Supporting Tens Of Thousands Of Users With Apache?
"I can only think of a couple of ways of doing this. One is to have an enormous single fileserver and have a cluster of apache web servers that NFS mount the home directories to serve up web pages. Then the users FTP into the main file server to store their web pages. To me, this seems wildly inefficient, and you have no real redundancy if the main fileserver crashes unless you are using a SAN (which is very expensive), or you have a hot backup that is rsync'ed or something. And I'm thinking that rsyncing up to 2TB of data would be an exercise in futility.
My other thought was to have several back end file servers with a fixed number of users on each server, and then send all HTTP requests through an LDAP server first, which would then do a redirect to the machine that user's web page resided on. The big problem then is how to make sure users are FTP'ing into the machine that their account is on? They may also use FrontPage extensions with Apache, and this could complicate things even worse.
I know there has got to be a better architecture for this. How do enormous sites like Yahoo and Excite tackle this problem? They have hundreds of thousands of users! Better yet, how could they tackle it with Open Source tools? Would, for example, a Turbo Linux cluster help this problem any, or would I still have to replicate the data across every node in the cluster (meaning I'd need up to 2TB of storage for each cluster node!) Then what happens if they decide they want to add another 10,000 users? I can't find pointers to information, or ideas on how to do this *anywhere*. Can you fellow Slashdotters give me any advice?"
Vague rambling: :-)
Use multiple NFS (or CODA?) servers on the backend. All of the web servers (N seperate machines) would mount "/webspace" or something from each NFS server. (probably a soft mount, or via CODA) You need to specify all of the server mounts via apache "UserDir" directives so it will search for the user's homepage correctly.
If the NFS servers also crossmount then they can be FTP servers as well and be able to access any user's home directory. Then you just list all of the NFS servers in a DNS round robin for "ftp.blah" and the WEB servers in a round robin for 'www.blah' and it should work.
i'd use coda or AFS ( i prefer AFS ) as he base filesystem. at the uni i work we serve 125K users..not the 50K you want but we use AFS as the main filestore on an origin 2000 (192 CPUs) for compute and a AIX cluster (6 x dual/quad cpu RS/6000s). In general, i'd recommend you get a couple of dual cpu alpha boxen running tru64 or linux/alpha ( say two or 3 at $10K each ), a RAID array for each alpha box or just get two sun E450s with 4 CPUs each and get one A3500FC array with two FC cards running solaris. either way its going to cost $70K or so. you can use apache/linux x86 boxen on the front end authenticating using kerberos and kerberised FTP to the back end cluster of alphas or sparcs. Around 50 of these machines should be enough..you get cheap dual cpu asus p2bds mobos + dual 650Mhz piiis for 1700 each with HDD, NIC, 512MB of ram etc... look at spending $250K for the whole thing..your company should be able to afford it easily. i recommend cisco for the switches and 3c905Bs for the linux boxes NICs. we spend $2-3mil a year for our infrastructure but we can afford it.
If you can hold off a little (October-ish), you should check out VA Linux's NAS (network attached storage) solution. I think they call it 'VANA'. From the presentation I saw, a 2TB storage unit should cost about $125k. Not beautiful, but better than the $250k that someone else was quoting. Beyond that, service is simple, and VA Linux is putting a kickass 3 year service contract behind it. As far as front-end computatio to put in front of it, for actually serving pages anf ftp access of of, I'm not sure what to recommend, but with offloading all the file i/o on to a NAS, you should be able to drop quite a bit of the computation from the application servers. As a random guess, I'd say 4 dual-p3's running Apache/BSD or Apache/Linux that are load balanced for web serving, and then 1 or 2 of the same for ftp access should do.
First of all, I love Linux and I'm not a huge FreeBSD fan. However, FreeBSD with Apache is probably a better choice for such a large site - it's known to hold up with these types of heavy loads.
:-) Then security isn't a concern, since the most they can fuck up is their own stuff.
Also, recommend to them to not allow CGI scripting - that would be a NIGHTMARE to support with 30,000 users. Not only would there be a huge amount of security holes, imagine the amount of server power that would take.
Of course, if you have a large amount of money to spend, get an S/390 and give each user a virtual machine running Linux
--
They say they have 40,000 users and want to provide 50MB per user.
Ok, but how much are people *really* going to use? The university I am at has about 20,000 users, and provides each with 50MB of disk space (to be used for everything including webspace). In total about 100GB is used, so the average per person is only 5MB. Since this includes much more than just webspace, I'm guessing that you'd find that 200GB would be more than enough for your users.
Other notes which might be of interest; my university runs apache on solaris, with the file system on a separate NFS-mounted box. The webserver (which is also FTP server and telnet server) is a four processor SUN box IIRC.
Tarsnap: Online backups for the truly paranoid
If there's something there that your users are familiar with, it'll be easier for them and for your support people.
You didn't mention anything about how much traffic you expect. If the traffic is small enough to fit on one machine, you could use several 3ware IDE controller cards with 80GB drives attached to them to get a lot of storage cheap.
The most straight foward way is to avoid the use of fileserver entirely. The problem your having is your assuming your going to use URLs of the form http://www.sitename.edu/path-to-userdir/. If you instead give everyone their own subdomain this is quite easy.
Setup 10 apache/linux-or-bsd servers with 3000 users each. Setup a single DNS server that manages the subdomain "users.sitename.edu". Then give each user a subdomain of "username.users.sitename.edu". Map each subdomain to the IP of the appropriate server. You can manually configure apache or you can use one of the dynamic boot-time configuration schemes.
Of course this could be a problem if your client has a policy about DNS subdomain allocation. And 50MB * 3000 users per server is still big... you might want to buy a big big disk array and plug all the linux servers into it.
For that matter a rack full of Sun Netra T1s and a fibre channel disk array should be cheap enough and supported by Sun to boot. Netra's are dream to manage.
If it's HTML, graphics, and some animations, then 10 meg is still plentiful. But if you keep the quotas down, say, to a meg per person, it gets easier to do backups, it prevents the warez, MP3, and pr0n sites. Or at least limits them.
If folks want more than that, they can pay extra, or go to a third-party hosting system.
Hey -- I was just thinking -- this is a damn clever idea.
Damn. good thinking--
willis/
there is no thing
what else could you want?
Last I heard approx. 80% are still BSD. Microsoft will probably, eventually, have to move to win2k, for technical (PR) reasons.
cheers, G
ps. Netcraft is now reporting (at least for me) hotmail.com is running Microsoft-IIS/5.0 on Windows 2000
If you decide to go with more then one web server take a look at Squid. It can reverse proxy web request to make many servers look like one. You should be able to split the user names on alpha ranges.
From the Squid FAQ
This doesn't quite solve your ftp problem. I did a quick search and didn't find anything that would direct a ftp to a different server based off of username. It shouldn't be hard to adapt a ftp proxy to do this for you, but I've never tried.
It wouln't be hard to write a quick php/cgi help page that given the user name would provide the used with the correct server address. Or you could make a few dns entries like a.ftp.host, b.ftp.host, etc and if the users name was tom they would use t.ftp.host.
Or you could ask Geocites for their user management software ;~)
Leknor
http://Leknor.com
Leknor
http://Leknor.com
"So many idiots, so few comets"
I setup a webserver that serves our customers personal homepages (1-10megs each) (10k+ accounts and growing) running Apache+Frontpage and NCFTPD on a single Sun Microsysyems Ultra2 with 500mb of RAM and 2x300mhz Sparc cpu's with a single A1000 filled with ?4-9 gig drives setup for RAID 0+1.
The load on the box is very minimal and I fully expect this box to handle well over 50k before needing any type of upgrade.
The CGI's could easily kill just about any box if you allow 50k people to have access to the CGI-BIN, you might consider using a reverse proxy to distrubute the CGI-BIN accross serveral boxes.
-luke (lstepnio@majjix.com)
and you are a moron. Thanks for sharing.
This sig is false.
This is definitely the path to real expandability and much like the path "real" sites take. Also, make as many a.ftp.host sites as possible. It is then always possible to direct 10 or 20 of them to the same server by dns, but also to transparently split apart the ones that are overloaded later.
If your budget is really tight then load as much disk as convienent on each server, make them cheap linux or BSD servers, and base the node size on whats cheap in HDs these days. Organize your directory structure so one tree contains all data from each named server (atree/ and btree/ if a & b are on the same physical server) so you can split easily by node. and all the CGI et cetera on the local servers, so you can only slow it down a bit.
On another note, a typical thing to do is professors.host (who have fewer procs usually, but therefore get better response times) nongeeks.host and a.cs.host b.cs.host so that the cs users can get angry at each other for bringing down a server, whle the nongeeks who'll be more likely to think it is your fault won't have problems. On a similar note, if your setup allows easy user migration, you could disallow cgi on most servers, allowing it only by permission (and migrating that user, then.)
Looking for freelance Actionscript (Flash/Flex) or ColdFusion work and/or freelance developers. Email me, put Slashdot
If this is a university, here's my suggestion
;-}
Split up the users by the school they are in, i.e. at GWU, we've got seas.gwu.edu(engineering), esia.gwu.edu(international affairs), and a couple others. Most university students don't run high volume stuff off their website - so you could probably get away with decently powered boxes - 1 for each school, with tons of SCSI HD space on each box. Or - get a big honkin SUN box and let it do everything (gwis2.circ.gwu.edu).
Of course - if this isn't a university - just throw my comments out the window
slashdot username - at - email.domain.name
Well how cheap is cheap? what type of budget do you have to work with? Are the sites dynamic static or what? Do you know what kind of traffic these sites currently have? Are you looking for high availability and/or redundancy? Do you want load balancing of any sort? Besides frontpage do your customers want anything else like php or zope (this will add to server load)? Are you providing other services like SQL, SMTP etc?
In an old article in Slashdot these guys had the ticket for you. This origin server could have been upgraded to easily handle your load. Concentric serves over 100k users with theirs. Of course, you want to load it with Apache and stay away from that crazy things like JSPs and to much CGI.
From the message this sounds like a web server to support personal web pages at a university. Based on that, you can bet that a large percentage (most?) of the people will not use the space at all. Some small number will use the maximum of their quota (whatever that number is) and a bunch will put up just 1-2MB of stuff.
My swag would be that *maybe* 1/10th of the 2TB of theoretical maximum would be used. 200GB of storage is doable on a good server, with simple backup strategies, etc.
As for the Web server - I would bet that a single well-tuned machine (of the PC make) could do a reasonable job, assuming that no user has really nasty CGI scripts running. Yeah - it would slow down at peak times, but hey - it's a freebie for the students. You might saturate the uplink before the server.
It will take some system admin time on an ongoing basis to monitor performance and slap those students that run nasty CGI scripts that chew up lots of CPU time. That requires a carefully written user policy that basically states you can do stuff, but you can't hog the CPU. Then it also requires a system admin that can be a BOFH to clue in the lusers that need to be helped out. The need for maintenace is probably a greater cost than the actual hardware.