Building a Scaleable Apache Site?
bobm writes "I'm looking for feedback on any experience building a scaleable site. This would be a database driven site, not just a bunch of static pages. I've been looking for pointers to what other people have learned (either the easy way or hard way). I would like to keep it Apache based and am looking for feedback on the max # of children processes that you've been able to run, etc. Hardware-wise, I'm looking at using quad Xeons or even Sun E10K systems. I would like to stay non-clustered if possible."
Post the URL, not just the question, and let the /. effect take its toll.
Increase the threads etc. until it stays up.
Just in case you haven't thought about this, for a database-backed website, getting rid of the database connection overhead is just about the smartest thing you can do performance-wise. Think mod_perl. Furthermore, consider moving your SQL server to another machine before making any other hardware changes. (If you haven't already...) The demands of an HTTP server are definitely different than those of a SQL server. If you're going to have a lot of dynamic content, plus a decent number of SSL requests, think about putting a proxy in front of your page-generating server. I know these aren't Apache tweaks, but they're worth considering anyway.
Nonperiodic Central Trajectory
I've built a site that's able to handle 1-2 million dynamic page views per day. There's not a single static page on the whole site except for the 404 page.
One trick that we currently use is a little daemon that runs on our app servers (custom java app). It's essentially a tcp socket interface to a hashtable with an expiration timestamp. Here's how the site works:
- request comes in
- front end server takes GET params and queries the local cache daemon to see if those objects are local
- if the objects are local - great - slap them together and deliver the page, otherwise
- query the database for the object info
- populate the cache daemon
- deliver the page
Another trick we use is dumping the output from one dynamic page to be included by another. So, have a page that generates nothing but an element (eg. slashbox). Have a mechanism on the back end that requests that page and stores the result as a text file. The dynamic page (say, php or jsp) just uses an include directive pointing to the static text file - which can be formatted html.Of course, the real weak point of the system (without clustering) is the database. Make sure that your data is index properly and that your queries are optimised. We have 2 tables with over a million rows each that get hit all the time. Proper data layout, quick queries and the local caches help our puny dual P3-733 (NON xeon) with a paltry 1GB of RAM dish out well over a million dynamic pages per day.
Combined with a proper connection pool, they can really save your butt.
However, persistent connections may be too much of a burden for an overworked db server. If you're using PHP/MySQL for example, mysql_pconnect may not be the way to go if you have a few front end servers hitting the database. It seems that the PHP connection pooling limit is per process. If you have 100 Apache processes w/ a 10 connection limit per and 10 web servers, that's a max of 10,000 db connections!!!
One idea might be an intermediate "connection broker" on a per server basis. We use something similar to this.
Apache's fork() model is great for stability, but it really hinders interprocess resource sharing. We're mostly Java based here, which allows us to use beans and such. Does mod_perl allow for resource sharing between processes?
You have provided way too little info. First, do you really mean scalable or do you mean high-traffic. They are not the same thing. You can build a high-traffic site using technology that won't cluster/expand well and be screwed when you need a higher-traffic site. Converesly you can build a very low-traffic site that will scale quite well (a technology that allows you to easily add hardware as your traffic dictates for example).
It would be helpful to know, for example, what portion of the traffic (both # of requests and bytes) is static and what is dynamic (include images)? What is the peak (say 98th percentile) expected traffic? What are typical page sizes and how much are they compressible with gzip? etc.
Apache itself doesn't really handle dynamic content - its modules or an underlying app server do that. That is probably where you will have to do the most work.
As another poster mentioned, persistent database connections are essential. You may want to look into a "real" app server. JBoss is open source and just won some awards at Java One. If that is too much complexity at least be sure to use persistent connections in whatever other technology you select.
Persistent connections have a down side. Don't forget that your underlying database must be able to handle both your number of requests and your number of connections. If you just increase Apache processes you may find that the database is unable to manage that many simultaneous connections efficiently. Opening/closing connections for each request kills you. Maintaining hundreds of open connections kills you. This is one of the real strengths of any technology that can handle connection pooling - you will probably find that you only need a handful of connections to handle lots of front-ends and connection pooling allows you to do it efficiently. It can also help you scale by distributing connections to multiple database servers for you when your needs dictate.
The faster you can dispense with a request the better. This includes not only all your processing time but the transmission time to the client. A process/thread can't move on till the client has the data. Therefore...
Design your pages to give yourself a fighting chance. For example: if you have any static images be sure to set your http headers to prevent browsers from reloading them. Even the request overhead to the server to determine that the cached image is up-to-date is more than the size of the image itself so set a LONG expiration.
Trim unnecessary whitespace, using short names (ie. i/x.png instead of buttonimages/left_page_arrow_top.png) and so on.
If the pages are large enough and the clients slow enough then you may want to use gzip (mod_gzip) to compress the data. It will cost you processing time to compress dynamic content but will save you transmission time. If you pay for bandwidth you can see a 50-80% reduction in your bandwidth usage as well.
Note: if your spec of "non-clustered" and scalable still allows multiple machines and if you do have images or other static content you may want to move that content to a separate machine. The Linux kernel http server screams on static content (of course the static-content load on your server may be so small a percentage that it isn't worth the effort).
Try Apache 2.x first. One problem with 1.x on most (all??) platforms is the "thundering herd" problem. You may try to increase performance by running lots of processes but when a request comes in, all sleeping processes are awoken (the thundering herd) and although only one will end up servicing the request, the effect of waking up huge numbers of sleeping processes can be "bad".
Be sure to test with clients of varying speed. We discovered we could crash a site faster with slow clients than fast. Once while testing a Cold Fusion/IIS site it seemed like we could realy get some screaming throughput when testing on the LAN. Unfortunately when the server had to keep threads/connections alive long enough to service slow clients it wasn't so pretty. When we ran the simulation that way we could crash the server in 2 seconds.
Give me more specifics and I may be able to give better advice.
~~~~~~~
"You are not remembered for doing what is expected of you." - Atul Chitnis
Dynamic: currently mod_perl but open to something faster (if there is a proven faster technology).
Apache: current 1.3.x move to 2.0.x when it's ready for prime time.
OS/Hardware: open, currently Solaris/Sun, open to quad Xeon/Linux if it has the performance.
The reason for asking about a single vs multiple machines is that I wanted to get a handle on what one box could do as opposed to the gut reaction to just keep adding servers.
Although I'm not expecting magic I didn't want to get too specific because I'm interested in feedback from across the board, for example how does Orbitz or Yahoo or *New York Times* maintain uptime? I haven't found anywhere that discusses places like that.
A good reference on this is from one of the eToys architects. It uses mod_perl as the technology but the general strategies -- caching in particular -- will work for any application server technology.
Chris
M-x auto-bs-mode
Contains very good information. http://www.kegel.com/c10k.html
Slashdot = Sarcasm