Slashdot Mirror


Building a Scaleable Apache Site?

bobm writes "I'm looking for feedback on any experience building a scaleable site. This would be a database driven site, not just a bunch of static pages. I've been looking for pointers to what other people have learned (either the easy way or hard way). I would like to keep it Apache based and am looking for feedback on the max # of children processes that you've been able to run, etc. Hardware-wise, I'm looking at using quad Xeons or even Sun E10K systems. I would like to stay non-clustered if possible."

11 of 60 comments (clear)

  1. Persistent Connections Are Your Friend by Aix · · Score: 5, Informative

    Just in case you haven't thought about this, for a database-backed website, getting rid of the database connection overhead is just about the smartest thing you can do performance-wise. Think mod_perl. Furthermore, consider moving your SQL server to another machine before making any other hardware changes. (If you haven't already...) The demands of an HTTP server are definitely different than those of a SQL server. If you're going to have a lot of dynamic content, plus a decent number of SSL requests, think about putting a proxy in front of your page-generating server. I know these aren't Apache tweaks, but they're worth considering anyway.

  2. Look in the right place by linuxwrangler · · Score: 5, Informative

    You have provided way too little info. First, do you really mean scalable or do you mean high-traffic. They are not the same thing. You can build a high-traffic site using technology that won't cluster/expand well and be screwed when you need a higher-traffic site. Converesly you can build a very low-traffic site that will scale quite well (a technology that allows you to easily add hardware as your traffic dictates for example).

    It would be helpful to know, for example, what portion of the traffic (both # of requests and bytes) is static and what is dynamic (include images)? What is the peak (say 98th percentile) expected traffic? What are typical page sizes and how much are they compressible with gzip? etc.

    Apache itself doesn't really handle dynamic content - its modules or an underlying app server do that. That is probably where you will have to do the most work.

    As another poster mentioned, persistent database connections are essential. You may want to look into a "real" app server. JBoss is open source and just won some awards at Java One. If that is too much complexity at least be sure to use persistent connections in whatever other technology you select.

    Persistent connections have a down side. Don't forget that your underlying database must be able to handle both your number of requests and your number of connections. If you just increase Apache processes you may find that the database is unable to manage that many simultaneous connections efficiently. Opening/closing connections for each request kills you. Maintaining hundreds of open connections kills you. This is one of the real strengths of any technology that can handle connection pooling - you will probably find that you only need a handful of connections to handle lots of front-ends and connection pooling allows you to do it efficiently. It can also help you scale by distributing connections to multiple database servers for you when your needs dictate.

    The faster you can dispense with a request the better. This includes not only all your processing time but the transmission time to the client. A process/thread can't move on till the client has the data. Therefore...

    Design your pages to give yourself a fighting chance. For example: if you have any static images be sure to set your http headers to prevent browsers from reloading them. Even the request overhead to the server to determine that the cached image is up-to-date is more than the size of the image itself so set a LONG expiration.

    Trim unnecessary whitespace, using short names (ie. i/x.png instead of buttonimages/left_page_arrow_top.png) and so on.

    If the pages are large enough and the clients slow enough then you may want to use gzip (mod_gzip) to compress the data. It will cost you processing time to compress dynamic content but will save you transmission time. If you pay for bandwidth you can see a 50-80% reduction in your bandwidth usage as well.

    Note: if your spec of "non-clustered" and scalable still allows multiple machines and if you do have images or other static content you may want to move that content to a separate machine. The Linux kernel http server screams on static content (of course the static-content load on your server may be so small a percentage that it isn't worth the effort).

    Try Apache 2.x first. One problem with 1.x on most (all??) platforms is the "thundering herd" problem. You may try to increase performance by running lots of processes but when a request comes in, all sleeping processes are awoken (the thundering herd) and although only one will end up servicing the request, the effect of waking up huge numbers of sleeping processes can be "bad".

    Be sure to test with clients of varying speed. We discovered we could crash a site faster with slow clients than fast. Once while testing a Cold Fusion/IIS site it seemed like we could realy get some screaming throughput when testing on the LAN. Unfortunately when the server had to keep threads/connections alive long enough to service slow clients it wasn't so pretty. When we ran the simulation that way we could crash the server in 2 seconds.

    Give me more specifics and I may be able to give better advice.

    --

    ~~~~~~~
    "You are not remembered for doing what is expected of you." - Atul Chitnis
    1. Re:Look in the right place by babbage · · Score: 5, Informative
      I was going to reply to the article, but you hit most of the points I was going to ...and a lot of them I wasn't thinking of. So to cut down on the redundancy, I'll just reply here & add a couple more points:
      • To control the timeout problem for slow connections/clients, Apache can be tuned to use very short keepalive times. HTTP/1.1's keepalive header can be useful for clustering a burst of multiple requests (such as an HTML file plus a collection of images for it) but the dormant processes it can generate can be more costly than the TCP connection overhead time you were trying to avoid by enabling HTTP/1.1. Oops. Set the timeout low enough to hit the sweet spot between "too many new TCP connections" and "too many idle Apache children".
      • Reconsider your resistance to clustering. Yes it can make things more complicated, but it can also make your life a lot easier. Want to ease the Apache or MySQL load? Buy a couple more boxes & have them NFS mount the content or data directories. You can also do clever things like putting all your static content on a server optimized for that purpose (no mod_cgi or mod_include or anything like that) or dedicating hardware to a mod_perl instance, a mod_php instance, etc. Whatever. You gain a lot more flexibility, you compartmentalize things so that (hopefully) you don't have a single point of failure, and it's easier to swap/upgrade/replace components in one area without disrupting things in others.
      • As another commenter noted, caching can be a big help. Caching proxies can reduce the load on the main server significantly. Not everything can be cached, but it's possible to strike a balance between readily cachable data (home page, section headers, images, stylesheets) and material that really does have to get generated for each request. On a big site, every little bit helps.
      • mod_gzip is your friend. Processing power is always going to be cheaper than bandwidth, so spend your money on compressing data is cheaper than paying for increased bandwidth. Even if not all clients can take advantage of it, if a significant fraction of them can then you'll quickly come out ahead.
      • If you have any huge content (audio or video) that places a heavy load on the rest of your systems, you might consider outsourcing it to a company like Akamai that specializes in delivering such content quickly. Services like this are probably expensive, but if you need it then you need it, and going with an Akamai is surely cheaper than setting up & maintaining your own data centers all over the country & world.
      • As the above poster notes, consider Apache 2.0. Among the many neat-o features it offers is a choice in execution model: in addition to the fork/exec multiprocess model that 1.3.x used, you can also try threaded modes (which should be a big boost to Win32 servers if you need to go in that direction for anything), and I think maybe some more exotic execution methods. Depending on you setup, you might be able to find a big speedup by switching to threads. (Note though that, as of now, Apache2 and PHP4 don't play nicely together, and the same is probably true for a lot of Apache extensions (mod_perl, others) so make sure that whatever modules you need to use are going to work. Test test test!
  3. You need to provide way more info by DevilM · · Score: 3, Informative

    You say you want it Apache based on dyanmic. Well how are you going to build the dynamic pages? Are you writing your own Apache module? Are you using Perl, PHP, JSP, CFML... what? Are you using Apache 1.3.* or Apache 2.0.x? What database are you using? What kind of application is it? What OS are you using? What about disk subsystem, is it RAID based? If so, what level? Why do you want to use a single big machine instead of many small machines?

    Anyway, you need to provide way more information in order to get help. There is no magic way to make a site scalable. It just depends on the answers to all the above questions and more.

    1. Re:You need to provide way more info by Anonymous Coward · · Score: 1, Informative
      How do they maintain uptime? Uhh, see earlier posts.... :-)

      I've recently started with a NYT sibling company. Suffice to say, our network design -- while far from perfect & entirely too arcane in a lot of ways (like, say, daily munging of data from old VAX & PDP mainframes into a web presentable form) -- works. I'm told we had emails from visitors telling us that on 9/11 last year, our site was the only major one that a lot of people could get to: when NYT, CNN, MSNBC etc were inaccessible for several hours, our site was able to handle it.

      Now granted, us with that traffic spike still might not have been the level of them on an average day, but still -- the ability of the system to withstand sudden shocks like that day (or, say, a Slashdotting earlier this week) has been well proven. And in an abstract way, the points being raised in this thread -- by several posters -- are all design aspects incorporated into the site I work for.

      Gimme an email address & I might elaborate. I don't want to go into detail on Slashdot... :-/

  4. Re:Cache, Cache, Cache by Longstaff · · Score: 3, Informative

    Well, this a problem with any cache system. With ours, you adjust the TTL to an acceptable value of "staleness" down to the second.

    When optimising any system, relaxing granularity is something that you should look at. Do I really need the latest version of the news story up to this very second - or can I deal with one that's a minute or more old. In our case, the news stories are edited and reviewed before they're published, so it doesn't matter if the story is 1 minute old or 10 days old.

    In an emergency, we can forceably expire an element.

    There are cases on our site where we can't cache the data - we *need* the live data. Those cases are scrutinized thoroughly before we actually make a live call to the db to see if there's some way to get around it. However, most of our data is cacheable and we have a hit rate of ~80%

  5. A good article by cwinters · · Score: 4, Informative

    A good reference on this is from one of the eToys architects. It uses mod_perl as the technology but the general strategies -- caching in particular -- will work for any application server technology.

    --

    Chris
    M-x auto-bs-mode

  6. Kegel's site by jawahar · · Score: 5, Informative

    Contains very good information. http://www.kegel.com/c10k.html

  7. Re:Cache, Cache, Cache by Anonymous Coward · · Score: 1, Informative

    We use Postgresql NOTIFY facilities to accomplish this. That is, when related table information changes, we have Postgresql NOTIFY all webservers (via their database connection) that information has changed and they'll want to look at the new information. Often they pull a new copy of any relevent data.

  8. How about... by jabbo · · Score: 3, Informative

    When I worked at XOOM we had a farm of about 30 front-end FreeBSD webservers mounting member directories via NFS and serving an average of 500mbps (peaking up to about 1Gbps at times). The key to that architecture was that member logins were cached via a proprietary daemon that all pages authenticated from. Templates, dynamic content, etc. were all pickled to flat files whenever possible (at first, not much; later, once the merger with Snap! was done, much more, as their content caching system was superior).

    The database on the Xoom side was an E450 IIRC. Snap used much burlier hardware because they were basically a silver-spoon project of CNET/NBC.

    The lesson for scalability is simple, cache like a motherfucker and make everything you can static. And run DSR. ;-)

    If you decouple the database from the webservers you need to make extra sure that you proxy the high-traffic requests, either by running a static-file-dumping daemon process (for content) or a proxy daemon (for authentication). My moderately-low-traffic site at my current job can handle two saturated DS3's worth of traffic with 1024 apache child processes running on each of 2 dual PII boxes w/512MB RAM, plus the database running on a dual PII w/1GB RAM. Doesn't even break a sweat. Postgres (the database) runs 1024 child processes with a lot of buffers, NFS caches are pretty good sized (if your frontend webservers are Sun, you can use cachefs aggressively, I would), and overall it just took some serious tuning to make sure that nothing fazes it.

    I'm working on a couple of "community" sites with similar demands (~1million visitors/month) and mod_throttle + caching will solve one's problems, the other is where I stole the throttling idea from :-). Just tune, tune, tune... you'll get it.

    For the whiners, Xoom failed in the end because it lost sight of the cheap-ass principles that made it a good stock market scam. Right up until the end, performance on the member servers was sub-4 second per page on average.

    --
    Remember that what's inside of you doesn't matter because nobody can see it.
  9. Re:ACS/OpenACS by consumer · · Score: 2, Informative
    The problem with Apache 1.x/PHP/mod_perl/MySQL/PostgreSQL is that the so-called persistent database connection is per-process based.

    And how is this a problem exactly? If your server is handling only dynamic pages (your static stuff should be split onto another server) you will almost certainly need a database handle on every request. Connection pooling is only useful if your application spends a lot of time NOT using the database.

    Then there is the problem of running out of db connections for any particular process.

    Why would a particular process need more than one database connection? Each process only handles one request at a time.

    Apache 2.0 is likely to be better in this respect, but I still think that AOLServer is cleaner.

    Apache 2 provides full support for threading, so it can use the same approach as AOLServer. It doesn't sound like you know very much about it, so maybe you should check it out before you tell everyone it's no good.