Slashdot Mirror


On Building High Volume Dynamic Web Sites

kolestrol asks: "A while back I built a Web site using mysql and Java servlets to track Kosova refugees. That experience had taught me a lot. I had severely underestimated the job. I was wondering if anyone has any similar experiences, i.e. maintaining highly data-driven interactive Web sites with a high volume, and how they have managed to handle the load. Furthermore, how have they managed to handle content (site redesigns, etc.). The reason I ask is that ever since the above-mentioned project, I have been doing a lot more research, trying to find a free Linux solution. The only thing I found was at The Linux Virtual Server Project." I don't know how the larger Web sites do it, but I assume they evolved in stages to add their current features. What kind of design decisions are made when designing such sites?" (Read more.)

"Apart from this I have been talking to commercial vendors like BEA (I was very impressed) who provided application servers with load-balancing, replication, etc., starting at $20,000 (Australian) -- they run sites like Amazon.com, Qwest, Wells-Fargo etc.

There is an issue here (is there? I don't have any experience to really know hence am asking you) ... I can build a custom solution with load balancing written at the application level. But how does this affect my maintainability (for example Amazon.com moving from just books to all sorts of other stuff .. how long did it take to redesign the site etc.)?

The site I first built could potentially hold information about a million refugees, and allowed searching on most fields regarding information on a person (wildcard queries). Unfortunately, on doing some stress testing (with around 700,000 records) I found that at most 15 hits could be handled every ten seconds. I optimized the code, switched JDBC drivers to a faster driver, wrote a simple load balancer (and I mean very simple) and limited searching of fields to a few fields as well as preventing bad wildcard queries (e.g., a wildcard at the start would make little if any use of the index). Consequently, I managed to get the system to handle slightly more load (200 hits at 5 seconds) (Hardware was Dual Pentium II 450Mhz I think, 512MB RAM, 2x8G Ultra-wide SCSI hard drives, and running Linux of course). BTW, The Kosova refugees articles has a lot of misinformation, e.g. encrypted databases, and the time to actually build it was actually one week (and two weeks of overcoming red tape, etc.)."

7 of 222 comments (clear)

  1. Tips by Anonymous Coward · · Score: 4
    We use Apache/PHP/MySQL to serve out about a million dynamic pages per day. Here are some general tips that hopefully will help, although a further understanding of how your data is structured would help.

    Cache your dynamic data
    I cannot emphasize how important this is. If your site pulls up a page of 50 records, which each pull other database info for each record, create a cache for the whole page. Then create a cache for each record as well. Cache elements as well as entire pages of dynamic data into separate cache tables in the database wherever possible.

    Split your servers
    Separate your database and Apache servers. They should communicate with each other through a 100Mb network switch at the least (make sure you use full duplexing as well).

    Split your tables
    I know this sounds funny, but the current version of MySQL has a tendency to lock tables when doing writes to the table. This means that one update on a table can halt all other reads on the table. MySQL is very fast, so normally this isn't much of a problem, but when you start getting into a high volume of requests, you're going to start to get bogged down. The solution to this is to split your table into multiple tables, i.e. the table mydata becomes mydata_1, mydata_2, mydata_3, etc, where for example mydata_1 might hold records starting with a-e, mydata_2 is for f-h, etc. This might sound tricky but it will save you a lot of trouble. I can't even count the number of times we stared at the mysqladmin processlist and saw one of our tables constantly locked (stopping all reads and writes) before we came up with this solution.

    Your MySQL server
    Should have a ton of memory plus a fast disk. Preferably a gig or more for memory, use RAID or a fast (10,000 RPM+, 6ms or less seek time) SCSI disk for the data, also keep it separate from your OS disk (i.e. don't have your OS running on the same disk as your MySQL tables). Save the high MHz processors for...

    Your Apache servers
    We've found that the really processor intensive stuff happens on the Apache servers. So you should keep an eye on the load average, etc. on these servers. If they start to get bogged down, you can just pop in another Apache server and split the load.

    The nice thing about this setup is that you can keep adding Apache servers as your processing needs go up. From our own experience MySQL is not very processor intensive but very dependent on memory and disk speed. When using MySQL with Linux, you also have to be careful about file system limitations, handlers etc. Tweaking certain variables should help. Good luck!

  2. Slashdot is terrible example by Raul+Acevedo · · Score: 4
    Slashdot is the most unreliable site I visit on a regular basis. Throughout the day, page loads can take several seconds; sometimes not, sometimes longer, sometimes not accessible at all. This has been the case both in the East Coast and the West Coast (I just moved from one to the other). Note that in the East Coast, I had an extremely fast cable modem that was faster than most T1 connections I've had at actual companies, and in the West Coast, it's been through actual T1s. Also, I compare to other sites at the same time that Slashdot is slow.

    No, I'm not flaming Slashdot; I love everything else about the site. But its accessibility unfortunately didn't improve with the Andover.net takeover, nor through any of the other changes that have been happening in the last two years.

    I'm sure other people's mileage will vary, I'm interested in hearing other people's experience.
    ----------

    --
    In a real emergency, we would have all fled in terror, and you would not have been notified.
  3. FastCGI is the unsung hero by Get+Behind+the+Mule · · Score: 4
    The fashionable technologies for persistent server-side programming for the Apache server these days are mod_perl, PHP and Java servlets, but I've had very good experience with another one that doesn't seem to be so "hip": FastCGI. FastCGI is not the fastest of them all (for that, you need to program your own Apache module), but its robustness and maintainability, in addition to very good speed, make it one of the best choices of the lot.

    Here are some of the advantages I've seen:

    • Low impact on the server. FastCGI processes run independently of the server; they communicate with the server via a protocol, although if you use one of the programmer's interfaces you never really have to worry about that. To the programmer, a FastCGI program looks very similar to a CGI program; except that it has all the advantages of persistence.

      Since FastCGI processes are independent of the server, they are less likely to weigh the server down with a heavy processing load, and buggy FastCGI's are less likely to slow down or crash the server. If a FastCGI is going haywire, the problem can be diagnosed with the usual tools for analyzing the process behavior (like ps, top, Sun's proctool, etc). And FastCGI can be configured to adjust the number of running processes to fit the load.

      In contrast, technologies like mod_perl or PHP, which are embedded in the server, place an extra load on the server itself. It increases their memory footprint (especially in the case of mod_perl), which can be very problematic when Apache forks extra servers to handle request spikes -- you run the risk of running out of memory. They can make the web servers start up more slowly, and if one of your programs has gone on the blink, it can adversely impact the servers themselves. And embedded programs are not as easy to debug as independent processes.

      In the case of servlets, since they all run as threads within a JVM, then if one of them is buggy or slow, it's not easy to find out which one is causing the problem. Usually you just notice that your JVM has slowed down, deteriorating everything else; then you have to go about finding out which thread is responsible.

    • No commitment to a particular programming language. This takes away one of the most contentious debating points concerning mod_perl, PHP, Java servlets, and other technologies like mod_pyapache and the Apache API. Each of them requires a specific language, and hence a discussion of the various approaches often deteriorates into a language flamewar. More to the point, many programmers simply cannot use one paradigm or the other becuase they don't know its "proprietary" language very well. And once you've started, you're locked in; you may have to think twice about a hiring a perfectly good new programmer, because your candidate doesn't happen to be fluent in the language you've chosen.

      None of these problems come up with FastCGI. You can write FastCGI processes in whatever language you like, as long as you honor the protocol. And there are re-usable FCGI interfaces for C, C++, Java, Perl, Python and TCL.

    • With Perl's CGI::Fast, the FCGI program can also run as conventional CGI or from the command line. Having just praised FCGI's language independence, I do have to mention this advantage of the Perl interface. It makes FastCGI processes much easier to debug than, say, a mod_perl handler.


    I personally happen to like Perl a lot, and I very much like the idea of mod_perl. Programming to the Apache API with Perl is way cool, and so many Perl programmers fall all over themselves praising it as a panacea. But because of the memory impact on the server, I have found very difficult to implement mod_perl so that the server is stable and doesn't eat up all my RAM. It can be done, with a lot of effort on the part of the web server administrators, but it's certainly a lot harder than it is with FastCGI.

    And for the record, I do recognize the strengths of the various other techniques that I've mentioned as well. They all deserve their status as highly respected technologies for server-side programming, but FastCGI ranks up there with them in quality and deserves more attention than it's been getting.
  4. Some links pertinant to this comment by Steven+Pulito · · Score: 4
  5. Read O'Reilly's "Web Performance Tuning" by rambone · · Score: 4
    This book is really quite good considering how early it came out relative to the maturity of most high volume sites.

    Most points mentioned here are covered in detail in this book.

  6. photo.net & ArsDigita by Anonymous Coward · · Score: 5

    Philip & Alex's Guide to Web Publishing and the Web Tools Review are some good sources of information on this topic. Both can be easily found at http://www.photo.net/. Philip Greenspun, who is the creator of photo.net and wrote the Guide to Web Publishing, also is the founder of ArsDigita. ArsDigita does web dev consulting and offers a free, open source toolkit for building robust, high-utilization sites. The previous poster directed you to a good info source, I'm not sure why they were rated down to 0...

  7. Big Secrets Given Away... by Matts · · Score: 5
    I'm going to give away the big secret of this:
    There are no shortcuts
    Wow - amazing huh? There are some things you can do, like not using spawning CGI scripts (which you're not) and using persistent database connections (which you are), but short of that there's no shortcut. That's not to say there's nothing you can do though:
    • Ignore your application server vendor. They have to pass on some of the cost to Oracle, and they don't really manage Amazon.com with their product - but they probably do some small part of it so they can say that legally. I'm willing to bet that its the most unreliable part of Amazon.com.
    • Use well known, well respected, and evolved tools. These include things like mod_perl, Apache, Oracle, java servlets are getting there (but you saw that they don't scale fantastically, and their JDBC drivers are much slower than Perl's equivalent), but they just aren't that fast yet on large projects. AOLServer also looks like a fairly nippy option, but you need to use tcl to program it AFAIK.
    • Tune your database. This can't be stressed enough. It may take the rest of your life, but do it anyway. And if you can't do it, then hire a proffesional. These guys are expensive though - but you get what you pay for in this respect.
    • Split up your hardware. A separate DB and Web server can increase your application's speed no end due to removing contention for resources.
    • Cache! Cache whatever you can. If using something like mod_perl then stick the "Oops" proxy server in front of it to cache page accesses (there are good reasons why this speeds things up). Cache stuff in your server's ram. Cache stuff in shared memory.
    • Be ready to spend. Running a fast, large hits web site is expensive. There's no ifs nor buts about this unless you don't mind downtime. PhilG of "Phillip and Alex's" fame estimates something like $100,000+++ a year to run a web site like this, taking into account Oracle costs, support, DBA costs (yes, you do need one), hardware and network costs.
    And read "Philip and Alex..." - even if you only get the web version - somewhere off http://photo.net. He debunks the myths of application servers and reducing the costs and time of development of this sort of thing. And read "The Mythical Man Month" - that also debunks the idea of reducing the time to develop complex things.

    Good Luck!

    --

    Matt. Want XML + Apache + Stylesheets? Get AxKit.