Slashdot Mirror


On Building High Volume Dynamic Web Sites

kolestrol asks: "A while back I built a Web site using mysql and Java servlets to track Kosova refugees. That experience had taught me a lot. I had severely underestimated the job. I was wondering if anyone has any similar experiences, i.e. maintaining highly data-driven interactive Web sites with a high volume, and how they have managed to handle the load. Furthermore, how have they managed to handle content (site redesigns, etc.). The reason I ask is that ever since the above-mentioned project, I have been doing a lot more research, trying to find a free Linux solution. The only thing I found was at The Linux Virtual Server Project." I don't know how the larger Web sites do it, but I assume they evolved in stages to add their current features. What kind of design decisions are made when designing such sites?" (Read more.)

"Apart from this I have been talking to commercial vendors like BEA (I was very impressed) who provided application servers with load-balancing, replication, etc., starting at $20,000 (Australian) -- they run sites like Amazon.com, Qwest, Wells-Fargo etc.

There is an issue here (is there? I don't have any experience to really know hence am asking you) ... I can build a custom solution with load balancing written at the application level. But how does this affect my maintainability (for example Amazon.com moving from just books to all sorts of other stuff .. how long did it take to redesign the site etc.)?

The site I first built could potentially hold information about a million refugees, and allowed searching on most fields regarding information on a person (wildcard queries). Unfortunately, on doing some stress testing (with around 700,000 records) I found that at most 15 hits could be handled every ten seconds. I optimized the code, switched JDBC drivers to a faster driver, wrote a simple load balancer (and I mean very simple) and limited searching of fields to a few fields as well as preventing bad wildcard queries (e.g., a wildcard at the start would make little if any use of the index). Consequently, I managed to get the system to handle slightly more load (200 hits at 5 seconds) (Hardware was Dual Pentium II 450Mhz I think, 512MB RAM, 2x8G Ultra-wide SCSI hard drives, and running Linux of course). BTW, The Kosova refugees articles has a lot of misinformation, e.g. encrypted databases, and the time to actually build it was actually one week (and two weeks of overcoming red tape, etc.)."

16 of 222 comments (clear)

  1. Dynamic High Traffic Site? Slashdot! by bert · · Score: 3

    Slashdot would be the obvious example, right? So ask CmdrTaco and his crew, and take care to download en fiddle with Slash first!

  2. Resin by JohnZed · · Score: 3

    Ok, I promise I don't work for these guys, but I'd have to highly recommend Resin from www.caucho.com. It's open source and amazingly fast. We can serve dozens of requests per second of a resonably complicated site on a crappy $400 Linux PC.
    Also, what JVM are you using? Definitely try the newest Sun (with Inprise JIT, must be downloaded separately) for a single-processor system and IBM's jdk for an SMP box.
    For App Servers, you should check out Web Sphere from IBM (not too expensive, relatively speaking). Also, TowerJ can DRAMATICALLY speed up Linux server-side java and improve scalability (towerj.com, I believe), but it ain't cheap.
    Good luck!
    --JRZ

  3. Headers and Tiers by elliot · · Score: 3

    Be sure to separate dynamic and static content. Make sure your servers that are hitting the database and generating content are not wasting cpu and memory
    serving images.
    Make sure that you use HTTP Headers to your
    advantage. By setting these correctly, you can benefit from cacheing on the client end, as well as on a front end server (SQUID, mod_proxy, etc.).
    Find a tool that lets you seperate code and HTML. Perl Modules work great in conjunction with mod_perl and one of the embedded perl Modules (HTML::Mason, HTML::Embperl, ePerl, etc.) However, many tools allow you to do similar things, Java Server Pages. The key is to be consitent and keep the logic separated from the layout and display.

    HTH, Aaron

  4. My two approaches... One good, one bad (?) by Sun+Tzu · · Score: 3

    In Starshiptraders.com (a game), I wrote the entire thing in C -- there isn't even a copy of Apache involved. That system serves about 140,000 pages per day, and at peak times, the single-cpu (Celeron 366/320MB) is about 99% idle. All the files will fit in memory -- I only have about 120MB of data, but it is frequently intensively used. The data is all stored in flat files with ponters to related records in the same and other files. All data is, therefore, directly addressable and there is not much in the way of wasteful I/O. I can't really recommend this approach though for the obvious development and maintainability problems that it entails. ;)

    My other dynamic site project, SiteReview.org (user-posted website reviews), is written in PHP, serves pages with Apache, and has a MySQL back end data store. The /. folks are proof that MySQL and Apache are up to the task for some serious work. I have gone to some trouble to minimize the number of database calls and will work a bit more to minimize the size of returned pages. Each of my tables is indexed on the (very few) columns that are used to access it, so I get no full table scans. PHP can be compiled as a module for Apache, eliminating the startup overhead and resulting in quite efficient processing. PHP is also very easy to work with.

    Currently that site is a work in progress with very small volume and I therefore have no evidence yet that I did anything right ;). (Anyway, it's at a hosting provider who is not optimized for PHP -- they call a php executable for each of my pages. If the need should arise, I will move it.) However, I think that this approach is a good balance between maintainability, efficiency and scalability. You can start with a single system and, when the load exceeds that capabilities of the box, you can easily offload the database onto a dedicated database server and put up multiple webservers on the front end with DNS round-robin or somesuch.

  5. My site by noeld · · Score: 3
    My site RootPrompt.org -- Nothing but Unix is written in php3 with a MySQL database backend.

    I have worked to minimize the database calls and keep the pages as small as I can. By minimizing the database calls I give my self more room to grow before I start needing more hardware and by keeping the page small I make the site more slow connection friendly and make better use of my bandwidth. I think that if you are waiting for something to download it should be what you want (content) not fluff.

    I have added features slowly as I have gotten them working. Comments, user logins, syndication pages, etc. I think that if you get a good idea get it online and then work to make it better.

    I think you should always keep in mind that anything cool may soon be much bigger so write a site that is cool when ten people use it and is still cool (and fast) when a ten thousand people (or more) are using it.

    I would also recomend setting things up so that your content can be syndicated and shared on other sites.

    RootPrompt.org's headlines for example can be had in netscape's rss format at:
    http://rootprompt.org/rss/

    and in text format at:
    http://rootprompt.org/rss/text.php3

    Doing this will allow you to share the content that you create with the world without requiring a lot of machine on your end.

    Noel

    RootPrompt.org -- Nothing but Unix

  6. JServ and Apache by Hairy+Fop · · Score: 3

    If you're using jserv and apache then it allows you to use multiple front end servers e.g. round robin DNS. and multiple backend JServ engines.

    If you want to cluster multiple SQL servers, then you can have multiple read only mysql servers and one write mysql server which updates the other mysql servers from the update log.
    Your DB code would have to be aware of the read and write DB servers.

    As for coping with changes if it's interface changes then you need to create you architecture seperating application logic from interface logic.
    e.g.
    Servlet (Java) code handling DB intercation and application type logic and an intelligent templating language handling interface (XSL).

  7. Optimization, Scalability, and AOLServer by slazlo · · Score: 3

    About 6 months ago I began development on a site that I wanted to be scalable to the extreme. After research into which tools best fit my job I decided on AOLServer for many reasons including: multithreaded vs forking architecture, persistant db connections, shared memory space, proven track record, simplistic implementation including embedded tcl in pseudo asp like pages. Fortunately since then AOLServer has become OpenSource under the GPL allowing my complete architecture to rely on only OS tools: LINUX, PostGres, AOLServer, Postfix, and more. I believe AOLServer to be one of the best kept secrets as far as Open Source tools out there. A company named Arsdigita has an Open Source toolkit designed for building online communities and online forums for any problems in case you get stuck. I would write more but its 8AM and I haven't been to bed yet ... maybe when I get up after Noon ;)

  8. Some links on scalability, optimization by slazlo · · Score: 3
    Before I sleep here are some useful links I found when I was investigating which tools to use:

    http://www.acme.com/software/thttpd/benchmarks.h tml

    http://www.cs.wustl.edu/~jxh/research/

    http://photo.net/wtr/thebook/server.html

    http://aolserver.com/features/

    http://www.aolserver.com/tcl2k/html/index.htm

    http://www.linux-ha.org/

    http://www.linuxvirtualserver.org/

    http://www.citi.umich.edu/projects/citi-netscape /reports/web-opt.html

    http://linuxperf.nl.linux.org/

    http://www-4.ibm.com/software/developer/library/ java2/index.html

    http://www.squid-cache.org/

    Hopefully this may be of help to you also.... after all this research I was very pleased to go with AOLServer even though they were not in the web server comparison at thttpd site the model was represented by Zeus and thttpd and AOLServer has many additional features that really sold me.

  9. Re:Tips by Tassach · · Score: 3
    I can't even count the number of times we stared at the mysqladmin processlist and saw one of our tables constantly locked (stopping all reads and writes) before we came up with this solution.


    In my experience, locking contention is usually due to inappropriate indexing and bad SQL coding. I'm not familiar with MySQL, but if you are having to do funky schema changes like splitting the tables it sounds like MySQL isn't ready for prime time yet. Dodgy workarounds are no substitute for a quality DB server. My personal preference, Sybase ASE 11.0.3.3 for Linux, is available with a zero-cost license for both production and development deployments. Sybase ASE 11.9.2 is more has some significant improvements over 11.0 and is zero-cost for development only. (Unless you REALLY need row level locking, 11.0 will probably meet your needs.)

    Question - are you splitting between rows or between columns? If you are having to split between rows, the problem is most likely resulting from an inappropriate clustered index. In an insert-intensive database, a bad clustered index will result in a hot spot in the last data and/or index page of the database. The best solution here is to cluster on a surrogate key. Your surrogate key generation algorithm needs to be carefully designed to distribute inserts evenly in the table. If you are doing primarily single-row updates and inserts, you should only be seeing page-level locks. If you are updating records frequently, try and use only fixed-length datatypes (or at least only update the fixed-length fields); this allows in-place updates. You should avoid indexing frequently updated fields, if possible.

    Database design is an art. Ditto for performance tuning. An expert DBA is worth his/her weight in gold. There's a good reason top DBA's command top rates :-)

    "The axiom 'An honest man has nothing to fear from the police'
    --
    Why is it that the proponents of "one nation under God" are so eager to get rid of "liberty and justice for all"?
  10. Tips by Anonymous Coward · · Score: 4
    We use Apache/PHP/MySQL to serve out about a million dynamic pages per day. Here are some general tips that hopefully will help, although a further understanding of how your data is structured would help.

    Cache your dynamic data
    I cannot emphasize how important this is. If your site pulls up a page of 50 records, which each pull other database info for each record, create a cache for the whole page. Then create a cache for each record as well. Cache elements as well as entire pages of dynamic data into separate cache tables in the database wherever possible.

    Split your servers
    Separate your database and Apache servers. They should communicate with each other through a 100Mb network switch at the least (make sure you use full duplexing as well).

    Split your tables
    I know this sounds funny, but the current version of MySQL has a tendency to lock tables when doing writes to the table. This means that one update on a table can halt all other reads on the table. MySQL is very fast, so normally this isn't much of a problem, but when you start getting into a high volume of requests, you're going to start to get bogged down. The solution to this is to split your table into multiple tables, i.e. the table mydata becomes mydata_1, mydata_2, mydata_3, etc, where for example mydata_1 might hold records starting with a-e, mydata_2 is for f-h, etc. This might sound tricky but it will save you a lot of trouble. I can't even count the number of times we stared at the mysqladmin processlist and saw one of our tables constantly locked (stopping all reads and writes) before we came up with this solution.

    Your MySQL server
    Should have a ton of memory plus a fast disk. Preferably a gig or more for memory, use RAID or a fast (10,000 RPM+, 6ms or less seek time) SCSI disk for the data, also keep it separate from your OS disk (i.e. don't have your OS running on the same disk as your MySQL tables). Save the high MHz processors for...

    Your Apache servers
    We've found that the really processor intensive stuff happens on the Apache servers. So you should keep an eye on the load average, etc. on these servers. If they start to get bogged down, you can just pop in another Apache server and split the load.

    The nice thing about this setup is that you can keep adding Apache servers as your processing needs go up. From our own experience MySQL is not very processor intensive but very dependent on memory and disk speed. When using MySQL with Linux, you also have to be careful about file system limitations, handlers etc. Tweaking certain variables should help. Good luck!

  11. Slashdot is terrible example by Raul+Acevedo · · Score: 4
    Slashdot is the most unreliable site I visit on a regular basis. Throughout the day, page loads can take several seconds; sometimes not, sometimes longer, sometimes not accessible at all. This has been the case both in the East Coast and the West Coast (I just moved from one to the other). Note that in the East Coast, I had an extremely fast cable modem that was faster than most T1 connections I've had at actual companies, and in the West Coast, it's been through actual T1s. Also, I compare to other sites at the same time that Slashdot is slow.

    No, I'm not flaming Slashdot; I love everything else about the site. But its accessibility unfortunately didn't improve with the Andover.net takeover, nor through any of the other changes that have been happening in the last two years.

    I'm sure other people's mileage will vary, I'm interested in hearing other people's experience.
    ----------

    --
    In a real emergency, we would have all fled in terror, and you would not have been notified.
  12. FastCGI is the unsung hero by Get+Behind+the+Mule · · Score: 4
    The fashionable technologies for persistent server-side programming for the Apache server these days are mod_perl, PHP and Java servlets, but I've had very good experience with another one that doesn't seem to be so "hip": FastCGI. FastCGI is not the fastest of them all (for that, you need to program your own Apache module), but its robustness and maintainability, in addition to very good speed, make it one of the best choices of the lot.

    Here are some of the advantages I've seen:

    • Low impact on the server. FastCGI processes run independently of the server; they communicate with the server via a protocol, although if you use one of the programmer's interfaces you never really have to worry about that. To the programmer, a FastCGI program looks very similar to a CGI program; except that it has all the advantages of persistence.

      Since FastCGI processes are independent of the server, they are less likely to weigh the server down with a heavy processing load, and buggy FastCGI's are less likely to slow down or crash the server. If a FastCGI is going haywire, the problem can be diagnosed with the usual tools for analyzing the process behavior (like ps, top, Sun's proctool, etc). And FastCGI can be configured to adjust the number of running processes to fit the load.

      In contrast, technologies like mod_perl or PHP, which are embedded in the server, place an extra load on the server itself. It increases their memory footprint (especially in the case of mod_perl), which can be very problematic when Apache forks extra servers to handle request spikes -- you run the risk of running out of memory. They can make the web servers start up more slowly, and if one of your programs has gone on the blink, it can adversely impact the servers themselves. And embedded programs are not as easy to debug as independent processes.

      In the case of servlets, since they all run as threads within a JVM, then if one of them is buggy or slow, it's not easy to find out which one is causing the problem. Usually you just notice that your JVM has slowed down, deteriorating everything else; then you have to go about finding out which thread is responsible.

    • No commitment to a particular programming language. This takes away one of the most contentious debating points concerning mod_perl, PHP, Java servlets, and other technologies like mod_pyapache and the Apache API. Each of them requires a specific language, and hence a discussion of the various approaches often deteriorates into a language flamewar. More to the point, many programmers simply cannot use one paradigm or the other becuase they don't know its "proprietary" language very well. And once you've started, you're locked in; you may have to think twice about a hiring a perfectly good new programmer, because your candidate doesn't happen to be fluent in the language you've chosen.

      None of these problems come up with FastCGI. You can write FastCGI processes in whatever language you like, as long as you honor the protocol. And there are re-usable FCGI interfaces for C, C++, Java, Perl, Python and TCL.

    • With Perl's CGI::Fast, the FCGI program can also run as conventional CGI or from the command line. Having just praised FCGI's language independence, I do have to mention this advantage of the Perl interface. It makes FastCGI processes much easier to debug than, say, a mod_perl handler.


    I personally happen to like Perl a lot, and I very much like the idea of mod_perl. Programming to the Apache API with Perl is way cool, and so many Perl programmers fall all over themselves praising it as a panacea. But because of the memory impact on the server, I have found very difficult to implement mod_perl so that the server is stable and doesn't eat up all my RAM. It can be done, with a lot of effort on the part of the web server administrators, but it's certainly a lot harder than it is with FastCGI.

    And for the record, I do recognize the strengths of the various other techniques that I've mentioned as well. They all deserve their status as highly respected technologies for server-side programming, but FastCGI ranks up there with them in quality and deserves more attention than it's been getting.
  13. Some links pertinant to this comment by Steven+Pulito · · Score: 4
  14. Read O'Reilly's "Web Performance Tuning" by rambone · · Score: 4
    This book is really quite good considering how early it came out relative to the maturity of most high volume sites.

    Most points mentioned here are covered in detail in this book.

  15. photo.net & ArsDigita by Anonymous Coward · · Score: 5

    Philip & Alex's Guide to Web Publishing and the Web Tools Review are some good sources of information on this topic. Both can be easily found at http://www.photo.net/. Philip Greenspun, who is the creator of photo.net and wrote the Guide to Web Publishing, also is the founder of ArsDigita. ArsDigita does web dev consulting and offers a free, open source toolkit for building robust, high-utilization sites. The previous poster directed you to a good info source, I'm not sure why they were rated down to 0...

  16. Big Secrets Given Away... by Matts · · Score: 5
    I'm going to give away the big secret of this:
    There are no shortcuts
    Wow - amazing huh? There are some things you can do, like not using spawning CGI scripts (which you're not) and using persistent database connections (which you are), but short of that there's no shortcut. That's not to say there's nothing you can do though:
    • Ignore your application server vendor. They have to pass on some of the cost to Oracle, and they don't really manage Amazon.com with their product - but they probably do some small part of it so they can say that legally. I'm willing to bet that its the most unreliable part of Amazon.com.
    • Use well known, well respected, and evolved tools. These include things like mod_perl, Apache, Oracle, java servlets are getting there (but you saw that they don't scale fantastically, and their JDBC drivers are much slower than Perl's equivalent), but they just aren't that fast yet on large projects. AOLServer also looks like a fairly nippy option, but you need to use tcl to program it AFAIK.
    • Tune your database. This can't be stressed enough. It may take the rest of your life, but do it anyway. And if you can't do it, then hire a proffesional. These guys are expensive though - but you get what you pay for in this respect.
    • Split up your hardware. A separate DB and Web server can increase your application's speed no end due to removing contention for resources.
    • Cache! Cache whatever you can. If using something like mod_perl then stick the "Oops" proxy server in front of it to cache page accesses (there are good reasons why this speeds things up). Cache stuff in your server's ram. Cache stuff in shared memory.
    • Be ready to spend. Running a fast, large hits web site is expensive. There's no ifs nor buts about this unless you don't mind downtime. PhilG of "Phillip and Alex's" fame estimates something like $100,000+++ a year to run a web site like this, taking into account Oracle costs, support, DBA costs (yes, you do need one), hardware and network costs.
    And read "Philip and Alex..." - even if you only get the web version - somewhere off http://photo.net. He debunks the myths of application servers and reducing the costs and time of development of this sort of thing. And read "The Mythical Man Month" - that also debunks the idea of reducing the time to develop complex things.

    Good Luck!

    --

    Matt. Want XML + Apache + Stylesheets? Get AxKit.