Building a Scaleable Apache Site?

Slashdot by Anonymous Coward · 2002-06-13 10:45 · Score: 4, Funny

Post the URL, not just the question, and let the /. effect take its toll.
Increase the threads etc. until it stays up.

Whew! by dalutong · 2002-06-13 10:52 · Score: 0, Offtopic

I've been looking for pointers to what other people have learned (either the easy way or hard way).

Wow. People are really picky today. Back when I was a young'n we didn't care whether what we learned was learned the easy way or the hard way.

Now these tooty stooty youngsters are picky about what their elders have learned! Damn it! They should be happy we let them KNOW about our past apache experiences...

Now let me tell you the story of me and my Apache friend...

:)

--

What comes first, finding a teacher or becoming a student?

Persistent Connections Are Your Friend by Aix · 2002-06-13 10:57 · Score: 5, Informative

Just in case you haven't thought about this, for a database-backed website, getting rid of the database connection overhead is just about the smartest thing you can do performance-wise. Think mod_perl. Furthermore, consider moving your SQL server to another machine before making any other hardware changes. (If you haven't already...) The demands of an HTTP server are definitely different than those of a SQL server. If you're going to have a lot of dynamic content, plus a decent number of SSL requests, think about putting a proxy in front of your page-generating server. I know these aren't Apache tweaks, but they're worth considering anyway.

--
Nonperiodic Central Trajectory

Re:Persistent Connections Are Your Friend by Anonymous Coward · 2002-06-13 11:16 · Score: 0

Exactly. Unless each page needs to be live it should be cached somehow.

Cache, Cache, Cache by Longstaff · 2002-06-13 11:13 · Score: 5, Insightful

is *the* most important word around for dynamic sites.

I've built a site that's able to handle 1-2 million dynamic page views per day. There's not a single static page on the whole site except for the 404 page.

/. doesn't generate these pages on the fly, they're generated by a background process that runs every minute or so and stored as a file. There's no reason to requery the database if you don't have to.

One trick that we currently use is a little daemon that runs on our app servers (custom java app). It's essentially a tcp socket interface to a hashtable with an expiration timestamp. Here's how the site works:

request comes in
front end server takes GET params and queries the local cache daemon to see if those objects are local
if the objects are local - great - slap them together and deliver the page, otherwise
query the database for the object info
populate the cache daemon
deliver the page

Another trick we use is dumping the output from one dynamic page to be included by another. So, have a page that generates nothing but an element (eg. slashbox). Have a mechanism on the back end that requests that page and stores the result as a text file. The dynamic page (say, php or jsp) just uses an include directive pointing to the static text file - which can be formatted html.

Of course, the real weak point of the system (without clustering) is the database. Make sure that your data is index properly and that your queries are optimised. We have 2 tables with over a million rows each that get hit all the time. Proper data layout, quick queries and the local caches help our puny dual P3-733 (NON xeon) with a paltry 1GB of RAM dish out well over a million dynamic pages per day.

Re:Cache, Cache, Cache by Old+Uncle+Bill · 2002-06-13 11:44 · Score: 1

Good design, but how do you keep the cache valid?

--
Yes, I am an agent of Satan, but my duties are largely ceremonial.
Re:Cache, Cache, Cache by Longstaff · 2002-06-13 12:02 · Score: 2

I'm not sure I understand the question, but I'll take a crack.

The local cache is simply time based. Each element in the cache has it's own expiration time and part of the API allows you to specify a TTL for each element. The element's timestamp is checked against its TTL with every request - if it's expired, the daemon deletes the element and simply reports that it couldn't find the object.

Another reactive behavior of the daemon is that it will call a trim() (which walks through the hashtable and purges any expired objects that simply haven't been requested since they turned "sour") whenever the hashtable grows to a specified max size. There's some additional logic that keeps trim() storms from occuring.

On a proactive side, the daemon itself does some housekeeping. After X seconds (we have it set to about 2 hours) it trim()'s itself.
Re:Cache, Cache, Cache by DevilM · 2002-06-13 12:21 · Score: 1

That's no good. What happens when your page is being built based on data coming from the URL query string or form variables or cookies or session state, etc. Do you just not cache those pages?

Clearly you have to index the cache based on the composite of the data input they page requires to process.
Re:Cache, Cache, Cache by Old+Uncle+Bill · 2002-06-13 12:52 · Score: 1

And the other part is, what happens when the data in the database changes before the cache TTL expires?

--
Yes, I am an agent of Satan, but my duties are largely ceremonial.
Re:Cache, Cache, Cache by Longstaff · 2002-06-13 13:07 · Score: 2

Well, we don't actually cache whole pages. The closest we come is our front page, where the entire page is generated, stored as a text file and included except for the header where your login name appears.

If you were to log into our site, your name is displayed on *each* page - how would you cache that whole page effectively?

We cache objects and have the web servers assemble the objects in a page on the fly. So, a news story is an object, a poll box is an object, etc.

One reason for this is different TTLs. Our news stories don't change that often - if ever - once they've been published. A news story TTL may be set to 1 or 2 hours, while our polls are constantly changing and need a much shorter TTL.

Our main goal with this design was to ease our database load. Scaling a database up is *expensive* (Oracle quoted us $250,000 - for the one box) and complex once you start moving to clusters, etc. Scaling the front end is simple - add another server behind the load balancer. We currently have 6 web servers for redundancy and general zippyness, but our load can be handled by 2 or 3 of those.
Re:Cache, Cache, Cache by Longstaff · 2002-06-13 14:14 · Score: 3, Informative

Well, this a problem with any cache system. With ours, you adjust the TTL to an acceptable value of "staleness" down to the second.

When optimising any system, relaxing granularity is something that you should look at. Do I really need the latest version of the news story up to this very second - or can I deal with one that's a minute or more old. In our case, the news stories are edited and reviewed before they're published, so it doesn't matter if the story is 1 minute old or 10 days old.

In an emergency, we can forceably expire an element.

There are cases on our site where we can't cache the data - we *need* the live data. Those cases are scrutinized thoroughly before we actually make a live call to the db to see if there's some way to get around it. However, most of our data is cacheable and we have a hit rate of ~80%
Re:Cache, Cache, Cache by Anonymous Coward · 2002-06-14 01:47 · Score: 1, Informative

We use Postgresql NOTIFY facilities to accomplish this. That is, when related table information changes, we have Postgresql NOTIFY all webservers (via their database connection) that information has changed and they'll want to look at the new information. Often they pull a new copy of any relevent data.
Re:Cache, Cache, Cache by mmcshane · 2002-06-14 15:58 · Score: 1

HTTP 1.1 (possibly 1.0 - i don't remember) has a header fieled called "Vary". the contents of that header can tell caches that the content of the response will vary by certain request parameters, language encodings, etc.
Re:Cache, Cache, Cache by JWSmythe · 2002-06-16 12:54 · Score: 1

We do a variety of this on Voyeurweb. We do millions of users/day (read hundreds of millions of requests/day).. For our feedback BB's, a request for the list of messages gets built once per minute on an as-needed basis..

The CGI looks to see if the page has already been generated. If it exists, it just dumps out the pre-created page to the user. If it doesn't, it creates the page..

Ugly as it may be, we have a cron which deletes any files older than x minutes. It works very well. There are 2 machines handling the CGI's for that area, and a few other lesser functions. There is one database machine which is actively used, and a backup machine which is never hit unless the primary dies. Multiple machines are for redundancy, should I want to do something silly like take one down and play with the hardware. :)

It definately reduced the load when we started caching the results.. It's easy for a perl script to dump out a HTML file, rather than doing several SQL queries and generating the HTML from the results...

To get an idea of the section I'm talking about, go to voyeurweb.com, look at a set of pictures, scroll to the bottom, and click "Leave a Comment for this Contri".

--
Serious? Seriousness is well above my pay grade.

Re:Persistent Connections Are Your Friend - MAYBE by Longstaff · 2002-06-13 11:24 · Score: 4, Interesting

Combined with a proper connection pool, they can really save your butt.

However, persistent connections may be too much of a burden for an overworked db server. If you're using PHP/MySQL for example, mysql_pconnect may not be the way to go if you have a few front end servers hitting the database. It seems that the PHP connection pooling limit is per process. If you have 100 Apache processes w/ a 10 connection limit per and 10 web servers, that's a max of 10,000 db connections!!!

One idea might be an intermediate "connection broker" on a per server basis. We use something similar to this.

Apache's fork() model is great for stability, but it really hinders interprocess resource sharing. We're mostly Java based here, which allows us to use beans and such. Does mod_perl allow for resource sharing between processes?

Look in the right place by linuxwrangler · 2002-06-13 12:12 · Score: 5, Informative

You have provided way too little info. First, do you really mean scalable or do you mean high-traffic. They are not the same thing. You can build a high-traffic site using technology that won't cluster/expand well and be screwed when you need a higher-traffic site. Converesly you can build a very low-traffic site that will scale quite well (a technology that allows you to easily add hardware as your traffic dictates for example).

It would be helpful to know, for example, what portion of the traffic (both # of requests and bytes) is static and what is dynamic (include images)? What is the peak (say 98th percentile) expected traffic? What are typical page sizes and how much are they compressible with gzip? etc.

Apache itself doesn't really handle dynamic content - its modules or an underlying app server do that. That is probably where you will have to do the most work.

As another poster mentioned, persistent database connections are essential. You may want to look into a "real" app server. JBoss is open source and just won some awards at Java One. If that is too much complexity at least be sure to use persistent connections in whatever other technology you select.

Persistent connections have a down side. Don't forget that your underlying database must be able to handle both your number of requests and your number of connections. If you just increase Apache processes you may find that the database is unable to manage that many simultaneous connections efficiently. Opening/closing connections for each request kills you. Maintaining hundreds of open connections kills you. This is one of the real strengths of any technology that can handle connection pooling - you will probably find that you only need a handful of connections to handle lots of front-ends and connection pooling allows you to do it efficiently. It can also help you scale by distributing connections to multiple database servers for you when your needs dictate.

The faster you can dispense with a request the better. This includes not only all your processing time but the transmission time to the client. A process/thread can't move on till the client has the data. Therefore...

Design your pages to give yourself a fighting chance. For example: if you have any static images be sure to set your http headers to prevent browsers from reloading them. Even the request overhead to the server to determine that the cached image is up-to-date is more than the size of the image itself so set a LONG expiration.

Trim unnecessary whitespace, using short names (ie. i/x.png instead of buttonimages/left_page_arrow_top.png) and so on.

If the pages are large enough and the clients slow enough then you may want to use gzip (mod_gzip) to compress the data. It will cost you processing time to compress dynamic content but will save you transmission time. If you pay for bandwidth you can see a 50-80% reduction in your bandwidth usage as well.

Note: if your spec of "non-clustered" and scalable still allows multiple machines and if you do have images or other static content you may want to move that content to a separate machine. The Linux kernel http server screams on static content (of course the static-content load on your server may be so small a percentage that it isn't worth the effort).

Try Apache 2.x first. One problem with 1.x on most (all??) platforms is the "thundering herd" problem. You may try to increase performance by running lots of processes but when a request comes in, all sleeping processes are awoken (the thundering herd) and although only one will end up servicing the request, the effect of waking up huge numbers of sleeping processes can be "bad".

Be sure to test with clients of varying speed. We discovered we could crash a site faster with slow clients than fast. Once while testing a Cold Fusion/IIS site it seemed like we could realy get some screaming throughput when testing on the LAN. Unfortunately when the server had to keep threads/connections alive long enough to service slow clients it wasn't so pretty. When we ran the simulation that way we could crash the server in 2 seconds.

Give me more specifics and I may be able to give better advice.

--

~~~~~~~
"You are not remembered for doing what is expected of you." - Atul Chitnis

Re:Look in the right place by babbage · 2002-06-13 13:17 · Score: 5, Informative
I was going to reply to the article, but you hit most of the points I was going to ...and a lot of them I wasn't thinking of. So to cut down on the redundancy, I'll just reply here & add a couple more points:
- To control the timeout problem for slow connections/clients, Apache can be tuned to use very short keepalive times. HTTP/1.1's keepalive header can be useful for clustering a burst of multiple requests (such as an HTML file plus a collection of images for it) but the dormant processes it can generate can be more costly than the TCP connection overhead time you were trying to avoid by enabling HTTP/1.1. Oops. Set the timeout low enough to hit the sweet spot between "too many new TCP connections" and "too many idle Apache children".
- Reconsider your resistance to clustering. Yes it can make things more complicated, but it can also make your life a lot easier. Want to ease the Apache or MySQL load? Buy a couple more boxes & have them NFS mount the content or data directories. You can also do clever things like putting all your static content on a server optimized for that purpose (no mod_cgi or mod_include or anything like that) or dedicating hardware to a mod_perl instance, a mod_php instance, etc. Whatever. You gain a lot more flexibility, you compartmentalize things so that (hopefully) you don't have a single point of failure, and it's easier to swap/upgrade/replace components in one area without disrupting things in others.
- As another commenter noted, caching can be a big help. Caching proxies can reduce the load on the main server significantly. Not everything can be cached, but it's possible to strike a balance between readily cachable data (home page, section headers, images, stylesheets) and material that really does have to get generated for each request. On a big site, every little bit helps.
- mod_gzip is your friend. Processing power is always going to be cheaper than bandwidth, so spend your money on compressing data is cheaper than paying for increased bandwidth. Even if not all clients can take advantage of it, if a significant fraction of them can then you'll quickly come out ahead.
- If you have any huge content (audio or video) that places a heavy load on the rest of your systems, you might consider outsourcing it to a company like Akamai that specializes in delivering such content quickly. Services like this are probably expensive, but if you need it then you need it, and going with an Akamai is surely cheaper than setting up & maintaining your own data centers all over the country & world.
- As the above poster notes, consider Apache 2.0. Among the many neat-o features it offers is a choice in execution model: in addition to the fork/exec multiprocess model that 1.3.x used, you can also try threaded modes (which should be a big boost to Win32 servers if you need to go in that direction for anything), and I think maybe some more exotic execution methods. Depending on you setup, you might be able to find a big speedup by switching to threads. (Note though that, as of now, Apache2 and PHP4 don't play nicely together, and the same is probably true for a lot of Apache extensions (mod_perl, others) so make sure that whatever modules you need to use are going to work. Test test test!
--
DO NOT LEAVE IT IS NOT REAL
Re:Look in the right place by bobm · 2002-06-13 14:58 · Score: 4, Interesting

Thanks for the info, I left the specifics out since I'm looking for generic feedback (for the learning).

The site will be mostly serving dynamic content with the average page being about 60-120k of code and around 10k of images. And yes, that's a lot of code but the site is serving up reports and whatnots. There are small pages between reports and the usual login, etc screens.

The real purpose of the question was to see how different tuning is being used in the real world, as the web has matured there has to be some interesting information on keeping the systems up 24/7, etc.
For example we're looking into a replicated database with just the important info (and I know that important is a real fuzzy term) for periods when we need to bring the primary database down.
what would be interesting is the proactive analysis (when do you add more hardware, etc) that is done on a live running system.
thanks
Re:Look in the right place by sydb · 2002-06-13 23:54 · Score: 2

Trim unnecessary whitespace, using short names (ie. i/x.png instead of buttonimages/left_page_arrow_top.png) and so on.

While the rest of your post contains many good points, I find this comment bizarre. The overhead of a few extra bytes is insignificant compared to the benefit of having maintainable code.

--
Yours Sincerely, Michael.
Re:Look in the right place by RAMMS+EIN · 2002-06-14 06:55 · Score: 1

``Apache2 and PHP4 don't play nicely together''

Works fine with me. I have been running PHP 4.2.0 on Apache 2.0.35 on Linux 2.4.19-pre7 and I have not experienced any problems whatsoever.

---
A computer without COBOL and Fortran is like a piece of chocolate cake
without ketchup and mustard.

--
Please correct me if I got my facts wrong.
Re:Look in the right place by mmcshane · 2002-06-14 15:55 · Score: 1

wow, that is a lot of code. depending on your end-user needs, you may want to look at compressing the content on the server-side before sending it over the wire. Most modern browsers can handle compress and/or gzip content (check the HTTP Accept header).

It's more processing on the server side but you'll save bandwidth and the user-perceived performance will be better.
Re:Look in the right place by praktike · 2002-06-16 15:59 · Score: 1

well, you can write a little app to change long names for the production copies of the pages, and save the development copies separately.

--
-------- -praktike

You need to provide way more info by DevilM · 2002-06-13 12:27 · Score: 3, Informative

You say you want it Apache based on dyanmic. Well how are you going to build the dynamic pages? Are you writing your own Apache module? Are you using Perl, PHP, JSP, CFML... what? Are you using Apache 1.3.* or Apache 2.0.x? What database are you using? What kind of application is it? What OS are you using? What about disk subsystem, is it RAID based? If so, what level? Why do you want to use a single big machine instead of many small machines?

Anyway, you need to provide way more information in order to get help. There is no magic way to make a site scalable. It just depends on the answers to all the above questions and more.

Re:You need to provide way more info by bobm · 2002-06-13 15:11 · Score: 4, Interesting

Database: Informix on EDS served from an E10K.

Dynamic: currently mod_perl but open to something faster (if there is a proven faster technology).

Apache: current 1.3.x move to 2.0.x when it's ready for prime time.

OS/Hardware: open, currently Solaris/Sun, open to quad Xeon/Linux if it has the performance.

The reason for asking about a single vs multiple machines is that I wanted to get a handle on what one box could do as opposed to the gut reaction to just keep adding servers.

Although I'm not expecting magic I didn't want to get too specific because I'm interested in feedback from across the board, for example how does Orbitz or Yahoo or *New York Times* maintain uptime? I haven't found anywhere that discusses places like that.
Re:You need to provide way more info by Longstaff · 2002-06-13 16:03 · Score: 2

Places like Yahoo will globally distribute their servers.

Services provided by Digital Island, Mirror Image and Akamai will distribute your content to a node as close to the client as possible. We use those services for our images (only static content we have), but Akamai (at least) is pushing a new distributed processing model. You give them a Java WAR file or a .Net app and they'll push the *app* out to the edge. Expensive, but interesting.
Re:You need to provide way more info by Anonymous Coward · 2002-06-13 16:07 · Score: 1, Informative

How do they maintain uptime? Uhh, see earlier posts.... :-)
I've recently started with a NYT sibling company. Suffice to say, our network design -- while far from perfect & entirely too arcane in a lot of ways (like, say, daily munging of data from old VAX & PDP mainframes into a web presentable form) -- works. I'm told we had emails from visitors telling us that on 9/11 last year, our site was the only major one that a lot of people could get to: when NYT, CNN, MSNBC etc were inaccessible for several hours, our site was able to handle it.
Now granted, us with that traffic spike still might not have been the level of them on an average day, but still -- the ability of the system to withstand sudden shocks like that day (or, say, a Slashdotting earlier this week) has been well proven. And in an abstract way, the points being raised in this thread -- by several posters -- are all design aspects incorporated into the site I work for.
Gimme an email address & I might elaborate. I don't want to go into detail on Slashdot... :-/
Re:You need to provide way more info by consumer · 2002-06-15 13:45 · Score: 1

How does Yahoo maintain uptime? They have thousands of Intel servers in racks. Nothing too tricky.
The only thing that's truly proven to be faster than mod_perl is custom-coded Apache modules written in C. You can do that, but it will take you a long time.
Re:You need to provide way more info by Anonymous Coward · 2002-06-16 13:06 · Score: 0

It's not that difficult:

Use squid in front-end.
Use thttpd for all static data(images and such).
Use apache for dynamic content.

I have a site running using this and everyone is happy.
70 million hits per month.

Powered by OpenBSD. :}

Session Management by JMandingo · 2002-06-13 13:25 · Score: 2, Interesting

Assuming you are using multiple web servers, and that your app is complex enough to require a session data management scheme (rather than just passing vars from page to page in the query strings), I recommend using cookies for session data. Naturally this only applies IF you don't mind requiring your clients have cookies enabled, IF you don't need to store anything more complex than strings, and IF the total amount of data you need to store is small.

Another option is to store session data the your top level frame on the client, but this can be messy and hard to debug. Storing session in your database is elegant and easy to debug but can increase the hits on your database to a prohibitive degree. Adding database bandwidth in the future is difficult and expensive. Adding web servers to your system is comparatively cheap and easy.

--
Vonnegut was right: Of all the words of mice and men, the saddest are, "It might have been."

Re:Session Management by Clay+Mitchell · 2002-06-14 07:01 · Score: 1

A third option is to look into using a J2EE App Server - there's nothing nicer than logging a person in, creating a object with all there stuff, and sticking it in their session. you never have to worry about database hits every time you need some info. you get all the info on the initial object creation and just pull it out when you need it. you can even use beans this way...

everybody likes to make fun of java - the poor thing, it was created to be a run anywhere client language, but it's true calling was serving up applications!

A good article by cwinters · 2002-06-13 15:33 · Score: 4, Informative

A good reference on this is from one of the eToys architects. It uses mod_perl as the technology but the general strategies -- caching in particular -- will work for any application server technology.

--

Chris
M-x auto-bs-mode

Kegel's site by jawahar · 2002-06-13 18:01 · Score: 5, Informative

Contains very good information. http://www.kegel.com/c10k.html

--
Slashdot = Sarcasm

Re:Kegel's site by Longstaff · 2002-06-13 18:19 · Score: 4, Insightful

mod parent up - great link!
Re:Kegel's site by amevba · 2002-06-18 08:25 · Score: 1

One thing that really makes their advice look good: the site had not been slashdotted -- at least not yet a moment ago!

Code maintenance by delibes · 2002-06-14 04:00 · Score: 1

Though this isn't directly Apache related, one more aspect of scalability is code flexibility and ease of change.

Don't be short-termist just because the person generating the business requirements thinks like that. After you're up and running, things may still change. By using good design patterns you'll find it easier to add new functionality or change the system behaviour.

--
This is not a sig

This person is lucky to have a job by ukpyr · 2002-06-14 05:53 · Score: 1

My feedback:
How does someone who's obviously never done this, let alone think about it for more than a few minutes have a job DOing this? Maybe it's just my area but there are NO web architecting type jobs around here and this numb-skull is having slashdot their job for them....
Life is like so fair! :)

Re:Persistent Connections Are Your Friend - MAYBE by RAMMS+EIN · 2002-06-14 06:39 · Score: 1

``It seems that the PHP connection pooling limit is per process.''

Then use Apache 2.x in threaded mode -- that way you don't create a new process for every connection. A threaded server may be a good idea anyway.

---
Harrison's Postulate:
For every action, there is an equal and opposite criticism.

--
Please correct me if I got my facts wrong.

Re:Look in the right place, but what is Clustering by Anonymous Coward · 2002-06-14 06:42 · Score: 0

Thanks for all the great info. But I have a question about what "clustering" means? It seems I hear it used so much, and unfortuately in my line of work I hear it mostly used by sales people who attach to a buzzword like parasites and then spit them out when there is dead air.

Does "clustering" mean multiple web servers, application servers, and database servers all serving the same application?

Can you just have a "cluster" of database servers? If so, and this is what confusses me most, how does the data stay insync? If I write an application that hits a "Clustered" database do I have to do anything special or does updating one database server cause the others to get updated?

Thanks for any help on this subject!

Re:Look in the right place, but what is Clustering by babbage · 2002-06-14 10:48 · Score: 2

It can mean different things. In the sense I'm used to it being applied, you split up services in different ways across multiple machines. Making up some terminology that I don't think anyone would object too very strongly, you can split them vertically (an Apache box, a MySQL box, etc) or horizontally (multiple Apache frontends sharing the same content somehow). A very tight definition might mean getting the computers to act as if they are one unit -- I think this may be what your sales guys are talking about -- but it's simpler to just have them working together loosely without having the extra step of pretending to be one homogeneous entity.

Like I say, there are different ways of doing this, and really you ought to browse through a good bookstore or two to get more details. One strategy that's easy to implement might be to split up your content so that plain html is on www.site.com while your images are on img.site.com, your cgi scripts are on cgi.site.com, and your data is housed on db.site.com (which probably shouldn't be web accessible, by the way -- this protects you!). This is a vertical split. Or you can go horizontal by placing everything behind a load balancer that redirects incoming requests to one of several web servers -- each of which can be getting content from a single shared NFS partition. Or you can do a mix of those: maybe all the front end web servers communicate with dedicated database etc boxes behind them (which, again, would not be otherwise internet accessible).

On a Linux or Unix system, NFS is a pretty easy way to mirror content across all your servers. For a Win2k served site, you could probably get away with CIFS Windows shared drives. Or if you want to be really cutting edge, WebDAV might be able to meet similar needs. Less clever -- but debatably easier -- ways to do it might involve rsync'ing content from a master content server to a set of web-facing server "clients". A variation on that idea ends up being more or less identical to content proxy caching, as one big expensive app server in the back gets it's data cached onto a pool of cheap web facing proxies.

But, like I said at the beginning, the devil is in the details and you really ought to pick up a couple of good books if you want to learn more about this. Strategies for e.g. database "clustering" can vary widely depending on the RDBMS being used: I doubt the method would be the same for MySQL as it would for PostgreSQL, Oracle or SQLServer, for example. Some of those might be able to do this work almost transparently, while others would involve more manual planning & setup.

--
DO NOT LEAVE IT IS NOT REAL

never spend more than $2,000 on a web server by toki · 2002-06-14 11:08 · Score: 1

Never go with more than two processors for a web server. Not even Solaris, which is renowned for it's processor scalability, can scale networking functions with more than 4 processors.

Get a single or dual processor intel/AMD rackmount system for your web servers, spending the extra money on a quad system isnt' worth it. You don't need SCSI either for them.

Sun's idea of a web server is a $20,000 E280R. Their Netra T1's are ony single 500 Mhz Ultrasparc IIe at roughly the same price as a dual processor intel/AMD machine, and they don't really compare performance wise running as web server.

For the backend, you can go with the huge systems to run the database. I wouldn't recommend running MySQL, it won't scale. You probably need something like Oracle if it's going to be a heavily trafficed/high transaction site.

Re:never spend more than $2,000 on a web server by toki · 2002-06-14 11:10 · Score: 1

rather, keep adding small web servers at around $2,000 each until you've scaled enough. Get a load balancer to distribute the load. Web functions scale horribly, on any system and any OS, as you add processors, so one big machine won't run and will be a helluva lot more expensive than larger number of smaller, more affordable machines with load distributed between them.

ACS/OpenACS by tin_the_fatty · 2002-06-14 15:07 · Score: 1

I am somewhat surprised that nobody has mentioned Philip Greenspun's "Guide to Web Publishing", online version available http://philip.greenspun.com/panda.

Check out http://www.openacs.org/. It is a toolkit derived from the original ACS, but instead of Oracle, it works with PostgreSQL. It is under active development. Greenspun's website survived a slashdot, and so did OpenACS. Certainly saying something about the scalability.

The problem with Apache 1.x/PHP/mod_perl/MySQL/PostgreSQL is that the so-called persistent database connection is per-process based. There is no guarantee that requests for a particular site will always be served by the same process with the appropriate db connections. Then there is the problem of running out of db connections for any particular process. It seems to be a lot of fine tuning work, a lot of memory and CPU power. Apache 2.0 is likely to be better in this respect, but I still think that AOLServer is cleaner.

Re:ACS/OpenACS by consumer · 2002-06-15 13:51 · Score: 2, Informative

The problem with Apache 1.x/PHP/mod_perl/MySQL/PostgreSQL is that the so-called persistent database connection is per-process based.
And how is this a problem exactly? If your server is handling only dynamic pages (your static stuff should be split onto another server) you will almost certainly need a database handle on every request. Connection pooling is only useful if your application spends a lot of time NOT using the database.
Then there is the problem of running out of db connections for any particular process.
Why would a particular process need more than one database connection? Each process only handles one request at a time.
Apache 2.0 is likely to be better in this respect, but I still think that AOLServer is cleaner.
Apache 2 provides full support for threading, so it can use the same approach as AOLServer. It doesn't sound like you know very much about it, so maybe you should check it out before you tell everyone it's no good.

Experience by Anonymous Coward · 2002-06-15 02:39 · Score: 0

I'll just note my experience:

Stock market sites, heavy load with peaks, a lot of generated graphics with particular validility (ie, valid for 60 seconds). We were getting more than enough hits to make caching just for those 60 seconds worthwhile.

Strategy: Ok, this was a few years ago now, but basically our end architecture went like this:

- A database server
- 3 dual processor apache servers, all mirrors, to handle the chart generation, page serving etc
- 1 inverted squid proxy running a custom patch ( its floating around somewhere on the 'net) that turned it into a load-balancing fiend that could handle nodes falling out of the cluster transparently, and balance load with a 5 minute predictor and queue tracking that mean't it could accurately predict what server would serve the client soonest.

Squid did a fantastic job of caching the data in ram where necessary, and yet dumping things as soon as they had expired. It took an enormous load of static content off the apache servers letting them do what they do best, which was heavy graphics calculations and dynamic pages.

The best thing about it was that it was effectively transparent, it just worked. We had the occasional expiration/caching issue while we were bedding it down but overall it operated brilliantly, we could take machines out of the cluster without concern whenever necessary, and we could change the format of our site without having to redesign the caching system.

On top of this, the dynamic code had its own cacing efficiencies where useful, but in the main it wasn't necessary. As hits to the site ramped up, the proxy absorbed more and more of the traffic since 60 hits/second all get the same chart from ram on the proxy machine, 'cos its cached for 30 seconds.

There is a lot of room for smarts in the web performance game.

How about... by jabbo · 2002-06-15 04:00 · Score: 3, Informative

When I worked at XOOM we had a farm of about 30 front-end FreeBSD webservers mounting member directories via NFS and serving an average of 500mbps (peaking up to about 1Gbps at times). The key to that architecture was that member logins were cached via a proprietary daemon that all pages authenticated from. Templates, dynamic content, etc. were all pickled to flat files whenever possible (at first, not much; later, once the merger with Snap! was done, much more, as their content caching system was superior).

The database on the Xoom side was an E450 IIRC. Snap used much burlier hardware because they were basically a silver-spoon project of CNET/NBC.

The lesson for scalability is simple, cache like a motherfucker and make everything you can static. And run DSR. ;-)

If you decouple the database from the webservers you need to make extra sure that you proxy the high-traffic requests, either by running a static-file-dumping daemon process (for content) or a proxy daemon (for authentication). My moderately-low-traffic site at my current job can handle two saturated DS3's worth of traffic with 1024 apache child processes running on each of 2 dual PII boxes w/512MB RAM, plus the database running on a dual PII w/1GB RAM. Doesn't even break a sweat. Postgres (the database) runs 1024 child processes with a lot of buffers, NFS caches are pretty good sized (if your frontend webservers are Sun, you can use cachefs aggressively, I would), and overall it just took some serious tuning to make sure that nothing fazes it.

I'm working on a couple of "community" sites with similar demands (~1million visitors/month) and mod_throttle + caching will solve one's problems, the other is where I stole the throttling idea from :-). Just tune, tune, tune... you'll get it.

For the whiners, Xoom failed in the end because it lost sight of the cheap-ass principles that made it a good stock market scam. Right up until the end, performance on the member servers was sub-4 second per page on average.

--
Remember that what's inside of you doesn't matter because nobody can see it.

You can read how we did it by consumer · 2002-06-15 13:26 · Score: 1

We presented this article at ApacheCon a couple of years ago. It describes how we built an e-commerce site that handled over 2.5 million page views per hour.

Re:Persistent Connections Are Your Friend - MAYBE by consumer · 2002-06-15 13:33 · Score: 1

Does mod_perl allow for resource sharing between processes?

With perl 5.8, apache 2, and mod_perl 2, you can share resources between threads. With earlier multi-process versions you can use shared memory or disk. On Linux, sharing on disk is very fast since frequently accessed files are kept in memory.

got cash? by edrugtrader · 2002-06-15 19:06 · Score: 1

just get a sun e450 and hire a unix guru to maintain it for you. you aren't qualified, and will look pretty stupid the first time it goes down.

--
MARIJUANA, SHROOMS, X: ONLINE?! - E

Great article on Scaling your DB by RevDigger · 2002-06-16 07:52 · Score: 2, Interesting

This is a great article on scaling a website really fast. I found their techniques for scaling their database especially interesting.

http://www.webtechniques.com/archives/2001/05/hong / /A>

It's about the guys who built amihotornot.

- H

Whoa... by percey · 2002-06-16 14:39 · Score: 1

You'd rather buy a million dollar machine than get a few small ones? I'm sure the CIO will love that. If your object is absurdity, and making it into some freaky unix journal for wierdest non-clustered apache site, then you may just want to get an IBM mainframe and put on it that Linux they're for it advertising. That way you can run 1000 linux partitions with 1 apache running on each partition.
However, as a sane person, you would realize that would be dumb. Alright look, if you've got the loot to spare why not try this approach.
Get yourself a solid database machine. Any kind of multiprocessor (2 or 4 proc should be fine) sun, ibm, or hp, would do, now not the E10k or that Regatta from IBM. I recommend that you get a true Unix system, because you'll end up with amazing uptime and it can take lots of load. Now remember this fact for the database:
Its the I/O stupid.
Well its the I/O and RAM. Get the best disks you can get your hands on and lots of em. And if you can get a gig or two of RAM. Now, your database may have specific requirements for RAID, I was told once that with DB2 RAID 5 was adequate (but that's hearsay).
Whereas I have seen many Oracle people recommend Raid 0+1. What I understand to be true, is that RAID 5 is good for datawarehouses, and RAID 0+1 for OLTP. You need to decide, most websites are probably in the realm of OLTP, but your situation may differ. Please remember: Get at least SCSI, better if you can afford it. You would be better off getting a penguin computing system and a fiber channel drive array than going with a high price unix vendor and plain old SCSI (IMHO). Now if you're just going to be using MySQL or Postgres, don't be dumb, get dual Xeon and run Linux, but the same rules apply.

As for replication I happen to work with a database cluster system with IBM's HACMP, if we need to take down one system we can fail the database over to the second machine, the process takes about two minutes and the database is back up and running. You then switch it back over when you're done doing whatever.

Ok, now I've addressed your database server, lets can evaluate your website needs.
Firstly, apache (of course) allows you to handle multiple websites as virtual hosts. It does it very well. Now, I'm not sure what kind of scalability you're talking about but that's one kind right there.

I would suggest that if you've got the means, you may want to look at an application server that's based on apache. Such as Oracle's or Websphere. It may make administering a ton of web addresses easier.

You want to stay away from clustering? Why? To save on administration costs? But yet you're entertaining an E10k.... I don't get it.
Since I don't think your server consolidation is a smart move I'm going to propose this:
IBM and Sun and HP (HP just came out with a really inexpensive rack mounted 1U Unix machine) all have small 1U webservers. But honestly, those systems are great for databases, where you may want to get the extra performance of specialized hardware, but you could do quite will with a rack full of dual processor Xeons running redhat. (this would be where an app server would be a good idea because a lot of them feature cloning so you don't have to copy all the new html over with each change) With all the modules and the native ability to compile it from scratch I think you're better off.

What you would do probably then is load balance them with Round Robin DNS (there's some expensive hardware load balancers that I'm not too familliar with that you can buy too) look it up, its very simple IP based load balancing from the DNS.
Remember also, that if you're looking to get 10 mil hits a month you're going to need bandwidth (I'd guess a DS3?) to support that.
This is of course my opinion, I have administered and set up the webserver and databases for several companies, and I've never figured out a really good answer to this problem of scalability, except to keep open the possibility of future growth (although there are some pretty specific formulas for capacity planning for databases). You're making a huge hardware investment, you need to keep that in mind too, if you need 2 CPU's then get the capability for four, etc. Keep in mind that companies grow (hopefully) and the CXO's that are approving this purchase won't like it if next week you come down and say Oh sorry, we need to get the E15k the E10k wasn't enough (but they won't dislike it so much if its 'only' a 5k XEON system you're getting). Also, let me say this try to get a professional opinion from someone say, not on slashdot (the problem is obviously where, vendors will lie, so will consultants) . If you're spending megabucks, you may have to defend your suggestion at some point, and telling em, 'Um, this guy on slashdot said it would be cool.' never really pans out.

Re:Whoa... by blinx_ · 2002-06-17 20:49 · Score: 1

Or instead of a really expencive hardware loadbalancer you could use Linux Virtual Server a linux based software loadbalancer - works really well. I've been using it together with NFS to balance a quite high traffic website.

--
Resistance is not futile - www.gnu.org

Good Resource by LedZeplin · 2002-06-17 07:18 · Score: 1

Here is a Page that I found that has all the tweaking information that I've needed, and if you are thinking about using PHP then it's even more suited for you. http://php.weblogs.com/tuning_apache_unix

Re:Good Resource by LedZeplin · 2002-06-17 07:20 · Score: 1

sorry here is a link for convience Tuning Apache and PHP for Speed on Unix

NFS vs. rsync? by LedZeplin · 2002-06-17 07:39 · Score: 2, Insightful

Several of you have discussed using NFS for cluster webservers to access a shared web root. My current setup uses rsync to distribute the files to the cluster nodes and I'm wondering why NFS? It seems that the rsync method would be a lot more failure resistant. If my primary server goes down the cluster nodes can serve the site as it was at time of failure. With an NFS server you would need a high availibility failover other wise all the cluster nodes are SOL right? I'm curious what the plus side to NFS is, maybe I'm missing out on something.

Re:NFS vs. rsync? by rumwrks · 2002-06-25 08:21 · Score: 1

You are absolutely correct.
Short answer, NFS is simpler to setup and manage (not much) and people are more familiar with it.
There are a few reasons I can see using NFS over rsync (I have an e-mail spool over nfs to a bunch of frontend email servers [using nfs aware maildir, really its ok ;]), but most of them are pretty thin.

Uptime by queenb**ch · 2002-06-18 10:01 · Score: 1

IMHO....

I agree with many of the things that the other posters here are telling you to do. I have a few suggestions though. Put the money in to areas where it will do you some good. Buy the cheaper servers. The real bottleneck on most servers is the PCI bus and the hard drive IO, anyway. Multiple smaller servers eliminiates this issue since the same load is spread across more PCI buses and read-write heads. It's nice to have the budget for an E10K, but we have two at work and I'm not terribly impressed, and especically not for the price tag, nevermind the footprint, weight, power usge, or massive BTU's.

My first concerns your network. You are far more likely to saturate your network than to run out of server capacity. Sites with page views in thousands per second generally run multiple high end connections, like DS-3's. You are going to need gear that can handle this kind of load, not to mention the redundancy and multipathing. Some serious gig gear is probably definitely in order.

I'd also suggest that you do some tweaking to your firewalls. Security is, or at least, should be a major concern for sites that take those kind of hits. Just ask Wingspan Bank. You can build a great site, but if people get their information stolen, they won't be back.

My next suggestion would be to put a dedicated load balancing hardware device out in front of them. Either use Squid or use an actual hardware device that is supposed to do this kind of thing, ala Cisco's CSS products.

I'd also suggest sinking some of that cabbage into a real database. Understand that I'm not knocking PostgreSQL or any of the other open source databases. Oracle has roughly 85% of the high end database market FOR A REASON. Once you shuck out for the Oracle instance, shuck out for someone one to tune it as well. Tuning is VERY important to databases.

Anyway that's my 2 cents worth. This advice is worth exactly what you have paid for it.

--
HDGary secures my bank :/

Re:Look in the right place, but what is Clustering by Anonymous Coward · 2002-06-19 00:18 · Score: 0

Thanks for clearing that up. I appreciate the response.

Re:You need to provide way more info - mod_perl by rumwrks · 2002-06-25 07:17 · Score: 1

I don't have any current results, but a couple years ago I did some fairly extensive testing with perl-cgi, mod_perl and fastcgi. We tried to tweak everything on each example as much as we could (preloading modules, persistent DB connections, etc...). By FAR and away the best was fastcgi (which sort of suprised me btw). Also it has some nifty characteristics that may solve some of those "excess" DB connection problems (cgi pools, etc..). Also you can run your cgi's on a seperate server (or server pool) from the main web server(s) which is pretty cool. The downside (especially bad with poorly written perl) is that you have to be extra careful with persistent variables (although if your mod_perl already you ought to be pretty good there). Worth checking out anyway.
http://www.fastcgi.com/

Slashdot Mirror

Building a Scaleable Apache Site?

60 comments