Building/Testing of a High Traffic Infrastructure?

← Back to Stories (view on slashdot.org)

Building/Testing of a High Traffic Infrastructure?

Posted by Cliff on Saturday November 13, 2004 @05:21AM from the transactions-and-pages-per-second dept.

New Breeze asks: "I'm currently working on my first web 'application', and have discovered that I know less than nothing about setting up the infrastructure to manage a high traffic system. Where does one go to learn about setting up the infrastructure required to host something like Slashdot? Or do you just say, 'Not my area!' and help them find a consultant?" "My experience is pretty much limited to:

1. Install the web server on one box, the database on the same box if it's a small installation or a separate box if performance seems like it will need it. Add more memory and processors based on SWAG criteria. (Scientific Wild Ass Guess)

2. Contract with a hosting company.

I had a potential customer ask what I would recommend if they wanted to self host, they have around 300 remote locations and would have multiple users from each location hitting the application at the same time, so saying a couple of beefy servers probably isn't the right answer.

I haven't a clue. The last place I worked with on something like this hired a high dollar consultant who spend a huge pile of their money setting up a load balanced, oracle parallel server redundant everything system.

How do you test it? I've worked where they actually had a room with hundreds of systems on racks that they would configured to run test transactions against different servers and software builds for stress testing, but that's not in my budget..."

19 of 231 comments (clear)

A Beowolf Cluster of Course by l810c · 2004-11-13 05:24 · Score: 4, Funny

That one was easy, ...Next
Ask a Pr0n serving company by chia_monkey · 2004-11-13 05:24 · Score: 5, Insightful

Seriously...they know all about serving up content on high traffic sites. Not only is it high traffic, but it's rather big files that they're delivering. When we're testing the networks that we set up, both wired and wireless, we often visit pr0n sites for our benchmarks.

--

"He uses statistics as a drunken man uses lampposts...for support rather than illumination." - Andrew Lang
1. Re:Ask a Pr0n serving company by AndroidCat · 2004-11-13 05:27 · Score: 4, Funny
  
  Okay, let's put it to a real test. Post some good pr0n links and we'll try to slashdot them!
  
  --
  One line blog. I hear that they're called Twitters now.
2. Re:Ask a Pr0n serving company by Neil+Blender · 2004-11-13 05:48 · Score: 5, Informative
  
  Okay, let's put it to a real test. Post some good pr0n links and we'll try to slashdot them!
  
  Having worked for porn sites, I will tell you this: They, more than anybody, will rise to the challenge. Porn = Traffic = Ad Click Throughs = Money. If a porn site sees a sudden rise in traffic, they will drop more servers into their delicately load balanced system without blinking an eye. Porns sites aren't slow. There is a reason why.
3. Re:Ask a Pr0n serving company by DogDude · 2004-11-13 06:06 · Score: 4, Interesting
  
  Absolutely. Most companies hand the whole thing off to a hosting companies that specialize in porn hosting. These places are rooms upon rooms of racks, on raised floors, 6 times redundant connections, dual power backup systems (generators), and all the fiber you could ever want. They're the best. Take a look Candid Hosting. They had a few hurricanes go over them, and they didn't bat en eye. Incredible uptime.
  
  --
  I don't respond to AC's.
Post a URL by MarkusQ · 2004-11-13 05:24 · Score: 5, Funny

Post a URL here and we'll help.
-- MarkusQ
P.S. Clever use of the text describing the link may help you control how much trafic you get, from low ("M. Moore Nude!") to high ("SCO caught robbing courthouse").
Simple Flow chart for learning by drachenfyre · 2004-11-13 05:26 · Score: 5, Funny

1. Submit Link to slashdot with your webserver hosting a lot of large video files supporting the link.
2. Have link approved (Note - duplicate any story just posted is probably the best way to get approval and lots of people crying dupe)
3. Learn what caused the webserver to melt and how long it took to melt.
4. Fix the problem that caused step #3
5. Repeat 1-4 until server doesn't melt.
6. Congrats! You've learned how to host a high demand web server.
PLEASE by asadodetira · 2004-11-13 05:29 · Score: 5, Funny

Please don't tell him. We don't need another slashdot. Servers worlwide surrender

OK.It's easy. There are three steps involved
1.Build a low performance infrastructure.
2.Put a RT sticker and chromed exhaust pipes
3.Done
Do the math by MarkusQ · 2004-11-13 05:33 · Score: 5, Insightful

First step, do the math.
What was once a "high volume" app may be nothing for modern equipment. You're talking about on the order of 1K concurrent users (300 sites * several users per site).
If "use" means manually typing data into forms, viewing mostly static pages, etc. this isn't really a very "high volume" application, and a single decent server should handle it.
If, on the other hand, "use" means constantly running complex queries against a billion item data set, you're doomed.
So where do you fall in this spectrum?
Coming up next...where's the bottleneck?
-- MarkusQ
1. Re:Do the math by New+Breeze · 2004-11-13 06:49 · Score: 4, Informative
  
  300 sites, between 12 and 200 concurrent users at a site.
  
  It's a CRM system, i.e. some basic data entry, some portions are transaction processing. i.e. the workflow portion for the base part of the app is very simply:
  
  Search for customer by various criteria.
  No customer found, add one.
  Retrieve customer information.
  Add current order information being stored for this customer.
  Process loyalty/discount programs to see if customer qualifies for an award.
  Return award to order entry system for processing.
  
  There's a lot more to it, but that's the meat of it. It's fairly data intensive, there is a great deal of information stored for customers for use in data mining the collected information. It's primarly web service based, but there is a fairly extensive management and reporting tool that is all HTML based.
  
  My guess is going to be that the bottleneck is going to the the database, but we've done extensive testing with a million customer sample database running multiple instances of test applications from 10 other boxes, but that doesn't exactly prove much as it's too predictable.
Apache Benchmark is your friend by sseremeth · 2004-11-13 05:34 · Score: 5, Informative

If you want to throw some serious load at your equipment, get a few other systems saturating your network with Apache Benchmark (ab) requests. It gives lots of useful data, like response times, etc. . And you're best off toppling the application and trying to find the cause that it failed and working on that as someone already suggested. The rinse and repeat.

Looks like Apache has updated their tools since the last time I had to do this...

http://httpd.apache.org/test/
Look at the other high load websites by GrAfFiT · 2004-11-13 05:36 · Score: 4, Informative

Check out what those guys do at Wikipedia. Don't forget to look at their useful links at the bottom.
Or maybe it's overkill.
Dear Slashdot... by Anonymous Coward · 2004-11-13 05:36 · Score: 5, Insightful

I'm currently working cooking in a restaurant, and have discovered that I know less than nothing about performing stomach surgery. Where does one go to learn about the techniques and tools necessary for curing stomach cancer? Or do you just say, 'Not my area!' and help them find an oncologist?

Seriously.. you have a lot to learn, and a lot of what you need to know just comes from experience which you can't get from a book.

First: learn how everything works. When you click a link in your "application" (why the quotes?), what happens? For instance, does it run a Controller object? If you're using a language like Ruby or Perl, is it "pre-compiled" or does it have to interpret a script on each hit? Does the controller then go to the database and populate variables, then insert them into a template, then render the template? Is the template cached? How are your database settings? Enough memory for joins? Are all your queries using the appropriate indexes? Are you familiar with your database's performance-measuring variables and tools? Are you pulling more data than you need to in each query?

Once you have an understanding of what's happening, then you can start measuring. Where are the bottlenecks? This is a very important thing to keep in mind in programming or system architecture: DON'T OPTIMIZE UNLESS YOU NEED TO! Keep your system and code as simple as possible. For instance don't cache things in your program (making it more complicated and harder to maintain) unless you have a BENCHMARK IN HAND showing a performance bottleneck.

You might not need to move your database to another machine. What you need to do depends on your app.

Yes, you will need to do a lot of testing to identify your "first round" of bottlenecks. You need to build a lot of diagnostics into your app to help you identify how long different steps take.

Always deploy your app in stages, one site at a time, until you start identifying some problems. Then fix those problems before continuing deployment. Never "flip a switch" and reveal any change all at once.

Good luck!
How to do it with little/no budget by multipartmixed · 2004-11-13 05:38 · Score: 5, Informative

First off, if this is a "must succeed with no problems" project, all bets are off -- hire an experienced consultant so you have someone to blame. Also, this technique only works when you have the type of site which will *build up* to expected load -- not get turned on instantly.

This is tough to generalize without knowing specifics, but here goes:

1. Make sure your application can work correctly when load balanced across multiple boxes
2. Keep webserving and DB work on different machines
3. Make sure your application can work with another database without much work (this gives you the option to hire, say, an Oracle DBA and buy an Oracle license if MySQL can't keep up.. does it even support row-locking yet?)
4. Have extra hardware handy, in the rack. Do NOT turn it on yet.
5. Observe the application running; determine bottlenecks, tune
6. If you can't tune it to perform adequately, NOW is the time to break out the extra hardware while re-evaluating the implementation.

If you throw all your hardware at the problem at once, you get very little warning when the shit starts to hit the fan, and no response scenario. Do NOT make that mistake. Load, test, tune, repair, repeat.

--

Do daemons dream of electric sleep()?
Two scenarios: by Jerf · 2004-11-13 05:38 · Score: 4, Insightful

1: Gradual growth. Find bottleneck, remove it. Repeat. Make sure to start with a growable database and web site technology, but that shouldn't be too tough. Also, stay ahead of the game, always with overcapacity, both to cover for outages and for sudden growth spurts.

2: Instant growth from 0 to thousands+: Hire someone who knows what they are doing. In the first scenario, you have the time to learn what is actually going on, which is an advantage. In this one, you don't, and the customer base is to big (i.e., $$$) to screw with.

That basically covers it. Specific advice will vary widely based on databases and web technology deployed, so just about any other specific advice you get here is as likely to be wrong for you as right.
A few basic things... by Em+Ellel · 2004-11-13 05:48 · Score: 5, Informative

These are some very basic thoughts on the subject. They may not be 100% right for you, but will get you thinking in the right way:

Rule 1 - Three tier archictecture is popular for a reason - it works. Offload user interface (web) to dedicated boxes, make application itself run on separate boxes and make database separate

Rule 2 - When possible, scale horizontaly not vertically. Make sure your application is as stateless as possible and is capable of you just dropping in an extra server when needed without a lot of reconfiguring. Make sure you can survive a loss of a server without loss of data. Lots of cheap servers will most always work out better (and cheaper) than one big ass box.

Rule 3 - Make as much of your application as static as possible. Even pseudostatic data (something that updates every minute or so) should be made static and have a process re-generating it every minute or so. Not wasting your CPU time to render a menu or something on every hit will add up fast under heavy stress.

Rule 4 - Strip your HTML. For example, some crappier web languages (think ColdFusion) have a tendancy of inserting spaces for every line of code etc. A large application running CF (dont ask) would insert enough spaces to make a simple page hundreds of kb in size. Just turning on "the write to output only on demand" option will drop size of the page to next to nothing. So know what it is that you are producing on output and make sure it is lean. Turning on server side compression solves this better, however adds to CPU requirements. On trully stateless web servers this just mean you need more web servers. So MAKE YOUR WEB SERVERS STATELESS.

Rule 5 - Know how many users your upstream connection can handle (in simplest terms - average size of HTML communication * number of users) and make sure you do not exceed it. Limit your connectivity at load ballancer. Having some users not be able to access your site is better than having ALL users not be able to access your site. Make sure you get plenty of bandwidth to spare. If you are setting up a multi-site presence, make sure your intersite communication is a - not going over same line as incoming and b - has sufficient bandwith and latency to serve the traffic.

Rule 6 - Professional load testing tools cost big bucks. But if you are carefull you can fake it with some open source software. Google it. When testing remember to take into consideration the limitation of your tester system and bandwidth.

-Em

--
RelevantElephants: A Somatic WebComic...
Experience is key by xrayspx · 2004-11-13 06:17 · Score: 4, Informative

Knowledge comes from Experience, and experience comes from Doing.

Mistakes will be made, They key is in mitigating the effects of those mistakes. Redundancy and Manageability are your two biggest buzzwords here. A good load test and utilization projections are definitely key, but no matter what you think your userbase will be, if it's a public application, you'll almost certainly be wrong. Try to prepare for the most traffic possible.

Redundancy on every level, including switching infrastructure is a very good plan. Any decent server sold can use multiple bonded NICs for reduncancy, if possible design your network such that if a switch fails, your network will fail over to another switch, etc.

I would suggest going to many local datacenters and interviewing each with probing questions relating to your situation. You will find that they are all relatively equal in terms of Standard DC items:
Diversity of route (physical entrance of cabling into the building) and redundant carriers.

Cooling

Power and backup gens

The things they differ on will be the readiness of their NOC team (do you have to fill out a web-form or call a call-center in East St. Louis to get a problem fixed in San Jose, or can you just "call the NOC and somene goes to your cage"), the monitoring/alerting they provide their customers for issues on the datacenter network. Infrastructure-wise, most DC's can provide you with Ping/Power/Pipe, but the service and SLAs are where they get points.

Do a LOT of reading. Depending on your platform, you have many choices. Linux vendors and Microsoft both have good platforms WRT building redundant networks, provided you do your homework.

Which brings you to manageability. Make sure that you have a deployment framework you can live with right from the start. Deploying code by hand is alright when you have 2 sites in IIS x 3 or 4 machines, but it gets hairy when you have 15 sites x 20 webservers. Make sure you can deploy web content, mid-tier apps, etc, with the "click of a button". This helps to ease the possibility of repetitive mistakes being made. Depending on the app, you may have to roll-your-own, but it's worth it.

Scalability. Make sure you pick a DC that can grow with you. If you plan to start out with 4 1u rackmount webservers and maybe a 7u DB, plus some storage array, make sure there is "room to move" in the DC without needing to cross-connect all over their facility with a cage here and a couple cabinets on the other end. Scalability testing by your engineers would be a great plan also. During load testing, if you're planning on using 2 mid-tier servers to process "Project X" from the web-users, set up 6 or 8 and load them up with bogus traffic. See how long it takes to kill your DB server.

Monitoring/analysis. Make sure you have a monitoring system into which you can hook custom monitors and alerts. Of your installation, those parts with the lowest levels of monitoring will be the ones most prone to breakage. Good packages here are NetCool and HP Openview. Expensive though. It's something you can probably write in-house until you need to spend the big bucks for an enterprise package.

Look to do a lot of reading, but break it into chunks. There is (I hope) no book called "Building and Maintaining High Traffic Enterprise Networks, for dummies, vol2". Every network will be different. But if you componentize your search, you will yeild great results. If you look to build your own monitoring or code deployment system, read up on WMI, read Cisco related newsgroups for network layer redundancy, etc.

Consultant is NOT a dirty word. Make sure you hire one for the right reasons. You do not want someone to come in and "make it so". You want someone with more experience than you have to work WITH you to design a network that you understand, can maintain, and which will scale. There's an art to it, hire Chris van Allsburg, not Picasso, Dali or certainly not Poll

--
I like music
Load balancer + content differentiation by chrysalis · 2004-11-13 06:34 · Score: 5, Informative

I have some experience with administration of web sites with very high traffic. My previous experience was with p0rn sites (lots of sites, lots of concurrent accesses). My current job is at Skyrock / Skyblog, that serves about 25 million pages every day.

In both jobs, the infrastructure was extremely similar.

The entry point is one (or more) load balancer.
A load balancer will not only blindly allow you to have multiple backends. It will also accept client connections, buffer the request, get the data from already established (keepalive) sessions, buffer it, and transmit it though large chunks to the client. This, alone, really helps to reduce the number of Apache processes that are taking resources (especially memory) for nothing.

The load balancer can also do other things, like protecting the servers against some attacks, plotting the current workload of every backend, compress HTML pages, etc.

At my previous job, we were using Foundry Serverirons. Now, we are using Zeus ZXTM http://www.zeus.co.uk/ with great success. Although it's very expensive software, it's way cheaper than Foundries, way more configurable, way more user-friendly and we are very pleased with it so far. A single PC handle 300 Mb/s (Linux 2.6 is needed for epoll).

The load balancer can also be configured to send the requests to this or that server according to the request.

Thus, servers are dedicated to specific tasks.

We have a bunch of static servers for static HTML, CSS, images, etc. They run minimal Apache servers, designed for speed, with NPTL and the worker MPM. Non-forking servers like thttpd or lighttpd is also an option. The static servers are mainly old P3 machines, with only 512 Mb RAM.

Then, we have servers for PHP. The Apache they are running is huge (our web sites need a lot of modules), the hosts are dual 3 Ghz Xeon with 2 Gb RAM and there are some other specific tweaks.

Content differentiation is important. It's a waste to spawn huge Apache process to serve static stuff, just because the same host should also be able to serve PHP. Also, tuning (esp. NFS) is very different for static and dynamic content. And as a specialized server often serves the same files, caching is more efficient.

We run Gentoo Linux on all web servers, plus one DragonFlyBSD (mostly for testing).

The same content differentiation is made for SQL server. One SQL server serves one sort of thing, so that caching is efficient. Also don't forget that on x86, Linux and MySQL can hardly use more than 2 Gb of RAM. So with big tables, this is really annoying. We are switching SQL servers to Transtec Opteron-based servers for that.

On high traffic infrastructures, the I/O is often the bottleneck especially if you serve a lot of different content.

For our blog service, we had to buy a Storagetek disk array with 56 disks (fiber channel, 15k) in RAID 10. As NFS would introduce too much delay, we directly plugged two web servers to the controller of the disk array. These web servers are the NFS servers for the PHP servers, but they also directly serve the static content.

The access time of hard disk is really annoying. For shared data, but also for databases. We found that RAID 5 was way too slow (even with the high-end Storagetek/LSI controller) since we have about 1 write for 5 reads. So we had to switch everything to RAID 10. It really performs better, but it's obviously more expensive.

Another bottleneck was the share of PHP sessions between all load-balanced PHP server. We first used a MySQL/InnoDB-based solution, but it poorly scaled. That's why I had to write specific software : Sharedance http://sharedance.pureftpd.org/

In a high-traffic infrastructure, my hint would be to use many modest, but specialized servers over one huge mega-fast server that does everything. This is way more scalable. And easier to manage, even from a financial point of view. You can b

--
{{.sig}}
Lessons since '99... by xanthan · 2004-11-13 06:42 · Score: 4, Informative

You don't mention if you're on the applications side of the world or the network, so I'll cover a little about both.

1. If you're on the app side, make friends with the network side and vice versa. To understand web site management and acceleration, you will need to know about both parts. Making peace with the other team is crucial to a successful site.

2. If you are on the app side, start thinking about concurrency from the start. You're going to have not 2-3 users at the same time, but more like hundreds if not thousands. This means that you can't do things like lock up tables and the like in the database. If at all possible write your application so that users don't need to come back to the same server to track their session information. Make sure each request is tracked quickly and easily. Also, differentiate your static content from the dynamic content -- you'll eventually want to cache the static content and life will be easier with static objects being served out of a known location. And please... please, please, please... make sure your app generates clean HTTP headers. Set your cache controls correctly, don't duplicate headers, don't be a smart-ass with your headers. Just use clean headers. ASSUME that there will be proxies between you and the client. ASSUME that you will not be able to control all of them.

3. Don't forget about megaproxies. Depending on the nature of your site, you're going to have a ton of your users coming from a small handful of addresses. (e.g., AOL) While some megaproxies have fixed the issue of a single user coming out of multiple proxy servers, all have not. This means anything that you use for client IP persistence is broken.

4. Client IP addresses... don't assume you have them. Don't assume they represent a unique user. They don't. Many load balancers/web accelerators also need to act as proxy and will replace the client IP address anyway. (Don't stress about logging -- any reasonable one will insert the client IP address in a HTTP header that you can extract like X-Forwarded-For:)

5. Peak load on your web servers. Apache can go fast, scale, blah blah blah... my ass. It's not the web server or operating system that is going to determine your peak performance. It is your application itself. Be prepared to fess up to the reality that your application peak performance is not going to be hundreds or thousands of requests per second unless you go insane with the optimization. (e.g., write your application into the web server and embed the whole thing into the kernel, etc.) Assume you're more likely going to get a few dozen requests/sec per app server. Keep that in mind as you plan server purchases and scaling.

6. HTTP request does not equal TCP connection. Don't assume that. With HTTP multiplexing like the stuff that Netscaler does (web accelerator), you're going to see most of your requests coming out of a small handful of TCP connections. Make sure your application supports that. Even if you don't use a web accelerator, browsers will do that do. Don't cheat and force the connection closed on every HTTP request, your web server will crap.

7. This is related to 6, but don't forget that web connections are very short lived compared to what the original designers of TCP were thinking about. As a result, you're going to run into cases where you run out of epheral ports (netstat -an will show a ton of ports in TIME_WAIT) even though your machine is idle. This is why HTTP Multiplexing is important -- you don't want a lot of connection churn. Yes, you can tweak your OS settings so that TIME_WAIT expires quickly, but that isn't going to help your overall performance. (TCP connection setup/teardown is a huge burden on a HTTP request that may only span a few packets...)

8. Look into HTTP acceleration technology from the get go. I've used several different brands and I've found Netscaler's to be the best. They are crazy fast and capable boxes that have a ton of features (like the HTTP multiplexing, SSL acceleration, HTTP compression, web