Building/Testing of a High Traffic Infrastructure?

← Back to Stories (view on slashdot.org)

Building/Testing of a High Traffic Infrastructure?

Posted by Cliff on Saturday November 13, 2004 @05:21AM from the transactions-and-pages-per-second dept.

New Breeze asks: "I'm currently working on my first web 'application', and have discovered that I know less than nothing about setting up the infrastructure to manage a high traffic system. Where does one go to learn about setting up the infrastructure required to host something like Slashdot? Or do you just say, 'Not my area!' and help them find a consultant?" "My experience is pretty much limited to:

1. Install the web server on one box, the database on the same box if it's a small installation or a separate box if performance seems like it will need it. Add more memory and processors based on SWAG criteria. (Scientific Wild Ass Guess)

2. Contract with a hosting company.

I had a potential customer ask what I would recommend if they wanted to self host, they have around 300 remote locations and would have multiple users from each location hitting the application at the same time, so saying a couple of beefy servers probably isn't the right answer.

I haven't a clue. The last place I worked with on something like this hired a high dollar consultant who spend a huge pile of their money setting up a load balanced, oracle parallel server redundant everything system.

How do you test it? I've worked where they actually had a room with hundreds of systems on racks that they would configured to run test transactions against different servers and software builds for stress testing, but that's not in my budget..."

14 of 231 comments (clear)

Min score:

Reason:

Sort:

Ask a Pr0n serving company by chia_monkey · 2004-11-13 05:24 · Score: 5, Insightful

Seriously...they know all about serving up content on high traffic sites. Not only is it high traffic, but it's rather big files that they're delivering. When we're testing the networks that we set up, both wired and wireless, we often visit pr0n sites for our benchmarks.

--

"He uses statistics as a drunken man uses lampposts...for support rather than illumination." - Andrew Lang
1. Re:Ask a Pr0n serving company by Anonymous Coward · 2004-11-13 06:46 · Score: 1, Insightful
  
  Objects can create more load in memory, on a particularly heavy loaded site I've seen sessions get a fairly high usage in memory and generally objects used seem to use more memory than standard variables. Yes, the code may be smaller, better written, but that doesn't mean more efficient with the idea of conserving resources. For database transactions its normally because it helps modifying the connection details more easily. Of course i'm sure I will get stoned by OOP zealots since it is the new(old) thing.
  
  I agree with the static nature, commonly when content is edited for a mass audience, rather than pulling it from a database, the content can be saved into a static page, if not writing a caching module, or else use one that already exists can reduce overheads significantly. Images viewed as a thumbnail, can be written out once viewed, so the resize script does not need to stdout. Additionally, just make sure thumbnails are generated upon upload.
  
  Generally to conserve resources, elegant generic code with objects, session variables and sql queries with multiple joins aren't the way to go. Like the parent says, only use what you need. However if you need more than one table you can always consider merging the tables into one. Okay, yes this goes against normalisation, but normalisation is used to reduce redundant data, not speed up a query or reduce overhead.
  
  Large savings can be made by ensuring things aren't recursive, by eliminating those loops and hardcoding stuff that might not be changed for a long while. This all sounds like bad programming, but it does reduce the overheads.
  Simple concepts reducing memory usage. Cut down on session vars, reduce sql queries to a minimum. Set connection limits. Ensure you have www log rotation and information. If the server is full, then make sure you're warned about it.
  
  I guess what I'm trying to point out is, elegant code isn't always fast and most developers aim for a streamlined, well programmed application. Sometimes however, unelegant programming can result in optimisations that you might not notice except on a heavily loaded server.
2. Re:Ask a Pr0n serving company by Barryke · 2004-11-13 06:57 · Score: 2, Insightful
  
  Candid Hosting is expensive, at least in comparision to dutch hosting (Amsterdam Internet Exchange)
  
  --
  Hivemind harvest in progress..
Test using Slashdot itself! by xmas2003 · 2004-11-13 05:26 · Score: 2, Insightful

In answer to your question about testing, have your web site /.'ed and see how it handles the Slashdot Effect which is a pretty good stress test! ;-)
P.S. When I first tried to read this story, I got "Nothing for you to see here. Please move along" ... somewhat ironic I'd say ...

--
Hulk SMASH Celiac Disease
Do the math by MarkusQ · 2004-11-13 05:33 · Score: 5, Insightful

First step, do the math.
What was once a "high volume" app may be nothing for modern equipment. You're talking about on the order of 1K concurrent users (300 sites * several users per site).
If "use" means manually typing data into forms, viewing mostly static pages, etc. this isn't really a very "high volume" application, and a single decent server should handle it.
If, on the other hand, "use" means constantly running complex queries against a billion item data set, you're doomed.
So where do you fall in this spectrum?
Coming up next...where's the bottleneck?
-- MarkusQ
Dear Slashdot... by Anonymous Coward · 2004-11-13 05:36 · Score: 5, Insightful

I'm currently working cooking in a restaurant, and have discovered that I know less than nothing about performing stomach surgery. Where does one go to learn about the techniques and tools necessary for curing stomach cancer? Or do you just say, 'Not my area!' and help them find an oncologist?

Seriously.. you have a lot to learn, and a lot of what you need to know just comes from experience which you can't get from a book.

First: learn how everything works. When you click a link in your "application" (why the quotes?), what happens? For instance, does it run a Controller object? If you're using a language like Ruby or Perl, is it "pre-compiled" or does it have to interpret a script on each hit? Does the controller then go to the database and populate variables, then insert them into a template, then render the template? Is the template cached? How are your database settings? Enough memory for joins? Are all your queries using the appropriate indexes? Are you familiar with your database's performance-measuring variables and tools? Are you pulling more data than you need to in each query?

Once you have an understanding of what's happening, then you can start measuring. Where are the bottlenecks? This is a very important thing to keep in mind in programming or system architecture: DON'T OPTIMIZE UNLESS YOU NEED TO! Keep your system and code as simple as possible. For instance don't cache things in your program (making it more complicated and harder to maintain) unless you have a BENCHMARK IN HAND showing a performance bottleneck.

You might not need to move your database to another machine. What you need to do depends on your app.

Yes, you will need to do a lot of testing to identify your "first round" of bottlenecks. You need to build a lot of diagnostics into your app to help you identify how long different steps take.

Always deploy your app in stages, one site at a time, until you start identifying some problems. Then fix those problems before continuing deployment. Never "flip a switch" and reveal any change all at once.

Good luck!
Two scenarios: by Jerf · 2004-11-13 05:38 · Score: 4, Insightful

1: Gradual growth. Find bottleneck, remove it. Repeat. Make sure to start with a growable database and web site technology, but that shouldn't be too tough. Also, stay ahead of the game, always with overcapacity, both to cover for outages and for sudden growth spurts.

2: Instant growth from 0 to thousands+: Hire someone who knows what they are doing. In the first scenario, you have the time to learn what is actually going on, which is an advantage. In this one, you don't, and the customer base is to big (i.e., $$$) to screw with.

That basically covers it. Specific advice will vary widely based on databases and web technology deployed, so just about any other specific advice you get here is as likely to be wrong for you as right.
Depends by JediTrainer · 2004-11-13 05:43 · Score: 3, Insightful

What constitutes 'high traffic' for you?

I've been developing a high traffic site (well, maybe medium traffic) at about 1.5 million transactions per month. We have customers using the site all over North America, plus a few in Europe and Asia, and the whole thing is hosted internally off of our 10MB link.

We have each 'tier' clustered as a pair of servers - 1Ghz/256M is more than sufficient for our 2 Apache servers. 3Ghz/1GB is our Tomcat tier, and I'm not sure what the DB runs on, but they're the beefiest servers of all the tiers.

Within the app architecture, try to ensure that you can scale to more servers. We have the ability to add more servers to any of the above tiers without any changes, plus any long-running processes (complicated reports and such) get dispatched to a fourth layer of servers we call 'backend' (by RMI). These 'backend' servers can be low-end (300mhz/256M are fine), because they run non-time-critical tasks and generally might email their results or whatever.

In this way, we've avoided the EJB complications while also having full redundancy at every level. There was some custom framework involved, but it's been working well. Our application was complex enough to warrant an advanced framework (similar to Struts, except we wrote ours before Struts came out), yet EJB seemed too heavy for what we wanted to accomplish. Of course it didn't hurt that the only thing we paid licenses for was the DB.

Importantly, though, this was the right solution *for us*. It's serving us well, and already scaling well beyond the number of customers we originally anticipated would be using it. While this meets our needs fairly well, it may or may not be the right type of solution for what you're looking for, particularly because I don't know what your application is supposed to do.

--

You can accomplish anything you set your mind to. The impossible just takes a little longer.
Can you qualify some of this stuff? by UVABlows · 2004-11-13 06:09 · Score: 2, Insightful

I don't really understand a lot of the stuff you said (I am not a sysadmin). For example:

What does it mean to not scale "vertically"? When I read that, the only thing that comes to mind is to put the boxes next to each other, not on top of each other. From context I gather that horizontally means extra machines, but what does vertically mean?

For "dropping in an extra server when needed without a lot of reconfiguring", what do you mean by "a lot of reconfiguring"? Obviously you need to get the machine, install the os, set up networking, install the web server, setup the web application, point it at the database, etc. How does the application being "stateless" help? I guess, what are some examples of state that an application can have that will make configuring an additional web server difficult?

Concerning the pseudo static data regeneration, what if the thing that was being updated was only accessed once every half-hour on average? I am assuming then that generating the page on demand would be better?

I don't really know what you mean by "MAKE YOUR WEB SERVERS STATELESS". I mean, they have to know if a request just came in, where the data is, what time it is etc, and that stuff gives it state. I am assuming you mean something else by stateless but I cannot figure it out.

Thanks for the help!

--
<high-level position here>
<name of stupid small company here>
Re:This is the type of question by deesine · 2004-11-13 06:26 · Score: 2, Insightful

You've got to kidding, right?!

This guy's asking how he might setup a race car for the NASCAR circuit. And you're telling him; forget about $big block engines, forget about $super injected fuel & exhaust flow, forget about $blue-printing the motor...you can get the same performance from your Escort, just press harder on the gas pedal!

Thanks for the laugh! LOL

-d

--
damaged by dogma
...as always, it depends... by Bob+Bitchen · 2004-11-13 06:27 · Score: 3, Insightful

First off, I'd say you're doing this bass-ackwards. You really should have already answered all these and many other questions before ever laying fingers to keyboard.

It depends on lots of things. Who's going to manage the self-hosted host? If they have an IT dept. maybe they can provide the hardware sizing. In any case you will first need to establish the usage patterns and then go forward from there.

--
http://tinyurl.com/3t236
Scaling a high traffic site by DFossmeister · 2004-11-13 06:57 · Score: 2, Insightful

First, although 300 locations with a few users each may sound like a high-volume site, it is not. I don't want to burst any bubbles, but it simple is not high-traffic in today's world. I work with large e-tailing sites that get 200,000 unique visitors per hour.

The first step is to determine the type of load you will receive. Is it call-center type traffic, where they will have dedicated staff accessing the application, or will it be more like Internet traffic that comes in waves when it feels like it? If your application fits the call-center model, then you need to know the maximum number of operator-types that will be online at any given time. If it is more like an Internet site, such as Slashdot, then you need to either project the number of sessions per hour, also called the arrival rate, or examine the web logs to find out.

Concurrent users and arrival rates are not the same--one is the output of the other. In arrival rate mode, the number of concurrent users vary depending on the number of visitors arriving that minute, and the speed of the site. If the site slows down, which is will at a higher rate of visitors, then the sessions will take longer. If the sessions take longer, then visitors continue to come to the site and the number of concurrent users rise. Internet visitors do not know how many users are on the site and certainly won't obey any threshold that you determine.

The second step is to test over the Internet, and from as may remote locations as possible. You said that there were to be 300 remote offices. Are these all in the US, or are any of them International? Testing on a local LAN does not tell you much of anything, because there is no latency and everything runs at the speed of your switch. Very few people have 100 megabit connections to the Internet, so it is not realistic to test that way. Real users have a mix of line speeds, and come from a variety of locations. It is best to test from 5 or more geographically disperse locations, using a distribution of the line speeds that your end users will be using. If each of these 300 site has a T1, and each site has an average of 3 users, then each user should run at 512Kbps, not 1.54Mbps.

Lastly, perform realistic transactions on the site, don't just simply hit the home page. Real users on the site will probably start at the home page and traverse the site, doing various things. You should have an idea of what these actions will be, or you can examine the web logs to determine the top 10 paths through the site. Then write scripts for each path and run them proportionately. You also need to build in think or dwell times into each page. Real users don't go from page to page as fast as possible! They take time to fill out forms. A good load test takes into account how familiar a person is with the site and what the person's patience with the site will be. A person using an SSL connection purchasing something has more patience that someone browsing a catalog. By the same token, an operator-type person does not have any choice about whether they can use the site or not, however their productivity will be directly proportional to the speed of the site.

There are very few open source or free tools that do these things for you. Your options are to 1) wing it as best you can using the SWAG method you described, or 2) seek help. There are various Do-It-Yourself outsourced solutions, such as Test Perspective or some other total outsourced solution. The DIY method will probably get you the best value, but you are subject to your own work, and don't have anyone to blame if things go wrong.

--
No Not Again! Its whats for dinner.
See how Wikipedia does it on a shoestring by gtoomey · 2004-11-13 17:38 · Score: 3, Insightful

Look how Wikipedia organises its cluster on a shoestring budget.
- Over 750 requests/second on 29 - servers average >20 requests/second each (Yes I know some are not http servers) . Compare that to some commercial solutions.
- commodity hardware
- squid for cacheing/load balancing, feeding Apache
- multi-tiered archtieture
- dual Opteron for the master mysql database
Don't guess by dubl-u · 2004-11-13 20:57 · Score: 2, Insightful

My guess is going to be that the bottleneck is going to the the database, but we've done extensive testing with a million customer sample database running multiple instances of test applications from 10 other boxes, but that doesn't exactly prove much as it's too predictable.

Don't guess. You have too much riding on it to guess. Build proper test infrastructure. Not only will it pay off now, but it'll be hugely valuable in the future as you change and expand the application.

If you're not sure what real users will do, get some to try it out and record their activity. Then spend a little time building a load model, where you describe types of users, their activity patterns, and their expectations. (E.g.: "At the end of the month the 800 salespeople will rush to meet their quota, and during a peak hour they'll each do..." Generally I end up with nice series of spreadsheets, so I can adjust registered users and see peak hits per second come out the other end.) Then simulate the projected load and see where the real bottlenecks are.

You should be really wary about optimizing without data. As Knuth says, "Premature optimization is the root of all evil." I know a number of people who build very high volume stuff, and I don't know any of them who haven't been frequently surprised at exactly where the bottleneck turned out to be.

Also, start small and work up. There's no need to build a huge load testing suite all in one go; often you'll learn enough from the first simple tests to point developers and sysadmins in the right direction.