Why eCommerce Sites collapse
Rahul Mehra writes "ZDNet has an interesting article about how eBay and other e-commerce sites collapse under heavy loads. It talks about how massive growth, incomplete planning, rising expectations (24x7 uptimes) and immature technology all contribute. " This train of thought, for me at least, leads to neo-Luddite question - what do you folks think?
At least, that's the way a lot of companies seem to treat sysadmins. A good sysadmin, who keeps systems running smoothly, *appears* to be doing nothing. Why should such a person be paid very much just to run backups and turn a few screws, they'll ask. The sysadmin is viewed as a high-tech janitor, and is given about as much respect. This often results in companies only hiring one sysadmin, or worse, foisting the sysadmins duties onto other people on a standby basis. So when the fires come, the company is suddenly understaffed. A bad sysadmin, who's always recovering from crashes, restoring backups, rerouting network traffic, looks like a busy employee. If not for him, our machines would all be down right?, they will say. So what we have here is a scenario that favors either bad sysadmins, overworked sysadmins, or standby sysadmins who are actually full-time employees with other stuff to do and worry about. Welcome to hell.
One thing the article didn't mention directly, but plays a very important part in the stability/quality of your infastructure is the quality of your sysadmins. It doesn't matter if your boxes are triply redundant hot-swappable never go down systems if the sysadmins inadvertanty blow away key files periodically. Lots and lots of IT places seem to hire semi-trained monkeys as sysadmins and then wonder why their site is always going down. Look at the chart of outages on the second or third page of that article, notice how often "Failed software upgrade" appears? The problem is that the hardware vendor is usually blamed for those kinds of problems, which draws attention away from the true problem of unqualified sysadmins. Of course most of the Slashdot crowd doesn't fall in that category.
I read the internet for the articles.
Doing it right the first time of course isn't that easy, but once it works, don't break it. It must be possibal to run a 24x7 site for an entire year, while stuck on gilligan's island without any way to contact the rest of the world including the site.
Nasa has comptuers in buildings where at any moment deadly (a few seconds from a small leak and everyone in building is dead!) chemicals are around. Do you think that their IS wants to touch the comptuers? Not unless they first send everyone else home and empty those tanks. If it wasn't so heavy they would probably insist on space suits too.
You decide a year or more in advance how much bandwidth you will get. Then decide how many customers that will support, and you don't allow marketing to sell to any more customers. Thats right, you refuse to allow more onto the system. Marketing can deal with this if you make them, and long term satisfaction will go up.
Once you know how much bandwidth you will have, you make sure you have comptuers that can deal with it. Mainframes have been doing 24x7 for years. Unix is very close to matching that (with Sun's redundant hot swapable system perhaps better, not that sun is the only chioce) I have seen tripple redundant systems with a polling mechanism where if one comptuer gives a different result it is shut off. Guess what: none of this is cheep. Thats right, doing buisness on the internet in volumn isn't cheap. Spend the money on system that will stay up, and enough power that you don't run out, and you will run 24x7. There are plenty of companies that make equipemtn that is ment for this use.
Last, and foremost: hire system administrators that have proven they can keep the systems running 24x7, and pay them to do so. These people are older, in their 50s or so. Hire thebest of the expirenced, and then give them a deal: you pay them to keep the systems up there or not. They should soon find a paycheck arriving every two weeks, with only a few hours a month work.
Remember, design the system so you can run it from Giligans island (no access by you) without your boss realising, and you will do fine.
Of course reality is that you do have to replace crashed harddrives, but with RAID-6 (raid-5 plus more redundancy, raid-6 isn't officialy defined) that is any time. You do need to buymore backup tapes once in a while, but automatied backups are the norm in 24x7 enviroments.
The first link basically says that the eBay guys weren't paranoid enough about making sure the setup was reliable. This is always a problem. (hey, I'm working on a commercial web site that only got a proper sys-admin 2 years after it started...). Little side-note - one guy says Sun's clustering stuff is not that great... I know Sun have been a bit late in starting doing clustering stuff, but I've also heard that what they have done is pretty good, *shrug*. Actually, they just annouced version 3 last week, which also allows clustering of 16 Starfires, for 1024 processors. (they're also making the source code for this available...)
This is all very interesting to read about. One would think that the internet is generating "unheard amounts" of loads on various systems for the first time. Mainframes (IBM, Unisys, Amdahl) have taken much more than this in terms of loads or transactions / second. The problem that I see is that people tend to isolate architectures that have worked in the past for new cool things that vendors tend to shove down their throats. CICS or for that matter virtually any transaction intensive database on proper mainframe (some of my customers are doing 20-30K transactions per second and they are in no way "big" users) could handle that load. At times, the whole internet revolution reminds me of the "client server" phase that the industry went through. Ziff David was one of the proponents of this phase (well they had to sell them damn magazines didn't they?) often claiming that a Novell file server would be damaging to companies like IBM. Well perhaps it is time to step back and examine how some of the legacy systems have worked (heck... imagine your bank telling you that their systems got overloaded on pay day!?! Then let see how we can adapt them to the Internet. IBM is doing an awesome job on this and so is HP. I strongly belive that the systems we're seeing today are "prototypes" doing proof of concepts, waiting on the big iron boxes to become internet enabled. One more point. Most of the classic "brick and mortar" businesses, people who know their technology, customers, systems.. are NOT internet or e-business enabled. Lets drop a few names of the DOW Jones components.. Ford, GM, GE, DOW, Coke etc, do more business than the e-business startups and probably process more transactions per day on their mainframes. I'd be more concerned about what happens when they start up their internet "storefronts"... Ok.. just a few random thoughts before I head into work...
As it stands now, eBay's auctions are so time-critical that they're in the same league as online brokerages. And speaking of brokerages...
Fidelity is running TV ads (plastered all over Pirates of Silicon Valley last night) touting the speed of their systems and how seconds count, with a quick disclaimer at the end of the ad that response time depends on network conditions. This is a pet peeve of mine: ads with disclaimers which make the rest of the ad meaningless. Example: "99c Big Macs! That's right, 99 cents! Only 99 cents! Prices may vary." But the point is that they're promoting the idea that the internet is suitable for real-time transactions, even though they recognize that it isn't quite there.
That was funny. You got me thinking about other Fox computer-related specials:
America's Funniest Core Dumps
When Spammers Attack
I Married A SysAdmin
Real Life Reboots
Totally Shocking Backups -- Caught On Tape
/* Alright -- quit yer groanin' */
Save the whales. Feed the hungry. Free the mallocs.
Oracle can use either a file or a partition for its datastore. In recent versions of Solaris, Oracle will put the file in a mode which can bypass the VFS layer of the OS to get near raw partition speeds.
The inside scoop was that, because eBay did not install the latest kernel patch to Solaris 2.5.1, they ran into a bug where if you have a kernel core dump of more than 2GB, it will piss all over your disks. I suspect they do not have root or swap under Vertias control.
So when the machine panic'd, it overwrote most of root (the core dump starts from the end of swap back to the beginning, many users have root just before swap on their disk).
So they not only had to restore Solaris, they had to restore their configuration. Not something that can occur quickly, esp. when the CEO of a company is breathing down your back. It is also my understanding that the eBay database itself was okay and didn't have any data corruption.
You want to keep in mind that Dell is using UNIX servers on the backend. It's my understanding that they have several Sun E10K's and IBM S70s. (All in Dell black and grey, of course)