The root of all eBay's troubles

← Back to Stories (view on slashdot.org)

The root of all eBay's troubles

Posted by ryuzaki0 on Friday June 11, 1999 @10:07AM from the supposed-to-be-back-up-soon dept.

UncleRoger writes "A friend pointed me to this article would would appear to explain why eBay has had such troubles with downtime, including the outage since Wednesday evening. " It would appear that MS is tired of having the finger pointed at them - as they point out, it's an Oracle database that's running on Solaris that's causing the troubles.

11 of 300 comments (clear)

Min score:

Reason:

Sort:

This is absurd. by prolix · 1999-06-11 09:15 · Score: 3

Some of this is just absurd. For example, the six points of failure with the Starfire. Hrmm:

"Applications running in Domains are only as reliable as the instance of the Solaris operating system. For applications to gain enhanced reliability from Domains, users must explicitly set up clustering, just as in standalone systems. Sun does not recommend clustering between Domains, suggesting instead that fail-over occur to either separate, standalone systems or Domains in other Enterprise 10000 systems."

Uhh, duh, isn't that the whole idea? Am I missing something here?

"Daemons that control domain operations and perform monitoring functions run on an unreliable device (Ultra 5 workstation), hardly a desirable situation in the context of a data center."

Excuse me? The Ultra 5 an "unreliable device"?? We have a farm of Ultra 5s that have been running for a year now. Total number of system failures or crashes of any kind: 0. Period. How is the Ultra 5 any less reliable than any other workstation-class system?

"When security is compromised on the System Service Processor, which runs on the Ultra 5 workstation controlling domain operations and performance monitoring, all running domains on the E 10000 can be brought down with a short command sequence."

No kidding. When (or rather, *if*) security is compromised, you could do a whole lot more than bringing down all running domains. Just the same as any other platform. How is this a weakness specific to Solaris or the Starfire?

And besides, these are supposed to *secured* (meaning, physically) control consoles. Meaning, locked in a cabinet in the datacenter.

"System boards that are hosting non-pageable kernel data structures cannot be removed from a domain without interrupting service. The Solaris operating system has to undertake a special "quiesce," or suspend, operation while the critical pages are migrated to another board."

Ummm, yeah. So? How is this any different from any other operating system? Again, I fail to see what the problem is. And besides, how often do you change system boards? Please.

Sure, go ahead... try and remove a CPU card from any NT-based system without first warning the OS. Not only will it hang horribly (ie; you can't do it!), you'll probably fry hardware as well!

The fact that the Starfire can even do this is pretty amazing.

"System boards that are hosting Token Ring adapters, ATM adapters, or non-Sun disk controllers cannot be present in a domain if board-remove operations involving kernel quiescence are to be performed on that domain."

Uh-huh. Sure. I know lots of people with Starfires running Token Ring off of non-Sun hardware that are removing boards with non-pageable data. Happens every day.

I'm not saying it doesn't happen per se, I just think that these arguments are rather ridiculous.

"If you remove a system board from a running domain without enough swap space, Solaris will hang. The administrative tools do not warn you if you do not have enough swap space available."

What kind of idiot doesn't leave enough swap space? What kind of admin would go ripping out system boards without really thinking it through first? What kind of person spends the incredible amount of money the E10000s cost without being informed as to the basics of running a Solaris-based system? Come on.

It's like saying "If you remove a CPU card from an NT-based system while running domains are active, the system will be brought down and all domains brought offline." Ummm, duh. If you remove your legs, you can't walk either. Apparently, M$ thinks that true Unix sysadmins are as stupid and lacking common sense like the server admins that they're used to dealing with.

"Reliable hardware is getting even more reliable. For example, customers can take advantage of 99.9% system-level uptime guarantees for Windows NT-based servers from major systems vendors, such as Compaq, Hewlett-Packard, IBM, and Data General."

These are guarantees on the hardware, not software. I'm sure this looks great for the PR, but hello? I'd love to know what the "major system vendors" think about Windows-based servers being equated with their hardware guarantees.

"Microsoft Windows® 2000 Server builds on these gains. For example, Windows 2000 Server supports COM+ load balancing, which eases customer development of highly available and scalable applications in a multi-tiered environment. On the back-end, Windows 2000 Advanced Server supports two-node fail-over clustering, whereas Windows 2000 Datacenter Server will support four-node clustering. IBM and other vendors will provide support for up to eight nodes."

WOW! I am truly impressed. Two or four-node fail-over. Please.

Finally, at the end:

"Which brings us back to eBay. For those keeping score, eBay relies on Windows NT-based servers running Internet Information Server to provide front-end web services, and a single Enterprise 10000 from Sun Microsystems to host an Oracle database on the back-end. According to published reports, the outages at eBay, which began in February, are due to problems at the back-end."

This is curious. Maybe I'm missing something, but a telnet to port 80 shows that www.ebay.com is using Apache 1.3.6 on Solaris. It doesn't get any more front-end than that, does it?

I did notice that pages.ebay.com and listings.ebay.com are running IIS 3.0, and cgi.ebay.com is running IIS 4.0.

Also notice that their web site is still up and running. Not that that means a whole lot, but hey.

I find a lot of what this article had to say utterly hilarious. The implications that the Starfire is an unreliable and dangerous system is the greatest work of FUD that I've seen in my life.

OK, enough said.

--
--globalnap.net, product of pure caffeine--
Some fun statistics from ms... by reverse+solidus · 1999-06-11 05:24 · Score: 4

"The percentage of users running Windows NT Workstation 4.0 whose PCs stopped working more than once a month was less than half that of Windows 95 users."

More here

Yeah, yeah, it's workstation and not server, totally different operating systems. Not.
The FUD is so thick I can hardly see... by BeBoxer · 1999-06-11 05:25 · Score: 5

Just a couple of choice pieces of FUD from the M$ web page.

--------quote on----------
3.When security is compromised on the System Service Processor, which runs on the Ultra 5 workstation controlling domain operations and performance monitoring, all running domains on the E 10000 can be brought down with a short command sequence.
--------quote off----------
So, let me get this straight. The workstation which is responsible for "controlling" operations can be used to stop operations? What was Sun thinking, including a command that would turn things off! Of course, we all know that one of the main features of NT Server over NT Workstation is that the "Shutdown" command has been removed from the "Start" menu.

--------quote on------------
4.System boards that are hosting non-pageable kernel data structures cannot be removed from a domain without interrupting service. The Solaris operating system has to undertake a special "quiesce," or suspend, operation while the critical pages are migrated to another board.
--------quote off------------
This is supposed to be a problem? Now, I think it's pretty neat that you can migrate kernel memory off of a certain piece of hardware and swap it out at all. We're supposed to believe that under NT you can do this at all? Much less without telling the kernel first? The only conceivable way to allow this to happen without telling the kernel to clear the board out first would be to make sure that all the kernel memory has had a copy paged out to disk. Or perhaps keep multiple copies of all kernel data structures (and hope they don't get out of sync.) Maybe NT does do the last one. That would certainly explain why it's a memory hog.

Pretty amazing if you ask me. This web page is clearly meant for the PHBs of the world, as anybody with any knowledge at all of how computers work is simply going to laugh at this.
Ebay uses not one E10K, but... by afabbro · 1999-06-11 05:27 · Score: 3

...two. At least according to an InternetWorld article a month or so ago.

--
Advice: on VPS providers
eBay's custom software is buggy by Josh+Turpen · 1999-06-11 05:38 · Score: 4

Is it that hard to see? They changed their page layout and tweeked the software, and now they are getting data corruption. It doesn't have anything to do with Solaris or Oracle. If you design a database that can't correctly handle concurrency, doesn't have good constraints, no triggers, etc. ad nauseum, then you are going to get data corruption. Also, the whole 'high availability' arguement is laughable. eBay's buggy software is still has high availability. ;) Just because their software crashes doesn't mean Oracle and Solaris are crashing.

It's just more FUD from the Empire.

--
--- A Jesus Fish eating a Darwin Fish only proves Darwin's point.
Classic obfuscation by edwards · 1999-06-11 05:42 · Score: 3

Let me translate Microsoft's position:

1. Sun Enterprise 10000 systems have single points of failure. You can't hot-swap CPU boards arbitrarily, and the Ultra-5 front-end is a critical component.

2. Sun recommends that for high availability you cluster between multiple 10000 systems. This is bad.

3. Microsoft's commodity hardware platforms do not offer any of the scalability or reliability features of the Enterprise 10000, so clustering is the only option. This is good.

4. Microsoft's current clustering offering is primitive. In a survey, a majority of people said it was adequate.

5. Microsoft promises that Windows 2000 will have better clustering than NT.

6. eBay is not following Sun's recomendations that high-availability requires multiple systems. They have experienced outages.

BTW, it is shocking to me that eBay could have only a single server. This is at best incredibly naive; at worst blatant incompetence. Therefore I suspect it is false.
1. Re:Classic obfuscation by Anonymous Coward · 1999-06-11 06:09 · Score: 4
  
  Well, yes. I just showed this to an NT admin and he didn't understand. If you don't mind, I would like to elaborate:
  
  1:The first point is wrong, AFAIK -- E10ks are fully triple redundant. The second point is that no one but a maniac would hot-swap components without correctly varying them off. That would be like sticking a fork into a running toaster to change elements and being surprised at a nasty result. The Ultra5 front end is not critical. This is not true. You can manage them from any machine. You can attach a terminal or one of those awful JavaStations. So, two lies and a really bizarre attempt at deception (Buy my car -- that brand sucks because you can't change the brake pads while it is speeding down the Interstate!).
  
  2:If you don't understand this, you don't understand business computing, clustering, or applied EE/CS. This requires a lot of remedial work in security basics. I would suggest Computer Security Handbook by Arthur E. Hutt (Editor), Seymour Bosworth (Editor), Douglas B. Hoyt (Editor)Paperback 3rd edition (September 1995) John Wiley & Sons; ISBN: 047111854. So, another really bizarre attempt at deception (Their car sucks because it needs tires.)
  
  3:see above (Their car has those big steel bumpers and huge brakes, leading to costly repairs over the life of the vehicle. Our car has neither brakes nor bumpers, so you should should get a few of them and not worry about costly repair jobs.)(Why get a bus -- you can yoke 14 Yugos together -- see the user friendly and brightly-painted YugoYoke(TM)!)
  
  4:Yeah, and 3/4 of the American population strongly felt that they made up 75% of the population. Duh. I do not ask a plumber for stock tips. I ask a stockbroker. I do not ask my stockbroker to preform dentistry. I ask my dentist. Etc. ... if you are so stupid that you do not understand that the people who set the pace that the industry aspires to (uptime, ease of use, security, robustness -- the classic RAS mainframe stuff) don't feel that this is adequate and THEY ARE THE PEOPLE WHOSE OPINIONS YOU SHOULD BE CONSIDERING. (My uncle, Crazy Eddie, says those cares suck and my cars rule. So you should buy my cars.)
  
  5:And at some point in the future, you may win the lottery. Are you doing business in the future or right frigging now? Hmmm? I can't hear you ... (And next year's model will actually have disk brakes all around, so you should buy this year's model now.)
  
  6:Well, a)I think that they have two E10ks (please correct me if I am wrong) so this is actually not true (again) and b)thank you Captain Obvious. And if try to drill through your skull with a drill press contrary to all logic and common sense, you might miss some work. Yes, you should do things that matter with some care.
  
  Of course, this is just one woman's opinion ...
Always question competence first... by Sun+Tzu · 1999-06-11 05:48 · Score: 3

When people can't design a reliable system with budgets that allow the purchase of Sun 10000's!

I run a Sun 10000 with two SSP's. 10000's are connected to their SSP's via private ethernets. I have three private networks; two to allow redundant interconnects between the SSP's and the two 10K control boards and a third for general use, NFS mounting CDROM's and the like. Most people will have no reason to put the SSP's on a public network at all -- I certainly don't. In order to hack the SSP, one must first hack the 10000. Once they've done that, the ability to reach the SSP's by network is irrelevant. The point about the "problem" of the SSP having control is as silly as claiming that EMC Symmetrix disk arrays (heavily used in IBM mainframe shops) can be crashed by the single laptop each array contains.

I would love to know the details of their failures -- I suspect the article is hinting at issues that have nothing to do with their real problems. Further, I'd bet that the main vulnerability that people cluster Sun's against is hardware failure -- and I'd also bet that the main reason people cluster NT boxes is software unreliability!

--
Geeky modern art T-shirts
Kinda funny... by ChrisRijk · 1999-06-11 05:54 · Score: 3

It's so absolutely laughable that Microsoft is trying to claim the high ground in availability/reliability over Sun.... This is from a company that to get '99.9%' reliablility you need 4 computers - the other 3 purely for backup, and this figure has so many cop-outs (ie doesn't count if it's planned - ie you're installing something and the OS decides you need to reboot, you need to change the hardware etc - or if it's a network problem, or a problem with any of the applications) in the contract, that it isn't hardly even worth the paper it's printed on.
A single Starfire is rated as being able to deliver 99.95% availability with one - ie no clusters, and without all those caveats above - though it does need to be setup with reliability in mind for this - there's plenty of options.... Starfires aren't simple either - up to 64 CPUs, many more PCI and similar slots, memory slots, etc, etc. So, plenty of things to go wrong. Similar sized computers (from everyone) are really hard to transport without something going wrong. The only people more nutso on reliability on 'big iron' computers are IBM (from the companies I know a fair bit about anyway). Not only do they have backup CPUs, in their CPUs they do the same operation twice (in parallel, with checking at the end) to trap the ultra-freak chance of cosmic radiation or something casuing a flipped bit, or worse. (yes, they do seriously actually worry about such things... I remember an IBM proposal about how to design memory that can handle a once-a-month chance, for when you have a huge about of RAM, for some particular kind of radiation....)
The only complaint I've ever heard about Starfires in general is that if a PCI card (though not SBus card) breaks down it can hang the entire system until an operator manually flicks a switch to say that that particular card is defunct. Though this is really because of how PCI works - Sun's 99.999% reliable Netra 1800 has some highly specialised custom hardware to get around this problem with PCI cards, as well as backup CPUs and plenty of other stuff... ridiculously expensive too, they are... though apparantly more cost effective than anything in the same class. The Netra 1800s are a few months old, while the Starfire design is over 2 years old, btw.
I dunno about all of MS's claims, but I'm pretty damn sure that you can have hardware redundancy for just about everything, if you want, including the Ultra-5 controller. Most of the other claims seem to be related to the fact that you can hot-swap PCI cards, memory, CPUs and even mother-boards in a Starfire...
EBay do seem to have had more than their fair share of problems though... quite a few hardware problems it seems - I vaguely remember a problem earlier in the year was due to some controller card or something. As far as I know, nobody has had anything close to the problems EBay are having with their Starfire(s)...
Another little point... MS's idea of expensive downtime is $10,000/hour. I remember reading something on Sun's site a while back about high end availability systems. Sun's idea of expensive downtime is $10,000,000/hour - ie stockbrokers. They also had a list of most common causes for 'unplanned' downtime on their HA systems - first was 'operator error' (or lack of training, etc). I can't remember was second was, but third was 'fire'! (I'm pretty sure Sun's computers don't have a reputation for spontaneous combustion!)
FUD SHARKS to the left! FUD SHARKS to the right! by dublin · 1999-06-11 06:36 · Score: 5

Sometimes you have to wonder how things ever get approved to be on their website. Let's look at a few of the more imflammatory claims, which is really quite a kettle o' fish:

RED HERRING: Daemons that control domain operations and perform monitoring functions run on an unreliable device (Ultra 5 workstation), hardly a desirable situation in the context of a data center.

So what? The E10000 will continue to truck on as before without it. This is a complete red herring. The SSP is a really just a console station, nothing more. If it dies, you reboot it, or in the worst case, replace it with another one from the closet, which with Sun's AutoClient technology, can take on the entire identity of the failed box in a couple of minutes. (AutoClient allows Wall St. traders to replace their workstations and be working again with NO IMPACT in 5 minutes. Let's see NT do that.)

FUD SHARK: When security is compromised on the System Service Processor, which runs on the Ultra 5 workstation controlling domain operations and performance monitoring, all running domains on the E 10000 can be brought down with a short command sequence.

No one in their right mind would put the SSP on a network that extends beyond the glass house!! It's a *console*, designed to be locked up securely, like all other mission-critical control consoles. MS still doesn't get the data center, do they?

DUH WHALE: System boards that are hosting non-pageable kernel data structures cannot be removed from a domain without interrupting service. The Solaris operating system has to undertake a special "quiesce," or suspend, operation while the critical pages are migrated to another board.

This is incredible. They're knocking the E10K because you can't walk up to it and pull a CPU card at random without telling the machine first that you plan to do this. These cards contain memory, too, folks, which is why it's pretty reasonable to let the system move things to a safe place before the card goes bye-bye. Pretty much only Tandems can accept this sort of things (because they've got at least two of everything all the time, and they cost like they have even more), and if you're after real fault tolerance, you won't be running NT on them, even though you could...

STING RAID: If you remove a system board from a running domain without enough swap space, Solaris will hang. The administrative tools do not warn you if you do not have enough swap space available.

This is pretty low. Yeah, it can happen - what else is an OS supposed to do when it has more processes than now remains as memory? Although a warning would be nice, E10K admins aren't stupid (we hope), and they understand that there are easy workarounds to this - the E10K makes it very easy to move enough resources into the OS domain in quesiton on a temporary basis. If you don't have enough hardware to do that, you misconfigured the machine in the first place. This is hardly a weakness.

On the whole, the incredible thing about this is that MS is throwing rocks at a really good system with availability features far in excess of that for any practical NT box. You've gotta admire their guts, though - some people will read this and think the E10K is a really expensive, dangerous computer. Funny how they neglect to mention that there's not an NT box on the planet that can provide the performance of an E10K, regardless of how much you spend. This may change eventually, but it's pretty cheeky now.

If you need real fault-tolerance, get a Tandem/Compaq - but after you've paid all that money, I bet the Compaq folks would be the first to advise against using NT on it if you really want fault tolerance.

--
"The future's good and the present is nothing to sneeze at." - Roblimo's last ./ post
E-10000's are error prone, in practise by Anonymous Coward · 1999-06-11 07:40 · Score: 4

I don't know about EBay, but I know that E-10000's are extremely tricky to configure correctly.

Sun markets them as ultra-reliable and hardware-level redundant, but the truth is that configuring them is so complicated that even a team of experienced sysadmins is bound to screw something up sooner or later. If you bet the store on a single E-10000, then sooner or later the machine will crash hard and your store is hosed.

Given their size, expense and complexity, they are not appropriate for use as the main server in an internet commerce company. Sun should not sell these machines to companies like EBay.

In defense of Sun, I should point out that their "smaller" systems, namely the E-4000, E-6000, E-3000 etc., are rock solid and just as easy to configure as a small server. But no one would dream of running a whole store on a single one of them -- for reliability, you need to run several of them redundantly.

And Windows NT is far less reliable than any Sun machine. NT is the opposite of reliability. Production Solaris machines routinely stay up and running for months or years at a time. Show me an NT server which can do that.