The root of all eBay's troubles
UncleRoger writes "A friend pointed me to this article would would appear to explain why eBay has had such troubles with downtime, including the outage since Wednesday evening. " It would appear that MS is tired of having the finger pointed at them - as they point out, it's an Oracle database that's running on Solaris that's causing the troubles.
"The percentage of users running Windows NT Workstation 4.0 whose PCs stopped working more than once a month was less than half that of Windows 95 users."
More here
Yeah, yeah, it's workstation and not server, totally different operating systems. Not.
Just a couple of choice pieces of FUD from the M$ web page.
--------quote on----------
3.When security is compromised on the System Service Processor, which runs on the Ultra 5 workstation controlling domain operations and performance monitoring, all running domains on the E 10000 can be brought down with a short command sequence.
--------quote off----------
So, let me get this straight. The workstation which is responsible for "controlling" operations can be used to stop operations? What was Sun thinking, including a command that would turn things off! Of course, we all know that one of the main features of NT Server over NT Workstation is that the "Shutdown" command has been removed from the "Start" menu.
--------quote on------------
4.System boards that are hosting non-pageable kernel data structures cannot be removed from a domain without interrupting service. The Solaris operating system has to undertake a special "quiesce," or suspend, operation while the critical pages are migrated to another board.
--------quote off------------
This is supposed to be a problem? Now, I think it's pretty neat that you can migrate kernel memory off of a certain piece of hardware and swap it out at all. We're supposed to believe that under NT you can do this at all? Much less without telling the kernel first? The only conceivable way to allow this to happen without telling the kernel to clear the board out first would be to make sure that all the kernel memory has had a copy paged out to disk. Or perhaps keep multiple copies of all kernel data structures (and hope they don't get out of sync.) Maybe NT does do the last one. That would certainly explain why it's a memory hog.
Pretty amazing if you ask me. This web page is clearly meant for the PHBs of the world, as anybody with any knowledge at all of how computers work is simply going to laugh at this.
Is it that hard to see? They changed their page layout and tweeked the software, and now they are getting data corruption. It doesn't have anything to do with Solaris or Oracle. If you design a database that can't correctly handle concurrency, doesn't have good constraints, no triggers, etc. ad nauseum, then you are going to get data corruption. Also, the whole 'high availability' arguement is laughable. eBay's buggy software is still has high availability. ;) Just because their software crashes doesn't mean Oracle and Solaris are crashing.
It's just more FUD from the Empire.
--- A Jesus Fish eating a Darwin Fish only proves Darwin's point.
Well, yes. I just showed this to an NT admin and he didn't understand. If you don't mind, I would like to elaborate:
... if you are so stupid that you do not understand that the people who set the pace that the industry aspires to (uptime, ease of use, security, robustness -- the classic RAS mainframe stuff) don't feel that this is adequate and THEY ARE THE PEOPLE WHOSE OPINIONS YOU SHOULD BE CONSIDERING. (My uncle, Crazy Eddie, says those cares suck and my cars rule. So you should buy my cars.)
... (And next year's model will actually have disk brakes all around, so you should buy this year's model now.)
...
1:The first point is wrong, AFAIK -- E10ks are fully triple redundant. The second point is that no one but a maniac would hot-swap components without correctly varying them off. That would be like sticking a fork into a running toaster to change elements and being surprised at a nasty result. The Ultra5 front end is not critical. This is not true. You can manage them from any machine. You can attach a terminal or one of those awful JavaStations. So, two lies and a really bizarre attempt at deception (Buy my car -- that brand sucks because you can't change the brake pads while it is speeding down the Interstate!).
2:If you don't understand this, you don't understand business computing, clustering, or applied EE/CS. This requires a lot of remedial work in security basics. I would suggest Computer Security Handbook by Arthur E. Hutt (Editor), Seymour Bosworth (Editor), Douglas B. Hoyt (Editor)Paperback 3rd edition (September 1995) John Wiley & Sons; ISBN: 047111854. So, another really bizarre attempt at deception (Their car sucks because it needs tires.)
3:see above (Their car has those big steel bumpers and huge brakes, leading to costly repairs over the life of the vehicle. Our car has neither brakes nor bumpers, so you should should get a few of them and not worry about costly repair jobs.)(Why get a bus -- you can yoke 14 Yugos together -- see the user friendly and brightly-painted YugoYoke(TM)!)
4:Yeah, and 3/4 of the American population strongly felt that they made up 75% of the population. Duh. I do not ask a plumber for stock tips. I ask a stockbroker. I do not ask my stockbroker to preform dentistry. I ask my dentist. Etc.
5:And at some point in the future, you may win the lottery. Are you doing business in the future or right frigging now? Hmmm? I can't hear you
6:Well, a)I think that they have two E10ks (please correct me if I am wrong) so this is actually not true (again) and b)thank you Captain Obvious. And if try to drill through your skull with a drill press contrary to all logic and common sense, you might miss some work. Yes, you should do things that matter with some care.
Of course, this is just one woman's opinion
Sometimes you have to wonder how things ever get approved to be on their website. Let's look at a few of the more imflammatory claims, which is really quite a kettle o' fish:
RED HERRING: Daemons that control domain operations and perform monitoring functions run on an unreliable device (Ultra 5 workstation), hardly a desirable situation in the context of a data center.
So what? The E10000 will continue to truck on as before without it. This is a complete red herring. The SSP is a really just a console station, nothing more. If it dies, you reboot it, or in the worst case, replace it with another one from the closet, which with Sun's AutoClient technology, can take on the entire identity of the failed box in a couple of minutes. (AutoClient allows Wall St. traders to replace their workstations and be working again with NO IMPACT in 5 minutes. Let's see NT do that.)
FUD SHARK: When security is compromised on the System Service Processor, which runs on the Ultra 5 workstation controlling domain operations and performance monitoring, all running domains on the E 10000 can be brought down with a short command sequence.
No one in their right mind would put the SSP on a network that extends beyond the glass house!! It's a *console*, designed to be locked up securely, like all other mission-critical control consoles. MS still doesn't get the data center, do they?
DUH WHALE: System boards that are hosting non-pageable kernel data structures cannot be removed from a domain without interrupting service. The Solaris operating system has to undertake a special "quiesce," or suspend, operation while the critical pages are migrated to another board.
This is incredible. They're knocking the E10K because you can't walk up to it and pull a CPU card at random without telling the machine first that you plan to do this. These cards contain memory, too, folks, which is why it's pretty reasonable to let the system move things to a safe place before the card goes bye-bye. Pretty much only Tandems can accept this sort of things (because they've got at least two of everything all the time, and they cost like they have even more), and if you're after real fault tolerance, you won't be running NT on them, even though you could...
STING RAID: If you remove a system board from a running domain without enough swap space, Solaris will hang. The administrative tools do not warn you if you do not have enough swap space available.
This is pretty low. Yeah, it can happen - what else is an OS supposed to do when it has more processes than now remains as memory? Although a warning would be nice, E10K admins aren't stupid (we hope), and they understand that there are easy workarounds to this - the E10K makes it very easy to move enough resources into the OS domain in quesiton on a temporary basis. If you don't have enough hardware to do that, you misconfigured the machine in the first place. This is hardly a weakness.
On the whole, the incredible thing about this is that MS is throwing rocks at a really good system with availability features far in excess of that for any practical NT box. You've gotta admire their guts, though - some people will read this and think the E10K is a really expensive, dangerous computer. Funny how they neglect to mention that there's not an NT box on the planet that can provide the performance of an E10K, regardless of how much you spend. This may change eventually, but it's pretty cheeky now.
If you need real fault-tolerance, get a Tandem/Compaq - but after you've paid all that money, I bet the Compaq folks would be the first to advise against using NT on it if you really want fault tolerance.
"The future's good and the present is nothing to sneeze at." - Roblimo's last
I don't know about EBay, but I know that E-10000's are extremely tricky to configure correctly.
Sun markets them as ultra-reliable and hardware-level redundant, but the truth is that configuring them is so complicated that even a team of experienced sysadmins is bound to screw something up sooner or later. If you bet the store on a single E-10000, then sooner or later the machine will crash hard and your store is hosed.
Given their size, expense and complexity, they are not appropriate for use as the main server in an internet commerce company. Sun should not sell these machines to companies like EBay.
In defense of Sun, I should point out that their "smaller" systems, namely the E-4000, E-6000, E-3000 etc., are rock solid and just as easy to configure as a small server. But no one would dream of running a whole store on a single one of them -- for reliability, you need to run several of them redundantly.
And Windows NT is far less reliable than any Sun machine. NT is the opposite of reliability. Production Solaris machines routinely stay up and running for months or years at a time. Show me an NT server which can do that.