The root of all eBay's troubles
UncleRoger writes "A friend pointed me to this article would would appear to explain why eBay has had such troubles with downtime, including the outage since Wednesday evening. " It would appear that MS is tired of having the finger pointed at them - as they point out, it's an Oracle database that's running on Solaris that's causing the troubles.
Just a couple of choice pieces of FUD from the M$ web page.
--------quote on----------
3.When security is compromised on the System Service Processor, which runs on the Ultra 5 workstation controlling domain operations and performance monitoring, all running domains on the E 10000 can be brought down with a short command sequence.
--------quote off----------
So, let me get this straight. The workstation which is responsible for "controlling" operations can be used to stop operations? What was Sun thinking, including a command that would turn things off! Of course, we all know that one of the main features of NT Server over NT Workstation is that the "Shutdown" command has been removed from the "Start" menu.
--------quote on------------
4.System boards that are hosting non-pageable kernel data structures cannot be removed from a domain without interrupting service. The Solaris operating system has to undertake a special "quiesce," or suspend, operation while the critical pages are migrated to another board.
--------quote off------------
This is supposed to be a problem? Now, I think it's pretty neat that you can migrate kernel memory off of a certain piece of hardware and swap it out at all. We're supposed to believe that under NT you can do this at all? Much less without telling the kernel first? The only conceivable way to allow this to happen without telling the kernel to clear the board out first would be to make sure that all the kernel memory has had a copy paged out to disk. Or perhaps keep multiple copies of all kernel data structures (and hope they don't get out of sync.) Maybe NT does do the last one. That would certainly explain why it's a memory hog.
Pretty amazing if you ask me. This web page is clearly meant for the PHBs of the world, as anybody with any knowledge at all of how computers work is simply going to laugh at this.
Sometimes you have to wonder how things ever get approved to be on their website. Let's look at a few of the more imflammatory claims, which is really quite a kettle o' fish:
RED HERRING: Daemons that control domain operations and perform monitoring functions run on an unreliable device (Ultra 5 workstation), hardly a desirable situation in the context of a data center.
So what? The E10000 will continue to truck on as before without it. This is a complete red herring. The SSP is a really just a console station, nothing more. If it dies, you reboot it, or in the worst case, replace it with another one from the closet, which with Sun's AutoClient technology, can take on the entire identity of the failed box in a couple of minutes. (AutoClient allows Wall St. traders to replace their workstations and be working again with NO IMPACT in 5 minutes. Let's see NT do that.)
FUD SHARK: When security is compromised on the System Service Processor, which runs on the Ultra 5 workstation controlling domain operations and performance monitoring, all running domains on the E 10000 can be brought down with a short command sequence.
No one in their right mind would put the SSP on a network that extends beyond the glass house!! It's a *console*, designed to be locked up securely, like all other mission-critical control consoles. MS still doesn't get the data center, do they?
DUH WHALE: System boards that are hosting non-pageable kernel data structures cannot be removed from a domain without interrupting service. The Solaris operating system has to undertake a special "quiesce," or suspend, operation while the critical pages are migrated to another board.
This is incredible. They're knocking the E10K because you can't walk up to it and pull a CPU card at random without telling the machine first that you plan to do this. These cards contain memory, too, folks, which is why it's pretty reasonable to let the system move things to a safe place before the card goes bye-bye. Pretty much only Tandems can accept this sort of things (because they've got at least two of everything all the time, and they cost like they have even more), and if you're after real fault tolerance, you won't be running NT on them, even though you could...
STING RAID: If you remove a system board from a running domain without enough swap space, Solaris will hang. The administrative tools do not warn you if you do not have enough swap space available.
This is pretty low. Yeah, it can happen - what else is an OS supposed to do when it has more processes than now remains as memory? Although a warning would be nice, E10K admins aren't stupid (we hope), and they understand that there are easy workarounds to this - the E10K makes it very easy to move enough resources into the OS domain in quesiton on a temporary basis. If you don't have enough hardware to do that, you misconfigured the machine in the first place. This is hardly a weakness.
On the whole, the incredible thing about this is that MS is throwing rocks at a really good system with availability features far in excess of that for any practical NT box. You've gotta admire their guts, though - some people will read this and think the E10K is a really expensive, dangerous computer. Funny how they neglect to mention that there's not an NT box on the planet that can provide the performance of an E10K, regardless of how much you spend. This may change eventually, but it's pretty cheeky now.
If you need real fault-tolerance, get a Tandem/Compaq - but after you've paid all that money, I bet the Compaq folks would be the first to advise against using NT on it if you really want fault tolerance.
"The future's good and the present is nothing to sneeze at." - Roblimo's last