The London Stock Exchange Goes Down For Whole Day
Colin Smith writes "TradElect, the Microsoft .Net based trading platform for the London Stock Exchange, was offline for about seven hours, meaning that their 5-nines SLAs are shot for approximately the next 100 years. The TradElect system was launched back in June of 2007 and was designed for increased speed and system capacity."
Since when is 7 hours even close to "a whole day"? Maybe you meant "almost a whole business day"?
It's a whole trading day--and that's all that really matters when it comes to a major market.
"Nature doesn't care how smart you are. You can still be wrong." - Richard Feynman
Agreed. It's a bit of flame-bait mentioning them in the summary when the exchange is being tight-lipped about what the root-cause is (if they even know at this point.) I do a lot of .NET stuff and, like other platforms eg. Java, there's many things that could cause problems, like plain old programming bugs.
$7.95/mo, 200 GB disk, 2TBxfer, MySQL, PHP, RoR.
Perhaps the bit you're missing is that windows isn't quite as bad as the /. crowd likes to say it is. Especially if its an older (translation: fixed & stable) variety like win2k or even nt4.
I'm not sure if you're serious or not, but surely you aren't trying to compare NT4 uptime with the 5 9s of a solid System z platform?
Oh please. Persuasive marketers can get Windows installed just about anywhere including US war ships.
While it is commonly accepted by many techies (and strongly denied by others) that Microsoft Windows is not a suitable platform for that level of computing, sales people often bypass the techies who know better and sell to managers and executives who still believe "you can't get fired for using Microsoft."
With all this said, it will be quite some time (and possibly never) that we will ever know for certain what is at the root cause of the failure. You can be sure that Microsoft is all over this problem both technically and P.R.-wise. They won't let the facts get out if they are damaging. Recall the major power outage that many still believe was caused by a worm attacking Microsoft servers? As far as I can see, the true cause of that failure has yet to be revealed.
But if this was a planned event, or an unplanned disaster resulting from a planned event gone bad (updates, upgrade, other maintenance), you would think they would have provided for mishaps in some way or another.
But as this news story is all I have to go on, there is no indication of cause and so I will not presume this is a Microsoft problem. But it says a lot that NYSE runs on Linux and not Microsoft. It seems SOMEONE did listen to the techies.
Why the heck they were using MS Windows for this type of environment is stunning... Transactional processing which is the bulk of this type of setup is where Solaris and Linux excel. Any company that builds a system like that on .Net should be thown out on the street.
In short.. Not to rock on Windows, but different platforms always offer different strengths..
Wait! Are you suggesting that downtime can be caused by application problems, network problems, hardware problems, dumbass systems administrators and a whole slew of other things completed unrelated to the platform on which it is running?
I am *shocked*! *Shocked* I tell you!
My blog
Followed by the youngest member of the team becoming the scape goat and being fired.
-Ours is the wisdom of Solomon, the magic of Merlyn, the fall of Icaris.
As is normally the case M$ threw lots of money at the exchange to get it to switch unix/linux base to windows net so that M$ can tout that a major exchange is running windows.
Full page ads touting the switch and the reasons they cited were better through put and better up time.
They even had ads touting it here on /.
Let me explain computers to you. See, the developer uses a set of platforms, languages, integration components, etc.. to deliver his functionality to the end user. A failure at any level can cause the application to fail. It could be application logic, network issues, hardware issues, integration with third party systems, a dipship systems administrator, etc...
And yet the 90-105 IQ SlashDweeb set comes out in numbers with no data and says "lolz Windoze! .NET haha!". Crikey.
No... Actually I deal with this everyday. Windows is great for places where you need desktop apps or such. It also does well when you must have generic developers for web development.
Where Unix/Linux/BSD truly shines is on back office type transactional processing. There are many reasons for this, and have a long history at doing exactly this. Meaning, mainframes may not have every been considered sexy, but they ran critical systems in companies for decades with very little problems... Actually they built such a reputation that when they failed most instantly assumed it was a hardware failure... Working on them, however, takes a more polished developer...
These "better languages" are easier to use which allows for less experienced coders to perform the tasks. This is not an ideal world we live in.
No different then what can happen on a unix box I suppose.
Note that the current system is built around a large cluster of 2.2GHz servers, while the unix-based system it replaced (which coped perfectly happily with a substantial portion of the same traffic) ran from a smaller cluster of much slower servers.
The primary purpose for the new system, introduced less than a year ago, was to expand capacity. For it to have failed within a year due to lack of capacity basically means that it has failed in that objective.
In other words, he used the "no one ever got fired for buying IBM" defense.
Oh, ye of lesser cynicism. I also, long ago, used to believe that language features could improve software reliability. Nowadays the idea just makes me cackle -- in actuality the universe just invents better idiots.
- "History shows again and again how nature points out the folly of men" -- Blue Oyster Cult, 'Godzilla'
No, actually the Windows system (10 ms per transaction) was a 13x speedup over the older system (135 ms per transaction), followed quickly by an addiditonal 50% speedup (6 ms per transaction). The Windows system was just recently updated to double performance again (3 ms per transaction), so it's now 45 times as fast as the unix-based system it replaced.
You may be able to fault it on reliability (though the olde system wasn't perfect either), but you can't fault it on performance.
Socialism: a lie told by totalitarians and believed by fools.
I couldn't disagree more. Although automatic garbage collection is nice, this doesn't mean that you'll get "five nines uptime" systems by working with "less experienced" coders.
If you're building a system that must guarantee 999.99% uptime, you wait until your best professionals become available, because it doesn't only involve code. You DON'T give the job to the less experienced ones, no matter how great the programming language. Five nines uptime requires a very robust design and very solid code quality running on a very solid platform which is running on a very solid OS on a very solid infrastructure. You'll want everything to be tested by unit tests, integration tests, regression tests, and whatnot. That involves a whole lot more than 'just' coders, but whoever works on it, they better be good at it.
Visit http://ringbreak.dnd.utwente.nl/~mrjb/growingbettersoftware to download your free copy of the book
Why did the upgrade fail, I guess is what an intelligent person would ask. You haven't asked that. You've hilariously assumed it's .NET or Microsoft's fault.
As a matter of like for like, I'm going to assume it was because some Linux dweeb walked in and tripped over a network cable. Ergo, I now claim Linux dweebs are clumbsy oafs who should be banned from computer rooms.
It's about the same thing when people say that "XP does not crash, it's faulty device drivers that crash".
If a system should be reliable, then it should be reliable, no excuses accepted. It does not matter if it's system bugs, application bugs, hardware failures or power outages, a system that pretends to achieve 99.999% availability should take all that into account.
The operating system is not at fault if the power goes down, of course, it's a sloppy engineer that designs a system without redundant power supply. But, likewise, a sloppy engineer will prefer a system that lets him configure and operate it by click-and-drag, instead of a carefully designed and tested set of procedures.
A critical system should NEVER depend on an operating system that does not have a proper batch language. That should be a compact and powerful script language, using TEXT files for configuration that can be hand edited if needed, that can be stored and archived in a version control system, so that bugs can be tracked.
Right from your article "and be cheaper to manage"
sounds like the LSE fired expensive. knowledgeable admins and went for 'cheaper' ones, there is your problem right there. windows server isn't perfect, but clearly they had good hardware, were running mission critical apps, but went with cheaper less experienced admins.
also, your fine article specified there were 'no production outages', they don't claim the system ran 24/7/365 with no reboots or glitches, but that there was no production outages for six years. there is quite a bit of difference. the former states that admins and hardware were able to offer the specific services needed at the time it was needed for 6 years, but not on the amount of redundant hardware, etc required to accomplish everything.
so given everything i've read here, under experienced windows admin approves an under tested system upgrade that epic fails, and takes down the production server for the first time in 6 years. no shock here, they wanted to cut corners on admin costs, they brought the epic fail on themselves.
https://www.gnu.org/philosophy/free-sw.html
I mean, that might be what they worked on, but it's kinda pointless; what's interesting is the # of transactions per second, and that can usually be improved at the expense of individual latency. For example, databases can be configured to wait a few milliseconds to group transactions, so as to write several to disk in one single write/sync.
this same kind of thing( replace *nix with Windows ) is what took out the LAX comm system a few years ago and left dozens and dozens of airplanes in the air and on the ground at/over LAX without communications.
What blows me away is that for years, UNIX systems were one of the defacto standards for mission critical OSs. Along comes a marketing company, Microsoft, and people are saying it is capable of mission critical use even when there are constant disruptions from virus attacks, Ctl-Alt-Del and BSoD are a well known features, and any of a hundred other reasons it is NOT ready for mission critical systems.
What kinds of morons are running the show anyways? And it is about time people start getting fired for this junk. From my experience on operating systems, UNIX was the one OS where when you wrote code, you dealt with the business logic/code and not OS issues. Only once in a blue moon did an OS patch or structure tweak get in the way of coding the application(s). OS/2 was pretty good but not as good as UNIX and Windows was the worst. Gawd, I still hear people complaining about that little Windows Mobile OS crashing. They can't even get a small chunk of code working properly let alone the behemoth that is the Windows desktop and server OS.
LoB
"Anyone who stands out in the middle of a road looks like roadkill to me." --Linus
IIRC, Brazil Bovespa had a small glitch last month or two.
Back in the day when Wall Street and financial markets ran on Solaris systems (AFAIK), this shit wasn't common.
Now it's probably going to become *acceptable* for stock exchanges and aviation reservation software to crash.
Apparently, there's a new generation of a-holes on the system administration markets who grew up with Windows and the Blue Screen of Death, that thinks it's acceptable for operating systems to crash, once in a while. Is it evolution?
Main difference between the BSD license and the GPL license: one is from California and the other is from Massachusetts
Oh, yes.. battery backed write cache. With batteries produced by the lowest bidder. The warranty is for 3 years, and the battery lasts just that long before silently failing. When the power goes, well you really didn't need that data written to disk on your database server, did you?
We now do not allow any server to be put into production with any kind of write cache on it. Ever.
"Be grateful for what you have. You may never know when you may lose it."
It's called framing and it is making public debate in western society increasingly difficult.
May the Maths Be with you!
I have a feeling that the 'normal' IT situation was to blame for this.
Preamble: Technical Expertise provided a wonderful architecture that was HA and robust, fast, and scalable.
Bean Counters looked at the cost and said "You Tech guys spend too much money."
IT architects: "How much is your data worth?"
Bean Counters: "Not this much. Look we don't really need all of these systems. My home system has been working for 4 years with no problems. And I've talked with Microsoft Execs and they will cut us a deal for their platform. Now go away, I've just decided how the architecture will be done. Why did we hire you anyways?"
There are no loopholes. It's either legal or it's not.