Failed Software Upgrade Halts Transit Service
linuxwrangler writes "San Francisco Bay Area commuters awoke this morning to the news that BART, the major regional transit system which carries hundreds of thousands of daily riders, was entirely shut down due to a computer failure. Commuters stood stranded at stations and traffic backed up as residents took to the roads. The system has returned to service and BART says the outage resulted from a botched software upgrade."
They should have brought their skateboards to work.
Why was a weekday selected for this software update?
Have you tried turning it off and on again?
If we colonize Mars, it won't be the World Wide Web anymore. UWW?
BART is run by the dumbest people on Earth. First off, it's takes a special kind of stupid to create a rail system that goes almost, but not quite all the way to the airport. 30 years later they extended to one of them but you still have to transfer to a bus for the last mile on another. Then you have to wonder what kind of idiot puts light carpet and cloth seating on public transport. 35 years later they start testing non-porous flooring/seating and maybe in another five years all of the trains will be switched over. Then, some bean counter got a bonus when they closed all the station bathrooms when 9/11 happened, ostensibly for security. Now a fifth of the escalators are out of service at any one time because they are clogged with human shit.
I also heard there was some sort of labor dispute.
This is really surprising to me.
For all the "can not fail" systems I've worked on, there has been an identical set of hardware, along with other hardware to simulate load, on which you could try upgrades before you put them on a live system and cost the local economy tens of millions of dollars by screwing up.
I guess you can't always save by eliminating humans and their expensive unions. Although, I'm sure the software was intended to pick up the financial slack for all of those expensive peeps. Don't worry, Wall Street is highly motivated to eliminate the humans with the software, eventually...
Because there is no means in the "cockpit" to actually make the train go. There are three buttons in a BART rail car:
Open Doors
Go to next stop
Emergency Stop
Not even a "close doors" button - that is handled by door sensors and the computer when "Go to the next stop" is pressed.
Everything is automated. A chimpanzee could operate a BART train.
First I'm not going to plug any VM vendor.... but with certain VM backends, snapshots are possible, and it's a godsend when crap like this happens.
READY.
PRINT ""+-0
"assistant general manager for operations, said the system's backup computer had gone down at the same time its central supervisory computer crashed."
Redundancy is not just running two boxes... How many times do we need to point out that there's a reason true redundancy is hard and expensive?
TFA (sorry for reading it) states that the problem showed up 12 hours after the upgrade. That's why it's time-consuming to test hi-rel stuff, whatever bean counters say...
See what happens when you give these guys root access? ;-)
So your posting from an un-patched windows 98 box? Or are you still on 3.1?
I have seen quite efficient manual train network operation, but the workers behind the success could explain it was only possible because they had a few old timers who where still able to organize train flows using paper and pencil. Younger workers had always worked with computers, and when all the old timers will all be retired, the know-how will be lost.
it's more the contractors refusing to train and keep their hires. Nobody wants to keep someone around. They cost more every year. But for programmers that means nobody knows how anything works. It keeps profits high for the guy running the sub-contractor, but it means crummy software...
Hi! I make Firefox Plug-ins. Check 'em out @ https://addons.mozilla.org/en-US/firefox/addon/youtube-mp3-podcaster/
Terry Childs was locked up on the off chance that something far less disruptive than this would happen. At least that was the excuse.
You've almost certainly never ridden BART, much less seen the driver's cab. Why do I say this? Because there's a section of the BART system (the Oakland Wye, bane of commuters who want to get anywhere during rush hour) where drivers are instructed to go to manual control, limited to 25 MPH. It's the result of your vaunted "automated" system designed in the '60s never having worked properly in the past 50 years, and one of the contributing factors to a crash in 2009 (thankfully no one was seriously injured). There are many well-documented incidents of entire train sets disappearing from the computer system, as well as "ghost" trains randomly appearing.
Here is what an actual BART cab looks like:
http://i.imgur.com/IbYtYTa.jpg
It was broke (and remains so) decades ago. The automated system never really worked properly.
computers run the track switches
Terry Childs pissed off the city and he worked for them.
Likely in this case some out side vendor / contractor messed up.
If the recent strike wasn't bad enough, now a computer glitch. Man, if I was riding the transit to work and back I would be extremely pissed. Wonder how many people had lost their jobs because they couldn't make it to work??
They pilot their solar powered dirigibles.
I'm sure that if you asked them the answer would be along the lines of "Huh? What's a production system? We just call it the system."
I once argued for retention of a QA system, which was basically a 4 week old copy of Prod. Things like being able to replicate actual problems with actual data, test new functionality & patches without impacting the business counted for less than some little tart's fluttering eyelashes. Of course that's what management wanted to hear, because an extra server is just a wasted expense, right?
Confucius say, "Find worm in apple - bad. Find half a worm - worse."