Slashdot Mirror


Failed Software Upgrade Halts Transit Service

linuxwrangler writes "San Francisco Bay Area commuters awoke this morning to the news that BART, the major regional transit system which carries hundreds of thousands of daily riders, was entirely shut down due to a computer failure. Commuters stood stranded at stations and traffic backed up as residents took to the roads. The system has returned to service and BART says the outage resulted from a botched software upgrade."

35 of 125 comments (clear)

  1. I Guess by dale.furno · · Score: 2, Funny

    They should have brought their skateboards to work.

    1. Re:I Guess by noh8rz10 · · Score: 2

      wow first it's the unions that are shutting them down and now a software update? I wonder what will happen next.

    2. Re:I Guess by Trax3001BBS · · Score: 2, Interesting

      San Fran will turn into Detroit?

      While from Reddit posted a day ago, it's so on topic to your post I had to post it your reply

      http://www.reddit.com/r/explainlikeimfive/comments/1r6f8w/eli5_americans_what_exactly_happened_to_detroit_i/
      Very good read if you want to know about Detroit

    3. Re:I Guess by RabidReindeer · · Score: 4, Funny

      wow first it's the unions that are shutting them down and now a software update? I wonder what will happen next.

      Unionized software.

      Ironic, isn't it? Silicon Valley commutes wrecked due to bad IT practices!

    4. Re:I Guess by milkmage · · Score: 2

      most of them already have cars. BART serves the Bay Area. 50 miles south and east of SF.

      the week long strike earlier this year caused havoc on the roads- people were on the road at 0400, and still late for work. extra busses, extra boats, not enough.

      https://www.google.com/search?q=bart+strike+traffic&espv=210&es_sm=119&tbm=isch&tbo=u&source=univ&sa=X&ei=EhyQUtq2FYb9iQKq2oG4CQ&ved=0CDYQsAQ&biw=1354&bih=647

  2. Strange times by nightsky30 · · Score: 5, Insightful

    Why was a weekday selected for this software update?

    1. Re:Strange times by TWX · · Score: 4, Informative

      Well, based on my own experience with bureaucracies, there is some existing rule that ensures that certain types of staff have certain days off unless there's an emergency, and a software update probably didn't previously count as an emergency.

      From one standpoint, it makes sense, especially if those doing the work need technical support from a vendor. On the other hand, it probably makes more sense to have a QA lab set up if one is going to operate this way, so that one can test a rollout in advance, hopefully forestalling such problems going live.

      --
      Do not look into laser with remaining eye.
    2. Re:Strange times by DavidClarkeHR · · Score: 2

      Why was a weekday selected for this software update?

      Should have been a tuesday. Then our windows updates and our transit updates would match! (... 14% ... for ... ever ...)

      --
      - Nec Impar Pluribus, or so I'm told.
    3. Re:Strange times by x181 · · Score: 2

      so they can purposely botch it and justify the need to have human operators. in case you don't know, BART is currently going through a tense union battle resulting in a few worker strikes and contract disputes.

    4. Re:Strange times by B33rNinj4 · · Score: 4, Insightful

      Man, my company hasn't had a QA environment that mirrored production in over a decade. I'd like to think that they had something set up, but the few state-run departments I've seen have been sorely lacking.

    5. Re:Strange times by s1d3track3D · · Score: 2

      Yes and I bet there was a least one developer saying the exact same thing who was overruled by mgmt who proceeded with the push regardless!

    6. Re:Strange times by Salo2112 · · Score: 3, Funny

      Patch *Tuesday*. Duh.

    7. Re:Strange times by girlintraining · · Score: 5, Insightful

      On the other hand, it probably makes more sense to have a QA lab set up if one is going to operate this way, so that one can test a rollout in advance, hopefully forestalling such problems going live.

      And that's pretty hopeful. The thing is, in the real world, you just don't test all your patches. You can't; in any non-trivially sized network you're going to have hundreds of them to go through every week, and the workload is the same for a small or large business. That's why large businesses tend to do better (strangely enough) than small ones when it comes to patch management. And this is an attitude that is backed up by the numbers -- I would say over 9 times out of 10, a break/fix patch has no consequences being pushed into the production environment. It goes out. The version increments. The end. It's that 1 time that screws everyone up -- but it happens infrequently enough that management doesn't update its policies.

      Most managers operate under a triage approach to maintenance -- that is, throw resources at a problem when something breaks and complaints start coming in, rather than throwing resources at prevention. In the short run, this is the right approach -- in a crisis you want all hands on deck. The problem is that over time, neglecting preventative maintenance procedures, which show up only as a cost without a defined benefit, results in departments moving to a triage model all the time. Basically, the problem is short-term prioritization over long-term cost reduction.

      And I've seen it in almost every IT department I've worked for. I've even sat down with managers and explained to them that when 35% of their workflow is emergency break/fix and that number is trending upwards, we have a process control issue. They invariably agree with me, but say they can't get out from under the workload. Of course, when I come back three months later and it's now at 47% and the workload is now a third higher, they say the same thing.

      I would lay money that this is how project management is happening at BART, and it has now deteriorated to the point where its starting to impact its core business. The problem is, while it is still likely at a point where effective project management can right this sinking ship... it almost never happens. Unfortunately, the solution most of the time here is to throw someone under the bus, blaming them for the failure, and insisting that as the system has worked up until this point, it does not need an overhaul.

      They couldn't be more wrong; But unfortunately it will take several people being thrown under the bus and a few more high-profile failures before senior management fires the mid-level manager responsible for the project and brings on someone with a strong background in project management and they restructure their department from the ground up following the best practices of change management. Of course, they'll over-do it in the attempt and the pendulum will have to start swinging back the other way, but... that's what happens.

      --
      #fuckbeta #iamslashdot #dicemustdie
    8. Re:Strange times by causality · · Score: 2

      Yes, of course, it's always clueless management ignoring the brave developer who warns of catastrophe.

      If management wants the power in the form of the final decisions (which they have), and the ability to take most of the credit (which is often the case), then they also get to keep the responsibility.

      Sounds fair to me. Power and responsibility should never be separated. Ever.

      --
      It is a miracle that curiosity survives formal education. - Einstein
  3. Hello, IT. by tech.kyle · · Score: 3, Funny

    Have you tried turning it off and on again?

    --
    If we colonize Mars, it won't be the World Wide Web anymore. UWW?
  4. BART by Anonymous Coward · · Score: 5, Interesting

    BART is run by the dumbest people on Earth. First off, it's takes a special kind of stupid to create a rail system that goes almost, but not quite all the way to the airport. 30 years later they extended to one of them but you still have to transfer to a bus for the last mile on another. Then you have to wonder what kind of idiot puts light carpet and cloth seating on public transport. 35 years later they start testing non-porous flooring/seating and maybe in another five years all of the trains will be switched over. Then, some bean counter got a bonus when they closed all the station bathrooms when 9/11 happened, ostensibly for security. Now a fifth of the escalators are out of service at any one time because they are clogged with human shit.

    I also heard there was some sort of labor dispute.

    1. Re:BART by Jane+Q.+Public · · Score: 3, Insightful

      "BART is run by the dumbest people on Earth."

      Well, you really do have to wonder when they say they worked through the whole night only to discover that this new, mysterious problem was caused by the updated they'd made the night before.

      I mean, wow. Wouldn't that be the first thing that popped into your mind?

    2. Re:BART by MrEricSir · · Score: 4, Informative

      The Bart-SFO extension was a matter of politics, you can't blame the people who run Bart for that. You also can't blame the initial designers for not building the OAK extension, since OAK was a much smaller airport in those days (and had very few passenger flights.)

      The train design was done by an aerospace company with absolutely no rail experience, which explains Bart's quirky design elements. But you can't blame Bart current management for construction contracts awarded in the 1960's.

      --
      There's no -1 for "I don't get it."
    3. Re:BART by Anonymous Coward · · Score: 3, Funny

      So people take a dump while riding the escalator? That's actually a cool idea.

    4. Re:BART by Anonymous Coward · · Score: 5, Insightful

      Plus, BART is not exactly a metro system like in Boston, Chicago, or New York. It's somewhere between a metro and commuter rail, but closer to the latter. It's a product of 1960s thinking, where people were trying to deal with the population shift out of the urban core. So part of the idea was to create high-speed transit from bed-room communities to downtown Oakland and San Francisco.

      Connecting the airports probably never figured much into the equation. It wasn't built to supplement the transportation needs of carless San Francisco residents. It was built to shuttle people around the Bay Area. If you needed to get to the airport, you got there like everybody else--you drove your car.

    5. Re:BART by gagol · · Score: 2

      To suspect something is one thing, to be sure of it you need to gather and analyse data at best. A night to confirm it is reasonable. And bathroom in a metro is a luxury, how many undergrounds have those facilities (dont know, none in montreal, canada)?

      --
      Tomorrow is another day...
    6. Re:BART by gagol · · Score: 2

      Let us know how it went for you!

      --
      Tomorrow is another day...
    7. Re:BART by bluemonq · · Score: 2

      > 30 years later they extended to one of them but you still have to transfer to a bus for the last mile on another.

      Pity you didn't have a spare $100 million a couple decades ago. I'm SURE you'd have been willing to pay for it, right? The extension to SFO wasn't built until recent times because back in the '60s San Mateo County quit the BART project, and the money wasn't around until the tech bubble started growing; ground was broken in 1997. The Oakland extension wasn't started until recently (opens in 2014) because again, there wasn't any money for it. The only reason it's getting built now is because Feds are footing a good chunk of the bill. OAK wasn't even all that popular an airport until last decade, after their renovation.

    8. Re:BART by SeaFox · · Score: 2

      If you needed to get to the airport, you got there like everybody else--you drove your car.

      But this just comes right back to how BART is stupid. Because when you build public transportation, it's going to be used by people who don't have cars, and to not take them into account is fucking stupid.

      Maybe the assumption was if you couldn't afford a car, you probably couldn't afford to be going on many flights either. Keep in mind air fare was a bit pricier in the 60's and gas was quite a bit cheaper. Financial bar for car ownership was lower.

    9. Re:BART by drinkypoo · · Score: 2

      Well, what I meant was that they should have taken both classes of passenger into account.

      Ideally this means having lines segregated by socioeconomic status. You don't want to go to the airport and the ghetto.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    10. Re:BART by xaxa · · Score: 2

      London Underground toilet map (not so great in the centre, but pretty good elsewhere).

      They're in probably half of European underground stations, on average. Expect to pay 0-50c, depending on the country.

      My local station (in London) has one, it's always very clean. I don't think many people use it.

  5. Snapshots? by Neo-Rio-101 · · Score: 2

    First I'm not going to plug any VM vendor.... but with certain VM backends, snapshots are possible, and it's a godsend when crap like this happens.

    --
    READY.
    PRINT ""+-0
    1. Re:Snapshots? by Runaway1956 · · Score: 2

      You have to realize how few people even know what a VM is. Or a snapshot. Where I work, there is one backup made each week, on the server. No other machine has a snapshot, a disk image, a backup, there are no VM's - nothing. If/when a disk fails, that machine comes to a halt until a vendor is called in to replace the disk, the OS, and all the software.

      We have some fool who is referred to as "the IT guy". I can't even say that with a straight face. This is one of those who got a Microsoft-centric education, and proved to be pretty adept at accomplishing Microsoft-centric tasks - and just happens to be related to the company president.

      I know that our situation isn't unique.

      --
      "Windows is like the faint smell of piss in a subway: it's there, and there's nothing you can do about it." - Charlie Br
    2. Re:Snapshots? by Anonymous Coward · · Score: 2, Interesting

      No. Just no.

      Have you ever actually tried this on a production system? I haven't (I'm not stupid enough to do that), but I've seen many others try. In almost every case, the resulting mess from "rolling back" a VM was greater then the mess of a botched software update to begin with. In one particular case, I witnessed a certain VM running some very expensive enterprise software totally hose itself and then proceed to blow away the majority of a database hosted on another VM after it was restored following a broken update. Despite their attempts to restore both VMs and bring them back in sync, they eventually determined that the data couldn't be trusted on either and the entire system had to be restored from backup. The downtime this cost them was greater then the downtime would have been had they simply called the vendor and said "your update broke our stuff, fix it" (they had the support contracts and the fix would have taken 10 minutes instead of 8 hours).

      Another time I saw someone restore a VM that was running a network daemon for a cluster of hardware locks attached to one of the nodes (of course, this VM was locked to that particular node since it required passthrough access to the USB dongles). That was a good one- not only did none of the licenses get checked back into the network daemon (so they basically lost all the capacity they had in use at the time of restore), but the licensing software freaked out and shat itself when the time stamps coming off the hardware were suddenly in the future (as the clock had not yet been synchronized back to local time). It took those guys several days of pleading with the software vendor to send them new keys and get the licensing system sorted out and working again (snapshots were permanently disabled on that VM thereon after).

      Now, it's an awesome feature to have for testing and development stuff- but for production, you should have procedures in place to deal with this kind of thing rather then reaching for the Big Red Button and nuking everything from orbit. I keep hearing about this kind of thing- "oh just restore the VM from snapshot in prod", and it makes me cringe every time I hear it. You don't restore a server from tape unless you absolutely have to. I fail to see why anyone thinks that restoring a VM from snapshot is any different- the only difference is that it takes seconds to complete, instead of hours.

    3. Re:Snapshots? by Todd+Knarr · · Score: 2

      Gods, no. Just... no. Think for a minute. If your VM's running a database server and you roll back to a snapshot, what happens? Well, the snapshot doesn't know anything about the database since that's an application-level thing, so it'll roll back to being mid-operation (times however many database operations were in progress). The problem is that since the clients haven't been rolled back to the same moment down to the nanosecond, the database is now mid-operation while the clients that're supposedly performing those operations... aren't. From here things proceed to go pear-shaped in a big way.

      It can be done safely, but it requires either intimate knowledge of the application by the VM host or bringing the applications to a safe idle state before starting the snapshot. Basically snapshots are far less useful than they're made out to be because the problem you're trying to solve is far more complex than just taking a snapshot.

  6. Looks like Terry Childs had a point by Somebody+Is+Using+My · · Score: 4, Funny

    See what happens when you give these guys root access? ;-)

  7. Manual operation by manu0601 · · Score: 2

    I have seen quite efficient manual train network operation, but the workers behind the success could explain it was only possible because they had a few old timers who where still able to organize train flows using paper and pencil. Younger workers had always worked with computers, and when all the old timers will all be retired, the know-how will be lost.

  8. Re:BART has drivers. by bluemonq · · Score: 4, Interesting

    You've almost certainly never ridden BART, much less seen the driver's cab. Why do I say this? Because there's a section of the BART system (the Oakland Wye, bane of commuters who want to get anywhere during rush hour) where drivers are instructed to go to manual control, limited to 25 MPH. It's the result of your vaunted "automated" system designed in the '60s never having worked properly in the past 50 years, and one of the contributing factors to a crash in 2009 (thankfully no one was seriously injured). There are many well-documented incidents of entire train sets disappearing from the computer system, as well as "ghost" trains randomly appearing.

    Here is what an actual BART cab looks like:
    http://i.imgur.com/IbYtYTa.jpg

  9. computers run the track swtichs by Joe_Dragon · · Score: 2

    computers run the track switches

  10. Re:This is really surprising to me. by bill_mcgonigle · · Score: 2

    and cost the local economy tens of millions of dollars by screwing up.

    So what? What's BART's incentive to avoid this? The customers will go to a competitor? They'll lose their jobs?

    Unionized monopolies are a wonderful thing.

    --
    My God, it's Full of Source!
    OUTSIDE_IP=$(dig +short my.ip @outsideip.net)