Windows Upgrade, FAA Error Cause LAX Shutdown

Repent, Sinners! by mfh · 2004-09-21 09:49 · Score: 5, Insightful

The recent shutdown of LAX due to an FAA radio outage was apparently caused by a Windows 2000 integration flaw, possibility related to an old Windows 95 bug.

Okay... a Win95 bug leads to the LAX shutdown because the *same* bug was later found in Win2k? Yup, closed source is the answer, Mr. Gates. I hereby repent my sins of Open Source Freedom and agree that security by obscurity is the answer! /sarcasm

a technician didn't reboot the system monthly as he should have

You have to love a system that requires downtime as part of uptime. How many Linux users have this problem? (Please press the Start button to shut down (stop) the computer.)

--
The dangers of knowledge trigger emotional distress in human beings.

Re:Repent, Sinners! by LostCluster · 2004-09-21 09:54 · Score: 3, Insightful

I've seen AIX-based database systems that require an overnight downtime to do reindexing, since non-SQL formats like DBase have always been a little funky when they start having to deal with million-record tables. It's amazing how ugly legacy databases can be compared to today's tech.
Re:Repent, Sinners! by (H)elix1 · 2004-09-21 10:08 · Score: 5, Insightful

You have to love a system that requires downtime as part of uptime. How many Linux users have this problem?

All right, I cannot throw the first stone here. I can raise my hand as a AIX C programmer back in the day...

We inherited a huge ball of spaghetti wire, nasty stuff that had memory leaks. Rather than taking the time to fix it, the powers that be determined it was better to keep working on new features rather than hash out the issues. At first it happened once a quarter, then once a month, and as time ticked by a weekly 'fix' to recycle the server. Lord knows I added to the mix as well, as they picked 'cheap' and 'build it fast' (not to be confused with running fast), skipping the entire do it right. That is how it happens... stuff gets rushed before its time. OSS is more immune than the typical commercial gig, but anytime a deadline comes without enough time to finish something is going to give. Downtime is just duct tape.

--
+++ UGUCAUCGUAUUUCU
Re:Repent, Sinners! by pchan- · 2004-09-21 10:21 · Score: 5, Insightful

where do you want to go today?

dear microsoft,

the above question was posed in a line of your advertisements well, after spending an hour and a half on a plane on the runway in oakland, and another hour on the runway in l.a. (sunday night), i think i have the answer. i want to go home. sounds like a simple enough request, or so i thought.

but here is what i really want: i would like you (microsoft, inc.), to stop selling your products to mission critical and infrastructure operations until such a time as they are ready to do so. when my desktop computer at work crashes (admittedly a rare occurance nowadays), i am inconvenienced. when hundreds of thousands of travellers in airports across the world are delayed because one of the busiest airports in the world is shut down due to a 10 year old known bug in your operating systems that has not been fixed, that is simply not acceptable. i realize that buyers of software and IT systems are easily suckered or bribed into using your systems, that is why i am appealing directly to you. please exit this market before we are forced to legislate you out.

thanks,
pc
Re:Repent, Sinners! by claar · 2004-09-21 10:29 · Score: 4, Insightful

Bah, what a cop out. If "we" won't accept criticisms similar to our own, we have no right to criticize in the first place..

Yes, init 6 is counter-intuitive. I remember that it actually did confuse me a bit the first time I heard of it. Does that mean we need to remove or change it? Nah, let 'em use `shutdown -r` or `alias restart="init 6"`. But just don't be an apologist for Linux, it just makes "us" look hypocritical.

--
I'd give my right arm to be ambidextrous...
Re:Repent, Sinners! by 47Ronin · 2004-09-21 10:33 · Score: 4, Insightful

Personally, I use "reboot".

"shutdown -r now" also works (r stands for reboot). To shut down, use -h (for halt).

Personally i use sudo reboot because I would never login as root for security/safety reasons.

--
Those who laugh at you for you having a Mac.. are the people who constantly call you to fix their PC.
Re:Repent, Sinners! by Phillup · 2004-09-21 11:34 · Score: 2, Insightful

But just don't be an apologist for Linux, it just makes "us" look hypocritical.

I wasn't apoligizing. It makes perfect sense to me.

Then agian, I have a calculator that you turn off by pressing the "ON" key. ;-)

Seriously tho...

Many devices have a single power button. You push it... thing comes on... push it again... thing turns off.

If anyone should apologize, it is the person that decided on "Start" for the button label.

And, in *nix... init 6 does just what it says it does.

It initializes run level six. Run level six can do anything you want it to do. It doesn't have to shut down the system.

So... WTF would I even have to apologize for? The fact that the parent associates it in his mind with shutting down?

It doesn't shut down... it initializes run level six. If you don't want it to shut down when you init 6... change it.

If you don't want to go to the "Start" button in Windows to shut down... well... that one is your problem. Not mine.

--

--Phillip

Can you say BIRTH TAX
Re:Repent, Sinners! by Awptimus+Prime · 2004-09-21 11:42 · Score: 4, Insightful

You have to love a system that requires downtime as part of uptime. How many Linux users have this problem? (Please press the Start button to shut down (stop) the computer.)

Well, in the past 10 years I have had a number of clients who have had Linux, Unix, Windows, and Mac systems that were critical to their day to day routine and they did nightly/weekly/monthly reboots as part of their maintenance.

I guess when you grow up and get out of high school, you will find that your linux box running as a DSL router is not a good example of a production server.
Re:Repent, Sinners! by n3k5 · 2004-09-21 12:22 · Score: 3, Insightful

If anyone should apologize, it is the person that decided on "Start" for the button label.
Originally the button just showed the Windows flag, so it basically the choice of a label was the same as in Gnome and KDE today. However, the average Windows user didn't figure out that this logo isn't just there for decorative purposes, but you actually have to click it in order to accomplish just about anything. So someone had to come up with a short piece of text that clues newbies in, and it worked rather well (in usability tests). 'Start' may not be optimal, but has anyone thought of something better? (Not that is matters anymore.)

--
but what do i know, i'm just a model.
Re:Repent, Sinners! by autopr0n · 2004-09-21 12:58 · Score: 4, Insightful

Yes, but maybe that was controlled by a cron-job and not some poor person manually initiating it every night? Just like an automated reboot is also not too scary on any decent Unix, but a manual action in MS-world?

a) This could easily been done as a sheduled task in windows 2000.

b) This could have been done by their code, in windows 2000 and windows 95.

c) Windows 2000 does not require a reboot after 49.7 days. Maybe their software relied on gettickcount() or something.

The problem lays with the developers of the software, not microsoft.

--
autopr0n is like, down and stuff.
Re:Repent, Sinners! by ckaminski · 2004-09-21 13:50 · Score: 4, Insightful

Thankfully, Chicken Little, planes do NOT fall out of the sky during a total air traffic control outage, but control regresses to pencil and paper.

Your plane *WILL* land. It may be at a different airport, and sooner or later than planned, but you will get on the ground in one piece.
Re:Repent, Sinners! by Awptimus+Prime · 2004-09-21 13:54 · Score: 4, Insightful

and these are heavy used mail servers.. no need to reboot on a nightly basis!! good grief (charley brown)

Right, the code used for mail serving is some of the most mature server code out there. This is far more reliable than say a Linux box set up with proprietary, closed src, business applications with their own bugs.

My feelings are the article may have mistakenly blamed Windows for a problem with one of the server applications running on it. It is not typical for even Win2k to hang unexpectedly when running good hardware and well-written code.

I say fuck it. There is no point in ever trying to defend logic when it stands in the way of the Microsoft bash-fests on /..

Just to clarify, I am not saying Windows servers can and will run as reliably as a properly configured BSD, Solaris, or Linux box. I am just trying state that Windows is reliable, if properly configured, but will probably not win an uptime competition. Big whoop. Reboot your shit during maintenance windows, regardless of OS, you run a much better chance of finding pending hardware failures. It is much better to powercycle that database server and get an error detecting the SCSI bus during a maintenance window than for it to happen at 5:30AM on a Monday or during your vacation.

Then again, I could be overly anal. I just like to avoid the reputations gained by those before me. :)
Re:Repent, Sinners! by pchan- · 2004-09-21 13:58 · Score: 5, Insightful

see what you've done, now i had to go and rtfa just to respond. here's a choice quote:

The servers are timed to shut down after 49.7 days of use in order to prevent a data overload, a union official told the LA Times. To avoid this automatic shutdown, technicians are required to restart the system manually every 30 days.

now, let's do a little math. the number of milliseconds in 49.7 days = (49.7 * 24 * 60 * 60 * 1000) = 4,294,080,000. recognize that number? that's right, it's 2^32 (actually, this is: 4,294,967,296, but it's pretty damn close). and why is that significant, you ask? because at 2^32, the unsigned int used by some versions of windows to keep the time since boot overflows back to zero, and bad things begin to happen.

is the problem microsoft's fault? goddamn right it is. in software that runs A MAJOR AIRPORT and controls the flight control and radar systems that affect thousands of lives in the air, an error like this just not an option. the people who put this system into production ought to be fired. i don't know what the right os for this task is. solaris? aix? vms? something with provable uptime and reliability, something that can deliver uptime of longer than a month and a half, that's for sure.

I'm sure Linux doesn't store time in an infinite bit counter either.

i don't recall advocating linux for the job. maybe it can do it, maybe not. and in regards to being free, when my life is on the line, they better spend every god-damn dollar they can to make sure that critical systems do not fail under any circumstances. microsoft was absolutely the wrong choice in this case.
Re:Repent, Sinners! by Atzanteol · 2004-09-21 14:15 · Score: 4, Insightful

You don't work with other people much do you? It's probably for the best.

These things cost money. Migrating apps that use the old DB to the new one, testing, bugs introduced in the migration, etc. If it works most companies will stick with it and not risk spending large amounts of money for no 'gain' (in their mind).

--
"Ignorance more frequently begets confidence than does knowledge"

- Charles Darwin
Re:Repent, Sinners! by nathanh · 2004-09-21 14:47 · Score: 3, Insightful

So someone had to come up with a short piece of text that clues newbies in, and it worked rather well (in usability tests). 'Start' may not be optimal, but has anyone thought of something better? (Not that is matters anymore.)

Click Me. Menu. Actions. Tasks. Open Here.
Any of those make more sense than "Start".
Re:Repent, Sinners! by tulare · 2004-09-21 15:54 · Score: 2, Insightful

Thankfully, Chicken Little, planes do NOT fall out of the sky during a total air traffic control outage, but control regresses to pencil and paper.
Or, more appropriately, to the hands of the pilots, including the one who had to take evasive action. What's glossed here is that a stupid application flaw very nearly did result in serious loss of life. Kudos to the pilot who knew what the fuck to do when the time came.

--
political_news.c: warning: comparison is always true due to limited range of data type
Re:Repent, Sinners! by sxpert · 2004-09-21 17:44 · Score: 1, Insightful

but "commence" is a french word, a certain part of the US population surely don't want a french word in the place of the idiotic "start" on their windows do something button...

"The problem with the french is that they don't have a word for 'entrepreneur'" (G.W. Bush)
Re:Repent, Sinners! by Anonymous Coward · 2004-09-21 21:33 · Score: 1, Insightful

"is the problem microsoft's fault? goddamn right it is."

No, it is the fault of those who selected a product with this problem to run an airport.

Anyone want to clue them in to scheduled jobs? by FyRE666 · 2004-09-21 09:50 · Score: 3, Insightful

It's obviously lunacy for any company to replace a proven system, which has given years of reliable service with some piece of trash that crashes if left running for over a month. That said, I was under the impression that a simple "at" job could be used on a Windows machine to run a script periodically (at is similar to cron, except far less capable, of course). Such a script could, if I'm not mistaken, be used to reboot the machine. One would think this would be an ideal way to hide the problem very nicely.

We use a similar system to reboot all of our NT servers every weekend to help prevent crashes during the week (doesn't work of course, but still).

--
Code, Hardware, stuff like that.

Re:Anyone want to clue them in to scheduled jobs? by Ann+Elk · 2004-09-21 10:11 · Score: 4, Insightful

It's obviously lunacy for any company to replace a proven system, which has given years of reliable service...

It's obvious you have never toured an ARTCC (Air Route Trafic Control Center). The system that is being replaced was barely hanging together by voodoo and chicken wire. It was designed back in the 60's to handle maybe 1/10th the current capacity. It is in dire need of replacement.

That said, I'm not convinced Windows (or Linux for that matter) is an appropriate OS for an application that practically defines the phrase "mission critical".
Re:Anyone want to clue them in to scheduled jobs? by mekkab · 2004-09-21 10:33 · Score: 2, Insightful

The cost of having a trained monkey reboot the system every month for 10 years is probably less than the cost of maintainance on the old hardware.

It makes sense on paper. It doesn't work out when the human element "screws the pooch" (they rarely show you that slide in the powerpoint, do they?!)

--
In the future, I would want to not be isolated from my friends in the Space Station.
Re:Anyone want to clue them in to scheduled jobs? by drinkypoo · 2004-09-21 10:56 · Score: 2, Insightful

I bet I could get you a replacement board for an IBM RT PC. I gave some Model 135s to a guy I used to work with, and I bet he's still got them or knows who has them. Since there's nothing better than a 135 I can't imagine you'd evince any significant dismay over that idea. There's a lot of that kind of crap running around assorted towns where IBM's got offices, like Austin - which is where I got them. I had AOS 4.3 and BSD-4.3-lite... More or less the same thing really.

Er anyway back to the point, you don't replace an old workhorse with a new POS. You get a newer workhorse than the last workhorse, and maybe not even a new one. I'd rather go dig up some Sparcstation 10s with supersparcs in them to replace (for example) your RT PC. Running SunOS or perhaps netbsd, you should be able to port your software from BSD. If you are running AIX on your RT, maybe you'd be better off with an old RS6k, they're available very cheaply. Hell, I once sold a 603e laptop RS6k (thinkpad power series) to a guy for like nine hundred bucks or so. That little bastard would make a better server than your average wintel box, given it was SCSI, assuming that you were replacing an antique.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"

And the lesson is... by jcr · 2004-09-21 09:50 · Score: 2, Insightful

Don't use this stuff in mission-critical applications.

-jcr

--
The only title of honor that a tyrant can grant is "Enemy of the State."

Re:And the lesson is... by Dun+Malg · 2004-09-21 10:54 · Score: 2, Insightful

Good IT is so hard to pull off because you have to convince people that events that strike once every few years have to be prepared for otherwise a disruption in service will occur.
Like the PHB at the office where my wife works said after announcing that the IT guy was to be laid off and not replaced: "I don't see why we need an IT guy-- we never have any computer problems" (cluebat time!)

--
If a job's not worth doing, it's not worth doing right.

I Hate to Say It by DarkKnightRadick · 2004-09-21 09:51 · Score: 2, Insightful

But I'm going to.

It's M$'s fault. Why do I hate to say it? Because it'll just be seen as more anti-MS crap from another /.er.

All I have to say is if the shoe fits, wear it.

In this individual case a PHB made a decision to scrap the old, stable OS to a new, known-to-be-unstable OS. That screams PHB.

--
"There is a way that seems right to a man, but its end is the way of death." Proverbs 16:25 (NKJV)

Heh by GypC · 2004-09-21 09:52 · Score: 3, Insightful

upgrade from Unix to Windows

AKA, "The PHB Special"

Of course, the guy who was supposed to reboot the box will get all the blame. Shit rolls downhill.

humans rule by Doc+Ruby · 2004-09-21 09:53 · Score: 3, Insightful

It is human error: those bugs didn't write themselves. Nor did the operations protocol that required "rebooting LAX" every 49.69(!) days. Nor did the upgrade procedure that ignored that bottleneck. Nor did the upgrade decision that moved from Unix to Windows. Those were all human errors, as was the decision to keep a job at LAX that would face blame for shutting down the airport (or risking lives) if the reboot was missed, or unsuccessful.

"Not I," says the referee,
"Don't point your finger at me.
I could've stopped it in the eighth
An' maybe kept him from his fate,
But the crowd would've booed, I'm sure,
At not gettin' their money's worth.
It's too bad he had to go,
But there was a pressure on me too, you know.
It wasn't me that made him fall.
No, you can't blame me at all."
- Bob Dylan, "Who Killed Davey Moore?"

--

--
make install -not war

Before the torrent of "windows sucks" posts... by rasafras · 2004-09-21 09:56 · Score: 3, Insightful

...keep in mind that we have established numerous times that windows is not suitable for systems that need reliability and stability. It is not the operating system's fault that this happened, it is the FAA's for choosing to use it instead of considering the better alternatives. If you get run over on a bicycle while riding on the highway, don't blame the bike.
Quick addition: it seems that the fault does not belong entirely to windows, but rather a combination of the software running on it and the system architecture.

With that said, Windows could stand to improve a lot. It has too many bugs, too many flaws, and so on. And it definitely does not have a stable, secure, reliable base. So don't expect it to.

--
webpage

Now even the submitters aren't reading the article by Holi · 2004-09-21 09:57 · Score: 2, Insightful

From the submission
possibility related to an old Windows 95 bug

From the Article.
The shutdown is intended to keep the system from becoming overloaded with data and potentially giving controllers wrong information about flights, according to a software analyst cited by the LA Times.

The shutdown is not a crash but a scheduled event to bring the servers down to flush data.
So it does not seem to be a problem with Windows (Ok now I get marked as troll) but with the FAA's own software.

--
Sorry, teleporters just kill you and then make a copy. A perfect, soul-less copy.

Re:Why not automate it? by Anonymous Coward · 2004-09-21 09:57 · Score: 1, Insightful

"Have they never thought to just schedule an event to reboot the computer every 30 days?"

Would it not worry you to know that the ATC were relying on a computer that reboots itself so often?

Re:A hit for the other team... by PPGMD · 2004-09-21 09:59 · Score: 5, Insightful

The patriot missile system had a similar problem. It's timing broke down after a period of time without a reboot (it was a much shorter cycle, either one day or one week).

Microsoft isn't the only one to have issues like that. But it has been patched and there should have been more than enough time for the FAA to test and deploy the patch on the few legacy machines running Windows 95.

I simply blame the FAA for wasting money away every year, billions are sunk into the system, but rarely does anything come out of it, Lockheed can deploy a complete new system to every airport for the amount of money that is being dumped into the old TRACONs and towers for MX.

Re:A hit for the other team... by oGMo · 2004-09-21 10:00 · Score: 2, Insightful

Bottom line is that everybody's at fault because had any one piece in the chain done their job properly the failure wouldn't have happened, but a cascade of mistakes lead to the ball hitting the grass instead of a glove.

An error is scored against a player if the player is determined to have been negligent in their position according to the rules. If someone hits a line drive right past the first baseman, it's still a hit. If the first baseman catches it, then drops it instead of making a tag, it's an error.

If multiple players are negligent, then multiple errors are scored. We've all seen "blooper" videos where there are cascading errors; one guy drops a catch, throws it to the next guy who drops it in turn, etc.

This is what happened here; it's not a hit, it's a cascade of errors. Everyone is to blame, because they all did something stupid. That doesn't make it "OK," it doesn't make any particular party less at fault.

I don't think this contradicts what you're saying here, I just wanted to emphasize the point. ;-)

--

Don't think of it as a flame---it's more like an argument that does 3d6 fire damage

Re:Check out this little pile of bullshit by k4_pacific · 2004-09-21 10:02 · Score: 2, Insightful

"Maybe they left off a percent sign?"

Or maybe there's some kind of failover to a backup system (Which they also forgot to reboot)?

--
Unknown host pong.

Re:Ahh yes... by Qeyser · 2004-09-21 10:02 · Score: 2, Insightful

Moreover: why do you have a critical system that hasn't been patched in over five years?

Check the date on that news.com article linked in the main story -- it's from March of 1999. The bug is that old, and as I recall the fix didn't take that long to get out.

If LAX was trying to upgrade to/integrate win2k with ancient, unpatched Win95 systems, its no wonder that they're having problems . . .

-Q

Don't be so hasty to blame the OS... by Ann+Elk · 2004-09-21 10:03 · Score: 5, Insightful

OK, I know it's violation of /. policy to actually read a referenced article. My bad. But, according to the software.silicon.com article:

Richard Riggs, an advisor to the technicians union, said the FAA - the American aviation regulator - had been planning to fix the program for some time. "They should have done it before they fielded the system," he said.

This sounds to me like more of a problem with the application, not the OS. The "system" crashed after 49.7 days, which is about 4 million seconds, which is about 4 billion milliseconds, which is (obviously) MAX_ULONG. I suspect the application is using a ULONG to store a timeout value and got pissed-off when it rolled over.

Re:A hit for the other team... by LostCluster · 2004-09-21 10:05 · Score: 2, Insightful

If multiple players are negligent, then multiple errors are scored. We've all seen "blooper" videos where there are cascading errors; one guy drops a catch, throws it to the next guy who drops it in turn, etc.

Only one error can be scored per base advanced by the runner, and if the runner took first by a "hit" before the errant throw, then there is only one "error" for his advancement to second. If two players crash into each other and the ball drops, it's usually a hit because it's hard to say either would have been able to make the catch "with normal effort" which is the real standard for an error.

windows update anyone? by roadrunnerro · 2004-09-21 10:06 · Score: 3, Insightful

and office update while you're at it too...

Wouldn't want to spoil a nice MS bashing session, but I think the bug was in the ported application, not in the OS - probably someone used the wrong data type to hold timestamps somewhere within the program (win95 had the same bug) - I've seen win2k last more than 47 days without reboots...

depends by Tsiangkun · 2004-09-21 10:06 · Score: 2, Insightful

I think it depends on what the company rep said when they convinced them to replace Unix with Windows.

If they advertised a consumer OS as an OS suitable for mission critical applications . . . then this flaw should not be in the software. It's could the software companies fault for agressively marketing their product where it should not be.

Maybe we should throw some blame to the PHB who ordered the switch. Purhaps there was no hard sell from MS, and a PHB saw a product brochure and got a hard on to switch.

I see your point though, the tech knew about the problem and failed to do his job.

I guess my question is, should the problem have been addressed before now, or is it common practice to wait for a catastrophic success like this to occur before addressing the problem ?

No proof the old system was stable. by rdunnell · 2004-09-21 10:07 · Score: 2, Insightful

A system running UNIX doesn't necessarily mean it was stable. It could have all sorts of flaws in the code, hardware failures, etc.

Sure, Windows 95 in particular and Windows in general is often less stable than modern counterparts. But an upgrade from an old, obsolete UNIX to a new Windows system could have had significant benefits and made a lot of sense at the time. Without the full information behind the decision, how can you judge whether the decision was bad or not?

An urban legend... by eddy · 2004-09-21 10:08 · Score: 2, Insightful

.. is what I'm going to consider this for the time being. I've seen it reported everywhere, but it's just too absurd to take at face value.

--
Belief is the currency of delusion.

Re:Why 49.7 days? by PhrostyMcByte · 2004-09-21 10:15 · Score: 5, Insightful

It sounds to me like an application they were running was badly designed to use GetTickCount() as a long-term counter. If so, it's not Win2k's fault.

Re:2K is based on NT kernel by gl4ss · 2004-09-21 10:18 · Score: 4, Insightful

so what if it is "completely different os"? that's the whole point, if it were continuation of the win95 line it would have been fixed!

now the bug was present in both codebases, but fixed just in one.

that's at least how the article and the writeup make it sound like.

--
world was created 5 seconds before this post as it is.

Re:If it's in the job description... by serviscope_minor · 2004-09-21 10:22 · Score: 5, Insightful

How can you intimate blaming the software company here?

You are joking, right? The majority of accidents happen due to human error. This is supposed to be mission critical software (and there's more than just money at stake). Yet, it relies on needless human intervention once a month! This is simply unacceptable for a piece of software in such a position. The main blame lies in the hands of the comany that provided it, the person who decided to switch to it and the person who decided to bring the new system online and remove the old one despite this flaw. The tecnician is almost irrelevent, since this happening was an inevitibility. It would have happened sooner or later because the system left room in there for human error to happen.

And yet, you still don't blame a company which ships mission critical software which leaves such a huge hole open for human errors. I hope our nuclear power plants are running on better designed stuff.

--
SJW n. One who posts facts.

Who's really at fault? by mcguyver · 2004-09-21 10:24 · Score: 3, Insightful

Whoever approved this process of manually rebooting a machine should be at fault. The fact that it was a windows operating system, or a unix OS or a purple OS is irrelevant. The problem here is someone thought a valid solution was to reboot a machine once a month.

Re:Why 49.7 days? by caluml · 2004-09-21 10:26 · Score: 2, Insightful

I think they solved it by Windows 98 - however, maybe there is an old app running on said Windows 2000 server that uses 32 bit milliseconds. Come on guys - we're going to get nowhere by harping on about issues that were fixed years ago. If we stand still, and laugh, Windows is going to sneak up, and run past.

--
Get your own free personal location tracker

Re:Liability. by Sloppy · 2004-09-21 10:27 · Score: 3, Insightful

You know, if strict product liability were applied to Microsoft, they'd be paying big time.

If duct tape a wing to an airplane and then the wing falls off and the plane crashes, you don't sue the duct tape maker. You sue the idiot who decided to use the duct tape.

The grossly negligent party in this situation, is the contractor who built a real-life system on top of Windows. And the FAA idiots who didn't spot this glaring flaw in the proposal. Microsoft shouldn't have to pay a cent.

--
As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.

Re:Wait, I know this one.... by Codebender · 2004-09-21 10:38 · Score: 3, Insightful

No, the FAA is responsible for maintaining the safety of that system. They failed bigtime by allowing Windows to be used for a mission-critical system. Technically, a contractor was the one who made the decision, but the final responsibility for oversight rests on the FAA.

Another mouse wiggler bites the dust.... by Proudrooster · 2004-09-21 10:50 · Score: 2, Insightful

Let this be a lesson out there to all the mouse wiggling MSCE's who scorn the uptime of UNIX and shun the power commandline. If you are running a critical Windows Server, REBOOT EARLY and REBOOT OFTEN. Remember, REBOOT-ing is part of the job description and it has to be done. Please protect our key infrastructure and reboot your servers WEEKLY! Just beacause the UNIX guys get 2 years of uptime, doesn't mean you can too. It just doesn't work that way.

Might I suggest this wonderful little tool. Poweroff. It's the only tool I know of which seems to be able to reliable reboot widows boxes, even when they are crippled due to worms and/or memory leaks. It can even close running apps. Also, you get get it to work over the network with a magic packet, in case Terminal Server crashes or is too slow to use.

The main article should get flagged as troll/flamebait due to the phrase upgrade from Unix to Windows. That wasn't an upgrade, that (as we now know) it was a disaster waiting to happen. Wait until the worm of the month comes through and shuts it down. When will people learn to use the RIGHT TOOL FOR THE JOB! If it has to run 24x7 forever, don't put it on Windows. Geez...

What failed? by AK+Marc · 2004-09-21 10:50 · Score: 5, Insightful

A system was deployed where the application (not the OS) failed after a finite time was deployed knowing it was faulty. An under-trained technician failed to reboot the server as scheduled. There was a backup which we don't have details on. It failed to work as well.

I don't see what the OS has to do with this. It could have been written for *NIX, OS/2, or any other OS. The lessons are two:
Don't deploy flawed software.
Make sure redundant systems work.

As an aside, since we don't know what the backup was, we could hypothetically say that it was the UNIX system that previously was primary that was relegated to backup duty. In that case, it would be a failure of Windows and UNIX at the same time. So, is it that UNIX sucks and is worthless for any important systems, or is it that the people that screwed this up would have screwed up something, no matter what OS they were working with?

--
Learn to love Alaska

Re:Please mod parent up by drsmithy · 2004-09-21 10:54 · Score: 2, Insightful

This is pure speculation of the editor. Nowhere in the article the blame is put on the OS. Linking the failure to an error in a previous version of the OS just doesn't make sense.

Particularly when it's not a "previous version" at all but a completely different Operating System.

Windows 95 and Windows NT (2000/XP/2003) are not the same OS. They're completely different. They share a common API and that's about it. Blaming this on "Windows 95" makes about as much sense as blaming an application bug under FreeBSD 5.x bug on Slackware 1.0.

...Blame the API instead by tyler_larson · 2004-09-21 11:05 · Score: 5, Insightful

This sounds to me like more of a problem with the application, not the OS.

Three words:

GetTickCount()

Returns the number of milliseconds since the machine was last booted.

From reading the article, one would surmise that this function is used to assign a timestamp to a particular flight plan or other record. After the machine has been running for 49.7 days, the GetTickCount() function rolls over to zero, which could cause a whole plethora of problems. Almost certainly those problems would include things like corruption of data, lost records, old records showing up as new, application crashes, and, of course, swarms of locusts. The only fix is to reboot.

The developers cleverly noticed the potential disaster before it crashed any planes, and as a workaround, instituted a policy requiring the servers to be rebooted at monthly intervals. Failure to do so would result in the calamities described above.

So while the problem wasn't the old Win95 bug, it was the same crappy windows API that caused both. The POSIX-compliant gettimeofday() function uses a 64-bit structure and does not suffer from the same flaw, and can be relied upon for at least the next 30 years or so (which isn't amazing, but it's a lot better than 50 days).

Note that the FAA insists that they're currently implementing a better solution than "reboot every month". Better hurry, guys, you've only got 47.3 days left.

--
"With sufficient thrust, pigs fly just fine. However, this is not necessarily a good idea...."
RFC 1925

Since We're Being Tehcnical About the Answer by techsoldaten · 2004-09-21 11:09 · Score: 3, Insightful

Since we are being technical about the answer, does this mean Microsoft or the software vendor qualifies as a terrorist organization?

Consider the fact that an entire airport was shut down, lives were disrupted, major economic harm was caused our airlines as a result of flights not getting out on time. LAX is a major hub that connects travelers throughout the country, it is conceivable traffic patterns throughout the U.S. were put out by this problem.

Think of it like a car bomb that went off without anyone dying, and you see my point.

M

Re:A few remarks by evilviper · 2004-09-21 11:19 · Score: 2, Insightful

You just can't talk about computers like you talk about machines. The analogy does not work.

If the fault was going to happen every 48 days, they should have scheduled a reboot for every 22 days at most. Just like everything else, it's insane to have a single point of failure like this.

If you know a machine needs to be rebooted regularly, there is no reason not to automate the process. Windows task scheduler should do the job quite well.

There's no reason the computer could not have reported an error, by whatever means, to an administrator when it detects it is operating in excess of it's design parameters. Send a barrage of e-mails, IMs, Faxes, SMS messages, etc. I can guarantee this life-or-death system would get somebody's attention, and it would be restarted as it should be.

--
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant

Pilot or no... by juuri · 2004-09-21 11:19 · Score: 4, Insightful

... how does a single app bring down the entire OS? You mean the app can't be restarted and brought back up with the same state at a moments notice in a mere minute or two?

Crappy design, regardless of who is at fault.

--
--- I do not moderate.

Re:If it's in the job description... by Dun+Malg · 2004-09-21 11:31 · Score: 2, Insightful

You design this sort of system _expecting_ that a reboot or two will be missed. Okay.. blame the tech if he didn't follow procedure.. but what if the reboot didn't happen because the tech's wife was in labor or if his kid got hit by a truck? You design systems thinking of the _worst_ case scenario.

You don't run a fucking air traffic control system with a "one truck" vulnerability.

Exactly. If you find a bug that requires a restart before a 49.7 day timer runs out, you are indeed an idiot if you decide a restart once a month is good enough. At the very least I'd have tech down there on the 1st and 15th of the month, so they'd have to miss three scheduled restarts to cause this problem. Better yet, have two guys there every damn Wednsday at noon. If they both miss seven Wednsdays in a row, well, you got bigger problems than bad software. Whoever decided once a month was adequate needs to have his head handed to him.

--
If a job's not worth doing, it's not worth doing right.

Fire the Department of the Interior's IT staff... by Dr.Dubious+DDQ · 2004-09-21 11:55 · Score: 4, Insightful

The FAA is under the auspices of the US Department of the Interior, aren't they? You know, the same department that was ordered by a court to take ALL of their systems off line because they were apparently unable to secure them? TWICE? (No, wait, the latter link says THREE times, most recently March 2004...!)

Is there some secret plot to make them look bad, or is the Department of the Interior riddled with incompetence? I certainly don't feel real secure about the safety of our airlines right now - and it's got nothing to do with "terrorists"...

(Not to say that terrorism isn't a real concern, but I'm somewhat less worried that their intentional plots will slip through observation by the authorities than "accidental" screwed up software being deployed by the FAA...)

--
Hacker Public Radio is our Friend

Re:"Upgrade"? by upsidedown_duck · 2004-09-21 12:09 · Score: 3, Insightful

It depends on how bad their previous UNIX system was. Any operating system can be neglected into oblivion. Also, if they got all new hardware to run Windows 2000, when the old hardware might have been ten-year-old 50MHz SMP boxes, then upgrade would be the right term. It's unfortunate that they didn't decide to upgrade to faster UNIX boxes, but that's politics for you.

--
-- "Makes Little Debbie look like a pile of puke!" - Moe Szyslak

Lol, only on Slashdot by jayhawk88 · 2004-09-21 12:12 · Score: 3, Insightful

I don't think blame should be assigned to the technician who missed the task...

Boss: OK Tech, it's your job to see to it this computer is rebooted monthly.
Tech: Will do Boss!
*Time Passes, System Crashes*
Boss: The system crashed, why is that?
Tech: Well, it's because I didn't reboot the system like I should have.
Boss: Oh well, I guess it's not your fault, obviously I failed to realize maximum security synergy in my systems.

Wherever the submitter works, I wanna get a job there!

Re:Lol, only on Slashdot by reverius · 2004-09-21 18:05 · Score: 2, Insightful

it's the boss' fault for making a task like that necessary in the first place.

if i design a system in which someone has to press a button every 12 hours or the world blows up, would anyone want to use that system? no, you think? what if you could -order someone who works below you- to do it!?

that's just plain stupid management. the rebooting job is a waste of the tech's time (anyone competent could make it reboot automatically) and a completely unnecessary job (any competant operating system doesn't need to be rebooted every 30 days, or even every 3 years).

If the boss had scheduled maintanance (Windows Update, to get service pack 4) or had used an operating system that doesn't require that much maintanance to function correctly, the job wouldn't have needed to be performed.

the boss should be fired for general incompetence/negligence (since he had the responsibility to make the system stable), and the tech should be put to work carrying boxes or something (or just fired as well), since he isn't competent enough to put an automatic timer on the rebooting.

Re:no such thing as a Windows 2000 49.7 day bug by Ahnteis · 2004-09-21 13:05 · Score: 2, Insightful

"There is nothing Microsoft could do to prevent this."

But this is slashdot so we won't let little things like facts get in the way of a good MS bashing session.

Re:no such thing as a Windows 2000 49.7 day bug by SuiteSisterMary · 2004-09-21 13:29 · Score: 2, Insightful

Nonsense. That would be like saying 'warning: you're taking a step, and might trip.'

Typing naught but 'GetTickCount()' into Google lands me right onto the MSDN page and clearly says:

The elapsed time is stored as a DWORD value. Therefore, the time will wrap around to zero if the system is run continuously for 49.7 days.

and goes on to suggest alternative timing capabilities.

This was a major fuckup by the application programmers, incorrectly using a clearly defined API call.

--
Vintage computer games and RPG books available. Email me if you're interested.

Maitainance. by Zebra_X · 2004-09-21 13:41 · Score: 3, Insightful

it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task.

Would you feel this way if the airplane that you were flying in missed it's engine overhaul time, the engined failed catastrophically and your plane crashed?

Critical System + Maitainance = Must Be Done.

The system was designed and setup in a particular manner. In fact, the reboot rule was added to the design of the system, so that this very thing would not happen.

Whoever's job it was to reboot the machine is at fault for not maintaining the system properly.

The discussion of whether the procedure of rebooting a machine every month is inane, is something different.

Why is it that.... by jwcorder · 2004-09-21 13:47 · Score: 2, Insightful

no one has put the blame where it belongs....on the system admin. We can have a shit throwing contest all day about whether or not this is MS's fault. But the fact remains that the problem was addressed and fixed in SP 4 for Win 2000.

If the system had been updated the problem would not have occurred. How is this a microsoft problem? They cannot force system maintenance.

--
http://jayceecorder.blogspot.com

MTBF - what boggles my mind by apikoros · 2004-09-21 14:08 · Score: 2, Insightful

Forgetting all the talk about Microsoft and Win95/98 and the defect in the OS that has been well known for years and for which a patch has also been available for years....

If you have a system that has a known failure point at 49 days,when do you perform the mandatory reset?

For the failure that is described the scheduled reset must have been "every 30 days" which is, frankly, INSANE!

If they had scheduled a mandaory reset every 14 or 15 days, they would have had to have had three failures before disaster struck. As it seems, one failure was all it took.

I wanted to say stupid - I say ??? by tuomoks · 2004-09-21 14:21 · Score: 2, Insightful

I started to write a long comment, no point, unfortunately this is the way today. Trust me - the more computer system decissions are made on manager level instead using people who know how to build systems - the worse it gets. Used to be that way - compare the financial / manufacturing systems running years to what we do today - any questions ? Some of my old systems are still running from 70's - none of my new systems can stay up more than 10-12 months AND I was told to build them that way. And no - CAD systems, CRM, protocols, world wide networks for finance / air lines / etc.. has been there since early 70's, so complexity is not any excuse. Just don't give up - maybe some day ( after my time.. ) And let's forget the Windows / *nix, Windows is more difficult to build reliable systems but it can be done - Windows is just more primitive, you have to design / code on lower level, it is harder than *nix but so what ?

Which UNIX is that? by mangu · 2004-09-21 15:49 · Score: 2, Insightful

Unix systems are often installed with the instruction taht they get reboots regularly.

In 25 years working with Unix systems, I've never seen that instruction. That must be because I've never worked with any Microsoft Unix system...

Re:Space Shuttle accidents and software bugs by GlassHeart · 2004-09-21 17:13 · Score: 4, Insightful

The only regret you'll have from paying for too much quality is the money. You'll have everything to regret from spending on too little quality.

That's a nice thing for a professor to advocate, but real world projects like the space shuttle do not have an infinite budget to accomplish the assigned task. Therefore, spending too much money on one aspect can mean that another is sacrificed and becomes the point of failure. Therefore, while being responsible for the part that never failed is an understandable source of pride, it may actually reveal a misallocation of resources.

Engineering is about spending the least amount of time and money to achieve the required quality. Nobody said anything about spending too little.

Incorrect. by jwigum · 2004-09-21 18:39 · Score: 5, Insightful

Part of being on the ball in any tech department means having the system up to date. If you don't have it up to date, and an error FOR WHICH A PATCH EXISTS gives you trouble, everyone else in the company should rip your head off. That's inexcusable.

If you install an unpatched version of an OS, and leave it as such, it's your own dumb fault. If a patch is out that fixes the problem, then the problem doesn't exist as far as anyone with half a brain is concerned.

My apologies for the abrasive manner of the response, but patches are around for a reason: to fix known problems.

Patches, do ya have 'em?

--

Look behind you...

Re:Incorrect. by fallen1 · 2004-09-22 00:59 · Score: 2, Insightful

Quote: My apologies for the abrasive manner of the response, but patches are around for a reason: to fix known problems.
Well, yes, this may be true BUT Microsoft patches are _notorious_ for breaking as many, if not more, things than they fix. How long can a critical system such as this one stay down for "routine" maintenance? WHEN would the breaks introduced by the patches show up? In the middle of routing 20 or more airplanes in the airspace around LAX?
Although the specific bug had a patch, perhaps this was a case of "do we patch and pray OR do we reboot monthly?"
*shrug* Maybe the heads of the department overrode the IT personnel and instead of paying the money to patch and test they told them to just reboot the system? No, I didn't RTFA but who knows exactly what went down? The department heads are all in a CYA mode right now and the "truth" may never be known.

--
Dream as if you'll live forever.
Live as if you'll die tomorrow.
~Anonymous~

Re:maintenance task (yyeahhh, rrriiiiighht) by timerider · 2004-09-21 19:14 · Score: 2, Insightful

it WAS a human error... i mean, it must have been some form of human life form who decided to use windows for those systems...

Every coin has two sides by babybird · 2004-09-21 19:54 · Score: 2, Insightful

By that same logic, doesn't a Windows users "Start" the shutdown procedure?

And if you don't want to go to the "Start" button in Windows to shut it down, you could always hit ctrl-alt-del and click shutdown. Or press the power button if you have power management enabled in the bios. I don't really see a fundamental difference between the two, it's just semantics really.

When I first started using Linux, one of the things that baffled me for hours until I could ask someone who knew Linux was how the heck do you rename a file?? I searched and searched for anything resembling a rename command and found nothing. It never occurred to me that you might use the move command to rename a file by essentially just "moving" the file to a new filename. That's at least as illogical (to me and every newbie I've ever known) as clicking Start to Shutdown for someone who isn't familiar with the idiosyncracies of a particular operating system.

--
Keith D.

you ass by RMH101 · 2004-09-21 20:54 · Score: 2, Insightful

big projects don't work like this. if you find a bug mid testing, then you don't throw the whole thing back at the vendor and chuck the baby out with the bathwater; you simply cannot organise big projects like this. you do risk analysis and if it's decided you can accept it with a constraint that you, say, boot it occasionally then you may be able to accept the system. if you have accepted it on this basis and don't do what you said you would when you signed the constraint off, it's your problem. yes, the vendor shouldn't sell buggy software, but *all* software has *some* bugs in it.

Slashdot Mirror

Windows Upgrade, FAA Error Cause LAX Shutdown

72 of 862 comments (clear)