Windows Upgrade, FAA Error Cause LAX Shutdown
fname writes "The recent shutdown of LAX due to an FAA radio outage was apparently caused by a Windows 2000 integration flaw, possibility related to an old Windows 95 bug. An article at the LA Times claims that the outage was caused by human error, as the system will automatically shut down after 49.7 days (related to this Windows 95 flaw?), and a technician didn't reboot the system monthly as he should have. This happened after an upgrade from Unix to Windows. I don't think blame should be assigned to the technician who missed the task; rather, it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task. Who's really at fault?"
The recent shutdown of LAX due to an FAA radio outage was apparently caused by a Windows 2000 integration flaw, possibility related to an old Windows 95 bug.
/sarcasm
Okay... a Win95 bug leads to the LAX shutdown because the *same* bug was later found in Win2k? Yup, closed source is the answer, Mr. Gates. I hereby repent my sins of Open Source Freedom and agree that security by obscurity is the answer!
a technician didn't reboot the system monthly as he should have
You have to love a system that requires downtime as part of uptime. How many Linux users have this problem? (Please press the Start button to shut down (stop) the computer.)
The dangers of knowledge trigger emotional distress in human beings.
It's obviously lunacy for any company to replace a proven system, which has given years of reliable service with some piece of trash that crashes if left running for over a month. That said, I was under the impression that a simple "at" job could be used on a Windows machine to run a script periodically (at is similar to cron, except far less capable, of course). Such a script could, if I'm not mistaken, be used to reboot the machine. One would think this would be an ideal way to hide the problem very nicely.
We use a similar system to reboot all of our NT servers every weekend to help prevent crashes during the week (doesn't work of course, but still).
Code, Hardware, stuff like that.
Have they never thought to just schedule an event to reboot the computer every 30 days?
Don't use this stuff in mission-critical applications.
-jcr
The only title of honor that a tyrant can grant is "Enemy of the State."
"Upgrade" from Unix to Windows, eh. You keep using that word. I do not think it means what you think it means.
Use Ctrl-C instead of ESC in Vim!
....all in-flight movies are played on Windows Media Player.
If you think
But most off the shelf software have disclaimers expressly stating they are not to be used in mission critical situations. Eg:
"technology is not fault tolerant and is not designed, manufactured, or intended for use or resale as on-line control equipment in hazardous environments requiring fail-safe performance, such as in the operation of nuclear facilities, aircraft navigation or communication systems, air traffic control, direct life support machines, or weapons systems, in which the failure of Java technology could lead directly to death, personal injury, or severe physical or environmental damage."
-- Samir Gupta, Ph. D. Head, New Technology Research Group, Nintendo Co. Ltd., Kyoto, Japan.
I thought switching to Windows from *nix saved time, money, and hassle! Haven't you guys seen those banner ads here?
But I'm going to.
/.er.
It's M$'s fault. Why do I hate to say it? Because it'll just be seen as more anti-MS crap from another
All I have to say is if the shoe fits, wear it.
In this individual case a PHB made a decision to scrap the old, stable OS to a new, known-to-be-unstable OS. That screams PHB.
"There is a way that seems right to a man, but its end is the way of death." Proverbs 16:25 (NKJV)
When a ball drops on a baseball field at the midpoint between two positions, it's scored a "hit" for the opposition rather than an "error" against either player. Still, a hit for the other side is a bad thing for the entire team.
This mess was big enough that there's a large enough supply of blame to give some to everybody involved.
- No system should require a manual reboot on a regular basis... there should at least be a script capable of accomplishing that. But somehow, one got implemented. Blame whoever bought it.
- Windows shouldn't have had a faw that required monthly reboots. Blame Microsoft.
- Somebody should have done the reboots like they were told to. Blame that poor smuck.
Bottom line is that everybody's at fault because had any one piece in the chain done their job properly the failure wouldn't have happened, but a cascade of mistakes lead to the ball hitting the grass instead of a glove.
Why did they move from Unix to Windows in the first place? And why should a bug from Win95 crash a migrated Win2K?
How sad that such a sprawling metropolis of commerce and travel can be brought to its knees by the magic that is Windows.
Color me suprised.
Read the only personal Runyon page out there.
upgrade from Unix to Windows
AKA, "The PHB Special"
Of course, the guy who was supposed to reboot the box will get all the blame. Shit rolls downhill.
... I can think of no one else to fault *BUT* the technician. The IT guys know full well that this "quirk" exists, and in fact, part of their planning and maintenence involved resetting the machine in order to get around this potential problem. These guys did not complete their job duties, and as such, the system went down.
How can you intimate blaming the software company here?
- DaftShadow
To the rescue!
http://www.nbc.com/LAX/
-- The Funk, The Whole Funk, And Nothing But The Funk
The newspaper said that a Microsoft-based replacement for an older Unix system needed to be reset every thirty days 'to prevent data overload', as a result of problems found when the system was first rolled out. However, a technician failed to perform the reset at the right time and an internal clock within the system subsequently shut it down. A back-up system also failed
Guess there was a backup, I feel for that guy.
"This happened after an upgrade from Unix to Windows."
Thats the funniest thing I heard all day. Windows is an upgrade from unix. I almost choked on my coffee.
...why did they switch to Windows in the first place?
US businesses that currently accept chip and PIN/signature
It is human error: those bugs didn't write themselves. Nor did the operations protocol that required "rebooting LAX" every 49.69(!) days. Nor did the upgrade procedure that ignored that bottleneck. Nor did the upgrade decision that moved from Unix to Windows. Those were all human errors, as was the decision to keep a job at LAX that would face blame for shutting down the airport (or risking lives) if the reboot was missed, or unsuccessful.
"Not I," says the referee,
"Don't point your finger at me.
I could've stopped it in the eighth
An' maybe kept him from his fate,
But the crowd would've booed, I'm sure,
At not gettin' their money's worth.
It's too bad he had to go,
But there was a pressure on me too, you know.
It wasn't me that made him fall.
No, you can't blame me at all."
- Bob Dylan, "Who Killed Davey Moore?"
--
make install -not war
sleep 4294080 /s
shutdown
I remember when the 49.7 day bug was discovered. That was right after I had just hit the 49.7 day freeze in an attempt to keep my personal machine alive as long as possible.
When it froze, I didn't know why until I read the story, just figured it finally gave up the ghost for no real reason. It was time for a reboot anyway, that system was hurtin' bad.
Why the hell the have a critical system running on an OS that can't stay up for at least 50 days, I do not know.
"With sufficient thrust, pigs fly just fine." -- RFC 1925
There's no conceivable reason not to. How do you justify your money going to a company that keeps the source to itself?
You paid for it with your taxes - you own it. Demand open source at ALL government levels.
the major advances in civilization are processes which all but wreck the societies in which they occur - A.N. White
MS lawyer: "It all worked in the flight2000 simulator? We always rebooted after every crash and everytime it was OK afterwards?"
It wouild suck if all the radios shut down in the middle of an emergency landing. Better to hae it manual.
Of course the technician was blamed - if not, some CIO-type in charge would have had to take it, and he wouldn't allow that to happen. It always runs downhill...
Because there are 4294080000 millisconds in that time period. Just enough to cause a roll-over when using a 32 bit counter (and yes, 49.7 is an approximate value).
Very few Win95 systems ever made it that long without a reboot... but you would've thought that it would've been fixed by Windows 2000.
Wanted: witty unique signature. Must be willing to relocate.
...keep in mind that we have established numerous times that windows is not suitable for systems that need reliability and stability. It is not the operating system's fault that this happened, it is the FAA's for choosing to use it instead of considering the better alternatives. If you get run over on a bicycle while riding on the highway, don't blame the bike.
Quick addition: it seems that the fault does not belong entirely to windows, but rather a combination of the software running on it and the system architecture.
With that said, Windows could stand to improve a lot. It has too many bugs, too many flaws, and so on. And it definitely does not have a stable, secure, reliable base. So don't expect it to.
webpage
So I installed Linux.
Fight Spammers!
From the submission
possibility related to an old Windows 95 bug
From the Article.
The shutdown is intended to keep the system from becoming overloaded with data and potentially giving controllers wrong information about flights, according to a software analyst cited by the LA Times.
The shutdown is not a crash but a scheduled event to bring the servers down to flush data.
So it does not seem to be a problem with Windows (Ok now I get marked as troll) but with the FAA's own software.
Sorry, teleporters just kill you and then make a copy. A perfect, soul-less copy.
This old error was from the use of a 32 bit 1 ms increment timer (comes out to 49.7 days until rollover). AFAIK, this was fixed in Win2k and above when the timer got bumped to 64 bit. Maybe whoever set up LAX was using some ancient legacy middleware that used the old timer. This is just bizarre. In both locations that I have worked the last three years, none of the Win2k or Win2k3 servers went down ever. Sounds like bad consultants.
[RIAA] says its concern is artists. That's true, in just the sense that a cattle rancher is concerned about its cattle.
Silly IT departments.
If you "upgrade" a piece of software, then discover it requires a complete manual system restart to remain stable, the prudent thing to do in any other circumstance would be a rollback.
Unfortunately, since this is an IT department, it must run Windows; after all, where could you ever find support for Linux?
It's only an insult if it's not true.
The system offers unprecedented voice quality, touch-screen technology, dynamic reconfiguration capabilities to meet changing needs, and an operational availability of 0.9999999
Okay, bullshit. If I have to reboot a server every month, .0000001 of a month is- oh, let's be generous and only count months with 31 days- about .26 seconds. That's a damned fast boot time for Win2K.
Maybe they left off a percent sign?
Do you have ESP?
Surely the simplest 'fudge' to fix this problem is
to write a script that beeps loudly every 10 mins
or some other (read: more sensible) notification
after the system uptime exceeds 30 or so days?
But seriously, if its running windows its not the
monthly reboots the need to worry about, its the
quaterly format/reinstall procedure thats required
for stable operation.
I dont think I've had a stable (home) windows install for
more than 6 months without reinstall, but maybe I'm
pushing my luck by actually USING the computer.
Windows in 6 Bytes (IA-32) : 90 90 90 90 CD 19
I wrote a VB program years ago for the Win95 to solve this problem. I just had the scheduler run my program that rebooted the system for me.
Umm.... Duh
You say things that offend me and I can deal with it. Can you?
I also love the statement that the system was upgraded from UNIX to Windows. Isn't this kind of like upgrading from being in very good health but not being good looking to being somewhat good looking but suffering from cancer, AIDS and heart disease?
cheap labor conservatives - they want to keep you hungry enough to be thankful for minimum wage.
I was sitting in Atlanta-Hartsfield for an extra 70 minutes thanks to that bastard.
There is a difference between "insightful" and "inciteful" other than spelling.
I remember back when that bug was announced. Seems it was at least a couple of years after Windows 95 had been out. I guess they had to work through a lot of other bugs to get Windows 95 to make it long enough for this bug to occur.
Unknown host pong.
The employee missed the maintenance window. If you forget to do something that is a part of your job, I would have to suggest that you are responsible for the consequences. Now, does placing the employee in such a situation apply some burden of responsibility upon higher-ups? Certainly. But, the employee should be held responsible...ESPECIALLY if the importance of the maintenance was made clear.
Since when is going from Unix to Windows and upgrade?
Mod Wisely.
Was the flaw left unfixed for too long because they did not have access to the source code? Or was it because it was too expensive? If this is such a critical system that it can cause loss of life (on a massive scale, no less), the root cause should have been fixed, rather than the workaround. I remember reading somewhere that this flaw has now been fixed. Smells like a managerial issue within the FAA, not just a technician problem. Remember NASA and the space shuttles?
If I recall, doesn't MS have something that absolves them of any liability listed towards the end of the license agreement. Something along the lines of, "Do not use in mission critical places." Or was it more like do not install in missile silos or nuclear facilities, something like that right? Someone correct me. If I am right about the license agreement, that was stupid of LAX to have been suckered into switching from UNIX to M$. Oh wait, I forgot, everything works better on MS products right? That's why we have many security/virus/worm/bug/whatever flaws. What a great product Bill!
I don't think blame should be assigned to the technician who missed the task; rather, it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task. Who's really at fault?"
Actually, I do. You've got a job; you've got deadlines. Do the work.
In the future, I would want to not be isolated from my friends in the Space Station.
OK, I know it's violation of /. policy to actually read a referenced article. My bad. But, according to the software.silicon.com article:
This sounds to me like more of a problem with the application, not the OS. The "system" crashed after 49.7 days, which is about 4 million seconds, which is about 4 billion milliseconds, which is (obviously) MAX_ULONG. I suspect the application is using a ULONG to store a timeout value and got pissed-off when it rolled over.
By the very nature of the system, the blame can fall on no one other than the maintenance personnel. Otherwise the PHB that authorize the "upgrade," and the system that put the said PHB in the position to authorize said "upgrade" would look incompentent and foolish, and we can't very well have that.
ELOI, ELOI, LAMA SABACHTHANI!?
This happened after an upgrade from Unix to Windows.
Shouldn't that read downgrade instead ???
Before the torrent of "windows sucks" posts...
Too late.
Ah, the old "windows maintenance reboot" problem. It always amazes me how IT managers (hell even some techos) accept the need to re-boot their windows systems every week. At my work, the windows guys accept it as normal maintenance. If I had to reboot my AIX and z/OS systems every week there would be hell to pay. But because its windows , its accepted. I dunno, mediocraty is the new standard these days...........
" upgrade from Unix to Windows "
:)
Ahahahahahahahahahahahahahahahahahahahah that's the funniest thing i've read in ages
Last.fm - join the social music revolution
I believe the 49.7 days of uptime for a Windows 95 box is a new record, shattering the previous record in Norway of 27.9 days back on January through February of 2001. Congratulations!
--
http://www.aikiweb.com - AikiWeb Aikido Information
after 584542046 years. Okay, I admit... when you reach that time, you'll probably have other problems than a Win2K crash.
and office update while you're at it too...
Wouldn't want to spoil a nice MS bashing session, but I think the bug was in the ported application, not in the OS - probably someone used the wrong data type to hold timestamps somewhere within the program (win95 had the same bug) - I've seen win2k last more than 47 days without reboots...
I worked for around 5 years in Air Traffic Control projects, both in delivery of radar processing and displays and in R&D for next generation systems.
Let me give you an overview of the failure approach of just one of those systems.
1) Everything on Unix, ruggedised releases of UNIX
2) Every box must be able to FAIL ON ITS OWN
3) Every box must have a direct replacement, or replacements, which carry the SAME LOAD.
4) ZERO total system downtime allowed, partial systems failures are allowed, but core systems must keep running.
5) 5 stages of power supply failure, double mains, double generation and lastly a great big warehouse of car batteries if all else fails.
6) 4 Years of testing of FULL system before live.
This is what is normal when safety is the primary concern. What the FAA decision sounds like is a cost driven process which chose the cheapest solution that "could" meet the requirements.
The idea of a safety critical (if it fails people could die) system that requires a reboot is fine in only one case... if it can be non-operational on a regular basis, in which case it should be done EVERY non-operational window (say every week) , this is therefore okay for some hospital scanners that are certified for 12 hour runs. Its not okay for a 24/7 system that controls objects flying around at 500 miles an hour.
Welcome to the US... we will be landing slightly quicker than expected.
An Eye for an Eye will make the whole world blind - Gandhi
I think it depends on what the company rep said when they convinced them to replace Unix with Windows.
If they advertised a consumer OS as an OS suitable for mission critical applications . . . then this flaw should not be in the software. It's could the software companies fault for agressively marketing their product where it should not be.
Maybe we should throw some blame to the PHB who ordered the switch. Purhaps there was no hard sell from MS, and a PHB saw a product brochure and got a hard on to switch.
I see your point though, the tech knew about the problem and failed to do his job.
I guess my question is, should the problem have been addressed before now, or is it common practice to wait for a catastrophic success like this to occur before addressing the problem ?
Tower: Who is this General Protection, anyway? And, how did he break my computer?
This week, while flying, I saw:
1. Windows-based terminal used by the public to print tickets (I think) with a "you have chosen to download a file, what do you want to do with it: save, open" or similar (I don't recall the exact wording).
2. A windows-based machine that was part of the baggage scanning setup at Chicago-O'Hare going through a scandisk process. OK, this may have been due to operators turing the machine off using the power switch, but should not such a machine use a read-only boot drive/partition?
Do you feel more secure?
The real "Libtards" are the Libertarians!
A system running UNIX doesn't necessarily mean it was stable. It could have all sorts of flaws in the code, hardware failures, etc.
Sure, Windows 95 in particular and Windows in general is often less stable than modern counterparts. But an upgrade from an old, obsolete UNIX to a new Windows system could have had significant benefits and made a lot of sense at the time. Without the full information behind the decision, how can you judge whether the decision was bad or not?
There is no such thing as a Windows 2000 49.7 day bug that causes an OS problem.
The problem here is the software made by Harris does not handle a rollover of the GetTickCount() function turning back to 0. This function counts the number of milliseconds since the OS was last booted so it should be obvious to anybody that the returned unsigned 4 byte integer cannot go on forever.
So the badly written Harris software has this bug and their solution (which was really not that bad of a work around) was to manually reboot the system every 30 days, but as a fail-safe, they had a scheduled task to do a reboot on the 49th day just in case. The 49th day came because of procedural error.
There is nothing Microsoft could do to prevent this.
.. is what I'm going to consider this for the time being. I've seen it reported everywhere, but it's just too absurd to take at face value.
Belief is the currency of delusion.
Hey, I submitted this two days ago. What makes it slashdot worthy now?
Have you ever noticed that anybody driving slower than you is an idiot, and anyone going faster than you is a maniac?
Where does it say that this was due to the Win95/Win98 bug? (If I missed something, please let me know.) Just because it happens to be the same amount of time as the Win95 bug doesn't mean it is the same bug. The bug was never present in Windows 2000, AFAIK. And in any case, there's a reason why 49.7 is a "magic number" for uptime (hint: how many milliseconds are there in 49.7 days?), just as there was a reason why "2000" was a magic number for date problems and why 2037 will be another magic number for date problems.
Just because it runs on (OS) and just because it crashes doesn't mean it is (OS vendor)'s fault. In this case, you certainly can't blame Microsoft: there was a problem in the radio software, the software developers knew about it, the maintenance staff knew about it, it didn't get fixed, and it caused a problem. Where does Microsoft fit into that?
Time flies like an arrow. Fruit flies like a banana.
I have the distinct, but sadly not unusual, pleasure of watching my company execute a brilliant strategy of:
Since becoming a PHB (although I still do architecture work - thankfully), I've found that mindless boneheaded, sweeping decisions, are usually driven by some empty-suit, bean-counting, incompetent, barely literate, sh!t-for-brains syncophant who found themselves in an executive position purely by accident. We're "encouraged" to support their "strategies". Indeed...
It's a much higher order PHB. Kinda like a 4th degree black-belt, but not.
Computer Science is Applied Philosophy
You know, if strict product liability were applied to Microsoft, they'd be paying big time.
What those who want activist courts fear is rule by the people.
Frank Stallone.
You know what?
Guys, most of the equipment in use by the FAA isn't new enough to run Windows 2000. I worked on the "state of the art" search radar, and it was built around Sun Ultra 5s.
It's good to use your head, but not as a battering ram.
+1 Funny
-><- no
The shutdown wasn't the problem, or more appropriately, the shutdown that would have prevented the problem was missed. But also the FAA's software probably has some issues of its own that need to be fixed.
On a completely different subject, I move that any post containing a phrase along the lines of "This is going to get me moderated as Troll" be automatically moderated Troll. Too many of us seem to use it becasue it tends to lead to the opposite result.
Number of milliseconds in 49.7 days:
60*60*24*49.7 * 1000 = 4,294,080,000
which just about overflows uint32.
"You mortals are so obtuse." -Q
That information had been filtered at least three times, can't count on that either...
Software analyst -> LA Times reporter -> TechWorld reporter.
http://msdn.microsoft.com/library/default.asp?url= /library/en-us/sysinfo/base/gettickcount.asp
Sounds like who ever wrote the software/OS module they were relying on used this gem. I hereby dub who soever was so silly as to do this as a 'code monkey, first class'.
putting the 'B' in LGBTQ+
Having to shutdown a system to maintain it's uptime is first a ridiculous idea.
Second, it took several years to find that bug because most windows machines never made it to that 49.7 days and if they did the users just assumed it was the normal because it is considered normal for windows to "lock up", freeze or whatever.
Third, replacing unix, known for it's stability, with any variant of windows (known for instability) in a system where peoples lives are at stake and then having this happen, the guys at LAX who decided to do this should be fired because they just risked a lot of lives and cause massive delays for travellers. In a political situation they would have to resign.
I remember a similar story about a aegis class cruiser stuck out in the ocean for three days because they decided to use windows. "Yea, that will work great during a war.."
*sigh* Microsoft has good lobby power and hires a fleet of sales people to keep selling their shod-ware that really should just be kept to mom and pop living rooms.
But then, this is the opionion of a guy who works only with linux and is sitting on an uptime on an openmosix cluster-leader (that also is my dev box) that looks like this:
19:03:06 up 319 days, 5:20, 3 users, load average: 1.28, 0.73, 0.37
eat your heart out LAX.. you got punk'd
anime+manga together at last.. in real time.
I really wish Microsoft would go out of business, quickly and quietly. I don't hate Bill Gates, I don't hate Windows, I'm just tired of hearing everyone bitch about them so much.
No one cares what your captcha was
Houston TX, USA
giant advertisement:
I'm thinking that maybe "the guy that almost crashed a bunch of planes" is not the name they were looking for.(I'm not making this up- that's really the ad I'm seeing.)
314-15-9265
Say no to software patents.
One of the things that is delightfully unambiguous is the naval tradition.
If the ship trades paint with anything, it's the Commanding Officer's fault. Yeah, some shrapnel may works its way down the organization chart, but the glory and the gory both rest on one neck...
Would that less time were spent on blamesmanship in our decadent, modern day...
Get thee glass eyes, and, like a scurvy politician, seem to see things thou dost not.--King Lear
so what if it is "completely different os"? that's the whole point, if it were continuation of the win95 line it would have been fixed!
now the bug was present in both codebases, but fixed just in one.
that's at least how the article and the writeup make it sound like.
world was created 5 seconds before this post as it is.
... should be:
"Microsoft: Writing the software to prevent SkyNet since 1981."
Do not look into laser with remaining eye.
And I hardly see how the Windows 95 bug is relevant to this issue as that clearly isn't what caused the shutdown.
Editors please learn how to do your fucking jobs and reject crap like this. Just because it bashes MS doesn't mean its newsworthy.
Mathematics is made of 50 percent formulas, 50 percent proofs, and 50 percent imagination.
I'm pretty sure the union official who says that it requires a reboot to avoid 'data overload' misspoke, and meant 'data overflow'. 49.7 days is 2^32 milliseconds.
What you may not have taken the time to observe is that when you run init with a name of telinit or with a process ID other than 1 it runs in 'telinit' mode. In this mode it passes a message via /dev/initctl (a FIFO) to tell the running copy of 'init' (the process responsible for initialising services and managing them thereafter) to perform a specific action (eg shutdown, reboot... etc)
The article at: http://www.techworld.com/opsys/news/index.cfm?News ID=2275
has a headline: Microsoft server crash nearly causes 800-plane pile-up And next to it you'll see a Microsoft advertisement ad that says: Make a name for yourself with Windows server systems
And I guess the FAA did just that too.
I don't think switching from Unix to Windows can be considered an "upgrade."
This sounds like more Microsoft FUD to me. But I might be wrong because I like to use Unix/Linux and therefore my oppinion is suspect.
Whoever approved this process of manually rebooting a machine should be at fault. The fact that it was a windows operating system, or a unix OS or a purple OS is irrelevant. The problem here is someone thought a valid solution was to reboot a machine once a month.
Oh. Good. Lord.
There's just so much wrong with this picture. At least they picked the version of Windows least likely to flake out.
(Personal nightmare: finding a Windows computer running your life support)
1) this is not a windows OS bug
GetTickCount() will rollover. An _application_ which assumes it is a strictly increasing value will misbehave after the 40 some odd days expire. That appears to be what is happening here.
Note that nowhere in the article is there a distinction between the "system" and the "OS" or the "application".
2) Regardless of where the fault is (hint: it's not in Windows), it is not unreasonable for a machine to need servicing. Aircraft engines are serviced at hour based intervals, wether they need it or not. It's better to just tear the thing down and rebuild it than to have it tear itself apart. software doesn't _have_ to be this way, but it sometimes is.
Making a complete hardware -> app layer stack 100% failsafe is.. tricky. For some applications, designing the system with a known restart point.. i.e. a reboot of the app or the entire machine, can be more cost effective.. (see earlier the paper on crash-only software design)..a periodic shutdown/restart in complicated systems can be a valid operational practice.
The fault here is two fold - one, the application/system had a known issue that is probably avoidable, but for whatever reasons, it still has the issue.
Knowing that the issue existed, the proper maintennace was not observed with the expected result - a failure.
Only in america do you get away with blaming Audi for oil sludge problems when you dont change your oil every maintenace interval.
If the system called for a 48th day restart, thats what it requires, and deviation from that has consequences. Luckily no one was hurt.
My opinions are my own, and do not necessarily represent those of my employer.
Windows NT has a disclaimer in the license agreement stating that it should not be used in critical job roles like nuclear reactor control, etc.
Maybe they need to update the list. I would suggest everything except their Mine Sweep game.
The race isn't always to the swift... but that's the way to bet!
No, the OP is using something called "inference". In fact, I am infering that the OP was infering that the journalist reporting the article either doesn't understand the 32 bit rollover problem or does not want to report all the details required to describe the 32 bit rollover problem.
Its far too great a coincidence that a Windows machine should halt consistently after 40 some days, and that this same bug plagued the Windows operating system.
As you can read in the OP, he questions "could this be?", not "this is".
Suggest you pull your head out.
I don't think blame should be assigned to the technician who missed the task; rather, it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task.
Yes, it really is. They had a system in place which they chose, knowing its deficiencies. To combat one of the deficiencies, they proscribed a procedure to be followed monthly. The procedure was not followed by the technician, so it was human error.
Would you expect your car to run flawlessly if you never put gas in it or changed the oil on a regular basis? If you didn't, whos fault is it? The car's or yours?
As many others have pointed out here, it's the same bug that brought down Windows 9x reappearing.
Just like the "Y2K glitch" was a platform independant problem based upon the 2-digit-year shorthand causing logical flaws, if you store time in a 32-bit variable by the microsecond... you'll hit the hard limit after about 49.7 days which is why that number can show up in kernels other than Win9x. If there's no proper handling of that rollover, things go haywire.
If two planes had crashed as a result of the comminication loss, I think that the resulting lawsuits, both criminal and civil, against the FAA, Harris and Microsoft would have been large enough to possibly cripple the latter two.
I used to have to reboot our NT Servers due to memory leaks once a month. Although this problem seems related to the application software rather than Win2k, I really have to ask myself what the fucking hell Windows, any version, is doing in a life critical computing environment. Is Windows even licenced for operation in such areas???? And I'm saying Linux is better, but there are OS'es around that ARE licenced for such operation (Tru64 if I'm not mistaken).
And the fact that the system had to be regularly rebooted , and was actually used in the field although this fact was known is simply pathetic, added to which the fact that they couldn't even automate the reboot smacks of gross incomptence.
Whoah! 7 nines uptime!
22 seconds of downtime per year.
Somebody is on drugs if they sold that. Somebody is on even stronger drugs if they bought that story.
"5 nines", for all intents and purposes, is as good as it gets, with "6 nines" seen as the holy grail. The top HA system I've ever dealt with (running a Telco's billing operation spanning 4 countries!) quoted a figure of 0.999996. To nobody's suprise, it did not run Windows.
Wonder how much their failure clause is going to set them back?
Norman Cook's Ode to Sl
psShutdown with task schedular would have been enough. honestly don't think M$ should be held entirely responsible. any f00l coul have set this up.
BTW, going from UNIX to Windows is more of a migration, not necessarily an upgrade.
You need people like me so you can point your fuckin fingers and say, "That's the bad guy." So what that make you? Good?
While I hate MS as much as the next guy, this might not really be directly their fault. Unix systems are often installed with the instruction taht they get reboots regularly. Often there is a problem that is caused by application code not the OS. If you have a memory leak in an application that runs and stays up all the time, it's going to cause the system to get horribly unusalbe in the long run regardless of whether it's UNIX or Windows. While a reboot might be overkill when it was just one application misbehaving, a reboot is a guaranteed way to kill and reset the responsible program no matter which one it is. At a previous place of employment we told the customer to do monthly reboots mainly because we didn't trust *our own* code to be that perfect.
Don't label something "offtopic" unless you know the topic well enough to tell what's on topic.
You don't blame the bike, you blame the person trying to use a grossly inappropriate tool.
Search Microsoft's Knowledge Base for "49.7 days", and you'll find a few bugs, all of them related to storing uptime in milliseconds in an unsigned 32-bit integer. Two were reported in Windows 2000:
That rpcss.exe issue looks like a prime suspect. The OS doesn't crash, but, given the time-sensitive nature of air traffic control data, it's quite possible that the applications running on that server would degrade to the point of failure.
Both look like they were found, or at least entered into the KB, after the release of Windows 2000 Service Pack 4 (Nov. 2003), and hotfixes are available for both.
Note to Microsoft (or anyone else storing milliseconds, for that matter): unsigned 64-bit int! Instead of having to reboot every 49.7 days, you'll have to reboot every 213,503,982,334 days, give or take a leap-second.
This sig intentionally left blank.
It probably should. My company uses XP Embedded for a few systems, and doesn't have any software-related problems on them. Ever. The only problems we have are when people snap off antennae that we use for the wireless connections, or something similar. There's no reason that they shouldn't be using something like this to scan baggage. It sounds like someone at O'Hare didn't do their homework.
The only drawback to XP Embedded, for my company at least, is that the Windows license costs us more than the solid-state drive that we run it from. Looking into Linux for new installations as an alternative, but it doens't make much sense to replace strong, stable XP systems that never fail.
This is great:
... according to a software analyst..."
"The shutdown is intended to keep the system from becoming overloaded with data
This "analyst" knows nothing about computers, works for Microsoft, or both.
"Windows Upgrade, FAA Error Cause LAX Shutdown"
Sounds to me like Windows causes constipation. Use moderatly.
No, the FAA is responsible for maintaining the safety of that system. They failed bigtime by allowing Windows to be used for a mission-critical system. Technically, a contractor was the one who made the decision, but the final responsibility for oversight rests on the FAA.
If I were /., I'd be careful. They're getting very close to libel. To take something this serious, and completely spin it around, and announce it in a public forum is just ASKING for a law suit. In this case, I think that /. would be fucked if MS saw this and wanted to pursue it.
I don't respond to AC's.
Its actually kind of amazing that it stayed up that long in the first place, when you think about it. Especially if the machine is doing anything at all.
Trying is the First Step to Failing --Homer Simpson
On some systems (Solaris specifically) the linux-weened will quickly learn that reboot or halt is NOT the command they wanted to run...
Actually the linux-derived programs reboot, halt and poweroff do exactly that but they first check the runlevel... if reboot detects the runlevel is not 6 or s it will call shutdown to tell init to enter runlevel 6. If halt/poweroff detects the runlevel is not s or 0 it will call shutdown tell init to enter runlevel 0. They are designed to do double duty... to be called at the end of rc.d scripts and for super-user usage.
You can force them to immediately shutdown or reboot without checking the runlevel by using the -f option.
Of course, the SunOS supplied binaries do not have this safety check... I'd recommend against getting used to that. Just pass the appropriate option to shutdown... (-r for reboot, -p for poweroff, halt is the default)
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
It stops running if I don't fill up the tank every 300 miles.
paintball
Wholly Jeez I cant wait till we get medical equipment thats built on windows xp embedded!
Nurse: Uh the resperator shut down
Tech: reboot it, everything should be fine
Nurse: ok, resperator is working again, what about the patient?
Tech: hmm... cant reboot him huh..?
Nurse: nope, hes cold
Tech: well at least the embedded web browser is working, maybe we can find him a family plot.. or email john edwards!
That's every 584,942,417 years. Which is simply not going to be good enough in my book.
"It is a greater offense to steal men's labor, than their clothes"
One of the greatest ads ever to appear on Slashdot:r v.adtech.de/images/Ad247098St1Sz225Sq1Id1.gif
http://a1767.g.akamai.net/v/1767/2939/30d/imagese
They could have all sorts of software that requires manual steps on shutdown and restart. It happens all the time.
Whereas on a modern Linux box you could probably script most actions, on Windows it's usually not that easy - even with Windows Scripting Host, most MS shops like to keep everything "standard" or "out of the box."
- It's not the Macs I hate. It's Digg users. -
...LAX is pronounced as "laks" and means something like "too lazy to do anything". :)
- Save a tree, eat more woodpeckers
It's probably not a Microsoft problem if the system is running on NT, it uses a 64-bit time.
It _could_ be that an important part of the system is running Windows 95 interfaced to a 2k domain that implements the rest of the system.
That really isn't Microsoft's fault that they didn't patch that critical machine to fix the flaw... or that they felt they needed to run Windows 95 (gag) in such a critical portion of the system.
It _could_ be that a user-land air traffic control related application itself calls an depricated API to return the time in microseconds, which
overflows/wraps around, causing the software to crash.
OR
It _could_ be that the user-land air traffic control software just mis-casts the time from the modern API into a 32-bit data structure, which wraps around, causing the software to crash.
In the latter two cases the article writer or LAX's press staff may have incorrectly drawn the connection to the famous Windows 95 problem... even when it wasn't Microsoft's fault in that case.
I really don't see how Microsoft could be the blame here at all...
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
Ladies and Gentlemen, at this time the Captain would like to ask you to remain seated with your seatbelt firmly fastened, however if there are any computer technicians flying with us today, especially if they know what to do when a 'Fatal Exception has occured at 0029:C02FDEC6', would that person please come forward to the cabin immediately?
Liberals call everyone Nazis yet they are the closest thing to it.
Unless it was SCO Unix switching from something that works to something that doesn't is not an upgrade.
That's our life, the big wheel of shit. - The Fat Man, Blue Tango Salvage
Let this be a lesson out there to all the mouse wiggling MSCE's who scorn the uptime of UNIX and shun the power commandline. If you are running a critical Windows Server, REBOOT EARLY and REBOOT OFTEN. Remember, REBOOT-ing is part of the job description and it has to be done. Please protect our key infrastructure and reboot your servers WEEKLY! Just beacause the UNIX guys get 2 years of uptime, doesn't mean you can too. It just doesn't work that way.
Might I suggest this wonderful little tool. Poweroff. It's the only tool I know of which seems to be able to reliable reboot widows boxes, even when they are crippled due to worms and/or memory leaks. It can even close running apps. Also, you get get it to work over the network with a magic packet, in case Terminal Server crashes or is too slow to use.
The main article should get flagged as troll/flamebait due to the phrase upgrade from Unix to Windows. That wasn't an upgrade, that (as we now know) it was a disaster waiting to happen. Wait until the worm of the month comes through and shuts it down. When will people learn to use the RIGHT TOOL FOR THE JOB! If it has to run 24x7 forever, don't put it on Windows. Geez...
The post you are replying to says 0.26 seconds. You boot in 24 seconds. That's about 92 times faster than the alloted time...
warning: This post is likely to contain gobs of dripping sarcasm. Consume at your own risk.
A system was deployed where the application (not the OS) failed after a finite time was deployed knowing it was faulty. An under-trained technician failed to reboot the server as scheduled. There was a backup which we don't have details on. It failed to work as well.
I don't see what the OS has to do with this. It could have been written for *NIX, OS/2, or any other OS. The lessons are two:
Don't deploy flawed software.
Make sure redundant systems work.
As an aside, since we don't know what the backup was, we could hypothetically say that it was the UNIX system that previously was primary that was relegated to backup duty. In that case, it would be a failure of Windows and UNIX at the same time. So, is it that UNIX sucks and is worthless for any important systems, or is it that the people that screwed this up would have screwed up something, no matter what OS they were working with?
Learn to love Alaska
Where do you want to land today?
"Eve of Destruction", it's not just for old hippies anymore...
...that it's not Microsoft's fault.
Here's what happened:
The FAA installed a new system. There were bugs in that system, in the custom software the FAA uses to move planes around the sky. Instead of fixing those bugs properly (as they apparently did in Seattle), the FAA instead went with the quick fix of rebooting the server every month, and backed that up with a script rebooting the server automatically if it's not done manually. Then, the FAA techs didn't follow the FAA's workaround procedures, and Chaos results.
Exactly how was this Microsoft's fault? Maybe I'm wrong here, but I don't see what MS did here. And OpenSource wouldn't have solved this problem, because I really doubt that anyone is going to write FAA flight control software under an open source license.
144l. ph34r my 133t l3g4l 5k1lz!
lol CAASD@MITRE ownzors j00
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
Let me see, you claim that the error was in the integration of the Windows server. Thanks for clearing that up, because the submitter wrote, "The recent shutdown of LAX due to an FAA radio outage was apparently caused by a Windows 2000 integration flaw."
Ya, way to correct that error. It was in the integration of the Windows servers, not a Windows 2000 integration flaw.
Particularly when it's not a "previous version" at all but a completely different Operating System.
Windows 95 and Windows NT (2000/XP/2003) are not the same OS. They're completely different. They share a common API and that's about it. Blaming this on "Windows 95" makes about as much sense as blaming an application bug under FreeBSD 5.x bug on Slackware 1.0.
Are you stuck on Win95 or 98? My current XP box will go 30 days without a sweat, and that's under heavy use (Compiling, video work, games). The only time I really need to reboot is when there's a big update released (Like SP2), other then that I'm fine.
And when it comes to my servers, all of the Win2k ones stay up freaking forever. I've had my SQL/ASP abuse box (The one I use to play around with code) up for almost 470 days before the power went out.
I also had an NT4.0 PDC up for over 600 days.
Looking for hardware (Currently need: Large Etch-a-Sketch) Have one? See my journal!
Don't tell anyone...
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
Pilot here, and this has been a well known pecadillo of the tracking system for SoCal Approach for a few years. It's an application problem that came into being after an upgrade of the application, not the OS. It's a memory allocation error that retains some of the old tracking on the system, thus, the whole box needs to be rebooted every 45 days or the memory overloads and crashes the OS. Look guys, I'm a Linux user and all, but let's not run around blaming M$ for problems with buggy software apps.
"Curiosity killed the cat, but for a while I was a suspect."- Steven Wright
Right?
I mean, why bother writing a timed script if it doesn't have a failsafe?
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
A UINT32 will overflow after about 30 days if it contains the number of milliseconds since execution start. This is just a fact. Yes, there was a bug in Win9x where such a buffer would overflow in the OS. But still, its amazing how fast the slashdot anti MS zealots are quick to point fingers without even considering the fact that it might have been a bug in the FAA software?!?!?
...but they couldn't script it to do an orderly shutdown? I mean what does the technician do differently that it doesn't interrupt air traffic?
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
Three words:
Returns the number of milliseconds since the machine was last booted.
From reading the article, one would surmise that this function is used to assign a timestamp to a particular flight plan or other record. After the machine has been running for 49.7 days, the GetTickCount() function rolls over to zero, which could cause a whole plethora of problems. Almost certainly those problems would include things like corruption of data, lost records, old records showing up as new, application crashes, and, of course, swarms of locusts. The only fix is to reboot.
The developers cleverly noticed the potential disaster before it crashed any planes, and as a workaround, instituted a policy requiring the servers to be rebooted at monthly intervals. Failure to do so would result in the calamities described above.
So while the problem wasn't the old Win95 bug, it was the same crappy windows API that caused both. The POSIX-compliant gettimeofday() function uses a 64-bit structure and does not suffer from the same flaw, and can be relied upon for at least the next 30 years or so (which isn't amazing, but it's a lot better than 50 days).
Note that the FAA insists that they're currently implementing a better solution than "reboot every month". Better hurry, guys, you've only got 47.3 days left.
"With sufficient thrust, pigs fly just fine. However, this is not necessarily a good idea...."
RFC 1925
That is the perfect response!
Since we are being technical about the answer, does this mean Microsoft or the software vendor qualifies as a terrorist organization?
Consider the fact that an entire airport was shut down, lives were disrupted, major economic harm was caused our airlines as a result of flights not getting out on time. LAX is a major hub that connects travelers throughout the country, it is conceivable traffic patterns throughout the U.S. were put out by this problem.
Think of it like a car bomb that went off without anyone dying, and you see my point.
M
Its far too great a coincidence that a Windows machine should halt consistently after 40 some days, and that this same bug plagued the Windows operating system.
It also happened to some Cisco routers. Should I presume that those affected IOS versions were rare Windows based IOSs?
If anything breaks, it must be Window's fault. It could never be the application developers that made bad code. The only people that make bad code work for Microsoft.
Learn to love Alaska
I was kinda just assuming that some human interaction was required for the reboot process,
such as enaging backup radar, or notifying appropriate people first. (Though that is just an
assumption), otherwise, yes, as other people have suggested, you could just have an automatic reboot.
Windows in 6 Bytes (IA-32) : 90 90 90 90 CD 19
From Harris.com
The system offers unprecedented voice quality, touch-screen technology, dynamic reconfiguration capabilities to meet changing needs, and an operational availability of 0.9999999.
Less than a one percent uptime!?!?! No wonder the thing crashed, it suppose to do that, ALL THE TIME! Bill Gates must be proud.
Linux O Muerte!
Song Airlines' in-flight entertainment system runs Linux. The system allows the passengers to listen to MP3s, see a moving map or watch Dish Network live.
After my flight landed they rebooted the system and I saw a friendly penguin and a bunch of startup messages. I noted that they were using a non-GPLed video driver.
Why do we have to point fingers at each other after a major failure?
Mostly it's systems that are poorly planned and fail and not people. Fix the problem.
Pointing fingers makes people defensive in the long run, and raises the probablily of it all happening again.
... how does a single app bring down the entire OS? You mean the app can't be restarted and brought back up with the same state at a moments notice in a mere minute or two?
Crappy design, regardless of who is at fault.
--- I do not moderate.
an upgrade from Unix to Windows.
Yeah, maybe in upside-down crazy world...
http://www.harris.com/view_pressrelease.asp?act=lo okup&pr_id=77
Wierd reloading link from the slashdot article text!
now3djp
According to this press release, VSCS offers "an operational availability of 0.9999999."
Someone check my math, but that appears to come out as 3.16 seconds annually, so their 3-hour outage burned up all their allowed downtime for the next 3,422 years.
So it should be quite safe to fly now, statistically speaking.
How can you upgrade from Unix to Windows, downgrade perhaps, never heard of any upgrade like that before.
Now I have to worry about the fact that my safety (and my family's) is in the hands of incompetent Microsoft.
That sucks.
...but I blame a lot of people for carelessness and incompetence (except for the actual techie that forgot to reboot last month--that is an honest mistake).
* Bill Gates and developers of Win2000 for the convoluted, kludgy API they designed for their OS
* Product managers at Harris--the crap-for-brains who actually thought changing out robust UNIX servers that weren't really THAT old with consumer-grade PCs running an unproven OS was an UPGRADE to a critical, safety related system. WHAT THE HELL WERE THEY THINKING? In one of the article links (the Harris press release), Harris touted SEVEN NINES reliability! If that was a criteria they should've NEVER considered Windows...Not even BillG himself would say Win2k could provide that sort of uptime!
* Retarded developers at Harris who used an API call that tracks milliseconds in a 32 bit integer despite the fact that bugs related to the use of said function call were WELL KNOWN by that time.
* Dough-heads at LAX and the FAA who, upon finding the error early in development, decided it was OK to rely on MANUAL MONTHLY REBOOTS as a workaround to a potentially fatal problem. They should've run the "upgraded" windows machines in parallel with the UNIX servers for much longer, and failing that they should've IMMEDIATELY restored the old UNIX servers to service as soon as the problem was discovered, and to refuse the upgrade (and revoke payment to Harris) until the problem was properly resolved (and NOT just worked around with a kludge like an email reminder to reboot, or a reboot script or a shutdown warning either).
I'm surprised that this sort of error got into such a critical system, and at the way it was handled. I would've certainly tested the new system in parallel for long enough to catch this sort of error and kept the old system around for longer as a standby (in my experience, replacements of critical systems were often tested in parallel for 3 months to a year). I also would've acted much more decisively in resolving the problem if it did slip through the cracks, given a system crash could put lives in danger.
Maybe my girlfriends fear of flying is more justified than I thought if these are the kind of clowns we trust our safety to...
If your servers are used in any sort of business environment I would reccomend rebooting them every 30 days even if it seems they don't need it.
Why? For one it's just good practice. Two you are much more likely to apply patches or fix wonky hardware if you know you are going to take the system down anyway. Three there are all sorts of problems that are likely to be prevented/spotted with frequent reboots. For example hardware self-tests don't get run if the system isn't cycled periodicly. Fourth it lets you verify that things like failover are working properly before it becomes a problem.
It doesn't matter what the OS is either, Windows, Novell, Linux, and commercial Unix servers all benefit from periodic reboots. Even Big Iron like IBM mainframes, AS/400's, HP/Tandem servers, and Unisis A-series usually will have occasional reboots as a part of scheduled maintenance.
Happy Fun Ball is for external use only.
recommends rebooting our production AIX box at least once a month -- it serves a database only (no interactive users).
Couple of tb of disk, couple of gb of ram (or more) and a dozen cpu's and we have to reboot it monthly.
It's called maintenance. It is required.
Now that I've heard that the application was possibly part of the problem, I can't help but think of the large number of barely literate Windows "programmers" that are out there. What was the application written in? VB? .Net? My kids can write code with VB (it's just not GOOD code). Let's get some geeky grey haired UNIX programmer to do the job and do it right!
=-=-=-=-=-=-=-= - The Celtic - =-=-=-=-=-=-=-=
> Tell you what, can you get me new boards for an IBM RT pc? I
> highly doubt it.
I've actually dealt with IBM in the "we need support and replacement parts for legacy hardware" capacity before.
And yes, if you've bought IBM in a professional/enterprise capacity, you've also bought the support contract. And if you've bought the support contract (And if you didn't, you deserve to be fired. Why the hell would you pay the IBM premium except for their support?), you can get parts and expert support for damn near everything IBM's ever made; all the way back to card punches/readers, and farther I'd bet. Remember, when you buy IBM, you're buying a MTBF of thirty YEARS.
cya,
john
Imagine all the people...
I didn't even know Windows kept track of uptime.
Obviously since uptime is stored in a 32 bit integer, Microsoft themselves never expected Windows to reach 50 days of uptime. Kinda telling isn't it.
I'm sure this has been brought up before, but why not bring a suit against M$ for selling a defective product? What makes bugs in their product any different than a car whose wheels fall off because of faulty lug nuts?
Loading...
I'm not sure exactly what downtime for routine maintenance on an AIX system running DBase has to do with a Windows bug that causes a system failure. However, in response, there is a difference between planned downtime where a service is made unavailable while planned routine maintenance is performed and planned downtime or an unplanned failure due to a flaw in the system.
:P
It appears that in this case Windows has a flaw which they try to work around with routine maintenance during planned downtime.
In your case I would say you have planned downtime for routine maintenance to work around the need for an appropriate system to handle the work load.
I suppose what is the same between these two cases is that you both need to change your system to something that is more appropriate for the task at hand. And to be more specific in the FCC case, Windows should not be allowed for use in any application where life, limb, or property is at risk. Hmm, I suppose that may rule out just about every use.
burnin
The FAA is under the auspices of the US Department of the Interior, aren't they? You know, the same department that was ordered by a court to take ALL of their systems off line because they were apparently unable to secure them? TWICE? (No, wait, the latter link says THREE times, most recently March 2004...!)
Is there some secret plot to make them look bad, or is the Department of the Interior riddled with incompetence? I certainly don't feel real secure about the safety of our airlines right now - and it's got nothing to do with "terrorists"...
(Not to say that terrorism isn't a real concern, but I'm somewhat less worried that their intentional plots will slip through observation by the authorities than "accidental" screwed up software being deployed by the FAA...)
Hacker Public Radio is our Friend
IW4M.
Got time? Spend some of it coding or testing
If it went down for three hours, now it's got to run for 3400 years in order to make up the claimed operational availability:
"The Harris-developed VSCS - based on independent, distributed processors and switches - allows air traffic controllers to establish all air-to-ground and ground-to-ground communications with pilots and other air traffic controllers. The system offers unprecedented voice quality, touch-screen technology, dynamic reconfiguration capabilities to meet changing needs, and an operational availability of 0.9999999."
This is a problem that goes deeper than Windows vs. Unix, it has little to do with the operating system or the hardware and even the application has little to do with it.
If you must assign blame, you should probably point fingers at the people who spec'd the system out and perhaps submitted to the cost-constraints demanded by the bean counters.
Any system that lives depend on needs to have fail-safe features and redundancy built in to it and a completely seperate fall-back proceedure that can be implimented at a moment's notice.
It can almost be assumed that something will go wrong and when it does, the equipment needs to be able to handle it as transparently as possible, otherwise you can just about count on human error to make matters worse.
These kinds of systems can end up being very expensive to build. It is very tempting to remove what can be seen as "bells and whistles" from the package to save money.
Unfortunately, these bells and whistles are designed and intended to save lives.
I honestly don't know if that is what happened in this case, but I've seen it before where lives were at stake. It is another version of the low-bidder syndrome.
I don't think blame should be assigned to the technician who missed the task...
Boss: OK Tech, it's your job to see to it this computer is rebooted monthly.
Tech: Will do Boss!
*Time Passes, System Crashes*
Boss: The system crashed, why is that?
Tech: Well, it's because I didn't reboot the system like I should have.
Boss: Oh well, I guess it's not your fault, obviously I failed to realize maximum security synergy in my systems.
Wherever the submitter works, I wanna get a job there!
...try working with someone who describes the system on his machine as "Word" and complains about the (boilerplate) fax template from MS-Word not being present in OpenOffice as being one of the most important "failings" in the system. I kid you not.
From the sound of it, that's not far from the territory the GPP is in.
Got time? Spend some of it coding or testing
The radar and the guidance system had separate clocks, and they'd drift out of sync.
Here's a detailed analysis by the General Accounting Office.
The reboot was to reset the logic flaw in the MS system timer. Read my post here on it. It has affected other MS made apps on MS Windows 2000 servers. So if MS's programmers get affected by it, you can expect non-MS employeed programmers to get affected too since they do not have the same level of access to the proprietary OS.
If Tyranny and Oppression come to this land,
it will be in the guise of fighting a foreign enemy. -James Madison
WTF is the FAA?
"Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife
Is is still called an upgrade when it flops so badly?
You're almost entirely right. NTLM2 is a separate protocol from Kerberos, though. It's used by downlevel clients who can't speak Kerberos, and is also used by your DC's when you're not running in Windows 2000/2003 Native Mode.
OpenLDAP and AD's LDAP crap is both originally based off of the original UMich LDAP code.
Which is an improvement - albeit a questionable one in this case....
I am very small, utmostly microscopic.
This happened after an upgrade from Unix to Windows.
I did not know that this could be considered as an upgrade........
Havin' it large, livin' the life, Welcome to the land of the rising sun.
windows 2000 can stay up for more then 232 milliseconds, but software that depends on GetTickCount() being correct can't. That's probably what happened. They could have rewritten the software to use a 64 bit time variable, or they could have worked around the bug.
They didn't, and that caused the crash. Not "buggy windows".
The fact that they couldn't even figure out how to run a sheduled task in windows to reboot the machine is just pathetic, and shows how incompitant they really are.
autopr0n is like, down and stuff.
I shouldn't bother even replying, but...
.Net code on this windows box all day at work, and I reboot once a week, when I power down my machine for the weekend, if I remember. I've gone a couple of months without a reboot to see what happened. Nothing happened. Last time we took down our production DB (ok, to apply a security patch), which handles way over 500,000 transactions a day on ms-sql2k, it had been up for 8 months without missing a beat.
An NT machine with uptime > 5 years is perfectly possible. WinNT 4.0, 5.0, 5.1, 5.2 (thats Windows NT4/2k/XP/2k+3) are not that bad, and keep on getting better. I'd even say that 2k and 2k+3 are good. its true what MS say about most crashes being the result of driver problems. I develop
Yes, MS releases security patches. No, its not always necessary to install them. A good admin will have disabled all unneccessary services & features, and if there is a patch for a service you aren't using, why would you install the patch, especially if the machine was running inside a trusted network.
Who's really at fault?
According to the headline, looks like Slashdot's already decided.
The decision to replace the legacy system was made the same week RadioShack quit selling vacuum tubes. Coincidence? I think not.
I like my women how I like my golf courses: with a windmill hole.
Ouch!!!
it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task.
Would you feel this way if the airplane that you were flying in missed it's engine overhaul time, the engined failed catastrophically and your plane crashed?
Critical System + Maitainance = Must Be Done.
The system was designed and setup in a particular manner. In fact, the reboot rule was added to the design of the system, so that this very thing would not happen.
Whoever's job it was to reboot the machine is at fault for not maintaining the system properly.
The discussion of whether the procedure of rebooting a machine every month is inane, is something different.
What would happen if a group of people out of the goodness of their hearts wrote them a new system that truly did everything they needed. Would they adopt it?
Or are the corporate powers that be so out of touch with reality that they wouldn't touch anything having to do with "open sores!"
The man who trades freedom for security does not deserve nor will he ever receive either. - Benjamin Franklin
If the system had been updated the problem would not have occurred. How is this a microsoft problem? They cannot force system maintenance.
http://jayceecorder.blogspot.com
The thing that is baffeling the Windows administrators here is the 49.7 day bug was NOT in windows 2000. I certainly had uptimes monthes past that timeframe. So how is a bug in Windows 95 affecting the FAA in Windows 2000? Can you say FUD? Can you say the FAA is blaming Microsoft for something they most likely screwed up? Knew you could.
nuff said
Oh, by all means, be the good consultant will you? Which of the raft of binary cruft which must compose the system was compiled with the wrong SDK? I'm sure everyone would love you to death if you could reach into the DLL hell and pull out the offending bits. The guy who's supposed to go and reboot the thing once a month will be especially pleased with how clever you are.
It's funny how people pointing their fingers at one or another potential causes think that mitigates how nasty M$ is as a platform. How pathetic a system is it that does not have reliable system timers? How much even more pathetic that someone's goofed timer can pull the whole system down. Oh, but it's a timer, see? No, it's just a "data overload" that will give traffic control incorrect information. How about they should have automated the reboot? As if you want faulty software deciding when it should stop giving your air traffic control info or you would trust it to come back up on it's own. The boss blamed his tech who missed the once a month reboot as if that was never going to happen. It's junk and you should not use it so it's not M$'s fault is my favorite though, right behind just don't use it.
The last two hit it on the head. M$, You have to be crazy to use it. Remember that the next time you think Winblows might be a reasonable candidate for anything. When the thing goes tits up, the blame gets put everywhere but and on you. So much for vendor support.
Friends don't help friends install M$ junk.
Maintenance Task.....Holy shit.....!!!
Maintenance Task.....A monthly Reboot ?????
Well there goes the farm Mildred....
Shut er down Ma....she's suckin mudd....LOL
Aren't there about 7 million freeware apps that will reboot your computer after a certain amount of time? Can't you just write a stupid shell script?
Did the REAL computer tech quit, so they couldn't figure out how to operate the Unix box? Christ.
Please stop stalking me, bro.
Actually, yes it does boot the local machine. It is run on the server that needs the nightly boot.
Coding my way to the next BSOD!
http://www.brothersoft.com/Utilities_System_Utilit ies_Sleep_Timer_4576.html
Sheesh.
Please stop stalking me, bro.
Forgetting all the talk about Microsoft and Win95/98 and the defect in the OS that has been well known for years and for which a patch has also been available for years....
If you have a system that has a known failure point at 49 days,when do you perform the mandatory reset?
For the failure that is described the scheduled reset must have been "every 30 days" which is, frankly, INSANE!
If they had scheduled a mandaory reset every 14 or 15 days, they would have had to have had three failures before disaster struck. As it seems, one failure was all it took.
"It only takes 20 years for a liberal to become a conservative without changing a single idea." Robert Anton Wilson
Yeah, you've got to hate what that Alzheimer's does. This proves that you should not get old. I'm just glad "W" is not a conservative; he would give conservatives a bad name.
Win2K is a completely different OS than Win95.
I am sure MS completely rewrites new OSs and hence no old bugs reappear in newer MS OSs. This is one reason MS has such a great security record.
FUD, shameless speculation, and bias. Man, this is just bad.
I really wish the FOSS community would follow the MS and SCO leads and avoid all of this FUD and such. Can't we be as good as MS?
Perfect timing for this comment. I was in the airport yesterday (Detroit). The screens over the metal detectors/ carryon xray machines do nothing except tell you whether the lane is open (a large arrow) or closed (a large X). 4 of the lanes had some sort of Windows error message. Apparently they couldn't handle the workload.
I started to write a long comment, no point, unfortunately this is the way today. Trust me - the more computer system decissions are made on manager level instead using people who know how to build systems - the worse it gets. Used to be that way - compare the financial / manufacturing systems running years to what we do today - any questions ? Some of my old systems are still running from 70's - none of my new systems can stay up more than 10-12 months AND I was told to build them that way. And no - CAD systems, CRM, protocols, world wide networks for finance / air lines / etc.. has been there since early 70's, so complexity is not any excuse. Just don't give up - maybe some day ( after my time.. ) And let's forget the Windows / *nix, Windows is more difficult to build reliable systems but it can be done - Windows is just more primitive, you have to design / code on lower level, it is harder than *nix but so what ?
I was on the lucky team that *lost* the bidding for the replacement system; IBM's team were the poor bastards who won, and were stuck investing seven years into building an unbuildable replacement, pouring billions of dollars down the drain while being micromanaged by the FAA, who didn't know much about software design or reliability in spite of having a methodology that required producing 175 design documents over the optimistically 3-year design period.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
This happened after an upgrade from Unix to Windows.
what's your definition of an upgrade?
The 40-year-old system was pretty much the Mos Eisley of software design - you'll never see a more wretched hive of scum, villainy, and undocumented unmaintainable Jovial code running on IBM 360/50 and 360/90 hardware. The backup system was much cleaner (and much dumber); I think the main thing they did in the 1970s enhancement was retread the design to use transistors instead of vacuum tubes, though I never worked directly with that side.
Yes, Sun and IBM machines fail - that's why all of the critical parts in our designs had to be at least doubly redundant, and often triply redundant, because the design spec of "Eight 9s of reliability" meant that doing an hour a year of preventive maintenance might expose you to too much risk from the backup system failing. I haven't seen IBM's design; I was on the lucky team that didn't win the bidding to build the final system, unlike the poor suckers at IBM who had to implement theirs, but the requirements were not only insanely non-implementable, they were excessively focussed on No Possible Downtime Ever, because if anything goes wrong resulting in an airline crash, the FAA gets insane amounts of political heat. Doesn't matter if the system is N years late, because you can try to blame the contractor for that, or if you can't fly supersonic planes across the Continental US because they're too fast for the new ARTCCs, because tough luck for the French and for bi-coastal business travellers.
Of course, that doesn't mean that Im inclined to trust a system running on Windows, either...
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
Except if an X bug is identified on a FreeBSD system there's a good chance that it's on Slackware, and many other systems. 9x and NT have different *KERNELS* but I suspect lots of the userland is either the same, or mostly the same. You really think they *reimplemented* the entirety of the win32 api just because the kernel changed? hell no
It's 10 PM. Do you know if you're un-American?
Wow, interesting, informative tidbit. Thanks.
No version of Windows has been certified telecom carrier grade reliable 99.999%. The number of Microsoft programmers and billions can't make Windows reliable. Microsoft won't even attempted to pass the certified telecom carrier grade test. There are version of Linux and embedded Linux that are certified telecom carrier grade reliable.
There is a serious security in Windows NT 4.0 for a couple of years that has not been fixed. What is Microsoft solution? Let support for Windows NT 4.0 expire at the end the year, then Microsoft won't have to fix serious security flaw. Linux 2.0 (which is older as Windows NT 4.0), 2.2, 2.4 and 2.6 are still supported with the latest security patches.
It may seem suspicious that the max uptime of the LAX system is the same as the max uptime of a Windows 95 box... until you realize that 49.7 days is 2^32 milliseconds. If you have a piece of software that counts milliseconds using a 32-bit integer, it will inevitably roll over after 49.7 days and -- unless designed to compensate for it -- will probably crash. Windows 95 is certainly not the only piece of software that counts milliseconds in a 32-bit integer.
That said, the Windows GetTickCount() system call returns a timer value as a 32-bit count of milliseconds since the system was booted. Now, any good programmer knows better than to use GetTickCount() -- there are other, better, more robust ways to tell time in Windows -- but it would not surprise me if a newbie had made the mistake of using this system call in the LAX software, thus leading to the problems.
In other words, the Windows timer is not at fault, but it is possible that one of the programmers was confused by the convoluted Win32 API and made a programming error as a result.
Failover clusters aren't trivial - I worked on a non-winning design for one of the predecessors to this system back in the late 80s (fortunately for us, we lost, and unfortunately for IBM, they won.) Yes, you can have two, three, or N of everything, but then you need a lot of code watching the redundant components to see if any of them appear to be failing, and code deciding which redundant subsystem is correct if two or more of them disagree, and code watching the watchers to make sure they're still watching well, and data communication protocols that work ok when all messages are transmitted redundantly to the redundant processors, possibly getting different results at microsecondly different times. One of my coworkers had worked with an early "fault-tolerant computer" system which had triply or quadruply redundant hardware, but had an operating system that crashed at least weekly because it was too complex.
You also have to be extremely careful and flexible in your design for the granularity of the redundant subsystems - if you make the separately processed chunks too big or too small, you can have an order of magnitude change in performance and sometimes several orders of magnitude change in reliability, and then there's the problem that the definition of "reliability" includes "probability that the calculation finishes in N milliseconds", so it's inextricably linked with performance.
Moore's Law is really your friend here. Improved performance means you can use a lot fewer parts, which reduces complexity and failures. Disk drives are more than an order of magnitude more reliable, and the increase in size means that a cluster of disks containing N gigabytes is several orders of magnitude more reliable because it's a lot fewer disks, and CPUs that are 2+ orders of magnitude faster mean that it's easier to guarantee that something happens in a given time, and cuts down on communications steps between different modules, so you cut down on all the failure modes for those communications, and on the monitoring software watching for failures, and on the failures of the monitoring software. On the other hand, Moore's Law lets operating system vendors and application vendors bloat their software with features - X Windows 11R3 ran just fine on my 386/25 machine with 8MB RAM, but of course I was using twm, not Enlightenment or Gnome.
Backup plans do introduce the danger of complexity - the FAA doctrines of the 1980s were that any new system had to be able to interoperate with everything its predecessor interoperated with, because you weren't going to flash-cut upgrade everything at once. That meant that everything you designed had to be bug-for-bug backwards compatible with the predecessor's interfaces, and when they redesigned the thing _your_ system interoperates with, it has to be bug-for-bug compatible with everything your system does, which means being compatible with its predecessor, which was compatible with your system's predecessor, etc. It's a vicious circle similar to the messes Windows and Intel CPUs had to put up with, except that while the 8088 and MS-DOS 1.0 were *ugly*, they were at least small and well-documented late-70s technology, as opposed to poorly-documented 1960s JOVIAL and 1950s analog.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
If cars are a valid analogy to operating systems, Linux cars work on "zero point energy", which means that, at the worst case, you should stop every few hundred miles to drain your bladder.
Sweet we finally found a job that George W Bush has created and can simultaneously perform!
Looking at the www.faa.gov home page, it says "Department of Transportation". However, having been a systems engineer and administrator in a couple of stints at one of the DOI Bureaus ... you don't want to know.
Fragmentation is another problem besides leaking, but it can also lead to systems getting progressively slower until they drop below some critical performance threshold.
And disk drives _do_ fill up with log files unless you do something about it.
Back when my department used Vaxen, we'd reboot them every Friday night, fsck the disks, and do backups. Around the time we were running SVR2, the file system really was stable enough and the removable disk packs high enough quality that fsck seldom found anything and didn't need manual intervention, and the rebooting process was reliable enough that we could let a cron job run it, and while we could have cut back to monthly, people had gotten in the habit of knowing the machine would be down, so they could get a life, and it was a good schedule for the times we actually did want to to upgrades or hardware maintenance.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
Huh, how so? How about fixing the shitty windoze API and making GetTickCount() return a 64 bit value?
In 25 years working with Unix systems, I've never seen that instruction. That must be because I've never worked with any Microsoft Unix system...
[...]it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task[...]
:)
:)
:P
Now I'm thrilled. So now it seems rebooting regularly just to avoid death of that Windows has "evolved" from a ridiculous flaw to a technician's maintenance task
That's really worth a smile, at least where I come from
And right, critical radio outage at the Federal Aviation Administration caused by some Windows version ? Naaah, can't happen in a Windows world, everybody would bet on human error in such a case, right ?
I am putting myself to the fullest possible use, which is all I can think that any conscious entity can ever hope to do.
Dude, if the OS requires a reboot, it doesn't matter how bad the application software is. A true Operating System should work flawlessly FOREVER. It's not impossible, because VMS does it, FreeBSD does it, Linux does it, so why cannot micro$oft windoze do it?
Can you imagine knowing about this problem, putting it into production and not riding your MS rep like a pony until it was verified fixed ? ...with any other vendor.. sheesh.. but I guess it doesn't work that way with MS - even for the FAA.
Just like the "Y2K glitch" was a platform independant problem based upon the 2-digit-year shorthand causing logical flaws, if you store time in a 32-bit variable by the microsecond... you'll hit the hard limit after about 49.7 days which is why that number can show up in kernels other than Win9x. If there's no proper handling of that rollover, things go haywire.
One interesting bit is that Quake 1 servers had problems running for more than 49.7 days for what I assume is precisely the same reason.
The Tandems lived up to their hype, in terms of reliability. I never saw a VSCS failure in almost ten years of use--I barely remember how to use the backup system, VTABS. Maybe now I'll get some practice with it.
Unix to Windows95? more like downgrade...big time
Ladies and Gentlemen, this is your captain speaking we're cruising at 30,000 feet, you can see the Mississipi out of the left side of the plane, and...uh, what the hell does "STOP 0xc0000005 (0x00000029,0xc02fdec6,0x00000000,0x00000000)" mean?
Hmmm, I seem to recall glancing at the MS EULA one time (when they were printed on the disk-envelope (that I never opened - it came that way - honest), and the thing said in part that it wasn't to be used in life-safety operations, for running a nuke plant, air traffic control, or other real-time operations...
So ummm, unless MS suddenly created a hardened RTOS, why the fsck is this thing even running anywhere near ATC?
I say FIRE the morons who installed it, ordered it, designed it, and sold it... Finally, FINE the hell out of the asshole company that wrote it and allowed it to be sold for that use... I'd say $150/hr PER person inconvenienced by this debacle, PLUS whatever the airlines lost (or might have earned) PLUS a punitative sanction to make it fucking hurt bad enough that they'll realize that this can't ever occur again - I'd say $15 billion would do it...
Of course, the most important thing is to spend a lot of time carefully defining what events are or are not failures, because that can make a couple of nines difference in what you call the reliability numbers...
And yes, the FAA has always been on drugs. One of the drugs they're on is knowing that if there's an airplane crash and hundreds of dead bodies due to problems with air traffic control, they get infinite amounts of political heat, whereas if major hub airports don't have enough capacity because the ATC system is antiquated, well, that's only money, and usually somebody else's money at that, and if there are appalling delays and cost overruns, maybe it takes a bit longer to get promoted, but often you can _get_ more budget, because if two 747s full of school children crash over LAX a month before Election Day due to ATC glitches, nobody wants to be the Congresscritter who voted against fixing the ATC system. So the system's rigged against them, forcing them to be overconservative, and to _look_ extremely conservative, except that every once in a while the fragility and brokenness of the system catches up with them and forces them to do something in a hurry, especially if there's going to be an election where the top people get replaced for partisan political reasons, which gives them an opportunity to let the outgoing guy take any blame after he's gone. So just because they're on drugs doesn't mean that it doesn't suck to be them...
On the other hand, you really can get equipment that reliable if you're willing to pay for it, and component reliability has improved wonderfully since the 1980s, e.g. disk drive MTBFs of 500000-1M hours instead of 10,000 hours, so you really can wait until midnight slowdown to rebuild the RAID partition after you hot-swap the drive, and computers are a couple orders of magnitude faster so you need fewer of them to get a given job done fast enough, making it much easier to make subsystems reliable and monitor their status.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
So you thought you like Fly-By-Wire airplanes?
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
The flaw isn't in Windows, it is in an application written by a high priced consulting company. It was discovered late in the evaluation process, and since it is easy to work around (by rebooting once per month), and fixing it would have delayed delivery, the software was accepted with the bug.
If the flaw was in the application then the fix would be to restart the application. Since the "fix" is to reboot the entire machine then it's self evident that the flaw is somewhere in the operating system.
If it would be possible to simply restart the application and the advice to reboot the computer is incorrect then the "high priced consulting company" isn't competent to write software in the first place.
Part of being on the ball in any tech department means having the system up to date. If you don't have it up to date, and an error FOR WHICH A PATCH EXISTS gives you trouble, everyone else in the company should rip your head off. That's inexcusable.
If you install an unpatched version of an OS, and leave it as such, it's your own dumb fault. If a patch is out that fixes the problem, then the problem doesn't exist as far as anyone with half a brain is concerned.
My apologies for the abrasive manner of the response, but patches are around for a reason: to fix known problems.
Patches, do ya have 'em?
Look behind you...
Sooooo. They were converting to Windows, eh? Do we really think they were installing Win95 anytime recently to force this bug unto themselves?
I would find it hard to believe that they were installing Win9x OR that Win2K+ was effected by this bug as I have found no current documentation pointing this bug to an installed W2K+ OS.
Blah, blah, blah.
I was on a Northwest flight from det->fra and was trying to watch a movie using the in-flight video on
demand system, when to my surpised the client rebooted and what did i see but a lill' penguin in the corner! -- my guess is that the 'server' on the plane that serves the menus and movies gets overloaded when the flight staff activates the video system -- and there must be some timeout that occurs and the client reboots. Unfortunatly i was not able to see what kind of hardware was driving the LCD displays.
-best
-greg
This is not addressed to the parent, but is for everyone who responded to the parent -
I'm throwing stones, now - especially after reading this incredibly long and geeky thread about shutting down your OS variants. God bless you for having multiple ways of shutting down/halting/suspending/restarting your computer in user/superuser/megauser/whosyourdaddyuser modes, but shame on you for being a stickler on MS's decision to place a Shutdown option on the "Start" menu when you can't even agree on how to shut your own damned computers down!
It's hypocritical, pharisitical, and parasitical (I like alliterations, even when they're not in context...makes me feel like Don King) to bring up such an argument as "Please press the Start button to shut down (stop) the computer". I'm not saying that "Start" is the most incredible choice for a button, but it makes sense. If you are shutting down your computer, you START THE SHUTDOWN PROCESS.
Those who can, do. Those who can't, go into business for themselves.
My comment was poking fun at people that assume that UNIX systems are the end all be all of uptime, because the OP's clear implication was that something requiring high uptime should be on a UNIX system, not a Windows system. VMS still beats the pants of UNIX in terms of uptime. It was a joke, you know. Laugh.
Still, regardless of where the bug was in this particular case, the fact remains that servers handling mission critical applications (ie, where people's lives are at stake) should not, under any condition, be running Windows. In this case, the problem was with the application, but just because Windows wasn't the issue this time doesn't mean we should all wait around until it is.
What you're saying is like, "I know there are two gaping security holes in this setup, but the hacker that just took our system down only used one -- therefore, I'm just going to patch that one and be on my merry way."
Personally, I'd rather not trust my life to a computer in general, but I'll be really plain and say that if I had to choose a mature UNIX system versus a Windows system, I'd pick the former any day of the week. And if I had the choice of VMS thrown in there, well, all the better. Things can still go wrong at the application level, but the chances of a BSOD turning the whole airport into a carnage of burning crashed planes is that much reduced. And that, my friend, is a good thing.
PS. Saying that Windows works as well on the server as UNIX or VMS is like saying that mentally challenged kids are as capable as normal ones because they too run the special olympics. Windows may have versions aimed at the server, but until systems that need to be up for a decade under high load have actually been up for a decade under high load, I'm not going to trust it. VMS and Solaris are proven server solutions that really do work. A stable NT that doesn't crash is vaporware, as much as Windows nuts wish it weren't. I'm not saying Windows can never be as stable as UNIX/VMS/MVS/whatever, but the simple fact is that today it is not and we're talking about deploying it on mission-critical servers today, not a decade from now when MS gets its act together.
By that same logic, doesn't a Windows users "Start" the shutdown procedure?
And if you don't want to go to the "Start" button in Windows to shut it down, you could always hit ctrl-alt-del and click shutdown. Or press the power button if you have power management enabled in the bios. I don't really see a fundamental difference between the two, it's just semantics really.
When I first started using Linux, one of the things that baffled me for hours until I could ask someone who knew Linux was how the heck do you rename a file?? I searched and searched for anything resembling a rename command and found nothing. It never occurred to me that you might use the move command to rename a file by essentially just "moving" the file to a new filename. That's at least as illogical (to me and every newbie I've ever known) as clicking Start to Shutdown for someone who isn't familiar with the idiosyncracies of a particular operating system.
Keith D.
Now you've become a thrustworthly company!
big projects don't work like this. if you find a bug mid testing, then you don't throw the whole thing back at the vendor and chuck the baby out with the bathwater; you simply cannot organise big projects like this. you do risk analysis and if it's decided you can accept it with a constraint that you, say, boot it occasionally then you may be able to accept the system. if you have accepted it on this basis and don't do what you said you would when you signed the constraint off, it's your problem. yes, the vendor shouldn't sell buggy software, but *all* software has *some* bugs in it.
If that's the case - a purely userland decision to store a time value in an int32 - I still say that Microsoft and those who applied Microsoft to this situation are at fault. Why? Because in Unix we have gettimeofday(2) which stores its result in a struct timeval. In other words, we have a well-established way of storing millisecond-resolution timestamps, and a cultural expectation that timestamps will be relative to the Epoch, not to the start of the program.
It pays for Unix programmers to learn the API, because the API is well thought out, and is not constantly churning due to marketing pressure. Unix is a more stable, mature platform. This leads to more reliable apps.
Damn straight -- I for one don't want any patch installed on a system which can endanger my life unless it's been fully tested.
Phil
I guess today is a passable day to die.
If you have to resort to bickering about button captions on the shell to give sit to Microsoft, you have problems. Furthermore, this is in no way related to the article; why is it +2 insightful?
If you let the caption of a button get to you, you need to remove the tin-foil hat and seek help immediately.
By the way, that "Unix to Windows" link just sits there reloading. I'm assuming it's a cookie thing.
Assume I was drunk when I posted this.
You have to get your traffic in a holding pattern and/or switch over to the redundant before rebooting a piece of critical ATC hardware. This cannot be done automagically because your Bravo space might be full of planes at the time, in which case a controller would not want his/her display to go away... I am sure the pilots wouldn't, either..
There is no such thing as a Windows 2000 49.7 day bug that causes an OS problem.
I thought so, too, but this persuaded me differently: RPCSS bug
(RPCSS being an integral part of the OS, and suddenly burning a huge amount of CPU cycles being a bug)
At least for server versions of NT and 2000, and my money is on the same thing happening in client versions if you run them long enough.
As I recall, since Windows 2000/NT was once the same product as IBM OS/2 (remember Microsoft OS/2, anybody?), this bug originated from the OS/2 side of the codebase.
IBM ran into the problem quicker, as OS/2 was adopted for various critical things like Automated Teller Machines (ATMs), while Windows NT was mostly used for simple file servers. As a result, the problem was fixed in OS/2 about 2 years before in Microsoft got around to fixing the problem in Windows.
Considering that I remember this patch existing for Windows NT and 2000 back in 1999, it is disheartening that the FAA did not feel it necessary to upgrade to something as simple and critical as Service Pack 2 or 3.
"I don't think blame should be assigned to the technician who missed the task; rather, it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task. Who's really at fault?"
If a single maintenance task (refueling) is missed on airplanes, they will crash.
Why is having to regularly work on extremely complicated systems anyone's fault? I'd lean towards blaming the idiot who didn't...you know...do his job.
Brant
Argle. Bargle.
This happened after an upgrade from Unix to Windows.
How does one upgrade from Unix to Windows?
The only flaw is that the consulting company that wrote the software was incompetant. They used the GetTickCount API which returns the number of milliseconds since the system was brought up in an unsigned 32-bit value. The documentation clearly states that this value will rollover to 0 and continue counting from there after 49.7 days. The documentation also mentions timers with higher resolution as well as better places to get system uptime as a 64-bit value.
The only reason rebooting Windows was necessary was because this tick value is tracked by the OS and not the application, so restarting the application would not prevent the software bug from causing problems. But the flaw is certainly in the application for using the wrong API for the job.
Once upon a time, a certain vendor recommended monthly reboots of their server which collected call data from one of their products. This may have changed with newer releases. The server ran Solaris.
Of course that's not to say it was a Solaris problem. Point is, by UNIX systems parent might have meant systems with UNIX as an OS, but running other crappy code on it?
I think that the recommendation was more to cover their posteriors : If for some reason the software failed, and the customer didn't do the monthly reboots, how's fault is it?! Of course, our server ran problem free with over 2 years of uptime before a drive failure ruined that.
No. The FAA is part of the Department of Transportation
i guess you *are* a moron. considering that refueling a plane is a multistep process with multiple points of failure (the refueler must forget, the pilot must ignore the fuel indicator, etc...) whereas rebooting the server has ... how many points of failure? apparently just one. the original author is correct to call out the question.
in systems i've designed where one person is resposible for some event in a complicated workflow, the system sends an email out when the event draws near. it sends another as it is missed. it alerts the person's manager if it is late. if the FAA system has no such alerts, then the reboot event was poorly architected. if the FAA system DOES have such alerts, then the entire organization has a problem. it is never the fault of one individual when a system fails.
This is a terrorist offense. Yes, the vendor execs could be dragged into court and sentenced to death for this.
The society for a thought-free internet welcomes you.
The air traffic CONTROL system was NOT affected. The controllers could watch the planes "near miss" on their radar scopes. The *radio* / communications system was hosed; controllers could not contact the planes by voice.
2^32.... reboots, schmeboots.... the system was down for *5 HOURS*. Not even a crapped-all-over-itself Windoze 95 box takes that long to come back up and load its app. This is obviously a SERIOUSLY FLAWED system that could not restore itself, and all backups failed (for a time) too.
"Delta, go around..." -- (message to Flight 191 as it was crashing in a thunderstorm, Dallas, 1989)
Unfortunately, this "promotion" doesn't always take the form of innocuous sales calls-- it includes significant political lobbying, with donations, gifts, dinners, etc. To put it more bluntly, Microsoft is paying politicians to select Windows for applications where MS knows it isn't really appropriate.
While the politicians are certainly to blame for being corrupt, it's not like Microsoft can avoid responsibility for their role in the decision-making process. If I suggest to a government official that something might be a good idea, I can reasonably avoid some of the responsibility when it doesn't pan out. When I bribe that official to do it, I'm taking a much more active part in that decision, and thus I deserve every bit as much blame for the end result.
My neighbor tried to use an electric handmixer to make gravey over a hot stove. The cord caught on fire and when he tried to put it out he got electricuted. He read the recipe on the internet using Internet Explorer Browser. If it wasn't for MS, He would be alive today.
See you can tag everything back to MS if you fish hard enough. Next it will be earthquakes and hurricanes we tag blame on MS for.
Doesn't anyone see this as a bit silly to blame MS for an obvious blunder on not only their IT dept. but the morons in charge of maintenance and engineering? If you drive your car through the back of your garage, is it the car manufacturer's fault? How stupid would you look for trying to place the blame back that far?
I don't quite share your fanatical hatred of Windows; when used properly, it's quite capable of handling whatever you throw at it.
VMS and Solaris are proven, aye; I do, however, have fond memories of reading, fifteen years ago, the exact same things about Solaris that people now say about Windows; rampant holes, services with root access open to the Internet (lpr, rpc, sendmail, and so on), accusations of terrible bloat (xclock takes HOW MUCH RAM?) and slowness ('Slowaris' indeed) and so on.
It also never fails to amaze me that the zealots tout the security of an OS that can be defined as 'the results of taking a secure OS and ripping out most of the security functionality.' UNIX, after all, is a play on it's MULTICS parent, as it's a casterated version thereof.
Still, as far as I'm concerned, if you want reliability and no downtime, you use a mainframe. Period. We're talking about a system that you can rip running processors out of, and the damn thing won't even blink.
Vintage computer games and RPG books available. Email me if you're interested.
So would this be the same "wrong API for the job" that Microsoft's developers are using to code Windows services?
; EN-US;318152
Print Spooler Stops Scheduling Print Jobs
http://support.microsoft.com/default.aspx?scid=kb
I agree the developers should not have used this tick counter. And when they discovered there was a problem it should have been fixed immediately as the code change would not be that significant if it was only a matter of the tick counter rolling over.
But from what I've seen first hand and heard from others I still believe that Windows is not up to the task. And rather than it being the wrong API for the task it appears to me its the entire system (Operating System, API, Developers, Vendors, etc.) that is wrong for the job.
burnin
You are correct, I did mean the FAA. My bad.
...would call a migration from Unix to Windows 2K an upgrade?! "...upgrade from Unix to Windows."
In the mid 1980's, I knew a software engineer at Caltech's Jet Propulsion Laboratory who worked on a multi-year JPL project for the FAA. The project was to replace the obsolete voice communication system for air traffic controllers. The new system had touch screens with onscreen menus and buttons were dynamically reconfigured depending on the controller's workload. It worked correctly, and the engineer enjoyed describing to me how it worked. This was all before there was any version of Windows. If I recall correctly, they developed on MODCOMP minicomputers running VMS but deployed on an embedded system with an in-house design for task switching, not a complete OS. I might be fuzzy about the technical details at this time, but a FOIA request should be able to retrieve them for the intensely curious.
I do clearly remember that the working system was presented to the FAA in Moneterey, and the FAA then terminated the contract and hired IBM to start over from scratch on a new system. Rumor was that this was a political payback. I should emphasize that's just a rumor I heard. Looks like Harris eventually got the contract. I wonder if any of the original code from JPL was ever deployed.
That certainly sounds fair.
Regarding some of the engineers at Harris.
As a matter of fact, I DO work as an engineer for a large, multinational company--and our projects do in fact involve mission critical systems. You are right--engineers do not always get what they want and it does often mean dealing with politically/non-technically made design choices like using Windows when we'd prefer not to. However, there is a limit--a time and place where commodity/consumer grade hardware and software is appropriate--and it's NOT at a level at which a crash will bring down an entire system. I do not have to know how the software works to make that observation--it has been shown that a windows box failed and the result was a major system disruption and hours of chaos. It's not the fact that they used Windows that is disturbing--it's the fact that they used it in a mission critical situation...without adequate testing to boot. And yes, I do have a clue as to how complex the system is and the intricacies of how it works--our companies products run systems in oil refineries, factories and power generating stations. In a similar situation and project we would handle things differently:
1 If program managers were indeed making critical decisions, the would HAVE to be registerd Professional Engineers by law, just like the lead developers.
2 Lead developers are explicitly instructed NOT to simply do as they're told. If they see a serious flaw in a design decision they are obligated to make their views known. Of course, you can't conter one political decision with another--you must have a solid case. If your boss refuses you go to his boss. If you are stonewalled right to the top and you think the issue is really important you can bring the issue to the professional association. The final course is to perform the work and refuse to sign off on it (make the boss do it). That way, if the result is failure, you are in the clear and your higher-ups take all the heat and not just some--it's "due diligence" (ass covering, really).
3 During development and testing, we identify any potential single points of failure, bottlenecks and known issues. In my situation, Windows-based systems are ALWAYS considered "unreliable" (that is, not to be relied upon for critical or safety related systems), therefore we prescribe redundancy. Our test plans always call for us to do controlled AND uncontrolled (pull the plug)shutdowns of each machine in sequence (to test failover) and simultaneously (to determine how the PLCs and other embedded systems, plus electromechanical systems, handle catastrophic failure).
4 If hardware cannot be supported for at least ten years (and in some cases up to 25 years) we MUST design such that there will be a drop-in replacement that will cause minimal disruption(for example an old VAX VMS server could be upgraded to a current Alpha VMS, or an old PLC can be replaced with a next generation one that will execute the same routines rung-for-rung)
5 It is typical to keep the previous, pre-upgrade equipment around as a standby system, ready to put back in service, until the new system has worked as-advertised WITHOUT INTERVENTION for at least a year. A crash or other fault would reset the 1-year clock and we'd be doing a thorough root-cause analysis.
It sounds like there is a lack of professionalism within your group of engineers. I'm not sure about how things are done where you live, but "just following orders" is not an excuse for poor engineering--a failure of that nature where I am would result in being temporarily barred from practicing engineering. Sometimes it can be tough to go against the PHB--I've heard of engineers being fired for refusing to sign off on designs, but I'd rather be fired and be able to work as an engineer elsewhere than have my ability to work as an engineer revoked entirely.
I guess I would have to ask the FAA as to why they made the decision to migrate a working critical system to Windows--a radically different architecture from UNIX. My employer builds
Management at work. Upgrade indeed.
shutdown -t 0 -r
This says to reboot the computer and wait 0 seconds before doing so. Stick a -f in there to force a shutdown if you've got ornery apps. Piece of fucking cake, people. Shit like this makes me wonder why I'm still unemployed; I obviously have some skills that would be appreciated by the FAA. Just put the command in a task scheduler entry, set it to recur every 2 weeks, and you're golden. I mean, seriously, what the fuck?
I use task scheduler to make backups of my current Opera session and to run periodic defrags and clean temporary folders and so forth. The system provides a way to maintain itself at scheduled intervals, why rely upon a technical lackey who can (and obviously did) screw up?
Tangentially, when the blaster worm came out and was giving everyone the NT Authority you-must-shutdown-now message, I discovered that a quick shutdown -a would abort the shutdown process and allow you to continue working with the (albeit unstable) system, to install a patch or the like.
Reinvent the wheel only at either a lower cost, greater effectiveness, or your own personal enrichment and satisfaction.
I did a search for the 49.7 days in Microsoft's knowledge base and found one possibly related bug, the non-related bug referenced by the article submitter, and some other non-related bugs. The one thing they all have in common is an improperly used GetTickCount function in the code.
First, there's the five and a half year old patch fixing an issue in Windows 95/98. There's no reason this should have been mentioned anywhere in reference to this incident. Shame on the poster and all the people backing this theory. It's pure reverse FUD because there's nothing indicating that this bug was related and everything shows that this only affects 9x. Personally, I'm positive that this problem isn't in 2000 because I supported 2000 for Microsoft when it was released and never heard of this happening. Also, Microsoft is good about testing all of its products to see which are affected. If this type of screw-up were common, the articles would be common on Slashdot since the typical reader lusts after examples of MS screw-ups. There's also the fact that there's a LOT of Windows 2000 boxes with uptimes way past a month and a half.
But then there's the CPU utilization rpcss.exe bug. If this is what was happening, then it's partially Microsoft's fault for not having enough QC testing targeted towards idiot programming mistakes. Nobody tested enough to see what happens under different scenarios when GetTickCount is improperly used. Also, the hotfix from Microsoft is only a few months old, probably not enough time to test and deploy. On the other hand, GetTickCount is designed to only work for 49.7 days and shouldn't have been used for this application. I'd assume that they didn't know what was going on when shit hit the fan after a month and a half of running relatively smoothly and only after the MS patch was released did they review their code and see that they were improperly using the function. Still though, any company that has an internally written or contracted program with this serious of a bug should have invested the resources required to find the problem and fix it. They should have known that the problem was related to software installed on the server, most likely their proprietary FAA program because if every Windows 2000 computer running on a Dell had this problem, Microsoft would have released a patch long ago. Heck, they should have found that they were using the function improperly. If the programmers knew how long it ran for before dying (49.7 days), they should have realized that it's related to the GetTickCount function and could have narrowed in their efforts to wherever the function was used.
If the problem was not related to the rpcss.exe bug, then I don't see how MS is to blame. The blame lies solely with the programmers of the FAA software for improperly using the GetTickCounter function.
In conclusion, with either of these scenarios, I'd be replacing some of my programmers if I were the manager in charge of the project that wrote the FAA software.
-Lucas
The Rpcss.exe bug appears to be fixed in W2K SP1 since it only applies to Windows 2000 Server (i.e. no service packs).
It looks like the print spooler bug was introduced in W2K SP1 and wasn't discovered or fixed until after SP3 (since only W2K SP1-SP3 are listed).
Considering how long SP1 has been out, not to mention SP4 I don't see this as a Microsoft problem (assuming it realyl is an OS issue). -- Argel
-- Argel
This is about the only comment by someone with a clue in this whole thread
Does that herb facilitates time travel?
Because the last time I had to schedule reboots for a mchine of mine was around 10 years ago.
Oh yes, last time I used Windows, my bad.
I have administered SOlaris, Linux, HP-UX, Irix and a few others, and frankly the one that either should go to get a job in the real world or stop talking hallucinations is you.
IANAL but write like a drunk one.
Well, I stand corrected - for some reason I thought the Federal Government would have been putting DOT in with all of the OTHER "interior" federal issues. Stupid me...
Considering the bizarre web of overlapping police agencies they've got, I should have known better...
Kinda worries me more to see more than one department having that kind of problem...
Hacker Public Radio is our Friend
it would have been cool if you had gotten a picture of this, but then they would probably have wanted to arrest you for 'terrorist like' activities'.
I can't wait till it doesn't feel like a police state anymore.
Creationists are a lot like zombies. Slow, but powerful and numerous. And they all want to eat our brains.
I used Solaris 2 on UltraSparc before, and it was the system that never froze/crash on me (besides those GNOME/KDE Apps / XFree86 itself). Even the most stable Linux that I saw cannot compete with it.
Our Windows servers have reboot schedules and these are monitored via our enterprise management tools to ensure that the uptime is not too high. Not drastically different to checking the fuel gauge really. A bit obvious, I thought.