Windows Upgrade, FAA Error Cause LAX Shutdown
fname writes "The recent shutdown of LAX due to an FAA radio outage was apparently caused by a Windows 2000 integration flaw, possibility related to an old Windows 95 bug. An article at the LA Times claims that the outage was caused by human error, as the system will automatically shut down after 49.7 days (related to this Windows 95 flaw?), and a technician didn't reboot the system monthly as he should have. This happened after an upgrade from Unix to Windows. I don't think blame should be assigned to the technician who missed the task; rather, it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task. Who's really at fault?"
The recent shutdown of LAX due to an FAA radio outage was apparently caused by a Windows 2000 integration flaw, possibility related to an old Windows 95 bug.
/sarcasm
Okay... a Win95 bug leads to the LAX shutdown because the *same* bug was later found in Win2k? Yup, closed source is the answer, Mr. Gates. I hereby repent my sins of Open Source Freedom and agree that security by obscurity is the answer!
a technician didn't reboot the system monthly as he should have
You have to love a system that requires downtime as part of uptime. How many Linux users have this problem? (Please press the Start button to shut down (stop) the computer.)
The dangers of knowledge trigger emotional distress in human beings.
It's obviously lunacy for any company to replace a proven system, which has given years of reliable service with some piece of trash that crashes if left running for over a month. That said, I was under the impression that a simple "at" job could be used on a Windows machine to run a script periodically (at is similar to cron, except far less capable, of course). Such a script could, if I'm not mistaken, be used to reboot the machine. One would think this would be an ideal way to hide the problem very nicely.
We use a similar system to reboot all of our NT servers every weekend to help prevent crashes during the week (doesn't work of course, but still).
Code, Hardware, stuff like that.
I'm surprised we didn't get 'mission' critical there in the blurb
Have they never thought to just schedule an event to reboot the computer every 30 days?
Don't use this stuff in mission-critical applications.
-jcr
The only title of honor that a tyrant can grant is "Enemy of the State."
"Upgrade" from Unix to Windows, eh. You keep using that word. I do not think it means what you think it means.
Use Ctrl-C instead of ESC in Vim!
Why Microsoft of course. Unix doesn't have any flaws.
....all in-flight movies are played on Windows Media Player.
If you think
But most off the shelf software have disclaimers expressly stating they are not to be used in mission critical situations. Eg:
"technology is not fault tolerant and is not designed, manufactured, or intended for use or resale as on-line control equipment in hazardous environments requiring fail-safe performance, such as in the operation of nuclear facilities, aircraft navigation or communication systems, air traffic control, direct life support machines, or weapons systems, in which the failure of Java technology could lead directly to death, personal injury, or severe physical or environmental damage."
-- Samir Gupta, Ph. D. Head, New Technology Research Group, Nintendo Co. Ltd., Kyoto, Japan.
I thought switching to Windows from *nix saved time, money, and hassle! Haven't you guys seen those banner ads here?
But I'm going to.
/.er.
It's M$'s fault. Why do I hate to say it? Because it'll just be seen as more anti-MS crap from another
All I have to say is if the shoe fits, wear it.
In this individual case a PHB made a decision to scrap the old, stable OS to a new, known-to-be-unstable OS. That screams PHB.
"There is a way that seems right to a man, but its end is the way of death." Proverbs 16:25 (NKJV)
When a ball drops on a baseball field at the midpoint between two positions, it's scored a "hit" for the opposition rather than an "error" against either player. Still, a hit for the other side is a bad thing for the entire team.
This mess was big enough that there's a large enough supply of blame to give some to everybody involved.
- No system should require a manual reboot on a regular basis... there should at least be a script capable of accomplishing that. But somehow, one got implemented. Blame whoever bought it.
- Windows shouldn't have had a faw that required monthly reboots. Blame Microsoft.
- Somebody should have done the reboots like they were told to. Blame that poor smuck.
Bottom line is that everybody's at fault because had any one piece in the chain done their job properly the failure wouldn't have happened, but a cascade of mistakes lead to the ball hitting the grass instead of a glove.
Why did they move from Unix to Windows in the first place? And why should a bug from Win95 crash a migrated Win2K?
How sad that such a sprawling metropolis of commerce and travel can be brought to its knees by the magic that is Windows.
Color me suprised.
Read the only personal Runyon page out there.
crontab -e :P
59 23 15 * * shutdown now >/dev/null
http://shit.slashdot.org/article.pl?sid=04/09/21/2 120203
upgrade from Unix to Windows
AKA, "The PHB Special"
Of course, the guy who was supposed to reboot the box will get all the blame. Shit rolls downhill.
... I can think of no one else to fault *BUT* the technician. The IT guys know full well that this "quirk" exists, and in fact, part of their planning and maintenence involved resetting the machine in order to get around this potential problem. These guys did not complete their job duties, and as such, the system went down.
How can you intimate blaming the software company here?
- DaftShadow
slashdot post this FUD and calls it news, implying that MS is at fault, this is pure trolling....
grow up slashdot editors, this has been old for a long time, grow up and stop the sensless bashing!!!
To the rescue!
http://www.nbc.com/LAX/
-- The Funk, The Whole Funk, And Nothing But The Funk
The newspaper said that a Microsoft-based replacement for an older Unix system needed to be reset every thirty days 'to prevent data overload', as a result of problems found when the system was first rolled out. However, a technician failed to perform the reset at the right time and an internal clock within the system subsequently shut it down. A back-up system also failed
Guess there was a backup, I feel for that guy.
"This happened after an upgrade from Unix to Windows."
Thats the funniest thing I heard all day. Windows is an upgrade from unix. I almost choked on my coffee.
...why did they switch to Windows in the first place?
US businesses that currently accept chip and PIN/signature
It is human error: those bugs didn't write themselves. Nor did the operations protocol that required "rebooting LAX" every 49.69(!) days. Nor did the upgrade procedure that ignored that bottleneck. Nor did the upgrade decision that moved from Unix to Windows. Those were all human errors, as was the decision to keep a job at LAX that would face blame for shutting down the airport (or risking lives) if the reboot was missed, or unsuccessful.
"Not I," says the referee,
"Don't point your finger at me.
I could've stopped it in the eighth
An' maybe kept him from his fate,
But the crowd would've booed, I'm sure,
At not gettin' their money's worth.
It's too bad he had to go,
But there was a pressure on me too, you know.
It wasn't me that made him fall.
No, you can't blame me at all."
- Bob Dylan, "Who Killed Davey Moore?"
--
make install -not war
sleep 4294080 /s
shutdown
I remember when the 49.7 day bug was discovered. That was right after I had just hit the 49.7 day freeze in an attempt to keep my personal machine alive as long as possible.
When it froze, I didn't know why until I read the story, just figured it finally gave up the ghost for no real reason. It was time for a reboot anyway, that system was hurtin' bad.
Why the hell the have a critical system running on an OS that can't stay up for at least 50 days, I do not know.
"With sufficient thrust, pigs fly just fine." -- RFC 1925
There's no conceivable reason not to. How do you justify your money going to a company that keeps the source to itself?
You paid for it with your taxes - you own it. Demand open source at ALL government levels.
the major advances in civilization are processes which all but wreck the societies in which they occur - A.N. White
MS lawyer: "It all worked in the flight2000 simulator? We always rebooted after every crash and everytime it was OK afterwards?"
It wouild suck if all the radios shut down in the middle of an emergency landing. Better to hae it manual.
Of course the technician was blamed - if not, some CIO-type in charge would have had to take it, and he wouldn't allow that to happen. It always runs downhill...
Because there are 4294080000 millisconds in that time period. Just enough to cause a roll-over when using a 32 bit counter (and yes, 49.7 is an approximate value).
Very few Win95 systems ever made it that long without a reboot... but you would've thought that it would've been fixed by Windows 2000.
Wanted: witty unique signature. Must be willing to relocate.
...keep in mind that we have established numerous times that windows is not suitable for systems that need reliability and stability. It is not the operating system's fault that this happened, it is the FAA's for choosing to use it instead of considering the better alternatives. If you get run over on a bicycle while riding on the highway, don't blame the bike.
Quick addition: it seems that the fault does not belong entirely to windows, but rather a combination of the software running on it and the system architecture.
With that said, Windows could stand to improve a lot. It has too many bugs, too many flaws, and so on. And it definitely does not have a stable, secure, reliable base. So don't expect it to.
webpage
So I installed Linux.
Fight Spammers!
The often do provide the source code to governments. It does not mean that is is free, or that the government can submit fixes.
Flight BA 91429 on final approach.
Tower: what does "svga.dll performed an illegal instruction" mean?
Pilot: Oh sh...
From the submission
possibility related to an old Windows 95 bug
From the Article.
The shutdown is intended to keep the system from becoming overloaded with data and potentially giving controllers wrong information about flights, according to a software analyst cited by the LA Times.
The shutdown is not a crash but a scheduled event to bring the servers down to flush data.
So it does not seem to be a problem with Windows (Ok now I get marked as troll) but with the FAA's own software.
Sorry, teleporters just kill you and then make a copy. A perfect, soul-less copy.
This old error was from the use of a 32 bit 1 ms increment timer (comes out to 49.7 days until rollover). AFAIK, this was fixed in Win2k and above when the timer got bumped to 64 bit. Maybe whoever set up LAX was using some ancient legacy middleware that used the old timer. This is just bizarre. In both locations that I have worked the last three years, none of the Win2k or Win2k3 servers went down ever. Sounds like bad consultants.
[RIAA] says its concern is artists. That's true, in just the sense that a cattle rancher is concerned about its cattle.
Silly IT departments.
If you "upgrade" a piece of software, then discover it requires a complete manual system restart to remain stable, the prudent thing to do in any other circumstance would be a rollback.
Unfortunately, since this is an IT department, it must run Windows; after all, where could you ever find support for Linux?
It's only an insult if it's not true.
the blame rests with microsoft. windows is the only "operating system" i know of that *needs* to be rebooted to maintain regular operation. i'm sure this wouldn't have happened if they stuck with a rebootless unix machine.
The system offers unprecedented voice quality, touch-screen technology, dynamic reconfiguration capabilities to meet changing needs, and an operational availability of 0.9999999
Okay, bullshit. If I have to reboot a server every month, .0000001 of a month is- oh, let's be generous and only count months with 31 days- about .26 seconds. That's a damned fast boot time for Win2K.
Maybe they left off a percent sign?
Do you have ESP?
Surely the simplest 'fudge' to fix this problem is
to write a script that beeps loudly every 10 mins
or some other (read: more sensible) notification
after the system uptime exceeds 30 or so days?
But seriously, if its running windows its not the
monthly reboots the need to worry about, its the
quaterly format/reinstall procedure thats required
for stable operation.
I dont think I've had a stable (home) windows install for
more than 6 months without reinstall, but maybe I'm
pushing my luck by actually USING the computer.
Windows in 6 Bytes (IA-32) : 90 90 90 90 CD 19
Who's really at fault?"
...it's the technicians fault for not doing his job
...it's MS's fault for not patching such a blatent bug
...it's the guys fault who decided to move from the stable unix to shite Windows 2000
I wrote a VB program years ago for the Win95 to solve this problem. I just had the scheduler run my program that rebooted the system for me.
Umm.... Duh
You say things that offend me and I can deal with it. Can you?
I also love the statement that the system was upgraded from UNIX to Windows. Isn't this kind of like upgrading from being in very good health but not being good looking to being somewhat good looking but suffering from cancer, AIDS and heart disease?
cheap labor conservatives - they want to keep you hungry enough to be thankful for minimum wage.
I was sitting in Atlanta-Hartsfield for an extra 70 minutes thanks to that bastard.
There is a difference between "insightful" and "inciteful" other than spelling.
I remember back when that bug was announced. Seems it was at least a couple of years after Windows 95 had been out. I guess they had to work through a lot of other bugs to get Windows 95 to make it long enough for this bug to occur.
Unknown host pong.
The employee missed the maintenance window. If you forget to do something that is a part of your job, I would have to suggest that you are responsible for the consequences. Now, does placing the employee in such a situation apply some burden of responsibility upon higher-ups? Certainly. But, the employee should be held responsible...ESPECIALLY if the importance of the maintenance was made clear.
Since when is going from Unix to Windows and upgrade?
Mod Wisely.
upgrade from Unix to Windows
does that statement boggle anyone else's mind?
------ no thanks... I've quit
Was the flaw left unfixed for too long because they did not have access to the source code? Or was it because it was too expensive? If this is such a critical system that it can cause loss of life (on a massive scale, no less), the root cause should have been fixed, rather than the workaround. I remember reading somewhere that this flaw has now been fixed. Smells like a managerial issue within the FAA, not just a technician problem. Remember NASA and the space shuttles?
I can honestly say I have never encountered a situation where this would be a problem, as I have never had a Windows box stay up for more than a week without either crashing, getting so bogged down it needed a reboot just to open Word, or requiring me to reboot after I installed some ridiculously little program.
If I recall, doesn't MS have something that absolves them of any liability listed towards the end of the license agreement. Something along the lines of, "Do not use in mission critical places." Or was it more like do not install in missile silos or nuclear facilities, something like that right? Someone correct me. If I am right about the license agreement, that was stupid of LAX to have been suckered into switching from UNIX to M$. Oh wait, I forgot, everything works better on MS products right? That's why we have many security/virus/worm/bug/whatever flaws. What a great product Bill!
I don't think blame should be assigned to the technician who missed the task; rather, it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task. Who's really at fault?"
Actually, I do. You've got a job; you've got deadlines. Do the work.
In the future, I would want to not be isolated from my friends in the Space Station.
OK, I know it's violation of /. policy to actually read a referenced article. My bad. But, according to the software.silicon.com article:
This sounds to me like more of a problem with the application, not the OS. The "system" crashed after 49.7 days, which is about 4 million seconds, which is about 4 billion milliseconds, which is (obviously) MAX_ULONG. I suspect the application is using a ULONG to store a timeout value and got pissed-off when it rolled over.
By the very nature of the system, the blame can fall on no one other than the maintenance personnel. Otherwise the PHB that authorize the "upgrade," and the system that put the said PHB in the position to authorize said "upgrade" would look incompentent and foolish, and we can't very well have that.
ELOI, ELOI, LAMA SABACHTHANI!?
This happened after an upgrade from Unix to Windows.
Shouldn't that read downgrade instead ???
Nice going, Microsoft!
First you crash a US Navy ship a couple of years back thanks to a NT4 divide by zero bug, now this.
Today's lesson kids: Don't use Windows for any mission-critical apps!!
Before the torrent of "windows sucks" posts...
Too late.
Ah, the old "windows maintenance reboot" problem. It always amazes me how IT managers (hell even some techos) accept the need to re-boot their windows systems every week. At my work, the windows guys accept it as normal maintenance. If I had to reboot my AIX and z/OS systems every week there would be hell to pay. But because its windows , its accepted. I dunno, mediocraty is the new standard these days...........
" upgrade from Unix to Windows "
:)
Ahahahahahahahahahahahahahahahahahahahah that's the funniest thing i've read in ages
Last.fm - join the social music revolution
I believe the 49.7 days of uptime for a Windows 95 box is a new record, shattering the previous record in Norway of 27.9 days back on January through February of 2001. Congratulations!
--
http://www.aikiweb.com - AikiWeb Aikido Information
after 584542046 years. Okay, I admit... when you reach that time, you'll probably have other problems than a Win2K crash.
and office update while you're at it too...
Wouldn't want to spoil a nice MS bashing session, but I think the bug was in the ported application, not in the OS - probably someone used the wrong data type to hold timestamps somewhere within the program (win95 had the same bug) - I've seen win2k last more than 47 days without reboots...
I worked for around 5 years in Air Traffic Control projects, both in delivery of radar processing and displays and in R&D for next generation systems.
Let me give you an overview of the failure approach of just one of those systems.
1) Everything on Unix, ruggedised releases of UNIX
2) Every box must be able to FAIL ON ITS OWN
3) Every box must have a direct replacement, or replacements, which carry the SAME LOAD.
4) ZERO total system downtime allowed, partial systems failures are allowed, but core systems must keep running.
5) 5 stages of power supply failure, double mains, double generation and lastly a great big warehouse of car batteries if all else fails.
6) 4 Years of testing of FULL system before live.
This is what is normal when safety is the primary concern. What the FAA decision sounds like is a cost driven process which chose the cheapest solution that "could" meet the requirements.
The idea of a safety critical (if it fails people could die) system that requires a reboot is fine in only one case... if it can be non-operational on a regular basis, in which case it should be done EVERY non-operational window (say every week) , this is therefore okay for some hospital scanners that are certified for 12 hour runs. Its not okay for a 24/7 system that controls objects flying around at 500 miles an hour.
Welcome to the US... we will be landing slightly quicker than expected.
An Eye for an Eye will make the whole world blind - Gandhi
I think it depends on what the company rep said when they convinced them to replace Unix with Windows.
If they advertised a consumer OS as an OS suitable for mission critical applications . . . then this flaw should not be in the software. It's could the software companies fault for agressively marketing their product where it should not be.
Maybe we should throw some blame to the PHB who ordered the switch. Purhaps there was no hard sell from MS, and a PHB saw a product brochure and got a hard on to switch.
I see your point though, the tech knew about the problem and failed to do his job.
I guess my question is, should the problem have been addressed before now, or is it common practice to wait for a catastrophic success like this to occur before addressing the problem ?
This week, while flying, I saw:
1. Windows-based terminal used by the public to print tickets (I think) with a "you have chosen to download a file, what do you want to do with it: save, open" or similar (I don't recall the exact wording).
2. A windows-based machine that was part of the baggage scanning setup at Chicago-O'Hare going through a scandisk process. OK, this may have been due to operators turing the machine off using the power switch, but should not such a machine use a read-only boot drive/partition?
Do you feel more secure?
The real "Libtards" are the Libertarians!
A system running UNIX doesn't necessarily mean it was stable. It could have all sorts of flaws in the code, hardware failures, etc.
Sure, Windows 95 in particular and Windows in general is often less stable than modern counterparts. But an upgrade from an old, obsolete UNIX to a new Windows system could have had significant benefits and made a lot of sense at the time. Without the full information behind the decision, how can you judge whether the decision was bad or not?
There is no such thing as a Windows 2000 49.7 day bug that causes an OS problem.
The problem here is the software made by Harris does not handle a rollover of the GetTickCount() function turning back to 0. This function counts the number of milliseconds since the OS was last booted so it should be obvious to anybody that the returned unsigned 4 byte integer cannot go on forever.
So the badly written Harris software has this bug and their solution (which was really not that bad of a work around) was to manually reboot the system every 30 days, but as a fail-safe, they had a scheduled task to do a reboot on the 49th day just in case. The 49th day came because of procedural error.
There is nothing Microsoft could do to prevent this.
.. is what I'm going to consider this for the time being. I've seen it reported everywhere, but it's just too absurd to take at face value.
Belief is the currency of delusion.
Hey, I submitted this two days ago. What makes it slashdot worthy now?
Have you ever noticed that anybody driving slower than you is an idiot, and anyone going faster than you is a maniac?
Where does it say that this was due to the Win95/Win98 bug? (If I missed something, please let me know.) Just because it happens to be the same amount of time as the Win95 bug doesn't mean it is the same bug. The bug was never present in Windows 2000, AFAIK. And in any case, there's a reason why 49.7 is a "magic number" for uptime (hint: how many milliseconds are there in 49.7 days?), just as there was a reason why "2000" was a magic number for date problems and why 2037 will be another magic number for date problems.
Just because it runs on (OS) and just because it crashes doesn't mean it is (OS vendor)'s fault. In this case, you certainly can't blame Microsoft: there was a problem in the radio software, the software developers knew about it, the maintenance staff knew about it, it didn't get fixed, and it caused a problem. Where does Microsoft fit into that?
Time flies like an arrow. Fruit flies like a banana.
Wow, slashdot has hit a new low.
"The servers are timed to shut down after 49.7 days of use in order to prevent a data overload,"
Thats from the article. Win2K is a completely different OS than Win95. FUD, shameless speculation, and bias. Man, this is just bad.
I have the distinct, but sadly not unusual, pleasure of watching my company execute a brilliant strategy of:
Since becoming a PHB (although I still do architecture work - thankfully), I've found that mindless boneheaded, sweeping decisions, are usually driven by some empty-suit, bean-counting, incompetent, barely literate, sh!t-for-brains syncophant who found themselves in an executive position purely by accident. We're "encouraged" to support their "strategies". Indeed...
It's a much higher order PHB. Kinda like a 4th degree black-belt, but not.
Computer Science is Applied Philosophy
My goodness, you're funny! You should be a writer for some sort of comedy show! That, sir, is sheer comedic genius wrapped up in only two words. I am stupefied that no one has ever said that before, it's just so damned funny! Oh, Sir Laughsalot, you slay me with your jabs of wit. I must cease my reading of comments for the night, as I have no need for further entertainment and fear that one more comment such as yours will require me to call for medical assistance (if I am able to dial the phone - I will be laughing so heartily, you see.) I will chuckle to myself for days on end due to your obvious commmand of what is to be construed as comedy. Please continue to post these gems of yours. Oh, how I look forward to seeing more.
Jackass.
You know, if strict product liability were applied to Microsoft, they'd be paying big time.
What those who want activist courts fear is rule by the people.
Frank Stallone.
You know what?
Guys, most of the equipment in use by the FAA isn't new enough to run Windows 2000. I worked on the "state of the art" search radar, and it was built around Sun Ultra 5s.
It's good to use your head, but not as a battering ram.
+1 Funny
-><- no
The shutdown wasn't the problem, or more appropriately, the shutdown that would have prevented the problem was missed. But also the FAA's software probably has some issues of its own that need to be fixed.
On a completely different subject, I move that any post containing a phrase along the lines of "This is going to get me moderated as Troll" be automatically moderated Troll. Too many of us seem to use it becasue it tends to lead to the opposite result.
Number of milliseconds in 49.7 days:
60*60*24*49.7 * 1000 = 4,294,080,000
which just about overflows uint32.
"You mortals are so obtuse." -Q
Since when is going from Unix to Windows an upgrade?
That's a lot of nines.
On the bright side, though, it means that
the recent three and a half hour outage
was just a fluke. And also that they can
expect to get another 4000 years of uptime
before something like this happens again.
What was someone thinking when they made
that stat up? Just this outage makes their
system exceed it's downtime spec by *OVER
FOUR ORDERS OF MAGNITUDE*. Sheesh.
Next upgrade will be to Windows 3.1! W00t! I can't wait.
That information had been filtered at least three times, can't count on that either...
Software analyst -> LA Times reporter -> TechWorld reporter.
http://msdn.microsoft.com/library/default.asp?url= /library/en-us/sysinfo/base/gettickcount.asp
Sounds like who ever wrote the software/OS module they were relying on used this gem. I hereby dub who soever was so silly as to do this as a 'code monkey, first class'.
putting the 'B' in LGBTQ+
Having to shutdown a system to maintain it's uptime is first a ridiculous idea.
Second, it took several years to find that bug because most windows machines never made it to that 49.7 days and if they did the users just assumed it was the normal because it is considered normal for windows to "lock up", freeze or whatever.
Third, replacing unix, known for it's stability, with any variant of windows (known for instability) in a system where peoples lives are at stake and then having this happen, the guys at LAX who decided to do this should be fired because they just risked a lot of lives and cause massive delays for travellers. In a political situation they would have to resign.
I remember a similar story about a aegis class cruiser stuck out in the ocean for three days because they decided to use windows. "Yea, that will work great during a war.."
*sigh* Microsoft has good lobby power and hires a fleet of sales people to keep selling their shod-ware that really should just be kept to mom and pop living rooms.
But then, this is the opionion of a guy who works only with linux and is sitting on an uptime on an openmosix cluster-leader (that also is my dev box) that looks like this:
19:03:06 up 319 days, 5:20, 3 users, load average: 1.28, 0.73, 0.37
eat your heart out LAX.. you got punk'd
anime+manga together at last.. in real time.
I really wish Microsoft would go out of business, quickly and quietly. I don't hate Bill Gates, I don't hate Windows, I'm just tired of hearing everyone bitch about them so much.
No one cares what your captcha was
Houston TX, USA
giant advertisement:
I'm thinking that maybe "the guy that almost crashed a bunch of planes" is not the name they were looking for.(I'm not making this up- that's really the ad I'm seeing.)
314-15-9265
As an LM employee who works in ATC, all I can say is keep up the good PR!
URET is TEH R0XX0RZ!
Say no to software patents.
This is not Microsoft's fault. It's the FAA's fault. This is an application where there shouldn't be any legacy need to run some old MS Excel macros, or some game that only knows how to talk to graphics hardware using the DirectX API. Windows shouldn't have even been seriously considered, much less used for anything this important.
Blaming this on Microsoft, is like blaming Hasbro when their toy Tonka truck fails to pull your mobile home. Someone should have taken one peek at the proposal, saw Windows, and said, "Hey wait a minute -- this is the real world, not SimAirport."
This happened after an upgrade from Unix to Windows.
Because that isn't an oxymoron...
If I throw a stick, will you go away?
One of the things that is delightfully unambiguous is the naval tradition.
If the ship trades paint with anything, it's the Commanding Officer's fault. Yeah, some shrapnel may works its way down the organization chart, but the glory and the gory both rest on one neck...
Would that less time were spent on blamesmanship in our decadent, modern day...
Get thee glass eyes, and, like a scurvy politician, seem to see things thou dost not.--King Lear
... should be:
"Microsoft: Writing the software to prevent SkyNet since 1981."
Do not look into laser with remaining eye.
I'd like to report a typo in this post.
3 194/
"This happened after an upgrade from Unix to Windows.", should read, "This happened after a downgrade from Unix to Windows"
http://www.freeipods.com/default.aspx?referer=915
My lame blog.
And I hardly see how the Windows 95 bug is relevant to this issue as that clearly isn't what caused the shutdown.
Editors please learn how to do your fucking jobs and reject crap like this. Just because it bashes MS doesn't mean its newsworthy.
Mathematics is made of 50 percent formulas, 50 percent proofs, and 50 percent imagination.
I'm pretty sure the union official who says that it requires a reboot to avoid 'data overload' misspoke, and meant 'data overflow'. 49.7 days is 2^32 milliseconds.
What you may not have taken the time to observe is that when you run init with a name of telinit or with a process ID other than 1 it runs in 'telinit' mode. In this mode it passes a message via /dev/initctl (a FIFO) to tell the running copy of 'init' (the process responsible for initialising services and managing them thereafter) to perform a specific action (eg shutdown, reboot... etc)
Maybe the administrator intentionally "missed" the reboot when management wouldn't fix the system or fix the backup system.
Maybe the administrator's manager intentionally ordered the administrator to forgo the reboot so there wouldn't be downtime, or to save money, or both.
Maybe the management didn't allocate enough time for documenting procedures.
Maybe the management "saved" money by not adequately training administrators.
Maybe the management "saved" money by hiring administrators fresh out of school (with minimal intern "experience") instead of those with years of real-world experience.
Maybe it's management's ignorance to blame here.
" upgrade from Unix to Windows."
wouldnt a downgrade be the correct term for this.
i uppgraded my nt box to unix.
i downgraded my unix box to nt.
sounds about rigth
Did you buy your account on ebay? Because this is pretty low quality bargin-basement karma whoring for someone that's been on Slashdot for 7 years. It's written more like someone with a 700K account that discovered Linux 2 weeks ago. Oh well, did the job, even if you ended up looking foolish.
The article at: http://www.techworld.com/opsys/news/index.cfm?News ID=2275
has a headline: Microsoft server crash nearly causes 800-plane pile-up And next to it you'll see a Microsoft advertisement ad that says: Make a name for yourself with Windows server systems
And I guess the FAA did just that too.
I don't think switching from Unix to Windows can be considered an "upgrade."
This sounds like more Microsoft FUD to me. But I might be wrong because I like to use Unix/Linux and therefore my oppinion is suspect.
I second that:
possibility related to an old Windows 95 bug
This is pure speculation of the editor. Nowhere in the article the blame is put on the OS. Linking the failure to an error in a previous version of the OS just doesn't make sense.
To quote the article a bit more:
A major breakdown in Southern California's air traffic control system last week was partly due to a "design anomaly" in the way Microsoft Windows servers were integrated into the system
The failure was ultimately down to a combination of human error and a design glitch in the Windows servers
So the failure is due to a "design anomaly" in the integration of the Windows servers or a design glitch in the Windows servers, not a design anomaly in Windows or a design glitch in the Windows OS
Whoever approved this process of manually rebooting a machine should be at fault. The fact that it was a windows operating system, or a unix OS or a purple OS is irrelevant. The problem here is someone thought a valid solution was to reboot a machine once a month.
Oh. Good. Lord.
There's just so much wrong with this picture. At least they picked the version of Windows least likely to flake out.
(Personal nightmare: finding a Windows computer running your life support)
1) this is not a windows OS bug
GetTickCount() will rollover. An _application_ which assumes it is a strictly increasing value will misbehave after the 40 some odd days expire. That appears to be what is happening here.
Note that nowhere in the article is there a distinction between the "system" and the "OS" or the "application".
2) Regardless of where the fault is (hint: it's not in Windows), it is not unreasonable for a machine to need servicing. Aircraft engines are serviced at hour based intervals, wether they need it or not. It's better to just tear the thing down and rebuild it than to have it tear itself apart. software doesn't _have_ to be this way, but it sometimes is.
Making a complete hardware -> app layer stack 100% failsafe is.. tricky. For some applications, designing the system with a known restart point.. i.e. a reboot of the app or the entire machine, can be more cost effective.. (see earlier the paper on crash-only software design)..a periodic shutdown/restart in complicated systems can be a valid operational practice.
The fault here is two fold - one, the application/system had a known issue that is probably avoidable, but for whatever reasons, it still has the issue.
Knowing that the issue existed, the proper maintennace was not observed with the expected result - a failure.
Only in america do you get away with blaming Audi for oil sludge problems when you dont change your oil every maintenace interval.
If the system called for a 48th day restart, thats what it requires, and deviation from that has consequences. Luckily no one was hurt.
My opinions are my own, and do not necessarily represent those of my employer.
Windows NT has a disclaimer in the license agreement stating that it should not be used in critical job roles like nuclear reactor control, etc.
Maybe they need to update the list. I would suggest everything except their Mine Sweep game.
The race isn't always to the swift... but that's the way to bet!
)Mandate open source software in Government, schools, libraries.
)Make private closed-source software companies financially liable for vulnerabilities.
)Dump the DMCA.
)Invalidate IP patents.
Who's really at fault?
No, don't tell me... ooh, I know this! (Smacks forehead.) Is it... Microsoft?
No, the OP is using something called "inference". In fact, I am infering that the OP was infering that the journalist reporting the article either doesn't understand the 32 bit rollover problem or does not want to report all the details required to describe the 32 bit rollover problem.
Its far too great a coincidence that a Windows machine should halt consistently after 40 some days, and that this same bug plagued the Windows operating system.
As you can read in the OP, he questions "could this be?", not "this is".
Suggest you pull your head out.
I don't think blame should be assigned to the technician who missed the task; rather, it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task.
Yes, it really is. They had a system in place which they chose, knowing its deficiencies. To combat one of the deficiencies, they proscribed a procedure to be followed monthly. The procedure was not followed by the technician, so it was human error.
Would you expect your car to run flawlessly if you never put gas in it or changed the oil on a regular basis? If you didn't, whos fault is it? The car's or yours?
If two planes had crashed as a result of the comminication loss, I think that the resulting lawsuits, both criminal and civil, against the FAA, Harris and Microsoft would have been large enough to possibly cripple the latter two.
I used to have to reboot our NT Servers due to memory leaks once a month. Although this problem seems related to the application software rather than Win2k, I really have to ask myself what the fucking hell Windows, any version, is doing in a life critical computing environment. Is Windows even licenced for operation in such areas???? And I'm saying Linux is better, but there are OS'es around that ARE licenced for such operation (Tru64 if I'm not mistaken).
And the fact that the system had to be regularly rebooted , and was actually used in the field although this fact was known is simply pathetic, added to which the fact that they couldn't even automate the reboot smacks of gross incomptence.
Whoah! 7 nines uptime!
22 seconds of downtime per year.
Somebody is on drugs if they sold that. Somebody is on even stronger drugs if they bought that story.
"5 nines", for all intents and purposes, is as good as it gets, with "6 nines" seen as the holy grail. The top HA system I've ever dealt with (running a Telco's billing operation spanning 4 countries!) quoted a figure of 0.999996. To nobody's suprise, it did not run Windows.
Wonder how much their failure clause is going to set them back?
Norman Cook's Ode to Sl
psShutdown with task schedular would have been enough. honestly don't think M$ should be held entirely responsible. any f00l coul have set this up.
BTW, going from UNIX to Windows is more of a migration, not necessarily an upgrade.
You need people like me so you can point your fuckin fingers and say, "That's the bad guy." So what that make you? Good?
While I hate MS as much as the next guy, this might not really be directly their fault. Unix systems are often installed with the instruction taht they get reboots regularly. Often there is a problem that is caused by application code not the OS. If you have a memory leak in an application that runs and stays up all the time, it's going to cause the system to get horribly unusalbe in the long run regardless of whether it's UNIX or Windows. While a reboot might be overkill when it was just one application misbehaving, a reboot is a guaranteed way to kill and reset the responsible program no matter which one it is. At a previous place of employment we told the customer to do monthly reboots mainly because we didn't trust *our own* code to be that perfect.
Don't label something "offtopic" unless you know the topic well enough to tell what's on topic.
You don't blame the bike, you blame the person trying to use a grossly inappropriate tool.
Search Microsoft's Knowledge Base for "49.7 days", and you'll find a few bugs, all of them related to storing uptime in milliseconds in an unsigned 32-bit integer. Two were reported in Windows 2000:
That rpcss.exe issue looks like a prime suspect. The OS doesn't crash, but, given the time-sensitive nature of air traffic control data, it's quite possible that the applications running on that server would degrade to the point of failure.
Both look like they were found, or at least entered into the KB, after the release of Windows 2000 Service Pack 4 (Nov. 2003), and hotfixes are available for both.
Note to Microsoft (or anyone else storing milliseconds, for that matter): unsigned 64-bit int! Instead of having to reboot every 49.7 days, you'll have to reboot every 213,503,982,334 days, give or take a leap-second.
This sig intentionally left blank.
For several hours the airport was EX-LAX... ...after which it got reLAXed.
It probably should. My company uses XP Embedded for a few systems, and doesn't have any software-related problems on them. Ever. The only problems we have are when people snap off antennae that we use for the wireless connections, or something similar. There's no reason that they shouldn't be using something like this to scan baggage. It sounds like someone at O'Hare didn't do their homework.
The only drawback to XP Embedded, for my company at least, is that the Windows license costs us more than the solid-state drive that we run it from. Looking into Linux for new installations as an alternative, but it doens't make much sense to replace strong, stable XP systems that never fail.
This is great:
... according to a software analyst..."
"The shutdown is intended to keep the system from becoming overloaded with data
This "analyst" knows nothing about computers, works for Microsoft, or both.
"Windows Upgrade, FAA Error Cause LAX Shutdown"
Sounds to me like Windows causes constipation. Use moderatly.
Despite all the glorious claims made in their advertising, if you have a bad experience while using Microsoft products that was caused by flaws in those products, Microsoft's official position is:
TOUGH NOOGIES (and you were stupid to believe the shit in our ads).
If I were /., I'd be careful. They're getting very close to libel. To take something this serious, and completely spin it around, and announce it in a public forum is just ASKING for a law suit. In this case, I think that /. would be fucked if MS saw this and wanted to pursue it.
I don't respond to AC's.
Its actually kind of amazing that it stayed up that long in the first place, when you think about it. Especially if the machine is doing anything at all.
Trying is the First Step to Failing --Homer Simpson
On some systems (Solaris specifically) the linux-weened will quickly learn that reboot or halt is NOT the command they wanted to run...
Actually the linux-derived programs reboot, halt and poweroff do exactly that but they first check the runlevel... if reboot detects the runlevel is not 6 or s it will call shutdown to tell init to enter runlevel 6. If halt/poweroff detects the runlevel is not s or 0 it will call shutdown tell init to enter runlevel 0. They are designed to do double duty... to be called at the end of rc.d scripts and for super-user usage.
You can force them to immediately shutdown or reboot without checking the runlevel by using the -f option.
Of course, the SunOS supplied binaries do not have this safety check... I'd recommend against getting used to that. Just pass the appropriate option to shutdown... (-r for reboot, -p for poweroff, halt is the default)
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
It stops running if I don't fill up the tank every 300 miles.
paintball
Wholly Jeez I cant wait till we get medical equipment thats built on windows xp embedded!
Nurse: Uh the resperator shut down
Tech: reboot it, everything should be fine
Nurse: ok, resperator is working again, what about the patient?
Tech: hmm... cant reboot him huh..?
Nurse: nope, hes cold
Tech: well at least the embedded web browser is working, maybe we can find him a family plot.. or email john edwards!
Well as I pointed out in one of the forums. Blame may be something that is shared amongst several parties, including MS. Software is custom code and prebuilt parts, some of it MS, and some from other vendors. The days of completely from scratch code are over. All of the above interacting in conplex ways. The truth unfortunately more times than not, makes for lousy press.
That's every 584,942,417 years. Which is simply not going to be good enough in my book.
"It is a greater offense to steal men's labor, than their clothes"
Obviously, this links Microsoft to terrorism. Could there be any more conclusive evidence!?
Seriously, though, LAX needs to start to learn to check on integration stuff before they implement it into one of the biggest airports in the world. If I recall correctly, a recent radio commercial for some TV show or movie described it as "a city in itself". I would have thought they would have tested a little bit more before implementing this stuff.
- Code Dark
I toured a RAPCON facility on an Air Force Base shortly after 9/11 (was a miracle for a civilian to get to do so, but I have friends in locally high places). Most of the serious computing hardware that was obviously visible was all very recent IBM RS/6000 AIX boxes, but there were a couple Compaq Proliants running Windows 2000. All the radar displays were Tektronix gear, the likes of which you'll not find anywhere on Tek's website :-)
One of the greatest ads ever to appear on Slashdot:r v.adtech.de/images/Ad247098St1Sz225Sq1Id1.gif
http://a1767.g.akamai.net/v/1767/2939/30d/imagese
They could have all sorts of software that requires manual steps on shutdown and restart. It happens all the time.
Whereas on a modern Linux box you could probably script most actions, on Windows it's usually not that easy - even with Windows Scripting Host, most MS shops like to keep everything "standard" or "out of the box."
- It's not the Macs I hate. It's Digg users. -
...LAX is pronounced as "laks" and means something like "too lazy to do anything". :)
- Save a tree, eat more woodpeckers
It's probably not a Microsoft problem if the system is running on NT, it uses a 64-bit time.
It _could_ be that an important part of the system is running Windows 95 interfaced to a 2k domain that implements the rest of the system.
That really isn't Microsoft's fault that they didn't patch that critical machine to fix the flaw... or that they felt they needed to run Windows 95 (gag) in such a critical portion of the system.
It _could_ be that a user-land air traffic control related application itself calls an depricated API to return the time in microseconds, which
overflows/wraps around, causing the software to crash.
OR
It _could_ be that the user-land air traffic control software just mis-casts the time from the modern API into a 32-bit data structure, which wraps around, causing the software to crash.
In the latter two cases the article writer or LAX's press staff may have incorrectly drawn the connection to the famous Windows 95 problem... even when it wasn't Microsoft's fault in that case.
I really don't see how Microsoft could be the blame here at all...
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
Ladies and Gentlemen, at this time the Captain would like to ask you to remain seated with your seatbelt firmly fastened, however if there are any computer technicians flying with us today, especially if they know what to do when a 'Fatal Exception has occured at 0029:C02FDEC6', would that person please come forward to the cabin immediately?
Liberals call everyone Nazis yet they are the closest thing to it.
Unless it was SCO Unix switching from something that works to something that doesn't is not an upgrade.
That's our life, the big wheel of shit. - The Fat Man, Blue Tango Salvage
Let this be a lesson out there to all the mouse wiggling MSCE's who scorn the uptime of UNIX and shun the power commandline. If you are running a critical Windows Server, REBOOT EARLY and REBOOT OFTEN. Remember, REBOOT-ing is part of the job description and it has to be done. Please protect our key infrastructure and reboot your servers WEEKLY! Just beacause the UNIX guys get 2 years of uptime, doesn't mean you can too. It just doesn't work that way.
Might I suggest this wonderful little tool. Poweroff. It's the only tool I know of which seems to be able to reliable reboot widows boxes, even when they are crippled due to worms and/or memory leaks. It can even close running apps. Also, you get get it to work over the network with a magic packet, in case Terminal Server crashes or is too slow to use.
The main article should get flagged as troll/flamebait due to the phrase upgrade from Unix to Windows. That wasn't an upgrade, that (as we now know) it was a disaster waiting to happen. Wait until the worm of the month comes through and shuts it down. When will people learn to use the RIGHT TOOL FOR THE JOB! If it has to run 24x7 forever, don't put it on Windows. Geez...
The post you are replying to says 0.26 seconds. You boot in 24 seconds. That's about 92 times faster than the alloted time...
warning: This post is likely to contain gobs of dripping sarcasm. Consume at your own risk.
A system was deployed where the application (not the OS) failed after a finite time was deployed knowing it was faulty. An under-trained technician failed to reboot the server as scheduled. There was a backup which we don't have details on. It failed to work as well.
I don't see what the OS has to do with this. It could have been written for *NIX, OS/2, or any other OS. The lessons are two:
Don't deploy flawed software.
Make sure redundant systems work.
As an aside, since we don't know what the backup was, we could hypothetically say that it was the UNIX system that previously was primary that was relegated to backup duty. In that case, it would be a failure of Windows and UNIX at the same time. So, is it that UNIX sucks and is worthless for any important systems, or is it that the people that screwed this up would have screwed up something, no matter what OS they were working with?
Learn to love Alaska
Where do you want to land today?
"Eve of Destruction", it's not just for old hippies anymore...
...that it's not Microsoft's fault.
Here's what happened:
The FAA installed a new system. There were bugs in that system, in the custom software the FAA uses to move planes around the sky. Instead of fixing those bugs properly (as they apparently did in Seattle), the FAA instead went with the quick fix of rebooting the server every month, and backed that up with a script rebooting the server automatically if it's not done manually. Then, the FAA techs didn't follow the FAA's workaround procedures, and Chaos results.
Exactly how was this Microsoft's fault? Maybe I'm wrong here, but I don't see what MS did here. And OpenSource wouldn't have solved this problem, because I really doubt that anyone is going to write FAA flight control software under an open source license.
144l. ph34r my 133t l3g4l 5k1lz!
No, the OP is using something called "inference". In fact, I am infering that the OP was infering that the journalist reporting the article either doesn't understand the 32 bit rollover problem or does not want to report all the details required to describe the 32 bit rollover problem.
Exactly. Bringing servers down to "flush data"? Methinks Holi is not a programmer but still feels qualified to talk about the internals of the software. But this is slashdot, so what else do you expect?
lol CAASD@MITRE ownzors j00
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
Let me see, you claim that the error was in the integration of the Windows server. Thanks for clearing that up, because the submitter wrote, "The recent shutdown of LAX due to an FAA radio outage was apparently caused by a Windows 2000 integration flaw."
Ya, way to correct that error. It was in the integration of the Windows servers, not a Windows 2000 integration flaw.
...upgrade from Unix to Windows? Wouldn't that constitute a downgrade?
Upgrade? Is going from the automobile to the horse and buggy an upgrade? Is going from a modern jet the the Wright Flyer an upgrade? Is going from running shoes to bare feet an upgrade? I could go on forever.
How ya like dat?
Don't tell anyone...
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
Pilot here, and this has been a well known pecadillo of the tracking system for SoCal Approach for a few years. It's an application problem that came into being after an upgrade of the application, not the OS. It's a memory allocation error that retains some of the old tracking on the system, thus, the whole box needs to be rebooted every 45 days or the memory overloads and crashes the OS. Look guys, I'm a Linux user and all, but let's not run around blaming M$ for problems with buggy software apps.
"Curiosity killed the cat, but for a while I was a suspect."- Steven Wright
Right?
I mean, why bother writing a timed script if it doesn't have a failsafe?
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
Note to self: NEVER FLY IN OR OUT OF LAX. I've got enough computer problems as it is, I sure as hell don't want more while I'm six miles above the ground.
A UINT32 will overflow after about 30 days if it contains the number of milliseconds since execution start. This is just a fact. Yes, there was a bug in Win9x where such a buffer would overflow in the OS. But still, its amazing how fast the slashdot anti MS zealots are quick to point fingers without even considering the fact that it might have been a bug in the FAA software?!?!?
...but they couldn't script it to do an orderly shutdown? I mean what does the technician do differently that it doesn't interrupt air traffic?
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
Three words:
Returns the number of milliseconds since the machine was last booted.
From reading the article, one would surmise that this function is used to assign a timestamp to a particular flight plan or other record. After the machine has been running for 49.7 days, the GetTickCount() function rolls over to zero, which could cause a whole plethora of problems. Almost certainly those problems would include things like corruption of data, lost records, old records showing up as new, application crashes, and, of course, swarms of locusts. The only fix is to reboot.
The developers cleverly noticed the potential disaster before it crashed any planes, and as a workaround, instituted a policy requiring the servers to be rebooted at monthly intervals. Failure to do so would result in the calamities described above.
So while the problem wasn't the old Win95 bug, it was the same crappy windows API that caused both. The POSIX-compliant gettimeofday() function uses a 64-bit structure and does not suffer from the same flaw, and can be relied upon for at least the next 30 years or so (which isn't amazing, but it's a lot better than 50 days).
Note that the FAA insists that they're currently implementing a better solution than "reboot every month". Better hurry, guys, you've only got 47.3 days left.
"With sufficient thrust, pigs fly just fine. However, this is not necessarily a good idea...."
RFC 1925
That is the perfect response!
Since we are being technical about the answer, does this mean Microsoft or the software vendor qualifies as a terrorist organization?
Consider the fact that an entire airport was shut down, lives were disrupted, major economic harm was caused our airlines as a result of flights not getting out on time. LAX is a major hub that connects travelers throughout the country, it is conceivable traffic patterns throughout the U.S. were put out by this problem.
Think of it like a car bomb that went off without anyone dying, and you see my point.
M
Its far too great a coincidence that a Windows machine should halt consistently after 40 some days, and that this same bug plagued the Windows operating system.
It also happened to some Cisco routers. Should I presume that those affected IOS versions were rare Windows based IOSs?
If anything breaks, it must be Window's fault. It could never be the application developers that made bad code. The only people that make bad code work for Microsoft.
Learn to love Alaska
I love it when the cashpoints crash personally. I mean, money isn't important is it?
Your cashpoint; built on NT technology.
It is as this point that you can be certain of only one thing: PHB's rule the world.
Microsoft operating systems have never been good. Never. Not once in the past 10 years. How long does it take for this to sink in folks?
Friends don't let friends buy Microsoft products.
I was kinda just assuming that some human interaction was required for the reboot process,
such as enaging backup radar, or notifying appropriate people first. (Though that is just an
assumption), otherwise, yes, as other people have suggested, you could just have an automatic reboot.
Windows in 6 Bytes (IA-32) : 90 90 90 90 CD 19
a move from Unix to Windows is an UPGRADE!!!???!!!
since when???
From Harris.com
The system offers unprecedented voice quality, touch-screen technology, dynamic reconfiguration capabilities to meet changing needs, and an operational availability of 0.9999999.
Less than a one percent uptime!?!?! No wonder the thing crashed, it suppose to do that, ALL THE TIME! Bill Gates must be proud.
Linux O Muerte!
Microsoft operating systems have never been good. Never. Not once in the past 10 years. How long does it take for this to sink in folks?You are just incompetent. The only reason you can't keep up a Windows box is because you don't know how to work a computer. I had to move the UPS in a server room. There were NT servers I had to turn off with uptime of over 5 years. It brings a tear to your eye to reboot something like that, whether it is running NT or UNIX.
Song Airlines' in-flight entertainment system runs Linux. The system allows the passengers to listen to MP3s, see a moving map or watch Dish Network live.
After my flight landed they rebooted the system and I saw a friendly penguin and a bunch of startup messages. I noted that they were using a non-GPLed video driver.
Would do the trick... We have used that exact script for YEARS to nightly reboot a troublesome NT4 BDC at a remote location.
I'll bet that (curiously) every time you did this, your local system shut down and re-booted instead.
Why do we have to point fingers at each other after a major failure?
Mostly it's systems that are poorly planned and fail and not people. Fix the problem.
Pointing fingers makes people defensive in the long run, and raises the probablily of it all happening again.
... how does a single app bring down the entire OS? You mean the app can't be restarted and brought back up with the same state at a moments notice in a mere minute or two?
Crappy design, regardless of who is at fault.
--- I do not moderate.
"overloaded with data"?? Har har har. Translation: memory leaks, a standard feature of Microsoft software.
an upgrade from Unix to Windows.
Yeah, maybe in upside-down crazy world...
[I said, no text.]
If you have a windows machine with a 5 year uptime then YOU are the incompetant one. I don't believe any windows OS has gone 5 years without security patches being released.
And as much as you want to blather about how people must be incompetant, nobody believes you. We all know you are a little shitface in your mom's basement sticking up for windows because you are too fucking stupid to use a computer.
http://www.harris.com/view_pressrelease.asp?act=lo okup&pr_id=77
Wierd reloading link from the slashdot article text!
now3djp
According to this press release, VSCS offers "an operational availability of 0.9999999."
Someone check my math, but that appears to come out as 3.16 seconds annually, so their 3-hour outage burned up all their allowed downtime for the next 3,422 years.
So it should be quite safe to fly now, statistically speaking.
omg i love nerds. especially ones that love math. oh baby. oh baby. my name is beth.
Vandemar.org
Unix clock will die in 202? so 32 bit machines will then have to run a 64 bit clock. O well computer using this system was ment to be replaced by then.
Thank god AMD decided to roll out 64 bit now.
Unix uptime counter is simple record start time subtract from current time(record in sec from a partical date) and it gives you uptime in secs. Now is the uptime counter in windows where the extream over head comes from. Why in heck would you need to be recording mill secs no real need secs is good enought.
How can you upgrade from Unix to Windows, downgrade perhaps, never heard of any upgrade like that before.
Now I have to worry about the fact that my safety (and my family's) is in the hands of incompetent Microsoft.
That sucks.
...but I blame a lot of people for carelessness and incompetence (except for the actual techie that forgot to reboot last month--that is an honest mistake).
* Bill Gates and developers of Win2000 for the convoluted, kludgy API they designed for their OS
* Product managers at Harris--the crap-for-brains who actually thought changing out robust UNIX servers that weren't really THAT old with consumer-grade PCs running an unproven OS was an UPGRADE to a critical, safety related system. WHAT THE HELL WERE THEY THINKING? In one of the article links (the Harris press release), Harris touted SEVEN NINES reliability! If that was a criteria they should've NEVER considered Windows...Not even BillG himself would say Win2k could provide that sort of uptime!
* Retarded developers at Harris who used an API call that tracks milliseconds in a 32 bit integer despite the fact that bugs related to the use of said function call were WELL KNOWN by that time.
* Dough-heads at LAX and the FAA who, upon finding the error early in development, decided it was OK to rely on MANUAL MONTHLY REBOOTS as a workaround to a potentially fatal problem. They should've run the "upgraded" windows machines in parallel with the UNIX servers for much longer, and failing that they should've IMMEDIATELY restored the old UNIX servers to service as soon as the problem was discovered, and to refuse the upgrade (and revoke payment to Harris) until the problem was properly resolved (and NOT just worked around with a kludge like an email reminder to reboot, or a reboot script or a shutdown warning either).
I'm surprised that this sort of error got into such a critical system, and at the way it was handled. I would've certainly tested the new system in parallel for long enough to catch this sort of error and kept the old system around for longer as a standby (in my experience, replacements of critical systems were often tested in parallel for 3 months to a year). I also would've acted much more decisively in resolving the problem if it did slip through the cracks, given a system crash could put lives in danger.
Maybe my girlfriends fear of flying is more justified than I thought if these are the kind of clowns we trust our safety to...
Not suprising, considering previously the navy had a destroyer running windows nt which crashed on a divide by zero or something, leaving the ship stranded at sea and had to be towed back. This is no different.
recommends rebooting our production AIX box at least once a month -- it serves a database only (no interactive users).
Couple of tb of disk, couple of gb of ram (or more) and a dozen cpu's and we have to reboot it monthly.
It's called maintenance. It is required.
Now that I've heard that the application was possibly part of the problem, I can't help but think of the large number of barely literate Windows "programmers" that are out there. What was the application written in? VB? .Net? My kids can write code with VB (it's just not GOOD code). Let's get some geeky grey haired UNIX programmer to do the job and do it right!
=-=-=-=-=-=-=-= - The Celtic - =-=-=-=-=-=-=-=
http://shit.slashdot.org/article.pl?sid=04/09/21/2 120203
WTF this is an ameraturish bug. Uptime overflow bug my lord, this is CS201 kind of bug you see on a power point presentation with the title "whats the bug here students?". I think M$ should divert some of those marketing funds towards better quality oversight.
If we scale that to count nanoseconds instead, we run into a problem during the year 2554AD. We may have found a solution by then.
> Tell you what, can you get me new boards for an IBM RT pc? I
> highly doubt it.
I've actually dealt with IBM in the "we need support and replacement parts for legacy hardware" capacity before.
And yes, if you've bought IBM in a professional/enterprise capacity, you've also bought the support contract. And if you've bought the support contract (And if you didn't, you deserve to be fired. Why the hell would you pay the IBM premium except for their support?), you can get parts and expert support for damn near everything IBM's ever made; all the way back to card punches/readers, and farther I'd bet. Remember, when you buy IBM, you're buying a MTBF of thirty YEARS.
cya,
john
Imagine all the people...
I didn't even know Windows kept track of uptime.
Obviously since uptime is stored in a 32 bit integer, Microsoft themselves never expected Windows to reach 50 days of uptime. Kinda telling isn't it.
I'm sure this has been brought up before, but why not bring a suit against M$ for selling a defective product? What makes bugs in their product any different than a car whose wheels fall off because of faulty lug nuts?
Loading...
I'm not sure exactly what downtime for routine maintenance on an AIX system running DBase has to do with a Windows bug that causes a system failure. However, in response, there is a difference between planned downtime where a service is made unavailable while planned routine maintenance is performed and planned downtime or an unplanned failure due to a flaw in the system.
:P
It appears that in this case Windows has a flaw which they try to work around with routine maintenance during planned downtime.
In your case I would say you have planned downtime for routine maintenance to work around the need for an appropriate system to handle the work load.
I suppose what is the same between these two cases is that you both need to change your system to something that is more appropriate for the task at hand. And to be more specific in the FCC case, Windows should not be allowed for use in any application where life, limb, or property is at risk. Hmm, I suppose that may rule out just about every use.
burnin
The FAA is under the auspices of the US Department of the Interior, aren't they? You know, the same department that was ordered by a court to take ALL of their systems off line because they were apparently unable to secure them? TWICE? (No, wait, the latter link says THREE times, most recently March 2004...!)
Is there some secret plot to make them look bad, or is the Department of the Interior riddled with incompetence? I certainly don't feel real secure about the safety of our airlines right now - and it's got nothing to do with "terrorists"...
(Not to say that terrorism isn't a real concern, but I'm somewhat less worried that their intentional plots will slip through observation by the authorities than "accidental" screwed up software being deployed by the FAA...)
Hacker Public Radio is our Friend
As someone who has worked on the project in question. I would just like to note that original system was hosted on a Tandem computer system. Which uses its own OS. In respect to that, Windows was an upgrade. Personally, I would have rather seen a UNIX replacement put into place.
IW4M.
Got time? Spend some of it coding or testing
If it went down for three hours, now it's got to run for 3400 years in order to make up the claimed operational availability:
"The Harris-developed VSCS - based on independent, distributed processors and switches - allows air traffic controllers to establish all air-to-ground and ground-to-ground communications with pilots and other air traffic controllers. The system offers unprecedented voice quality, touch-screen technology, dynamic reconfiguration capabilities to meet changing needs, and an operational availability of 0.9999999."
Microsoft makes plenty of other timers available. The nature and behavior of GetTickCount has always been documented and publicized, it's just that some stupid dumbasses (Microsoft not excluded) decided that it would make a good general purpose timer for processes that would be running. It's not the recommended method for maintaining a date and it has the worst resolution of any of the system timers.
This is a problem that goes deeper than Windows vs. Unix, it has little to do with the operating system or the hardware and even the application has little to do with it.
If you must assign blame, you should probably point fingers at the people who spec'd the system out and perhaps submitted to the cost-constraints demanded by the bean counters.
Any system that lives depend on needs to have fail-safe features and redundancy built in to it and a completely seperate fall-back proceedure that can be implimented at a moment's notice.
It can almost be assumed that something will go wrong and when it does, the equipment needs to be able to handle it as transparently as possible, otherwise you can just about count on human error to make matters worse.
These kinds of systems can end up being very expensive to build. It is very tempting to remove what can be seen as "bells and whistles" from the package to save money.
Unfortunately, these bells and whistles are designed and intended to save lives.
I honestly don't know if that is what happened in this case, but I've seen it before where lives were at stake. It is another version of the low-bidder syndrome.
I don't think blame should be assigned to the technician who missed the task...
Boss: OK Tech, it's your job to see to it this computer is rebooted monthly.
Tech: Will do Boss!
*Time Passes, System Crashes*
Boss: The system crashed, why is that?
Tech: Well, it's because I didn't reboot the system like I should have.
Boss: Oh well, I guess it's not your fault, obviously I failed to realize maximum security synergy in my systems.
Wherever the submitter works, I wanna get a job there!
...try working with someone who describes the system on his machine as "Word" and complains about the (boilerplate) fax template from MS-Word not being present in OpenOffice as being one of the most important "failings" in the system. I kid you not.
From the sound of it, that's not far from the territory the GPP is in.
Got time? Spend some of it coding or testing
The radar and the guidance system had separate clocks, and they'd drift out of sync.
Here's a detailed analysis by the General Accounting Office.
'Life's easy.' There, now somebody finally said it. I don't ever want to hear 'Nobody ever said Life's easy' again.
"If anyone ever told you life was easy, they lied." - mom
The reboot was to reset the logic flaw in the MS system timer. Read my post here on it. It has affected other MS made apps on MS Windows 2000 servers. So if MS's programmers get affected by it, you can expect non-MS employeed programmers to get affected too since they do not have the same level of access to the proprietary OS.
If Tyranny and Oppression come to this land,
it will be in the guise of fighting a foreign enemy. -James Madison
WTF is the FAA?
"Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife
Is is still called an upgrade when it flops so badly?
You're almost entirely right. NTLM2 is a separate protocol from Kerberos, though. It's used by downlevel clients who can't speak Kerberos, and is also used by your DC's when you're not running in Windows 2000/2003 Native Mode.
OpenLDAP and AD's LDAP crap is both originally based off of the original UMich LDAP code.
Which is an improvement - albeit a questionable one in this case....
I am very small, utmostly microscopic.
This happened after an upgrade from Unix to Windows.
I did not know that this could be considered as an upgrade........
Havin' it large, livin' the life, Welcome to the land of the rising sun.
windows 2000 can stay up for more then 232 milliseconds, but software that depends on GetTickCount() being correct can't. That's probably what happened. They could have rewritten the software to use a 64 bit time variable, or they could have worked around the bug.
They didn't, and that caused the crash. Not "buggy windows".
The fact that they couldn't even figure out how to run a sheduled task in windows to reboot the machine is just pathetic, and shows how incompitant they really are.
autopr0n is like, down and stuff.
I shouldn't bother even replying, but...
.Net code on this windows box all day at work, and I reboot once a week, when I power down my machine for the weekend, if I remember. I've gone a couple of months without a reboot to see what happened. Nothing happened. Last time we took down our production DB (ok, to apply a security patch), which handles way over 500,000 transactions a day on ms-sql2k, it had been up for 8 months without missing a beat.
An NT machine with uptime > 5 years is perfectly possible. WinNT 4.0, 5.0, 5.1, 5.2 (thats Windows NT4/2k/XP/2k+3) are not that bad, and keep on getting better. I'd even say that 2k and 2k+3 are good. its true what MS say about most crashes being the result of driver problems. I develop
Yes, MS releases security patches. No, its not always necessary to install them. A good admin will have disabled all unneccessary services & features, and if there is a patch for a service you aren't using, why would you install the patch, especially if the machine was running inside a trusted network.
Who's really at fault?
According to the headline, looks like Slashdot's already decided.
The decision to replace the legacy system was made the same week RadioShack quit selling vacuum tubes. Coincidence? I think not.
I like my women how I like my golf courses: with a windmill hole.
Ouch!!!
And they're not alone. Our "core business" runs on an IBM zSeries machine running IBM's z/OS. It only needs to be IPL'ed about once every 9 months or so. Actually, we do that just so that we remember how to! The hardware is so reliable that it can even do non-disruptive microcode loads most of the time. And it is always doing "self checks" to detect and automatically (no software notice) correct itself.
Anyway, UniSys is doing a review to show how this business can be "transparently" rehosted onto their Windows server(s). Why? "It's so much less expensive." We reboot our Windows servers at least once every two months due to problems. Oh, this is __not__ scheduled. The Windows support team doesn't believe in __scheduled__ outages. Why? Because they can only be scheduled around 2 am on Sunday. And __they don't want to be bothered!!__
it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task.
Would you feel this way if the airplane that you were flying in missed it's engine overhaul time, the engined failed catastrophically and your plane crashed?
Critical System + Maitainance = Must Be Done.
The system was designed and setup in a particular manner. In fact, the reboot rule was added to the design of the system, so that this very thing would not happen.
Whoever's job it was to reboot the machine is at fault for not maintaining the system properly.
The discussion of whether the procedure of rebooting a machine every month is inane, is something different.
What would happen if a group of people out of the goodness of their hearts wrote them a new system that truly did everything they needed. Would they adopt it?
Or are the corporate powers that be so out of touch with reality that they wouldn't touch anything having to do with "open sores!"
The man who trades freedom for security does not deserve nor will he ever receive either. - Benjamin Franklin
If the system had been updated the problem would not have occurred. How is this a microsoft problem? They cannot force system maintenance.
http://jayceecorder.blogspot.com
nuff said
Oh, by all means, be the good consultant will you? Which of the raft of binary cruft which must compose the system was compiled with the wrong SDK? I'm sure everyone would love you to death if you could reach into the DLL hell and pull out the offending bits. The guy who's supposed to go and reboot the thing once a month will be especially pleased with how clever you are.
It's funny how people pointing their fingers at one or another potential causes think that mitigates how nasty M$ is as a platform. How pathetic a system is it that does not have reliable system timers? How much even more pathetic that someone's goofed timer can pull the whole system down. Oh, but it's a timer, see? No, it's just a "data overload" that will give traffic control incorrect information. How about they should have automated the reboot? As if you want faulty software deciding when it should stop giving your air traffic control info or you would trust it to come back up on it's own. The boss blamed his tech who missed the once a month reboot as if that was never going to happen. It's junk and you should not use it so it's not M$'s fault is my favorite though, right behind just don't use it.
The last two hit it on the head. M$, You have to be crazy to use it. Remember that the next time you think Winblows might be a reasonable candidate for anything. When the thing goes tits up, the blame gets put everywhere but and on you. So much for vendor support.
Friends don't help friends install M$ junk.
Maintenance Task.....Holy shit.....!!!
Maintenance Task.....A monthly Reboot ?????
Well there goes the farm Mildred....
Shut er down Ma....she's suckin mudd....LOL
Aren't there about 7 million freeware apps that will reboot your computer after a certain amount of time? Can't you just write a stupid shell script?
Did the REAL computer tech quit, so they couldn't figure out how to operate the Unix box? Christ.
Please stop stalking me, bro.
http://www.brothersoft.com/Utilities_System_Utilit ies_Sleep_Timer_4576.html
Sheesh.
Please stop stalking me, bro.
Forgetting all the talk about Microsoft and Win95/98 and the defect in the OS that has been well known for years and for which a patch has also been available for years....
If you have a system that has a known failure point at 49 days,when do you perform the mandatory reset?
For the failure that is described the scheduled reset must have been "every 30 days" which is, frankly, INSANE!
If they had scheduled a mandaory reset every 14 or 15 days, they would have had to have had three failures before disaster struck. As it seems, one failure was all it took.
I want to know who is the moron for making a switch from a unix system to a windows system.
to me that is the moron who should get fired.
I think there something more going here - haven't they heard of automation and scheduling a job?
Perfect timing for this comment. I was in the airport yesterday (Detroit). The screens over the metal detectors/ carryon xray machines do nothing except tell you whether the lane is open (a large arrow) or closed (a large X). 4 of the lanes had some sort of Windows error message. Apparently they couldn't handle the workload.
I started to write a long comment, no point, unfortunately this is the way today. Trust me - the more computer system decissions are made on manager level instead using people who know how to build systems - the worse it gets. Used to be that way - compare the financial / manufacturing systems running years to what we do today - any questions ? Some of my old systems are still running from 70's - none of my new systems can stay up more than 10-12 months AND I was told to build them that way. And no - CAD systems, CRM, protocols, world wide networks for finance / air lines / etc.. has been there since early 70's, so complexity is not any excuse. Just don't give up - maybe some day ( after my time.. ) And let's forget the Windows / *nix, Windows is more difficult to build reliable systems but it can be done - Windows is just more primitive, you have to design / code on lower level, it is harder than *nix but so what ?
I was on the lucky team that *lost* the bidding for the replacement system; IBM's team were the poor bastards who won, and were stuck investing seven years into building an unbuildable replacement, pouring billions of dollars down the drain while being micromanaged by the FAA, who didn't know much about software design or reliability in spite of having a methodology that required producing 175 design documents over the optimistically 3-year design period.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
Or just write proper software that can handle the rollover. Sheesh, that's the correct fix.
The system that failed at the Los Angeles ARTCC (Air Route Traffic Control Center) is known as the VSCS (Voice Switching Control System). The backup system is known as VTABS (VSCS Training And Backup Switch). The system was installed to replace the aging WECO (Western Electric Company) system, basically a hard-wired system. If one line went down, there were others to take its place. I knew that it would only be a matter of time before a failure in the VSCS system would take down all of our communications, not just ground-to-air, but ground-to-ground as well.
.
o okup&pr_id=78
Why didn't the backup system work? Well, the rumor mill has it that VTABS was not properly configured such that when VSCS failed, the frequencies and the lines each controller needs at his particular sector were not available in VTABS. If this was the case, it sounds as if someone in automation wasn't doing his job.
I know that one of the worst scenarios for a controller is to lose his ability to communicate. He can no longer control.
Here's some more info if you're interested . .
http://www.harris.com/view_pressrelease.asp?act=l
FAA would be Dept. of Transportation or maybe Homeland Security.
This happened after an upgrade from Unix to Windows.
what's your definition of an upgrade?
The 40-year-old system was pretty much the Mos Eisley of software design - you'll never see a more wretched hive of scum, villainy, and undocumented unmaintainable Jovial code running on IBM 360/50 and 360/90 hardware. The backup system was much cleaner (and much dumber); I think the main thing they did in the 1970s enhancement was retread the design to use transistors instead of vacuum tubes, though I never worked directly with that side.
Yes, Sun and IBM machines fail - that's why all of the critical parts in our designs had to be at least doubly redundant, and often triply redundant, because the design spec of "Eight 9s of reliability" meant that doing an hour a year of preventive maintenance might expose you to too much risk from the backup system failing. I haven't seen IBM's design; I was on the lucky team that didn't win the bidding to build the final system, unlike the poor suckers at IBM who had to implement theirs, but the requirements were not only insanely non-implementable, they were excessively focussed on No Possible Downtime Ever, because if anything goes wrong resulting in an airline crash, the FAA gets insane amounts of political heat. Doesn't matter if the system is N years late, because you can try to blame the contractor for that, or if you can't fly supersonic planes across the Continental US because they're too fast for the new ARTCCs, because tough luck for the French and for bi-coastal business travellers.
Of course, that doesn't mean that Im inclined to trust a system running on Windows, either...
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
Wow, interesting, informative tidbit. Thanks.
No version of Windows has been certified telecom carrier grade reliable 99.999%. The number of Microsoft programmers and billions can't make Windows reliable. Microsoft won't even attempted to pass the certified telecom carrier grade test. There are version of Linux and embedded Linux that are certified telecom carrier grade reliable.
There is a serious security in Windows NT 4.0 for a couple of years that has not been fixed. What is Microsoft solution? Let support for Windows NT 4.0 expire at the end the year, then Microsoft won't have to fix serious security flaw. Linux 2.0 (which is older as Windows NT 4.0), 2.2, 2.4 and 2.6 are still supported with the latest security patches.
It may seem suspicious that the max uptime of the LAX system is the same as the max uptime of a Windows 95 box... until you realize that 49.7 days is 2^32 milliseconds. If you have a piece of software that counts milliseconds using a 32-bit integer, it will inevitably roll over after 49.7 days and -- unless designed to compensate for it -- will probably crash. Windows 95 is certainly not the only piece of software that counts milliseconds in a 32-bit integer.
That said, the Windows GetTickCount() system call returns a timer value as a 32-bit count of milliseconds since the system was booted. Now, any good programmer knows better than to use GetTickCount() -- there are other, better, more robust ways to tell time in Windows -- but it would not surprise me if a newbie had made the mistake of using this system call in the LAX software, thus leading to the problems.
In other words, the Windows timer is not at fault, but it is possible that one of the programmers was confused by the convoluted Win32 API and made a programming error as a result.
Failover clusters aren't trivial - I worked on a non-winning design for one of the predecessors to this system back in the late 80s (fortunately for us, we lost, and unfortunately for IBM, they won.) Yes, you can have two, three, or N of everything, but then you need a lot of code watching the redundant components to see if any of them appear to be failing, and code deciding which redundant subsystem is correct if two or more of them disagree, and code watching the watchers to make sure they're still watching well, and data communication protocols that work ok when all messages are transmitted redundantly to the redundant processors, possibly getting different results at microsecondly different times. One of my coworkers had worked with an early "fault-tolerant computer" system which had triply or quadruply redundant hardware, but had an operating system that crashed at least weekly because it was too complex.
You also have to be extremely careful and flexible in your design for the granularity of the redundant subsystems - if you make the separately processed chunks too big or too small, you can have an order of magnitude change in performance and sometimes several orders of magnitude change in reliability, and then there's the problem that the definition of "reliability" includes "probability that the calculation finishes in N milliseconds", so it's inextricably linked with performance.
Moore's Law is really your friend here. Improved performance means you can use a lot fewer parts, which reduces complexity and failures. Disk drives are more than an order of magnitude more reliable, and the increase in size means that a cluster of disks containing N gigabytes is several orders of magnitude more reliable because it's a lot fewer disks, and CPUs that are 2+ orders of magnitude faster mean that it's easier to guarantee that something happens in a given time, and cuts down on communications steps between different modules, so you cut down on all the failure modes for those communications, and on the monitoring software watching for failures, and on the failures of the monitoring software. On the other hand, Moore's Law lets operating system vendors and application vendors bloat their software with features - X Windows 11R3 ran just fine on my 386/25 machine with 8MB RAM, but of course I was using twm, not Enlightenment or Gnome.
Backup plans do introduce the danger of complexity - the FAA doctrines of the 1980s were that any new system had to be able to interoperate with everything its predecessor interoperated with, because you weren't going to flash-cut upgrade everything at once. That meant that everything you designed had to be bug-for-bug backwards compatible with the predecessor's interfaces, and when they redesigned the thing _your_ system interoperates with, it has to be bug-for-bug compatible with everything your system does, which means being compatible with its predecessor, which was compatible with your system's predecessor, etc. It's a vicious circle similar to the messes Windows and Intel CPUs had to put up with, except that while the 8088 and MS-DOS 1.0 were *ugly*, they were at least small and well-documented late-70s technology, as opposed to poorly-documented 1960s JOVIAL and 1950s analog.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
See this smart man's post.
i d= 10314632
http://it.slashdot.org/comments.pl?sid=122665&c
MS is very much on the hook for this one.
If cars are a valid analogy to operating systems, Linux cars work on "zero point energy", which means that, at the worst case, you should stop every few hundred miles to drain your bladder.
Sweet we finally found a job that George W Bush has created and can simultaneously perform!
You are blaming M$ because the tech forgot to reboot? Thats like blaming Ford if the idiot at you garage forgets to screw your lug nuts back on, and your wheels fall of at 90 mph!. Jesus you fuking linux cry babies. How easy would it have been to write an auto-reboot script on a schedule? This isnt an open/closed source argument dumbshit! BTW, this is syrrys,I'm over my post limit. So fukyou!
Looking at the www.faa.gov home page, it says "Department of Transportation". However, having been a systems engineer and administrator in a couple of stints at one of the DOI Bureaus ... you don't want to know.
Fragmentation is another problem besides leaking, but it can also lead to systems getting progressively slower until they drop below some critical performance threshold.
And disk drives _do_ fill up with log files unless you do something about it.
Back when my department used Vaxen, we'd reboot them every Friday night, fsck the disks, and do backups. Around the time we were running SVR2, the file system really was stable enough and the removable disk packs high enough quality that fsck seldom found anything and didn't need manual intervention, and the rebooting process was reliable enough that we could let a cron job run it, and while we could have cut back to monthly, people had gotten in the habit of knowing the machine would be down, so they could get a life, and it was a good schedule for the times we actually did want to to upgrades or hardware maintenance.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
Huh, how so? How about fixing the shitty windoze API and making GetTickCount() return a 64 bit value?
In 25 years working with Unix systems, I've never seen that instruction. That must be because I've never worked with any Microsoft Unix system...
[...]it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task[...]
:)
:)
:P
Now I'm thrilled. So now it seems rebooting regularly just to avoid death of that Windows has "evolved" from a ridiculous flaw to a technician's maintenance task
That's really worth a smile, at least where I come from
And right, critical radio outage at the Federal Aviation Administration caused by some Windows version ? Naaah, can't happen in a Windows world, everybody would bet on human error in such a case, right ?
I am putting myself to the fullest possible use, which is all I can think that any conscious entity can ever hope to do.
Dude, if the OS requires a reboot, it doesn't matter how bad the application software is. A true Operating System should work flawlessly FOREVER. It's not impossible, because VMS does it, FreeBSD does it, Linux does it, so why cannot micro$oft windoze do it?
Thanks, teahouse. The M$ marketing department will be sending your check to the usual address.
Can you imagine knowing about this problem, putting it into production and not riding your MS rep like a pony until it was verified fixed ? ...with any other vendor.. sheesh.. but I guess it doesn't work that way with MS - even for the FAA.
I just have to comment on this one. The flaw isn't in Windows, it is in an application written by a high priced consulting company. It was discovered late in the evaluation process, and since it is easy to work around (by rebooting once per month), and fixing it would have delayed delivery, the software was accepted with the bug.
Blaming this on Windows is wrong. Just plain wrong. Completely wrong. For all the faults Windows has, this is not one of them.
The bug is in an application written by consultants. It's the same bug Windows 95 had, but now it is a critical application, and not in the operating system.
Microsoft isn't at fault this time.
The error in question is in an application provided by a consulting company. The bug is known, and the workaround is to reboot once per month. Windows is not the problem here; it gives good service. I expect Linux would do the same, but the consultants preferred to use Windows.
Personally, I think that using as much off the shelf software as possible, and trusting them to write as little as possible, is a good decision.
The Tandems lived up to their hype, in terms of reliability. I never saw a VSCS failure in almost ten years of use--I barely remember how to use the backup system, VTABS. Maybe now I'll get some practice with it.
Unix to Windows95? more like downgrade...big time
Ladies and Gentlemen, this is your captain speaking we're cruising at 30,000 feet, you can see the Mississipi out of the left side of the plane, and...uh, what the hell does "STOP 0xc0000005 (0x00000029,0xc02fdec6,0x00000000,0x00000000)" mean?
Hmmm, I seem to recall glancing at the MS EULA one time (when they were printed on the disk-envelope (that I never opened - it came that way - honest), and the thing said in part that it wasn't to be used in life-safety operations, for running a nuke plant, air traffic control, or other real-time operations...
So ummm, unless MS suddenly created a hardened RTOS, why the fsck is this thing even running anywhere near ATC?
I say FIRE the morons who installed it, ordered it, designed it, and sold it... Finally, FINE the hell out of the asshole company that wrote it and allowed it to be sold for that use... I'd say $150/hr PER person inconvenienced by this debacle, PLUS whatever the airlines lost (or might have earned) PLUS a punitative sanction to make it fucking hurt bad enough that they'll realize that this can't ever occur again - I'd say $15 billion would do it...
That is because the Windows uptime counter uses milliseconds, not seconds.
Of course, the most important thing is to spend a lot of time carefully defining what events are or are not failures, because that can make a couple of nines difference in what you call the reliability numbers...
And yes, the FAA has always been on drugs. One of the drugs they're on is knowing that if there's an airplane crash and hundreds of dead bodies due to problems with air traffic control, they get infinite amounts of political heat, whereas if major hub airports don't have enough capacity because the ATC system is antiquated, well, that's only money, and usually somebody else's money at that, and if there are appalling delays and cost overruns, maybe it takes a bit longer to get promoted, but often you can _get_ more budget, because if two 747s full of school children crash over LAX a month before Election Day due to ATC glitches, nobody wants to be the Congresscritter who voted against fixing the ATC system. So the system's rigged against them, forcing them to be overconservative, and to _look_ extremely conservative, except that every once in a while the fragility and brokenness of the system catches up with them and forces them to do something in a hurry, especially if there's going to be an election where the top people get replaced for partisan political reasons, which gives them an opportunity to let the outgoing guy take any blame after he's gone. So just because they're on drugs doesn't mean that it doesn't suck to be them...
On the other hand, you really can get equipment that reliable if you're willing to pay for it, and component reliability has improved wonderfully since the 1980s, e.g. disk drive MTBFs of 500000-1M hours instead of 10,000 hours, so you really can wait until midnight slowdown to rebuild the RAID partition after you hot-swap the drive, and computers are a couple orders of magnitude faster so you need fewer of them to get a given job done fast enough, making it much easier to make subsystems reliable and monitor their status.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
So you thought you like Fly-By-Wire airplanes?
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
Part of being on the ball in any tech department means having the system up to date. If you don't have it up to date, and an error FOR WHICH A PATCH EXISTS gives you trouble, everyone else in the company should rip your head off. That's inexcusable.
If you install an unpatched version of an OS, and leave it as such, it's your own dumb fault. If a patch is out that fixes the problem, then the problem doesn't exist as far as anyone with half a brain is concerned.
My apologies for the abrasive manner of the response, but patches are around for a reason: to fix known problems.
Patches, do ya have 'em?
Look behind you...
Sooooo. They were converting to Windows, eh? Do we really think they were installing Win95 anytime recently to force this bug unto themselves?
I would find it hard to believe that they were installing Win9x OR that Win2K+ was effected by this bug as I have found no current documentation pointing this bug to an installed W2K+ OS.
Blah, blah, blah.
I was on a Northwest flight from det->fra and was trying to watch a movie using the in-flight video on
demand system, when to my surpised the client rebooted and what did i see but a lill' penguin in the corner! -- my guess is that the 'server' on the plane that serves the menus and movies gets overloaded when the flight staff activates the video system -- and there must be some timeout that occurs and the client reboots. Unfortunatly i was not able to see what kind of hardware was driving the LCD displays.
-best
-greg
This is not addressed to the parent, but is for everyone who responded to the parent -
I'm throwing stones, now - especially after reading this incredibly long and geeky thread about shutting down your OS variants. God bless you for having multiple ways of shutting down/halting/suspending/restarting your computer in user/superuser/megauser/whosyourdaddyuser modes, but shame on you for being a stickler on MS's decision to place a Shutdown option on the "Start" menu when you can't even agree on how to shut your own damned computers down!
It's hypocritical, pharisitical, and parasitical (I like alliterations, even when they're not in context...makes me feel like Don King) to bring up such an argument as "Please press the Start button to shut down (stop) the computer". I'm not saying that "Start" is the most incredible choice for a button, but it makes sense. If you are shutting down your computer, you START THE SHUTDOWN PROCESS.
Those who can, do. Those who can't, go into business for themselves.
Twitter, you're a petulant cock-gobbling sycophant to Linux Torvaldyos! Quit taking DP from ESR and RMS's feculent cocks and why don't you try to stop sucking quite so much? Get out of your parents' basement and see the real world - maybe then you'll see how pathetic you sound, with your neverending stream of bullshit about how Microsoft is stalking you. Wasn't it you who said that Microsoft believes your insane ranting is actually a threat to them, so they PAY PEOPLE to reply to you on Slashdot? No sir, I don't get any money. I do it for the love. Someone has to go up against your paranoid whining. So get back in your cage and shut the fuck up already.
Doesn't windoze have a cron type thingy for scheduling jobs? ;)
When the Micro$soft salesman said "but your system will need to be rebooted every month because it crashes", didn't they smell a rat?
I'm just glad I will never fly to the US, and if for any reason I do, I'll make sure to avoid LAX in the future
RebateFX.com - Spread rebates for Forex traders
My comment was poking fun at people that assume that UNIX systems are the end all be all of uptime, because the OP's clear implication was that something requiring high uptime should be on a UNIX system, not a Windows system. VMS still beats the pants of UNIX in terms of uptime. It was a joke, you know. Laugh.
Still, regardless of where the bug was in this particular case, the fact remains that servers handling mission critical applications (ie, where people's lives are at stake) should not, under any condition, be running Windows. In this case, the problem was with the application, but just because Windows wasn't the issue this time doesn't mean we should all wait around until it is.
What you're saying is like, "I know there are two gaping security holes in this setup, but the hacker that just took our system down only used one -- therefore, I'm just going to patch that one and be on my merry way."
Personally, I'd rather not trust my life to a computer in general, but I'll be really plain and say that if I had to choose a mature UNIX system versus a Windows system, I'd pick the former any day of the week. And if I had the choice of VMS thrown in there, well, all the better. Things can still go wrong at the application level, but the chances of a BSOD turning the whole airport into a carnage of burning crashed planes is that much reduced. And that, my friend, is a good thing.
PS. Saying that Windows works as well on the server as UNIX or VMS is like saying that mentally challenged kids are as capable as normal ones because they too run the special olympics. Windows may have versions aimed at the server, but until systems that need to be up for a decade under high load have actually been up for a decade under high load, I'm not going to trust it. VMS and Solaris are proven server solutions that really do work. A stable NT that doesn't crash is vaporware, as much as Windows nuts wish it weren't. I'm not saying Windows can never be as stable as UNIX/VMS/MVS/whatever, but the simple fact is that today it is not and we're talking about deploying it on mission-critical servers today, not a decade from now when MS gets its act together.
By that same logic, doesn't a Windows users "Start" the shutdown procedure?
And if you don't want to go to the "Start" button in Windows to shut it down, you could always hit ctrl-alt-del and click shutdown. Or press the power button if you have power management enabled in the bios. I don't really see a fundamental difference between the two, it's just semantics really.
When I first started using Linux, one of the things that baffled me for hours until I could ask someone who knew Linux was how the heck do you rename a file?? I searched and searched for anything resembling a rename command and found nothing. It never occurred to me that you might use the move command to rename a file by essentially just "moving" the file to a new filename. That's at least as illogical (to me and every newbie I've ever known) as clicking Start to Shutdown for someone who isn't familiar with the idiosyncracies of a particular operating system.
Keith D.
Now you've become a thrustworthly company!
What is the difference with Commence and Start?
Start writing.
Start listening til music.
Start shutdown procedure.
It works just the same!
I happen to like Start better than Commence. Commence is a very heavy word compared to just Start.
Why else are there so many clueless Microsoft Windows users out there? They never got a manual with the OS they licensed unknowingly through their computer purchase. Neither do you get a manual when actually purchasing Microsoft products.
How are you supposed to magically know how such complex system works without a good manual?
Irresponsible and cheap!
big projects don't work like this. if you find a bug mid testing, then you don't throw the whole thing back at the vendor and chuck the baby out with the bathwater; you simply cannot organise big projects like this. you do risk analysis and if it's decided you can accept it with a constraint that you, say, boot it occasionally then you may be able to accept the system. if you have accepted it on this basis and don't do what you said you would when you signed the constraint off, it's your problem. yes, the vendor shouldn't sell buggy software, but *all* software has *some* bugs in it.
Damn straight -- I for one don't want any patch installed on a system which can endanger my life unless it's been fully tested.
Phil
I guess today is a passable day to die.
If you have to resort to bickering about button captions on the shell to give sit to Microsoft, you have problems. Furthermore, this is in no way related to the article; why is it +2 insightful?
If you let the caption of a button get to you, you need to remove the tin-foil hat and seek help immediately.
By the way, that "Unix to Windows" link just sits there reloading. I'm assuming it's a cookie thing.
Assume I was drunk when I posted this.
You have to get your traffic in a holding pattern and/or switch over to the redundant before rebooting a piece of critical ATC hardware. This cannot be done automagically because your Bravo space might be full of planes at the time, in which case a controller would not want his/her display to go away... I am sure the pilots wouldn't, either..
There is no such thing as a Windows 2000 49.7 day bug that causes an OS problem.
I thought so, too, but this persuaded me differently: RPCSS bug
(RPCSS being an integral part of the OS, and suddenly burning a huge amount of CPU cycles being a bug)
At least for server versions of NT and 2000, and my money is on the same thing happening in client versions if you run them long enough.
As I recall, since Windows 2000/NT was once the same product as IBM OS/2 (remember Microsoft OS/2, anybody?), this bug originated from the OS/2 side of the codebase.
IBM ran into the problem quicker, as OS/2 was adopted for various critical things like Automated Teller Machines (ATMs), while Windows NT was mostly used for simple file servers. As a result, the problem was fixed in OS/2 about 2 years before in Microsoft got around to fixing the problem in Windows.
Considering that I remember this patch existing for Windows NT and 2000 back in 1999, it is disheartening that the FAA did not feel it necessary to upgrade to something as simple and critical as Service Pack 2 or 3.
"I don't think blame should be assigned to the technician who missed the task; rather, it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task. Who's really at fault?"
If a single maintenance task (refueling) is missed on airplanes, they will crash.
Why is having to regularly work on extremely complicated systems anyone's fault? I'd lean towards blaming the idiot who didn't...you know...do his job.
Brant
Argle. Bargle.
This happened after an upgrade from Unix to Windows.
How does one upgrade from Unix to Windows?
"This happened after an upgrade from Unix to Windows. I don't think blame should be assigned to the technician who missed the task; rather, it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task. Who's really at fault?"
Thats an upgrade? Having your system shutdown after 40 days.
Once upon a time, a certain vendor recommended monthly reboots of their server which collected call data from one of their products. This may have changed with newer releases. The server ran Solaris.
Of course that's not to say it was a Solaris problem. Point is, by UNIX systems parent might have meant systems with UNIX as an OS, but running other crappy code on it?
I think that the recommendation was more to cover their posteriors : If for some reason the software failed, and the customer didn't do the monthly reboots, how's fault is it?! Of course, our server ran problem free with over 2 years of uptime before a drive failure ruined that.
So this organisation just lost their ROI & guess what - TCO just shot up! You wouldn't listen, would you?
While Windows is a sub-optimal OS, the 49.7 day bug never existed in Win2k, WinXP or Win2k3. It only existed in Win9x (Google for it).
So this is either a completely bogus claim on someone's part or the FAA is running a mission-critical application on an unpatched Windows 9x box.
No. The FAA is part of the Department of Transportation
i guess you *are* a moron. considering that refueling a plane is a multistep process with multiple points of failure (the refueler must forget, the pilot must ignore the fuel indicator, etc...) whereas rebooting the server has ... how many points of failure? apparently just one. the original author is correct to call out the question.
in systems i've designed where one person is resposible for some event in a complicated workflow, the system sends an email out when the event draws near. it sends another as it is missed. it alerts the person's manager if it is late. if the FAA system has no such alerts, then the reboot event was poorly architected. if the FAA system DOES have such alerts, then the entire organization has a problem. it is never the fault of one individual when a system fails.
This is a terrorist offense. Yes, the vendor execs could be dragged into court and sentenced to death for this.
The society for a thought-free internet welcomes you.
The air traffic CONTROL system was NOT affected. The controllers could watch the planes "near miss" on their radar scopes. The *radio* / communications system was hosed; controllers could not contact the planes by voice.
2^32.... reboots, schmeboots.... the system was down for *5 HOURS*. Not even a crapped-all-over-itself Windoze 95 box takes that long to come back up and load its app. This is obviously a SERIOUSLY FLAWED system that could not restore itself, and all backups failed (for a time) too.
"Delta, go around..." -- (message to Flight 191 as it was crashing in a thunderstorm, Dallas, 1989)
Unfortunately, this "promotion" doesn't always take the form of innocuous sales calls-- it includes significant political lobbying, with donations, gifts, dinners, etc. To put it more bluntly, Microsoft is paying politicians to select Windows for applications where MS knows it isn't really appropriate.
While the politicians are certainly to blame for being corrupt, it's not like Microsoft can avoid responsibility for their role in the decision-making process. If I suggest to a government official that something might be a good idea, I can reasonably avoid some of the responsibility when it doesn't pan out. When I bribe that official to do it, I'm taking a much more active part in that decision, and thus I deserve every bit as much blame for the end result.
On a completely different subject, I move that any post containing a phrase along the lines of "This is going to get me moderated as Troll" be automatically moderated Troll. Too many of us seem to use it becasue it tends to lead to the opposite result.
Isn't "on a completely different subject" just other words for "Off Topic"? So shouldn't your post automatically be modded as "Off Topic"?
My neighbor tried to use an electric handmixer to make gravey over a hot stove. The cord caught on fire and when he tried to put it out he got electricuted. He read the recipe on the internet using Internet Explorer Browser. If it wasn't for MS, He would be alive today.
See you can tag everything back to MS if you fish hard enough. Next it will be earthquakes and hurricanes we tag blame on MS for.
Doesn't anyone see this as a bit silly to blame MS for an obvious blunder on not only their IT dept. but the morons in charge of maintenance and engineering? If you drive your car through the back of your garage, is it the car manufacturer's fault? How stupid would you look for trying to place the blame back that far?
I don't quite share your fanatical hatred of Windows; when used properly, it's quite capable of handling whatever you throw at it.
VMS and Solaris are proven, aye; I do, however, have fond memories of reading, fifteen years ago, the exact same things about Solaris that people now say about Windows; rampant holes, services with root access open to the Internet (lpr, rpc, sendmail, and so on), accusations of terrible bloat (xclock takes HOW MUCH RAM?) and slowness ('Slowaris' indeed) and so on.
It also never fails to amaze me that the zealots tout the security of an OS that can be defined as 'the results of taking a secure OS and ripping out most of the security functionality.' UNIX, after all, is a play on it's MULTICS parent, as it's a casterated version thereof.
Still, as far as I'm concerned, if you want reliability and no downtime, you use a mainframe. Period. We're talking about a system that you can rip running processors out of, and the damn thing won't even blink.
Vintage computer games and RPG books available. Email me if you're interested.
So would this be the same "wrong API for the job" that Microsoft's developers are using to code Windows services?
; EN-US;318152
Print Spooler Stops Scheduling Print Jobs
http://support.microsoft.com/default.aspx?scid=kb
I agree the developers should not have used this tick counter. And when they discovered there was a problem it should have been fixed immediately as the code change would not be that significant if it was only a matter of the tick counter rolling over.
But from what I've seen first hand and heard from others I still believe that Windows is not up to the task. And rather than it being the wrong API for the task it appears to me its the entire system (Operating System, API, Developers, Vendors, etc.) that is wrong for the job.
burnin
You are correct, I did mean the FAA. My bad.
...would call a migration from Unix to Windows 2K an upgrade?! "...upgrade from Unix to Windows."
Why can't Windows XP reset itself every 30 days?
I can reboot Windows XP using a script
@c:\windows\system32\shutdown -r
But when I tried to put the above script into a scheduled task. NOTHING HAPPENS!!!
On Linux I have NO problem. I just put a reboot command into a crontab of Linux and it reboots the machine at the scheduled time.
In the mid 1980's, I knew a software engineer at Caltech's Jet Propulsion Laboratory who worked on a multi-year JPL project for the FAA. The project was to replace the obsolete voice communication system for air traffic controllers. The new system had touch screens with onscreen menus and buttons were dynamically reconfigured depending on the controller's workload. It worked correctly, and the engineer enjoyed describing to me how it worked. This was all before there was any version of Windows. If I recall correctly, they developed on MODCOMP minicomputers running VMS but deployed on an embedded system with an in-house design for task switching, not a complete OS. I might be fuzzy about the technical details at this time, but a FOIA request should be able to retrieve them for the intensely curious.
I do clearly remember that the working system was presented to the FAA in Moneterey, and the FAA then terminated the contract and hired IBM to start over from scratch on a new system. Rumor was that this was a political payback. I should emphasize that's just a rumor I heard. Looks like Harris eventually got the contract. I wonder if any of the original code from JPL was ever deployed.
That certainly sounds fair.
Regarding some of the engineers at Harris.
As a matter of fact, I DO work as an engineer for a large, multinational company--and our projects do in fact involve mission critical systems. You are right--engineers do not always get what they want and it does often mean dealing with politically/non-technically made design choices like using Windows when we'd prefer not to. However, there is a limit--a time and place where commodity/consumer grade hardware and software is appropriate--and it's NOT at a level at which a crash will bring down an entire system. I do not have to know how the software works to make that observation--it has been shown that a windows box failed and the result was a major system disruption and hours of chaos. It's not the fact that they used Windows that is disturbing--it's the fact that they used it in a mission critical situation...without adequate testing to boot. And yes, I do have a clue as to how complex the system is and the intricacies of how it works--our companies products run systems in oil refineries, factories and power generating stations. In a similar situation and project we would handle things differently:
1 If program managers were indeed making critical decisions, the would HAVE to be registerd Professional Engineers by law, just like the lead developers.
2 Lead developers are explicitly instructed NOT to simply do as they're told. If they see a serious flaw in a design decision they are obligated to make their views known. Of course, you can't conter one political decision with another--you must have a solid case. If your boss refuses you go to his boss. If you are stonewalled right to the top and you think the issue is really important you can bring the issue to the professional association. The final course is to perform the work and refuse to sign off on it (make the boss do it). That way, if the result is failure, you are in the clear and your higher-ups take all the heat and not just some--it's "due diligence" (ass covering, really).
3 During development and testing, we identify any potential single points of failure, bottlenecks and known issues. In my situation, Windows-based systems are ALWAYS considered "unreliable" (that is, not to be relied upon for critical or safety related systems), therefore we prescribe redundancy. Our test plans always call for us to do controlled AND uncontrolled (pull the plug)shutdowns of each machine in sequence (to test failover) and simultaneously (to determine how the PLCs and other embedded systems, plus electromechanical systems, handle catastrophic failure).
4 If hardware cannot be supported for at least ten years (and in some cases up to 25 years) we MUST design such that there will be a drop-in replacement that will cause minimal disruption(for example an old VAX VMS server could be upgraded to a current Alpha VMS, or an old PLC can be replaced with a next generation one that will execute the same routines rung-for-rung)
5 It is typical to keep the previous, pre-upgrade equipment around as a standby system, ready to put back in service, until the new system has worked as-advertised WITHOUT INTERVENTION for at least a year. A crash or other fault would reset the 1-year clock and we'd be doing a thorough root-cause analysis.
It sounds like there is a lack of professionalism within your group of engineers. I'm not sure about how things are done where you live, but "just following orders" is not an excuse for poor engineering--a failure of that nature where I am would result in being temporarily barred from practicing engineering. Sometimes it can be tough to go against the PHB--I've heard of engineers being fired for refusing to sign off on designs, but I'd rather be fired and be able to work as an engineer elsewhere than have my ability to work as an engineer revoked entirely.
I guess I would have to ask the FAA as to why they made the decision to migrate a working critical system to Windows--a radically different architecture from UNIX. My employer builds
Upgrade from unix to windows? You should have put downgrade. Windows is never an upgrade to Unix.
Management at work. Upgrade indeed.
shutdown -t 0 -r
This says to reboot the computer and wait 0 seconds before doing so. Stick a -f in there to force a shutdown if you've got ornery apps. Piece of fucking cake, people. Shit like this makes me wonder why I'm still unemployed; I obviously have some skills that would be appreciated by the FAA. Just put the command in a task scheduler entry, set it to recur every 2 weeks, and you're golden. I mean, seriously, what the fuck?
I use task scheduler to make backups of my current Opera session and to run periodic defrags and clean temporary folders and so forth. The system provides a way to maintain itself at scheduled intervals, why rely upon a technical lackey who can (and obviously did) screw up?
Tangentially, when the blaster worm came out and was giving everyone the NT Authority you-must-shutdown-now message, I discovered that a quick shutdown -a would abort the shutdown process and allow you to continue working with the (albeit unstable) system, to install a patch or the like.
Reinvent the wheel only at either a lower cost, greater effectiveness, or your own personal enrichment and satisfaction.
I did a search for the 49.7 days in Microsoft's knowledge base and found one possibly related bug, the non-related bug referenced by the article submitter, and some other non-related bugs. The one thing they all have in common is an improperly used GetTickCount function in the code.
First, there's the five and a half year old patch fixing an issue in Windows 95/98. There's no reason this should have been mentioned anywhere in reference to this incident. Shame on the poster and all the people backing this theory. It's pure reverse FUD because there's nothing indicating that this bug was related and everything shows that this only affects 9x. Personally, I'm positive that this problem isn't in 2000 because I supported 2000 for Microsoft when it was released and never heard of this happening. Also, Microsoft is good about testing all of its products to see which are affected. If this type of screw-up were common, the articles would be common on Slashdot since the typical reader lusts after examples of MS screw-ups. There's also the fact that there's a LOT of Windows 2000 boxes with uptimes way past a month and a half.
But then there's the CPU utilization rpcss.exe bug. If this is what was happening, then it's partially Microsoft's fault for not having enough QC testing targeted towards idiot programming mistakes. Nobody tested enough to see what happens under different scenarios when GetTickCount is improperly used. Also, the hotfix from Microsoft is only a few months old, probably not enough time to test and deploy. On the other hand, GetTickCount is designed to only work for 49.7 days and shouldn't have been used for this application. I'd assume that they didn't know what was going on when shit hit the fan after a month and a half of running relatively smoothly and only after the MS patch was released did they review their code and see that they were improperly using the function. Still though, any company that has an internally written or contracted program with this serious of a bug should have invested the resources required to find the problem and fix it. They should have known that the problem was related to software installed on the server, most likely their proprietary FAA program because if every Windows 2000 computer running on a Dell had this problem, Microsoft would have released a patch long ago. Heck, they should have found that they were using the function improperly. If the programmers knew how long it ran for before dying (49.7 days), they should have realized that it's related to the GetTickCount function and could have narrowed in their efforts to wherever the function was used.
If the problem was not related to the rpcss.exe bug, then I don't see how MS is to blame. The blame lies solely with the programmers of the FAA software for improperly using the GetTickCounter function.
In conclusion, with either of these scenarios, I'd be replacing some of my programmers if I were the manager in charge of the project that wrote the FAA software.
-Lucas
The Rpcss.exe bug appears to be fixed in W2K SP1 since it only applies to Windows 2000 Server (i.e. no service packs).
It looks like the print spooler bug was introduced in W2K SP1 and wasn't discovered or fixed until after SP3 (since only W2K SP1-SP3 are listed).
Considering how long SP1 has been out, not to mention SP4 I don't see this as a Microsoft problem (assuming it realyl is an OS issue). -- Argel
-- Argel
This is about the only comment by someone with a clue in this whole thread
Does that herb facilitates time travel?
Because the last time I had to schedule reboots for a mchine of mine was around 10 years ago.
Oh yes, last time I used Windows, my bad.
I have administered SOlaris, Linux, HP-UX, Irix and a few others, and frankly the one that either should go to get a job in the real world or stop talking hallucinations is you.
IANAL but write like a drunk one.
What an ass you are. You claim to know that "M$" is a "nasty" platform. Do you develop on Win"doze"? If you do, and you use this old function call, then you're an ASS.
Oh, wait. I already said that. Should I now link to the slew of articles written by others who also claim you to be an ass? (by your editorial logic that should be proof-positive of your ass-dom)
I think they meant downgrade.
Well, I stand corrected - for some reason I thought the Federal Government would have been putting DOT in with all of the OTHER "interior" federal issues. Stupid me...
Considering the bizarre web of overlapping police agencies they've got, I should have known better...
Kinda worries me more to see more than one department having that kind of problem...
Hacker Public Radio is our Friend
I'm posting this so that you (the moderator) have some context to consider twitter and not mod him up whenever he posts his filler preformatted rants about installing Knoppix or Mepis or whatever that unfortunately get him karma every single time and allow him to continue posting his trademark toxic crap (read on) day in and day out. You may consider this a troll - I consider it community service. And I ain't kidding.
If you're a /. subscriber, I invite you to look through some of his posting history. I guarantee that you'll be hard pressed to find someone that is more "out there" than twitter. You'll also probably notice he's got quite an AC following. Don't just read his posts, make sure you go through the replies.
To get an idea of what I'm talking about, check this post out. This is an article about email disclaimers. The parent of the post is complaining about the ads in the linked page and so on, and twitter actually goes off on a rant to blame it on Microsoft and recommend Lynx, because "is teh free".
Here's another. In this post twitter not only calls the OP a troll but attempts to "tell it like it is" while making some vague argument about "GNU". Yes, if you're confused, you're not alone. The reply (modded +4) proceeds to simply destroy his bogus argument. You will notice he did not reply. This is what some people call "drive-by advocacy". A sort of I'll just leave you with my thoughts here and move on to the next flamebait kind of deal. In fact, he almost never replies because he knows that his fanatical arguments simply do not hold up to any sort of discussion. It's not that he's chosen the wrong cause - he's just going at it in a completely wrong way.
Here's that drive-by advocacy and FUD in motion: twitter goes on about some topic and then drops the usual "oh and M$ is teh evil" because "WMP phones home" or some such. Called on his FUD, he then claims that WMP stores every song and movie you've ever played in a file, somewhere. Pressed further, he just sort of slithers out of sight, his FUD-spreading complete. This is not about some Microsoft technology that nobody likes anyway; it's about lying for the sake of lying. Way too many of his posts are exactly like this one.
More? Just read though this post and the subsequent replies. I guess this stands on its own. Or these two. Or this one. Or this one.
Still not convinced? This is what twitter considers "humour" while going about his daily "M$" routine.
M
This place is really going to the dogs. Why don't you go back to IRC my man "twitter"? Do us all a favor and disappear, k?
it would have been cool if you had gotten a picture of this, but then they would probably have wanted to arrest you for 'terrorist like' activities'.
I can't wait till it doesn't feel like a police state anymore.
Creationists are a lot like zombies. Slow, but powerful and numerous. And they all want to eat our brains.
I used Solaris 2 on UltraSparc before, and it was the system that never froze/crash on me (besides those GNOME/KDE Apps / XFree86 itself). Even the most stable Linux that I saw cannot compete with it.
Our Windows servers have reboot schedules and these are monitored via our enterprise management tools to ensure that the uptime is not too high. Not drastically different to checking the fuel gauge really. A bit obvious, I thought.