Windows Upgrade, FAA Error Cause LAX Shutdown

Repent, Sinners! by mfh · 2004-09-21 09:49 · Score: 5, Insightful

The recent shutdown of LAX due to an FAA radio outage was apparently caused by a Windows 2000 integration flaw, possibility related to an old Windows 95 bug.

Okay... a Win95 bug leads to the LAX shutdown because the *same* bug was later found in Win2k? Yup, closed source is the answer, Mr. Gates. I hereby repent my sins of Open Source Freedom and agree that security by obscurity is the answer! /sarcasm

a technician didn't reboot the system monthly as he should have

You have to love a system that requires downtime as part of uptime. How many Linux users have this problem? (Please press the Start button to shut down (stop) the computer.)

--
The dangers of knowledge trigger emotional distress in human beings.

Re:Repent, Sinners! by LostCluster · 2004-09-21 09:54 · Score: 3, Insightful

I've seen AIX-based database systems that require an overnight downtime to do reindexing, since non-SQL formats like DBase have always been a little funky when they start having to deal with million-record tables. It's amazing how ugly legacy databases can be compared to today's tech.
Re:Repent, Sinners! by Da+Twink+Daddy · 2004-09-21 09:57 · Score: 5, Funny

You have to love a system that requires downtime as part of uptime. How many Linux users have this problem? (Please press the Start button to shut down (stop) the computer.)

Sure,
init 6
doesn't sound like it should start (initialize) anything...
Re:Repent, Sinners! by Phillup · 2004-09-21 10:05 · Score: 2, Informative

doesn't sound like it should start (initialize) anything

So... it should not initialize (begin) run level 6?

--

--Phillip

Can you say BIRTH TAX
Re:Repent, Sinners! by SoSueMe · 2004-09-21 10:06 · Score: 2, Funny

... the LAX shutdown...

Would that be 'exLAX'?
Re:Repent, Sinners! by (H)elix1 · 2004-09-21 10:08 · Score: 5, Insightful

You have to love a system that requires downtime as part of uptime. How many Linux users have this problem?

All right, I cannot throw the first stone here. I can raise my hand as a AIX C programmer back in the day...

We inherited a huge ball of spaghetti wire, nasty stuff that had memory leaks. Rather than taking the time to fix it, the powers that be determined it was better to keep working on new features rather than hash out the issues. At first it happened once a quarter, then once a month, and as time ticked by a weekly 'fix' to recycle the server. Lord knows I added to the mix as well, as they picked 'cheap' and 'build it fast' (not to be confused with running fast), skipping the entire do it right. That is how it happens... stuff gets rushed before its time. OSS is more immune than the typical commercial gig, but anytime a deadline comes without enough time to finish something is going to give. Downtime is just duct tape.

--
+++ UGUCAUCGUAUUUCU
Re:Repent, Sinners! by pchan- · 2004-09-21 10:21 · Score: 5, Insightful

where do you want to go today?

dear microsoft,

the above question was posed in a line of your advertisements well, after spending an hour and a half on a plane on the runway in oakland, and another hour on the runway in l.a. (sunday night), i think i have the answer. i want to go home. sounds like a simple enough request, or so i thought.

but here is what i really want: i would like you (microsoft, inc.), to stop selling your products to mission critical and infrastructure operations until such a time as they are ready to do so. when my desktop computer at work crashes (admittedly a rare occurance nowadays), i am inconvenienced. when hundreds of thousands of travellers in airports across the world are delayed because one of the busiest airports in the world is shut down due to a 10 year old known bug in your operating systems that has not been fixed, that is simply not acceptable. i realize that buyers of software and IT systems are easily suckered or bribed into using your systems, that is why i am appealing directly to you. please exit this market before we are forced to legislate you out.

thanks,
pc
Re:Repent, Sinners! by archen · 2004-09-21 10:27 · Score: 1

If it was really that important to reboot the system they could just install back orifice , then script another host to reboot any win9x host. Windows 2000 you could just schedule a reboot every month with the task scheduler. Win98 and ME have a scheduler also, but I've found that to be rather... unreliable.
Re:Repent, Sinners! by Surazal · 2004-09-21 10:27 · Score: 1

> How many Linux users have this problem? (Please press the Start button to shut down (stop) the computer.)

Sure,

init 6

Personally, I use "reboot".

"shutdown -r now" also works (r stands for reboot). To shut down, use -h (for halt).

--
--- Journals are boring; Go to my web page instead
Re:Repent, Sinners! by claar · 2004-09-21 10:29 · Score: 4, Insightful

Bah, what a cop out. If "we" won't accept criticisms similar to our own, we have no right to criticize in the first place..

Yes, init 6 is counter-intuitive. I remember that it actually did confuse me a bit the first time I heard of it. Does that mean we need to remove or change it? Nah, let 'em use `shutdown -r` or `alias restart="init 6"`. But just don't be an apologist for Linux, it just makes "us" look hypocritical.

--
I'd give my right arm to be ambidextrous...
Re:Repent, Sinners! by 47Ronin · 2004-09-21 10:33 · Score: 4, Insightful

Personally, I use "reboot".

"shutdown -r now" also works (r stands for reboot). To shut down, use -h (for halt).

Personally i use sudo reboot because I would never login as root for security/safety reasons.

--
Those who laugh at you for you having a Mac.. are the people who constantly call you to fix their PC.
Re:Repent, Sinners! by 0racle · 2004-09-21 10:37 · Score: 1

Just like you go to the start button to start system tasks, tasks like cleanly shutting down running tasks and making sure everything is properly written to disk.

--
"I use a Mac because I'm just better than you are."
Re:Repent, Sinners! by glitch23 · 2004-09-21 10:48 · Score: 1

I personally use 'shutdown -h now'

--
this nation, under God, shall have a new birth of freedom. -- Lincoln, Gettysburg Address
Re:Repent, Sinners! by admdrew · 2004-09-21 11:02 · Score: 2, Funny

Personally, I use an axe.

--
LegendMUD
Re:Repent, Sinners! by LnxAddct · 2004-09-21 11:03 · Score: 1

doesn't sound like it should start (initialize) anything...

It initializes the beginning of the end, run for your life!
-Steve
Re:Repent, Sinners! by jurv!s · 2004-09-21 11:17 · Score: 2, Informative

in my labs- users logged in on the console can reboot without sudo. Anything less would be uncivilized!

(ps man console.apps and pam_console)

--
sigs are for fools and trolls. no signature is *always* appropriate. you should turn them off in your preferences.
Re:Repent, Sinners! by Phillup · 2004-09-21 11:34 · Score: 2, Insightful

But just don't be an apologist for Linux, it just makes "us" look hypocritical.

I wasn't apoligizing. It makes perfect sense to me.

Then agian, I have a calculator that you turn off by pressing the "ON" key. ;-)

Seriously tho...

Many devices have a single power button. You push it... thing comes on... push it again... thing turns off.

If anyone should apologize, it is the person that decided on "Start" for the button label.

And, in *nix... init 6 does just what it says it does.

It initializes run level six. Run level six can do anything you want it to do. It doesn't have to shut down the system.

So... WTF would I even have to apologize for? The fact that the parent associates it in his mind with shutting down?

It doesn't shut down... it initializes run level six. If you don't want it to shut down when you init 6... change it.

If you don't want to go to the "Start" button in Windows to shut down... well... that one is your problem. Not mine.

--

--Phillip

Can you say BIRTH TAX
Re:Repent, Sinners! by neura · 2004-09-21 11:36 · Score: 2

Ya know, the day the poster of the first comment has actually READ any of the linked articles before posting, I may just drop dead from surprise. People apparently would rather get thier post in as soon as possible instead of actually READING WTF they're POSTING about. WORST OF ALL: This initial news item should have been moderated, since every non-factual suggestion made (about 75% of the post) is wrong. WHY CAN'T PEOPLE ACTUALLY POST REAL NEWS ANYMORE?!?!
Re:Repent, Sinners! by Hatta · 2004-09-21 11:39 · Score: 4, Funny

Personally i use sudo reboot because I would never login as root for security/safety reasons.

Funny, those are the only reasons I ever log in as root.

--
Give me Classic Slashdot or give me death!
Re:Repent, Sinners! by slittle · 2004-09-21 11:40 · Score: 1

We've also got VMS, Tandem and zSeries systems that do similar things, daily to monthly (ish) depending on the subsystem in question, and what's being done to it. Not exactly rare.

--
Opportunity knocks. Karma hunts you down.
Re:Repent, Sinners! by Awptimus+Prime · 2004-09-21 11:42 · Score: 4, Insightful

You have to love a system that requires downtime as part of uptime. How many Linux users have this problem? (Please press the Start button to shut down (stop) the computer.)

Well, in the past 10 years I have had a number of clients who have had Linux, Unix, Windows, and Mac systems that were critical to their day to day routine and they did nightly/weekly/monthly reboots as part of their maintenance.

I guess when you grow up and get out of high school, you will find that your linux box running as a DSL router is not a good example of a production server.
Re:Repent, Sinners! by Phillup · 2004-09-21 11:42 · Score: 3, Informative

Only if that is what you have run level 6 configured to do.

All the init 6 command does is initialize run level 6. You can have run level 6 configured any way you want.

It isn't hard wired to shut down. (On debian run level 6 does a reboot... run level 0 halts the system.)

--

--Phillip

Can you say BIRTH TAX
Re:Repent, Sinners! by 0x0d0a · 2004-09-21 11:42 · Score: 1

a) That's nonportable. Use telinit 6.

b) What, you don't like "reboot"?

--
May we never see th
Re:Repent, Sinners! by Lisandro · 2004-09-21 11:43 · Score: 1

I beleive the word you're looking for is touché.
Re:Repent, Sinners! by NeoSkandranon · 2004-09-21 11:44 · Score: 1

The databases used by my college (UNC-charlotte) basically turn off at a certain hour of the night and dont come back on till morning. This effectively renders the entire student portal useless as email, scheduling and other things are disabled. I was told by an IT student this was because the databases recorded changes each night...could this be similar to what you're talking about with AIX?

--
If you can't see the value in jet powered ants you should turn in your nerd card. - Dunbal (464142)
Re:Repent, Sinners! by I_redwolf · 2004-09-21 11:49 · Score: 1

It initalizes a run level.. which is what init is.. process control initiation.

Init is the parent of all processes. Its primary role is to create processes from a script stored in the file /etc/inittab (see init-tab(5)). This file usually has entries which cause init to spawn gettys on each line that users can log in. It also controls autonomous processes required by any particular system.

6 is a runlevel used to initiate the rebooting of the system.
Re:Repent, Sinners! by Feanturi · 2004-09-21 11:57 · Score: 1

You have to love a system that requires downtime as part of uptime.

You mean like MMORPGS that shut down everything and bring it all back up again every single day, to make sure things keep running smoothly?
Re:Repent, Sinners! by gweihir · 2004-09-21 12:16 · Score: 1

I've seen AIX-based database systems that require an overnight downtime to do reindexing,

Yes, but maybe that was controlled by a cron-job and not some poor person manually initiating it every night? Just like an automated reboot is also not too scary on any decent Unix, but a manual action in MS-world?

Just another indication of what Windows is: A toy.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:Repent, Sinners! by sydb · 2004-09-21 12:20 · Score: 1

Sounds like a batch run. Like what banks do every night to reconcile accounts. At least I know one place where they still do that, can't speak for anywhere else.

Why a college would do big batch jobs I don't know, though.

--
Yours Sincerely, Michael.
Re:Repent, Sinners! by Turmio · 2004-09-21 12:21 · Score: 3, Interesting

You have to love a system that requires downtime as part of uptime. How many Linux users have this problem?
Actually I was hit by the max 497 days uptime bug of Linux 2.4 (and with a desktop machine no less). The box at work did run for about 650 days but anyway well after the mile stone of half way journey for 2nd consecutive uptime reset. Then it was time for me to change rooms. I wasn't at office that day and my co-worker just unplugged the box. Was I pissed or not? Yes I was.
Re:Repent, Sinners! by n3k5 · 2004-09-21 12:22 · Score: 3, Insightful

If anyone should apologize, it is the person that decided on "Start" for the button label.
Originally the button just showed the Windows flag, so it basically the choice of a label was the same as in Gnome and KDE today. However, the average Windows user didn't figure out that this logo isn't just there for decorative purposes, but you actually have to click it in order to accomplish just about anything. So someone had to come up with a short piece of text that clues newbies in, and it worked rather well (in usability tests). 'Start' may not be optimal, but has anyone thought of something better? (Not that is matters anymore.)

--
but what do i know, i'm just a model.
Re:Repent, Sinners! by ScourgeOfGod · 2004-09-21 12:23 · Score: 1

Never mind the system. We've found someone to blame. Clearly that tech's job should be outsourced to the great unwashed!

--
If you're happy and you know it, think again!
Re:Repent, Sinners! by secolactico · 2004-09-21 12:30 · Score: 4, Funny

Golly gee-whiz, if someone is too stupid to migrate a million-record dBase table to SQL, he only deserves a real good whacking (and a career re-orientation into would you like grits with that ???)...

Most of the time it is not because the inability of the database tech, but the "hey, it's been working so far" attitude of the decision makers.

Maybe the powers that be are allergic to Open Source solutions and commercial databases can be expensive. Maybe the client applications are tied to the current system and porting them would be too expensive (example, POS systems).

I can imagine the conversation:

- "We are closed at night anyway"
- "Yes, boss, but recovering from a failure (knock on wood) can be too difficult in the current system"
- "Well, that's what we are paying you for"
- "Yes, sir. Thank you, sir. Would you like grits with that?"

--
No sig
Re:Repent, Sinners! by NuclearDog · 2004-09-21 12:34 · Score: 1

Well, then it could also be said that "Shutdown is a button used to start the shutdown of the system.".

--
This statement is forty-five characters long.
Re:Repent, Sinners! by Anonymous Coward · 2004-09-21 12:38 · Score: 2, Informative

But just don't be an apologist for Linux, it just makes "us" look hypocritical.
>>I wasn't apoligizing. It makes perfect sense to me....
>>So... WTF would I even have to apologize for? The fact that the parent associates it in his mind with shutting down?
Down boy! Heel!
apologist n. A person who argues in defense or justification of something, such as a doctrine, policy, or institution.
All words that sound vaguely alike don't necessarily mean the same thing.
Re:Repent, Sinners! by crackshoe · 2004-09-21 12:52 · Score: 1

My college (well, ex college) did the same thing. the online registrar and student portal closed @ 8pm, except on registration days, when it was open 24/7

--
Don't worry - its just stigmata. Pass me a napkin and don't you dare tell my mother.
Re:Repent, Sinners! by valkraider · 2004-09-21 12:53 · Score: 2, Interesting

Our college did batch runs for all sorts of stuff. We only had about 3000 students, but between faculty and staff it worked out to around 5000 people in various systems. Things had to run to calculate and process dorm room phone bills, cafeteria plans, accounts payable and recievable, invoices, transcripts, and DOOM wads...

And we were small. I can only imagine what a big school with 30 to 50 thousand students would need done... Not to mention all the DOOM3 wads nowadays. ;)
Re:Repent, Sinners! by Soporific · 2004-09-21 12:56 · Score: 1

The guy above you would have linux use init 574R7 for that short piece of text.

~S
Re:Repent, Sinners! by autopr0n · 2004-09-21 12:58 · Score: 4, Insightful

Yes, but maybe that was controlled by a cron-job and not some poor person manually initiating it every night? Just like an automated reboot is also not too scary on any decent Unix, but a manual action in MS-world?

a) This could easily been done as a sheduled task in windows 2000.

b) This could have been done by their code, in windows 2000 and windows 95.

c) Windows 2000 does not require a reboot after 49.7 days. Maybe their software relied on gettickcount() or something.

The problem lays with the developers of the software, not microsoft.

--
autopr0n is like, down and stuff.
Re:Repent, Sinners! by multipartmixed · 2004-09-21 13:06 · Score: 3, Informative

> since non-SQL formats like DBase have always been
> a little funky when they start having to deal
> with million-record tables.

Oh, yes, SQL the magic bullet. I have a database problem! No matter what it is, I can solve it by migrating to a database system which uses SQL!

> It's amazing how ugly legacy databases can be
> compared to today's tech.

Yes, today's tech! SQL, the magic bullet! Why, we should use Oracle! It's SQL and thus must be modern! It's only been around since 1979!

Wait!

1979 was a long time ago.

Oh, dear?

Could it be that Oracle is not modern tech? But, how could it not be? It uses SQL, the magic bullet!

Hint: query language and scalability are not related.
Hint II: RDBMS is no magic bullet, either.

--

Do daemons dream of electric sleep()?
Re:Repent, Sinners! by matth · 2004-09-21 13:12 · Score: 1

I'm not sure where you gather your information from but I'll have you know I sysadmin a large number of linux and window boxen, and we reboot the windows boxen much more often then the linux ones. Additionally I have linux machines that have been up for over a year.. and these are heavy used mail servers.. no need to reboot on a nightly basis!! good grief (charley brown)
Re:Repent, Sinners! by AhabTheArab · 2004-09-21 13:15 · Score: 1

So, by this same logic, [not criticizing anybody for something as silly as what word is used when shutting down] couldn't you say Start(begin) the shutdown process for the Windows world?
Re:Repent, Sinners! by Darby · 2004-09-21 13:18 · Score: 1, Informative

Windows 2000 you could just schedule a reboot every month with the task scheduler. Win98 and ME have a scheduler also, but I've found that to be rather... unreliable.

The W2K scheduler isn't reliable either as we recently found out.
In the first place, you can *only* run scheduled tasks as the system user unless the user who has the task scheduled is actually logged in at the console. This means no non-system scheduled tasks can run if the system reboots. This means driving in to type your user name and password in. Pretty stupid to even bother scheduling in this situation.
Second, you have to explicitly type in the admin password to the scheduler for each and every task you want to actually schedule (see above).
Third, it has a habit of forgetting the password you typed in causing all of your scheduled tasks to fail.

I've never been a fan of Windows, but these recent discoveries led me to the conclusion that it seriously is a toy single user operating system.

Don't even get me started on the fact that .NET doesn't even support simple basic internet protocols like...say...FT freaking P.
Re:Repent, Sinners! by severoon · 2004-09-21 13:20 · Score: 1

I rate this post +5 snarky. (That's a good thing.)

While I agree there's no magic bullet, at the same time, I like my air traffic control fault-tolerant, thank you. One missed maintenance cycle should not result in even the remotest possibility of planes dropping out of the air like moths near a zapper.

--
but have you considered the following argument: shut up.
Re:Repent, Sinners! by severoon · 2004-09-21 13:29 · Score: 1

Yes, I would willingly place my life in the hands of an application called Back Orifice.

I don't care how many hackers wrote it, and how godly their skills. I'm not getting on a plane that's even tenuously related to anything called Back Orifice. (One million marketers can't all be wrong about the significance of product naming.)

--
but have you considered the following argument: shut up.
Re:Repent, Sinners! by Metzli · 2004-09-21 13:34 · Score: 1

If I was feeling particularly jaunty in my old Compaq days, I'd use shutdown -c now. Yeah, that's fun. Shutdown the whole cluster.

--
"It's too bad stupidity isn't painful." - A. S. LaVey
Re:Repent, Sinners! by Metzli · 2004-09-21 13:38 · Score: 1

I have to raise my hand too. I used to be one of the SysAdmins for a bunch of Solaris machines that had to be rebooted every month. The application (crapplication?) was so bad and had so many memory leaks that leaving it up for ~6 weeks would cause system flakiness and/or crashes. We used to joke that it was amazing that Windows was ported to the E450.

--
"It's too bad stupidity isn't painful." - A. S. LaVey
Re:Repent, Sinners! by rjdohnert · 2004-09-21 13:49 · Score: 1

What a moron. let me guess, they should be using Linux and that microsofts going to loose that account because some moron who didnt do his job. Give me a break. This has nothing to do with microsoft. It sure hasnt changed my opinion about them or my decision to deploy 8 more Windows Server 2003 boxen in my datacenter.
Re:Repent, Sinners! by ckaminski · 2004-09-21 13:50 · Score: 4, Insightful

Thankfully, Chicken Little, planes do NOT fall out of the sky during a total air traffic control outage, but control regresses to pencil and paper.

Your plane *WILL* land. It may be at a different airport, and sooner or later than planned, but you will get on the ground in one piece.
Re:Repent, Sinners! by jack_csk · 2004-09-21 13:50 · Score: 1

I second with this. When I took database course in college, I saw a database system that used SQL while it was storing data as text.
Re:Repent, Sinners! by Awptimus+Prime · 2004-09-21 13:54 · Score: 4, Insightful

and these are heavy used mail servers.. no need to reboot on a nightly basis!! good grief (charley brown)

Right, the code used for mail serving is some of the most mature server code out there. This is far more reliable than say a Linux box set up with proprietary, closed src, business applications with their own bugs.

My feelings are the article may have mistakenly blamed Windows for a problem with one of the server applications running on it. It is not typical for even Win2k to hang unexpectedly when running good hardware and well-written code.

I say fuck it. There is no point in ever trying to defend logic when it stands in the way of the Microsoft bash-fests on /..

Just to clarify, I am not saying Windows servers can and will run as reliably as a properly configured BSD, Solaris, or Linux box. I am just trying state that Windows is reliable, if properly configured, but will probably not win an uptime competition. Big whoop. Reboot your shit during maintenance windows, regardless of OS, you run a much better chance of finding pending hardware failures. It is much better to powercycle that database server and get an error detecting the SCSI bus during a maintenance window than for it to happen at 5:30AM on a Monday or during your vacation.

Then again, I could be overly anal. I just like to avoid the reputations gained by those before me. :)
Re:Repent, Sinners! by pchan- · 2004-09-21 13:58 · Score: 5, Insightful

see what you've done, now i had to go and rtfa just to respond. here's a choice quote:

The servers are timed to shut down after 49.7 days of use in order to prevent a data overload, a union official told the LA Times. To avoid this automatic shutdown, technicians are required to restart the system manually every 30 days.

now, let's do a little math. the number of milliseconds in 49.7 days = (49.7 * 24 * 60 * 60 * 1000) = 4,294,080,000. recognize that number? that's right, it's 2^32 (actually, this is: 4,294,967,296, but it's pretty damn close). and why is that significant, you ask? because at 2^32, the unsigned int used by some versions of windows to keep the time since boot overflows back to zero, and bad things begin to happen.

is the problem microsoft's fault? goddamn right it is. in software that runs A MAJOR AIRPORT and controls the flight control and radar systems that affect thousands of lives in the air, an error like this just not an option. the people who put this system into production ought to be fired. i don't know what the right os for this task is. solaris? aix? vms? something with provable uptime and reliability, something that can deliver uptime of longer than a month and a half, that's for sure.

I'm sure Linux doesn't store time in an infinite bit counter either.

i don't recall advocating linux for the job. maybe it can do it, maybe not. and in regards to being free, when my life is on the line, they better spend every god-damn dollar they can to make sure that critical systems do not fail under any circumstances. microsoft was absolutely the wrong choice in this case.
Re:Repent, Sinners! by dgatwood · 2004-09-21 14:06 · Score: 3, Interesting

As another link in this discussion noted, an unpatched Win2k does, in fact, require a reboot every 47.5 days because a certain process goes nuts and eats 60% of the CPU. The fact that MS has a patch for the problem does not mean that the problem does not exist.

--
Check out my sci-fi/humor trilogy at PatriotsBooks.
Re:Repent, Sinners! by Atzanteol · 2004-09-21 14:15 · Score: 4, Insightful

You don't work with other people much do you? It's probably for the best.

These things cost money. Migrating apps that use the old DB to the new one, testing, bugs introduced in the migration, etc. If it works most companies will stick with it and not risk spending large amounts of money for no 'gain' (in their mind).

--
"Ignorance more frequently begets confidence than does knowledge"

- Charles Darwin
Re:Repent, Sinners! by gnuman99 · 2004-09-21 14:18 · Score: 1

Originally the button just showed the Windows flag, so it basically the choice of a label was the same as in Gnome
My has Applications next to it. "Log off" or Shut down is under "Actions" menu... Lock Screen is there too..
Re:Repent, Sinners! by nathanh · 2004-09-21 14:47 · Score: 3, Insightful

So someone had to come up with a short piece of text that clues newbies in, and it worked rather well (in usability tests). 'Start' may not be optimal, but has anyone thought of something better? (Not that is matters anymore.)

Click Me. Menu. Actions. Tasks. Open Here.
Any of those make more sense than "Start".
Re:Repent, Sinners! by Bush+Pig · 2004-09-21 15:09 · Score: 3, Funny

I'm still having a bit of trouble with the notion that moving from UNIX to Windows was regarded as an upgrade.

--
What a long, strange trip it's been.
Re:Repent, Sinners! by AJWM · 2004-09-21 15:16 · Score: 1

This sounds like it has far more to do with DBase than with AIX, so why even mention the OS? And what, exactly, is a "non-SQL format"? SQL is a language, not a database or file format.

--
-- Alastair
Re:Repent, Sinners! by Vinson+Massif · 2004-09-21 15:17 · Score: 3, Funny

Personally I:
% su -
# uname -n
and MAKE SURE I'M ON THE RIGHT MACHINE !!
# shutdown -r 120 'go away!!'

Most system's reboot invoke a `shutdown -r now`.

--
"Remember, any tool can be the right tool." -- Red Green
Re:Repent, Sinners! by GMFTatsujin · 2004-09-21 15:19 · Score: 2, Funny

I'm sure Emilia Airhart said the same thing before she patched her Windows 3.11!
Re:Repent, Sinners! by AJWM · 2004-09-21 15:31 · Score: 1

Who uses 'init 6' to shutdown Linux?

There's a perfectly good 'halt' binary, with (at least on by SUSE box) symbolic links from 'shutdown', 'poweroff', and 'reboot'.

Initializing runlevel 6 starts all the S* scripts in /etc/init.d/rc6.d, in this case there's just one, which is linked to the 'halt' script in /etc/init.d. Of course, "init 6" could do just about anything, depending on how the runlevels are defined.

--
-- Alastair
Re:Repent, Sinners! by NeoSkandranon · 2004-09-21 15:35 · Score: 1

I don't know either, other than the original infastructure was of a kind that would need to do that, and they didnt fele like upgrading.

--
If you can't see the value in jet powered ants you should turn in your nerd card. - Dunbal (464142)
Re:Repent, Sinners! by LostCluster · 2004-09-21 15:36 · Score: 1

Older database formats that don't support SQL aren't doing that just to be mean, it's because they don't naturally do as much indexing as even MS Access databases do. That's the critical flaw that causes them to start acting funny when they get to large sizes... if there's ever a mistake in the index, it can cause problems making that record come up on command.
Re:Repent, Sinners! by tulare · 2004-09-21 15:54 · Score: 2, Insightful

Thankfully, Chicken Little, planes do NOT fall out of the sky during a total air traffic control outage, but control regresses to pencil and paper.
Or, more appropriately, to the hands of the pilots, including the one who had to take evasive action. What's glossed here is that a stupid application flaw very nearly did result in serious loss of life. Kudos to the pilot who knew what the fuck to do when the time came.

--
political_news.c: warning: comparison is always true due to limited range of data type
Re:Repent, Sinners! by MonsterChicharo · 2004-09-21 16:02 · Score: 1

I think that is precisely the kind of attitude that delivers the most expensive solution in the long run. You don't have to be state-of-the-art all the time, but come on; dBase to SQL is a no-brainer.

The company I work for had been delaying migration from an HP-3000 for years and years. Postponing the decision was made on the basis of if-it's-not-broken-don't-fix-it. Presently not even HP will give support to these machines, and the only people available for it are charging obscene amounts of money for doing it, no warranties attached. Now that we have been forced by headquarters to implement strict security measures we find that it is simply not possible to do it on the HP-3000, therefore forcing us to migrate to another platform. Quickly. As in right now. You can picture the nightmare it has become.
Re:Repent, Sinners! by eclectechie · 2004-09-21 16:08 · Score: 1

I've seen AIX-based database systems that require an overnight downtime to do reindexing, since non-SQL formats like DBase have always been a little funky when they start having to deal with million-record tables. It's amazing how ugly legacy databases can be compared to today's tech.
I have written applications that use million+ -record tables. And yes, rebuilding indexes over such tables is slow.
But an index is an index is an index. I don't care what legacy or modern technology your database uses, when you need to rebuild an index, you need to read all your million+ records to know what to index. That is what takes the time, barring crappy index insertion algorithms (both are SQL-neutral issues).
The right answer is to never let your index get out of date (or corrupted). Worked for me on AS/400...

--
"The empty vessel makes the greatest sound." -- William Shakespeare; Henry V, 4. 4
Re:Repent, Sinners! by adamfranco · 2004-09-21 16:16 · Score: 2, Interesting

I personally like "Commence".

Commence writing.
Commence listening to music.
Commence shutdown procedure.

It works for everything!

Usage of "Start" instead of "Commence" probably has something to do with the majority of the population wondering who was graduating when they clicked the button...

--
"When ideology and theology couple, their offspring are not always bad but they are always blind." -- Bill Moyers
Re:Repent, Sinners! by dnoyeb · 2004-09-21 16:24 · Score: 1

You have to love a system that requires downtime as part of uptime.

Yea, i gotta go through 4-5 hours of downtime every day. And I'm only getting worse...
Re:Repent, Sinners! by ckedge · 2004-09-21 17:02 · Score: 2, Interesting

> a) This could easily been done as a sheduled task in windows 2000.

Uh, no, no it could not.

Scheduled Tasks in Microsoft Windows have never been reliable. Quite frequently mine have their security credentials "screwed up" somehow and stop working until I notice and "touch" them so I'm forced to re-enter a user/pwd.

I have never EVER heard of Solaris cron failing to run on time.

> and not some poor person manually initiating it every night?

It's windows, you have to have a person present to ensure that the system actually a) goes down b) comes back up as intended.

I've done a half year consulting gig and spent a month walking 5 blocks through the downtown core of San Francisco at 5am every single FUCKING morning to hit the power button on a 4 way 400 MHz $50,000 Compaq windows box at one of the biggest banks in the world. Database held holdings information on around half a trillion dollars in equities.
Re:Repent, Sinners! by sxpert · 2004-09-21 17:44 · Score: 1, Insightful

but "commence" is a french word, a certain part of the US population surely don't want a french word in the place of the idiotic "start" on their windows do something button...

"The problem with the french is that they don't have a word for 'entrepreneur'" (G.W. Bush)
Re:Repent, Sinners! by mpe · 2004-09-21 18:13 · Score: 1

I've seen AIX-based database systems that require an overnight downtime to do reindexing, since non-SQL formats like DBase have always been a little funky when they start having to deal with million-record tables.

This in application, rather than an Operating System issue though.
Re:Repent, Sinners! by mpe · 2004-09-21 19:55 · Score: 1

Click Me. Menu. Actions. Tasks. Open Here.

How about "Do Things" or "Do Stuff"? Which would be both suitably ambiguious and understandable by people of any literacy level.
Re:Repent, Sinners! by babybird · 2004-09-21 19:58 · Score: 1

Yes, those people would probably rather have "Freedom" there....but I'm not sure that could really apply to Windows, do you think?

Oh well, those same Americans are perfectly happy with a government that just tells them they have freedom, rather than actually having it... so maybe that would be poetic in some sense.

--
Keith D.
Re:Repent, Sinners! by Jeppe+Salvesen · 2004-09-21 20:12 · Score: 1

You have to love a system that requires downtime as part of uptime. How many Linux users have this problem?

Say - if you want to take a full backup of large mysql tables without the comfort of a replicating slave. It is preferable to have planned downtime on a regular basis rather than unplanned downtime.

--
Stop the brainwash
Re:Repent, Sinners! by timmyf2371 · 2004-09-21 20:36 · Score: 1

If you don't want to go to the "Start" button in Windows to shut down... well... that one is your problem. Not mine.
I press the "Start" button in Windows to begin the shutdown process, just as you would type "init 6" to begin run level 6.
I've always failed to understand the significance of the joke about pressing "Start" to begin the shutdown process.

--

Backup not found: (A)bort (R)etry (P)anic
Re:Repent, Sinners! by Anonymous Coward · 2004-09-21 21:33 · Score: 1, Insightful

"is the problem microsoft's fault? goddamn right it is."

No, it is the fault of those who selected a product with this problem to run an airport.
Re:Repent, Sinners! by Shadowlore · 2004-09-21 23:51 · Score: 2, Interesting

Well, in the past 10 years I have had a number of clients who have had Linux, Unix, Windows, and Mac systems that were critical to their day to day routine and they did nightly/weekly/monthly reboots as part of their maintenance.

I guess when you grow up and get out of high school, you will find that your linux box running as a DSL router is not a good example of a production server.

Yeah they did that to the Linux boxes here, because they didn't know better. Now, with real Linux experts, our Linuxen are not rebooted or taken down for routine maintenance. And no we aren't talking about "DSL Routers". We are talking about systems that process email to the tune of a million message per server per day.

Critical? You bet it is. Merril Lynch, HP, APL, and many others. Planned downtime for "regular maintenance"? Nope. The only time we plan downtime is for hardware replacement/upgrade and kernel upgrades, the occasional (rare) server moves, and full data center shutdowns to perform data center failover verification.

I guess when you grow up and get out of community college, you'll find that running a dormitory quake server is not a good example of a business critical production server. /pointed sarcasm.

--
My Suburban burns less gasoline than your Prius.
Re:Repent, Sinners! by rabtech · 2004-09-22 00:39 · Score: 1

In Windows 2000 you have the various QueryPerformance APIs to get exact system time as well as POSIX-compliant gettimeofday() which returns a 64-bit int.

GetTickCount() was retained for backward compatibility only.

--
Natural != (nontoxic || beneficial)
Re:Repent, Sinners! by Daytona955i · 2004-09-22 00:43 · Score: 1

What, you don't have the machine name in your prompt? Talk about sinning...
Re:Repent, Sinners! by jones948 · 2004-09-22 00:49 · Score: 2, Funny

Click Me. Menu. Actions. Tasks. Open Here.
Any of those make more sense than "Start".

Ah, but then you couldn't tie in a catchy Rolling Stones song with your product launch.
Re:Repent, Sinners! by gadget+junkie · 2004-09-22 00:56 · Score: 1

I would beware of equating "old stuff" with "doesn't work". Most of the military stuff , asics and so on are basically the best technology money could buy in the eighties, precisely because all the S**t has been thrown out. what causes problem in these cases is, in my view:

1. feature drift; at some point, old stuff cannot cope, and the camel's back is broken;

2.Vendor behaviour: MS has a "force them to upgrade" strategy by which you cannot recall any release that can be called "stable", by rigid standards.

I wouldn't want to go into my usual MS bashing, but consider this: if, by chance, win 2000 SP XYZ was completely stable, I'd buy ten CD's and lock them into ten different vaults, because all ther world would program with that as a target, even five - ten years from now..

--
"If a boss demands loyalty, give him integrity. But if he demands integrity, give him loyalty." (John Boyd, 1927-1997)
Re:Repent, Sinners! by mpe · 2004-09-22 01:13 · Score: 1

but here is what i really want: i would like you (microsoft, inc.), to stop selling your products to mission critical and infrastructure operations until such a time as they are ready to do so.

IIRC Microsoft do actually state in their documentation that Windows is not suitable for such tasks.
But this is probably more intended as CYA. Since their advertising and marketing arms push the myth of Windows as a general purpose OS.
Re:Repent, Sinners! by nfsilkey · 2004-09-22 01:22 · Score: 1

I guess when you closely read your parent, you will find that the author has a two digit UID. High school? Pfft. :)
Re:Repent, Sinners! by mpe · 2004-09-22 01:24 · Score: 1

now, let's do a little math. the number of milliseconds in 49.7 days = (49.7 * 24 * 60 * 60 * 1000) = 4,294,080,000. recognize that number? that's right, it's 2^32 (actually, this is: 4,294,967,296, but it's pretty damn close). and why is that significant, you ask? because at 2^32, the unsigned int used by some versions of windows to keep the time since boot overflows back to zero, and bad things begin to happen.

Bad things may happen if the software does not take account of this system variable being capable of "roll over".
is the problem microsoft's fault? goddamn right it is. in software that runs A MAJOR AIRPORT and controls the flight control and radar systems that affect thousands of lives in the air, an error like this just not an option. the people who put this system into production ought to be fired. i don't know what the right os for this task is. solaris? aix? vms? something with provable uptime and reliability, something that can deliver uptime of longer than a month and a half, that's for sure.

The fault is more with the people who chose to use MS Windows in this way. Microsoft's blame is more at the level of promoting their products as something they are not as well as encouraging a culture of "everything Microsoft".
Re:Repent, Sinners! by mpe · 2004-09-22 01:50 · Score: 1

Thankfully, Chicken Little, planes do NOT fall out of the sky during a total air traffic control outage, but control regresses to pencil and paper.

You use paper and pencil when you have radio and no radar. It isn't much use when you have radar and no radio.
Re:Repent, Sinners! by MacGod · 2004-09-22 01:53 · Score: 1

Only if that is what you have run level 6 configured to do.

All the init 6 command does is initialize run level 6. You can have run level 6 configured any way you want.

It isn't hard wired to shut down. (On debian run level 6 does a reboot... run level 0 halts the system.)

A few days ago there was yet another article lamenting how Linux wasn't ready for the mass market yet, and of course, dozens of Slashdotters came out of the wordwork saying how easy everything was.

Well, it seems to me that it's stuff like this that makes it hard for a new user. "Initiate run level 6" isn't intuitive, "configure run level six to whatever you want" equally obscure. Bash the naming of the start menu all you want, but it's understood. You don't have to configure it to shut down, and there's no run levels or anything like that.

It is my personal opinion that Linux's mass-market adoption problems stem partly from too many options for the average computer user, and from too-obscure terminology/command names.

--
"Reality is merely an illusion, albeit a very persistent one " -Albert Einstein
Re:Repent, Sinners! by freqres · 2004-09-22 02:27 · Score: 1

but come on; dBase to SQL is a no-brainer.

Have you actually done any database app upgrade/conversions? I'll agree that the DATA conversion is usually pretty straight forward (usually, there are some pretty archaic and unsupported db systems out there) but it's the actual application that ends up being the bitch. Converting some huge Foxpro or Clipper app to a WebApp+Oracle can be horrible. These are the huge old apps that only some old-timers at the company use but don't really know much about it, have poor documentation of both code and business logic and have gone through lots of hack revisions to just get something to work or add a new feature. It's just great when you ask to talk with the guy who programmed the system and the answer is either he retired in 1987 or died 10 years ago. Then the real kicker is when you finally get things figured out and get the thing ported, everyone comes up with new ideas for it and changes they think should have been made decades ago.

Granted this is more a general code updating/porting problem (ok, it's a rant), but when talking about old dBase related systems I have never run across a situation where a business wanted just the data updated.

--
Rampant Ninja related crimes these days...Whitehouse is not the exception
Re:Repent, Sinners! by jack_csk · 2004-09-22 02:30 · Score: 1

Interesting, back in the days when I worked in an ISP, I never have to reboot those Linux servers besides Kernel update.
Re:Repent, Sinners! by freqres · 2004-09-22 02:32 · Score: 1

Maybe 'Clickit Ritecheer' would be more appropriate for those segments of the population?

--
Rampant Ninja related crimes these days...Whitehouse is not the exception
Re:Repent, Sinners! by SeanDuggan · 2004-09-22 02:44 · Score: 1

Click Me. Menu. Actions. Tasks. Open Here.
Interestingly enough, that's what they wound up doing. On Windows 95, you may notice that if often came up with a little moving arrow pointing to the Start button, saying "Click here"
Anyhow, if you're interested in an actual explanation of the start button's history, there have been blog entries from Microsoft engineers explaining it. To summarize their reasoning for labelling it 'Start', "it sent our usability numbers through the roof, because all of a sudden, people knew what to click when they wanted to do something."

--
This sig has absolutely no significance and serves only to take up screen space and waste the time of the reader.
Re:Repent, Sinners! by fatphil · 2004-09-22 02:53 · Score: 1

"""
You have to love a system that requires downtime as part of uptime. How many Linux users have this problem?
"""

I run linux and have a similar "data overload" problem:

phil$ uptime
17:48:38 up 82 days, 5:20, 11 users, load average: 1.00, 1.00, 1.00

The "data overload"? My uptime's actually 497+82=579 days.
The digits 4, 9 and 7 in that order may ring a bell.

FP.

--
Also FatPhil on SoylentNews, id 863
Re:Repent, Sinners! by MonsterChicharo · 2004-09-22 02:53 · Score: 1

You are absolutely right. Older applications are usually dependent on one or two persons, poorly documented (if documented at all), and have so many patches that it is difficult to make sense of them at all.

Sometimes the only solution is to start all over again, gathering requirements directly from the sponsor or user than trying to migrate an existing app.
Re:Repent, Sinners! by wikdwarlock · 2004-09-22 02:55 · Score: 1

Amen! Love it or hate it, I believe (can't find any supporting evidence, though) that MS specifically says not to use their software for things that could potentially cost human life via failure.

Can you build your kids a treehouse using duct tape and balsa wood? Sure! Is it possible it could be safe for 20 years? Absolutely! Is it still reckless to use the wrong tools and materials in this way? You bet.

--

"I must not fear. Fear is the mind killer." -Bene Gesserit Litany Against Fear
Re:Repent, Sinners! by _STL99 · 2004-09-22 04:54 · Score: 1

Ok, so init 6 initializes run level six. ...and the Start button starts the process for the user to shut down. What's the difference?
Re:Repent, Sinners! by AnyoneEB · 2004-09-22 07:02 · Score: 1

when hundreds of thousands of travellers in airports across the world are delayed because one of the busiest airports in the world is shut down due to a 10 year old known bug in your operating systems that has not been fixed, that is simply not acceptable.

Wait a minute! I'd understand if there was no patch (and it sounds like it was pretty stupid to release it in the first place), but the link to the fix is in the article text. I'm not sure how recent it is (only date I see on the link is "Last Reviewed: 8/9/2004 (3.2)"), but a fix does exist.

I'm not saying they should be using Windows. They should certainly use a more stable OS, but this problem is not purely Microsoft's fault, they already patched it.

--
Centralization breaks the internet.
Re:Repent, Sinners! by hesiod · 2004-09-22 07:09 · Score: 1

> Maybe 'Clickit Ritecheer' would be more appropriate

Reminds me of the language selection in Redhat installs where you could select "Redneck."
Re:Repent, Sinners! by hesiod · 2004-09-22 07:16 · Score: 1

> Your personal opinion doesn't matter, simply because you don't know wtf you are speaking about.

Except when the argument revolves around the fact that the target audience doesn't know "wtf" you are talking about either. So, in this case, it matters completely. Unless you just want to be snobbishly "right," like you are doing.
Re:Repent, Sinners! by svallarian · 2004-09-22 07:30 · Score: 1

Lemme introduce you to an IP or phone enabled surge protector my friend!

Steven V>

--
I patented screwing your mom. But it got revoked for "prior art."
Re:Repent, Sinners! by smartdreamer · 2004-09-22 08:07 · Score: 1

In fact capitalism have no problem with bogus software.
I suggest that the fault goes to the manager who decided to make the transition from UNIX to Windows.
;-)
Re:Repent, Sinners! by jschottm · 2004-09-22 09:47 · Score: 1

The problem is when should that maintenance window be? My servers are co-loed and have a "slow" time of 1AM through 6AM, times when I'd really rather be asleep or at least, not-doing-work(tm). And because some of them are Dell PowerEdge 350s (someone else bought, them, not my fault), they take about 5-6 minutes to reboot. Even during my off hours, that's still quite a few connections that fail to get through.

Before some of the straggling Windows boxes got converted to something else, we did weekly preventative reboots, and from time to time, they'd fail to come back up. And the powers that be saw it as being more affordable to have an IT staffer go out to the co-lo, rather than investing in a remotely controlled power unit.

Of course, I could just be like some admins in the branch I work in and shut down the database server at 5:30 every Monday afternoon, nevermind that people are still using it...
Re:Repent, Sinners! by lanner · 2004-09-22 17:23 · Score: 1

Three years ago I worked for a computer gaming firm and we put out a well-known MMO product. It, unfortunately, ran on Windows 2000 systems because the server software was written to run on Windows. This exact same problem plagued our systems, where the servers would refuse to allow UDP sessions to be opened and closed after approximately 50 days! Bad stuff would start to happen overall, and the system would become unstable. We had to reboot systems every 30 days in the middle of the night. It was a real pain. For all of this time, I had no idea what was causing the bug!
Re:Repent, Sinners! by Awptimus+Prime · 2004-09-22 17:59 · Score: 1

Those guys sound mean.

Any company with proper communication requires IT to send an email to staff reminding people of upcoming downtimes, whether the outage is by your staff or ISP. I think it is unprofessional to abruptly take away a client or employee's resources without warning. Nobody likes that type of IT guy. I find being liked comes in very handy when reviews come around and better-paying positions open up. :)
Re:Repent, Sinners! by n3k5 · 2004-09-22 19:30 · Score: 1

Click Me. Menu. Actions. Tasks. Open Here.

Any of those make more sense than "Start". 100% wrong. They needed to make a design choice that was to stay around for many years. "Click Me" or "Open Here" would have been, and still be, ridiculed much more than "Start", and not just by MSFT opponents. "Menu" isn't very inventive, as you could label any menu "Menu". It isn't completely braindead, but I doubt it would improve on usanility compared to "Start". As the start menu contains, among other things, links to documents, which are neither "Actions", nor "Tasks", these labels would be simply wrong.

I don't want to play the MSFT apologist at all, in fact if I got to design a desktop interface, I'd do it very differently. I just want to point out that UI design can be very difficult, as it involves figuring out what's going on in the minds of complete strangers. Even though you don't like the word 'Start' and think that for you your suggestions make more sense, they really don't.

--
but what do i know, i'm just a model.
Re:Repent, Sinners! by nathanh · 2004-09-22 19:45 · Score: 1

I don't want to play the MSFT apologist at all, in fact if I got to design a desktop interface, I'd do it very differently. I just want to point out that UI design can be very difficult,

No, it really isn't. You just think it is because you've never tried.
as it involves figuring out what's going on in the minds of complete strangers. Even though you don't like the word 'Start' and think that for you your suggestions make more sense, they really don't.

Microsoft later put the words "Click Here" on the taskbar with an arrow pointing to the Start button. Obviously my first suggestion was not picked at random; I based it on what Microsoft ended up doing to correct their stupid UI design.
Re:Repent, Sinners! by n3k5 · 2004-09-23 01:59 · Score: 1

[UI design can be very difficult]
No, it really isn't. You just think it is because you've never tried.
I have worked on user interfaces and still do; in fact it's part of my course of studies. We see frustrating mistakes and flaws in user interfaces all the time. Many computer programs, even ones that are used by millions of people and were developed with huge amounts of money, have aspects that seem counter-intuitive to most users. Pretty much every web site's design is far from optimal. This wouldn't be the case if UI design was always very easy.

Microsoft later put the words "Click Here" on the taskbar with an arrow pointing to the Start button.
This hint appeared when a user had just rebooted or logged in, but did not click the start button right away. As soon as s-/he did, it vanished again. This is not correcting a broken start button, but adding a new feature targeted at n00bs. Using "Click Here" as a hint is very different from using it as a button's label.

--
but what do i know, i'm just a model.
Re:Repent, Sinners! by severoon · 2004-09-23 10:07 · Score: 1

Yea, you're right. This is no big deal...there's no sense in trying to minimize risks since they're already pretty low...it's good enough, right?

What the hell are you talking about? Chicken Little? Bad ATC is the number two reason airplanes do drop out of the sky after pilot error. And if you look at most cases of pilot error, bad ATC nearly always contributes to the errors that are ultimately assessed as the responsibility of the pilot.

Besides, even if I give you the benefit of the doubt, my requirements for airlines are slightly higher than simple survival. I want to do more than just land...I want to land: (1) on a runway, as opposed to in a corn field or a swamp, (2) at the destination declared on my ticket, and (3) on or ahead of the arrival time declared on my ticket. Anything less than that is unacceptable. You can't claim this isn't a big deal simply because, hey, most people will probably still make it out alive.

Chicken Little, indeed.

--
but have you considered the following argument: shut up.
Re:Repent, Sinners! by Paradise+Pete · 2004-09-29 17:42 · Score: 1

Usage of "Start" instead of "Commence" probably has something to do with the majority of the population wondering who was graduating when they clicked the button...
I think it was because some of the testers kept saying "Commence? But I don't have anything to say!," while others would write stuff like "I think it's very nice."
Re:Repent, Sinners! by Paradise+Pete · 2004-09-29 17:47 · Score: 1

I press the "Start" button in Windows to begin the shutdown process
By that logic every command in the world belongs there. Maybe they should put the Close Window command in there, for when I want to start closing the window.

Anyone want to clue them in to scheduled jobs? by FyRE666 · 2004-09-21 09:50 · Score: 3, Insightful

It's obviously lunacy for any company to replace a proven system, which has given years of reliable service with some piece of trash that crashes if left running for over a month. That said, I was under the impression that a simple "at" job could be used on a Windows machine to run a script periodically (at is similar to cron, except far less capable, of course). Such a script could, if I'm not mistaken, be used to reboot the machine. One would think this would be an ideal way to hide the problem very nicely.

We use a similar system to reboot all of our NT servers every weekend to help prevent crashes during the week (doesn't work of course, but still).

--
Code, Hardware, stuff like that.

Re:Anyone want to clue them in to scheduled jobs? by DarkKnightRadick · 2004-09-21 09:54 · Score: 1

We use a similar system to reboot all of our NT servers every weekend to help prevent crashes during the week (doesn't work of course, but still).

You and LAX must not have installed Windows properl. /sarcasm. (;

--
"There is a way that seems right to a man, but its end is the way of death." Proverbs 16:25 (NKJV)
Re:Anyone want to clue them in to scheduled jobs? by TykeClone · 2004-09-21 09:54 · Score: 3, Interesting

at sucks. Very, very much.
I've got an NT server that would hang after 2 weeks. I set up an at job to restart that service nightly and do not have that problem.
I've also got several linux servers that just plain run (and some NT/2000 servers as well).
That being said, rebooting sometimes does clear up many evils. We have a speakerphone (around 10 years old - no OS) that just wouldn't work one day. After looking at it, I unplgged it and plugged it back in (I rebooted it!) and it worked. No good reason, it just helps.

--
A fine is a tax you pay for doing wrong and a tax is a fine you pay for doing all right.
Re:Anyone want to clue them in to scheduled jobs? by dbottaro · 2004-09-21 09:58 · Score: 5, Informative

Agreed. A well written AT script something like this: Each M T W Th R S Su 12:45 AM shutdown /l /r /y /c
Would do the trick... We have used that exact script for YEARS to nightly reboot a troublesome NT4 BDC at a remote location.
While we knew that this was not a great solution, no one needed to access the server at that time of night. Any right minded IT person should be able to see the flaw in the FAA's logic.

--
Coding my way to the next BSOD!
Re:Anyone want to clue them in to scheduled jobs? by mekkab · 2004-09-21 10:10 · Score: 4, Interesting

It's obviously lunacy for any company to replace a proven system, which has given years of reliable service with some piece of trash that crashes if left running for over a month

What if that proven systen is decaying out from under you? HD's failing, memory going bad... Tell you what, can you get me new boards for an IBM RT pc? I highly doubt it.

What about "olde" mainframes running assembler code? The pool of expertise is drying up... sometimes you need to pitch the hardware.

--
In the future, I would want to not be isolated from my friends in the Space Station.
Re:Anyone want to clue them in to scheduled jobs? by Ann+Elk · 2004-09-21 10:11 · Score: 4, Insightful

It's obviously lunacy for any company to replace a proven system, which has given years of reliable service...

It's obvious you have never toured an ARTCC (Air Route Trafic Control Center). The system that is being replaced was barely hanging together by voodoo and chicken wire. It was designed back in the 60's to handle maybe 1/10th the current capacity. It is in dire need of replacement.

That said, I'm not convinced Windows (or Linux for that matter) is an appropriate OS for an application that practically defines the phrase "mission critical".
Re:Anyone want to clue them in to scheduled jobs? by FyRE666 · 2004-09-21 10:23 · Score: 2, Interesting

What about "olde" mainframes running assembler code? The pool of expertise is drying up... sometimes you need to pitch the hardware.

Yeah but maybe they should have replaced it with something that, you know, actually works...

I'm all for change, but I wouldn't swap my car for a brand new sparkling wheelchair, my haircut for a mullet, or my soul/self respect for a job writing VBScript. It just doesn't seem right, you know?

--
Code, Hardware, stuff like that.
Re:Anyone want to clue them in to scheduled jobs? by LifesABeach · 2004-09-21 10:27 · Score: 2, Interesting

Well, I guess I've seen a first here. The system was 'upgraded' to Windows 2000? The manager that made that decision has done more than any staff member at Bin-Laden University for the Scrambled of Brains.
Re:Anyone want to clue them in to scheduled jobs? by Billy+the+Mountain · 2004-09-21 10:29 · Score: 5, Funny

Each M T W Th R S Su 12:45 AM shutdown /l /r /y /c

We have used that exact script for YEARS to nightly reboot a troublesome NT4 BDC at a remote location.

Does it work on Friday? You might want to check on that...

BTM

--
That was the turning point of my life--I went from negative zero to positive zero.
Re:Anyone want to clue them in to scheduled jobs? by mekkab · 2004-09-21 10:30 · Score: 1

This wasn't an ARTCC. Besides, the ARTCC's are all on DSR now, and a bunch have URET on top of that. They're slowly but surely entering the modern age!

--
In the future, I would want to not be isolated from my friends in the Space Station.
Re:Anyone want to clue them in to scheduled jobs? by mekkab · 2004-09-21 10:33 · Score: 2, Insightful

The cost of having a trained monkey reboot the system every month for 10 years is probably less than the cost of maintainance on the old hardware.

It makes sense on paper. It doesn't work out when the human element "screws the pooch" (they rarely show you that slide in the powerpoint, do they?!)

--
In the future, I would want to not be isolated from my friends in the Space Station.
Re:Anyone want to clue them in to scheduled jobs? by Dun+Malg · 2004-09-21 10:37 · Score: 4, Funny

Such a script could, if I'm not mistaken, be used to reboot the machine. One would think this would be an ideal way to hide the problem very nicely.
For a real-time application like air traffic control, you really can't automate reboots like that. You need someone standing there to say "crap! crap! crap!" and take the necessary actions when the system decides it doesn't want to reboot properly.*
*even if they don't know what to do, they can at least shout "crap!", which is more than a system stuck at the BIOS screen with an "elbow parity error" can say.

--
If a job's not worth doing, it's not worth doing right.
Re:Anyone want to clue them in to scheduled jobs? by jelle · 2004-09-21 10:38 · Score: 1

They should have used duck-tape and tie-wraps and the old system would still have been fine for another 45 years.

--
--- Hindsight is 20/20, but walking backwards is not the answer.
Re:Anyone want to clue them in to scheduled jobs? by doorbot.com · 2004-09-21 10:43 · Score: 1

What if that proven systen is decaying out from under you? HD's failing, memory going bad... Tell you what, can you get me new boards for an IBM RT pc? I highly doubt it.

Easy! Just give John Titor a call... he's got all the time in the world, plus he has prior experience finding old tech!
Re:Anyone want to clue them in to scheduled jobs? by thrills33ker · 2004-09-21 10:49 · Score: 4, Funny

"This wasn't an ARTCC. Besides, the ARTCC's are all on DSR now, and a bunch have URET on top of that."

Well, I'm glad you cleared that up!
Re:Anyone want to clue them in to scheduled jobs? by glitch23 · 2004-09-21 10:50 · Score: 1

I'd much rather have an OS in a mission critical environment at least tell me something when it crashes (linux: oops!) than give me a BSOD full of incomprehensible characters.

--
this nation, under God, shall have a new birth of freedom. -- Lincoln, Gettysburg Address
Re:Anyone want to clue them in to scheduled jobs? by drinkypoo · 2004-09-21 10:56 · Score: 2, Insightful

I bet I could get you a replacement board for an IBM RT PC. I gave some Model 135s to a guy I used to work with, and I bet he's still got them or knows who has them. Since there's nothing better than a 135 I can't imagine you'd evince any significant dismay over that idea. There's a lot of that kind of crap running around assorted towns where IBM's got offices, like Austin - which is where I got them. I had AOS 4.3 and BSD-4.3-lite... More or less the same thing really.

Er anyway back to the point, you don't replace an old workhorse with a new POS. You get a newer workhorse than the last workhorse, and maybe not even a new one. I'd rather go dig up some Sparcstation 10s with supersparcs in them to replace (for example) your RT PC. Running SunOS or perhaps netbsd, you should be able to port your software from BSD. If you are running AIX on your RT, maybe you'd be better off with an old RS6k, they're available very cheaply. Hell, I once sold a 603e laptop RS6k (thinkpad power series) to a guy for like nine hundred bucks or so. That little bastard would make a better server than your average wintel box, given it was SCSI, assuming that you were replacing an antique.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Re:Anyone want to clue them in to scheduled jobs? by Maserati · 2004-09-21 10:56 · Score: 1

If they showed "that slide" I'd go to even fewer meetings than I do now.

Remember, even Unix systems like a little rebootin' now and then.

Not Linux, that's perfect.

--
Veteran, Bermuda Triangle Expeditionary Force, 1992-1951
Re:Anyone want to clue them in to scheduled jobs? by Anonymous Coward · 2004-09-21 11:26 · Score: 5, Informative

I used to write aviation message handling systems. We migrated from Tru64 (now extinct) to Linux and have had much better: performance, maintainability, hardware support, and reliability.

Of course, the code leap from Tru64 to Linux is quite small, which is the biggest reason why Linux was chosen.

Aviation expects 99.9999% uptime with absolutely no message loss, and we would achieve that with hot-standbys and MySQL mirroring. All circuits were split and would simultaneously enter both servers. Only the primary server would route the message.

No, we didn't require the customer to reboot. The system could run for years at a time.

Putting mission critical applications on Windows 95 is just plain stupid.
Re:Anyone want to clue them in to scheduled jobs? by j3110 · 2004-09-21 11:38 · Score: 1

My favorite part was the human error part... It was human error, alright... the jack-a$$ that decided to use a 32bit millisecond counter for uptime, and crash when the counter overflows.

Granted Linux isn't the best solution, but it's a hell of a lot better than running "mission critical" systems on "voodoo and chicken wire". ...and in the end, isn't it about getting marginally better for marginal cost? Take your above average /.er and give him a week in the airlines, and he'll probably have something more reliable set up using Wine running on Linux that runs off a USB keychain flash stick, and is administrable from anywhere in the world using SSH and VNC.

--
Karma Clown
Re:Anyone want to clue them in to scheduled jobs? by 0x0d0a · 2004-09-21 11:39 · Score: 2, Funny

Aviation expects 99.9999% uptime with absolutely no message loss, and we would achieve that with hot-standbys and MySQL mirroring.

Yes, that was a jab at you, Postgres fans. ;-)

--
May we never see th
Re:Anyone want to clue them in to scheduled jobs? by sydb · 2004-09-21 11:49 · Score: 1

No good reason, it just helps.

Well, there is a good reason, you just don't know what it is!

I agree though, sometimes a reboot will fix problems you don't have the time, knowledge or tools to identify and resolve methodically. It's about business priorities.

I run webmail and stuff from home. I went on holiday a few weeks back - half way through the trip I went to an ISP (in Budapest, I believe they were running Windows 95 on 486s, at least it felt that way) and I couldn't access my server. This has never happened to me in several years. I thought the fan on my athlon must have given up the ghost, or my block must have burned down. I mean, I run Debian and as everyone knows, Debian never lets you down. Call me a troll, but it's true.

Got home, turned out there had been lots of electrical storms while I was away. my ADSL modem's lights were all on - not normal. A reboot (of the modem) fixed that. I'm thinking of putting it on a timer switch now, to restart at 3am every night...

So yeah these people should basically be shot.

--
Yours Sincerely, Michael.
Re:Anyone want to clue them in to scheduled jobs? by Erik+Hollensbe · 2004-09-21 11:56 · Score: 1

Dunno about you,

but if I had the choice of running 40-year-old tested software and hardware (even if it is "voodoo and chicken wire") vs. something some 4-year fresh out of college wrote on hardware they could get at newegg controlling the airways...

You get the idea. Even good Sun and IBM machines fail - they do it better, but they still fail. I'll stick with the tested solution, thanks.
Re:Anyone want to clue them in to scheduled jobs? by sydb · 2004-09-21 12:00 · Score: 1

So do you need someone standing there watching it all the time in case it crashes? That's what you're implying - that unless a system is watched then there's no guarantee it will do what you want it to do. Fair enough, but I would have thought an automated monitoring system could happily replace a baby sitter.

You might say that a reboot introduces a level of risk which, combined with the risk of a monitoring system failure (which should itself be alerted on using failsafe methods) is to high for such a system. In which case, implement a system which doesn't need rebooted. If the system is so important, it's important enough to be stable!

--
Yours Sincerely, Michael.
Re:Anyone want to clue them in to scheduled jobs? by agallagh42 · 2004-09-21 12:02 · Score: 3, Informative

"Since when does Windows 2000 include a "shutdown" command?"

Uh, since about 2000 I believe.:)
C:\>shutdown /? Usage: shutdown [-i | -l | -s | -r | -a] [-f] [-m \\computername] [-t xx] [-c "c omment"] [-d up:xx:yy] No args Display this message (same as -?) -i Display GUI interface, must be the first option -l Log off (cannot be used with -m option) -s Shutdown the computer -r Shutdown and restart the computer -a Abort a system shutdown -m \\computername Remote computer to shutdown/restart/abort -t xx Set timeout for shutdown to xx seconds -c "comment" Shutdown comment (maximum of 127 characters) -f Forces running applications to close without warning -d [u][p]:xx:yy The reason code for the shutdown u is the user code p is a planned shutdown code xx is the major reason code (positive integer less than 256) yy is the minor reason code (positive integer less than 65536) C:\>

--
Carpe Cerevisi - Seize the Beer
Re:Anyone want to clue them in to scheduled jobs? by surprise_audit · 2004-09-21 12:08 · Score: 1

Putting mission critical applications on Windows 95 is just plain stupid.

Not just stupid, it's a violation of the EULA. Or did they finally remove the bit that says something like: "not for use in life-or-death situations"??
Re:Anyone want to clue them in to scheduled jobs? by agallagh42 · 2004-09-21 12:35 · Score: 2, Informative

"Nope. Windows 2000 server:
C:\>shutdown /?
'shutdown' is not recognized as an internal or external command, operable program or batch file."

Well, you have to install the resource kit tools. You wouldn't want everything installed by default would you?

--
Carpe Cerevisi - Seize the Beer
Re:Anyone want to clue them in to scheduled jobs? by Dun+Malg · 2004-09-21 12:57 · Score: 1

You might say that a reboot introduces a level of risk which, combined with the risk of a monitoring system failure is to high for such a system. In which case, implement a system which doesn't need rebooted. If the system is so important, it's important enough to be stable!
Well yeah, that's the real crux of the matter now isn't it. Since the manual reboot is only a temporary kludge intended to work around a bug that's (presumably) being fixed, the atypical activity of restarting should be attended by a human. I would assume that the uptime monitoring system was not designed to account for this reboot, but was designed to handle other incidents and as such would require no further attendance.

--
If a job's not worth doing, it's not worth doing right.
Re:Anyone want to clue them in to scheduled jobs? by einhverfr · 2004-09-21 13:02 · Score: 1

iirc there is also a way to use rundll to initate a shutdown. I don't remember it because I don't do that much with Windows anymore.

--

LedgerSMB: Open source Accounting/ERP
Re:Anyone want to clue them in to scheduled jobs? by einhverfr · 2004-09-21 13:05 · Score: 1

You have never had Linux crash when running XDM have you? The most you get are flashing keyboard lights and a lockup.

The last time this happened to me it was due to a failing CPU. Not Linux's fault, but I would really like it if the oops message would be sent somewhere where I can read it in this case :-(

--

LedgerSMB: Open source Accounting/ERP
Re:Anyone want to clue them in to scheduled jobs? by einhverfr · 2004-09-21 13:09 · Score: 1

What is an elbow parity error? A keyboard error caused by your elbow resting on the space bar?

Also, I think it would be quite possible for a reboot error to cause the BIOS to send out a large number of "CRAP! CRAP!CRAP!" lines before giving the errors. The system can say whatever its programmers want it to say ;-)

--

LedgerSMB: Open source Accounting/ERP
Re:Anyone want to clue them in to scheduled jobs? by j3110 · 2004-09-21 13:17 · Score: 1

I thought the point of this article is that it's tried, tested, and found lacking. :) Who says you can't actually test a system alongside the old until you feel comfortable with it? When the old Voodoo fails, you'll have the new stuff there to take it's place. Then once you've actually been on the new stuff for a while, jettison the old POS system out of one of the airplane at about 10,000 feet, because I doubt it will be missed.

--
Karma Clown
Re:Anyone want to clue them in to scheduled jobs? by mekkab · 2004-09-21 13:27 · Score: 1

ACtually, R6k's are dying. They are currently used in air traffic control, too. Infact, the UK was planning on buying up ALL the 595's in all of Europe at a significant cost; thats how desperate the situation is.

--
In the future, I would want to not be isolated from my friends in the Space Station.
Re:Anyone want to clue them in to scheduled jobs? by DarkVader · 2004-09-21 13:53 · Score: 2, Interesting

A nightly reboot seems like a sledgehammer approach to me.

I've got a script that pings my upstream router every 10 minutes. If it misses a ping, it waits 30 seconds and tries again. 2 missed pings, and it power cycles my DSL router, using an activehome box and an x10 appliance module.
Re:Anyone want to clue them in to scheduled jobs? by dbottaro · 2004-09-21 14:00 · Score: 1

Eek! Posting Typo! Bad fingers! Thanks for the eagle eyes.

--
Coding my way to the next BSOD!
Re:Anyone want to clue them in to scheduled jobs? by Jameth · 2004-09-21 14:09 · Score: 1

"even if they don't know what to do, they can at least shout "crap!", which is more than a system stuck at the BIOS screen with an "elbow parity error" can say."

Actually, it wouldn't be very hard to have the reboot process send out a useful message which, on its receipt at the appropriate time, would prevent the vocalization of "crap! crap! crap!" by another system. And, as such, a failure to reboot would draw the appropriate crap cries in a similar time-frame as a human rebooter could reasonably issue such fearful announcements.
Re:Anyone want to clue them in to scheduled jobs? by tcgroat · 2004-09-21 14:57 · Score: 1

"Air Force One, we understand you're low on fuel, but please hold up on that emergency landing while the system reboots."
Half-way measures are not suitable for life-safety applications, no matter what OS you run. Get to the root cause and remove it, permanently. A cron job or "At" command is a kludgy work-around, not a solution. From the safety point of view, it's still broken no matter how you hide the flaw!
Re:Anyone want to clue them in to scheduled jobs? by morcheeba · 2004-09-21 15:15 · Score: 1

dbottaro isn't using an englsh version of windows, you insensitive clod! He's using the Jamaican version where the days are labeled "moonday twoesday whensday thurzday RASTADAY!! saunterday sunningday." If it were up to me, I'd like my company to adopt these names.

--
HIV Crosses Species Barrier... into Muppets
Re:Anyone want to clue them in to scheduled jobs? by Dun+Malg · 2004-09-21 15:16 · Score: 1

What is an elbow parity error?
That's when you count your elbows and don't come up with an even number.
PS: if you only have one arm, remember to change your elbow parity from EVEN to ODD.

--
If a job's not worth doing, it's not worth doing right.
Re:Anyone want to clue them in to scheduled jobs? by Dun+Malg · 2004-09-21 15:24 · Score: 1

Actually, it wouldn't be very hard to have the reboot process send out a useful message which, on its receipt at the appropriate time, would prevent the vocalization of "crap! crap! crap!" by another system. And, as such, a failure to reboot would draw the appropriate crap cries in a similar time-frame as a human rebooter could reasonably issue such fearful announcements.
But implementing such a system would probably require as much work as fixing the apparent micro-second rollover bug. The reboot is only a kludge until the bug is fixed.

--
If a job's not worth doing, it's not worth doing right.
Re:Anyone want to clue them in to scheduled jobs? by trolman · 2004-09-21 16:09 · Score: 1

I install cable modems and will not touch a system with WIN95. WTF is the FAA doing here and why is this not a news headline vs. what some canidate for president did fourty years ago?
Re:Anyone want to clue them in to scheduled jobs? by The+Fink · 2004-09-21 17:19 · Score: 1

That almost reminds me of a quote (from a.s.r, repeated many times over on many different sites) related to reannual wine: rebooting in response to the crash you'll have later on. :-)
Apologies, I'm sure, to Terry Pratchett.
Re:Anyone want to clue them in to scheduled jobs? by stoborrobots · 2004-09-21 17:21 · Score: 1

Aviation expects 99.9999% uptime ... we didn't require the customer to reboot.
It's a good damn thing too... Unless your system could reboot in 31.536 seconds...

I get worried when someone mentions five-9s uptime (5 minutes a year downtime), six-9s is just out there... that's less than 32 SECONDS of downtime a YEAR!

--
"Go to CNN [for a] spell-checked, fact-checked summary" -- CmdrTaco
Re:Anyone want to clue them in to scheduled jobs? by 6th+time+lucky · 2004-09-21 19:39 · Score: 1

And what system would *they* get to run on it?
Re:Anyone want to clue them in to scheduled jobs? by sumdumass · 2004-09-21 20:34 · Score: 1

Why woon't you touch a win95 box? i find they are simple to manage. That is if you can find a nic that has driver support fopr them. The 3com 3c905 serries seem to work well. you can find them used for around $10.

There is nothign wrong with the networking abilities of win95 that i am aware of. Well Nothing with dhcp asigned ip adresses and staict ip on ethernet. You might have some problems with token rings or the speed of the box being generaly slower then the capable speed of the modem. but this is mute of it is a 100mhz or faster system.

I have a couple win95 boxes running cable modem access and a proxy to share the internet. I kinda like using coyote linux and an old 486 better but it works well.
Re:Anyone want to clue them in to scheduled jobs? by cheekyboy · 2004-09-21 20:38 · Score: 1

Didnt someone say recently its cheaper and better to just ditch the old mainframes and run the code under emulation in a virtual mainframe like VMWare, that way you get modern reliable hardware, 100x faster spec and 50x smaller floor space but with the same old reliable code.

OT: I think the govt should FORCE microsoft to release all sourcecodes >10 years old or any OS that is 100% discontinued, like Win95, then perhaps the OSS community can make a retro Win95-redux running on linux kernel. But we know the govt is 100% currupted up the arse

--
Liberty freedom are no1, not dicks in suits.
Re:Anyone want to clue them in to scheduled jobs? by sumdumass · 2004-09-21 20:56 · Score: 1

I think what he means is durring a reboot you need a person there to monitor it.

I'm sure there are backup proceedures for mision critical systems including taking over the jobs maualy or starting a backup system. One of the problems would be not knowing why or what when the system did go down. If they had known the error was just an aplication crash or the service failure was because of a hardware problem in the machine, they could have switched to backup systems or go manual without much interuptions in service. If the process is scripted and there is a probelm, it will take some time to determin what went wrong, if it is because of outside forces (like terrorism) or if it will bring other systems down and cascade into more problems. I think this is the reason lax shut down for a while.

I agree that the system should be stable. Rebooting for stability is really scary when we are talking about anythign futher then the destop. Someone must have been smoking crack when they ok'ed the situations that led to this. I feel that if an accident happend that caused injury or loss of life, the management as well as the person making the decision leading to this type of situation (for any mision critical aplication/situation not just LAX) should be held both criminaly and civily responcible. There is really no excuse for it. If cost is the reson behind it The companies in charge shoudl have to pay 1,000,000 fold whatever the savings were as well as damages to each indevidual involed. If some lies were made durring a sales pitch to unknowing PHB's then the sales men as well as rthe vendor should be held just as acountable. I'm surprised there isn't already some regulation or law that would prevent things liek this.
Re:Anyone want to clue them in to scheduled jobs? by sydb · 2004-09-22 00:19 · Score: 1

Right, but I can implement a restart of the ADSL modem in 1 minute with a £3.99 timer switch, rather than the considerably greater time and expense of going the X10 route.

--
Yours Sincerely, Michael.
Re:Anyone want to clue them in to scheduled jobs? by lachlan76 · 2004-09-22 00:51 · Score: 1

Or, you could make the script hand over control to another server before the reboot. THAT might work nicely.
Re:Anyone want to clue them in to scheduled jobs? by iammrjvo · 2004-09-22 01:20 · Score: 1

Not to mention vendors "retiring" old software and hardware so that it's no longer supported. Eventually something is going to break - you have to plan for failure.

During a production phase isn't the time that you want to have to put in an unplanned upgrade because something broke which can't be repaired or replaced.

--
Ha, ha! Nobody ever says Italy.
Re:Anyone want to clue them in to scheduled jobs? by strictfoo · 2004-09-22 01:34 · Score: 1

A story about an FAA system that needed to be taken offline and wasn't, and a post discussing this is deemed offtopic?

Seriosuly now, wtf mods?

--
I've just signed legislation that'll outlaw Russia forever. We'll begin bombing in five minutes.
Re:Anyone want to clue them in to scheduled jobs? by ooby · 2004-09-22 01:58 · Score: 1

It was at an ARTCC, and it was using Voice Switch Communication System. DSR does not handle air-traffic control voice communications.

Also, Intel's processor warranty does not cover mission critical usage, and it explicitly uses air traffic control as an example of such activity.
Re:Anyone want to clue them in to scheduled jobs? by mekkab · 2004-09-22 03:58 · Score: 1

from the article:
but departing planes were held on the runways until a failed radio communications system could be repaired at Los Angeles Center, a remote facility in the desert north of the city.

Oops, you are right. Thats ZLA. I'm not too up on VSCS, but I thought there was some part of it that bounces the signal back to the controllers that isn't on site...? maybe?

--
In the future, I would want to not be isolated from my friends in the Space Station.
Re:Anyone want to clue them in to scheduled jobs? by Dun+Malg · 2004-09-22 04:01 · Score: 1

The status of the machine can be monitored by another machine which can shout 'crap' if the first one does not reboot.
"Quo custodiet ipsos custodies" - who watches the watchman? Adding another machine to watch the first machine for what is already only a temporary kludge is adding too much unnecessary complexity. They just need to send the $10/hr intern down there to restart it every Wednsday morning until the bug is fixed.

--
If a job's not worth doing, it's not worth doing right.
Re:Anyone want to clue them in to scheduled jobs? by TykeClone · 2004-09-22 12:14 · Score: 1

But that's not nearly so cool and you can't brag about it on /.

--
A fine is a tax you pay for doing wrong and a tax is a fine you pay for doing all right.
Re:Anyone want to clue them in to scheduled jobs? by julesh · 2004-09-29 02:25 · Score: 1

Seriosuly now, wtf mods?

I just came to this story from metamoderation. There are a _lot_ of junk moderations of perfectly valid posts as offtopic or troll at the moment. And one with "fp!" as insightful. I think it's a new form of trolling.
Re:Anyone want to clue them in to scheduled jobs? by julesh · 2004-09-29 02:35 · Score: 1

I'd much rather have an OS in a mission critical environment at least tell me something when it crashes (linux: oops!) than give me a BSOD full of incomprehensible characters.

Windows BSODs contain very similar information to a Linux kernel oops: a brief description of what caused the crash, details of the process that was running at the time, and details of loaded drivers.

The only thing that's missing is the stack trace and disassembly of the executing code presented by Linux. These items are only useful if you have the source code to the module where the crash occurred, which is unlikely with Windows.

I understand there are debugging versions of the kernel available; these might give more information.

Why not automate it? by DevilJeff · 2004-09-21 09:50 · Score: 1

Have they never thought to just schedule an event to reboot the computer every 30 days?

Re:Why not automate it? by Embedded2004 · 2004-09-21 09:54 · Score: 2, Informative

Well, if it is running windows, and somehow someone made a mistake and desided to run it on some mission critical system, they should reghost it as often as they can.

Windows has an odd tendancy to corrupt it self.
Re:Why not automate it? by Anonymous Coward · 2004-09-21 09:57 · Score: 1, Insightful

"Have they never thought to just schedule an event to reboot the computer every 30 days?"

Would it not worry you to know that the ATC were relying on a computer that reboots itself so often?
Re:Why not automate it? by bstone · 2004-09-21 10:18 · Score: 2, Interesting

I don't see the logic in a system being so critical to be working 24/7 that they force it to crash if the maintenance is missed. Does anyone else see a problem with this logic?
Re:Why not automate it? by black+mariah · 2004-09-21 10:41 · Score: 1

No, it wouldn't. You schedule the reboot during shift change. No planes are landing, no safety issues. It doesn't matter how often the system is rebooted, what matters is WHEN it is rebooted. Uptime means absolutely nothing to those with actual jobs to do.

--
'Standards' in computing only impress those who are impressed by things like 'standards'.
Re:Why not automate it? by 1u3hr · 2004-09-21 13:11 · Score: 1

You schedule the reboot during shift change. No planes are landing, no safety issues.
Except in an emergency, which by its nature will happen unpredictably, maybe even on a shift change (or the TERRORISTS might plan something nasty then once they know).

And the lesson is... by jcr · 2004-09-21 09:50 · Score: 2, Insightful

Don't use this stuff in mission-critical applications.

-jcr

--
The only title of honor that a tyrant can grant is "Enemy of the State."

Re:And the lesson is... by LostCluster · 2004-09-21 09:56 · Score: 2

"This stuff" being all of IT. HDs will fail within 5-7 years no matter what OS you put on them...

Good IT is so hard to pull off because you have to convince people that events that strike once every few years have to be prepared for otherwise a disruption in service will occur.
Re:And the lesson is... by Dun+Malg · 2004-09-21 10:54 · Score: 2, Insightful

Good IT is so hard to pull off because you have to convince people that events that strike once every few years have to be prepared for otherwise a disruption in service will occur.
Like the PHB at the office where my wife works said after announcing that the IT guy was to be laid off and not replaced: "I don't see why we need an IT guy-- we never have any computer problems" (cluebat time!)

--
If a job's not worth doing, it's not worth doing right.
Re:And the lesson is... by drinkypoo · 2004-09-21 11:04 · Score: 1

"This stuff" being simple single systems. There are basically two ways to build mission-critical systems. You can build proven systems where they simply don't fail in any way you can't recover from, or you can build failover clusters. Planning for failure is always a good idea, and you have to do it one way or another. Finally, always have a backup plan. Not having a plan means that everyone has to think on their feet when things go wrong and some people aren't very good at that. Hopefully not too many of them are pilots but this is the real world where bad things happen to everyone.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Re:And the lesson is... by mrseigen · 2004-09-21 13:12 · Score: 1

Like the PHB at the office where my wife works said after announcing that the IT guy was to be laid off and not replaced: "I don't see why we need an IT guy-- we never have any computer problems" (cluebat time!)

Are you kidding? That's the top kind of boss. You just wait two weeks until something shits itself, then show up and offer to fix everything again and work as a fulltime IT consultant at twice the rate.

If that doesn't work, you just discuss the situation with his boss.
Re:And the lesson is... by Dun+Malg · 2004-09-21 15:10 · Score: 1

Are you kidding? That's the top kind of boss. You just wait two weeks until something shits itself, then show up and offer to fix everything again and work as a fulltime IT consultant at twice the rate.
Nothing hardly ever shits itself enough to affect him. It just decays gently and makes the lives of the admin staff more difficult. At any rate, they eventually convinced his replacement (and then only after his computer finally had trouble) to hire a student part time to take care of IT needs (inadequate solution, but better than nothing).
If that doesn't work, you just discuss the situation with his boss.
Heh. It's in academia. His boss is the University of California Regents Office. They'd just as soon see his entire department die entirely, but can't kill it for various convoluted political reasons.

--
If a job's not worth doing, it's not worth doing right.

"Upgrade"? by thelenm · 2004-09-21 09:50 · Score: 5, Funny

"Upgrade" from Unix to Windows, eh. You keep using that word. I do not think it means what you think it means.

--
Use Ctrl-C instead of ESC in Vim!

Re:"Upgrade"? by Anonymous Coward · 2004-09-21 10:10 · Score: 1, Funny

"Hello. My name is Inigo Montoya. You killed my server. Prepare to die."
Re:"Upgrade"? by wmopnc · 2004-09-21 10:29 · Score: 1

I love that movie.
Re:"Upgrade"? by upsidedown_duck · 2004-09-21 12:09 · Score: 3, Insightful

It depends on how bad their previous UNIX system was. Any operating system can be neglected into oblivion. Also, if they got all new hardware to run Windows 2000, when the old hardware might have been ten-year-old 50MHz SMP boxes, then upgrade would be the right term. It's unfortunate that they didn't decide to upgrade to faster UNIX boxes, but that's politics for you.

--
-- "Makes Little Debbie look like a pile of puke!" - Moe Szyslak
Re:"Upgrade"? by abb3w · 2004-09-21 13:59 · Score: 1

It's inconceivable!
Drat it, the proper canonical format for that joke is "My pid is Inigo Montoya. You kill-9 my parent process. Prepare to vi."

--
//Information does not want to be free; it wants to breed.
Re:"Upgrade"? by Daytona955i · 2004-09-22 01:03 · Score: 1

hehe... yeah, my school did that with the mail server back when I was in school still. Took a perfectly good/working/stable UNIX mail server and made it an exchange server. Before the switch I never had any problems getting mail. After the switch I had tons of issues mostly not being able to get to it at times.

In a related story by lateralus_1024 · 2004-09-21 09:51 · Score: 1

....all in-flight movies are played on Windows Media Player.

--
If you think /. comments are bad, check out Digg.

Re:In a related story by databank · 2004-09-21 10:01 · Score: 2, Interesting

Actually there's a lot of truth to that..I once flew in an airliner overseas which had the tv screens built into the back of the seat in front of me.

In the middle of the movie, the screen did the classic "blue screen of death" and rebooted with the Windows logo. There were quite a few chuckles in the aircraft when the movie was restarted and then the jokes started flying about the plane running on Microsoft Windows....(uh..oh..we're going to crash!..no wait, that's just Microsoft Windows)

Why is the FAA using off the shelf software? by Samir+Gupta · 2004-09-21 09:51 · Score: 4, Informative

This is not an attack on Microsoft.

But most off the shelf software have disclaimers expressly stating they are not to be used in mission critical situations. Eg:

"technology is not fault tolerant and is not designed, manufactured, or intended for use or resale as on-line control equipment in hazardous environments requiring fail-safe performance, such as in the operation of nuclear facilities, aircraft navigation or communication systems, air traffic control, direct life support machines, or weapons systems, in which the failure of Java technology could lead directly to death, personal injury, or severe physical or environmental damage."

--
-- Samir Gupta, Ph. D. Head, New Technology Research Group, Nintendo Co. Ltd., Kyoto, Japan.

Re:Why is the FAA using off the shelf software? by pyro101 · 2004-09-21 09:59 · Score: 2, Informative

I don't know about using windows 95, but here at the nuclear facility that I work at we use not only Java but also windows. Have been using windows for some time and have to use java because that is the way Oracle is going. We have more problems with hardware issues then with the off the shelf software , but no matter what problems we get from any of it we as software developers are supposed to anticipate it and prove that we can, within reason catch the user/machine/other devices before screwing stuff up. But most of all we go through huge testing on any small addition or change to the code base, even changing color on menus requires a 10-20 signitures (never know what else could have been added on accident).
Re:Why is the FAA using off the shelf software? by vsprintf · 2004-09-21 11:00 · Score: 1

But most off the shelf software have disclaimers expressly stating they are not to be used in mission critical situations.

Most of the government's sotware is contracted out, and many of the low-bid *enterprise solution providers* that provide the software are just integrators of COTS. Even in situations where a gov't agency is paying for custom software, there are still gov't project managers complaining about the cost and how COTS would save money because they read it in some magazine. (Been there, still doing that.) Common sense is not necessarily a requirement for a federal management position.

What?! by ottergoose · 2004-09-21 09:51 · Score: 5, Funny

I thought switching to Windows from *nix saved time, money, and hassle! Haven't you guys seen those banner ads here?

Re:What?! by drew · 2004-09-21 10:41 · Score: 2, Informative

Funniest thing is that was actually the ad i saw when i read one of the linked articles :)

--
If I don't put anything here, will anyone recognize me anymore?
Re:What?! by tool462 · 2004-09-21 10:48 · Score: 2, Funny

Nope. :)
Re:What?! by Almost-Retired · 2004-09-21 13:46 · Score: 1

You're new here I assume...

I Hate to Say It by DarkKnightRadick · 2004-09-21 09:51 · Score: 2, Insightful

But I'm going to.

It's M$'s fault. Why do I hate to say it? Because it'll just be seen as more anti-MS crap from another /.er.

All I have to say is if the shoe fits, wear it.

In this individual case a PHB made a decision to scrap the old, stable OS to a new, known-to-be-unstable OS. That screams PHB.

--
"There is a way that seems right to a man, but its end is the way of death." Proverbs 16:25 (NKJV)

Re:I Hate to Say It by multimed · 2004-09-21 09:58 · Score: 4, Funny

No way is it Microsoft's fault. It even says so in their EULA...
I'm still amused & suprised the poster left off the quotes as in "upgrade" from Unix to Windows.

--
Vote Quimby.
Re:I Hate to Say It by DarkKnightRadick · 2004-09-21 10:00 · Score: 1

Haha. I do agree with some other posters that the PHB's involved as well as the tech should take some of the blame. After all, they chose and installed the software.

That is indeed amusing.

--
"There is a way that seems right to a man, but its end is the way of death." Proverbs 16:25 (NKJV)
Re:I Hate to Say It by rjstanford · 2004-09-21 10:37 · Score: 1

But I'm going to.

It's M$'s fault. Why do I hate to say it? Because it'll just be seen as more anti-MS crap from another /.er.

All I have to say is if the shoe fits, wear it.

In this individual case a PHB made a decision to scrap the old, stable OS to a new, known-to-be-unstable OS. That screams PHB

Yeah, it would be. Except that the OS kept running along just fine, and one of the FAA's applications cratered. Which means... er... that its not "M$" fault. It's somebody else's fault and Microsoft is blameless in this matter. But apart from that minor detail, I agree with your post entirely.

--
You're special forces then? That's great! I just love your olympics!
Re:I Hate to Say It by SoSueMe · 2004-09-21 10:48 · Score: 1

The servers are timed to shut down after 49.7 days of use in order to prevent a data overload, a union official told the LA Times. To avoid this automatic shutdown, technicians are required to restart the system manually every 30 days. An improperly trained employee failed to reset the system, leading it to shut down without warning, the official said. Backup systems failed because of a software failure, according to a report in The New York Times.

And what backup system failed?
Re:I Hate to Say It by silicon+not+in+the+v · 2004-09-21 11:04 · Score: 1

Seriously that decision made no sense. Someone made the remark earlier that they had to scrap the old system because it was held together with spit and baling wire or whatever. That still doesn't make sense because they could very easily scrap the old HARDWARE and replace it with Shiny new Sun Blade XXXX servers or something and still use UNIX systems to run it. Why would you switch to entirely new SOFTWARE on Windows?

--
We may experience some slight turbulence and then...explode. -Capt. Mal Reynolds
Re:I Hate to Say It by AstroDrabb · 2004-09-21 11:21 · Score: 1

It's somebody else's fault and Microsoft is blameless in this matter.
Wow, nothing like being an MS apologist. MS had this same logic flaw in Win95 and they did it again in their WinNT based systems. They stuck the system time in milliseconds in a 32-bit int. It basically gives about 49.7 days. What a coincedence. This app was probably relying on the system timer when MS's logic flaw caused the issues.

--
If Tyranny and Oppression come to this land,
it will be in the guise of fighting a foreign enemy. -James Madison
Re:I Hate to Say It by rjstanford · 2004-09-21 11:43 · Score: 1

Wow, nothing like being an MS apologist. MS had this same logic flaw in Win95 and they did it again in their WinNT based systems. They stuck the system time in milliseconds in a 32-bit int. It basically gives about 49.7 days. What a coincedence. This app was probably relying on the system timer when MS's logic flaw caused the issues.

Yeah, except that that call had been deprecated by Microsoft for years. Try GetSystemTimeAsFileTime for a change, which has been available since the Windows 95 days.

Or is it now Microsoft's fault that someone called a method that they had been told not to call for the past decade? Hmm. In your world, hey, maybe it is. From the source...

The GetTickCount function retrieves the number of milliseconds that have elapsed since the system was started. It is limited to the resolution of the system timer. ... The elapsed time is stored as a DWORD value. Therefore, the time will wrap around to zero if the system is run continuously for 49.7 days.

Yeah, evil Microsoft.

--
You're special forces then? That's great! I just love your olympics!
Re:I Hate to Say It by AstroDrabb · 2004-09-21 12:15 · Score: 4, Informative

Funny, no where in the doc for GetTickCount() does it say it is deprecated and not to use it. The only thing it does say is "If you need a higher resolution timer, use a multimedia timer or a high-resolution timer." I don't know what the program needs since I did not write it nor have I seen the code. Maybe they didn't need a high-res timer and wanted a tick count for how long the system has been up? I don't think that is too much to ask from on OS.
The GetSystemTimeAsFileTime() function retrieves the current system date and time. The information is in Coordinated Universal Time (UTC) format. It doesn't tell you how long the system has been up.
Oh, and if MS did not think this is a problem why did they fix it in a WinNT service pack? Also, right in that link MS says
Microsoft has confirmed that this is a problem in Windows NT 4.0 and Windows NT Server 4.0, Terminal Server Edition. This problem was first corrected in Windows NT 4.0 Service Pack 4.0 and Windows NT Server 4.0, Terminal Server Edition Service Pack 4.

MS also didn't seem to fix it in Win2000 Server and their own engineers got hurt by it, specifically with Rpcss.exe which according to MS
SYMPTOMS
The Rpcss.exe process consumes 60 percent or more of CPU time, and system performance and network performance are affected. This symptom typically occurs 49.7 days after the server is started.
CAUSE
This problem occurs because a call to the GetTickCount timer function causes the function to overflow 49.7 days after the server is started.
If GetTickCount is "deprecated" as you state, why in the world is MS's own programmers using it in rpcss.exe? According to this site
rpcss.exe is an executable of Microsoft Windows Opearting System. It is reponsible for Remote Procedure Call services on the local machine. These are public services available to the local network. This program is important for the stable and secure running of your computer and should not be terminated.

Still not convinced and want to appologize for MS? Well here are some more of MS's software that are affected by it in Windows 2000 servers (what this FAA project is using).
Print Spooler Stops Scheduling Print Jobs
The Print Spooler service may stop scheduling print jobs to specific Simple Port Monitor (SPM) ports. Although incoming jobs are queuing into the spooler, print jobs may not start. Note that this symptom occurs 49.7 days after you start the Print Spooler service.

There are a bunch of MS apps affected by this logic flaw that has been passed from version to version of MS OSes. If this flaw affected all these MS developers who have far more access to proprietary docs, I don't see how other developers would not stumble over it as well since they do not have access to the proprietary OS.

--
If Tyranny and Oppression come to this land,
it will be in the guise of fighting a foreign enemy. -James Madison
Re:I Hate to Say It by Christopher_G_Lewis · 2004-09-22 03:59 · Score: 2

From the SDK (bold face by me):
GetTickCount

GetTickCount

The GetTickCount function retrieves the number of milliseconds that have elapsed since the system was started. It is limited to the resolution of the system timer. To obtain the system timer resolution, use the GetSystemTimeAdjustment function.

DWORD GetTickCount(void);

Parameters
This function has no parameters.
Return Values
The return value is the number of milliseconds that have elapsed since the system was started.

Remarks
The elapsed time is stored as a DWORD value. Therefore, the time will wrap around to zero if the system is run continuously for 49.7 days.

If you need a higher resolution timer, use a multimedia timer or a high-resolution timer.

To obtain the time elapsed since the computer was started, retrieve the System Up Time counter in the performance data in the registry key HKEY_PERFORMANCE_DATA. The value returned is an 8-byte value. For more information, see Performance Monitoring.
...

So the offical SDK tells you *not* to use GetTickCount for uptime, but to use HKEY_PERFORMANCE_DATA.

Just a case of RTFM, for all parties involved, including the Microsofties...

--
www.christopherlewis.com
Re:I Hate to Say It by rjstanford · 2004-09-22 04:00 · Score: 1

Funny, no where in the doc for GetTickCount() does it say it is deprecated and not to use it. The only thing it does say is "If you need a higher resolution timer, use a multimedia timer or a high-resolution timer." I don't know what the program needs since I did not write it nor have I seen the code. Maybe they didn't need a high-res timer and wanted a tick count for how long the system has been up? I don't think that is too much to ask from on OS.

Let me simply quote from that very same webpage. Highlighting mine:

The elapsed time is stored as a DWORD value. Therefore, the time will wrap around to zero if the system is run continuously for 49.7 days.

If you need a higher resolution timer, use a multimedia timer or a high-resolution timer.

To obtain the time elapsed since the computer was started, retrieve the System Up Time counter in the performance data in the registry key HKEY_PERFORMANCE_DATA. The value returned is an 8-byte value. For more information, see Performance Monitoring.

If GetTickCount is "deprecated" as you state, why in the world is MS's own programmers using it in rpcss.exe?

Because they're lazy? And its still there? And its not deprecated in the Java @deprecated sense, just no longer considered to be an acceptable way of doing those kind of operations? As referenced in the docs?

What would you have them do instead - break the API because some people misuse it?

--
You're special forces then? That's great! I just love your olympics!
Re:I Hate to Say It by rjstanford · 2004-09-22 04:51 · Score: 1

Actually, its been around since the 3.0 days IIRC. Maybe even before then. And MSFT very rarely removes any API call, since they have an exceptional (no joke here) track record of backwards compatibility. After all, if they pulled a function that caused a major - or even a minor - product to stop working, even if that product hadn't been certified against their new OS version, you just know that they'd take the blame when the users upgraded Windows versions.

--
You're special forces then? That's great! I just love your olympics!
Re:I Hate to Say It by AstroDrabb · 2004-09-22 08:17 · Score: 1

It _still_ doesn't say it is deprecated and it is funny that a bunch of apps from MS used GetTickCount() and got hit by the logic flaw. Also, HKEY_PERFORMANCE_DATA is not supported on Win9x, so GetTickCount() can be used instead for apps that need to run on Win9x and WinNT.

--
If Tyranny and Oppression come to this land,
it will be in the guise of fighting a foreign enemy. -James Madison
Re:I Hate to Say It by Christopher_G_Lewis · 2004-09-22 09:16 · Score: 1

It's interesting that the "bunch of apps" turn out to be:

Vtdapi.vxd - Win95 & Win98
SNMP - WinNT 4.0
RPCSS - Win2K
PrintSpool - Win2K

The others KB articles are summary articles with the above, or other Int32 overflow errors.

The VTDAPI.vxd was fixed in 6/1998 and the SNMP error was fixed in SP4 (4/1999).

--
www.christopherlewis.com
Re:I Hate to Say It by DarkKnightRadick · 2004-09-22 14:22 · Score: 1

Not a friggin' clue.

--
"There is a way that seems right to a man, but its end is the way of death." Proverbs 16:25 (NKJV)

A hit for the other team... by LostCluster · 2004-09-21 09:52 · Score: 3, Interesting

When a ball drops on a baseball field at the midpoint between two positions, it's scored a "hit" for the opposition rather than an "error" against either player. Still, a hit for the other side is a bad thing for the entire team.

This mess was big enough that there's a large enough supply of blame to give some to everybody involved.

- No system should require a manual reboot on a regular basis... there should at least be a script capable of accomplishing that. But somehow, one got implemented. Blame whoever bought it.
- Windows shouldn't have had a faw that required monthly reboots. Blame Microsoft.
- Somebody should have done the reboots like they were told to. Blame that poor smuck.

Bottom line is that everybody's at fault because had any one piece in the chain done their job properly the failure wouldn't have happened, but a cascade of mistakes lead to the ball hitting the grass instead of a glove.

Re:A hit for the other team... by PPGMD · 2004-09-21 09:59 · Score: 5, Insightful

The patriot missile system had a similar problem. It's timing broke down after a period of time without a reboot (it was a much shorter cycle, either one day or one week).
Microsoft isn't the only one to have issues like that. But it has been patched and there should have been more than enough time for the FAA to test and deploy the patch on the few legacy machines running Windows 95.
I simply blame the FAA for wasting money away every year, billions are sunk into the system, but rarely does anything come out of it, Lockheed can deploy a complete new system to every airport for the amount of money that is being dumped into the old TRACONs and towers for MX.
Re:A hit for the other team... by oGMo · 2004-09-21 10:00 · Score: 2, Insightful

Bottom line is that everybody's at fault because had any one piece in the chain done their job properly the failure wouldn't have happened, but a cascade of mistakes lead to the ball hitting the grass instead of a glove.

An error is scored against a player if the player is determined to have been negligent in their position according to the rules. If someone hits a line drive right past the first baseman, it's still a hit. If the first baseman catches it, then drops it instead of making a tag, it's an error.

If multiple players are negligent, then multiple errors are scored. We've all seen "blooper" videos where there are cascading errors; one guy drops a catch, throws it to the next guy who drops it in turn, etc.

This is what happened here; it's not a hit, it's a cascade of errors. Everyone is to blame, because they all did something stupid. That doesn't make it "OK," it doesn't make any particular party less at fault.

I don't think this contradicts what you're saying here, I just wanted to emphasize the point. ;-)

--
Don't think of it as a flame---it's more like an argument that does 3d6 fire damage
Re:A hit for the other team... by LostCluster · 2004-09-21 10:05 · Score: 2, Insightful

If multiple players are negligent, then multiple errors are scored. We've all seen "blooper" videos where there are cascading errors; one guy drops a catch, throws it to the next guy who drops it in turn, etc.

Only one error can be scored per base advanced by the runner, and if the runner took first by a "hit" before the errant throw, then there is only one "error" for his advancement to second. If two players crash into each other and the ball drops, it's usually a hit because it's hard to say either would have been able to make the catch "with normal effort" which is the real standard for an error.
Re:A hit for the other team... by dcw3 · 2004-09-21 22:35 · Score: 1

Lockheed can deploy a complete new system to every airport for the amount of money that is being dumped into the old TRACONs and towers for MX.

Yeah, I'm sure that Lockheed would deliver one just like they did for that Mars Orbiter...try getting your miles & kilometers straight next time. Hey, you pick on my Patriot, I pick on your crap.

--
Just another day in Paradise
Re:A hit for the other team... by Royoken · 2004-09-22 02:36 · Score: 1

I work at Lockheed, and it bothers me to no end when we have to fix old systems that have been in place since the 70's... written in some crippled form of assembly for god's sake.

I would love to see newer systems out there, in fact we are in the process now of replacing the enroute traffic systems across the country right now.

http://www.aviationnow.com/avnow/news/channel_avia tiondaily_story.jsp?id=news/pro08134.xml
Re:A hit for the other team... by PPGMD · 2004-09-22 03:46 · Score: 1

...Lockheed would deliver one just like they did for that Mars Orbiter...try getting your miles & kilometers straight next time.
Well first I am not a LM employee. Second the Mars Orbiter didn't kill 28 people.
Re:A hit for the other team... by Sinical · 2004-09-22 06:27 · Score: 1

Patriot stores time in a 24-bit floating point number. Round-off and truncation errors accumulate over time.

It was a limitation of the hardware of the time, probably matched to cost constraints based on requirements from the Army as to the necessary uptime a Patriot battery should support.

You cannot fault Patriot here, since it functioned as designed and the Army was aware (even if the operators of the batteries were not informed) of the Patriot uptime limitations. In missile system design, a lot of decisions get made based on the requirements from the customer: picking a battery based on power consumption and required missile endurance, picking rocket motors based on required shelf lives (15 years or more!), etc.

Contrast this with Windows, which simply overflows a 32-bit counter of milliseconds (49.71 days). There really shouldn't be a limitation like this, even in a desktop system.

Trust me, I know about these things.

Migration by OxygenPenguin · 2004-09-21 09:52 · Score: 1

Why did they move from Unix to Windows in the first place? And why should a bug from Win95 crash a migrated Win2K?

How sad that such a sprawling metropolis of commerce and travel can be brought to its knees by the magic that is Windows.

Color me suprised.

--
Read the only personal Runyon page out there.

Re:Migration by legirons · 2004-09-21 10:01 · Score: 5, Funny

"Why did they move from Unix to Windows in the first place?"

Maybe they didn't want to have to reboot on January 19, 2038
Re:Migration by cofaboy · 2004-09-21 11:22 · Score: 1

Why they moved is no doubt due to the age of the old system, these can only be upgraded and updated so far. The question I would like answered is how long is this system supposed to last. The old system was close on 40 when it was retired, without regular reboots. The hardware will last 40 years? Doubt that since every reboot stress' the hardware its likely that it will need replacement within 3 - 6 years. HD's will last for years, even decades, as long as you *dont* turn them off. What happens on a reboot, bus reset disks spin down then back up. Try it and listen to yours you may even get to hear the click as they STOP. PS as an aside will NT5 still be supported in 2040?

--
In the end, It's all bovine dung you know
Re:Migration by Tatarize · 2004-09-21 13:36 · Score: 1

Well duh, that reboot would be during high traffic hours. 19:14:08 LA time. 19:00 is fairly busy. Would you honestly prefer they reboot in peak hours?

--

It is no longer uncommon to be uncommon.
Re:Migration by Tatarize · 2004-09-21 13:38 · Score: 1

Also, on the 18th pacific time.

--

It is no longer uncommon to be uncommon.

Heh by GypC · 2004-09-21 09:52 · Score: 3, Insightful

upgrade from Unix to Windows

AKA, "The PHB Special"

Of course, the guy who was supposed to reboot the box will get all the blame. Shit rolls downhill.

Re:Heh by Nuclear+Elephant · 2004-09-21 09:55 · Score: 4, Funny

It's an upgrade because it helps to create thousands of jobs for full-time system power cycling engineers.
Re:Heh by Michael+Woodhams · 2004-09-21 11:42 · Score: 5, Informative

There is a rather more extreme case of this with the FAA - when first deployed, the cargo doors of the DC-10 were unsafe, with a failure mode that was likely to make the plane uncontrolable in flight.

This occured in flight, and through luck (which allowed some degree of control) and extraordinary airmanship, the plane was landed safely. (This is known as "The Windsor Incident.")

McDonnell-Douglas didn't want to do a proper redesign of the door mechanism, and the FAA head was a 'companies know best' political appointee, so the result was McD added little windows to the door so that the guy closing the door could look to see it had all engaged properly. (This was over vigourous opposition by the NTSB, who recognized the inadequacy of the fix.)

The situation: A single failure (not looking, or looking but not noticing an unsafe condition) by a non-safety trained close to minimum wage employee could cause the deaths of hundreds of people.

Result: over 300 dead when a Turkish Airlines DC-10 crashed near Paris. The guy who closed the door hadn't even been told he was supposed to check the little windows.

Safety critical systems must be tolerant of human error. If a single omission by a human leads to a hazardous situation, this is primarily the fault of the system, not the human.

--
Quattuor res in hoc mundo sanctae sunt: libri, liberi, libertas et liberalitas.
Re:Heh by InfiniteWisdom · 2004-09-21 11:49 · Score: 4, Funny

Surely you mean Microsoft Server Cycling Engineers (MSCE)
Re:Heh by Cecil · 2004-09-21 13:58 · Score: 1

Uh, your analogy sucks.

In the first case, you say you blame the guy who didn't reboot the server. Yet in the second, you say you're not surprised that the lunatic fired the gun. Well, here's some news for you: According to your analogy, they're the same guy.

The person who's really responsible in both cases is the person in the middle, who decided to in the first case to use a Windows 95 box for a mission critical server, and in the second case to give a known psychopath a gun.

If rebooting a server is not either pre-planned maintenance or a random unexpected hardware failure, something's fucking wrong. No one has any right to blame the poor techie who is trying to nurse it along for as long as he can when it finally fails.

--
Random and weird software I've written.
Re:Heh by Michael+Woodhams · 2004-09-21 15:38 · Score: 2, Informative

Once (Chicago O'Hare, c1980.) Due to faulty maintenance procedures (now discontinued), lack of locking on slats (now fixed) and engine-out-on-takeoff procedures that sacrificed air speed for altitude.

There are three DC-10 crashes (that I can think of off hand) that could reasonably be blamed at least partially on the design of the plane: we've mentioned two (Paris, Chicago). The third is Sioux City, where an uncontained engine failure in cruise disabled all three hydrolic systems. The plane crash landed with (from memory) about 110 deaths and 180 survivors.

Other planes of similar size and age (Lockheed L1011 tristar, 747) had four hydrolic systems. Had the DC-10 had four *and* (that is a big 'and') the fourth had not been disabled, it is unlikely there would have been any deaths. (A 747 once had 3 out of 4 hydrolic systems disabled on takeoff, and landed safely.)

In terms of safety, I'd be more worried about any model of airplane less than a few years old than I'd be about a well maintained DC-10. Let other people find the surprises first.

--
Quattuor res in hoc mundo sanctae sunt: libri, liberi, libertas et liberalitas.

If it's in the job description... by DaftShadow · 2004-09-21 09:52 · Score: 1

... I can think of no one else to fault *BUT* the technician. The IT guys know full well that this "quirk" exists, and in fact, part of their planning and maintenence involved resetting the machine in order to get around this potential problem. These guys did not complete their job duties, and as such, the system went down.

How can you intimate blaming the software company here?

- DaftShadow

Re:If it's in the job description... by DarkKnightRadick · 2004-09-21 09:57 · Score: 1

Because the software company provided shity software for way more then it was actually worth.

--
"There is a way that seems right to a man, but its end is the way of death." Proverbs 16:25 (NKJV)
Re:If it's in the job description... by Cyb3r · 2004-09-21 09:59 · Score: 1

Is it normal for a software made by such a company to need to be rebooted monthly?

Come on, lets be serious here...
Re:If it's in the job description... by Chess_the_cat · 2004-09-21 10:08 · Score: 1

So what is the technician getting paid for? I'd love to see his job description: Make sure computers work but don't knock yourself out or anything. Personally, if I were that technician you can bet I'd have made sure to reboot at the scheduled time. It's just that simple. Has this guy ever heard of a Post-It Note?

--
Support the First Amendment. Read at -1
Re:If it's in the job description... by Mateito · 2004-09-21 10:15 · Score: 1

If they found a bug that means monthly reboots, for a mission critical system that stops planes banging into each other (something very likely to cause loss of human life), Why the fuck did they roll it out?

You don't implement this sort of system without months and months of testing with real data in real time.

Somebody cut costs here, and it obviously wasn't the tech.

You design this sort of system _expecting_ that a reboot or two will be missed. Okay.. blame the tech if he didn't follow procedure.. but what if the reboot didn't happen because the tech's wife was in labor or if his kid got hit by a truck? You design systems thinking of the _worst_ case scenario.

You don't run a fucking air traffic control system with a "one truck" vulnerability.

--
Norman Cook's Ode to Sl
Re:If it's in the job description... by Mateito · 2004-09-21 10:18 · Score: 1
Why did he miss it?
- Umm.. I forgot
- Sorry, I got hit by a bus
In the end, it doesn't matter. You can't roll out a system as critical as air traffic control that is not tolerant to one tech not following his job description. The worst case scenario, his death, could lead to the deaths of hundreds of other people.
--
Norman Cook's Ode to Sl
Re:If it's in the job description... by serviscope_minor · 2004-09-21 10:22 · Score: 5, Insightful

How can you intimate blaming the software company here?

You are joking, right? The majority of accidents happen due to human error. This is supposed to be mission critical software (and there's more than just money at stake). Yet, it relies on needless human intervention once a month! This is simply unacceptable for a piece of software in such a position. The main blame lies in the hands of the comany that provided it, the person who decided to switch to it and the person who decided to bring the new system online and remove the old one despite this flaw. The tecnician is almost irrelevent, since this happening was an inevitibility. It would have happened sooner or later because the system left room in there for human error to happen.

And yet, you still don't blame a company which ships mission critical software which leaves such a huge hole open for human errors. I hope our nuclear power plants are running on better designed stuff.

--
SJW n. One who posts facts.
Re:If it's in the job description... by pfleming · 2004-09-21 10:32 · Score: 2, Interesting

How many guns have you seen that fire on a monthly basis unless you 'prevented' it?
Re:If it's in the job description... by LBArrettAnderson · 2004-09-21 10:41 · Score: 1

It was not Microsoft's decision for them to use their operating system. In fact, they aren't allowed to use it for critical anything in the first place.
Re:If it's in the job description... by SillyNickName4me · 2004-09-21 10:44 · Score: 1

> Blaming the software is like gun control, guns don't kill people, people kill people.

The software was sold for use in a system that can't afford downtime, especially unexpected downtime.

Whomever sold that software for that purpose as well as the people who bought it are to blame.

It is not like it is unknown that Windowss (any version) is one fo those systems that is not suitable for that (definitely not the only one).

Having some guy reboot such a system once a month to prevent it from crashing is like using duct tape to keep your car together.. sure, it will work for a while, but it is bound to fail, and as such can be no more then an emergency measure.
Re:If it's in the job description... by SoSueMe · 2004-09-21 11:10 · Score: 1

I cannot comment on the "value" of the software in question.
I can question the quality with another statement from the article:
Backup systems failed because of a software failure,...

If your backup software fails, where are you?
Re:If it's in the job description... by SoSueMe · 2004-09-21 11:12 · Score: 1

To 90%+ of the world, Yes.
Re:If it's in the job description... by SoSueMe · 2004-09-21 11:21 · Score: 1

I would hazard a guess that more people (PHB's) read the ads than those who read the EULA's.
Re:If it's in the job description... by Dun+Malg · 2004-09-21 11:31 · Score: 2, Insightful

You design this sort of system _expecting_ that a reboot or two will be missed. Okay.. blame the tech if he didn't follow procedure.. but what if the reboot didn't happen because the tech's wife was in labor or if his kid got hit by a truck? You design systems thinking of the _worst_ case scenario.
You don't run a fucking air traffic control system with a "one truck" vulnerability.
Exactly. If you find a bug that requires a restart before a 49.7 day timer runs out, you are indeed an idiot if you decide a restart once a month is good enough. At the very least I'd have tech down there on the 1st and 15th of the month, so they'd have to miss three scheduled restarts to cause this problem. Better yet, have two guys there every damn Wednsday at noon. If they both miss seven Wednsdays in a row, well, you got bigger problems than bad software. Whoever decided once a month was adequate needs to have his head handed to him.

--
If a job's not worth doing, it's not worth doing right.
Re:If it's in the job description... by AstroDrabb · 2004-09-21 11:34 · Score: 1

Sure you could blame the poor tech dude. However, humans make mistakes far more often then well written software. They should have know that it _would_ happen. The real meat of the problem is that they had to come up with this kludge to remember to reboot at least ever 30 days because of a logic error in MS software that MS _repeated_ from Win95. MS stored the system time in milliseconds in a 32-bit int. That gives about 49.7 days until it loops. What a coincidence. The FAA is to blame because their PHB's were suckered into using MS for a _very_ critical system. Just imagine what would have happened if planes had crashed into each other. 800+ planes delayed, thousands of unhappy customers and what does MS say? "We cannot comment at this time".
It is MS's software fault. It is the FAA's fault for getting suckered into using MS software for such a critical system (I have no problem if a company wants to use MS software for non-critical corporate tasks like desktops etc). And lest of all the poor tech guys fault for not rebooting.

--
If Tyranny and Oppression come to this land,
it will be in the guise of fighting a foreign enemy. -James Madison
Re:If it's in the job description... by soliptic · 2004-09-21 11:39 · Score: 1

Point needs reinforcing.
How many bridges collapse on a monthly basis unless you 'prevent' it?
How many bank vaults open up on a monthly basis unless you 'prevent' it?
How many supertankers sink on a monthly basis unless you 'prevent' it?
How many cars spontaneously explode on a monthly basis unless you 'prevent' it?
Re:If it's in the job description... by vsprintf · 2004-09-21 11:42 · Score: 1

Personally, if I were that technician you can bet I'd have made sure to reboot at the scheduled time. It's just that simple. Has this guy ever heard of a Post-It Note?

Perhaps the machine had never made it to 30 days uptime before, so the task became a non-issue? I've had a Post-it that says "Timesheet!" on my monitor for two years, and it doesn't do any good. You stop noticing those things after a few days.
Re:If it's in the job description... by Jedi+Alec · 2004-09-21 11:51 · Score: 1

Just imagine what would have happened if planes had crashed into each other. 800+ planes delayed, thousands of unhappy customers

Not to mention a few hundred *dead* ones...

--

People replying to my sig annoy me. That's why I change it all the time.
Re:If it's in the job description... by DarkKnightRadick · 2004-09-22 14:19 · Score: 1

Probably wanking off to porn. :p

Seriously though, the systems should have been more closely monitored.

--
"There is a way that seems right to a man, but its end is the way of death." Proverbs 16:25 (NKJV)

Heather Locklear by Billy+Donahue · 2004-09-21 09:53 · Score: 4, Funny

To the rescue!
http://www.nbc.com/LAX/

--
-- The Funk, The Whole Funk, And Nothing But The Funk

Re:Heather Locklear by DavidBrown · 2004-09-21 10:46 · Score: 1

Mod parent up as insightful - c'mon, the way Hollywood works, you have got to know that there are four or five hacks out there working on spec scripts for an episode of LAX based on this, and the executive producer probably thought, as his plane was routed to another airport "Hmmm, ripped from the headlines..."

--
144l. ph34r my 133t l3g4l 5k1lz!
Re:Heather Locklear by pyrrhonist · 2004-09-21 13:03 · Score: 1

there are four or five hacks out there working on spec scripts for an episode of LAX based on this, and the executive producer probably thought, as his plane was routed to another airport "Hmmm, ripped from the headlines..."
In this episode, an angry mob of late passengers kills the technician who was supposed to reset the computer. The episode will be used to launch CSI: Los Angeles.

--
Show me on the doll where his noodly appendage touched you.

Ouch... by hypermike · 2004-09-21 09:53 · Score: 1

The newspaper said that a Microsoft-based replacement for an older Unix system needed to be reset every thirty days 'to prevent data overload', as a result of problems found when the system was first rolled out. However, a technician failed to perform the reset at the right time and an internal clock within the system subsequently shut it down. A back-up system also failed

Guess there was a backup, I feel for that guy.

--

Re:Ouch... by surprise_audit · 2004-09-21 12:14 · Score: 1

Any bets on when the backup was last booted?? Just after the primary, perhaps??

Uprgrade from UNIX to Windows.. by Anonymous Coward · 2004-09-21 09:53 · Score: 4, Funny

"This happened after an upgrade from Unix to Windows."

Thats the funniest thing I heard all day. Windows is an upgrade from unix. I almost choked on my coffee.

Re:Uprgrade from UNIX to Windows.. by Mateito · 2004-09-21 10:11 · Score: 4, Funny

I almost choked on my coffee.
Try preparing the coffee with some sort of liquid. I recommend water.
You don't get the instant caffiene high like you do with chewing the beans*, but it does go down easier**
*Yes, I do this. Chocolate coated coffee beans rock
**Unless its Starbucks, which needs shot of snotberry flavoring to make it tolerable.

--
Norman Cook's Ode to Sl

Which begs the question... by mind21_98 · 2004-09-21 09:53 · Score: 1, Redundant

...why did they switch to Windows in the first place?

--
US businesses that currently accept chip and PIN/signature

humans rule by Doc+Ruby · 2004-09-21 09:53 · Score: 3, Insightful

It is human error: those bugs didn't write themselves. Nor did the operations protocol that required "rebooting LAX" every 49.69(!) days. Nor did the upgrade procedure that ignored that bottleneck. Nor did the upgrade decision that moved from Unix to Windows. Those were all human errors, as was the decision to keep a job at LAX that would face blame for shutting down the airport (or risking lives) if the reboot was missed, or unsuccessful.

"Not I," says the referee,
"Don't point your finger at me.
I could've stopped it in the eighth
An' maybe kept him from his fate,
But the crowd would've booed, I'm sure,
At not gettin' their money's worth.
It's too bad he had to go,
But there was a pressure on me too, you know.
It wasn't me that made him fall.
No, you can't blame me at all."
- Bob Dylan, "Who Killed Davey Moore?"

--

--
make install -not war

Re:humans rule by pestario · 2004-09-21 15:25 · Score: 1

those bugs didn't write themselves.

Actually, more often than not, a bug appears when a certain scenario is overlooked. When discovered, a patch is written to fix the bug.

--
:n
Re:humans rule by Doc+Ruby · 2004-09-21 15:34 · Score: 1

Those mismatch bugs are the product of two human acts: writing the original code, and writing the subsequent code that produces the bug. It's like how a chord is produced by sounding two separate notes: sounding both notes makes a single chord, and writing both sets of code produces a single bug. The patch written to fix the bug would have been written as part of the second code set, if not for human error at that time. Overlooking the scenario is the human error, put into action by writing the code. Sorry, I don't have a music analogy for the patch :).

--
--
make install -not war

integration flaw exposed: by overbom · 2004-09-21 09:54 · Score: 3, Funny

sleep 4294080
shutdown /s

Re:integration flaw exposed: by LostCluster · 2004-09-21 10:01 · Score: 1

Oh... there it is. Unit conversion flaw. They gave the value for seconds into a value for minutes... and ended up booting once every 10 years because of the factor of 60 mistake.

Ahh yes... by WD_40 · 2004-09-21 09:54 · Score: 1, Flamebait

I remember when the 49.7 day bug was discovered. That was right after I had just hit the 49.7 day freeze in an attempt to keep my personal machine alive as long as possible.

When it froze, I didn't know why until I read the story, just figured it finally gave up the ghost for no real reason. It was time for a reboot anyway, that system was hurtin' bad.

Why the hell the have a critical system running on an OS that can't stay up for at least 50 days, I do not know.

--

"With sufficient thrust, pigs fly just fine." -- RFC 1925

Re:Ahh yes... by Qeyser · 2004-09-21 10:02 · Score: 2, Insightful

Moreover: why do you have a critical system that hasn't been patched in over five years?

Check the date on that news.com article linked in the main story -- it's from March of 1999. The bug is that old, and as I recall the fix didn't take that long to get out.

If LAX was trying to upgrade to/integrate win2k with ancient, unpatched Win95 systems, its no wonder that they're having problems . . .

-Q
Re:Ahh yes... by 0x0d0a · 2004-09-21 11:35 · Score: 1

I'm not real sure that I want some MCSE patching the system that tracks the plane that I'm in. Sure, you don't *think* that there will be any side effects...

--
May we never see th

Mandate Open Source for Government work by nightsweat · 2004-09-21 09:54 · Score: 1

There's no conceivable reason not to. How do you justify your money going to a company that keeps the source to itself?

You paid for it with your taxes - you own it. Demand open source at ALL government levels.

--

the major advances in civilization are processes which all but wreck the societies in which they occur - A.N. White

But yer Honour! by Skiron · 2004-09-21 09:54 · Score: 1

MS lawyer: "It all worked in the flight2000 simulator? We always rebooted after every crash and everytime it was OK afterwards?"

No by temojen · 2004-09-21 09:54 · Score: 1

It wouild suck if all the radios shut down in the middle of an emergency landing. Better to hae it manual.

Simple Politics by Cobblepop · 2004-09-21 09:55 · Score: 1

Of course the technician was blamed - if not, some CIO-type in charge would have had to take it, and he wouldn't allow that to happen. It always runs downhill...

Why 49.7 days? by FirstTimeCaller · 2004-09-21 09:56 · Score: 4, Informative

Because there are 4294080000 millisconds in that time period. Just enough to cause a roll-over when using a 32 bit counter (and yes, 49.7 is an approximate value).

Very few Win95 systems ever made it that long without a reboot... but you would've thought that it would've been fixed by Windows 2000.

--
Wanted: witty unique signature. Must be willing to relocate.

Re:Why 49.7 days? by Holi · 2004-09-21 10:02 · Score: 4, Informative

It was this issue has nothing to do with the Win95 bug, It was just the submitters opinion (which happens to be very wrong)

--
Sorry, teleporters just kill you and then make a copy. A perfect, soul-less copy.
Re:Why 49.7 days? by PhrostyMcByte · 2004-09-21 10:15 · Score: 5, Insightful

It sounds to me like an application they were running was badly designed to use GetTickCount() as a long-term counter. If so, it's not Win2k's fault.
Re:Why 49.7 days? by caluml · 2004-09-21 10:26 · Score: 2, Insightful

I think they solved it by Windows 98 - however, maybe there is an old app running on said Windows 2000 server that uses 32 bit milliseconds. Come on guys - we're going to get nowhere by harping on about issues that were fixed years ago. If we stand still, and laugh, Windows is going to sneak up, and run past.

--
Get your own free personal location tracker
Re:Why 49.7 days? by AK+Marc · 2004-09-21 10:33 · Score: 5, Informative

and yes, 49.7 is an approximate value

The exact value is 49 and 59,929/84,375 days, or 49 days, 17 hours, 2 minutes, and 47.296 seconds (exact).
Hey, news for nerds, what did you expect...

--
Learn to love Alaska
Re:Why 49.7 days? by antiMStroll · 2004-09-21 10:50 · Score: 1

"Very few Win95 systems ever made it that long without a reboot..."
Damn, of all the things to contradict. We ran Win95 on machines running proprietary software which recorded audio in Real format, 24/7. The systems ran trouble-free season after season. After the 49.7 day patch of course.
Re:Why 49.7 days? by slittle · 2004-09-21 11:02 · Score: 1

Hey, news for nerds, what did you expect...
A beowulf cluster of petrified actresses covered with hot grits in Soviet Russia?

(WTF did you expect?)

--
Opportunity knocks. Karma hunts you down.
Re:Why 49.7 days? by rex+vonireful · 2004-09-21 11:32 · Score: 1

Very few Win95 systems ever made it that long without a reboot...

Apparently Windows 2000 has trouble with this too.

I never understood why server operating systems can achieve uptimes measured in months or years with the notable exception on one. Why is it so hard for Microsoft to do it. Others, like Unix, Linux, BSD, NetWare and OSX, can achieve this. Why not MS?
Re:Why 49.7 days? by siriuskase · 2004-09-21 11:41 · Score: 1

But you weren't debugging complicated aviation software. Apparently they figured they had all the bugs out after over a month of uptime.

--
If you must moderate, please moderate as irrelevent, not something bad, because I'm sure someone will find this interest

Before the torrent of "windows sucks" posts... by rasafras · 2004-09-21 09:56 · Score: 3, Insightful

...keep in mind that we have established numerous times that windows is not suitable for systems that need reliability and stability. It is not the operating system's fault that this happened, it is the FAA's for choosing to use it instead of considering the better alternatives. If you get run over on a bicycle while riding on the highway, don't blame the bike.
Quick addition: it seems that the fault does not belong entirely to windows, but rather a combination of the software running on it and the system architecture.

With that said, Windows could stand to improve a lot. It has too many bugs, too many flaws, and so on. And it definitely does not have a stable, secure, reliable base. So don't expect it to.

--
webpage

Re:Before the torrent of "windows sucks" posts... by EvilGrin666 · 2004-09-21 11:24 · Score: 1

I've got bittorrent loaded up, where can I download 'windows sucks'?
Re:Before the torrent of "windows sucks" posts... by hkb · 2004-09-21 12:42 · Score: 1

Sure NDS is (was?) superior to AD. But we're not talking about NDS, we're talking about AD and OpenLDAP and you're just a blind zealot if you think OpenLDAP scales better.

Let's see, let's try and think of some massive, global-scale OpenLDAP directories... Hmm can't think of any. If you can, let me know, and let me know how many objects they're storing, too.

Let's see, let's try and think of some massive, global-scale Active Directory directories. Hmm, Microsoft, Compaq/HP, IBM, Dell, oh well, you get the idea. Need more examples?

And AD "supports SSL" jsut fine. No, it's not OpenSSL, but it's just your plain old standard SSL.

AD's SSL and LDAP? PHP works with it, Perl works with it, C works with it. Right out of the box. I ought to know, I've coded tons of shit for it because I can't stand the garbage that is VBScript.

Works... just... fine...

Consider yourself schooled.

--
/* Moderating all non-anonymous trolls up since 2004 */
Re:Before the torrent of "windows sucks" posts... by hkb · 2004-09-21 12:47 · Score: 1

Becaus Active Directory is the leading directory service, right now. Nearly all the big players use it. Despite having a fantastic product, Novell's gone downhill big time. Here's to hoping they climb back up and kick ass again, because NDS was wonderful.

But yeah, no argument here.

--
/* Moderating all non-anonymous trolls up since 2004 */

They said Windows 98 or Better by www.sorehands.com · 2004-09-21 09:56 · Score: 4, Funny

So I installed Linux.

--
Fight Spammers!

Now even the submitters aren't reading the article by Holi · 2004-09-21 09:57 · Score: 2, Insightful

From the submission
possibility related to an old Windows 95 bug

From the Article.
The shutdown is intended to keep the system from becoming overloaded with data and potentially giving controllers wrong information about flights, according to a software analyst cited by the LA Times.

The shutdown is not a crash but a scheduled event to bring the servers down to flush data.
So it does not seem to be a problem with Windows (Ok now I get marked as troll) but with the FAA's own software.

--
Sorry, teleporters just kill you and then make a copy. A perfect, soul-less copy.

32 bit timer by charnov · 2004-09-21 09:57 · Score: 5, Interesting

This old error was from the use of a 32 bit 1 ms increment timer (comes out to 49.7 days until rollover). AFAIK, this was fixed in Win2k and above when the timer got bumped to 64 bit. Maybe whoever set up LAX was using some ancient legacy middleware that used the old timer. This is just bizarre. In both locations that I have worked the last three years, none of the Win2k or Win2k3 servers went down ever. Sounds like bad consultants.

--
[RIAA] says its concern is artists. That's true, in just the sense that a cattle rancher is concerned about its cattle.

Re:32 bit timer by Draknor · 2004-09-21 10:06 · Score: 4, Informative

Parent is right - its not a bug in Windows itself, but rather a piece of software running on Windows - from (one of the)FA's:

Richard Riggs, an advisor to the technicians union, said the FAA - the American aviation regulator - had been planning to fix the program for some time. "They should have done it before they fielded the system," he said.

(emphasis added)
Re:32 bit timer by djwolf · 2004-09-21 10:34 · Score: 3, Informative

The timer has not been incremented to 64bit. The reason is for api compatibility it hasn't been changed. Microsoft does give you some warning though:

GetTickCount

The GetTickCount function retrieves the number of milliseconds that have elapsed since the system was started. It is limited to the resolution of the system timer. To obtain the system timer resolution, use the GetSystemTimeAdjustment function.

DWORD GetTickCount(void);

Parameters
This function has no parameters.
Return Values
The return value is the number of milliseconds that have elapsed since the system was started.

Remarks
The elapsed time is stored as a DWORD value. Therefore, the time will wrap around to zero if the system is run continuously for 49.7 days.

If you need a higher resolution timer, use a multimedia timer or a high-resolution timer.

To obtain the time elapsed since the computer was started, retrieve the System Up Time counter in the performance data in the registry key HKEY_PERFORMANCE_DATA. The value returned is an 8-byte value. For more information, see Performance Monitoring.

Example Code
The following example demonstrates how to use a this function to wait for a time interval to pass. Due to the nature of unsigned arithmetic, this code works correctly if the return value wraps one time. If the difference between the two calls to GetTickCount is more than 49.7 days, the return value could wrap more than one time and this code will not work; use the system time instead.

DWORD dwStart = GetTickCount(); // Stop if this has taken too long
if( GetTickCount() - dwStart >= TIMELIMIT )
Cancel();
Example Code
Note that TIMELIMIT is defined as the time interval of interest to the application, in milliseconds.

Requirements
Client: Requires Windows XP, Windows 2000 Professional, Windows NT Workstation, Windows Me, Windows 98, or Windows 95.
Server: Requires Windows Server 2003, Windows 2000 Server, or Windows NT Server.
Header: Declared in Winbase.h; include Windows.h.
Library: Use Kernel32.lib.

--
---- I like compilers
Re:32 bit timer by Weird_one · 2004-09-21 11:03 · Score: 1

Actually, I believe the problem is that the 32bit timer was used as a date stamp for the database listing the current position of the planes or which planes were expected.

when the value rolled over, while the system still had room for more entries, the data was improperly ordered with current flights mixed with those from 49.7 days ago..

just a guess, but it seems like an error I had to fix with a y2k-type problem.

--
"Secrecy is the keystone of all tyranny. Not force, but secrecy ... [sic] censorship.
Re:32 bit timer by Foolhardy · 2004-09-21 11:05 · Score: 1

The NT kernel has always kept track of time with 64-bit number of 100ns periods. Absolute time is tracked as a 64-bit number of 100ns periods since January 1, 1601 (UTC). See the syscall NtQuerySystemTime, or the equivalent win32 function GetFileTimeAsSystemTime. (available since at least win95)

The FCC program that crashed probably used the grossly obsolete function GetTickCount as djwolf already posted.
Re:32 bit timer by mindriot · 2004-09-21 13:14 · Score: 1

when the timer got bumped to 64 bit

My god, are you saying that timer is gonna roll over now in only 584 554 531 years? Stupid short-sighted programmers...
Re:32 bit timer by lachlan76 · 2004-09-22 01:09 · Score: 1

How about I fix it now?
#define MAX_TIME 0xFFFFFFFF unsigned long int start = GetTickCount(); DoStuff(); unsigned long int elapsed; unsigned long int finish = GetTickCount(); if(start <= finish) { elapsed = finish - start; } else { elapsed = (MAX_TIME - start) + finish; }

There, done. Wasn't that hard, was it?

...eh-heh-heh. by rincebrain · 2004-09-21 09:58 · Score: 1

Silly IT departments.

If you "upgrade" a piece of software, then discover it requires a complete manual system restart to remain stable, the prudent thing to do in any other circumstance would be a rollback.

Unfortunately, since this is an IT department, it must run Windows; after all, where could you ever find support for Linux?

--
It's only an insult if it's not true.

Re:...eh-heh-heh. by rincebrain · 2004-09-21 10:39 · Score: 1

My apologies for being a dolt.

I shall attempt to RTFA more carefully in the future.

--
It's only an insult if it's not true.

Check out this little pile of bullshit by Trailer+Trash · 2004-09-21 09:58 · Score: 5, Interesting

The system offers unprecedented voice quality, touch-screen technology, dynamic reconfiguration capabilities to meet changing needs, and an operational availability of 0.9999999

Okay, bullshit. If I have to reboot a server every month, .0000001 of a month is- oh, let's be generous and only count months with 31 days- about .26 seconds. That's a damned fast boot time for Win2K.

Maybe they left off a percent sign?

--
Do you have ESP?

Re:Check out this little pile of bullshit by k4_pacific · 2004-09-21 10:02 · Score: 2, Insightful

"Maybe they left off a percent sign?"

Or maybe there's some kind of failover to a backup system (Which they also forgot to reboot)?

--
Unknown host pong.
Re:Check out this little pile of bullshit by larien · 2004-09-21 10:23 · Score: 4, Informative

Welcome to planned vs unplanned downtime; in many cases, a 10 hour outage can still give you a 100% availability if you planned that outage. What they're probably quoting is 0.0000001 unplanned downtime.
Lies, damned lies and availability stats...
Re:Check out this little pile of bullshit by Phosphor3k · 2004-09-21 10:24 · Score: 1

They didn't actually have to reboot the OS. It was a legacy peice of software running on the OS that needed to be restarted.
Re:Check out this little pile of bullshit by Anonymous Coward · 2004-09-21 10:38 · Score: 1, Funny

I'd hate to sit next to you during Price is Right.
Re:Check out this little pile of bullshit by drinkypoo · 2004-09-21 10:49 · Score: 1

That implies that they're expecting windows to be down for scheduled maintenance about one month out of the year... which is right about in line with my experience :)

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Re:Check out this little pile of bullshit by archen · 2004-09-21 10:56 · Score: 1

On some machines I work with I can take FreeBSD completely down and up (not just up) in about 23 seconds, and that's with 2 seconds for you to escape in the boot loader. But that of course depends on your hardware. Running the same system on a Dell server could take like 3 or more minutes simply because of all the initalization crap for hardware. In any case the boot time required would be 0.26 not 26. Allowing for a reboot a year to obtain 5 nines would be about a 3 second boot cycle.
Re:Check out this little pile of bullshit by Anonymous Coward · 2004-09-21 11:04 · Score: 1, Informative

I'm pretty sure that the stat is
still blown wide open. With a
allowable downtime of 30 seconds
per year, the recent ~12600 second
outage means they are probably not
at the promised .9999999 uptime,
unless the system was actually
brought up four hundred years ago,
and this was it's first unplanned
outage.
Re:Check out this little pile of bullshit by Shadowlore · 2004-09-22 19:23 · Score: 1

Maybe they left off a percent sign?

Here, is this better?:

The system offers unprecedented voice quality, touch-screen technology, dynamic reconfiguration capabilities to meet changing needs, and an operational availability of 0.9999999%

Come on, it awas obvious where the % went. ;)

--
My Suburban burns less gasoline than your Prius.

Simplest Fix (And real concern) by FalconZero · 2004-09-21 09:59 · Score: 1

Surely the simplest 'fudge' to fix this problem is
to write a script that beeps loudly every 10 mins
or some other (read: more sensible) notification
after the system uptime exceeds 30 or so days?

But seriously, if its running windows its not the
monthly reboots the need to worry about, its the
quaterly format/reinstall procedure thats required
for stable operation.

I dont think I've had a stable (home) windows install for
more than 6 months without reinstall, but maybe I'm
pushing my luck by actually USING the computer.

--
Windows in 6 Bytes (IA-32) : 90 90 90 90 CD 19

I could fix that problem by randomErr · 2004-09-21 09:59 · Score: 1

I wrote a VB program years ago for the Win95 to solve this problem. I just had the scheduler run my program that rebooted the system for me.

Umm.... Duh

--
You say things that offend me and I can deal with it. Can you?

We used to joke by multiplexo · 2004-09-21 09:59 · Score: 3, Interesting

that no one would ever run into the 49.7 day bug on a Windows system because the chances of having that much uptime were slim to none. Having a system where you know that things are broken and you have to reboot it every 30 days to keep it from breaking down is a bad thing, deploying such a system into a production environment is even worse (but it's been done, I don't know how many times I wrote cron jobs to kill bad pieces of software and restart them) but deploying such a system in an environment where lives are at stake is completely inexcusable, regardless of whether or not it is closed or open source. This is similar to having a circuit in your house that overheats because occasionally too much load is placed on it. The idiot solution is to reset the breaker when it trips, the correct solution is to put in a bigger circuit that can handle the peak load. This vendor provided the idiot solution to this problem and should be punished for it, this never should have been deployed, I can only hope that they won't blame the technician for failing to do something that he wouldn't have had to do if the system had been designed properly.

I also love the statement that the system was upgraded from UNIX to Windows. Isn't this kind of like upgrading from being in very good health but not being good looking to being somewhat good looking but suffering from cancer, AIDS and heart disease?

--
cheap labor conservatives - they want to keep you hungry enough to be thankful for minimum wage.

Re:We used to joke by Anonymous Coward · 2004-09-21 10:04 · Score: 1, Funny

UNIX -> Windows is like Clinton -> Bush.

Sorry, but someone had to say it.

I want to know who dunnit by ElForesto · 2004-09-21 09:59 · Score: 1

I was sitting in Atlanta-Hartsfield for an extra 70 minutes thanks to that bastard.

--
There is a difference between "insightful" and "inciteful" other than spelling.

49.7 days by k4_pacific · 2004-09-21 09:59 · Score: 5, Funny

I remember back when that bug was announced. Seems it was at least a couple of years after Windows 95 had been out. I guess they had to work through a lot of other bugs to get Windows 95 to make it long enough for this bug to occur.

--
Unknown host pong.

Re:49.7 days by ArchieBunker · 2004-09-21 10:02 · Score: 1

The story is a troll, all the win95 code died with Windows ME. In case you haven't realized win2k is based on NT4 which never had 49 day uptime bug.

--
Only the State obtains its revenue by coercion. - Murray Rothbard

Maintenance by apoplectic · 2004-09-21 10:00 · Score: 1

The employee missed the maintenance window. If you forget to do something that is a part of your job, I would have to suggest that you are responsible for the consequences. Now, does placing the employee in such a situation apply some burden of responsibility upon higher-ups? Certainly. But, the employee should be held responsible...ESPECIALLY if the importance of the maintenance was made clear.

Re:Maintenance by chris_mahan · 2004-09-21 10:10 · Score: 1

No. Management.

Management is ALWAYS at fault.

--If they knew and did nothing.
--If they knew, tried to fix it, and failed.
--If they did not know, they should have.

All I know is that management is always at fault.

--
"Piter, too, is dead."
Re:Maintenance by lachlan76 · 2004-09-22 01:20 · Score: 1

Human error will happen whether you like it or not, punishing people severely won't prevent the inevitable.

If the software had been written properly, this human error wouldn't be a problem.
Re:Maintenance by apoplectic · 2004-09-22 02:28 · Score: 1

I certainly didn't indicate anything regarding punishment...merely responsibility.

However, given you assert that human error wouldn't be a problem if "the software had been written properly", and that the software was written by humans, then I would suppose your ultimate assertion would have to be "If it hadn't been for human error in writing the software, this human error wouldn't be a problem". Not a very convincing argument regarding responsibility, IMHO.
Re:Maintenance by lachlan76 · 2004-09-22 02:57 · Score: 1

To bring down air-traffic control with this bug, one person needs to make a mistake once.

To introduce the bug, many people have to make the same mistake many times - not just when it is written, but every time someone reads it.

I certainly didn't indicate anything regarding punishment...merely responsibility

Well, what do YOU think will happen to the techie? Management aren't exactly going to be rushing to his aid, are they?

Upgrade?? by Zevets · 2004-09-21 10:01 · Score: 1, Redundant

Since when is going from Unix to Windows and upgrade?

--

Mod Wisely.

Flaw left unfixed for too long? by Astro-pilot · 2004-09-21 10:01 · Score: 2, Interesting

Was the flaw left unfixed for too long because they did not have access to the source code? Or was it because it was too expensive? If this is such a critical system that it can cause loss of life (on a massive scale, no less), the root cause should have been fixed, rather than the workaround. I remember reading somewhere that this flaw has now been fixed. Smells like a managerial issue within the FAA, not just a technician problem. Remember NASA and the space shuttles?

If I recall by sweetshot97 · 2004-09-21 10:02 · Score: 1

If I recall, doesn't MS have something that absolves them of any liability listed towards the end of the license agreement. Something along the lines of, "Do not use in mission critical places." Or was it more like do not install in missile silos or nuclear facilities, something like that right? Someone correct me. If I am right about the license agreement, that was stupid of LAX to have been suckered into switching from UNIX to M$. Oh wait, I forgot, everything works better on MS products right? That's why we have many security/virus/worm/bug/whatever flaws. What a great product Bill!

Uhm, THE TECHNICIAN by mekkab · 2004-09-21 10:02 · Score: 1

I don't think blame should be assigned to the technician who missed the task; rather, it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task. Who's really at fault?"

Actually, I do. You've got a job; you've got deadlines. Do the work.

--
In the future, I would want to not be isolated from my friends in the Space Station.

Re:Uhm, THE TECHNICIAN by Kwil · 2004-09-21 10:07 · Score: 2, Funny

Are you kidding?

Think about it.. the Tech managed to keep Windows up and running for almost 50 days. The guy's a hero!

--
That Jesus Christ guy is getting some terrible lag... it took him 3 days to respawn! -NJ CoolBreeze
Re:Uhm, THE TECHNICIAN by serviscope_minor · 2004-09-21 10:29 · Score: 1

I don't think blame should be assigned to the technician who missed the task; rather, it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task. Who's really at fault?" Actually, I do. You've got a job; you've got deadlines. Do the work. So... the technician gets run over by a car, or collapses of a heart attack at work. Well, hell, the planes are just going to have to bang in to each other because we didn't forsee that. Mission critical means designing for the worst case. Requiring needless manual intervention once a month is not mission critical design. So maybe the tecnician was to blame, but not as much as the bozo who rolled this thing out with this known flaw.

--
SJW n. One who posts facts.
Re:Uhm, THE TECHNICIAN by mekkab · 2004-09-21 10:35 · Score: 1

So... the technician gets run over by a car, or collapses of a heart attack at work.

See, thats part of the job description. He's not ALLOWED to die. ;)

--
In the future, I would want to not be isolated from my friends in the Space Station.

Don't be so hasty to blame the OS... by Ann+Elk · 2004-09-21 10:03 · Score: 5, Insightful

OK, I know it's violation of /. policy to actually read a referenced article. My bad. But, according to the software.silicon.com article:

Richard Riggs, an advisor to the technicians union, said the FAA - the American aviation regulator - had been planning to fix the program for some time. "They should have done it before they fielded the system," he said.

This sounds to me like more of a problem with the application, not the OS. The "system" crashed after 49.7 days, which is about 4 million seconds, which is about 4 billion milliseconds, which is (obviously) MAX_ULONG. I suspect the application is using a ULONG to store a timeout value and got pissed-off when it rolled over.

Blame the human by Lead+Butthead · 2004-09-21 10:03 · Score: 1

By the very nature of the system, the blame can fall on no one other than the maintenance personnel. Otherwise the PHB that authorize the "upgrade," and the system that put the said PHB in the position to authorize said "upgrade" would look incompentent and foolish, and we can't very well have that.

--
ELOI, ELOI, LAMA SABACHTHANI!?

hum, guess they selected the wrong word... by sxpert · 2004-09-21 10:04 · Score: 1, Troll

This happened after an upgrade from Unix to Windows.

Shouldn't that read downgrade instead ???

Not soon enough by ravenspear · 2004-09-21 10:04 · Score: 1

Before the torrent of "windows sucks" posts...

Too late.

Its a mind set sort of thing...... by zootman · 2004-09-21 10:05 · Score: 1

Ah, the old "windows maintenance reboot" problem. It always amazes me how IT managers (hell even some techos) accept the need to re-boot their windows systems every week. At my work, the windows guys accept it as normal maintenance. If I had to reboot my AIX and z/OS systems every week there would be hell to pay. But because its windows , its accepted. I dunno, mediocraty is the new standard these days...........

"upgrade from Unix to Windows." by captainclever · 2004-09-21 10:05 · Score: 1

" upgrade from Unix to Windows "

Ahahahahahahahahahahahahahahahahahahahah that's the funniest thing i've read in ages :)

--
Last.fm - join the social music revolution

49.7 Days - A New Record for Windows 95! by akiy · 2004-09-21 10:05 · Score: 3, Funny

I believe the 49.7 days of uptime for a Windows 95 box is a new record, shattering the previous record in Norway of 27.9 days back on January through February of 2001. Congratulations!

--

--
http://www.aikiweb.com - AikiWeb Aikido Information

Re:49.7 Days - A New Record for Windows 95! by toddestan · 2004-09-21 14:14 · Score: 1

I've gotten a Windows 95 system up to the fabled 49.7 days. Even did some web browsing, chatting, Photoshopping, and used a scanner that was attached to it too. I didn't decide to go for a the record until I noticed it had been up for about 3.5 weeks, then I decided to see if it would go all the way. It made it, still worked fine right to the end.

Ironically, that's the record holder I have for any of my Windows systems. 2000/XP are far more stable, but I usually end up rebooting them for some reason long before 49 days.

It'll still crash... by Kippesoep · 2004-09-21 10:06 · Score: 1

after 584542046 years. Okay, I admit... when you reach that time, you'll probably have other problems than a Win2K crash.

windows update anyone? by roadrunnerro · 2004-09-21 10:06 · Score: 3, Insightful

and office update while you're at it too...

Wouldn't want to spoil a nice MS bashing session, but I think the bug was in the ported application, not in the OS - probably someone used the wrong data type to hold timestamps somewhere within the program (win95 had the same bug) - I've seen win2k last more than 47 days without reboots...

Lessions from other Aviation Authorities by MosesJones · 2004-09-21 10:06 · Score: 5, Interesting

I worked for around 5 years in Air Traffic Control projects, both in delivery of radar processing and displays and in R&D for next generation systems.

Let me give you an overview of the failure approach of just one of those systems.

1) Everything on Unix, ruggedised releases of UNIX

2) Every box must be able to FAIL ON ITS OWN

3) Every box must have a direct replacement, or replacements, which carry the SAME LOAD.

4) ZERO total system downtime allowed, partial systems failures are allowed, but core systems must keep running.

5) 5 stages of power supply failure, double mains, double generation and lastly a great big warehouse of car batteries if all else fails.

6) 4 Years of testing of FULL system before live.

This is what is normal when safety is the primary concern. What the FAA decision sounds like is a cost driven process which chose the cheapest solution that "could" meet the requirements.

The idea of a safety critical (if it fails people could die) system that requires a reboot is fine in only one case... if it can be non-operational on a regular basis, in which case it should be done EVERY non-operational window (say every week) , this is therefore okay for some hospital scanners that are certified for 12 hour runs. Its not okay for a 24/7 system that controls objects flying around at 500 miles an hour.

Welcome to the US... we will be landing slightly quicker than expected.

--
An Eye for an Eye will make the whole world blind - Gandhi

Re:Lessions from other Aviation Authorities by n9mdh · 2004-09-21 11:10 · Score: 1, Funny

...2) Every box must be able to FAIL ON ITS OWN

That requirement makes it obvious why they needed to switch from Unix to a Microsoft product, doesn't it?
Re:Lessions from other Aviation Authorities by Anonymous Coward · 2004-09-21 18:45 · Score: 1, Interesting

What the FAA decision sounds like is a cost driven process which chose the cheapest solution that "could" meet the requirements.

Making incremental changes to the existing system, such as faster hardware, would almost certainly have met the requirements and been much cheaper than the solution chosen.

I think someone just got a bad case of "shiny thing" and thousands of travellers ended up paying (fortunately not with their lives).
Re:Lessions from other Aviation Authorities by 6th+time+lucky · 2004-09-21 20:00 · Score: 1

Welcome to the US... we will be landing slightly quicker than expected.
Welcome to the US... We have 'upgraded' out flight paths from horizontal to a vertical mode.

depends by Tsiangkun · 2004-09-21 10:06 · Score: 2, Insightful

I think it depends on what the company rep said when they convinced them to replace Unix with Windows.

If they advertised a consumer OS as an OS suitable for mission critical applications . . . then this flaw should not be in the software. It's could the software companies fault for agressively marketing their product where it should not be.

Maybe we should throw some blame to the PHB who ordered the switch. Purhaps there was no hard sell from MS, and a PHB saw a product brochure and got a hard on to switch.

I see your point though, the tech knew about the problem and failed to do his job.

I guess my question is, should the problem have been addressed before now, or is it common practice to wait for a catastrophic success like this to occur before addressing the problem ?

Re:depends by black+mariah · 2004-09-21 10:47 · Score: 1

Could we please take about thirty fucking seconds to read a goddamned article? It wasn't a software switch, it was the ENTIRE FUCKING SYSTEM, HARDWARE AND ALL, that was replaced. The UNIX systems were old-ass and needed replacement.
I guess my question is, should the problem have been addressed before now, or is it common practice to wait for a catastrophic success like this to occur before addressing the problem ?
You're right. The incompetent dipshit that let this happen should have been fired beforehand. How the fuck can you forget to do something that's part of your BASIC FUCKING DUTIES?

--
'Standards' in computing only impress those who are impressed by things like 'standards'.
Re:depends by Tsiangkun · 2004-09-21 11:36 · Score: 1

from one of the seven god damned link An improperly trained employee failed to reset the system

So who is the motherfucking god damned fool that forgot to tell this fucker his job included rebooting the mission critical machines every 30 days ?
Re:depends by AstroDrabb · 2004-09-21 11:39 · Score: 1

Purhaps there was no hard sell from MS
Not likely. I have worked at 3 fortune 500 companies and MS _always_ have their cronies around at all 3 to help influence decisions, especially for bigger purchases or where a non-MS competitor was also being looked at. A big switch like this is like pay-day for MS. I am sure they had some of their techs and PHB cronies there.

--
If Tyranny and Oppression come to this land,
it will be in the guise of fighting a foreign enemy. -James Madison
Re:depends by vsprintf · 2004-09-21 12:37 · Score: 1

Maybe we should throw some blame to the PHB who ordered the switch. Purhaps there was no hard sell from MS, and a PHB saw a product brochure and got a hard on to switch.

I doubt that it was a hard sell from MS, but it could have been another company. The typical sales job is from an integrator/reseller who provides the final product. There is usually a slick marketeer with a young, curvy, female partner to provide a distraction for any male techies present. The power-point says it all: Lowest total project cost, lowest time-to-completion, scalable, innovative, rich features (that one always seems to be present, whatever it means), easy administration, state-of-the-art, enterprise-class, lowest TCO. What more could a PHB ask for? (Well, yeah, there's the truth, but we're talking business here. :)

Re:Would you trust your life to windows 95? by lucabrasi999 · 2004-09-21 10:07 · Score: 1

Tower: Who is this General Protection, anyway? And, how did he break my computer?

Seen this week at various airports by whoever57 · 2004-09-21 10:07 · Score: 4, Interesting

This week, while flying, I saw:
1. Windows-based terminal used by the public to print tickets (I think) with a "you have chosen to download a file, what do you want to do with it: save, open" or similar (I don't recall the exact wording).

2. A windows-based machine that was part of the baggage scanning setup at Chicago-O'Hare going through a scandisk process. OK, this may have been due to operators turing the machine off using the power switch, but should not such a machine use a read-only boot drive/partition?

Do you feel more secure?

--
The real "Libtards" are the Libertarians!

No proof the old system was stable. by rdunnell · 2004-09-21 10:07 · Score: 2, Insightful

A system running UNIX doesn't necessarily mean it was stable. It could have all sorts of flaws in the code, hardware failures, etc.

Sure, Windows 95 in particular and Windows in general is often less stable than modern counterparts. But an upgrade from an old, obsolete UNIX to a new Windows system could have had significant benefits and made a lot of sense at the time. Without the full information behind the decision, how can you judge whether the decision was bad or not?

Re:No proof the old system was stable. by DarkKnightRadick · 2004-09-22 14:32 · Score: 1

While that is true, I imagine it would have been mentioned if the old system had been worse.

--
"There is a way that seems right to a man, but its end is the way of death." Proverbs 16:25 (NKJV)

no such thing as a Windows 2000 49.7 day bug by art123 · 2004-09-21 10:08 · Score: 4, Informative

There is no such thing as a Windows 2000 49.7 day bug that causes an OS problem.

The problem here is the software made by Harris does not handle a rollover of the GetTickCount() function turning back to 0. This function counts the number of milliseconds since the OS was last booted so it should be obvious to anybody that the returned unsigned 4 byte integer cannot go on forever.

So the badly written Harris software has this bug and their solution (which was really not that bad of a work around) was to manually reboot the system every 30 days, but as a fail-safe, they had a scheduled task to do a reboot on the 49th day just in case. The 49th day came because of procedural error.

There is nothing Microsoft could do to prevent this.

Re:no such thing as a Windows 2000 49.7 day bug by recharged95 · 2004-09-21 12:31 · Score: 1

Regardless, I think in Windows 2000 (even NT), they should have thrown a system message (warning, info or error) in the event viewer i.e. saying that the tick count was rolling over due to a hardware+OS limitation (32 bits). Then at least the users would 'instantly' know that the OS is doing something abstract, upon rebooting. I'm pretty sure the problem was finding it and validating everything would be ok than trying to get the computer rebooted. I can see the IT staff somewhat clueless since no info was generated in the logs.
It was the users' error that they forgot procedure and didn't know what the problem was immediately, but it was the OS's error that it didn't report anything in a log for a problem that is OS+hardware oriented and you get no exception/error on the GetTickCount() call (i.e. proprietary & uneditable call). This applies to all OSes. The OS should report these backdoor-like fixes to limitations, it make debugging much easier--we did something similar (special status message) for satellite nav systems.
Re:no such thing as a Windows 2000 49.7 day bug by Ahnteis · 2004-09-21 13:05 · Score: 2, Insightful

"There is nothing Microsoft could do to prevent this."

But this is slashdot so we won't let little things like facts get in the way of a good MS bashing session.
Re:no such thing as a Windows 2000 49.7 day bug by SuiteSisterMary · 2004-09-21 13:29 · Score: 2, Insightful

Nonsense. That would be like saying 'warning: you're taking a step, and might trip.'

Typing naught but 'GetTickCount()' into Google lands me right onto the MSDN page and clearly says:

The elapsed time is stored as a DWORD value. Therefore, the time will wrap around to zero if the system is run continuously for 49.7 days.

and goes on to suggest alternative timing capabilities.

This was a major fuckup by the application programmers, incorrectly using a clearly defined API call.

--
Vintage computer games and RPG books available. Email me if you're interested.
Re:no such thing as a Windows 2000 49.7 day bug by some_guy_out_there · 2004-09-21 16:33 · Score: 1

Actually M$ has some fault in this. Being a current CS student we were taught in our OS class that well designed and developed operating system should be able to manage resources and provide good user interaction. The managing resources part also includes applications. And yes I will be the first to say, nobody was born knowing how to program. However, errors/bugs/features that one are found in an application can and will cause that application to fail. However, when an application failure brings down ( sends the OS to a screeching halt ) the OS is at fault. The OS has to be designed to handle these types of failures.
Re:no such thing as a Windows 2000 49.7 day bug by Keeper · 2004-09-21 16:46 · Score: 1

The OS doesn't come to a screeching halt. It continues working just fine. The poorly programmed application starts behaving badly, essentially making it useless. Instead of telling people how to restart the application in question, they apparently found it easier to tell people to hit the reset button instead.
Re:no such thing as a Windows 2000 49.7 day bug by recharged95 · 2004-09-27 13:33 · Score: 1

"the application programmers, incorrectly using a clearly defined API"
Yep you are right about that. The app programmers should have identified this in a test scenario/test case and provided some feedback in his/her code. I would. But of course app programmers aren't as savvy as we may think since the DWORD handling is due to the choices the OS programmers made due to the hardware (i.e.outside the application programmer domain?). Hence it maybe a draw & one of those fuzzy lines on who "owns" the state (and should write the behavior).
In the end, someone (app developer or OS) should have reported something.

An urban legend... by eddy · 2004-09-21 10:08 · Score: 2, Insightful

.. is what I'm going to consider this for the time being. I've seen it reported everywhere, but it's just too absurd to take at face value.

--
Belief is the currency of delusion.

I don't feel redeemed, I feel cheated... by jbwolfe · 2004-09-21 10:08 · Score: 2, Informative

Hey, I submitted this two days ago. What makes it slashdot worthy now?

--
Have you ever noticed that anybody driving slower than you is an idiot, and anyone going faster than you is a maniac?

Re:I don't feel redeemed, I feel cheated... by Dun+Malg · 2004-09-21 11:04 · Score: 1

Hey, I submitted this two days ago. What makes it slashdot worthy now?
You probably failed to include the requisite Windows bashing, this time appearing in the form of unfounded speculation that it's a repeat of the Win95 bug.

--
If a job's not worth doing, it's not worth doing right.

Hold on just one minute! by cookd · 2004-09-21 10:09 · Score: 1

Where does it say that this was due to the Win95/Win98 bug? (If I missed something, please let me know.) Just because it happens to be the same amount of time as the Win95 bug doesn't mean it is the same bug. The bug was never present in Windows 2000, AFAIK. And in any case, there's a reason why 49.7 is a "magic number" for uptime (hint: how many milliseconds are there in 49.7 days?), just as there was a reason why "2000" was a magic number for date problems and why 2037 will be another magic number for date problems.

Just because it runs on (OS) and just because it crashes doesn't mean it is (OS vendor)'s fault. In this case, you certainly can't blame Microsoft: there was a problem in the radio software, the software developers knew about it, the maintenance staff knew about it, it didn't get fixed, and it caused a problem. Where does Microsoft fit into that?

--
Time flies like an arrow. Fruit flies like a banana.

You insensitive clod by rutledjw · 2004-09-21 10:09 · Score: 4, Interesting

As a PHB, I resemble that remark! Clearly you do not appreciate the fine art which is combining management and technical decision-making. Neither does my parent corp.

I have the distinct, but sadly not unusual, pleasure of watching my company execute a brilliant strategy of:

Outsouring Data Center Operations (systems that used to down for seconds a year are now down for days and in some cases weeks per year)
Outsource development to India (which has been a mess I won't use the foul language to describe) _AND_
Squeeze remaining people to make up for items 1 and 2!

Since becoming a PHB (although I still do architecture work - thankfully), I've found that mindless boneheaded, sweeping decisions, are usually driven by some empty-suit, bean-counting, incompetent, barely literate, sh!t-for-brains syncophant who found themselves in an executive position purely by accident. We're "encouraged" to support their "strategies". Indeed...

It's a much higher order PHB. Kinda like a 4th degree black-belt, but not.

--

Computer Science is Applied Philosophy

Re:You insensitive clod by Hatta · 2004-09-21 11:43 · Score: 1

Since becoming a PHB (although I still do architecture work - thankfully), I've found that mindless boneheaded, sweeping decisions, are usually driven by some empty-suit, bean-counting, incompetent, barely literate, sh!t-for-brains syncophant who found themselves in an executive position purely by accident.

Consider for a moment that your underlings may feel the same way about you.

--
Give me Classic Slashdot or give me death!
Re:You insensitive clod by rutledjw · 2004-09-22 04:12 · Score: 1
Typically I ignore such drivel, but since I have some time to kill...
1. I don't refer to the guys on my team as "underlings"
2. You have no basis for making that statement. Your post is more of a reflection on you than me
3. I bust my ass to make sure what comes out of the greater architecture team is logical and spend a lot of time "running interference" to keep the morons away from my team.
Further, I've built this team virtually from scratch and it's a pretty d@mn fine crew. They have skills I can't hold a candle to and I give them due respect. Considering they had other offer(s) when they chose this job and have stuck with it through the current round of BS - I'd say they don't feel that way.
Am I little defensive? Yeah, I've seen incompetent mgmt dorks and I don't respond well to insipid one-liners.
--

Computer Science is Applied Philosophy
Re:You insensitive clod by LaCosaNostradamus · 2004-09-22 06:52 · Score: 1

Outsouring Data Center Operations (systems that used to down for seconds a year are now down for days and in some cases weeks per year)

... because it's cheaper.

Outsource development to India (which has been a mess I won't use the foul language to describe)

... because it's cheaper.

Squeeze remaining people to make up for items 1 and 2!

... because it's cheaper.

There. Now you've been trained in modern business methods. I'll send you my bill. It'll be a whopper.

--
[You have a stable society when some nut guns down a schoolyard and the law doesn't change.]

Liability. by Dausha · 2004-09-21 10:10 · Score: 1

You know, if strict product liability were applied to Microsoft, they'd be paying big time.

--
What those who want activist courts fear is rule by the people.

Re:Liability. by DaFallus · 2004-09-21 10:22 · Score: 1

Thats the same kind of mentality that awards people for spilling hot coffee on themselves. I think the only people that should be held liable for their decisions are the actual decision makers. Everyone knows Windows isn't what we would call entirely stable. Thats why nuclear facilities and any facility where security and stability are key are advised against using Windows. The idiots who decided to "upgrade" from Unix to Windows should be held accountable, not Microsoft or the lowly tech who was supposed to reboot the system.

--
No one cares what your captcha was

Houston TX, USA
Re:Liability. by Sloppy · 2004-09-21 10:27 · Score: 3, Insightful

You know, if strict product liability were applied to Microsoft, they'd be paying big time.
If duct tape a wing to an airplane and then the wing falls off and the plane crashes, you don't sue the duct tape maker. You sue the idiot who decided to use the duct tape.
The grossly negligent party in this situation, is the contractor who built a real-life system on top of Windows. And the FAA idiots who didn't spot this glaring flaw in the proposal. Microsoft shouldn't have to pay a cent.

--
As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.

"Who's really at fault?" by switcha · 2004-09-21 10:11 · Score: 3, Funny

You guessed it.

Frank Stallone.

--
You know what? ... A little club soda *did* get that out!

Re:"Who's really at fault?" by bstadil · 2004-09-21 10:41 · Score: 1

Arnold
Nuff Said

--
Help fight continental drift.
Re:"Who's really at fault?" by red+floyd · 2004-09-21 11:28 · Score: 1

Or so the Germans would have us think.

--
The only reason we have the rights we have is that people just like us died to gain those rights. -- Cheerio Boy

Equipment age by raider_red · 2004-09-21 10:11 · Score: 1

Guys, most of the equipment in use by the FAA isn't new enough to run Windows 2000. I worked on the "state of the art" search radar, and it was built around Sun Ultra 5s.

--
It's good to use your head, but not as a battering ram.

(...) an upgrade from Unix to Windows. by Pope+Raymond+Lama · 2004-09-21 10:11 · Score: 1

+1 Funny

--
-><- no .sig is good sig.

Re:Now even the submitters aren't reading the arti by Kehvarl · 2004-09-21 10:11 · Score: 1, Informative

The shutdown wasn't the problem, or more appropriately, the shutdown that would have prevented the problem was missed. But also the FAA's software probably has some issues of its own that need to be fixed.

On a completely different subject, I move that any post containing a phrase along the lines of "This is going to get me moderated as Troll" be automatically moderated Troll. Too many of us seem to use it becasue it tends to lead to the opposite result.

Bug due to UINT32 overlfow by slobber · 2004-09-21 10:12 · Score: 1

Number of milliseconds in 49.7 days:

60*60*24*49.7 * 1000 = 4,294,080,000

which just about overflows uint32.

--
"You mortals are so obtuse." -Q

Yea, but... by HaeMaker · 2004-09-21 10:13 · Score: 2, Interesting

That information had been filtered at least three times, can't count on that either...

Software analyst -> LA Times reporter -> TechWorld reporter.

gettickcount maybe? by plopez · 2004-09-21 10:13 · Score: 4, Funny

http://msdn.microsoft.com/library/default.asp?url= /library/en-us/sysinfo/base/gettickcount.asp

Sounds like who ever wrote the software/OS module they were relying on used this gem. I hereby dub who soever was so silly as to do this as a 'code monkey, first class'.

--
putting the 'B' in LGBTQ+

poor guy.. by joeldg · 2004-09-21 10:14 · Score: 2, Interesting

Having to shutdown a system to maintain it's uptime is first a ridiculous idea.

Second, it took several years to find that bug because most windows machines never made it to that 49.7 days and if they did the users just assumed it was the normal because it is considered normal for windows to "lock up", freeze or whatever.

Third, replacing unix, known for it's stability, with any variant of windows (known for instability) in a system where peoples lives are at stake and then having this happen, the guys at LAX who decided to do this should be fired because they just risked a lot of lives and cause massive delays for travellers. In a political situation they would have to resign.

I remember a similar story about a aegis class cruiser stuck out in the ocean for three days because they decided to use windows. "Yea, that will work great during a war.."

*sigh* Microsoft has good lobby power and hires a fleet of sales people to keep selling their shod-ware that really should just be kept to mom and pop living rooms.

But then, this is the opionion of a guy who works only with linux and is sitting on an uptime on an openmosix cluster-leader (that also is my dev box) that looks like this:
19:03:06 up 319 days, 5:20, 3 users, load average: 1.28, 0.73, 0.37

eat your heart out LAX.. you got punk'd

--
anime+manga together at last.. in real time.

Goodbye Microsoft by DaFallus · 2004-09-21 10:14 · Score: 1

I really wish Microsoft would go out of business, quickly and quietly. I don't hate Bill Gates, I don't hate Windows, I'm just tired of hearing everyone bitch about them so much.

--
No one cares what your captcha was

Houston TX, USA

Re:Goodbye Microsoft by BarakMich · 2004-09-21 10:29 · Score: 1

You seem not to understand a basic rule of life:

People like to bitch

Even if MS was replaced with something quick, bug free, and exactly what the average user wanted, people will still bitch about it somehow.

You can only minimize the bitching, never can you eliminate it.

Ouch, poor ad placement by Eric+Seppanen · 2004-09-21 10:15 · Score: 5, Funny

Headline:

Microsoft server crash nearly causes 800-plane pile-up
failure to restart system caused data overload

giant advertisement:

Make a name for yourself with Windows Server System

I'm thinking that maybe "the guy that almost crashed a bunch of planes" is not the name they were looking for.

(I'm not making this up- that's really the ad I'm seeing.)

--
314-15-9265

Space Shuttle accidents and software bugs by BlueUnderwear · 2004-09-21 10:17 · Score: 4, Interesting

Was at JAOO today, and on the closing panel discussion for the Test-Driven Development track, Mr Kevlin Henney was praising NASA's rigorous software testing procedures. He was so proud of them that he let out a "and in both space shuttle crashes, software was not to blame". Well, this may be correct if he was thinking only about the flight software... but there is other software than what rides in the shuttle itself...

--
Say no to software patents.

Re:Space Shuttle accidents and software bugs by GlassHeart · 2004-09-21 13:22 · Score: 1

Mr Kevlin Henney was praising NASA's rigorous software testing procedures. He was so proud of them that he let out a "and in both space shuttle crashes, software was not to blame".
This may sound a bit funny, but perhaps some of the software budget should've been shifted to the parts of the shuttles that did break, or parts that may have saved the astronauts. Obviously not so much of it that the software ends up broken, but overengineering a module in a system is only slightly less wrong than underengineering it.
Re:Space Shuttle accidents and software bugs by hazem · 2004-09-21 16:35 · Score: 1

An engineering prof where I worked had a sign on his door that said something like:

The only regret you'll have from paying for too much quality is the money. You'll have everything to regret from spending on too little quality.
Re:Space Shuttle accidents and software bugs by GlassHeart · 2004-09-21 17:13 · Score: 4, Insightful

The only regret you'll have from paying for too much quality is the money. You'll have everything to regret from spending on too little quality.
That's a nice thing for a professor to advocate, but real world projects like the space shuttle do not have an infinite budget to accomplish the assigned task. Therefore, spending too much money on one aspect can mean that another is sacrificed and becomes the point of failure. Therefore, while being responsible for the part that never failed is an understandable source of pride, it may actually reveal a misallocation of resources.
Engineering is about spending the least amount of time and money to achieve the required quality. Nobody said anything about spending too little.
Re:Space Shuttle accidents and software bugs by hazem · 2004-09-21 17:30 · Score: 1

And that reminds me of the instructions from the shop manual for the Model T. For bolting the head back on, instead of giving a torque measurement, it said something like "tighten to just before the point of stripping".

If one part is not failing while others, it doesn't mean necissarily that you're over-spending on the non-failing point. You might be at the "just before the point of stripping."

In which case, the people providing the budget should be aware of the need to buy enough quality. If they don't buy enough, then they shouldn't be bothering with the project at all. No amount of engineering, no matter how galliant, will overcome a lacking the minimum resources to get the job done.

This is where good engineers are able to comminucate with the non-engineer policy makers. It becomes an ethical issue as well.

I definitely understand budgetary constraints, but there's a point where you have to say it can't be done safely for what's being spent. If you're tasked with designing/building a bridge or some other critical structure, what do you do when you know that you've not been assigned enough resources to get the job done safely and properly? A good PE would be willing to refuse the project and not sign his name on the plans, or demand more resources - as it's his ass on the line in the end.

(It's funny, I don't use the phrase "non-engineer" often, but I'm actually working right now on a presentation titled "traffic safety for non-engineers".)
Re:Space Shuttle accidents and software bugs by Superjhemp · 2004-09-23 03:27 · Score: 1

(It's funny, I don't use the phrase "non-engineer" often, but I'm actually working right now on a presentation titled "traffic safety for non-engineers".)
What software did you use to make that presentation? I hope it isn't powerpoint.
Re:Space Shuttle accidents and software bugs by hazem · 2004-09-23 04:40 · Score: 1

Well, sadly yes. But, I'm paid as a contractor on this, and I used the tools required by the people who pay me. Seeing that at the moment I'm otherwise unemployed...
Re:Space Shuttle accidents and software bugs by Superjhemp · 2004-09-23 04:54 · Score: 1

I definitely understand budgetary constraints, but there's a point where you have to say it can't be done safely for what's being spent. If you're tasked with designing/building a bridge or some other critical structure, what do you do when you know that you've not been assigned enough resources to get the job done safely and properly? A good PE would be willing to refuse the project and not sign his name on the plans, or demand more resources - as it's his ass on the line in the end.
Hmm, but if you're tasked with designing/building a bridge, or some other critical structure, what do you do if you get all the monetary resources needed, but instead are required to use a certain "quality" of steel which you know becomes brittle at low temperatures (... and the bridge is indeed being built at a place known for its harsh winters...) Would a good PE still be willing to sign his name on the plans?
The answer may be easy, if lots of other clients are queuing up before his door.
But what if there are no other clients, and he would be unemployed without accepting this project?
Re:Space Shuttle accidents and software bugs by hazem · 2004-09-23 05:11 · Score: 1

There's a big difference between this presentation and building a faulty bridge.

This presentation is simply sumarizing some crash statistics and showcasing features of "traffic calming" to be presented to DOT employees. I can think of no scenario where the choice of using Powerpoint over another product will result in harm or death to anyone. Even if powerpoint crashes, the presentation can be handled using the printed notes.

Now, I'm not a PE (in fact, I'm an MBA with a BA in Middle East Studies*), but I do have a strong sense of ethics and personal responsibility. If I were a PE, and were presented with your choice on the bridge, the answer for me is easy. 1) refuse to sign my name on the project, and 2) if they continue to go forward when I am confident others will be endangered, then I go to anyone who'll listen to expose what is going on.

For me, my economic welfare is trumped by the safety of others. There is always another way to put food in my mouth, but there are few ways to ease the guilt from harming someone else.

* but I did complete 2.5 years of an EE program and worked 5 years as a systems administrator in engineering school. That by no means makes me an engineer, but I do have a pretty good grasp of engineering problems and engineers.
Re:Space Shuttle accidents and software bugs by fshalor · 2004-09-23 05:43 · Score: 1

Read "what do you care what other people think" by richard freynman. In 1986, the nasa software was tops and the engineers and techs were using the three diamaters rule to certify roundness of the SRB's.

A poll of one of the top engineer committees revealed an average of 1/300 as a failure rate for the entire shuttle. And the managere/engineer sprouted out the 1/10^6 like the company line.

--
-=fshalor ::this post not spellchecked. move along::

Who's really at fault? by smittyoneeach · 2004-09-21 10:18 · Score: 1

One of the things that is delightfully unambiguous is the naval tradition.
If the ship trades paint with anything, it's the Commanding Officer's fault. Yeah, some shrapnel may works its way down the organization chart, but the glory and the gory both rest on one neck...
Would that less time were spent on blamesmanship in our decadent, modern day...

--
Get thee glass eyes, and, like a scurvy politician, seem to see things thou dost not.--King Lear

Re:2K is based on NT kernel by gl4ss · 2004-09-21 10:18 · Score: 4, Insightful

so what if it is "completely different os"? that's the whole point, if it were continuation of the win95 line it would have been fixed!

now the bug was present in both codebases, but fixed just in one.

that's at least how the article and the writeup make it sound like.

--
world was created 5 seconds before this post as it is.

Microsoft's new slogan... by TWX · 2004-09-21 10:19 · Score: 3, Funny

... should be:

"Microsoft: Writing the software to prevent SkyNet since 1981."

--
Do not look into laser with remaining eye.

Re:Microsoft's new slogan... by CoderDog · 2004-09-21 17:55 · Score: 1

Except that in Rise of the Machines we learned that what become SkyNet was originally mistaken for a rash of viruses. Everyone knows that Microsoft and viruses have a pretty tight relationship going on. Bill Gates is just drooling over the idea of having a planetary computer to charge for upgrades and maintenance.
Re:Microsoft's new slogan... by TWX · 2004-09-22 06:50 · Score: 1

If the computers that run SkyNet constantly crash, requiring human intervention to keep them running that doesn't bode well for SkyNet's functionality, regardless of how malicious it is supposed to be.

--
Do not look into laser with remaining eye.

Windows Bug by nwbvt · 2004-09-21 10:19 · Score: 2

Is there any evidence that this was caused by a Windows bug or is this just more /. anti-Windows FUD? None of the articles support such a hypothesis, they seem to put the blame on the integration and maintence of the system, not on the design of the operating system.

And I hardly see how the Windows 95 bug is relevant to this issue as that clearly isn't what caused the shutdown.

Editors please learn how to do your fucking jobs and reject crap like this. Just because it bashes MS doesn't mean its newsworthy.

--
Mathematics is made of 50 percent formulas, 50 percent proofs, and 50 percent imagination.

Re:Windows Bug by nwbvt · 2004-09-21 10:28 · Score: 1

BTW, before claiming that is must be a huge coincidence if the system's reboot procedure and MS's 95 bug happen after the same amount of time, do the math. 49.7 days is about 2^32 milliseconds. In case you are like the submitter of this article and know nothing about computers, that is a very significant number.

--
Mathematics is made of 50 percent formulas, 50 percent proofs, and 50 percent imagination.

Data overload by ceswiedler · 2004-09-21 10:20 · Score: 1

I'm pretty sure the union official who says that it requires a reboot to avoid 'data overload' misspoke, and meant 'data overflow'. 49.7 days is 2^32 milliseconds.

Try telinit by TheScienceKid · 2004-09-21 10:21 · Score: 2, Informative

What you may not have taken the time to observe is that when you run init with a name of telinit or with a process ID other than 1 it runs in 'telinit' mode. In this mode it passes a message via /dev/initctl (a FIFO) to tell the running copy of 'init' (the process responsible for initialising services and managing them thereafter) to perform a specific action (eg shutdown, reboot... etc)

Re:Try telinit by multipartmixed · 2004-09-21 13:09 · Score: 1

telinit doesn't tell init about any special actions.

It tells init what run level to initialize.

What happens at that run level is defined by (depending on OS) /etc/inittab and /etc/rc?.d/[SK]*

--

Do daemons dream of electric sleep()?

Ironically ... by yo_tuco · 2004-09-21 10:23 · Score: 1

The article at: http://www.techworld.com/opsys/news/index.cfm?News ID=2275
has a headline: Microsoft server crash nearly causes 800-plane pile-up And next to it you'll see a Microsoft advertisement ad that says: Make a name for yourself with Windows server systems

And I guess the FAA did just that too.

Upgade from Unix to Windows??? by dokhebi · 2004-09-21 10:24 · Score: 1

I don't think switching from Unix to Windows can be considered an "upgrade."

This sounds like more Microsoft FUD to me. But I might be wrong because I like to use Unix/Linux and therefore my oppinion is suspect.

Re:Upgade from Unix to Windows??? by smchris · 2004-09-21 12:36 · Score: 1

You know what they say, "Nobody ever lost his job choosing Microsoft!"

Oh, wait....

Who's really at fault? by mcguyver · 2004-09-21 10:24 · Score: 3, Insightful

Whoever approved this process of manually rebooting a machine should be at fault. The fact that it was a windows operating system, or a unix OS or a purple OS is irrelevant. The problem here is someone thought a valid solution was to reboot a machine once a month.

Life and death critical apps running under Windows by lee+n.+field · 2004-09-21 10:24 · Score: 1

The recent shutdown of LAX due to an FAA radio outage was apparently caused by a Windows 2000 integration flaw, possibility related to an old Windows 95 bug. An article at the LA Times claims that the outage was caused by human error, as the system will automatically shut down after 49.7 days (related to this Windows 95 flaw?), and a technician didn't reboot the system monthly as he should have. This happened after an upgrade from Unix to Windows.

Oh. Good. Lord.

There's just so much wrong with this picture. At least they picked the version of Windows least likely to flake out.

(Personal nightmare: finding a Windows computer running your life support)

A few remarks by bmajik · 2004-09-21 10:24 · Score: 2, Informative

1) this is not a windows OS bug

GetTickCount() will rollover. An _application_ which assumes it is a strictly increasing value will misbehave after the 40 some odd days expire. That appears to be what is happening here.

Note that nowhere in the article is there a distinction between the "system" and the "OS" or the "application".

2) Regardless of where the fault is (hint: it's not in Windows), it is not unreasonable for a machine to need servicing. Aircraft engines are serviced at hour based intervals, wether they need it or not. It's better to just tear the thing down and rebuild it than to have it tear itself apart. software doesn't _have_ to be this way, but it sometimes is.

Making a complete hardware -> app layer stack 100% failsafe is.. tricky. For some applications, designing the system with a known restart point.. i.e. a reboot of the app or the entire machine, can be more cost effective.. (see earlier the paper on crash-only software design)..a periodic shutdown/restart in complicated systems can be a valid operational practice.

The fault here is two fold - one, the application/system had a known issue that is probably avoidable, but for whatever reasons, it still has the issue.

Knowing that the issue existed, the proper maintennace was not observed with the expected result - a failure.

Only in america do you get away with blaming Audi for oil sludge problems when you dont change your oil every maintenace interval.

If the system called for a 48th day restart, thats what it requires, and deviation from that has consequences. Luckily no one was hurt.

--
My opinions are my own, and do not necessarily represent those of my employer.

Re:A few remarks by evilviper · 2004-09-21 11:19 · Score: 2, Insightful

You just can't talk about computers like you talk about machines. The analogy does not work.

If the fault was going to happen every 48 days, they should have scheduled a reboot for every 22 days at most. Just like everything else, it's insane to have a single point of failure like this.

If you know a machine needs to be rebooted regularly, there is no reason not to automate the process. Windows task scheduler should do the job quite well.

There's no reason the computer could not have reported an error, by whatever means, to an administrator when it detects it is operating in excess of it's design parameters. Send a barrage of e-mails, IMs, Faxes, SMS messages, etc. I can guarantee this life-or-death system would get somebody's attention, and it would be restarted as it should be.

--
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
Re:A few remarks by Dun+Malg · 2004-09-21 11:23 · Score: 1

Only in america do you get away with blaming Audi for oil sludge problems when you dont change your oil every maintenace interval.
Heh. Only in America can you blame Audi because you were too dumb to tell the gas from the brake.

--
If a job's not worth doing, it's not worth doing right.
Re:A few remarks by kupci · 2004-09-21 15:28 · Score: 1

1) this is not a windows OS bug
Unless it's your code, don't be so sure. For the original Win95/98 issue, it was an "OS" bug. Also, depends on what you call the "OS". As MSFT has argued in court, the "OS" encompasses lots of things. That's the problem with a company run by a bunch of lawyers instead of technical folks. 2) Regardless of where the fault is (hint: it's not in Windows), it is not unreasonable for a machine to need servicing.
That's why the software houses try to call 'em "Service" packs, sounds much more marketable than "patches". But don't think the airplane analogy works, what we have here is more like a "recall". a periodic shutdown/restart in complicated systems can be a valid operational practice.
Not in the on-line 24x7 world, and especially not at a major airport. If the system called for a 48th day restart, thats what it requires, and deviation from that has consequences.
Bad design is bad design. As others have pointed out it's ridiculous LAX is running on such a weak system. It does give you a bit of respect for those who wrote, for example, the air traffic controller systems, that manage all this magic every day. It also shows you just how difficult it is to replace something like that.

Microsoft needs to update its list... by Eric+Damron · 2004-09-21 10:24 · Score: 1

Windows NT has a disclaimer in the license agreement stating that it should not be used in critical job roles like nuclear reactor control, etc.

Maybe they need to update the list. I would suggest everything except their Mine Sweep game.

--
The race isn't always to the swift... but that's the way to bet!

Re:Now even the submitters aren't reading the arti by Anonymous Coward · 2004-09-21 10:27 · Score: 1, Interesting

No, the OP is using something called "inference". In fact, I am infering that the OP was infering that the journalist reporting the article either doesn't understand the 32 bit rollover problem or does not want to report all the details required to describe the 32 bit rollover problem.

Its far too great a coincidence that a Windows machine should halt consistently after 40 some days, and that this same bug plagued the Windows operating system.

As you can read in the OP, he questions "could this be?", not "this is".

Suggest you pull your head out.

blame should be assigned to the technician by KillerCow · 2004-09-21 10:28 · Score: 1

I don't think blame should be assigned to the technician who missed the task; rather, it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task.

Yes, it really is. They had a system in place which they chose, knowing its deficiencies. To combat one of the deficiencies, they proscribed a procedure to be followed monthly. The procedure was not followed by the technician, so it was human error.

Would you expect your car to run flawlessly if you never put gas in it or changed the oil on a regular basis? If you didn't, whos fault is it? The car's or yours?

Re:blame should be assigned to the technician by cranos · 2004-09-21 11:02 · Score: 1

Yes but the procedure was flawed. In an operation as vital Aircraft control there should be at least three levels of redundancy, where one mistake does not bring down an entire operation.
Re:blame should be assigned to the technician by YU+Nicks+NE+Way · 2004-09-21 11:27 · Score: 1

They had three levels of redundancy: a mandatory monthly reboot to protect the application, a mandatory 49 day reboot if the application hadn't been restarted at the monthly reboot, and a backup system. All of them failed.

Two of them were technician errors: the procedure was documented, along with the known flaw in the radio application software that caused it to be required. I'd say that, yes, this was technician error.
Re:blame should be assigned to the technician by cranos · 2004-09-21 11:57 · Score: 1

But really this is only one level of redundancy, the backup system. Relying on the technician to reboot the system every month or so maually is just asking for trouble. An automated system backed up by the technician and then the backup system would be three levels of redundancy.

Re:2K is based on NT kernel by LostCluster · 2004-09-21 10:28 · Score: 4, Informative

As many others have pointed out here, it's the same bug that brought down Windows 9x reappearing.

Just like the "Y2K glitch" was a platform independant problem based upon the 2-digit-year shorthand causing logical flaws, if you store time in a 32-bit variable by the microsecond... you'll hit the hard limit after about 49.7 days which is why that number can show up in kernels other than Win9x. If there's no proper handling of that rollover, things go haywire.

What would have happened if planes had crashed? by theolein · 2004-09-21 10:28 · Score: 1

If two planes had crashed as a result of the comminication loss, I think that the resulting lawsuits, both criminal and civil, against the FAA, Harris and Microsoft would have been large enough to possibly cripple the latter two.

I used to have to reboot our NT Servers due to memory leaks once a month. Although this problem seems related to the application software rather than Win2k, I really have to ask myself what the fucking hell Windows, any version, is doing in a life critical computing environment. Is Windows even licenced for operation in such areas???? And I'm saying Linux is better, but there are OS'es around that ARE licenced for such operation (Tru64 if I'm not mistaken).

And the fact that the system had to be regularly rebooted , and was actually used in the field although this fact was known is simply pathetic, added to which the fact that they couldn't even automate the reboot smacks of gross incomptence.

Uptime: From one of the artticle links by Mateito · 2004-09-21 10:28 · Score: 5, Interesting

The system offers unprecedented voice quality, touch-screen technology, dynamic reconfiguration capabilities to meet changing needs, and an operational availability of 0.9999999.

Whoah! 7 nines uptime!

22 seconds of downtime per year.

Somebody is on drugs if they sold that. Somebody is on even stronger drugs if they bought that story.

"5 nines", for all intents and purposes, is as good as it gets, with "6 nines" seen as the holy grail. The top HA system I've ever dealt with (running a Telco's billing operation spanning 4 countries!) quoted a figure of 0.999996. To nobody's suprise, it did not run Windows.

Wonder how much their failure clause is going to set them back?

--
Norman Cook's Ode to Sl

Re:Uptime: From one of the artticle links by phiwum · 2004-09-21 20:35 · Score: 1

"The system offers unprecedented voice quality, touch-screen technology, dynamic reconfiguration capabilities to meet changing needs, and an operational availability of 0.9999999."

Whoah! 7 nines uptime!

22 seconds of downtime per year.

Nah, just a little typo. Someone forgot the trailin "%" sign.

--
Phiwum's law: anyone that names an obvious law after himself and then puts it in his own sig is just pathetic.

psShutdown w/Task Scheduler by danZenie · 2004-09-21 10:28 · Score: 1

psShutdown with task schedular would have been enough. honestly don't think M$ should be held entirely responsible. any f00l coul have set this up.

BTW, going from UNIX to Windows is more of a migration, not necessarily an upgrade.

--
You need people like me so you can point your fuckin fingers and say, "That's the bad guy." So what that make you? Good?

Not necessarily Windows' fault by DunbarTheInept · 2004-09-21 10:29 · Score: 4, Interesting

While I hate MS as much as the next guy, this might not really be directly their fault. Unix systems are often installed with the instruction taht they get reboots regularly. Often there is a problem that is caused by application code not the OS. If you have a memory leak in an application that runs and stays up all the time, it's going to cause the system to get horribly unusalbe in the long run regardless of whether it's UNIX or Windows. While a reboot might be overkill when it was just one application misbehaving, a reboot is a guaranteed way to kill and reset the responsible program no matter which one it is. At a previous place of employment we told the customer to do monthly reboots mainly because we didn't trust *our own* code to be that perfect.

--

Don't label something "offtopic" unless you know the topic well enough to tell what's on topic.

Re:Not necessarily Windows' fault by meme_police · 2004-09-21 12:14 · Score: 2, Informative

Spoken by someone who obviously hasn't adminned any enterprise UNIX servers.

--
The meme police, They live inside of my head
Re:Not necessarily Windows' fault by DunbarTheInept · 2004-09-21 18:39 · Score: 1

You have just lied.

--
Don't label something "offtopic" unless you know the topic well enough to tell what's on topic.
Re:Not necessarily Windows' fault by shippo · 2004-09-21 23:59 · Score: 1

I came across a commercial Windows application once that had a terrible memory leak. The developers of this app knew there was a leak, but hadn't or couldn't find it. It was some form of email system that cycled through all mailboxes at a regular interval running various tasks, and only for the users not currently logged in.

However in their wisdom one of the minor updates added a command line switch to reboot after a defined number of passes. They never introduced any other fixes, and documented this switch as the solution to this problem.

The reboot actually did a full reset via the BIOS, and not a full shutdown (it was back in the days of Windows 95 - due to network OS API issues it would not run on NT 3.x); hence often this reboot caused disk corruption.
Re:Not necessarily Windows' fault by meme_police · 2004-09-22 09:08 · Score: 1

I seriously doubt it.

--
The meme police, They live inside of my head
Re:Not necessarily Windows' fault by DunbarTheInept · 2004-09-23 08:02 · Score: 1

Then your ignorance was accidental. Ever heard of a "turnkey" system? Where the hardware box, OS, and an application on the OS are all sold as a single unit? They do get used in enterprise-level installations, like the customers we had. Ever heard of tiny companies called "The Home Depot", "Sunbeam/Oster", "Crystler", "Bay Area Networks"?

--
Don't label something "offtopic" unless you know the topic well enough to tell what's on topic.
Re:Not necessarily Windows' fault by meme_police · 2004-09-27 08:57 · Score: 1

Have you heard of GE? We don't buy pathetic turnkey solutions that have to be rebooted nightly. If a vendor suggested that we'd laugh them out of the conf room.
And wtf is "Crystler"? Do you mean www.chrysler.com?

--
The meme police, They live inside of my head
Re:Not necessarily Windows' fault by DunbarTheInept · 2004-09-28 06:54 · Score: 1

I notice you've changed your tune from "you've never done enterprise stuff" to "the place I work for wouldn't accept your stuff". Thank you for backpeddling to the truth. I know the software was crappy. I chose to leave that place once I could. (a few years after I left they were in such bad shape that Nasdaq had to de-list them, making me very glad I never excercised my stock options).

I just get very angry at people who lie. And when you pretend to know something that you know damned well you don't, that's lying.

--
Don't label something "offtopic" unless you know the topic well enough to tell what's on topic.
Re:Not necessarily Windows' fault by meme_police · 2004-09-28 12:55 · Score: 1

You're funny. Just because you put turnkey solutions in place in large enterprises doesn't mean they're being used for "enterprise stuff". Don't be so absurd. They're being used as toys.

--
The meme police, They live inside of my head
Re:Not necessarily Windows' fault by DunbarTheInept · 2004-09-28 17:12 · Score: 1

Since you don't want me to call you a liar, then that would have to mean you actually know what this software does and actually therefore believe that a company would consider a software package to be a mere toy when it is in charge of all their distribution centers, and that if the software were to fail and therefore the company wouldn't ever be able to ship a single product to a single customer, that this would just be no big deal to them. The ability to run their distribution center is, after all, just a toy, right?

If you don't want me to assume you are lying, then I have to assume you believe what you're saying, which makes you really dumb. Frankly, thinking of you as a liar is more flattering.

--
Don't label something "offtopic" unless you know the topic well enough to tell what's on topic.

You don't blame the bike by lee+n.+field · 2004-09-21 10:29 · Score: 1

"If you get run over on a bicycle while riding on the highway, don't blame the bike."

You don't blame the bike, you blame the person trying to use a grossly inappropriate tool.

Re:Retard by Keith+Russell · 2004-09-21 10:32 · Score: 5, Informative

Search Microsoft's Knowledge Base for "49.7 days", and you'll find a few bugs, all of them related to storing uptime in milliseconds in an unsigned 32-bit integer. Two were reported in Windows 2000:

That rpcss.exe issue looks like a prime suspect. The OS doesn't crash, but, given the time-sensitive nature of air traffic control data, it's quite possible that the applications running on that server would degrade to the point of failure.

Both look like they were found, or at least entered into the KB, after the release of Windows 2000 Service Pack 4 (Nov. 2003), and hotfixes are available for both.

Note to Microsoft (or anyone else storing milliseconds, for that matter): unsigned 64-bit int! Instead of having to reboot every 49.7 days, you'll have to reboot every 213,503,982,334 days, give or take a leap-second.

--
This sig intentionally left blank.

Re:Seen this week at various airports by Anonymous Coward · 2004-09-21 10:33 · Score: 3, Interesting

It probably should. My company uses XP Embedded for a few systems, and doesn't have any software-related problems on them. Ever. The only problems we have are when people snap off antennae that we use for the wireless connections, or something similar. There's no reason that they shouldn't be using something like this to scan baggage. It sounds like someone at O'Hare didn't do their homework.

The only drawback to XP Embedded, for my company at least, is that the Windows license costs us more than the solid-state drive that we run it from. Looking into Linux for new installations as an alternative, but it doens't make much sense to replace strong, stable XP systems that never fail.

Putting the "anal" into analyst by Codebender · 2004-09-21 10:35 · Score: 1

This is great:

"The shutdown is intended to keep the system from becoming overloaded with data ... according to a software analyst..."

This "analyst" knows nothing about computers, works for Microsoft, or both.

Windows Upgrade, FAA Error Cause LAX Shutdown by Dacmot · 2004-09-21 10:36 · Score: 1

"Windows Upgrade, FAA Error Cause LAX Shutdown"

Sounds to me like Windows causes constipation. Use moderatly.

Re:Wait, I know this one.... by Codebender · 2004-09-21 10:38 · Score: 3, Insightful

No, the FAA is responsible for maintaining the safety of that system. They failed bigtime by allowing Windows to be used for a mission-critical system. Technically, a contractor was the one who made the decision, but the final responsibility for oversight rests on the FAA.

Libel? by DogDude · 2004-09-21 10:40 · Score: 1

If I were /., I'd be careful. They're getting very close to libel. To take something this serious, and completely spin it around, and announce it in a public forum is just ASKING for a law suit. In this case, I think that /. would be fucked if MS saw this and wanted to pursue it.

--
I don't respond to AC's.

Amazing! by LinuxOnHal · 2004-09-21 10:41 · Score: 1

Its actually kind of amazing that it stayed up that long in the first place, when you think about it. Especially if the machine is doing anything at all.

--
Trying is the First Step to Failing --Homer Simpson

But DON'T get into the habit of using reboot. by Ayanami+Rei · 2004-09-21 10:41 · Score: 1

On some systems (Solaris specifically) the linux-weened will quickly learn that reboot or halt is NOT the command they wanted to run...

Actually the linux-derived programs reboot, halt and poweroff do exactly that but they first check the runlevel... if reboot detects the runlevel is not 6 or s it will call shutdown to tell init to enter runlevel 6. If halt/poweroff detects the runlevel is not s or 0 it will call shutdown tell init to enter runlevel 0. They are designed to do double duty... to be called at the end of rc.d scripts and for super-user usage.
You can force them to immediately shutdown or reboot without checking the runlevel by using the -f option.

Of course, the SunOS supplied binaries do not have this safety check... I'd recommend against getting used to that. Just pass the appropriate option to shutdown... (-r for reboot, -p for poweroff, halt is the default)

--
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON

Re:But DON'T get into the habit of using reboot. by drinkypoo · 2004-09-21 11:02 · Score: 2, Interesting

The funny thing is that halt used to halt the system RIGHT GODDAMN NOW on most Unixes, and famously on Xenix. They called it haltsys and you typed sync twice before running it. The second one was just to give the system time to sync while your fingers were moving. Most Xenix systems didn't have much of a buffer (I had Xenix on a 286 with 1MB RAM, but the 386 product was of course much more popular) but they don't have much of a filesystem either. Anyway other elderly Unixes and Unix derivatives are simple like that too. Halt just halts, it doesn't stroke you first.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Re:But DON'T get into the habit of using reboot. by uid8472 · 2004-09-21 11:28 · Score: 1

*BSD also has reboot/halt/poweroff that aren't confused as to their function the way Linux's are. But...

[On] SunOS ... [j]ust pass the appropriate option to shutdown... (-r for reboot, -p for poweroff, halt is the default)

...on BSD (well, NetBSD at least), however, it's shutdown -h for halt, and the default is to drop back to single-user. Fun, no?
Re:But DON'T get into the habit of using reboot. by red+floyd · 2004-09-21 11:34 · Score: 1

Ah yes, the good old "sync; sync;sync; reboot" or "sync;sync;sync;haltsys" We wrote a script to do that on a (highly customized) ODT2 system. The user never saw a command prompt, we had given an button in our app to "Shut Down".

--
The only reason we have the rights we have is that people just like us died to gain those rights. -- Cheerio Boy
Re:But DON'T get into the habit of using reboot. by TeraCo · 2004-09-21 11:55 · Score: 1

Um..
> sync
> sync
> sync
> halt

is not the same as: sync;sync;sync;halt

The whole point of doing 3 syncs was that typing them out gave the system the time to finish the first one.. by scripting it, you are missing the point.

--
Not Meta-modding due to apathy.
Re:But DON'T get into the habit of using reboot. by multipartmixed · 2004-09-21 13:16 · Score: 1

There is no problem using reboot on Solaris, as long as you understand what it does. In fact, I use it more often than shutdown -g0 -i0 -y because the few times I ever reboot a box it's because I want to pass arguments to the kernel or generate a crash dump.

Of course, if you have any running apps which can handle an off-the-cuff reboot, you should probably either stop 'em or pop down the run level first..

--

Do daemons dream of electric sleep()?
Re:But DON'T get into the habit of using reboot. by red+floyd · 2004-09-21 13:59 · Score: 1

IIRC, the ODT2 sync command waited until the data flushed.

SCO's own docs said to do that in certain cases.

--
The only reason we have the rights we have is that people just like us died to gain those rights. -- Cheerio Boy
Re:But DON'T get into the habit of using reboot. by TeraCo · 2004-09-21 15:49 · Score: 1

Then you only need one sync :)

--
Not Meta-modding due to apathy.
Re:But DON'T get into the habit of using reboot. by PsychoSid · 2004-09-21 17:17 · Score: 1

reboot on solaris can be passed arguments (reboot -- -r) will do a reconfigure boot for instance
The thing with reboot rather than init 6 on Solaris is that /etc/inittab defines init to run scripts in /etc/rcX.d. Generally this is stop scripts on Solaris.
reboot is passed directly to "uadmin" which doesn't cleanly run and /etc/rcX.d scripts.

My GM car is defective! by raehl · 2004-09-21 10:41 · Score: 1

It stops running if I don't fill up the tank every 300 miles.

--
paintball

Horray Windows Embedded by sPaKr · 2004-09-21 10:41 · Score: 1

Wholly Jeez I cant wait till we get medical equipment thats built on windows xp embedded!

Nurse: Uh the resperator shut down

Tech: reboot it, everything should be fine

Nurse: ok, resperator is working again, what about the patient?

Tech: hmm... cant reboot him huh..?

Nurse: nope, hes cold

Tech: well at least the embedded web browser is working, maybe we can find him a family plot.. or email john edwards!

64 bit int by Alien54 · 2004-09-21 10:42 · Score: 4, Funny

Note to Microsoft (or anyone else storing milliseconds, for that matter): unsigned 64-bit int! Instead of having to reboot every 49.7 days, you'll have to reboot every 213,503,982,334 days, give or take a leap-second.

That's every 584,942,417 years. Which is simply not going to be good enough in my book.

--
"It is a greater offense to steal men's labor, than their clothes"

Re:64 bit int by Dmala · 2004-09-21 13:29 · Score: 2, Funny

That's every 584,942,417 years. Which is simply not going to be good enough in my book.

What are you? A geologist?

semi-OT: pathetic ad placement by bersl2 · 2004-09-21 10:46 · Score: 1

One of the greatest ads ever to appear on Slashdot:
http://a1767.g.akamai.net/v/1767/2939/30d/imageser v.adtech.de/images/Ad247098St1Sz225Sq1Id1.gif

You have no idea what kind of software they run.. by cbreaker · 2004-09-21 10:46 · Score: 1

They could have all sorts of software that requires manual steps on shutdown and restart. It happens all the time.

Whereas on a modern Linux box you could probably script most actions, on Windows it's usually not that easy - even with Windows Scripting Host, most MS shops like to keep everything "standard" or "out of the box."

--
- It's not the Macs I hate. It's Digg users. -

In The Netherlands... by .+visplek+. · 2004-09-21 10:46 · Score: 1

...LAX is pronounced as "laks" and means something like "too lazy to do anything". :)

--
- Save a tree, eat more woodpeckers

The article is light on details... by Ayanami+Rei · 2004-09-21 10:48 · Score: 4, Informative

It's probably not a Microsoft problem if the system is running on NT, it uses a 64-bit time.

It _could_ be that an important part of the system is running Windows 95 interfaced to a 2k domain that implements the rest of the system.
That really isn't Microsoft's fault that they didn't patch that critical machine to fix the flaw... or that they felt they needed to run Windows 95 (gag) in such a critical portion of the system.

It _could_ be that a user-land air traffic control related application itself calls an depricated API to return the time in microseconds, which
overflows/wraps around, causing the software to crash.
OR
It _could_ be that the user-land air traffic control software just mis-casts the time from the modern API into a 32-bit data structure, which wraps around, causing the software to crash.
In the latter two cases the article writer or LAX's press staff may have incorrectly drawn the connection to the famous Windows 95 problem... even when it wasn't Microsoft's fault in that case.

I really don't see how Microsoft could be the blame here at all...

--
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON

Re:The article is light on details... by Awptimus+Prime · 2004-09-21 12:04 · Score: 1

I agree. I have had uptimes of well over 50 days with Windows boxes. Busy ones, at that.

I think this article misses the actual cause as probably being something related to the third party applications they are running on the system.

Pray You Never Hear This by craXORjack · 2004-09-21 10:48 · Score: 5, Funny

Ladies and Gentlemen, at this time the Captain would like to ask you to remain seated with your seatbelt firmly fastened, however if there are any computer technicians flying with us today, especially if they know what to do when a 'Fatal Exception has occured at 0029:C02FDEC6', would that person please come forward to the cabin immediately?

--
Liberals call everyone Nazis yet they are the closest thing to it.

Objection! by HangingChad · 2004-09-21 10:49 · Score: 1

This happened after an upgrade from Unix to Windows

Unless it was SCO Unix switching from something that works to something that doesn't is not an upgrade.

--
That's our life, the big wheel of shit. - The Fat Man, Blue Tango Salvage

Another mouse wiggler bites the dust.... by Proudrooster · 2004-09-21 10:50 · Score: 2, Insightful

Let this be a lesson out there to all the mouse wiggling MSCE's who scorn the uptime of UNIX and shun the power commandline. If you are running a critical Windows Server, REBOOT EARLY and REBOOT OFTEN. Remember, REBOOT-ing is part of the job description and it has to be done. Please protect our key infrastructure and reboot your servers WEEKLY! Just beacause the UNIX guys get 2 years of uptime, doesn't mean you can too. It just doesn't work that way.

Might I suggest this wonderful little tool. Poweroff. It's the only tool I know of which seems to be able to reliable reboot widows boxes, even when they are crippled due to worms and/or memory leaks. It can even close running apps. Also, you get get it to work over the network with a magic packet, in case Terminal Server crashes or is too slow to use.

The main article should get flagged as troll/flamebait due to the phrase upgrade from Unix to Windows. That wasn't an upgrade, that (as we now know) it was a disaster waiting to happen. Wait until the worm of the month comes through and shuts it down. When will people learn to use the RIGHT TOOL FOR THE JOB! If it has to run 24x7 forever, don't put it on Windows. Geez...

Re:Another mouse wiggler bites the dust.... by 808140 · 2004-09-21 12:23 · Score: 1

When will people learn to use the RIGHT TOOL FOR THE JOB! If it has to run 24x7 forever, don't put it on Windows. Geez...

Couldn't have said it better myself. System should have been running VMS.
Ah well.
Re:Another mouse wiggler bites the dust.... by SuiteSisterMary · 2004-09-21 13:19 · Score: 1

And when the APPLICATION has a bug requiring it to be restarted every X days, how would running VMS have changed ANYTHING?

--
Vintage computer games and RPG books available. Email me if you're interested.

I think you missed the decimal point. by freakmn · 2004-09-21 10:50 · Score: 1

The post you are replying to says 0.26 seconds. You boot in 24 seconds. That's about 92 times faster than the alloted time...

--
warning: This post is likely to contain gobs of dripping sarcasm. Consume at your own risk.

What failed? by AK+Marc · 2004-09-21 10:50 · Score: 5, Insightful

A system was deployed where the application (not the OS) failed after a finite time was deployed knowing it was faulty. An under-trained technician failed to reboot the server as scheduled. There was a backup which we don't have details on. It failed to work as well.

I don't see what the OS has to do with this. It could have been written for *NIX, OS/2, or any other OS. The lessons are two:
Don't deploy flawed software.
Make sure redundant systems work.

As an aside, since we don't know what the backup was, we could hypothetically say that it was the UNIX system that previously was primary that was relegated to backup duty. In that case, it would be a failure of Windows and UNIX at the same time. So, is it that UNIX sucks and is worthless for any important systems, or is it that the people that screwed this up would have screwed up something, no matter what OS they were working with?

--
Learn to love Alaska

Re:What failed? by Dr.Dubious+DDQ · 2004-09-21 11:46 · Score: 1

There was a backup which we don't have details on. It failed to work as well.
I immediately assumed (maybe true, maybe not) that the problem was that the "backup" system was identical to the main one, and being identical, had exactly the same problem...and therefore ALSO died at the same time due to the problem.
\
Stupid.

--
Hacker Public Radio is our Friend

Followed by: by HotNeedleOfInquiry · 2004-09-21 10:50 · Score: 1

Where do you want to land today?

--
"Eve of Destruction", it's not just for old hippies anymore...

Having read the article, I have decided... by DavidBrown · 2004-09-21 10:51 · Score: 1

...that it's not Microsoft's fault.

Here's what happened:

The FAA installed a new system. There were bugs in that system, in the custom software the FAA uses to move planes around the sky. Instead of fixing those bugs properly (as they apparently did in Seattle), the FAA instead went with the quick fix of rebooting the server every month, and backed that up with a script rebooting the server automatically if it's not done manually. Then, the FAA techs didn't follow the FAA's workaround procedures, and Chaos results.

Exactly how was this Microsoft's fault? Maybe I'm wrong here, but I don't see what MS did here. And OpenSource wouldn't have solved this problem, because I really doubt that anyone is going to write FAA flight control software under an open source license.

--
144l. ph34r my 133t l3g4l 5k1lz!

Re:Having read the article, I have decided... by aXis100 · 2004-09-21 13:00 · Score: 1

I agree. Im no MS fanboy, but everyone has been far too quick to call this a "Microsoft problem".

The media doesnt help - the sort of headlines I was seeing on the news when this story came up helped to feed the ignorance.
Re:Having read the article, I have decided... by tbogart · 2004-09-21 16:15 · Score: 1

Just where do you read in what article that said anything other than the "combination of human error and a design glitch in the Windows servers"?

No, I don't want to offload the responsibility of any humans involved, but the point everyone has be making (and I thought it was obvious enough) that it is ridulous to have a system so flawed that human intervention was required as a workaround when LOSS OF LIVES AND/OR PROPERTY is involved.

LM FAA SUCKS by Ayanami+Rei · 2004-09-21 10:53 · Score: 1

lol CAASD@MITRE ownzors j00

--
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON

Mod parent down by Anonymous Coward · 2004-09-21 10:54 · Score: 1, Funny

Let me see, you claim that the error was in the integration of the Windows server. Thanks for clearing that up, because the submitter wrote, "The recent shutdown of LAX due to an FAA radio outage was apparently caused by a Windows 2000 integration flaw."

Ya, way to correct that error. It was in the integration of the Windows servers, not a Windows 2000 integration flaw.

Re:Please mod parent up by drsmithy · 2004-09-21 10:54 · Score: 2, Insightful

This is pure speculation of the editor. Nowhere in the article the blame is put on the OS. Linking the failure to an error in a previous version of the OS just doesn't make sense.

Particularly when it's not a "previous version" at all but a completely different Operating System.

Windows 95 and Windows NT (2000/XP/2003) are not the same OS. They're completely different. They share a common API and that's about it. Blaming this on "Windows 95" makes about as much sense as blaming an application bug under FreeBSD 5.x bug on Slackware 1.0.

Re:Can't be a common problem... by Judg3 · 2004-09-21 10:57 · Score: 1

Are you stuck on Win95 or 98? My current XP box will go 30 days without a sweat, and that's under heavy use (Compiling, video work, games). The only time I really need to reboot is when there's a big update released (Like SP2), other then that I'm fine.
And when it comes to my servers, all of the Win2k ones stay up freaking forever. I've had my SQL/ASP abuse box (The one I use to play around with code) up for almost 470 days before the power went out.
I also had an NT4.0 PDC up for over 600 days.

--
Looking for hardware (Currently need: Large Etch-a-Sketch) Have one? See my journal!

PSSST!!! Active Directory/NTLM2 is LDAP3+Kerberos5 by Ayanami+Rei · 2004-09-21 10:58 · Score: 1

Don't tell anyone...

--
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON

It was the app, not the OS by Teahouse · 2004-09-21 11:01 · Score: 5, Informative

Pilot here, and this has been a well known pecadillo of the tracking system for SoCal Approach for a few years. It's an application problem that came into being after an upgrade of the application, not the OS. It's a memory allocation error that retains some of the old tracking on the system, thus, the whole box needs to be rebooted every 45 days or the memory overloads and crashes the OS. Look guys, I'm a Linux user and all, but let's not run around blaming M$ for problems with buggy software apps.

--
"Curiosity killed the cat, but for a while I was a suspect."- Steven Wright

Re:It was the app, not the OS by BCW2 · 2004-09-21 13:50 · Score: 1

Windows allows so many TSR's in any version that memory in any quantity will run out over time. The only simple solution is a regular reboot.

Why is there no back up system on a different reboot schedule? Seems like a simple solution.

--
Professional Politicians are not the solution, they ARE the problem.
Re:It was the app, not the OS by pilsner.urquell · 2004-09-21 13:55 · Score: 1

Pilot here, and this has been a well known pecadillo of the tracking system for SoCal Approach for a few years. It's an application problem that came into being after an upgrade of the application, not the OS. It's a memory allocation error that retains some of the old tracking on the system, thus, the whole box needs to be rebooted every 45 days or the memory overloads and crashes the OS. Look guys, I'm a Linux user and all, but let's not run around blaming M$ for problems with buggy software apps.
Yea, right! Can't write a bug free app on a buggy OS. OS' are like roads if they are full of potholes then all the vehicles using them will have bad shocks and broken springs.
Re:It was the app, not the OS by tbogart · 2004-09-21 15:53 · Score: 2, Interesting

Just curious - but how does being a pilot give you more insight into the system? I would particularly like to see the "memory allocation error that retains some of the old tracking on the system". That would be quite amazing in itself.
Re:It was the app, not the OS by Teahouse · 2004-09-21 18:32 · Score: 1

Not really, they go partially blind to transponder codes, so they have a hard time determining who is in a Cessna and who is in a 747. All they have are radar blips and the transponder number, but on the radar screens, they lose the designations that show heavy, vfr, and destinations. It's not the end of the world, but it does make it very difficult to keep traffic at a safe distance in class B airspace. The software was down for almost two hours, so it wasn't just a short moment on the backup information.

--
"Curiosity killed the cat, but for a while I was a suspect."- Steven Wright
Re:It was the app, not the OS by handorf · 2004-09-22 01:11 · Score: 1

In addition to the other response, terminating VFR Flight Following isn't an option inside the class B.
I'm sure everyone flying VFR outside the class B got a quick "Radar Service terminated, remain clear class Bravo", but if they're already inside ATC has to provide separation.

--
-- IANAEG - I am not an elder god.
Re:It was the app, not the OS by Teahouse · 2004-09-22 12:22 · Score: 1

I wirk out of Long Beach Airport, and am friends with the AT's at that tower. Just like the programmer community, most people within the aircraft industry talk and cross-pollinate. A lot of pilots know the problems with Air Traffic because we want to know what to expect. Same thing foew for weather service. We get to know the guys there as well, and for the same reasons.

--
"Curiosity killed the cat, but for a while I was a suspect."- Steven Wright
Re:It was the app, not the OS by tbogart · 2004-09-22 19:27 · Score: 2, Informative

Don't get me wrong - I am not questioning you seem familiar with the effect the problems have on operations. And of course it just shows good sense that as a pilot, you network (!) with the folks you depend on as you describe. But do you network with the programmers or the administrators? It still sounds like you are getting at least two levels removed information from the level any real dirt is available. Perhaps an analogy would be talking to someone who works in the next office to the folks who supervise Air Traffic Controllers rather than the controllers themselves. Sure, if those folks ar interested in aviation and ask the right questions they can gain reliable information, but it is not like going to the, er, appropriate end of the horse.

FWIW, my father was a machinist/aircraft mechanic and finally technical writer who worked with oil company research labs on improving lubrication, publishing articles in their company publications including doing his own photomicroscopy to analyse corrosion effects.

My first job out of school with a EE degree was at the Johnson Space Center training astronauts and sitting console. About 70% of the folks I worked with were either military pilots still flying in the reserves or private pilots (and I was fool enough to go do light aerobotics with some of them), plus of course the flight crews. While there, I started dealing with with computers as they first started appearing in offices, and eventually went into full time system administration/ systems engineering, primarily for development groups and test labs.

Now, the reason I blabbed on like that was to try to establish

1) I am somewhat familiar with the aviation community from both the 'user' and 'support' aspects.

2) I am somewhat familiar with the computer community, starting as a user, and moving into the support realm.

3) I would claim that both the classes I wrote and taught - as well as the time spent on console, directly gives me a somewhat initmate knowledge of translating information from one community into another. You generally don't explain an onboard system to a pilot the same way you would a PHD in EE, or a medical experiment to a pilot as you would an MD.

One particular conclusion based on my experience in those worlds (and I know this is a bit of a generalization) is that when a pilot or any member of an air crew tells me something about their aircraft or it's surrounding operations, I can probably bet on the information being pretty good.

If a programmer or administrator tells me something about their program or system, before I put any stock in what they say (beyond my own experience in similar veins), I probe their background and quiz them as much as possible.

If I wanted to be glib, if programmers/administrators had to go thru the kind of training programs as pilots or even support personell, about 85% would not cut it. Or if these folks made it into the sky, they would be weeded out by the flaming holes in the ground they made.

If, as I expect, your information is based on what an ATC heard from a guy down the hall, or maybe even was touching a computer, or even from a distilled briefing from the contractor - I would first have to ask how much that ATC knew about systems and programming and see how critically s/he processed (!) that data.

If you even got the information directly from and admin/programmer, (as you might guess by now), the same set of questions would apply.

In either case, the point is to wonder aloud if you take that information as if it were coming from folks who are the caliber of the people you are used to relying on.

Consider your description of the memory issue:
"It's a memory allocation error that retains some of the old tracking on the system, thus, the whole box needs to be rebooted every 45 days or the memory overloads and crashes the OS."

The typical memory allocation error doesn't have anything to do with old data still being in the system, but simply that m
Re:It was the app, not the OS by Teahouse · 2004-09-23 04:27 · Score: 1

Good points all. I understand your concerns, but that is why I simply posted "Pilot here" instead of trying to post a long qualification. I was simply posting the information I received from my ATC pals. I figured people could make their own assumptions from that. Pilots know how to fly and how to negotiate complicated airspace, but we rely on a whole mess of other experts to actually keep that airspace organized. All I have to do is get my King Air and it's passengers down safely.

As for the use of Linux, I switched to Linux 3 years ago, and do all my logs and planning on a laptop set to dual boot Suse and M$2k. I am not a fan of M$, and do not depend on their software for much. Luckily, even if the ground system goes down, the best OS of all (Pilot Brain 1.0) is still in control of the aircraft and getting it down safely (as you know). Although it would be dicey, as long as we have radios and UNICOM frequencies, I have no doubt most of us (except for the guys flying the heavies) could get down without any outside aid. The problem is pretty much solved by knowing your AIM/FAR. There is no M$ in my cockpit when I fly, and for that I am thankful.

--
"Curiosity killed the cat, but for a while I was a suspect."- Steven Wright

And then reboots the box without prompting at 49? by Ayanami+Rei · 2004-09-21 11:01 · Score: 1

Right?

I mean, why bother writing a timed script if it doesn't have a failsafe?

--
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON

Buffer overruns, not just a window thing. by sebby1234 · 2004-09-21 11:04 · Score: 1

A UINT32 will overflow after about 30 days if it contains the number of milliseconds since execution start. This is just a fact. Yes, there was a bug in Win9x where such a buffer would overflow in the OS. But still, its amazing how fast the slashdot anti MS zealots are quick to point fingers without even considering the fact that it might have been a bug in the FAA software?!?!?

Re:Buffer overruns, not just a window thing. by tbogart · 2004-09-21 15:47 · Score: 1

Actually, us zealots were responding to multiple resports pointing directly to MS software. If you are going to start making up possibilities - maybe it never happened at all?

So they could schedule a reboot at the 49th... by Ayanami+Rei · 2004-09-21 11:05 · Score: 1

...but they couldn't script it to do an orderly shutdown? I mean what does the technician do differently that it doesn't interrupt air traffic?

--
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON

Re:So they could schedule a reboot at the 49th... by NeoSkandranon · 2004-09-21 13:14 · Score: 1

I suspect he probably does it in the dead of night when there's less traffic

--
If you can't see the value in jet powered ants you should turn in your nerd card. - Dunbal (464142)

...Blame the API instead by tyler_larson · 2004-09-21 11:05 · Score: 5, Insightful

This sounds to me like more of a problem with the application, not the OS.

Three words:

GetTickCount()

Returns the number of milliseconds since the machine was last booted.

From reading the article, one would surmise that this function is used to assign a timestamp to a particular flight plan or other record. After the machine has been running for 49.7 days, the GetTickCount() function rolls over to zero, which could cause a whole plethora of problems. Almost certainly those problems would include things like corruption of data, lost records, old records showing up as new, application crashes, and, of course, swarms of locusts. The only fix is to reboot.

The developers cleverly noticed the potential disaster before it crashed any planes, and as a workaround, instituted a policy requiring the servers to be rebooted at monthly intervals. Failure to do so would result in the calamities described above.

So while the problem wasn't the old Win95 bug, it was the same crappy windows API that caused both. The POSIX-compliant gettimeofday() function uses a 64-bit structure and does not suffer from the same flaw, and can be relied upon for at least the next 30 years or so (which isn't amazing, but it's a lot better than 50 days).

Note that the FAA insists that they're currently implementing a better solution than "reboot every month". Better hurry, guys, you've only got 47.3 days left.

--
"With sufficient thrust, pigs fly just fine. However, this is not necessarily a good idea...."
RFC 1925

Re:...Blame the API instead by 0x0d0a · 2004-09-21 11:32 · Score: 1

True, but the also-POSIX-compliant time() is going to cause plent of interesting problems in Y2038.

--
May we never see th
Re:...Blame the API instead by Ann+Elk · 2004-09-21 11:37 · Score: 1

From the current Win32 SDK:

Windows time is stored as a 32-bit value, which means the system can record no more than 2^32 millisecond intervals before the 32-bit value overflows to zero. This is approximately 49.7 days. If you use Windows time, check for the overflow condition when comparing times.

The programmer used an inappropriate API and it bit him/her on the ass. Big surprise.
Re:...Blame the API instead by DraKKon · 2004-09-21 11:40 · Score: 1

Note that the FAA insists that they're currently implementing a better solution than "reboot every month". Better hurry, guys, you've only got 47.3 days left.

You owe me a new keyboard. That line was awesome!

--
"It's not like your minds are as open as the source you love..." - Me to the majority of Slashdot.
Re:...Blame the API instead by gnuman99 · 2004-09-21 14:50 · Score: 1

True, but the also-POSIX-compliant time() is going to cause plent of interesting problems in Y2038.
So, how many people are still running 8-bit processors?
By 2038, no 32-bit processors will be in production. Heck, there will probably be no mainstream 32-bit only chips in production by end of next year (hint: AMD64, G5, etc..).
Re:...Blame the API instead by amorsen · 2004-09-21 22:09 · Score: 1

By 2038, no 32-bit processors will be in production. Heck, there will probably be no mainstream 32-bit only chips in production by end of next year (hint: AMD64, G5, etc..).
There are plenty of 8-bit processors in production right now. In fact, the production of 8-bit processors is likely orders of magnitude higher than the production of 32-bit processors right now. 32-bit processors will likely be produced in mass quantities in 2038. (Barring any interruption by Singularity or collapse of civilisation).

--
Finally! A year of moderation! Ready for 2019?

Eh? by RichM · 2004-09-21 11:06 · Score: 1

This happened after an upgrade from Unix to Windows.

This was an upgrade???

Re:Now even the submitters aren't reading the arti by Kehvarl · 2004-09-21 11:07 · Score: 1

That is the perfect response!

Since We're Being Tehcnical About the Answer by techsoldaten · 2004-09-21 11:09 · Score: 3, Insightful

Since we are being technical about the answer, does this mean Microsoft or the software vendor qualifies as a terrorist organization?

Consider the fact that an entire airport was shut down, lives were disrupted, major economic harm was caused our airlines as a result of flights not getting out on time. LAX is a major hub that connects travelers throughout the country, it is conceivable traffic patterns throughout the U.S. were put out by this problem.

Think of it like a car bomb that went off without anyone dying, and you see my point.

M

Re:Now even the submitters aren't reading the arti by AK+Marc · 2004-09-21 11:09 · Score: 1

Its far too great a coincidence that a Windows machine should halt consistently after 40 some days, and that this same bug plagued the Windows operating system.

It also happened to some Cisco routers. Should I presume that those affected IOS versions were rare Windows based IOSs?

If anything breaks, it must be Window's fault. It could never be the application developers that made bad code. The only people that make bad code work for Microsoft.

--
Learn to love Alaska

Re:And then reboots the box without prompting at 4 by FalconZero · 2004-09-21 11:13 · Score: 1

I was kinda just assuming that some human interaction was required for the reboot process,
such as enaging backup radar, or notifying appropriate people first. (Though that is just an
assumption), otherwise, yes, as other people have suggested, you could just have an automatic reboot.

--
Windows in 6 Bytes (IA-32) : 90 90 90 90 CD 19

Nice uptime! by Mullen · 2004-09-21 11:15 · Score: 1

From Harris.com
The system offers unprecedented voice quality, touch-screen technology, dynamic reconfiguration capabilities to meet changing needs, and an operational availability of 0.9999999.

Less than a one percent uptime!?!?! No wonder the thing crashed, it suppose to do that, ALL THE TIME! Bill Gates must be proud.

--
Linux O Muerte!

Re:Nice uptime! by tomas.bjornerback · 2004-09-21 17:22 · Score: 1

Doh, 1 = 100%, hence 0.9999999 = 99.99999% uptime. ...on a Windows box? Gotta be kidding!

--
I have 1 Gbps Internet access@home

On the brighter side... by dlleigh · 2004-09-21 11:17 · Score: 1

Song Airlines' in-flight entertainment system runs Linux. The system allows the passengers to listen to MP3s, see a moving map or watch Dish Network live.

After my flight landed they rebooted the system and I saw a friendly penguin and a bunch of startup messages. I noted that they were using a non-GPLed video driver.

Pointing fingers? by Netdoctor · 2004-09-21 11:19 · Score: 1

Why do we have to point fingers at each other after a major failure?

Mostly it's systems that are poorly planned and fail and not people. Fix the problem.

Pointing fingers makes people defensive in the long run, and raises the probablily of it all happening again.

Pilot or no... by juuri · 2004-09-21 11:19 · Score: 4, Insightful

... how does a single app bring down the entire OS? You mean the app can't be restarted and brought back up with the same state at a moments notice in a mere minute or two?

Crappy design, regardless of who is at fault.

--
--- I do not moderate.

upgrade?! by accessdeniednsp · 2004-09-21 11:23 · Score: 1

an upgrade from Unix to Windows.

Yeah, maybe in upside-down crazy world...

Wierd reloading harris.com link by now3djp · 2004-09-21 11:28 · Score: 1

http://www.harris.com/view_pressrelease.asp?act=lo okup&pr_id=77 Wierd reloading link from the slashdot article text! now3djp

34 centuries... by greck · 2004-09-21 11:31 · Score: 1

According to this press release, VSCS offers "an operational availability of 0.9999999."

Someone check my math, but that appears to come out as 3.16 seconds annually, so their 3-hour outage burned up all their allowed downtime for the next 3,422 years.

So it should be quite safe to fly now, statistically speaking.

Upgrade? by kg4gyt · 2004-09-21 11:36 · Score: 1

How can you upgrade from Unix to Windows, downgrade perhaps, never heard of any upgrade like that before.

I used to worry about terrorists when flying. by Pausanias · 2004-09-21 11:36 · Score: 1

Now I have to worry about the fact that my safety (and my family's) is in the hands of incompetent Microsoft.

That sucks.

I don't blame the OS per se... by WebCowboy · 2004-09-21 11:37 · Score: 3, Interesting

...but I blame a lot of people for carelessness and incompetence (except for the actual techie that forgot to reboot last month--that is an honest mistake).

* Bill Gates and developers of Win2000 for the convoluted, kludgy API they designed for their OS

* Product managers at Harris--the crap-for-brains who actually thought changing out robust UNIX servers that weren't really THAT old with consumer-grade PCs running an unproven OS was an UPGRADE to a critical, safety related system. WHAT THE HELL WERE THEY THINKING? In one of the article links (the Harris press release), Harris touted SEVEN NINES reliability! If that was a criteria they should've NEVER considered Windows...Not even BillG himself would say Win2k could provide that sort of uptime!

* Retarded developers at Harris who used an API call that tracks milliseconds in a 32 bit integer despite the fact that bugs related to the use of said function call were WELL KNOWN by that time.

* Dough-heads at LAX and the FAA who, upon finding the error early in development, decided it was OK to rely on MANUAL MONTHLY REBOOTS as a workaround to a potentially fatal problem. They should've run the "upgraded" windows machines in parallel with the UNIX servers for much longer, and failing that they should've IMMEDIATELY restored the old UNIX servers to service as soon as the problem was discovered, and to refuse the upgrade (and revoke payment to Harris) until the problem was properly resolved (and NOT just worked around with a kludge like an email reminder to reboot, or a reboot script or a shutdown warning either).

I'm surprised that this sort of error got into such a critical system, and at the way it was handled. I would've certainly tested the new system in parallel for long enough to catch this sort of error and kept the old system around for longer as a standby (in my experience, replacements of critical systems were often tested in parallel for 3 months to a year). I also would've acted much more decisively in resolving the problem if it did slip through the cracks, given a system crash could put lives in danger.

Maybe my girlfriends fear of flying is more justified than I thought if these are the kind of clowns we trust our safety to...

Re:I don't blame the OS per se... by codedaddy · 2004-09-22 03:04 · Score: 1

Your comment questions my belief on free speech. Your comment has no basis of fact, just conjectures. Did it ever occur to you that the FAA might not have wanted to spend any more time or money with a more stable solution. I work at Harris and know some of the engineers that worked on that system. If anyone is a retard it's you. I don't know if you have ever work at a large company, but the engineers don't always get to do what they want. The program managers make the system critical decisions. We engineers just do what we are told. So if you want to call the program managers retards, go ahead. By the way, you haven't a clue how the software works. There are so many levels of software from high-level apps to embedded software(which I worked on) that there is no way of knowing where the error occured. You can't imagine the complexity of this system. And the comment to why change from Unix to Windows, ask FAA. The hardware is becoming obsolete because the vendors have stopped making them. Upgrade is the natural progression of things. Yes, I agree that W2K was probably not the best decision, but don't blame the software developers. They probably didn't make that decision, but had to deal with it.

Upgrade!?!? by Izago909 · 2004-09-21 11:40 · Score: 1

This happened after an upgrade from Unix to Windows.

Sounds like the time a car dealer tried to 'upgrade' my 2001 Checy into an 82 Honda. I didn't accept the offer, but from the sounds of this article, a sucker is still born every minute.

Re:Can't be a common problem... by ces · 2004-09-21 11:41 · Score: 1

If your servers are used in any sort of business environment I would reccomend rebooting them every 30 days even if it seems they don't need it.

Why? For one it's just good practice. Two you are much more likely to apply patches or fix wonky hardware if you know you are going to take the system down anyway. Three there are all sorts of problems that are likely to be prevented/spotted with frequent reboots. For example hardware self-tests don't get run if the system isn't cycled periodicly. Fourth it lets you verify that things like failover are working properly before it becomes a problem.

It doesn't matter what the OS is either, Windows, Novell, Linux, and commercial Unix servers all benefit from periodic reboots. Even Big Iron like IBM mainframes, AS/400's, HP/Tandem servers, and Unisis A-series usually will have occasional reboots as a part of scheduled maintenance.

--
Happy Fun Ball is for external use only.

And yet our vendor... by OSgod · 2004-09-21 11:41 · Score: 1

recommends rebooting our production AIX box at least once a month -- it serves a database only (no interactive users).

Couple of tb of disk, couple of gb of ram (or more) and a dozen cpu's and we have to reboot it monthly.

It's called maintenance. It is required.

Re:And yet our vendor... by Rick+Genter · 2004-09-21 11:59 · Score: 1

That doesn't seem right. Even our Windows 2000-based database server, running SQL Server 2000, ran for 15 months before we finally retired it for new hardware.

Of course, the server had no access to the Internet, so we didn't have to worry about the patch-o-the-month. Otherwise we wouldn't have made it 2 months without a reboot.

--
Don't underestimate the power of The Source
Re:And yet our vendor... by tbogart · 2004-09-21 16:06 · Score: 1

The fact that your vendor should be submitted to some very clever form of torture doesn't say anything about your system.

IBM uses Notes internally. I got a peek at the server list for about 100,000 seats. Even before the internal ban on Windows servers, the server of choice was RS6K. That didn't stop some of the salesmen from the subdivision from outright lying to customers that the AIX version was going away so they could sell bunches of Windows licenses.
These are the people driving next to you on the freeway and voting. Actually, that explains alot if you think about it.....

Sounds like a typical Windows "programmer".. by TheCeltic · 2004-09-21 11:41 · Score: 1

Now that I've heard that the application was possibly part of the problem, I can't help but think of the large number of barely literate Windows "programmers" that are out there. What was the application written in? VB? .Net? My kids can write code with VB (it's just not GOOD code). Let's get some geeky grey haired UNIX programmer to do the job and do it right!

--
=-=-=-=-=-=-=-= - The Celtic - =-=-=-=-=-=-=-=

IBM product support kicks all ass. by SvnLyrBrto · 2004-09-21 11:49 · Score: 2, Interesting

> Tell you what, can you get me new boards for an IBM RT pc? I
> highly doubt it.

I've actually dealt with IBM in the "we need support and replacement parts for legacy hardware" capacity before.

And yes, if you've bought IBM in a professional/enterprise capacity, you've also bought the support contract. And if you've bought the support contract (And if you didn't, you deserve to be fired. Why the hell would you pay the IBM premium except for their support?), you can get parts and expert support for damn near everything IBM's ever made; all the way back to card punches/readers, and farther I'd bet. Remember, when you buy IBM, you're buying a MTBF of thirty YEARS.

cya,
john

--
Imagine all the people...

Re:IBM product support kicks all ass. by mekkab · 2004-09-21 13:29 · Score: 1

Actually, you end up buying AIX 3.2.5 source code. NO SHIT.

--
In the future, I would want to not be isolated from my friends in the Space Station.

News to me by ptelligence · 2004-09-21 11:51 · Score: 1

I didn't even know Windows kept track of uptime.

Uptime??? by starnix · 2004-09-21 11:52 · Score: 1

Obviously since uptime is stored in a 32 bit integer, Microsoft themselves never expected Windows to reach 50 days of uptime. Kinda telling isn't it.

So sue them by C3ntaur · 2004-09-21 11:53 · Score: 1

I'm sure this has been brought up before, but why not bring a suit against M$ for selling a defective product? What makes bugs in their product any different than a car whose wheels fall off because of faulty lug nuts?

--
Loading...

Re:So sue them by Dr.Dubious+DDQ · 2004-09-21 12:14 · Score: 1

why not bring a suit against M$ for selling a defective product?
Because the license that you supposedly agree to by running the software says you won't. And that you agree that the software may not be suitable "for any particular purpose" or similar language.

You know, I would have sworn that use for Air Traffic Control was in the official list of "stuff you agree not to use Windows NT for" in the license for at least previous versions of Windows NT Server (3.51?). Did they remove that, or did the FAA ignore it?....

--
Hacker Public Radio is our Friend
Re:So sue them by Mybrid · 2004-09-21 13:46 · Score: 1

Because the license that you supposedly agree to by running the software says you won't.
Which, after watching years of Judge Wapner, doesn't mean it will hold up in court. Doctor's always have you sign a waiver saying you will not sue them if something goes wrong in surgery. Guess what, you do have the right to sue and people regularly win malpractice lawsuits even though they signed they waivers they say they won't.

Downtime vs Failure by burnin1965 · 2004-09-21 11:55 · Score: 5, Interesting

I'm not sure exactly what downtime for routine maintenance on an AIX system running DBase has to do with a Windows bug that causes a system failure. However, in response, there is a difference between planned downtime where a service is made unavailable while planned routine maintenance is performed and planned downtime or an unplanned failure due to a flaw in the system.

It appears that in this case Windows has a flaw which they try to work around with routine maintenance during planned downtime.

In your case I would say you have planned downtime for routine maintenance to work around the need for an appropriate system to handle the work load.

I suppose what is the same between these two cases is that you both need to change your system to something that is more appropriate for the task at hand. And to be more specific in the FCC case, Windows should not be allowed for use in any application where life, limb, or property is at risk. Hmm, I suppose that may rule out just about every use. :P

burnin

Re:Downtime vs Failure by Epi-man · 2004-09-22 02:53 · Score: 1

And to be more specific in the FCC case

Sorry to be a bunghole, but just wanted to point out you meant FAA case. It took me a minute to realize that as I thought I missed another issue involving the FCC.
Re:Downtime vs Failure by horatio · 2004-09-22 06:54 · Score: 1

Windows should not be allowed for use in any application where life, limb, or property is at risk

Wasn't this in the MS EULA for one of the 9x's? I'm almost positive I read that in one of the manuals or EULAs that came with an OEM disc.

--
There is very little future in being right when your boss is wrong.

Fire the Department of the Interior's IT staff... by Dr.Dubious+DDQ · 2004-09-21 11:55 · Score: 4, Insightful

The FAA is under the auspices of the US Department of the Interior, aren't they? You know, the same department that was ordered by a court to take ALL of their systems off line because they were apparently unable to secure them? TWICE? (No, wait, the latter link says THREE times, most recently March 2004...!)

Is there some secret plot to make them look bad, or is the Department of the Interior riddled with incompetence? I certainly don't feel real secure about the safety of our airlines right now - and it's got nothing to do with "terrorists"...

(Not to say that terrorism isn't a real concern, but I'm somewhat less worried that their intentional plots will slip through observation by the authorities than "accidental" screwed up software being deployed by the FAA...)

--
Hacker Public Radio is our Friend

Should have used MPlayer by leonbrooks · 2004-09-21 12:02 · Score: 1

IW4M.

--
Got time? Spend some of it coding or testing

Oopsie by alw53 · 2004-09-21 12:08 · Score: 1

If it went down for three hours, now it's got to run for 3400 years in order to make up the claimed operational availability:

"The Harris-developed VSCS - based on independent, distributed processors and switches - allows air traffic controllers to establish all air-to-ground and ground-to-ground communications with pilots and other air traffic controllers. The system offers unprecedented voice quality, touch-screen technology, dynamic reconfiguration capabilities to meet changing needs, and an operational availability of 0.9999999."

Not a Windows or Unix thing by gone.fishing · 2004-09-21 12:12 · Score: 1

This is a problem that goes deeper than Windows vs. Unix, it has little to do with the operating system or the hardware and even the application has little to do with it.

If you must assign blame, you should probably point fingers at the people who spec'd the system out and perhaps submitted to the cost-constraints demanded by the bean counters.

Any system that lives depend on needs to have fail-safe features and redundancy built in to it and a completely seperate fall-back proceedure that can be implimented at a moment's notice.

It can almost be assumed that something will go wrong and when it does, the equipment needs to be able to handle it as transparently as possible, otherwise you can just about count on human error to make matters worse.

These kinds of systems can end up being very expensive to build. It is very tempting to remove what can be seen as "bells and whistles" from the package to save money.

Unfortunately, these bells and whistles are designed and intended to save lives.

I honestly don't know if that is what happened in this case, but I've seen it before where lives were at stake. It is another version of the low-bidder syndrome.

Lol, only on Slashdot by jayhawk88 · 2004-09-21 12:12 · Score: 3, Insightful

I don't think blame should be assigned to the technician who missed the task...

Boss: OK Tech, it's your job to see to it this computer is rebooted monthly.
Tech: Will do Boss!
*Time Passes, System Crashes*
Boss: The system crashed, why is that?
Tech: Well, it's because I didn't reboot the system like I should have.
Boss: Oh well, I guess it's not your fault, obviously I failed to realize maximum security synergy in my systems.

Wherever the submitter works, I wanna get a job there!

Re:Lol, only on Slashdot by reverius · 2004-09-21 18:05 · Score: 2, Insightful

it's the boss' fault for making a task like that necessary in the first place.

if i design a system in which someone has to press a button every 12 hours or the world blows up, would anyone want to use that system? no, you think? what if you could -order someone who works below you- to do it!?

that's just plain stupid management. the rebooting job is a waste of the tech's time (anyone competent could make it reboot automatically) and a completely unnecessary job (any competant operating system doesn't need to be rebooted every 30 days, or even every 3 years).

If the boss had scheduled maintanance (Windows Update, to get service pack 4) or had used an operating system that doesn't require that much maintanance to function correctly, the job wouldn't have needed to be performed.

the boss should be fired for general incompetence/negligence (since he had the responsibility to make the system stable), and the tech should be put to work carrying boxes or something (or just fired as well), since he isn't competent enough to put an automatic timer on the rebooting.

Good point, but... by leonbrooks · 2004-09-21 12:13 · Score: 1

...try working with someone who describes the system on his machine as "Word" and complains about the (boilerplate) fax template from MS-Word not being present in OpenOffice as being one of the most important "failings" in the system. I kid you not.

From the sound of it, that's not far from the territory the GPP is in.

--
Got time? Spend some of it coding or testing

Patriot bug details by Animats · 2004-09-21 12:21 · Score: 2, Informative

That was a bad bug. It didn't cause system crashes. It caused missile misses. This bug was responsible for an interception failure which allowed an incoming Scud missile to hit a barracks in Saudi Arabia, killing 28 people.

The radar and the guidance system had separate clocks, and they'd drift out of sync.

Here's a detailed analysis by the General Accounting Office.

Re:Patriot bug details by ergo98 · 2004-09-21 15:33 · Score: 1

Actually didn't the missile that hit the barracks get hit, and knocked off course, by a patriot? My recollection of that story was that the interception was complete (not dead on, as that is rather difficult at those speeds, but close enough to severely affect the trajectory), but in a stroke of bad luck the new trajectory was straight for the barracks.

Re:Now even the submitters aren't reading the arti by AstroDrabb · 2004-09-21 12:25 · Score: 3, Interesting

The shutdown is not a crash but a scheduled event to bring the servers down to flush data.

That is MS PHB speek to "assure" other PHB's that it was not MS's fault. What _modern_ server OS needs to reboot to flush freakin data! Why do you think technical details are never released in these types of press releases?

The reboot was to reset the logic flaw in the MS system timer. Read my post here on it. It has affected other MS made apps on MS Windows 2000 servers. So if MS's programmers get affected by it, you can expect non-MS employeed programmers to get affected too since they do not have the same level of access to the proprietary OS.

--
If Tyranny and Oppression come to this land,
it will be in the guise of fighting a foreign enemy. -James Madison

For those of us not in the states by Trogre · 2004-09-21 12:27 · Score: 1

WTF is the FAA?

--
"Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife

Re:For those of us not in the states by Dr.Dubious+DDQ · 2004-09-21 12:52 · Score: 1

" Federal Aviation Administration"

"Bringing Safety to America's Skies", they say...

--
Hacker Public Radio is our Friend

upgrade? by Baalam · 2004-09-21 12:31 · Score: 1

Is is still called an upgrade when it flops so badly?

Re:PSSST!!! Active Directory/NTLM2 is LDAP3+Kerber by hkb · 2004-09-21 12:45 · Score: 1

You're almost entirely right. NTLM2 is a separate protocol from Kerberos, though. It's used by downlevel clients who can't speak Kerberos, and is also used by your DC's when you're not running in Windows 2000/2003 Native Mode.

OpenLDAP and AD's LDAP crap is both originally based off of the original UMich LDAP code.

--
/* Moderating all non-anonymous trolls up since 2004 */

Hey - atleast it stayed up that long! by gatkinso · 2004-09-21 12:49 · Score: 1

Which is an improvement - albeit a questionable one in this case....

--
I am very small, utmostly microscopic.

An upgrade? by Hido · 2004-09-21 12:50 · Score: 1

This happened after an upgrade from Unix to Windows.

I did not know that this could be considered as an upgrade........

--
Havin' it large, livin' the life, Welcome to the land of the rising sun.

Exactly by autopr0n · 2004-09-21 13:03 · Score: 2, Informative

windows 2000 can stay up for more then 232 milliseconds, but software that depends on GetTickCount() being correct can't. That's probably what happened. They could have rewritten the software to use a 64 bit time variable, or they could have worked around the bug.

They didn't, and that caused the crash. Not "buggy windows".

The fact that they couldn't even figure out how to run a sheduled task in windows to reboot the machine is just pathetic, and shows how incompitant they really are.

--
autopr0n is like, down and stuff.

Re:Exactly by dgatwood · 2004-09-21 14:39 · Score: 1

Strictly speaking, you can fix it with something like this:
uint64_t myGetTickCount() { static uint32_t last32; static uint64_t base64; uint64_t cur; cur = GetTickCount(); if (cur < last32) { base64 += SOMEBIGNUMBER; } last32 = cur; cur += base64; return cur; }
followed by a global search and replace. Of course, this assumes you call it at least once every 49.7 days. If you don't, you should clearly be using a different API.
The real question in my mind is why Microsoft didn't do a 64-bit GetTickCount64 years ago. (I did find a web page that suggests that they're planning one for Longhorn....) Both Linux (64-bit jiffies) and Mac OS X (clock_get_uptime) have had such facilities for years....

--
Check out my sci-fi/humor trilogy at PatriotsBooks.
Re:Exactly by Foolhardy · 2004-09-21 15:08 · Score: 1

Internally, NT has always tracked time using a 64 bit number of 100ns periods. The function GetSystemTimeAsFileTime outputs "a 64-bit value representing the number of 100-nanosecond intervals since January 1, 1601 (UTC)." This function has been available since NT3.1 and Win95.

Re:Seen this week at various airports by Johnno74 · 2004-09-21 13:04 · Score: 1

I shouldn't bother even replying, but...

An NT machine with uptime > 5 years is perfectly possible. WinNT 4.0, 5.0, 5.1, 5.2 (thats Windows NT4/2k/XP/2k+3) are not that bad, and keep on getting better. I'd even say that 2k and 2k+3 are good. its true what MS say about most crashes being the result of driver problems. I develop .Net code on this windows box all day at work, and I reboot once a week, when I power down my machine for the weekend, if I remember. I've gone a couple of months without a reboot to see what happened. Nothing happened. Last time we took down our production DB (ok, to apply a security patch), which handles way over 500,000 transactions a day on ms-sql2k, it had been up for 8 months without missing a beat.

Yes, MS releases security patches. No, its not always necessary to install them. A good admin will have disabled all unneccessary services & features, and if there is a patch for a service you aren't using, why would you install the patch, especially if the machine was running inside a trusted network.

Bias by rd_syringe · 2004-09-21 13:17 · Score: 1

Who's really at fault?

According to the headline, looks like Slashdot's already decided.

Why replaced by jamesl · 2004-09-21 13:19 · Score: 2, Funny

The decision to replace the legacy system was made the same week RadioShack quit selling vacuum tubes. Coincidence? I think not.

OT Your sig by Darby · 2004-09-21 13:24 · Score: 1

I like my women how I like my golf courses: with a windmill hole.

Ouch!!!

Maitainance. by Zebra_X · 2004-09-21 13:41 · Score: 3, Insightful

it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task.

Would you feel this way if the airplane that you were flying in missed it's engine overhaul time, the engined failed catastrophically and your plane crashed?

Critical System + Maitainance = Must Be Done.

The system was designed and setup in a particular manner. In fact, the reboot rule was added to the design of the system, so that this very thing would not happen.

Whoever's job it was to reboot the machine is at fault for not maintaining the system properly.

The discussion of whether the procedure of rebooting a machine every month is inane, is something different.

Re:Maitainance. by argent · 2004-09-21 15:36 · Score: 1

Would you feel this way if the airplane that you were flying in missed it's engine overhaul time, the engined failed catastrophically and your plane crashed?

If the airplane I was flying had its engine shut down in midflight because a maintainance task that was only necessary because of an easily fixed design flaw that had been known for years and that would have made it shut down a couple of weeks later... I think I'd be kind of pissed, yeh.

Whoever's job it was to reboot the machine is at fault for not maintaining the system properly.

True. But in addition, whoever decided to fix the problem by adding a manual restart was at fault. And whoever used a Windows-based computer for a critical system where an unscheduled reboot would fail the system was at fault as well. And whoever designed this system so any single system failure would bring it down... they were also at fault.

Lots of fault to go around. Laying it all on the last straw and dismissing the design flaw as "something different"... that seems almost like a diversionary tactic.

For this kind of safety-critical control system you need at *least* a completely redundant configuration with a hot spare activated by a heartbeat failure. And if a periodic reboot is necessary, and you can't fix the problem any other way, you'd do it by having an automatic mechanism to force a failover.

What if OSS gave them software? by Mustang+Matt · 2004-09-21 13:43 · Score: 2, Interesting

What would happen if a group of people out of the goodness of their hearts wrote them a new system that truly did everything they needed. Would they adopt it?

Or are the corporate powers that be so out of touch with reality that they wouldn't touch anything having to do with "open sores!"

--
The man who trades freedom for security does not deserve nor will he ever receive either. - Benjamin Franklin

Why is it that.... by jwcorder · 2004-09-21 13:47 · Score: 2, Insightful

no one has put the blame where it belongs....on the system admin. We can have a shit throwing contest all day about whether or not this is MS's fault. But the fact remains that the problem was addressed and fixed in SP 4 for Win 2000.

If the system had been updated the problem would not have occurred. How is this a microsoft problem? They cannot force system maintenance.

--
http://jayceecorder.blogspot.com

Re:2K is based on NT kernel by banzai51 · 2004-09-21 13:47 · Score: 1

The thing that is baffeling the Windows administrators here is the 49.7 day bug was NOT in windows 2000. I certainly had uptimes monthes past that timeframe. So how is a bug in Windows 95 affecting the FAA in Windows 2000? Can you say FUD? Can you say the FAA is blaming Microsoft for something they most likely screwed up? Knew you could.

OMFG by confused+one · 2004-09-21 13:48 · Score: 1

nuff said

be more specific please. Windoze sucks. by twitter · 2004-09-21 13:49 · Score: 1, Flamebait

Sounds like bad consultants.

Oh, by all means, be the good consultant will you? Which of the raft of binary cruft which must compose the system was compiled with the wrong SDK? I'm sure everyone would love you to death if you could reach into the DLL hell and pull out the offending bits. The guy who's supposed to go and reboot the thing once a month will be especially pleased with how clever you are.

It's funny how people pointing their fingers at one or another potential causes think that mitigates how nasty M$ is as a platform. How pathetic a system is it that does not have reliable system timers? How much even more pathetic that someone's goofed timer can pull the whole system down. Oh, but it's a timer, see? No, it's just a "data overload" that will give traffic control incorrect information. How about they should have automated the reboot? As if you want faulty software deciding when it should stop giving your air traffic control info or you would trust it to come back up on it's own. The boss blamed his tech who missed the once a month reboot as if that was never going to happen. It's junk and you should not use it so it's not M$'s fault is my favorite though, right behind just don't use it.

The last two hit it on the head. M$, You have to be crazy to use it. Remember that the next time you think Winblows might be a reasonable candidate for anything. When the thing goes tits up, the blame gets put everywhere but and on you. So much for vendor support.

--

Friends don't help friends install M$ junk.

Maintenance Task ???? by burdicda · 2004-09-21 13:52 · Score: 1

Maintenance Task.....Holy shit.....!!!
Maintenance Task.....A monthly Reboot ?????

Well there goes the farm Mildred....

Shut er down Ma....she's suckin mudd....LOL

LAX Problem Solved by Freeware App by thegnu · 2004-09-21 13:54 · Score: 1

Aren't there about 7 million freeware apps that will reboot your computer after a certain amount of time? Can't you just write a stupid shell script?

Did the REAL computer tech quit, so they couldn't figure out how to operate the Unix box? Christ.

--
Please stop stalking me, bro.

Re:Well-Written ? by dbottaro · 2004-09-21 13:56 · Score: 1

Actually, yes it does boot the local machine. It is run on the server that needs the nightly boot.

--
Coding my way to the next BSOD!

OK, sorry $9.99 app by thegnu · 2004-09-21 13:58 · Score: 1

http://www.brothersoft.com/Utilities_System_Utilit ies_Sleep_Timer_4576.html

Sheesh.

--
Please stop stalking me, bro.

MTBF - what boggles my mind by apikoros · 2004-09-21 14:08 · Score: 2, Insightful

Forgetting all the talk about Microsoft and Win95/98 and the defect in the OS that has been well known for years and for which a patch has also been available for years....

If you have a system that has a known failure point at 49 days,when do you perform the mandatory reset?

For the failure that is described the scheduled reset must have been "every 30 days" which is, frankly, INSANE!

If they had scheduled a mandaory reset every 14 or 15 days, they would have had to have had three failures before disaster struck. As it seems, one failure was all it took.

Re:MTBF - what boggles my mind by tbogart · 2004-09-21 15:21 · Score: 1

Follow your own link. No patch there for Windows 2000 Advanced Server.
Re:MTBF - what boggles my mind by apikoros · 2004-09-22 00:53 · Score: 1

Absolutely correct... my bad :-(
I probably should not have included the link as it is peripheral to my point that they could have prevented this merely by scheduling the required reboots at any interval short of the maximum to give themselves the least bit of redundancy. Most failure analyses show a chain of trivial mistakes all of which had to be made to cause the failure. Correct any of these trivial mistakes and the disaster does not happen.
Re:MTBF - what boggles my mind by tbogart · 2004-09-22 05:21 · Score: 1

Quite true. And has been pointed out by others, the ability to automate something like a reboot is so trivial that the whole issue still smacks of not having the whole story.

That said, I don't think the point that making bad systems decisions such as basing your system on such a weak product to begin with only increases your chances for disaster should be lost or discounted.

Nothing is bullet proof - but some implementations certainly come closer than others.

Re:2K is based on NT kernel by Phragmen-Lindelof · 2004-09-21 14:10 · Score: 1

"It only takes 20 years for a liberal to become a conservative without changing a single idea." Robert Anton Wilson
Yeah, you've got to hate what that Alzheimer's does. This proves that you should not get old. I'm just glad "W" is not a conservative; he would give conservatives a bad name.

Win2K is a completely different OS than Win95.
I am sure MS completely rewrites new OSs and hence no old bugs reappear in newer MS OSs. This is one reason MS has such a great security record.

FUD, shameless speculation, and bias. Man, this is just bad.
I really wish the FOSS community would follow the MS and SCO leads and avoid all of this FUD and such. Can't we be as good as MS?

Re:Seen this week at various airports by Anonymous Coward · 2004-09-21 14:17 · Score: 2, Interesting

Perfect timing for this comment. I was in the airport yesterday (Detroit). The screens over the metal detectors/ carryon xray machines do nothing except tell you whether the lane is open (a large arrow) or closed (a large X). 4 of the lanes had some sort of Windows error message. Apparently they couldn't handle the workload.

I wanted to say stupid - I say ??? by tuomoks · 2004-09-21 14:21 · Score: 2, Insightful

I started to write a long comment, no point, unfortunately this is the way today. Trust me - the more computer system decissions are made on manager level instead using people who know how to build systems - the worse it gets. Used to be that way - compare the financial / manufacturing systems running years to what we do today - any questions ? Some of my old systems are still running from 70's - none of my new systems can stay up more than 10-12 months AND I was told to build them that way. And no - CAD systems, CRM, protocols, world wide networks for finance / air lines / etc.. has been there since early 70's, so complexity is not any excuse. Just don't give up - maybe some day ( after my time.. ) And let's forget the Windows / *nix, Windows is more difficult to build reliable systems but it can be done - Windows is just more primitive, you have to design / code on lower level, it is harder than *nix but so what ?

Re:I wanted to say stupid - I say ??? by tbogart · 2004-09-21 15:29 · Score: 1

Well, Stone tablets could still be used, but if obviously better tools exist - doesn't it make engineering sense to use them? It hardly seems wise to "forget the Windows / *nix" issue when it goes to logic of choosing one of the most basic building blocks on which any system is built.

They _do_ use Duct tape and baling wire by billstewart · 2004-09-21 14:34 · Score: 2, Interesting

Back when I was working on ARTCC replacement in the late 80s, during the daytime they were running the "modern" 1960s IBM System 360/90 system, which was an ugly undocumented unmaintainable hack job written mostly in JOVIAL. For about four hours a night, they'd run the backup system EDARC, which was an 1970s "Enhanced" version of the 1950s "DARC" radar controller. There were all sorts of parts you couldn't get back in the 1980s - IBM had stopped making the "Serpentine" cable connector, for instance.

I was on the lucky team that *lost* the bidding for the replacement system; IBM's team were the poor bastards who won, and were stuck investing seven years into building an unbuildable replacement, pouring billions of dollars down the drain while being micromanaged by the FAA, who didn't know much about software design or reliability in spite of having a methodology that required producing 175 design documents over the optimistically 3-year design period.

--

Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks

Say what? by teklob · 2004-09-21 14:51 · Score: 1

This happened after an upgrade from Unix to Windows.
what's your definition of an upgrade?

Laws, sausages, and air traffic control software by billstewart · 2004-09-21 14:52 · Score: 1

What was the old line about laws and sausages, that if you're fool enough to like either one of them you ought to be forced to see them being made? Yes, the old system was "tested", probably, but the loads it was tested for don't resemble the modern airspace environment.

The 40-year-old system was pretty much the Mos Eisley of software design - you'll never see a more wretched hive of scum, villainy, and undocumented unmaintainable Jovial code running on IBM 360/50 and 360/90 hardware. The backup system was much cleaner (and much dumber); I think the main thing they did in the 1970s enhancement was retread the design to use transistors instead of vacuum tubes, though I never worked directly with that side.

Yes, Sun and IBM machines fail - that's why all of the critical parts in our designs had to be at least doubly redundant, and often triply redundant, because the design spec of "Eight 9s of reliability" meant that doing an hour a year of preventive maintenance might expose you to too much risk from the backup system failing. I haven't seen IBM's design; I was on the lucky team that didn't win the bidding to build the final system, unlike the poor suckers at IBM who had to implement theirs, but the requirements were not only insanely non-implementable, they were excessively focussed on No Possible Downtime Ever, because if anything goes wrong resulting in an airline crash, the FAA gets insane amounts of political heat. Doesn't matter if the system is N years late, because you can try to blame the contractor for that, or if you can't fly supersonic planes across the Continental US because they're too fast for the new ARTCCs, because tough luck for the French and for bi-coastal business travellers.

Of course, that doesn't mean that Im inclined to trust a system running on Windows, either...

--

Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks

Re:Please mod parent up by Hard_Code · 2004-09-21 14:53 · Score: 1

Except if an X bug is identified on a FreeBSD system there's a good chance that it's on Slackware, and many other systems. 9x and NT have different *KERNELS* but I suspect lots of the userland is either the same, or mostly the same. You really think they *reimplemented* the entirety of the win32 api just because the kernel changed? hell no

--

It's 10 PM. Do you know if you're un-American?

Re:Laws, sausages, and air traffic control softwar by Erik+Hollensbe · 2004-09-21 14:57 · Score: 1

Wow, interesting, informative tidbit. Thanks.

Windows is NOT telecom carrier grade reliable by Anonymous Coward · 2004-09-21 15:13 · Score: 1, Informative

No version of Windows has been certified telecom carrier grade reliable 99.999%. The number of Microsoft programmers and billions can't make Windows reliable. Microsoft won't even attempted to pass the certified telecom carrier grade test. There are version of Linux and embedded Linux that are certified telecom carrier grade reliable.
There is a serious security in Windows NT 4.0 for a couple of years that has not been fixed. What is Microsoft solution? Let support for Windows NT 4.0 expire at the end the year, then Microsoft won't have to fix serious security flaw. Linux 2.0 (which is older as Windows NT 4.0), 2.2, 2.4 and 2.6 are still supported with the latest security patches.

49.7-day bug not exclusive to Windows. by Temporal · 2004-09-21 15:23 · Score: 3, Interesting

It may seem suspicious that the max uptime of the LAX system is the same as the max uptime of a Windows 95 box... until you realize that 49.7 days is 2^32 milliseconds. If you have a piece of software that counts milliseconds using a 32-bit integer, it will inevitably roll over after 49.7 days and -- unless designed to compensate for it -- will probably crash. Windows 95 is certainly not the only piece of software that counts milliseconds in a 32-bit integer.

That said, the Windows GetTickCount() system call returns a timer value as a 32-bit count of milliseconds since the system was booted. Now, any good programmer knows better than to use GetTickCount() -- there are other, better, more robust ways to tell time in Windows -- but it would not surprise me if a newbie had made the mistake of using this system call in the LAX software, thus leading to the problems.

In other words, the Windows timer is not at fault, but it is possible that one of the programmers was confused by the convoluted Win32 API and made a programming error as a result.

Backup Plans and Failover Clusters by billstewart · 2004-09-21 15:26 · Score: 1

Backup plans, and people who know how to use them, are really much more critical, because sometimes you will need them.

Failover clusters aren't trivial - I worked on a non-winning design for one of the predecessors to this system back in the late 80s (fortunately for us, we lost, and unfortunately for IBM, they won.) Yes, you can have two, three, or N of everything, but then you need a lot of code watching the redundant components to see if any of them appear to be failing, and code deciding which redundant subsystem is correct if two or more of them disagree, and code watching the watchers to make sure they're still watching well, and data communication protocols that work ok when all messages are transmitted redundantly to the redundant processors, possibly getting different results at microsecondly different times. One of my coworkers had worked with an early "fault-tolerant computer" system which had triply or quadruply redundant hardware, but had an operating system that crashed at least weekly because it was too complex.

You also have to be extremely careful and flexible in your design for the granularity of the redundant subsystems - if you make the separately processed chunks too big or too small, you can have an order of magnitude change in performance and sometimes several orders of magnitude change in reliability, and then there's the problem that the definition of "reliability" includes "probability that the calculation finishes in N milliseconds", so it's inextricably linked with performance.

Moore's Law is really your friend here. Improved performance means you can use a lot fewer parts, which reduces complexity and failures. Disk drives are more than an order of magnitude more reliable, and the increase in size means that a cluster of disks containing N gigabytes is several orders of magnitude more reliable because it's a lot fewer disks, and CPUs that are 2+ orders of magnitude faster mean that it's easier to guarantee that something happens in a given time, and cuts down on communications steps between different modules, so you cut down on all the failure modes for those communications, and on the monitoring software watching for failures, and on the failures of the monitoring software. On the other hand, Moore's Law lets operating system vendors and application vendors bloat their software with features - X Windows 11R3 ran just fine on my 386/25 machine with 8MB RAM, but of course I was using twm, not Enlightenment or Gnome.

Backup plans do introduce the danger of complexity - the FAA doctrines of the 1980s were that any new system had to be able to interoperate with everything its predecessor interoperated with, because you weren't going to flash-cut upgrade everything at once. That meant that everything you designed had to be bug-for-bug backwards compatible with the predecessor's interfaces, and when they redesigned the thing _your_ system interoperates with, it has to be bug-for-bug compatible with everything your system does, which means being compatible with its predecessor, which was compatible with your system's predecessor, etc. It's a vicious circle similar to the messes Windows and Intel CPUs had to put up with, except that while the 8088 and MS-DOS 1.0 were *ugly*, they were at least small and well-documented late-70s technology, as opposed to poorly-documented 1960s JOVIAL and 1950s analog.

--

Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks

My Linux car works fine! by mangu · 2004-09-21 15:35 · Score: 1

It stops running if I don't fill up the tank every 300 miles.

If cars are a valid analogy to operating systems, Linux cars work on "zero point energy", which means that, at the worst case, you should stop every few hundred miles to drain your bladder.

Sweet we finally found a job by BoomerSooner · 2004-09-21 15:36 · Score: 1

Sweet we finally found a job that George W Bush has created and can simultaneously perform!

Re:Sweet we finally found a job by babybird · 2004-09-21 20:18 · Score: 1

My God, I didn't realize his problem is just that he hasn't been rebooted in over 49 days!

--
Keith D.

Re:Fire the Department of the Interior's IT staff. by tbogart · 2004-09-21 15:41 · Score: 3, Interesting

Looking at the www.faa.gov home page, it says "Department of Transportation". However, having been a systems engineer and administrator in a couple of stints at one of the DOI Bureaus ... you don't want to know.

Lots of reasons that memory can leak by billstewart · 2004-09-21 15:44 · Score: 1

Memory can leak because of applications. Memory can leak because of operating systems. Memory can leak because of obscure timing bugs nobody can find between the OS and the application's garbage collector. Memory can leak because the hardware clock that drives the timer for the garbage collector sometimes skips a beat because of bus loading. Memory can leak because the moon is full, affecting the frequency of cosmic rays hitting the memory chips. Memory can even leak because somebody didn't RTFM.

Fragmentation is another problem besides leaking, but it can also lead to systems getting progressively slower until they drop below some critical performance threshold.

And disk drives _do_ fill up with log files unless you do something about it.

Back when my department used Vaxen, we'd reboot them every Friday night, fsck the disks, and do backups. Around the time we were running SVR2, the file system really was stable enough and the removable disk packs high enough quality that fsck seldom found anything and didn't need manual intervention, and the rebooting process was reliable enough that we could let a cron job run it, and while we could have cut back to monthly, people had gotten in the habit of knowing the machine would be down, so they could get a life, and it was a good schedule for the times we actually did want to to upgrades or hardware maintenance.

--

Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks

What m$ SHOULD have done by mangu · 2004-09-21 15:46 · Score: 1

There is nothing Microsoft could do to prevent this.

Huh, how so? How about fixing the shitty windoze API and making GetTickCount() return a 64 bit value?

Re:What m$ SHOULD have done by Keeper · 2004-09-21 16:43 · Score: 1

Then you break backwards compatibility with all of the existing applications that call that API. Compiled code doesn't tend to work nicely when it expects 4 bytes on the stack and gets 8 bytes instead.

If your code can't deal with a rollover condition, you should be calling GetSystemTime instead. Typically, the reason why people call GetTickCount instead of GetSystemTime is generally due to lazyness (it's harder to "sort" the system time structure than a dword).
Re:What m$ SHOULD have done by Shimbo · 2004-09-21 21:56 · Score: 1

Then you break backwards compatibility with all of the existing applications that call that API. Compiled code doesn't tend to work nicely when it expects 4 bytes on the stack and gets 8 bytes instead.

True; part of the problem though is that Microsoft doesn't seem to *design* APIs. They just publish the first interface that comes into their head, then when it turns out to be horribly broken, invent another one.
Re:What m$ SHOULD have done by Keeper · 2004-09-22 06:10 · Score: 1

Prove it instead of FUDing about it.

Which UNIX is that? by mangu · 2004-09-21 15:49 · Score: 2, Insightful

Unix systems are often installed with the instruction taht they get reboots regularly.

In 25 years working with Unix systems, I've never seen that instruction. That must be because I've never worked with any Microsoft Unix system...

Re:Which UNIX is that? by DunbarTheInept · 2004-09-21 18:41 · Score: 1

These were large systems built on Unix. It wasn't the Unix that made the reboots necessary. It was the software on top of it. The whole package (unix plus application software) was sold as a single turnkey system.

--
Don't label something "offtopic" unless you know the topic well enough to tell what's on topic.

maintenance task (yyeahhh, rrriiiiighht) by l3v1 · 2004-09-21 15:52 · Score: 1

[...]it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task[...]

Now I'm thrilled. So now it seems rebooting regularly just to avoid death of that Windows has "evolved" from a ridiculous flaw to a technician's maintenance task :)

That's really worth a smile, at least where I come from :)

And right, critical radio outage at the Federal Aviation Administration caused by some Windows version ? Naaah, can't happen in a Windows world, everybody would bet on human error in such a case, right ? :P

--
I am putting myself to the fullest possible use, which is all I can think that any conscious entity can ever hope to do.

Re:maintenance task (yyeahhh, rrriiiiighht) by timerider · 2004-09-21 19:14 · Score: 2, Insightful

it WAS a human error... i mean, it must have been some form of human life form who decided to use windows for those systems...

It's the OS's fault! by mangu · 2004-09-21 15:53 · Score: 1

An under-trained technician failed to reboot the server as scheduled.

Dude, if the OS requires a reboot, it doesn't matter how bad the application software is. A true Operating System should work flawlessly FOREVER. It's not impossible, because VMS does it, FreeBSD does it, Linux does it, so why cannot micro$oft windoze do it?

Re:It's the OS's fault! by AK+Marc · 2004-09-22 03:42 · Score: 1

A true Operating System should work flawlessly FOREVER.

Yes, and the server did not require a reboot because the OS failed. It required a reboot because the application failed to properly handle a variable.

I still don't see how an application programmer failing to handle a variable correctly is the fault of the OS. I've personally administered Windows servers with uptime of over 5 years, so if you have never seen it, the problem may be that you (and all your friends) are too stupid to use the simplest server OS there is.

--
Learn to love Alaska

Can you imagine? by dickens · 2004-09-21 16:00 · Score: 2, Interesting

Can you imagine knowing about this problem, putting it into production and not riding your MS rep like a pony until it was verified fixed ? ...with any other vendor.. sheesh.. but I guess it doesn't work that way with MS - even for the FAA.

Re:2K is based on NT kernel by omicronish · 2004-09-21 16:27 · Score: 2, Interesting

Just like the "Y2K glitch" was a platform independant problem based upon the 2-digit-year shorthand causing logical flaws, if you store time in a 32-bit variable by the microsecond... you'll hit the hard limit after about 49.7 days which is why that number can show up in kernels other than Win9x. If there's no proper handling of that rollover, things go haywire.

One interesting bit is that Quake 1 servers had problems running for more than 49.7 days for what I assume is precisely the same reason.

Interesting timing for me... by Oswald · 2004-09-21 16:32 · Score: 1

...since just this week I noticed the old Tandem servers sitting by the loading dock, shrink-wrapped and addressed to some recycling outfit in Chicago. "Hmmm," says I, "I wonder what they've replaced these with." And now I know. Dell/Microsoft.

The Tandems lived up to their hype, in terms of reliability. I never saw a VSCS failure in almost ten years of use--I barely remember how to use the backup system, VTABS. Maybe now I'll get some practice with it.

Unix to Windows an Upgrade? by CodeBuster · 2004-09-21 17:25 · Score: 2, Funny

Unix to Windows95? more like downgrade...big time

Fix the Joke by MyHair · 2004-09-21 17:26 · Score: 1

Ladies and Gentlemen, this is your captain speaking we're cruising at 30,000 feet, you can see the Mississipi out of the left side of the plane, and...uh, what the hell does "STOP 0xc0000005 (0x00000029,0xc02fdec6,0x00000000,0x00000000)" mean?

MS EULA Says.... by Anonymous Coward · 2004-09-21 17:30 · Score: 1, Interesting

Hmmm, I seem to recall glancing at the MS EULA one time (when they were printed on the disk-envelope (that I never opened - it came that way - honest), and the thing said in part that it wasn't to be used in life-safety operations, for running a nuke plant, air traffic control, or other real-time operations...

So ummm, unless MS suddenly created a hardened RTOS, why the fsck is this thing even running anywhere near ATC?

I say FIRE the morons who installed it, ordered it, designed it, and sold it... Finally, FINE the hell out of the asshole company that wrote it and allowed it to be sold for that use... I'd say $150/hr PER person inconvenienced by this debacle, PLUS whatever the airlines lost (or might have earned) PLUS a punitative sanction to make it fucking hurt bad enough that they'll realize that this can't ever occur again - I'd say $15 billion would do it...

Designing for 8 Nines is *fun* by billstewart · 2004-09-21 17:58 · Score: 1

Individual pieces of equipment seldom get above 4-5 nines of reliability (4 nines is about 1 hour of downtime a year; 5 is about 6 minutes.) That's fine - so you use duplicate equipment, with a third piece of equipment watching the two working pieces to be sure they're both running correctly, or N+2 pieces of critical equipment if failures are obvious to the operators, and you make it hot swappable, so that if one piece fails, you can replace it while the other one's running, and you do a lot of work to prevent common-mode failures and undetected failures.

Of course, the most important thing is to spend a lot of time carefully defining what events are or are not failures, because that can make a couple of nines difference in what you call the reliability numbers...

For instance, what about preventive maintenance? If Thing1 and Thing2 each have 99.99% reliability, and you take Thing1 down for an hour for maintenance, have you blown your 8 9's reliability for the year because there was a 0.0001 chance of Thing2 failing during that hour? If so, then you need triplicate equipment, not just duplicate. And remember that the 1970s backup equipment is running the whole system for 4 hours a night to keep it working and keep the operators trained - can you fire it up during your 1-hour preventive maintenance and not get dinged for failure risk?
What about partial failures? If you've got a box that's supposed to manage 100 radar lines, obviously an event that takes down all 100 is a "failure", so you have to duplicate enough to cut the probability of that event down below your spec. But what if just one radar line fails - do you need to make the line cards supporting _each_ line 10 nines reliable, so that the chance that not all lines are working is 8 nines, or is it good enough to make sure that each radar line is independently 8 nines reliable (including per-line and per-half-system and per-system failures)? Hint: If your management is too conservative about the decision they make, you need triplicate line card support, which is much much harder than duplicate. And this gets to be _really_ annoying if each radar line your processor supports is on a telco circuit that's only 3-4 nines reliable in itself, but you're running it on a 10 nines triplicate set of line cards....

And yes, the FAA has always been on drugs. One of the drugs they're on is knowing that if there's an airplane crash and hundreds of dead bodies due to problems with air traffic control, they get infinite amounts of political heat, whereas if major hub airports don't have enough capacity because the ATC system is antiquated, well, that's only money, and usually somebody else's money at that, and if there are appalling delays and cost overruns, maybe it takes a bit longer to get promoted, but often you can _get_ more budget, because if two 747s full of school children crash over LAX a month before Election Day due to ATC glitches, nobody wants to be the Congresscritter who voted against fixing the ATC system. So the system's rigged against them, forcing them to be overconservative, and to _look_ extremely conservative, except that every once in a while the fragility and brokenness of the system catches up with them and forces them to do something in a hurry, especially if there's going to be an election where the top people get replaced for partisan political reasons, which gives them an opportunity to let the outgoing guy take any blame after he's gone. So just because they're on drugs doesn't mean that it doesn't suck to be them...

On the other hand, you really can get equipment that reliable if you're willing to pay for it, and component reliability has improved wonderfully since the 1980s, e.g. disk drive MTBFs of 500000-1M hours instead of 10,000 hours, so you really can wait until midnight slowdown to rebuild the RAID partition after you hot-swap the drive, and computers are a couple orders of magnitude faster so you need fewer of them to get a given job done fast enough, making it much easier to make subsystems reliable and monitor their status.

--

Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks

Re:Designing for 8 Nines is *fun* by Mateito · 2004-09-22 02:55 · Score: 1

Agree with all you say. HA system design is what I do for a living. But losing a line or a card isn't what we are worried about. The story was losing a node due to a reboot. Thus that's the level I'm addressing here.

22 seconds is something you'd find on an acid trip. I could build you a system that has all the redunancy in the world - My last job had to be redundant over cities in case an earthquake took out the infrastructure of one of them (I'm in Chile) - but you will not get a failover in 22 seconds.

Think about it. Something goes down and you lose a cluster node. Something somewhere needs to relize that there is a change in the system.

Assume we are sending keepalives every second. You need to miss 4 keep alives to be assured of a failure. That's 4 seconds.

Given that most contracts specify the penalty clauses on uptime to be calculated on a monthly or three monthly basis, that's your 0.9999999 blown already.

You want shorter keepalives? What 0.1 sec? When you have 40km between sites, that's not going to happen.

No. IMHO 7 nines is impossible for any "real world" service.

--
Norman Cook's Ode to Sl

It means "Kiss Your Ass Goodbye" by billstewart · 2004-09-21 18:02 · Score: 1

As they say, if car technology evolved the way computer technology did, compared to the cars of the 1980s, you should expect that today you can get a Rolls Royce that goes 800 mph, gets 400 miles per gallon, holds 82 passengers, and explodes twice a week killing everybody inside.

So you thought you like Fly-By-Wire airplanes?

--

Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks

Re:flaw isn't in Windows by mpe · 2004-09-21 18:19 · Score: 1

The flaw isn't in Windows, it is in an application written by a high priced consulting company. It was discovered late in the evaluation process, and since it is easy to work around (by rebooting once per month), and fixing it would have delayed delivery, the software was accepted with the bug.

If the flaw was in the application then the fix would be to restart the application. Since the "fix" is to reboot the entire machine then it's self evident that the flaw is somewhere in the operating system.
If it would be possible to simply restart the application and the advice to reboot the computer is incorrect then the "high priced consulting company" isn't competent to write software in the first place.

Incorrect. by jwigum · 2004-09-21 18:39 · Score: 5, Insightful

Part of being on the ball in any tech department means having the system up to date. If you don't have it up to date, and an error FOR WHICH A PATCH EXISTS gives you trouble, everyone else in the company should rip your head off. That's inexcusable.

If you install an unpatched version of an OS, and leave it as such, it's your own dumb fault. If a patch is out that fixes the problem, then the problem doesn't exist as far as anyone with half a brain is concerned.

My apologies for the abrasive manner of the response, but patches are around for a reason: to fix known problems.

Patches, do ya have 'em?

--

Look behind you...

Re:Incorrect. by fallen1 · 2004-09-22 00:59 · Score: 2, Insightful

Quote: My apologies for the abrasive manner of the response, but patches are around for a reason: to fix known problems.
Well, yes, this may be true BUT Microsoft patches are _notorious_ for breaking as many, if not more, things than they fix. How long can a critical system such as this one stay down for "routine" maintenance? WHEN would the breaks introduced by the patches show up? In the middle of routing 20 or more airplanes in the airspace around LAX?
Although the specific bug had a patch, perhaps this was a case of "do we patch and pray OR do we reboot monthly?"
*shrug* Maybe the heads of the department overrode the IT personnel and instead of paying the money to patch and test they told them to just reboot the system? No, I didn't RTFA but who knows exactly what went down? The department heads are all in a CYA mode right now and the "truth" may never be known.

--
Dream as if you'll live forever.
Live as if you'll die tomorrow.
~Anonymous~
Re:Incorrect. by Equinox · 2004-09-22 01:49 · Score: 1

I didn't read much of anything other than your response, but what if that particular patch broke critical functionality? Either way, the poor guy's getting his head bitten off...
Re:Incorrect. by Paracelcus · 2004-09-22 03:24 · Score: 1

Hey cat, sometimes new patches break things, maybe you remember when NT4 SP6 first came out and Notes servers went down everywhere forcing M$ to roll out NT4 SP6a!

Maybe the crummy app that they were running would break if they installed the aformentioned patch.

Also, who ever thought they could use Windows where only Unix or OS/400 has a sufficiently established record of reliability has rocks in their head.

--
I killed da wabbit -Elmer Fudd
Re:Incorrect. by Nintendork · 2004-09-22 07:10 · Score: 1

The patch in question was released almost 5 and a half years ago and does not affect the NT family of operating systems. There is no excuse for not testing and rolling it out on an operating system (95/98/ME) that shouldn't be used for critical operations to begin with.
-Lucas

A Win95 issue presents itself in W2K? by Anonymous Coward · 2004-09-21 18:42 · Score: 1, Interesting

Sooooo. They were converting to Windows, eh? Do we really think they were installing Win95 anytime recently to force this bug unto themselves?

I would find it hard to believe that they were installing Win9x OR that Win2K+ was effected by this bug as I have found no current documentation pointing this bug to an installed W2K+ OS.

Blah, blah, blah.

NWA Video on demand by gngulrajani · 2004-09-21 18:56 · Score: 1

I was on a Northwest flight from det->fra and was trying to watch a movie using the in-flight video on
demand system, when to my surpised the client rebooted and what did i see but a lill' penguin in the corner! -- my guess is that the 'server' on the plane that serves the menus and movies gets overloaded when the flight staff activates the video system -- and there must be some timeout that occurs and the client reboots. Unfortunatly i was not able to see what kind of hardware was driving the LCD displays.

-best
-greg

Hypocrites? by coronaride · 2004-09-21 18:58 · Score: 3, Interesting

This is not addressed to the parent, but is for everyone who responded to the parent -

I'm throwing stones, now - especially after reading this incredibly long and geeky thread about shutting down your OS variants. God bless you for having multiple ways of shutting down/halting/suspending/restarting your computer in user/superuser/megauser/whosyourdaddyuser modes, but shame on you for being a stickler on MS's decision to place a Shutdown option on the "Start" menu when you can't even agree on how to shut your own damned computers down!

It's hypocritical, pharisitical, and parasitical (I like alliterations, even when they're not in context...makes me feel like Don King) to bring up such an argument as "Please press the Start button to shut down (stop) the computer". I'm not saying that "Start" is the most incredible choice for a button, but it makes sense. If you are shutting down your computer, you START THE SHUTDOWN PROCESS.

--
Those who can, do. Those who can't, go into business for themselves.

Re:Hypocrites? by Paradise+Pete · 2004-09-29 17:38 · Score: 1

If you are shutting down your computer, you START THE SHUTDOWN PROCESS.
I agree 100%. That's why on my car I changed it so that I get to the brakes through the ignition switch, because after all, I do want to START the breaking process.

Re:Another mouse wiggler bites the dust.... [OT] by 808140 · 2004-09-21 19:44 · Score: 1

My comment was poking fun at people that assume that UNIX systems are the end all be all of uptime, because the OP's clear implication was that something requiring high uptime should be on a UNIX system, not a Windows system. VMS still beats the pants of UNIX in terms of uptime. It was a joke, you know. Laugh.

Still, regardless of where the bug was in this particular case, the fact remains that servers handling mission critical applications (ie, where people's lives are at stake) should not, under any condition, be running Windows. In this case, the problem was with the application, but just because Windows wasn't the issue this time doesn't mean we should all wait around until it is.

What you're saying is like, "I know there are two gaping security holes in this setup, but the hacker that just took our system down only used one -- therefore, I'm just going to patch that one and be on my merry way."

Personally, I'd rather not trust my life to a computer in general, but I'll be really plain and say that if I had to choose a mature UNIX system versus a Windows system, I'd pick the former any day of the week. And if I had the choice of VMS thrown in there, well, all the better. Things can still go wrong at the application level, but the chances of a BSOD turning the whole airport into a carnage of burning crashed planes is that much reduced. And that, my friend, is a good thing.

PS. Saying that Windows works as well on the server as UNIX or VMS is like saying that mentally challenged kids are as capable as normal ones because they too run the special olympics. Windows may have versions aimed at the server, but until systems that need to be up for a decade under high load have actually been up for a decade under high load, I'm not going to trust it. VMS and Solaris are proven server solutions that really do work. A stable NT that doesn't crash is vaporware, as much as Windows nuts wish it weren't. I'm not saying Windows can never be as stable as UNIX/VMS/MVS/whatever, but the simple fact is that today it is not and we're talking about deploying it on mission-critical servers today, not a decade from now when MS gets its act together.

Every coin has two sides by babybird · 2004-09-21 19:54 · Score: 2, Insightful

By that same logic, doesn't a Windows users "Start" the shutdown procedure?

And if you don't want to go to the "Start" button in Windows to shut it down, you could always hit ctrl-alt-del and click shutdown. Or press the power button if you have power management enabled in the bios. I don't really see a fundamental difference between the two, it's just semantics really.

When I first started using Linux, one of the things that baffled me for hours until I could ask someone who knew Linux was how the heck do you rename a file?? I searched and searched for anything resembling a rename command and found nothing. It never occurred to me that you might use the move command to rename a file by essentially just "moving" the file to a new filename. That's at least as illogical (to me and every newbie I've ever known) as clicking Start to Shutdown for someone who isn't familiar with the idiosyncracies of a particular operating system.

--
Keith D.

Re:Every coin has two sides by Daytona955i · 2004-09-22 00:35 · Score: 1

Yes, I had this discussion with one of my coworkers when he was telling me how his mom was confused about why you had to click the start button in the shutdown process. I said it kindof makes sense because you are starting the shutdown process.

Now if there was a button that said Start and all you did was click it and it shutdown, that would be stupid. But clicking Start->Shutdown to me, indicates that you start the shutdown process. Same thing with init 6. You initialize runlevel 6 which in most cases causes the computer to shutdown.

I think saying my naming convention is better than yours is just silly. If you really don't like the Start->shutdown method, just use ctrl-alt-del->shutdown instead. Just like if you don't like init 6 use 'shutdown -h now'

Now in terms of customizing the shutdown process Linux is hands down the winner but in terms of method that the end user shuts down it's a wash.

Well done Harris! by BouffeMoiLaChatte · 2004-09-21 20:02 · Score: 2, Funny

Now you've become a thrustworthly company!

you ass by RMH101 · 2004-09-21 20:54 · Score: 2, Insightful

big projects don't work like this. if you find a bug mid testing, then you don't throw the whole thing back at the vendor and chuck the baby out with the bathwater; you simply cannot organise big projects like this. you do risk analysis and if it's decided you can accept it with a constraint that you, say, boot it occasionally then you may be able to accept the system. if you have accepted it on this basis and don't do what you said you would when you signed the constraint off, it's your problem. yes, the vendor shouldn't sell buggy software, but *all* software has *some* bugs in it.

Re:you ass by mpe · 2004-09-22 01:05 · Score: 1

big projects don't work like this. if you find a bug mid testing, then you don't throw the whole thing back at the vendor and chuck the baby out with the bathwater; you simply cannot organise big projects like this.

In many cases the "chucking the baby out with the bathwater" stage appears to come with the choice of using MS Windows in the first place.

you do risk analysis and if it's decided you can accept it with a constraint that you, say, boot it occasionally then you may be able to accept the system.

I wonder if "risk analysis" is actually something along the lines of "TCO". i.e. meaning something other than what it says.
Re:you ass by lachlan76 · 2004-09-22 01:50 · Score: 1

you simply cannot organise big projects like this. you do risk analysis and if it's decided you can accept it with a constraint that you, say, boot it occasionally then you may be able to accept the system

Most people would expect more from air-traffic control systems, though.
Re:you ass by RMH101 · 2004-09-22 07:22 · Score: 1

do you wonder? oh good. whilst you're doing that, i'll get on with doing WHAT I DO FOR A LIVING rather than bitch about stuff on slashdot, shall i?

Re:2K is based on NT kernel by crucini · 2004-09-21 21:00 · Score: 1

If that's the case - a purely userland decision to store a time value in an int32 - I still say that Microsoft and those who applied Microsoft to this situation are at fault. Why? Because in Unix we have gettimeofday(2) which stores its result in a struct timeval. In other words, we have a well-established way of storing millisecond-resolution timestamps, and a cultural expectation that timestamps will be relative to the Epoch, not to the start of the program.

It pays for Unix programmers to learn the API, because the API is well thought out, and is not constantly churning due to marketing pressure. Unix is a more stable, mature platform. This leads to more reliable apps.

MOD PARENT UP! by WillerZ · 2004-09-21 21:12 · Score: 1

Damn straight -- I for one don't want any patch installed on a system which can endanger my life unless it's been fully tested.

Phil

--
I guess today is a passable day to die.

Cry me a fucking river. by hfis · 2004-09-21 21:36 · Score: 1

Who seriously cares? Seriously?

If you have to resort to bickering about button captions on the shell to give sit to Microsoft, you have problems. Furthermore, this is in no way related to the article; why is it +2 insightful?

If you let the caption of a button get to you, you need to remove the tin-foil hat and seek help immediately.

"Upgrade from Unix to Windows" by jcuervo · 2004-09-21 22:35 · Score: 1

root@faa:~# /usr/games/atc ^Croot@faa:~# shutdown -h now "Installing Windows" Broadcast message from root@faa ... The system is going down for system halt NOW ! Installing Windows C:\> /usr/games/atc Command not found C:\>

And planes start crashing...

By the way, that "Unix to Windows" link just sits there reloading. I'm assuming it's a cookie thing.

--
Assume I was drunk when I posted this.

Can't auto-boot in ATC by EmagGeek · 2004-09-21 22:46 · Score: 1

You have to get your traffic in a holding pattern and/or switch over to the redundant before rebooting a piece of critical ATC hardware. This cannot be done automagically because your Bravo space might be full of planes at the time, in which case a controller would not want his/her display to go away... I am sure the pilots wouldn't, either..

Re:Can't auto-boot in ATC by reverius · 2004-09-22 00:27 · Score: 1

well, my point was more that rebooting a piece of "critical ATC hardware" absolutely should not -happen-... hell, rebooting critical anything shouldn't happen.

does anyone else see how completely ridiculous it is that they were -okay- with using a system that had to be rebooted every 30 days?

that's the incompetence.
Re:Can't auto-boot in ATC by EmagGeek · 2004-09-22 01:13 · Score: 1

I completely agree with you on this point... rebooting is NOT a normal maintenance task.. hell, I don't even think HOME users should ever have to reboot their PCs.

Oh well.. someone will lose their job over this - probably the maintenance guy... and that'll be the end of it. They'll still run windows, and won't even question it's viability in that role because they spent so much money on the new system. They'd rather keep a broken system than admit they made a mistake and wasted millions of bucks...

Are you sure? by fnj · 2004-09-21 23:06 · Score: 1

There is no such thing as a Windows 2000 49.7 day bug that causes an OS problem.

I thought so, too, but this persuaded me differently: RPCSS bug

(RPCSS being an integral part of the OS, and suddenly burning a huge amount of CPU cycles being a bug)

At least for server versions of NT and 2000, and my money is on the same thing happening in client versions if you run them long enough.

Old OS/2 Bug, Not Windows 95 by JohnThreePound · 2004-09-21 23:13 · Score: 2, Interesting

As I recall, since Windows 2000/NT was once the same product as IBM OS/2 (remember Microsoft OS/2, anybody?), this bug originated from the OS/2 side of the codebase.

IBM ran into the problem quicker, as OS/2 was adopted for various critical things like Automated Teller Machines (ATMs), while Windows NT was mostly used for simple file servers. As a result, the problem was fixed in OS/2 about 2 years before in Microsoft got around to fixing the problem in Windows.

Considering that I remember this patch existing for Windows NT and 2000 back in 1999, it is disheartening that the FAA did not feel it necessary to upgrade to something as simple and critical as Service Pack 2 or 3.

Hi, I'm submitting articles to /. and I'm a moron. by banuaba · 2004-09-21 23:19 · Score: 1

"I don't think blame should be assigned to the technician who missed the task; rather, it seems a gross oversight for the FAA to guarantee that such a critical system will crash after only one missed maintenance task. Who's really at fault?"

If a single maintenance task (refueling) is missed on airplanes, they will crash.

Why is having to regularly work on extremely complicated systems anyone's fault? I'd lean towards blaming the idiot who didn't...you know...do his job.

--

Brant

Argle. Bargle.

Huh?!?! by mr_z_beeblebrox · 2004-09-21 23:22 · Score: 1

This happened after an upgrade from Unix to Windows.

How does one upgrade from Unix to Windows?

Re:flaw isn't in Windows by Anonymous Coward · 2004-09-21 23:35 · Score: 1, Informative

The only flaw is that the consulting company that wrote the software was incompetant. They used the GetTickCount API which returns the number of milliseconds since the system was brought up in an unsigned 32-bit value. The documentation clearly states that this value will rollover to 0 and continue counting from there after 49.7 days. The documentation also mentions timers with higher resolution as well as better places to get system uptime as a 64-bit value.

The only reason rebooting Windows was necessary was because this tick value is tracked by the OS and not the application, so restarting the application would not prevent the software bug from causing problems. But the flaw is certainly in the application for using the wrong API for the job.

I think the key is the vendor by Mordaximus · 2004-09-22 00:04 · Score: 1

Once upon a time, a certain vendor recommended monthly reboots of their server which collected call data from one of their products. This may have changed with newer releases. The server ran Solaris.

Of course that's not to say it was a Solaris problem. Point is, by UNIX systems parent might have meant systems with UNIX as an OS, but running other crappy code on it?

I think that the recommendation was more to cover their posteriors : If for some reason the software failed, and the customer didn't do the monthly reboots, how's fault is it?! Of course, our server ran problem free with over 2 years of uptime before a drive failure ruined that.

Re:I think the key is the vendor by DunbarTheInept · 2004-09-23 07:59 · Score: 1

parent might have meant systems with UNIX as an OS, but running other crappy code on it?

Yes. It was a turnkey system (That's where you buy the hardware box, OS, and the application on it as a single unit.)

--
Don't label something "offtopic" unless you know the topic well enough to tell what's on topic.

Re:Fire the Department of the Interior's IT staff. by spagiola · 2004-09-22 01:13 · Score: 1

The FAA is under the auspices of the US Department of the Interior, aren't they?

No. The FAA is part of the Department of Transportation

Re:Hi, I'm submitting articles to /. and I'm a mor by DevCybiko · 2004-09-22 01:16 · Score: 1

i guess you *are* a moron. considering that refueling a plane is a multistep process with multiple points of failure (the refueler must forget, the pilot must ignore the fuel indicator, etc...) whereas rebooting the server has ... how many points of failure? apparently just one. the original author is correct to call out the question. in systems i've designed where one person is resposible for some event in a complicated workflow, the system sends an email out when the event draws near. it sends another as it is missed. it alerts the person's manager if it is late. if the FAA system has no such alerts, then the reboot event was poorly architected. if the FAA system DOES have such alerts, then the entire organization has a problem. it is never the fault of one individual when a system fails.

By a literal reading of the Patriot Act... by gillbates · 2004-09-22 01:18 · Score: 1

This is a terrorist offense. Yes, the vendor execs could be dragged into court and sentenced to death for this.

--
The society for a thought-free internet welcomes you.

Major Clarifications by jedman · 2004-09-22 01:37 · Score: 1

The air traffic CONTROL system was NOT affected. The controllers could watch the planes "near miss" on their radar scopes. The *radio* / communications system was hosed; controllers could not contact the planes by voice.

2^32.... reboots, schmeboots.... the system was down for *5 HOURS*. Not even a crapped-all-over-itself Windoze 95 box takes that long to come back up and load its app. This is obviously a SERIOUSLY FLAWED system that could not restore itself, and all backups failed (for a time) too.

"Delta, go around..." -- (message to Flight 191 as it was crashing in a thunderstorm, Dallas, 1989)

"Promotion" by dachshund · 2004-09-22 01:46 · Score: 1

The fault is more with the people who chose to use MS Windows in this way. Microsoft's blame is more at the level of promoting their products as something they are not as well as encouraging a culture of "everything Microsoft".

Unfortunately, this "promotion" doesn't always take the form of innocuous sales calls-- it includes significant political lobbying, with donations, gifts, dinners, etc. To put it more bluntly, Microsoft is paying politicians to select Windows for applications where MS knows it isn't really appropriate.

While the politicians are certainly to blame for being corrupt, it's not like Microsoft can avoid responsibility for their role in the decision-making process. If I suggest to a government official that something might be a good idea, I can reasonably avoid some of the responsibility when it doesn't pan out. When I bribe that official to do it, I'm taking a much more active part in that decision, and thus I deserve every bit as much blame for the end result.

MS caused ALL problems, didn't you know? by Blitzenn · 2004-09-22 01:53 · Score: 1

My neighbor tried to use an electric handmixer to make gravey over a hot stove. The cord caught on fire and when he tried to put it out he got electricuted. He read the recipe on the internet using Internet Explorer Browser. If it wasn't for MS, He would be alive today.

See you can tag everything back to MS if you fish hard enough. Next it will be earthquakes and hurricanes we tag blame on MS for.

Doesn't anyone see this as a bit silly to blame MS for an obvious blunder on not only their IT dept. but the morons in charge of maintenance and engineering? If you drive your car through the back of your garage, is it the car manufacturer's fault? How stupid would you look for trying to place the blame back that far?

Re:Another mouse wiggler bites the dust.... [OT] by SuiteSisterMary · 2004-09-22 01:56 · Score: 1

I don't quite share your fanatical hatred of Windows; when used properly, it's quite capable of handling whatever you throw at it.

VMS and Solaris are proven, aye; I do, however, have fond memories of reading, fifteen years ago, the exact same things about Solaris that people now say about Windows; rampant holes, services with root access open to the Internet (lpr, rpc, sendmail, and so on), accusations of terrible bloat (xclock takes HOW MUCH RAM?) and slowness ('Slowaris' indeed) and so on.

It also never fails to amaze me that the zealots tout the security of an OS that can be defined as 'the results of taking a secure OS and ripping out most of the security functionality.' UNIX, after all, is a play on it's MULTICS parent, as it's a casterated version thereof.

Still, as far as I'm concerned, if you want reliability and no downtime, you use a mainframe. Period. We're talking about a system that you can rip running processors out of, and the damn thing won't even blink.

--
Vintage computer games and RPG books available. Email me if you're interested.

I still disagree, Re:flaw isn't in Windows by burnin1965 · 2004-09-22 02:56 · Score: 1

So would this be the same "wrong API for the job" that Microsoft's developers are using to code Windows services?

Print Spooler Stops Scheduling Print Jobs
http://support.microsoft.com/default.aspx?scid=kb; EN-US;318152

I agree the developers should not have used this tick counter. And when they discovered there was a problem it should have been fixed immediately as the code change would not be that significant if it was only a matter of the tick counter rolling over.

But from what I've seen first hand and heard from others I still believe that Windows is not up to the task. And rather than it being the wrong API for the task it appears to me its the entire system (Operating System, API, Developers, Vendors, etc.) that is wrong for the job.

burnin

DOH! by burnin1965 · 2004-09-22 02:58 · Score: 1

You are correct, I did mean the FAA. My bad.

WHAT?!?! Who in their right mind... by seanbo · 2004-09-22 03:10 · Score: 1

...would call a migration from Unix to Windows 2K an upgrade?! "...upgrade from Unix to Windows."

JPL had a working system for the FAA around 1985 by pdxChris · 2004-09-22 05:11 · Score: 2, Informative

In the mid 1980's, I knew a software engineer at Caltech's Jet Propulsion Laboratory who worked on a multi-year JPL project for the FAA. The project was to replace the obsolete voice communication system for air traffic controllers. The new system had touch screens with onscreen menus and buttons were dynamically reconfigured depending on the controller's workload. It worked correctly, and the engineer enjoyed describing to me how it worked. This was all before there was any version of Windows. If I recall correctly, they developed on MODCOMP minicomputers running VMS but deployed on an embedded system with an in-house design for task switching, not a complete OS. I might be fuzzy about the technical details at this time, but a FOIA request should be able to retrieve them for the intensely curious.

I do clearly remember that the working system was presented to the FAA in Moneterey, and the FAA then terminated the contract and hired IBM to start over from scratch on a new system. Rumor was that this was a political payback. I should emphasize that's just a rumor I heard. Looks like Harris eventually got the contract. I wonder if any of the original code from JPL was ever deployed.

Re:Now even the submitters aren't reading the arti by Kehvarl · 2004-09-22 05:31 · Score: 1

That certainly sounds fair.

I think you've confirmed my suspicions... by WebCowboy · 2004-09-22 06:07 · Score: 1

Regarding some of the engineers at Harris.

As a matter of fact, I DO work as an engineer for a large, multinational company--and our projects do in fact involve mission critical systems. You are right--engineers do not always get what they want and it does often mean dealing with politically/non-technically made design choices like using Windows when we'd prefer not to. However, there is a limit--a time and place where commodity/consumer grade hardware and software is appropriate--and it's NOT at a level at which a crash will bring down an entire system. I do not have to know how the software works to make that observation--it has been shown that a windows box failed and the result was a major system disruption and hours of chaos. It's not the fact that they used Windows that is disturbing--it's the fact that they used it in a mission critical situation...without adequate testing to boot. And yes, I do have a clue as to how complex the system is and the intricacies of how it works--our companies products run systems in oil refineries, factories and power generating stations. In a similar situation and project we would handle things differently:

1 If program managers were indeed making critical decisions, the would HAVE to be registerd Professional Engineers by law, just like the lead developers.

2 Lead developers are explicitly instructed NOT to simply do as they're told. If they see a serious flaw in a design decision they are obligated to make their views known. Of course, you can't conter one political decision with another--you must have a solid case. If your boss refuses you go to his boss. If you are stonewalled right to the top and you think the issue is really important you can bring the issue to the professional association. The final course is to perform the work and refuse to sign off on it (make the boss do it). That way, if the result is failure, you are in the clear and your higher-ups take all the heat and not just some--it's "due diligence" (ass covering, really).

3 During development and testing, we identify any potential single points of failure, bottlenecks and known issues. In my situation, Windows-based systems are ALWAYS considered "unreliable" (that is, not to be relied upon for critical or safety related systems), therefore we prescribe redundancy. Our test plans always call for us to do controlled AND uncontrolled (pull the plug)shutdowns of each machine in sequence (to test failover) and simultaneously (to determine how the PLCs and other embedded systems, plus electromechanical systems, handle catastrophic failure).

4 If hardware cannot be supported for at least ten years (and in some cases up to 25 years) we MUST design such that there will be a drop-in replacement that will cause minimal disruption(for example an old VAX VMS server could be upgraded to a current Alpha VMS, or an old PLC can be replaced with a next generation one that will execute the same routines rung-for-rung)

5 It is typical to keep the previous, pre-upgrade equipment around as a standby system, ready to put back in service, until the new system has worked as-advertised WITHOUT INTERVENTION for at least a year. A crash or other fault would reset the 1-year clock and we'd be doing a thorough root-cause analysis.

It sounds like there is a lack of professionalism within your group of engineers. I'm not sure about how things are done where you live, but "just following orders" is not an excuse for poor engineering--a failure of that nature where I am would result in being temporarily barred from practicing engineering. Sometimes it can be tough to go against the PHB--I've heard of engineers being fired for refusing to sign off on designs, but I'd rather be fired and be able to work as an engineer elsewhere than have my ability to work as an engineer revoked entirely.

I guess I would have to ask the FAA as to why they made the decision to migrate a working critical system to Windows--a radically different architecture from UNIX. My employer builds

Re:I think you've confirmed my suspicions... by Explet1ve! · 2004-09-27 14:24 · Score: 1

Agreed on all points. Linux and Windows should be no where near the server side of any truly critical systems. That's where mainframes, midframes, or UNIX systems with a 10+ year proven track record of reliability belong. And if Linux and Windows ARE used for some masochistic reason, the whole system better be redundant as hell.
We used to say that we don't have to worry about Windows' lack of reliability because no one would ever be stupid enough to run nuclear power plants, electric grids, AIR TRAFFIC CONTROL SYSTEMS, etc., on Windows. I guess we have to revisit our assumptions.

upgrade from Unix to windows by sglines · 2004-09-22 07:17 · Score: 1

Management at work. Upgrade indeed.

There is absolutely no excuse. by syukton · 2004-09-22 07:30 · Score: 1

C:\>shutdown Usage: shutdown [-i | -l | -s | -r | -a] [-f] [-m \\computername] [-t xx] [-c "comment"] [-d up:xx:yy] No args Display this message (same as -?) -i Display GUI interface, must be the first option -l Log off (cannot be used with -m option) -s Shutdown the computer -r Shutdown and restart the computer -a Abort a system shutdown -m \\computername Remote computer to shutdown/restart/abort -t xx Set timeout for shutdown to xx seconds -c "comment" Shutdown comment (maximum of 127 characters) -f Forces running applications to close without warning -d [u][p]:xx:yy The reason code for the shutdown u is the user code p is a planned shutdown code xx is the major reason code (positive integer less than 256) yy is the minor reason code (positive integer less than 65536)

The above command, shutdown, is present in Windows XP and in Windows 2000 with the Resource Kit installed. Windows has supported Task Scheduling for quite a number of years now. If the technician's procedure is solely to shutdown the computer via the start menu without shutting down any extraneous applications pre-shutdown, then this is all he needs to do to restart instantly with the same effect, from the command line:

shutdown -t 0 -r

This says to reboot the computer and wait 0 seconds before doing so. Stick a -f in there to force a shutdown if you've got ornery apps. Piece of fucking cake, people. Shit like this makes me wonder why I'm still unemployed; I obviously have some skills that would be appreciated by the FAA. Just put the command in a task scheduler entry, set it to recur every 2 weeks, and you're golden. I mean, seriously, what the fuck?

I use task scheduler to make backups of my current Opera session and to run periodic defrags and clean temporary folders and so forth. The system provides a way to maintain itself at scheduled intervals, why rely upon a technical lackey who can (and obviously did) screw up?

Tangentially, when the blaster worm came out and was giving everyone the NT Authority you-must-shutdown-now message, I discovered that a quick shutdown -a would abort the shutdown process and allow you to continue working with the (albeit unstable) system, to install a patch or the like.

--
Reinvent the wheel only at either a lower cost, greater effectiveness, or your own personal enrichment and satisfaction.

Here's the Real Story by Nintendork · 2004-09-22 08:16 · Score: 1

There was an issue with the FAA software running on Windows 2000 Server. They produced a procedure to work around the issue instead of fixing it. A technician messed up the procedure and all hell broke loose. Some of the writers got confused and declared that it's windows fault.

I did a search for the 49.7 days in Microsoft's knowledge base and found one possibly related bug, the non-related bug referenced by the article submitter, and some other non-related bugs. The one thing they all have in common is an improperly used GetTickCount function in the code.

First, there's the five and a half year old patch fixing an issue in Windows 95/98. There's no reason this should have been mentioned anywhere in reference to this incident. Shame on the poster and all the people backing this theory. It's pure reverse FUD because there's nothing indicating that this bug was related and everything shows that this only affects 9x. Personally, I'm positive that this problem isn't in 2000 because I supported 2000 for Microsoft when it was released and never heard of this happening. Also, Microsoft is good about testing all of its products to see which are affected. If this type of screw-up were common, the articles would be common on Slashdot since the typical reader lusts after examples of MS screw-ups. There's also the fact that there's a LOT of Windows 2000 boxes with uptimes way past a month and a half.

But then there's the CPU utilization rpcss.exe bug. If this is what was happening, then it's partially Microsoft's fault for not having enough QC testing targeted towards idiot programming mistakes. Nobody tested enough to see what happens under different scenarios when GetTickCount is improperly used. Also, the hotfix from Microsoft is only a few months old, probably not enough time to test and deploy. On the other hand, GetTickCount is designed to only work for 49.7 days and shouldn't have been used for this application. I'd assume that they didn't know what was going on when shit hit the fan after a month and a half of running relatively smoothly and only after the MS patch was released did they review their code and see that they were improperly using the function. Still though, any company that has an internally written or contracted program with this serious of a bug should have invested the resources required to find the problem and fix it. They should have known that the problem was related to software installed on the server, most likely their proprietary FAA program because if every Windows 2000 computer running on a Dell had this problem, Microsoft would have released a patch long ago. Heck, they should have found that they were using the function improperly. If the programmers knew how long it ran for before dying (49.7 days), they should have realized that it's related to the GetTickCount function and could have narrowed in their efforts to wherever the function was used.

If the problem was not related to the rpcss.exe bug, then I don't see how MS is to blame. The blame lies solely with the programmers of the FAA software for improperly using the GetTickCounter function.

In conclusion, with either of these scenarios, I'd be replacing some of my programmers if I were the manager in charge of the project that wrote the FAA software.

-Lucas

Re:Retard by argel · 2004-09-22 08:19 · Score: 1

Both look like they were found, or at least entered into the KB, after the release of Windows 2000 Service Pack 4 (Nov. 2003), and hotfixes are available for both.

The Rpcss.exe bug appears to be fixed in W2K SP1 since it only applies to Windows 2000 Server (i.e. no service packs).

It looks like the print spooler bug was introduced in W2K SP1 and wasn't discovered or fixed until after SP3 (since only W2K SP1-SP3 are listed).

Considering how long SP1 has been out, not to mention SP4 I don't see this as a Microsoft problem (assuming it realyl is an OS issue). -- Argel

--

-- Argel

mod parent up! by Mr+44 · 2004-09-22 09:10 · Score: 1

This is about the only comment by someone with a clue in this whole thread

What are you smoking? by jotaeleemeese · 2004-09-22 12:42 · Score: 1

Does that herb facilitates time travel?

Because the last time I had to schedule reboots for a mchine of mine was around 10 years ago.

Oh yes, last time I used Windows, my bad.

I have administered SOlaris, Linux, HP-UX, Irix and a few others, and frankly the one that either should go to get a job in the real world or stop talking hallucinations is you.

--
IANAL but write like a drunk one.

Re:What are you smoking? by Awptimus+Prime · 2004-09-24 12:35 · Score: 1

Because the last time I had to schedule reboots for a mchine of mine was around 10 years ago.

So you have not upgraded a kernel in 10 years?

Oh yes, last time I used Windows, my bad.

No Windows experience in 10 years? Good for you. I am glad to see someone so well-balanced in their experience comment on this topic. Personally, I don't go bragging on forums about gaping holes in my resume.

Re:Fire the Department of the Interior's IT staff. by Dr.Dubious+DDQ · 2004-09-22 17:23 · Score: 1

Well, I stand corrected - for some reason I thought the Federal Government would have been putting DOT in with all of the OTHER "interior" federal issues. Stupid me...

Considering the bizarre web of overlapping police agencies they've got, I should have known better...

Kinda worries me more to see more than one department having that kind of problem...

--
Hacker Public Radio is our Friend

Re:Seen this week at various airports by squarefish · 2004-09-24 02:13 · Score: 1

it would have been cool if you had gotten a picture of this, but then they would probably have wanted to arrest you for 'terrorist like' activities'.

I can't wait till it doesn't feel like a police state anymore.

--
Creationists are a lot like zombies. Slow, but powerful and numerous. And they all want to eat our brains.

Re:Another mouse wiggler bites the dust.... [OT] by jack_csk · 2004-09-24 03:01 · Score: 1

I used Solaris 2 on UltraSparc before, and it was the system that never froze/crash on me (besides those GNOME/KDE Apps / XFree86 itself). Even the most stable Linux that I saw cannot compete with it.

Re:Hi, I'm submitting articles to /. and I'm a mor by fr0dicus · 2004-09-26 04:31 · Score: 1

Our Windows servers have reboot schedules and these are monitored via our enterprise management tools to ensure that the uptime is not too high. Not drastically different to checking the fuel gauge really. A bit obvious, I thought.

Slashdot Mirror

Windows Upgrade, FAA Error Cause LAX Shutdown

648 of 862 comments (clear)