Stupid Data Center Tricks
jcatcw writes "A university network is brought down when two network cables are plugged into the wrong hub. An employee is injured after an ill-timed entry into a data center. Overheated systems are shut down by a thermostat setting changed from Fahrenheit to Celsius. And, of course, Big Red Buttons. These are just a few of the data center disasters caused by human folly."
The summary reads like a digg post, and has two different links that, in actuality, link to the exact same thing.
This needs some fixin'.
For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
So this is why Comcast has been stonewalling me with their excuses.
Can this really happen easily? I thought for really ugly things to happen, you need to have switches (without working STP, that is).
Where I work a couple years ago one of the non-technical people decided to plug a router into itself. Ended up bringing down the whole network for ~25 people in a company which depended on the Internet (Internet marketing company).
Unfortunately one of the tech guys figured it out literally as everyone was standing by the elevator waiting for it to take us home. We were that close to freedom :(
When the foot seeks the place of the head, the line is crossed. Know your place. Keep your place. Be a shoe.
Our entire network was brought down a few years ago when a student plugged a consumer router into his dorm room's port. Said router provided DHCP, and having two conflicting DHCP servers on the network terminally confused everything that didn't use static IPs.
Took our networking guys hours to trace that one down.
Hail Eris, full of mischief...
E pluribus sanguinem
In the summer of 2000 I worked at Quad/Graphics (printer, at least at that time, of Time, Newsweek, Playboy, and several other big-name publications). I was on a team of interns inventorying the company's computer equipment -- scanning bar coded equipment, and giving bar codes to those odds and ends that managed to slip through the cracks in the previous years. (It's amazing what grew legs and walked from one plant to another 40 miles away without being noticed.)
One of my co-workers got curious about the unlabeled big red button in the server room. Because he lied about hitting it, the servers were down for a day and a half while a team tried to find out what wiring or environmental monitor fault caused the shutdown. That little stunt cost my co-worker his job and cost the company several million dollars in productivity. It slowed or stopped work at three plants in Wisconsin, one in New York, and one in Georgia.
The real pisser was the guilty party lying about it, thereby starting the wild goose chase. If he had been honest, or even claimed it was an accident, the servers would have all been up within the hour, and at most plants little or no productivity would have been lost.
The reality: a 20 year old's shame cost a company millions.
The Etherkiller
Sure, technology causes its share of headaches, but human error accounts for roughly 70% of all data-center problems.
And 70% of all statistics are made up on the spot.
It's very disturbing and you'll see why these things happen.
RIP America
July 4, 1776 - September 11, 2001
Our University was brought to it's knees when a student in the residents halls was putzing around and accidentally installed a DHCP server on his box. Because the effects were unknown to the student that installed the DHCP server, it took about a day before they knew what was going on and disabled his switchport on the network.
Someone plugged an home router into the government office where I was doing consulting. (he wanted a switch to plug a networked printer)
The router started giving 192.168.x.x IP to everyone on the floor, soon including a few servers (including the Lotus Notes one)
Took 3 days for the admins to find out the source of the problem and where the router was... abysmal loss of productivity needless to say I gave them a good speech on not routing 192.168 packets on the network and isolating their networks.
Way back in the day at the B.U. computer center, the machine room had an extensive Halon fire system with nozzles under the raised flooring and on the ceiling. Pretty big room that housed an IBM mainframe, about a half dozen tape drives, maybe 50 refrigerator-sized disk drives, racks and racks of magnetic tape, a laser printer the size of a small car, networking hardware, etc. etc. One day, the maintenance people were walking through and their two-way radios set off the secondary fire alarm. At that point, you had about 10 seconds to escape. Watching the security camera video afterward was highly entertaining. One moment you saw the operator standing in front of the consoles and the next you saw him bolting out of the double doors.
Thank you... you've single-handedly made spending my time on recycled, old digg news completely and totally worth it.
When he arrived, most of the staff had gone home and the skeleton IT staff didn't want to hang around. So, they sent him away on the basis that his work wasn't "scheduled".
Everybody came back on Monday to find totally fried servers.
From scarped cliff or quarried stone she cries "A thousand types are gone, I care for nothing, no not one."
I am fortunate in working in an organisation with perhaps the best and most competent ops manager I have ever worked with, but even with well-written procedures and well-trained ops staff, errors still occur — but very rarely.
How can this leave out the standard cascade failure scenario?
Trying to achieve redundancy, someone gets what they think is worst-case-30A of servers with multiple power supplies, plugs one power supply on each into one PDU rated 30A, one power supply into the other.
They may or may not know that the derated capacity of of the circuit is only 24A, the data center is unlikely to warn them as they only appear to be using 15A per circuit at most.
Anyway, something happens to one of the PDUs and the power is lost from it. Perhaps power factor corrections (remember the derating?) and cron jobs running at midnight on all the servers that raise the load high simultaneously. Maybe just the failure of one of the PDUs that was feared, causing the attempt at "redundancy".
In any case, all of the load is then put on the remaining circuit, and it always fails. The whole rack loses power.
So I'm working in this company's datacenter on their networking equipment. But it's installed is such a crappy way that there's a floor tile pulled right next to the rack and the cables are run down into that hole. I'm working around on the equipment and step down into the hole by accident, at that point I notice that it's suddenly alot quieter where I'm standing, I look down and realize I'd just stepped on the power button of a power strip that most of the networking equipment was plugged into. Oh Sh!t. At the time the room was empty except for me, I quickly turn the strip back on. About the time the switches are just finishing coming back up one of the companies IT guys comes in and asks if anything's going on. I look at him a little confused and say "I'm not sure, what's up?". The network's back up by the time they noticed it.... I probably should have admitted it, but no harm, no foul. :)
Those data centers in the article sound huge, some may even have up to ten servers!
You've got to admit, although the results were disastrous, someone will remember this and have a good laugh over it. I am now.
I can definitely relate to that one. I've never had one that didn't try to deviate from plan to increase their profit on the job. I've even seen them put breakers in a panel that weren't connected to anything to make it appear as if they ran the circuit, when all they did was piggyback a circuit on another one to save the cost of running the wire. By the time you find the problem, they're long gone.
Gotta watch them like a hawk and make sure they do everything they're supposed to do.
The old tape machines (six foot tall) used to put out a tremendous amount of heat. Space is at a premium, so in the mainframe room the drives were normally put edge to edge,
with one pushing air in and the other pulling air out. The machines had two 10-12" fans per unit, so stacking two or three units was fine. One site had so many machines side to
side (over 7), the air coming out the last machine regularly set things on FIRE. It was not uncommon for the machine to ignite lint going through the stack, with it coming out the
end as a small explosion like dust in a grain silo explosion. A fire extinguisher was kept on hand, and the wall eventually got a stainless steel panel because it was so common.
I've seen a network brought down when a student (or employee) plugged their toy windows 2000 server into the campus network. Said "server" was configured as a domain controller (or whatever they called it before active directory, it's been a while). Toss in DHCP and their box got DOS'd as the entire campus tried using them for authentication.
Good times. Can you even do that kind of thing these days?
When I was IT manager for a big retail mfg we had a cross-country move from the SF bay area to TN (closer to shipping hubs and lower tax rates). I was hired for the new plant, and I was there setting up everything (I did not know the company knew next to nothing about technology) and the last thing shipped before the company shutdown for the move was ship the data server via 2 day FedEx. The CFO packed it up and shipped it out, as the driver pulled away from the bay the server fell off the bumper and onto the cement. They picked it up (looking undamaged in it's box). When I opened it there was a shower of parts. A HD drive had detached from the case but not the cable and had swung around in that case like a flail. CFO had NOT INSURED the shipment or taken anything apart. That and much more to save $50 here and there.
6.8SPC TR of 550, l xwind at 6, drift rt at 26" drops 77". AT has 503 ft-lbs at 1403 fps. FT 0.86
Deathly silence after someone does press the button should be adequate punishment.
Naturally, potential super-criminals, James Bond villains and right-leaning survivalist nationalist employees should be explained button's real purpose to avoid accidents caused by someone deciding to rid the world of communism during their lunch break.
Mit der Dummheit kämpfen Götter selbst vergebens
I worked in a datacenter that was two blocks from the harbor. The datacenter is on the second floor, but what the hell do you do if you're in the building and there is a flood, or if you're at home and have to get to the DC? It reminds me of New Orleans, but that didn't stop them from building it.
http://packetnexus.com
Many years ago I worked at a mainframe installation (IBM S/360 to give you an idea of my age ;-). The computer was installed at the back of a huge room with plenty of space for expansion. For some incomprehensible reason BRBs (Big Red Buttons) were placed along the skirting board every ten feet or so, which had hitherto not been a problem -- with all the space nobody came near during daily (and nightly) operations.
... A cassette decided to slide all the way across the room and unerringly triggered one of the BRBs square on. Half a night's work to be redone.
Every morning at around two AM a guy came with a load of cassettes containing cheques from the banks for clearing. He usually just opened the door to the room and shoved each cassette in to slide, like curling stones, across the floor to the cheque sorter.
And one morning, well
Back when I worked for Boeing, we had an "interesting" condition in our major Seattle area data center (the one built right on top of a major earthquake fault line). It seems that the contractors who had built the power system had cut a few corners and used a couple of incorrect bolts on lugs in some switchgear. The result of this was that, over time, poor connections could lead to high temperatures and electrical fires. So, plans were made to do maintenance work on the panels.
Initially, it was believed that the system, a dually redundant utility feed with diesel gen sets, UPS supplies and redundant circuits feeding each rack could be shut down in sections. So the repairs could be done on one part at a time, keeping critical systems running on the alternate circuits. No such luck. It seems that bolts were not the only thing contractors skimped upon. We had half of a dual power system. We had to shut down the entire server center (and the company) over an extended weekend*.
*Antics ensued here as well. The IT folks took months putting together a shut down/power up plan which considered numerous dependencies between systems. Everything had a scheduled time and everyone was supposed to check in with coordinators before touching anything. But on the shutdown day, the DNS folks came in early (there was a football game on TV they didn't want to miss) and pulled the plug on their stuff, effectively bringing everything else to a screeching halt.
Have gnu, will travel.
... that an idiot with his/her hand on a switch, a breaker or a power cord is more dangerous than even the worst computer bug.
(Judging from the houses that I see on my way to work each morning, some people shouldn't even be allowed to buy PAINT without supervision. And we provide them with computers and access to the Internet nowadays!)
(If that doesn't terrify you, you have nerves of steel.)
Cogito, igitur comedam pizza.
http://www.youtube.com/watch?v=7wRxASytPuQ is the most common reason servers go down. Come on, show of hands, how many of you have been a part of a scenario like this?
We've had multiple incidents nearly identical to one of the stupid tricks described in the article. One of our (former) techs had a habit of running two cables between the same pair of switches... or even plugging both ends of a single cable into the same switch! Needless to say, neither of these scenarios ends well.
My mother, who is a database admin for a county office (and has been for a long time), was getting a tour of a brand new mainframe server in the basement of her department's building back in the early 80's. At some point during the tour a large red button was pointed out that controlled the water-free fire suppression system. When pressed it activated a countdown safety timer that could be deactivated when the button was pulled back out.
Always wanting to try things for herself, she went to the red button at the end of the tour and pressed it. No timer was activated, instead a noticeable shutting down sound was heard as the buzzing of the mainframe died down. She accidentally hit the manual power-off button for the mainframe which was situated very close to the fire suppression button and happened to look similar.
All the IT staff of that building got to go home early that day because the mainframe took several hours to reboot and it was already lunch. She was very embarrassed and I have heard that story many times.
Ah, the memories! Here are some of the stories I've heard and or witnessed over the years.
THE WEBSITE'S DOWN!!!
http://www.youtube.com/watch?v=W8_Kfjo3VjU
Wherever You Go, There You Are
My favorite was at a big office building. An electrician was upgrading the fluorescent fixtures in the server room. He dropped a washer into one of the UPSs, where it promptly completed a circuit that was never meant to be. The batteries unloaded and fried the step-down transformer out at the street. The building had a diesel backup generator, which kicked in -- and sucked the fuel tank dry later that day. For the next week there were fuel trucks pulling up a few times a day. Construction of a larger fuel tank began about a week later.
Stop-Prism.org: Opt Out of Surveillance
I had one a few years back which highlighted issues with both our attention to the network behavior, and the ISP's procedures. One day the network engineer came over and asked if I knew why all the traffic on our upstream seemed to be going over the 'B' link, where it would typically head over the 'A' link to the same provider. The equipment was symmetrical and there was no performance impact, it was just odd because A was the preferred link. We looked back over the throughput graphs and saw that the change had occurred abruptly several days ago. We then inspected the A link and found it down. Our equipment seemed fine, though, so we got in touch with the outfit that was both colo provider and ISP.
After the usual confusion it was finally determined that one of the ISP's staff had "noticed a cable not quite seated" while working on the data center floor. He had apparently followed a "standard procedure" to remove and clean the cable before plugging it back in. It was a fiber cable and he managed to plug it back in wrong (transposed connectors on a fiber cable). Not only was the notion of cleaning the cable end bizarre -- what, wipe it on his t-shirt? -- and never fully explained, but there was no followup check to find out what that cable was for and whether it still worked. It didn't, for nearly a week. That highlighted that we were missing checks on the individual links to the ISP and needed those in addition to checks for upstream connectivity. We fixed those promptly.
Best part was that our CTO had, in a former misguided life, been a lawyer and had been largely responsible for drafting the hosting contract. As such, the sliding scale of penalties for outages went up to one-month free for multi-day incidents. The special kicker was that the credit applied to "the facility in which the outage occurred", rather than just to the directly effected items. Less power (not included in the penalty) the ISP ended up crediting us over $70K for that mistake. I have no idea if they train their DC staff better these days about well-meaning interference with random bits of equipment.
I had fun with a company awhile back. They are about 300 employees and ~90mil/year, so this is a small corporation.
Anyway, the company was trying to get a VPN tunnel established to their China office, and they were having a hell of a time at it. The employees on the China side had no IT experience so everything was done remotely.
It just so happens that one of the Chinese employees was recruited to make a change to the PIX firewall on the China side in order to get everything working. To our astonishment, it worked, and we had a secure VPN tunnel established.
The problem was accounts in the US started to get locked out, alphabetically, every 30 minutes. Our Active Directory was getting tons of password crack attempts from inside our internal network. I was using LDAP to develop an application at the time, so naturally I was suspect for causing all these lockouts.
Fast-forward a week. We look at the configuration of the Chinese firewall and it allowed all access from any IP address on the Chinese side. In other words, crackers were trying to get into our systems through our VPN tunnel in China. In effect, our corporate LAN had been directly connected to the Internet. Once we figured that out, I was free to go back to work and the network lived to see another day, but that incident caused major trouble for all our employees.
Moral of the story: Don't trust a Chinese firewall.
A year or so ago my company's entire network nation wide was taken down for several hour by a single misconfigured router in Texas.
Good judgement comes from experience. And most experience comes as a result of bad judgement.
/dev on a *nix box), then having to watch for about 45 minutes while my users' PIDs disappeared. I'll never forget that red-faced moment of knocking on my boss's door and letting him know he might want to leave his phone off the hook for the next hour...
Just about anyone who has been in the line of fire as sysadmin for long enough will recall some ill-concieved notion that caused untold trouble. Since my earliest experience with commercial computers was in a batch-processing environment, my initial mishaps rarely inconvenienced anybody other than myself. But I still recall an incident much later (early '90s) when I inadvertently managed to delete the ":per" directory on a Data General mainframe (more or less equivalent to
I was employed in a 50 employees publicity company. They have a couple of offices across the country and need to share a filesystem through WAFS. The main repository for the WAFS was running off a USB drive, connected to the server using a wire too short. I pointed the problem multiple times to my IT boss (no IT background what so ever) without success, tried to talk the issue to the owner of the company, without success, and one day tyhe worst happenned. The USB controller of the drive fried and we lost the last day of work. Thw windows server system went AWOL. It took an external consultant 3½ days to rebuild the main server, which was running the AD, WAFS, Exchange and our enterprise database. It costed us an account worth 12 MILLIONS $. The big boss then hired consultants and gave them over a thousand box to get her told the exact same thing I pointed to 3 months earlier when I audited the IT infrastructure. Two months later she comes top me and ask me how much it would cost to have a bullet-proof infrastructure. I told her to invest arounbd 80K in virtualisation solution with scripts to move VM around when workload changes and go with a consolidated storage with live backups and replication. It was too expensive. Another three months pass, she hire some consultants, gave them another thousands $ to get told basically the same thing I told her 3 months earlier... Than is where i quitted.
Tomorrow is another day...
Computer room was in middle of plant on second floor. Fire sprinkler pipe went through concrete floor under raised floor via a hole lined with a somewhat oversized pipe. Was small clearance around the pipe.
Plant was shut down for model changeover (when the line workers go deer hunting and the plant engineers and related workers fix or change everything that needs fixing or changing.) Somebody was welding a cable rack near the ceiling and the smoke drifted up through the gap, into the space under the raised floor, and set off the halon. A decade's worth of dust and many of the raised floor sections went flying.
Security responded to the alarm. No sign of fire. Per procedure they switched to the backup halon tank. Half an hour later...
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
I don't know how old these tape machines were, but I can assure you that back in the day we had power systems that used vacuum tubes, and the tube space needed to be air cooled. The air temperature could reach several hundred Celsius if the fans stopped. Shortly after this would come the plop of inrushing air as the envelope of a KT88 collapsed at the hottest point. It would not be good design practice to series the units like this, but again back in the day thermal management wasn't even a black art. The last piece of electronic equipment I recall that used large power tubes in its control circuits was still in service in 1982, and the power resistors had to be replaced regularly because otherwise they would eventually burn out.
From scarped cliff or quarried stone she cries "A thousand types are gone, I care for nothing, no not one."
Supervisor said try this key in the fire pull station in room "x" that had a halon type fire suppression system. I asked what would happen if the alarm went off? Sup said there was a ~10 sec delay in which a press of the button above the pull station would stop the discharge. As you've already guessed, there was no delay --- instantly $20,000 of gas discharged into the room, half of which came from a nozzle aimed at my head (who thought of that layout?) while doors and vents slammed shut.
Fastest I've ever spent money with nothing to show for it. $72,000,000 an hour if the discharge lasted only a second. Of course that pales with the U.S. Federal government which spent $302,511,415.53 an hour during 2009 according to this quasi source.
note: sup actually took responsibility
Working at a small web hosting company as senior tech support lead, plus junior sysadmin for 100+ servers, I had a very busy day explaining why peoples websites were not coming online. Our 45mbps DS3 was down, with nothing but a 20mbps DS3 over ATM to handle the load. We ended up shutting down web services to reduce bandwidth consumption just so that people could check their email.
This went on for 9 hours. Our ISP at the time was at a complete loss as to why our line was down. Their guys started at their POP in the San Francisco bay area and drove to every single POP along the way to Reno, NV (our location) until they got to Sacramento. There, they found a cable that had been bumped loose during maintenance done earlier in the day.
We lost a ton of money that day. In return, so did they.
Nobodies Prefect
Tidbits for Techs Technology Blog
but only on the drives which were oriented north-south; those oriented east-west were not affected. So came the directive that all drives, henceforth, needed to be oriented north-south.
That seems counter-productive. They were oriented into the less optimal position?
http://www.computerworld.com/s/article/print/9180479/Stupid_data_center_tricks?taxonomyName=Data+Center&taxonomyId=154 ... You're welcome! :)
Ant(Dude) @ Quality Foraged Links (AQFL.net) & The Ant Farm (antfarm.ma.cx / antfarm.home.dhs.org).
One of my favorite stories my grandfather told me is a story about a computer that would screw up its calculations at the same time every day, about 2pm. This was back in the 60s, when computers were rather large. Basically, if the accountants ran the job in the morning, everything checked out. But if they were to run the batch through in the afternoon, their results would all be off. After two days of checking all the standard stuff (bad memory modules, bad cooling, what have you), he noticed that there was this loud banging noise that would start in the afternoon. He went out of the office to the next set of offices over, which had a machine shop. Turns out the machine shop press would start running about 2pm every day, and that machine press happened to be on the same circuit as the adding machine, so it would draw off just enough power to screw with the results.
Shut up, GayFucks.
Is your country a backward shit hole, run by some totalitarian asshole?
Well, buddy, your troubles are over!
FOR HIRE: THE NORTH AMERICAN MIDDLE CLASS
Unwanted, unloved and unemployed, Hire the people that built up the most advanced civilization on the planet to build up your country !
Why be a third world shithole run by a tyrant when you can be the world's next super power!
Call today!
Operators, that speak fluent English, are standing by!
but only on the drives which were oriented north-south; those oriented east-west were not affected. So came the directive that all drives, henceforth, needed to be oriented north-south.
That seems counter-productive. They were oriented into the less optimal position?
Yes, I blew that one... Oops! But let me take this opportunity to point out something that I realized only after posting the GP post... That I was able to deduce the problem I had with the PBX, because I applied what I learned from the situation with the cleaning staff using a slot on a rack's outlet strip to plug in their vacuum cleaner.
IOW, although some of these stories seem funny in retrospect, they can also prove to be great learning opportunities, too! I'm looking forward to reading the other posts in this thread. I should probably head over to the "daily wtf" web site, again, too.
I dont care where you work, if you're on site doing training, you're probably also sucked back into the work cycle. I see it all the time at work, I have always preferred offsite training, turn off the cell phones. It also helps if you have to use your laptop on the lab, because 99% of the time it means you can not vpn into work so email is not a concern either.
I think my other Data Center operators would agree were all understaffed, and I work on a network with hundreds of millions of customers using it on a 24/7 cycle. The other danger nobody speaks of is that some companies are too passive when it comes to testing redundancy because half the time while there's redundancy in the system to keep a DMZ up and running, there's no spare DMZ capacity to handle a true outage such as a fiber ring failure that isolates the data center or other disaster. Companies need to design their redundancy so you can unplug the entire data center and your customers never knows it, because if you do not, you will rue the day a true outage happens that impacts the entire datacenter and you will hear about it on the news later. Not a good thing.
Some Air Force instructors told us in class that one time one of the tech school instructors wanted to brush up on his cisco skills, so he asked the IT if they had any old routers lying around. They did have one that they thought was cleared out, so they gave it over to him telling him to play around with it. He made the wonderful mistake of plugging it into the network inside the building and it started propagating all the old router information all over the network, which was hooked in the unclassified base network.
Why did you have to bring this up? You brought back bad memories of the time that I actually did this. I worked at the time as a computer operator in the Southland Corporation data center in Dallas. We had moved into our newly built headquarters building and there was a red light switch on the wall by the master breakers. We all wondered for days what the switch would do and I was the only one who eventually got brave (stupid?) enough to throw it. The master breakers to the computer immediately dropped out. So we tried to flip the master breakers up again, but they wouldn't budge. We had to call building maintenance wherein after an hour of delaying production waiting for them to call in, we were able to get power to the computers back. We didn't know that you had to reset the breakers first by forcing them all the way down. I was scheduled to by promoted into programming anytime so I was really sweating it. The other computer room operators and evening manager decided to not tell anyone who caused the breakers to trip. There was a major management inquisition about what had happened but everyone kept quite. Finally, the evening manager was told that he had until the next day to out the culprit. He was going to do this the next day, but said that management decided to drop it. I was saved. A glass covered wooden box was made to cover the switch. I was promoted to programmer shortly afterwards. Climb mountains, but don't ever flip switches because they are there.
Not really - just that trust in human nature or physical hospital security is flawed or there's a budget that means there is no choice other than to trust somebody to stick to the policy. Fully managed switches that could block dhcp used to be very expensive, and even now they are significantly more expensive than dumb switches that get every other part of the job done.
Don't make the mistake of thinking a hospital considers computer networking as part of it's core business and upgrades networking components as often a software company or a small office that doesn't have much equipment to upgrade.
I can't believe no one's posted Guy Steele's Magic/More Magic story, yet:
http://everything2.com/user/Accipiter/writeups/Magic
Sit, Ubuntu, sit. Good dog.
I would usually turn off the server room lights on the way out the door but left them on one night. A non-IT guy leaving tired around 9pm opened the door, hit the light switch and the AC switch next to it by mistake. It took less than five minutes to hit 60C, big spike on the temperature graph before it all shut down. Luckily nothing had to be replaced.
This encouraged a cover over the switch and a second AC unit - then one day we lost 1 phase of power and both AC units were on the same phase and went down while the servers stayed up while I shut down what I could and hunted around for industrial fans.
A Big Red Button incident knocked livejournal.com offline for 2 days back in 2003. I was working for their colo provider (the owner of said Button) at the time.
I stumbled onto a story of a PDP-10 with a mysterious "magic switch" some time ago; did it really happen or is it just a story? http://catb.org/jargon/html/magic-story.html
great learning opportunities
That one goes in the euphemism file entry for "horrible disaster".
"Little does he know, but there is no 'I' in 'Idiot'!"
You should have had the button labelled.
If he didn't admit it how do you know he did it.
Also,the tech team were shown to be pretty damn stupid, not being able to track down the fact that a large red button had been pushed. Didn't they know that the button existed? If they knew it existed were they aware of its significance? If so, then why the fsck did it take them so long to consider it.
The newbie was at fault, but so were the team who failed to identify the problem for a day and a half.
I can understand not insuring it though. I shipped a 1U fully insured. Double boxed w/ foam inserts and all. It arrived at it's destination in a different box. The back was caved in exactly as would be expected if it were gored w/ a forklift. They refused to pay claiming it was "improperly packed".
Always take a picture of box and content before shipping anything.
I lost my sig.
My dad told me this one:
He was a tech installing a replacement network. Once the new network was up (a some 40 PC's) most of them would run for a while, then suddenly crash and reboot. After checking cables and PC's for a couple of days he noticed that one PC didn't seem to have the problem, so he took a long look at that one. That PC was the only one not plugged into an earthed power socket... so the problem was current running from that PC through the earth wire of the network.
Sometimes the one that works causes the problem!
Many years ago Northrop University had 2 PDP 11/34 boxes sitting next to each other. One day the sysadmin decided to network them together by connecting an RS-232 cable between them -- boom, both systems crashed. Reboot, try connecting again, same thing. Suddenly it dawns on him -- he neglected to turn off the echo on the ports used for the interconnection, meaning the first character sent got echoed back and forth in an infinite loop, generating interrupts on both machines faster than the CPU could handle them.
I've abandoned my search for truth; now I'm just looking for some useful delusions.
Ah, the memories! Here are some of the stories I've heard and or witnessed over the years.
Hey, I have a similar story from when I was working at Dartmouth College in the mid-80's. I was on third shift with two other guys, one who knew what he was doing, and one who was, uh, not fully technology-enabled.
For some reason, one night the latter person thought it would be a good idea to clean out the cabinet of our Honeywell mainframe. With a broom. A long-handled push broom.
This was on a weekend, when we normally do a full backup (onto good old 9-track tapes), reboot the system into protected mode, verify the system integrity, and go into multi-user mode. Well, we finished the backups, and tried to reboot. Nothing was working, and the diagnostics were wonky and pretty uninformative, and we (the useful co-worker and I) spent an hour or so trying to debug what was going on. It wasn't until we asked the third guy about the machine that he mentioned his cleaning. The boot switches for the IPL were on the door, and when he was in there cleaning, the broom handle toggled several of them, leaving the machine in its unusual state.
Needless to say, we asked him to avoid cleaning mainframes with brooms in the future.
This is definitely possible. If there's two live jacks at your desk, just plug them both into a desktop hub/switch and it will bring your network down completely.
Something like a 4 or 8 port 3com or netgear switch will do the trick.
... who would have been working on some IBM big iron back in the early 70's. This was in NZ and there were no computers there at the time, so the cards were punched and then sent to the nearest computer (in Sydney), with the results coming back a week or two later. Inexplicably some of the runs would fail with random errors, causing a great deal of lost time, and it wasn't until they noticed one of the assistants picking up the punched holes and pushing them back into the cards that they figured out why. Apparently they didn't like to see all those cards go to waste.
They also had another issue with one of the systems shutting down due to a fault fairly regualrly, but only when one of the operators (a woman) was using it. They eventually traced the problem to her wearing nylon underwear and causing a static charge.
Why do people type "*nix" instead of spelling it out?
http://en.wikipedia.org/wiki/*nix
Every end has half a stick.
I have 3 data center stories:
I was installing mainframe software at a fortune 100 site. There were a line of printers in the computer room that spit paper into a hall to be picked up. If a printer ran out of paper a yellow flashing light went off. If one of the super fast page printers ran out of paper a red light flashed. I was in the computer room around midnight when a new watchman came in on his rounds, saw the flashing lights and paniced and pressed the red button on the IBM mainframe console. Needless to say, I was sent home and told to come back in a couple of weeks.
Another time I was in the computer room at a small mainframe installation out in the middle of nowhere. The managers decided that they needed a full bank of batteries for backup so there were a bunch of carpenters banging away next to me. One of them put a nail from a nail gun through the 220 volt main. The computer room sounded like an explosion as the heads of the drop-in disks retracted simultaneously. That bang was followed by dead silence. You have no idea how loud mainframe computer rooms are until the power goes out.
The last time I experienced a meltdown was at another data center that redundant everything ... except cooling water. The 3 inch pressurized, chilled water main blew apart draining the coolant system and leaving 6 inches of water under the raised floor. Fortunately there were no shorts but within about 20 minutes the temp in the rack room reached 130. It was a sauna in there. One by one the systems powered down starting with the big DEC VAX's ... the only systems that didn't shut down before we got to them were the SUN servers, mostly Sun 50's. It took us 3 days to get everything back up.
SG