Stupid Data Center Tricks

Network meltdown due to hub cross-connects by Florian+Weimer · 2010-08-15 01:37 · Score: 5, Interesting

Can this really happen easily? I thought for really ugly things to happen, you need to have switches (without working STP, that is).

Re:Network meltdown due to hub cross-connects by ianalis · 2010-08-15 02:33 · Score: 4, Interesting

According to CCNA Sem 1, a hub is a multiport repeater that operates in layer 1. A switch is a multiport bridge that operates in layer 2. I thought these definitions are universally accepted and used, until I used non-Cisco devices. I now have to refer to L2 and L3 switches even if CCNA taught me that these are switches and routers, respectively.
Re:Network meltdown due to hub cross-connects by X0563511 · 2010-08-15 02:52 · Score: 2, Interesting

It's so irritating when you ask for a hub, and someone hands you a switch. Stores do the same thing. It's hard enough to find hubs, let alone find them when the categorization lumps them together.
No, I said hub. I don't want switching. I want bits coming in one port to come back out of all the others.
You can do that with a switch, but getting a switch that can do that is a bit more pricey than a real hub...

--
For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
Re:Network meltdown due to hub cross-connects by Mad+Bad+Rabbit · 2010-08-15 03:39 · Score: 2, Interesting

Cheap deep-packet inspection (using an old hub and Wireshark) ?

--
>;k

Re:bad article is bad by Anonymous Coward · 2010-08-15 01:39 · Score: 2, Interesting

I seem to remember in the early days of Telehouse London an engineer switched off power to the
entire building. Only two routes out of the UK remained (one was a 256k satellite connection)
that had their own back-up power.

Don't try this at work... by alphatel · 2010-08-15 01:42 · Score: 2, Interesting

Plug all the ethernet-like T1 cables into a switch
Change the administrator password and forget what you changed it to
Hang everything off a single power strip, no UPS
Buy expensive remote management cards but don't bother to configure them

--
When the foot seeks the place of the head, the line is crossed. Know your place. Keep your place. Be a shoe.

Re:Don't try this at work... by v1 · 2010-08-15 02:03 · Score: 3, Interesting

- run thinnet lines along the floor under people's desks, for them to occasionally get kicked and aggravate loose crimps, taking entire banks of computers (in a different wing of the building) off the LAN with maddening irregularity
- plug a critical switch into one of the ups's "surge only" outlets
- install expensive new baytech RPMs on the servers at all remote locations, and forget to configure several of the servers to "power on after power failure".
- on the one local server you cannot remote manage, plug its inaccessible monitor into a wall outlet
honorable mention:
- junk the last service machine you have laying around that has a scsi card in it while you still have a few servers using scsi drives

--
I work for the Department of Redundancy Department.

Not using Cisco ACLs by Nimey · 2010-08-15 01:48 · Score: 3, Interesting

Our entire network was brought down a few years ago when a student plugged a consumer router into his dorm room's port. Said router provided DHCP, and having two conflicting DHCP servers on the network terminally confused everything that didn't use static IPs.

Took our networking guys hours to trace that one down.

--
Hail Eris, full of mischief...

E pluribus sanguinem

Re:Not using Cisco ACLs by GuldKalle · 2010-08-15 02:02 · Score: 2, Interesting

I had that error too, on a city-wide network. The solution? Get an IP from the offending router, go to its web interface, use the default password to get in, and disable DHCP.

--
What?

Quad Graphics 2000 by Anonymous Coward · 2010-08-15 01:51 · Score: 5, Interesting

In the summer of 2000 I worked at Quad/Graphics (printer, at least at that time, of Time, Newsweek, Playboy, and several other big-name publications). I was on a team of interns inventorying the company's computer equipment -- scanning bar coded equipment, and giving bar codes to those odds and ends that managed to slip through the cracks in the previous years. (It's amazing what grew legs and walked from one plant to another 40 miles away without being noticed.)

One of my co-workers got curious about the unlabeled big red button in the server room. Because he lied about hitting it, the servers were down for a day and a half while a team tried to find out what wiring or environmental monitor fault caused the shutdown. That little stunt cost my co-worker his job and cost the company several million dollars in productivity. It slowed or stopped work at three plants in Wisconsin, one in New York, and one in Georgia.

The real pisser was the guilty party lying about it, thereby starting the wild goose chase. If he had been honest, or even claimed it was an accident, the servers would have all been up within the hour, and at most plants little or no productivity would have been lost.

The reality: a 20 year old's shame cost a company millions.

Re:bad article is bad by commodore64_love · 2010-08-15 02:39 · Score: 2, Interesting

But.....

I only got a 200 on my English SAT. I's got no writin' skills. That's why I became a computer geek instead.

--
"I disapprove of what you say, but I will defend to the death your right to say it." - historian Evelyn Beatrice Hall

Re:bad article is bad by OnlineAlias · 2010-08-15 02:45 · Score: 2, Interesting

The first one, at the IU school of medicine, I'm very familiar with that place...they have no data center to speak of, and I do not know that person. I never heard of that incident. Also, who doesn't run spanning tree with BPDU gaurd and other such protections. I know IU does, for a fact.

Something is very very wrong with that article.

My favourite human error - a true story by Kupfernigk · 2010-08-15 02:50 · Score: 5, Interesting

This was a server room at an (unnamed) UK PLC. The air conditioning had remote management, and the remote management notified the maintenance people that attention was needed. So someone was sent out, on a Friday afternoon.

When he arrived, most of the staff had gone home and the skeleton IT staff didn't want to hang around. So, they sent him away on the basis that his work wasn't "scheduled".

Everybody came back on Monday to find totally fried servers.

--
From scarped cliff or quarried stone she cries "A thousand types are gone, I care for nothing, no not one."

cascade failures by Velox_SwiftFox · 2010-08-15 03:02 · Score: 3, Interesting

How can this leave out the standard cascade failure scenario?

Trying to achieve redundancy, someone gets what they think is worst-case-30A of servers with multiple power supplies, plugs one power supply on each into one PDU rated 30A, one power supply into the other.

They may or may not know that the derated capacity of of the circuit is only 24A, the data center is unlikely to warn them as they only appear to be using 15A per circuit at most.

Anyway, something happens to one of the PDUs and the power is lost from it. Perhaps power factor corrections (remember the derating?) and cron jobs running at midnight on all the servers that raise the load high simultaneously. Maybe just the failure of one of the PDUs that was feared, causing the attempt at "redundancy".

In any case, all of the load is then put on the remaining circuit, and it always fails. The whole rack loses power.

Re:cascade failures by omglolbah · 2010-08-15 05:00 · Score: 2, Interesting

Yep, it is one of the specific steps when we define requirements for server racks. Sadly not all the customers pay attention and then yell for us to come fix the mess when they find out years later :p
This is especially fun if the trip to the "datacenter" involves a helicopter ride to the oil rig where it is located :p

Mainframe days story by assemblerex · 2010-08-15 03:45 · Score: 5, Interesting

The old tape machines (six foot tall) used to put out a tremendous amount of heat. Space is at a premium, so in the mainframe room the drives were normally put edge to edge,
with one pushing air in and the other pulling air out. The machines had two 10-12" fans per unit, so stacking two or three units was fine. One site had so many machines side to
side (over 7), the air coming out the last machine regularly set things on FIRE. It was not uncommon for the machine to ignite lint going through the stack, with it coming out the
end as a small explosion like dust in a grain silo explosion. A fire extinguisher was kept on hand, and the wall eventually got a stainless steel panel because it was so common.

FedEx, get insurance/ship your server by AnAdventurer · 2010-08-15 03:52 · Score: 3, Interesting

When I was IT manager for a big retail mfg we had a cross-country move from the SF bay area to TN (closer to shipping hubs and lower tax rates). I was hired for the new plant, and I was there setting up everything (I did not know the company knew next to nothing about technology) and the last thing shipped before the company shutdown for the move was ship the data server via 2 day FedEx. The CFO packed it up and shipped it out, as the driver pulled away from the bay the server fell off the bumper and onto the cement. They picked it up (looking undamaged in it's box). When I opened it there was a shower of parts. A HD drive had detached from the case but not the cable and had swung around in that case like a flail. CFO had NOT INSURED the shipment or taken anything apart. That and much more to save $50 here and there.

--
6.8SPC TR of 550, l xwind at 6, drift rt at 26" drops 77". AT has 503 ft-lbs at 1403 fps. FT 0.86

Data center power by PPH · 2010-08-15 04:22 · Score: 3, Interesting

Back when I worked for Boeing, we had an "interesting" condition in our major Seattle area data center (the one built right on top of a major earthquake fault line). It seems that the contractors who had built the power system had cut a few corners and used a couple of incorrect bolts on lugs in some switchgear. The result of this was that, over time, poor connections could lead to high temperatures and electrical fires. So, plans were made to do maintenance work on the panels.

Initially, it was believed that the system, a dually redundant utility feed with diesel gen sets, UPS supplies and redundant circuits feeding each rack could be shut down in sections. So the repairs could be done on one part at a time, keeping critical systems running on the alternate circuits. No such luck. It seems that bolts were not the only thing contractors skimped upon. We had half of a dual power system. We had to shut down the entire server center (and the company) over an extended weekend*.

*Antics ensued here as well. The IT folks took months putting together a shut down/power up plan which considered numerous dependencies between systems. Everything had a scheduled time and everyone was supposed to check in with coordinators before touching anything. But on the shutdown day, the DNS folks came in early (there was a football game on TV they didn't want to miss) and pulled the plug on their stuff, effectively bringing everything else to a screeching halt.

--
Have gnu, will travel.

Re:Data center power by thegarbz · 2010-08-15 21:20 · Score: 2, Interesting

Basic rules of redundancy. A UPS isn't!

We had a similar situation to yours except we actually had a dual power system. The circuit breakers on the output however had very dodgy lugs on their cables which caused the circuit breakers to heat up, A LOT. This moved them very close to their rated trip current. When we eventually came in to do maintenance on one of the UPSes we turned it off as per procedure, naturally the entire load moved to the other. About 30 seconds later we hear a click come from a distribution board on the wall, and suddenly refinery operators were shouting panicked abuse through the 2ways to turn the damn thing back on.

These UPSes fed the emergency shutdown system of an oil refinery. Operators don't like their naps interrupted.

Washer in the UPS by Bob9113 · 2010-08-15 05:43 · Score: 4, Interesting

My favorite was at a big office building. An electrician was upgrading the fluorescent fixtures in the server room. He dropped a washer into one of the UPSs, where it promptly completed a circuit that was never meant to be. The batteries unloaded and fried the step-down transformer out at the street. The building had a diesel backup generator, which kicked in -- and sucked the fuel tank dry later that day. For the next week there were fuel trucks pulling up a few times a day. Construction of a larger fuel tank began about a week later.

--
Stop-Prism.org: Opt Out of Surveillance

Know your colo contracts by 1984 · 2010-08-15 05:45 · Score: 2, Interesting

I had one a few years back which highlighted issues with both our attention to the network behavior, and the ISP's procedures. One day the network engineer came over and asked if I knew why all the traffic on our upstream seemed to be going over the 'B' link, where it would typically head over the 'A' link to the same provider. The equipment was symmetrical and there was no performance impact, it was just odd because A was the preferred link. We looked back over the throughput graphs and saw that the change had occurred abruptly several days ago. We then inspected the A link and found it down. Our equipment seemed fine, though, so we got in touch with the outfit that was both colo provider and ISP.

After the usual confusion it was finally determined that one of the ISP's staff had "noticed a cable not quite seated" while working on the data center floor. He had apparently followed a "standard procedure" to remove and clean the cable before plugging it back in. It was a fiber cable and he managed to plug it back in wrong (transposed connectors on a fiber cable). Not only was the notion of cleaning the cable end bizarre -- what, wipe it on his t-shirt? -- and never fully explained, but there was no followup check to find out what that cable was for and whether it still worked. It didn't, for nearly a week. That highlighted that we were missing checks on the individual links to the ISP and needed those in addition to checks for upstream connectivity. We fixed those promptly.

Best part was that our CTO had, in a former misguided life, been a lawyer and had been largely responsible for drafting the hosting contract. As such, the sliding scale of penalties for outages went up to one-month free for multi-day incidents. The special kicker was that the credit applied to "the facility in which the outage occurred", rather than just to the directly effected items. Less power (not included in the penalty) the ISP ended up crediting us over $70K for that mistake. I have no idea if they train their DC staff better these days about well-meaning interference with random bits of equipment.

None of us are innocent. by BrokenHalo · 2010-08-15 06:07 · Score: 3, Interesting

Good judgement comes from experience. And most experience comes as a result of bad judgement.

Just about anyone who has been in the line of fire as sysadmin for long enough will recall some ill-concieved notion that caused untold trouble. Since my earliest experience with commercial computers was in a batch-processing environment, my initial mishaps rarely inconvenienced anybody other than myself. But I still recall an incident much later (early '90s) when I inadvertently managed to delete the ":per" directory on a Data General mainframe (more or less equivalent to /dev on a *nix box), then having to watch for about 45 minutes while my users' PIDs disappeared. I'll never forget that red-faced moment of knocking on my boss's door and letting him know he might want to leave his phone off the hook for the next hour...

Re:None of us are innocent. by Helen+O'Boyle · 2010-08-15 12:18 · Score: 3, Interesting

Good post title, BrokenHalo. I'll chime in with my two. 1987, my first full time job. I was a small ISV's UNIX guru. I wanted to remove everything under /usr/someone. I cd'd to /usr/someone and typed, "rm -r *", then I realized, hey, I know that won't get everything, better add some more, and the command became, "rm -r * .*". I realized, oh, no, this'll get .. too, so I better change it to: "rm -r * .?*". It took about 12 microseconds after I hit enter to realize that ".?*" still included "..". Yes, disastrous results ensued, even though I was able to ^C to avoid most of the damage, and I had the backup tape (back in the day, we used reels) in the tape drive just as users (other devs) began to notice that /usr/lib wasn't there. Yep, I have my own memories of red-facedly telling my boss, "oops, I did this, I'm in the process of fixing it now. Give me half an hour." In the future, "rm -r /usr/someone" did the trick nicely. Early 1990's, I was consulting in the data center of a company with 8 locations around the world. It contained the company's central servers that were accessed by about 700 users. Being a consultant, they didn't have a good place to put me, so I ended up at a desk in the computer room. Behind me was a large counter-high UPS that the previous occupant had used as somewhat of a credenza, and I carried on the tradition. That is, until the day I had put my cape on there, and the cape slid down and through one of those Rube Goldberg miracles caught the UPS master shutoff handle, pulled it down, and I heard about 30 servers (thank goodness there weren't more) powering down instantaneously. Amazingly, I lived, based on the ops manager pointing out to the powers that be that it was a freak accident and that others had been sitting similar stuff in the same place for years. The cape, however, was not allowed back in the data center. Fortunately, I've had better luck and/or been more careful over the past 20 years.
Re:None of us are innocent. by Anonymous Coward · 2010-08-15 21:06 · Score: 1, Interesting

My co-worker was windows guy and learning the *nix command on a Mac with OS X. He tried the rm -r on a mount point that he had learn to map to our dev server housing the builds and source. Just like what Mr. Jobs kept saying, it simply works. Luckily, the server was backed up nightly. The restore took about a day. Everybody got the read only access after that.

USB drive running mission critical WAFS by gagol · 2010-08-15 06:14 · Score: 4, Interesting

I was employed in a 50 employees publicity company. They have a couple of offices across the country and need to share a filesystem through WAFS. The main repository for the WAFS was running off a USB drive, connected to the server using a wire too short. I pointed the problem multiple times to my IT boss (no IT background what so ever) without success, tried to talk the issue to the owner of the company, without success, and one day tyhe worst happenned. The USB controller of the drive fried and we lost the last day of work. Thw windows server system went AWOL. It took an external consultant 3½ days to rebuild the main server, which was running the AD, WAFS, Exchange and our enterprise database. It costed us an account worth 12 MILLIONS $. The big boss then hired consultants and gave them over a thousand box to get her told the exact same thing I pointed to 3 months earlier when I audited the IT infrastructure. Two months later she comes top me and ask me how much it would cost to have a bullet-proof infrastructure. I told her to invest arounbd 80K in virtualisation solution with scripts to move VM around when workload changes and go with a consolidated storage with live backups and replication. It was too expensive. Another three months pass, she hire some consultants, gave them another thousands $ to get told basically the same thing I told her 3 months earlier... Than is where i quitted.

--
Tomorrow is another day...

Re:bad article is bad by BrokenHalo · 2010-08-15 06:19 · Score: 2, Interesting

while the second story was procedural user error (do the backup every day, no matter what) being blamed on a technical problem (the backup system).

Back in the late '80s when I was working on Prime "mini-computers" (as such machines were then known), I would receive periodic calls from Prime's tech support to alert me to (yet) another bug found in their BRMS (Backup/Restore Management System), and would I pretty-please stop using it. As it happened, I was using their less sophisticated but otherwise bombproof dump/restore utilities, so this was never an issue for me, but it was still pretty funny...

A classic by Anonymous Coward · 2010-08-15 08:30 · Score: 1, Interesting

One of my favorite stories my grandfather told me is a story about a computer that would screw up its calculations at the same time every day, about 2pm. This was back in the 60s, when computers were rather large. Basically, if the accountants ran the job in the morning, everything checked out. But if they were to run the batch through in the afternoon, their results would all be off. After two days of checking all the standard stuff (bad memory modules, bad cooling, what have you), he noticed that there was this loud banging noise that would start in the afternoon. He went out of the office to the next set of offices over, which had a machine shop. Turns out the machine shop press would start running about 2pm every day, and that machine press happened to be on the same circuit as the adding machine, so it would draw off just enough power to screw with the results.

Air Force fun by Anonymous Coward · 2010-08-15 13:48 · Score: 1, Interesting

Some Air Force instructors told us in class that one time one of the tech school instructors wanted to brush up on his cisco skills, so he asked the IT if they had any old routers lying around. They did have one that they thought was cleared out, so they gave it over to him telling him to play around with it. He made the wonderful mistake of plugging it into the network inside the building and it started propagating all the old router information all over the network, which was hooked in the unclassified base network.

Why did you have to bring this up? by Anonymous Coward · 2010-08-15 13:58 · Score: 1, Interesting

Why did you have to bring this up? You brought back bad memories of the time that I actually did this. I worked at the time as a computer operator in the Southland Corporation data center in Dallas. We had moved into our newly built headquarters building and there was a red light switch on the wall by the master breakers. We all wondered for days what the switch would do and I was the only one who eventually got brave (stupid?) enough to throw it. The master breakers to the computer immediately dropped out. So we tried to flip the master breakers up again, but they wouldn't budge. We had to call building maintenance wherein after an hour of delaying production waiting for them to call in, we were able to get power to the computers back. We didn't know that you had to reset the breakers first by forcing them all the way down. I was scheduled to by promoted into programming anytime so I was really sweating it. The other computer room operators and evening manager decided to not tell anyone who caused the breakers to trip. There was a major management inquisition about what had happened but everyone kept quite. Finally, the evening manager was told that he had until the next day to out the culprit. He was going to do this the next day, but said that management decided to drop it. I was saved. A glass covered wooden box was made to cover the switch. I was promoted to programmer shortly afterwards. Climb mountains, but don't ever flip switches because they are there.

Re:bad article is bad by afidel · 2010-08-15 14:21 · Score: 2, Interesting

Actual Cisco stuff (as opposed to Linksys gear with a Cisco badge) will discover a loop in an adjacent switch and shutdown the uplink port. Of course if you haven't turned on sw portfast the switch will do spanning tree which will keep the port from ever coming up, so yes better switches will definitely solve the problem. I had a network where the training room and C* row were serviced from the same 48 port switch, our very ADD CEO was in the training room trying to ignore a boring meeting and plugged two adjacent popups into each other, took down C* row but the upstream switch caught the problem so the rest of the company kept working. CFO threw a fit until the root cause was determined and then it was solved by putting a warning on the ports in the training room =)

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.

30 of 305 comments (clear)