ISP Recovers in 72 Hours After Leveling by Tornado
aldheorte writes "Amazing story of how an ISP in Jackson, TN, whose main facility was completely leveled by a tornado, recovered in 72 hours. The story is a great recounting of how they executed their disaster recovery plan, what they found they had left out of that plan, data recovery from destroyed hard drives, and perhaps the best argument ever for offsite backups. (Not affiliated with the ISP in question)"
Hopefully no one was hurt when the trailer park got levelled.
when Munchkins overrun the web now that this ISP got relocated by the twister.
"So, ah, your ISP here.. what's your uptime for the last year?"
"99.18% for our service, and 96.2% for our building."
And I'm sure every minute of those 72 hours was characterized by irate phone calls to tech support.
"Are you guys down again? You're down more than you're up! I'm going to find another service... etc..."
"Ma'am our facilities have been entirely leveled by a tornado, we'll be back up in 72 hours."
"72 HOURS?! I have photos of my grandchildren I have to mail! Worst ISP ever! Let me speak to your supervisor!"
"Ma'am our supervisor was also leveled by the tornado."
*click*
Not that I work tech support for an ISP and am bitter...
This is what happens when people make intelligent plans and the modify them as they see other plans work or fail. I'm glad to see that this was a work in progress rather than some arcane plan in a binder somewhere that no one ever looked at.
The Blaster Master Fighting for Truth, Justice, and Evil Pie since 1979
When your business gets pelted with the equivalent force of 100,000 elephants, you better have a friggin contingency plan.
--"The perfect example of the man of action is the suicide." - William Carlos Williams
...is a good enough argument for off site backups. If you don't have them, your backup plan is not enough.
A Tornado huh?
Well that's what you casemodders get for installing twenty overpowered cooling fans in every one of your 1000 servers!
Slashdot Syndrome: the sudden, extreme urge to correct someone in order to validate one's self.
Let the OZ jokes flow:
"Bring me the router of the wicked switch of the Qwest!"
Although, I am starting to wonder. Has anyone checked to see if this ISP has a record of resisting RIAA subpeonas? Perhaps the RIAA levelled it after acquiring cloudbuster equipment.
Don't blame Durga. I voted for Centauri.
A couple of friends of mine were badly burned because the web hosting company they were using lost all their data (customer and their own) in one humungous crash, and didn't have any backups. They didn't even have a spare copy of their customer database, so they couldn't even contact their customers to tell them what was going on. Nor could they tell what customers they had and how much service they'd paid for, etc.
The next Cmdr Taco duplicate will be ready soon, but subscribers can beat the rush and see it early!
Those businesses should realize they need a backup/disaster plan as well, if they absolutely could not withstand a day of downtime.
Perhaps having the sites mirrored on two colos in two locations, and routing to the other one when the first goes offline.
I don't need no instructions to know how to rock!!!!
No, in Russia Tornado does not own you. Neither does ISP. It is not, step 1) tornado step 2) ??? step 3) ISP recovers. There is not a beowulf cluster of these, and the tornado doesn't run Linux.
Then I've seen the other end of the spectrum - a 6 Billion dollar corporation's world HQ IT center... wow. They have disaster recovery sessions and planning like I never would have imagined. Very cool facility, but it has to be like that. Some day if they get burned, it's all over.
Berto
That's actually interesting - how many sites have contingency plans for the /. effect? How many businesses?
It's not just /., but just about any media can refer people to a real business site. For small companies, this could bring them down for some time. Imagine the "Bruce Almighty" effect, only with some business with a small-to-medium capacity connection, bombarded just because someone used http://www.slashdotme.com/ or spam@.me.into.oblivion.org in their movie.
The fact that so many sites are taken down by the /. effect causes me to believe that few sites and those who run them are truly prepared.
Terrycloth Lobster
But, as a programmer, I just dont care.
When I was a sophomore, working on my electrical engineering degree, I worked for a small, network-centric company that employed what seemed to be an abnormal number of snooty programmers and technical writers. Maybe it wasn't so abnormal.
Me: "Hi, IT support."
Stratjakt: "Hey, I know you're just a high-school educated 'IT person', but you need to get one of your cable monkeys up here and find out why I can't see the network!"
Me:: "OK, but let's check a couple of things quickly before I dispatch a technician. It may save some time."
Stratjakt: "Hey, I'm a programmer! I just don't care!"
Me: "I understand...I realize that my mundane existance doesn't have the exhilaration and exitedness of the thrilling, edge-of-your-seat world of a computer programmer, but there are just a few simple things that we could do to resolve this problem that will be faster than you waiting for a technician."
Stratjakt: "I just don't care."
Me: "No problem, I'll dispatch a technican."
An hour later...
Technician: "Stratjakt is all fixed up. I plugged his network cable back into the jack."
What amazes me isn't that these people were able to restore service to their customers in 72 hours. They used standard systems administration techniques. BGP was specifically mentioned.
No, what amazes me is that this is news. The IT industry is so full of idiots and morons and MCSEs that taking basic precautions earns you a six-figure salary and news coverage. These folks didn't even have off-site backups, it was luck that they were able to resume business operations (ie: billing) so soon.
Moral of the story? When automobile manufacturers start getting press coverage for doing a great job because unlike their competition, they install brakes in their vehicles, you know that the top-tier IT managers and executives have switched industries.
Barclay family motto:
Aut agere aut mori.
(Either action or death.)
Wrong on SOOOOOO many levels.
Let me start with this line:
"I realize that slashdot is mostly populated by high-school educated "IT people", who give a shit about logs and backups"
You claim to be a programmer, I have been a programmer and am now a Sys Admin, as both the BEST way to troubleshoot was from the logs. Unless you are the supreme programmer whose code never needs debugging and whose users never mispunch something causing an error a log file will let you see and know what has happened.
Now for this line:
"and restoring backup tapes is exhillirating and exciting."
I have restored from tape backup. We had a "programmer" BS from Virginia Tec, Masters from UMass who was certain he knew exactly what he was doing when he blew away an entire production database. (Actually he was a really good guy who just made a simple mistake) Fortunately we had tapes to restore from. But if ANYONE thinks that a restore is "exhillirating" (yes I left your type/mistake in there) then they are just strange. That was one of the most tedious and boring things I have had to do. But we had been tedious in backing EVERYTHING up so production was not severely impacted.
Now for where you directly insult everyone:
"I fully expect the PHBs and army of cable monkeys to get the network up and running in our new location."
So as a systems admin do I become a cable monkey? or am I a PHB? Either way I would be VERY needed if a disaster strikes just as I am needed every day. As for the elitist attitude and your lack of knowledge and concern for the backend of systems I am glad you do not work anywhere near me as I hate IT personal that have to call me to run windows update on their system when the latest worm comes around or to show them how to NOT clik ignore when Norton tells them they have a virus.
In short, Please show some respect for your coworkers and realize that these guys were prepared and did what their plan stated they could do.
If not don't be alarmed if somehow your account gets disabled and everything blown away and surprisingly they won't have backups, cause you "just don't care" for them.
I am 31337 or something.
I, for one, welcome our new Tornado-beating ISP overlords.
Slashdot still doesnâ(TM)t support Unicode after it was added to the HTML standard in 1997.
Can they recover from the slashdot effect???
The slashdot effect differs from a tornado in a few subtle ways:
1) You can't see it coming (unless you pay money to be a subscriber)
2) It doesn't hurt anything, except for webservers, the occasional OC line lit up like New Year's Eve, spammers, and the odd *IAA executive.
3) A tornado doesn't typically smell like armpits, cheetos, empty 64oz soda cups, burning plastic, your parent's basement and/or too much cologne for that first date.
4) It travels at the speed of light, a lot quicker than a tornado.
5) Does not require specific atmospheric conditions to be present...just a link on the front page.
Anything else?
You're a VB programmer, aren't you?
As the air to a bird or the sea to a fish, so is contempt to the contemptible -W.B.
Many companies in the World Trade Center thought that off-site backup meant the other building.
Cave, wreck, and deep diver.
that's computerworld receiving the /.ing
the isp is here
picture of the aftermath here
There is much cruelty in the universe, John.
Yeah, we seem to have the tour map.
I am also a former Aeneas customer.
Unless Aeneas has made some major changes they are quite certainly the worst ISP I have ever worked with. Aeneas has contracts with the Jackson-Madison County School System to provide internet service district wide. The quality of such service is, bar none, the worst I have experienced.
I did some volunteer work at a local Elementary school helping teachers work out any lingering computing problems they had(Virii, printer drivers, misconfigured ip settings, file transfer to a new computer, etc). The internet service I experienced while I was there lead me to believe I was on a 128k ISDN line. Not until I went to the server room did I realize that I was, infact, on a T1. Now this is during the middle of summer, mabye four other persons were in the building, three of which were in the same room as myself. The service was also intermittent, having several dead periods while I was working. Needless to say, I remained unimpressed by said experience.
When I was an Aeneas dialup customer, in 1998, the service provided by Aeneas was also subpar. The dialup speeds were averaging 21.6kbps, where as when I switched to U.S. Internet(now owned by Earthlink) my dialup speeds were always above 26.4kbps(Except on Mother's Day). There were frequent disconnections, and they had a limit of 150hrs/month.
I'm not supprised how easy it is to restore subpar service. All they had to do was tie together the strings that are their backbone.
Yep, thats the way it works. I dont crawl around on the floor plugging shit in and getting dirty.
...
They're just added beurocracy for the computer world, and I work to replace them each and every day with more sophisticated self-administrating softwares.
If you don't know how to crawl around on the floor plugging shit in and getting dirty, you do not have the perspective necessary to write software to replace the people who do. The best programmers are not arrogantly disconnected from the people in the trenches, especially if they're working on software directed towards their field. A good programmer needs at least to know what people commonly need support about in order to address it in future software. If your CTO is as out of touch and disconnected as you, I pity your fellow employees.
You're also a poor team player, which is a liability to you and your career unless you work solo. You're also incredibly stuck up and elitist, which unfortunately probably actually helps your career. You're also way off base: you obviously consider yourself "above" the type of people who enjoyed this article, and your comments have been way more of an advertisment of yourself than anything to do with the issue. Why don't you drop out of this conversation and let the high school kids who spend all day plugging shit in enjoy it. Believe it or not, there are a lot more nerds in high schools than in high-paying programming positions. That being the case, this site should have more stories about them than you.
72 hours seems way too long to be out of business. That's 3 days of money that the ISP is not pulling in dough. Unless the whole internet is crippled, I'd ditch an ISP that was out for three days. One of the main selling points for ISP is connectivity rain, snow, shine, OR rabid squirrels...
The company (ISP/consulting/services hosting) I used to work for had a DR plan to be executed in 24 hours with 75% functionality. Offsite servers and backups of course...
More impressive to me is the World Trade Center folks like American Express and other companies that had DR plans situated across the river. A lot of datacenters and information services were functional again within 18-24 hours. That's PPP PPP (prior planning prevents piss-poor performance).
I write good sigs on my bathroom wall...but this is not a real sig.
That was on a Monday. The next Monday was the Northridge quake.
They came into the next meeting a couple of weeks after the quake with a whole new perspective on disaster planning and training:
I think a lot of sites already have contingency plans for sudden traffic increases, and if not, they begin to think about them very seriously once they get a large spike in traffic that causes disruption of service. Even with traffic spike contingency plans, the level you establish as the maximum amount of traffic that you need to be able to sustain, and what amount of latency or down time is acceptable to business, can be and often is debated ad nauseum. It costs a lot of money to maintain readiness for, say, double or triple normal site traffic for a large site, and you have to make a business case for balancing that cost with the cost of an outage due to increased traffic.
There are several things you can do to quickly add the capability to handle additional load, and most of them rely on forethought when establishing contracts with your colocation facilities and software/hardware vendors. For instance, most large colo facilities allow you to reserve additional bandwidth capability. You may pay more for that priviledge, but that's part of the cost of preparedness. Also, you may purchase or lease additional hardware, have it set up and ready to install in a short amount of time, but not use it on a regular basis because of high licensing costs.
Licensing costs for database software can be enormous, but in the event of a large spike in traffic, turning on an additional 20 or 30 cpus on a large database server could save the company a lot of money in lost revenues. Especially if you database software vendor specifically allows this in your contract. If the contract doesn't allow this, you may end up paying a lot more in licensing fees than you would have made in revenue during the outage.
My main point here is that planning for extra traffic is a big cost-benefit balancing act, and it requires a lot of forethought. Most large software, hardware and service providers allow for emergency clauses in contractual agreements, but it's often up to the customer to specifically call those out.
But then again, it's like insurance. You hope you don't need it, but you're glad you have it when you do. And you have to pay for it even if you don't need it.
Also, when you plan for traffic spike, you need to consider the source of the traffic. Denial of service attacks are often easy to mitigate with common network practices, and it's just a matter of preparing for those. But real, human-driven traffic is much different, less predictable, and actually capable of generating revenue.
Understanding your company's site infrastructure, software architecture and day-to-day traffic patterns is very important when it comes to handling real traffic spikes. When a real spike happens, network operators, developers and database admins (among others), will probably need to jump into action, looking for and attempting to mitigate bottlenecks as they appear. This can be a difficult task, and there's nothing worse than knowing what the problem is and not being able to do anything effective to combat it in a reasonable amount of time.
Real traffic doesn't just come from other sites, it can also be driven by other forms of communication, such as television, print and other media... even word of mouth (although I haven't seen an example of this). A large, syndicated national television news program that runs during primetime can generate a lot more traffic than most web sites, and those spikes seem to grow on orders of magnitude as the duration and repetition of air time increases. A fifteen minute segment that is marginally compelling might be enough to swamp all but the largest and most prepared sites. The silver lining of the television spike is that it declines very quickly after the segment ends.
A spike from multiple media sources, for instance print, web, and television, could be very difficult to handle, both in magnitude and duration. Although, duration isn't often a problem, because even the most prepared sites will succumb under a huge spike and
When you go to a DRP seminar, they make the claim that the majority of business that are knocked out for longer than 48 hours go out of business within 1 year.
This was from a mazazine for managers, after all. Now there's some good news that pointy-haired bosses can understand!
30 days may be a bit much but as I found out one day 48 hours comes close to being too little in some situations. We had a massive generator capable of running most of our 4 story suburban office building for a couple days including the datacenter, AC for the datacenter, lights, and desktops. It would not run AC for the rest of the building or the elevator. At the ~35% load we placed on it and its 500 gallon tank the engineer from Catapilar said it should run for around 48 hours. Well we called our fuel supplier to get some offroad diesel delivered the next morning, no can do, they no longer stock it!?!? WHAT! Then we tried every other listed company in the area, none of them could get to us the next day with fuel. We ended up getting a fuel company out to deliver 300 gallons from Detroit to our offices in Akron, Ohio paying a $500 delivery charge and 70 cents a mile. After that we made sure to get a contract with a fuel company that guarenteed 24 hour delivery of offroad diesel =)
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.