Server Failure Destroys Sidekick Users' Backup Data

← Back to Stories (view on slashdot.org)

Server Failure Destroys Sidekick Users' Backup Data

Posted by timothy on Saturday October 10, 2009 @09:29PM from the oh-well-enough-said dept.

Expanding on the T-Mobile data loss mentioned in an update to an earlier story, reader stigmato writes "T-Mobile's popular Sidekick brand of devices and their users are facing a data loss crisis. According to the T-Mobile community forums, Microsoft/Danger has suffered a catastrophic server failure that has resulted in the loss of all personal data not stored on the phones. They are advising users not to turn off their phones, reset them or let the batteries die in them for fear of losing what data remains on the devices. Microsoft/Danger has stated that they cannot recover the data but are still trying. Already people are clamoring for a lawsuit. Should we continue to trust cloud computing content providers with our personal information? Perhaps they should have used ZFS or btrfs for their servers."

15 of 304 comments (clear)

Min score:

Reason:

Sort:

As if millions... by Anonymous Coward · 2009-10-10 21:33 · Score: 5, Funny

homemade cell phone porn videos cried out and then were silenced.
"they should have used ZFS or btrfs" by Manip · 2009-10-10 21:34 · Score: 5, Insightful

This seems a rather silly point to make. I know this is Slashdot and we have to suggest Open Source alternatives but throwing out random file systems as a suggestion to fix poor management and HARDWARE issues is some place between ignorant and silly.
Perhaps they should have had at least mirrored or stripped raid, with an off-site backup every week or so?
1. Re:"they should have used ZFS or btrfs" by sopssa · 2009-10-10 22:12 · Score: 5, Insightful
  
  Exactly, this can be a software bug too and that could possibly easily destroy or corrupt backup data too. I really doubt this service was ran without backups.
  The type of filesystem has nothing to do with this.
2. Re:"they should have used ZFS or btrfs" by Znork · 2009-10-10 22:47 · Score: 5, Insightful
  
  I really doubt this service was ran without backups.
  Knowing 'enterprise' backups I'd bet there was at least a backup client installed and running. However, I'm equally sure that the backups were, at best, tested once in a disaster recovery exercise and were otherwise never verified.
  Further, responsibility would probably be shared between a storage department, a server operations department and an application management department, neatly ensuring that no single person or function is in the position to even know what data is supposed to be backed up, what limitations there are to ensure consistency (cold/hot/inc/etc), to monitor that that's actually what does happen and that it keeps happening as the application and server configuration evolves.
  Backups of dubious value do not seem to be a rarity in enterprise settings.
3. Re:"they should have used ZFS or btrfs" by malchus842 · 2009-10-10 22:49 · Score: 5, Interesting
  
  One reason why our corporate policy is that we actually have to validate backups for every system on a regular basis (this means doing a full restore of a tape called from off-site), where the regularity is directly proportional to the criticality of the system. The more critical, the more often we test. On our iSeries, they restore the weekly backup tape EVERY week on the QA server - both for the purposes of refreshing it, AND to validate the backups. We also have a quarterly 'random' test where a system is chosen randomly and it must be recovered from bare metal using only our standard procedures + the backup tape.
  We've discovered all kinds of strangeness with backup tapes through the years. Our Tier 1 systems have completely separate instances in geographically diverse areas, with data-replication.
  Granted, this isn't cheap, but our data isn't either.
4. Re:"they should have used ZFS or btrfs" by asaul · 2009-10-10 23:19 · Score: 5, Interesting
  
  Dubious backups? Depends. We had a system which was a 6TB cluster that was notoriously difficult to back up. This went on for years, it took too long, failures caused issues downstream etc. Then someone took a moment to realise that the application was not capable of re-using that 6Tb of data if it was restored - once the data came in it was processed and archived. To recover the application all they had to do was backup a few gig of config and binaries, and restart slurping data from upstream again. Viola - backup stripped down to nothing, 6TB a day of data less to backup, and next to no failures as it was now so quick to backup.
  Then there is the case of an application which the vendor and application developer signed off on using a backup solution using a daily BCV snapshot. What they failed to tell us was application not only held data in a database, but in a 6G binary blob file buried deep in the application filesystem. If the database and the binary where out of sync in any way, it could mean missed or replayed transactions or generally that the application was inconsistant. As this was an order management platform, that was bad. You can guess the day we found out about this dependancy.... yup, data corruption, bad vendor advice screwed the binary file and all we had to go on was a backup some 23 hours old where the database was backed up an hour after the application. Because of a corresponding database SNAFU, the recover point was actually another day before that, with the database having to be rolled forward. It was at this point we found out the despite the signed off backup solution, the vendors documented recommendations (that were not supplied to us) was that the only good backup was a cold application one - not possible on a core order platform. Thankfully after some 56 hours of solid work the application vendor managed to help sort the issue out and the restore from backup was not actually needed. The backups were never really tested as the DR solution worked on SRDF - the DR consideration for data corruption was never really part of the design (from a very high level, not just this platform).
  So there you have it. Two dubious Enterprise backups - one not needed, the other not usable.
  
  --
  "If everybody is thinking alike, somebody isn't thinking" - Gen. George S. Patton
5. Re:"they should have used ZFS or btrfs" by petes_PoV · 2009-10-10 23:35 · Score: 5, Insightful
  
  It's not a backup unless you can prove it will restore. Until then it's just a waste of tape, or disk, and time
  The point about backups is not to tick the box saying "taken backup?" but to provide your business / customers / whatever with a reliable last resort for restoring almost all their data. If you don't have 100% certainty that it will work, you don't have a backup.
  
  --
  politicians are like babies' nappies: they should both be changed regularly and for the same reasons
6. Re:"they should have used ZFS or btrfs" by vk2 · 2009-10-11 00:36 · Score: 5, Interesting
  
  Thats why you have logical redundancies. I work for a fortune 10 company and this is a standard practice for all mission critical applications. The application has be to geographically redundant with install base at least at 3 data centers (ATL,SEA and DLS). Different SAN technology at each DC. All Oracle databases have 2 physical dataguard configuration with 4 hours and 8 hours latency (to guard against user errors) and all J2EE apps hard configured to switch connections from one db to the other almost on the fly or with a reboot. Some really really critical databases have all this and transaction duplication via Goldengate to remote databases to off load reporting queries. We have had issues where SAs screwed up allocating LUNs and ended up f*cking up the file systems but we recovered in every scenario even a 30 TB DB restore over 2 days.
  
  Its amazing a consumer serving company like T-Mobile risked itself by hosting their application on Microsoft platform;. Furthermore where is the DR in all this? Who the F*ck in the right mind fiddle something on SAN without confirming a full backup of all applications/databases? It appears that Hitachi and Microsoft are at fault here (if SAN maintenance is the root cause of this failure) but T-Mobile is the fool allowing these companies to ruin their data. Not only there won't be any consequences because of this issue to MS or Hitachi - T-Mobile will be pouring in more money to fly in the MS and Hitachi consultants.
  
  --
  No Sig for you.!
7. Re:"they should have used ZFS or btrfs" by jimicus · 2009-10-11 01:08 · Score: 5, Informative
  I've always been amazed that tape is trusted as much as it is. It seem (anecdotally at least) to have a disproportionately high failure rate.
  I'm not sure that's the problem so much - after all, LTO has a read head positioned directly after the write head and automatically verifies as it goes along. A tape error is dead easy to spot.
  There are a number of places where things can fall apart, and tapes don't even need to come into the matter:
  Nobody checking the logs
  Failure to understand the processes necessary to get a good backup. (You can't just dump the files that comprise a database to disk - you must either quiesce the database or use the DBMS' inbuilt backup routine - or you will wind up with inconsistent files and hence an inconsistent database. You'd be amazed how many people don't understand this.)
  Failure to maintain backup processes. (When you moved the database to another disk because you were running out of space, you did update your backup process? Right?)
  Not doing any test restores.
  Not doing enough test restores, or doing them carefully enough. (If you're unlucky, your database will come back up OK even though you didn't quiesce it before carrying out the backup. Why do I say unlucky? Well, if it had not come up OK, you'd know immediately that there was a problem with your process. Then once the database is back up, make sure you check the restored data to ensure that recent transactions which should be on the backup actually are).
Re:Backups? by TheSunborn · 2009-10-10 21:47 · Score: 5, Insightful

Or this was really a software error, and the backup servers in an other datacenter, just copied the faulty data/delete command.
They should really be far to big to have all their data stored in a single datacenter with no offsite backup. (Or they should have an entry on thedailywtf.com)
Microsoft was testing the US gov edition by AHuxley · 2009-10-10 21:58 · Score: 5, Funny

Right feature, wrong server? MS understands the need for a "Rose Mary Stretch" default setting.
The congress critters have learned a lot from the "terrible mistake" of email backups.
From cute page boys to Iran contra, MS can market this as a feature.

--
Domestic spying is now "Benign Information Gathering"
This may have to do with the "Pink" project fiasco by HonestButCurious · 2009-10-10 23:32 · Score: 5, Interesting

According to a very long article on AppleInsider:
http://www.appleinsider.com/articles/09/10/09/exclusive_pink_danger_leaks_from_microsofts_windows_phone.html&page=3
MS was misleading T-Mobile about the state of Sidekick support, and apparently charging hundreds of millions every year for, and I quote "a handful of people in Palo Alto managing some contractors in Romania, Ukraine, etc". This is apparently because most of the Sidekick devs had either moved to Pink or quit out of disgust.
Yesterday... all those backups seemed a waste... by argent · 2009-10-11 00:18 · Score: 5, Funny

Yesterday,
All those backups seemed a waste of pay.
Now my database has gone away.
Oh I believe in yesterday.
Suddenly,
There's not half the files there used to be,
And there's a milestone hanging over me
The system crashed so suddenly.
I pushed something wrong
What it was I could not say.
Now all my data's gone and I long for yesterday-ay-ay-ay.
Yesterday,
Need for backup seemed so far away.
Seemed my data were all here to stay,
Now I believe in yesterday.
Anonymous
Claimed information from the inside by cshbell · 2009-10-11 02:08 · Score: 5, Interesting

According to this comment post on Engadget, it was a contractor working for Danger/Microsoft who screwed up a SAN upgrade and caused the data loss. Obviously, take this with a grain of salt until it's substantiated:

"I've been getting the straight dope from the inside on this. Let me assure you, your data IS gone. Currently MS is trying to get the devices to sync the data they have back to the service as a form of recovery.

It's not a server failure. They were upgrading their SAN, and they outsourced it to a Hitachi consulting firm. There was room for a backup of the data on the SAN, but they didn't do it (some say they started it but didn't wait for it to complete). They upgraded the SAN, screwed it up and lost all the data.

All the apps in the developer store are gone too.

This is surely the end of Danger. I only hope it's the end of those involved who screwed this up and the MS folks who laid off and drove out anyone at Danger who knew what they were doing.
"Epic fail" doesn't begin to describe this one.
Re:WTF by Locutus · 2009-10-11 03:02 · Score: 5, Interesting

from that sounds of it, Microsoft couldn't turn Danger into a WinMo platform so they gutted it of employees instead of spinning it back off since they'd rather have it dead than spreading more Java but not dead before they had Pink out the door. So when you fire everyone from the top downward, you end up with people who's job is to turn the lights off when the doors get locked for good. they're not motivated much nor are they skilled in all of what used to be required to run the shop. Auto-pilot mode comes to mind.

So maybe the backup system needed to be checked or a CRON job verified or maybe the computer in Joe Fired's office was part of the backup process in some little way but important enough that the whole job was failing every night.

As I said, Microsoft tried to replace the Danger stack with Microsoft software but it wasn't going to work or got too much backtalk( thinking of Softimage ) and threats of everyone leaving if they had to port to the WiMo pile/stack. They moved anyone who'd go, over to Pink and left the rest to keep life support systems running. oops, they failed.

With Ballmer publicly saying that WinMo has been a failure, he's hearing the press say WinMo 6.5 is a yawn and expectations are that the Sony PS3 will eclipse MS XBox, and recently reading about how he's telling people that IBM doesn't know what they are doing....There's probably a new monkey-boy dance going on inside his office we'd probably love to see. It might be too dangerous being so close as to record it.

Will Microsoft ever make any profits from anything outside of MS Windows and MS Office? Ballmers 8-Ball still seems to be telling him something very different from what everyone else is seeing.

LoB

--
"Anyone who stands out in the middle of a road looks like roadkill to me." --Linus