Server Failure Destroys Sidekick Users' Backup Data

As if millions... by Anonymous Coward · 2009-10-10 21:33 · Score: 5, Funny

homemade cell phone porn videos cried out and then were silenced.

Re:As if millions... by davester666 · 2009-10-11 06:31 · Score: 2, Insightful

It is really 'backup' data?
From the sounds of it, each Danger phone loads its data from the 'cloud' whenever it's powered on, and syncs the data as it changes. To me, this makes the 'cloud' the live data store, and the phone just the local cache...

--
Sleep your way to a whiter smile...date a dentist!

"they should have used ZFS or btrfs" by Manip · 2009-10-10 21:34 · Score: 5, Insightful

This seems a rather silly point to make. I know this is Slashdot and we have to suggest Open Source alternatives but throwing out random file systems as a suggestion to fix poor management and HARDWARE issues is some place between ignorant and silly.

Perhaps they should have had at least mirrored or stripped raid, with an off-site backup every week or so?

Re:"they should have used ZFS or btrfs" by timmarhy · 2009-10-10 21:46 · Score: 4, Insightful

retarded comments like that are the reason these zealots aren't taken seriously in the enterprise.
i'd hazard a guess that the offsite backups were corrupted as well somehow or were silently failing.

--
If you mod me down, I will become more powerful than you can imagine....
Re:"they should have used ZFS or btrfs" by rastilin · 2009-10-10 21:50 · Score: 4, Informative

This seems a rather silly point to make. I know this is Slashdot and we have to suggest Open Source alternatives but throwing out random file systems as a suggestion to fix poor management and HARDWARE issues is some place between ignorant and silly.
Not as silly as it might appear. One of ZFS's main functions is that it can compensate for some degree of hardware failure.

--
How do you kill that which has no life?
Re:"they should have used ZFS or btrfs" by gravos · 2009-10-10 22:01 · Score: 4, Informative

The current major cloud providers (Google and Amazon) both replicate your permanent data to multiple hard disks (Google: 3, not sure about Amazon) in multiple areas of the datacenter, and I know Google is looking at providing replication to different datacenters (which is more complex than replication in the same datacenter because of the time delay).

--
This game will waste your life. Don't clicky!
Re:"they should have used ZFS or btrfs" by sopssa · 2009-10-10 22:12 · Score: 5, Insightful

Exactly, this can be a software bug too and that could possibly easily destroy or corrupt backup data too. I really doubt this service was ran without backups.
The type of filesystem has nothing to do with this.
Re:"they should have used ZFS or btrfs" by Znork · 2009-10-10 22:47 · Score: 5, Insightful

I really doubt this service was ran without backups.
Knowing 'enterprise' backups I'd bet there was at least a backup client installed and running. However, I'm equally sure that the backups were, at best, tested once in a disaster recovery exercise and were otherwise never verified.
Further, responsibility would probably be shared between a storage department, a server operations department and an application management department, neatly ensuring that no single person or function is in the position to even know what data is supposed to be backed up, what limitations there are to ensure consistency (cold/hot/inc/etc), to monitor that that's actually what does happen and that it keeps happening as the application and server configuration evolves.
Backups of dubious value do not seem to be a rarity in enterprise settings.
Re:"they should have used ZFS or btrfs" by malchus842 · 2009-10-10 22:49 · Score: 5, Interesting

One reason why our corporate policy is that we actually have to validate backups for every system on a regular basis (this means doing a full restore of a tape called from off-site), where the regularity is directly proportional to the criticality of the system. The more critical, the more often we test. On our iSeries, they restore the weekly backup tape EVERY week on the QA server - both for the purposes of refreshing it, AND to validate the backups. We also have a quarterly 'random' test where a system is chosen randomly and it must be recovered from bare metal using only our standard procedures + the backup tape.
We've discovered all kinds of strangeness with backup tapes through the years. Our Tier 1 systems have completely separate instances in geographically diverse areas, with data-replication.
Granted, this isn't cheap, but our data isn't either.
Re:"they should have used ZFS or btrfs" by WarlockD · 2009-10-10 22:55 · Score: 4, Interesting

Ever try to restore from a ZFS corruption? It IS easy and it can be done. However...

What if the data was on an EMC storage array and the tech told them its all lost? What if your dealing with a Teir 1 vender (I am looking at you Dell Equallogic) that swears UP and DOWN that there is no way to recover the system after a second drive out of a RAID 5 has been pulled? Hell, try just a standard raid 5 card from a Teir 1 vender. (Not talking about calling like 3ware support directly, they are honestly good and recovered a few arrays with them)

I "suspect" that they are running it off a storage array that failed big time, or lost the LUN, or just someone decided to die and take the server with it. There is just to much we don't know. Was Dagger installed on multi-servers? Was it clustered? Is it a cloud system? Does it run its own storage system or requires additional hardware?

But you know what? ZFS, EMC even Windows 2008, All moot. Why? WHERE ARE THE TAPE BACKUPS?!?! SERIOUSLY. The ONLY way they have lost ALL that data was that they didn't have backup solution. Otherwise their "press release" would say "...however we will be restoring the data from last week/months tapes..."

I do like how they keep saying "Microsoft/Danger" as if they are at fault. A good admin would expect a new car would catch fire and run into a bus full of nuns.
Re:"they should have used ZFS or btrfs" by mike260 · 2009-10-10 22:55 · Score: 4, Funny

There are plausible reports as to how this happened here.
tl;dr - They tried upgrading their SAN without making a backup first, and the upgrade somehow hosed the entire SAN.
Re:"they should have used ZFS or btrfs" by Anonymous Coward · 2009-10-10 22:58 · Score: 3, Insightful

Repeat after me, you haven't got backups unless you've tested RESTORES.
Re:"they should have used ZFS or btrfs" by Rakshasa+Taisab · 2009-10-10 23:05 · Score: 3, Funny

A bug that sneaks into the two or three offsite locations, destroying the tapes which are randomly checked before being shipped to ensure they contain valid data? Really nasty those bugs.

--
- These characters were randomly selected.
Re:"they should have used ZFS or btrfs" by asaul · 2009-10-10 23:19 · Score: 5, Interesting

Dubious backups? Depends. We had a system which was a 6TB cluster that was notoriously difficult to back up. This went on for years, it took too long, failures caused issues downstream etc. Then someone took a moment to realise that the application was not capable of re-using that 6Tb of data if it was restored - once the data came in it was processed and archived. To recover the application all they had to do was backup a few gig of config and binaries, and restart slurping data from upstream again. Viola - backup stripped down to nothing, 6TB a day of data less to backup, and next to no failures as it was now so quick to backup.
Then there is the case of an application which the vendor and application developer signed off on using a backup solution using a daily BCV snapshot. What they failed to tell us was application not only held data in a database, but in a 6G binary blob file buried deep in the application filesystem. If the database and the binary where out of sync in any way, it could mean missed or replayed transactions or generally that the application was inconsistant. As this was an order management platform, that was bad. You can guess the day we found out about this dependancy.... yup, data corruption, bad vendor advice screwed the binary file and all we had to go on was a backup some 23 hours old where the database was backed up an hour after the application. Because of a corresponding database SNAFU, the recover point was actually another day before that, with the database having to be rolled forward. It was at this point we found out the despite the signed off backup solution, the vendors documented recommendations (that were not supplied to us) was that the only good backup was a cold application one - not possible on a core order platform. Thankfully after some 56 hours of solid work the application vendor managed to help sort the issue out and the restore from backup was not actually needed. The backups were never really tested as the DR solution worked on SRDF - the DR consideration for data corruption was never really part of the design (from a very high level, not just this platform).
So there you have it. Two dubious Enterprise backups - one not needed, the other not usable.

--
"If everybody is thinking alike, somebody isn't thinking" - Gen. George S. Patton
Re:"they should have used ZFS or btrfs" by petes_PoV · 2009-10-10 23:35 · Score: 5, Insightful

It's not a backup unless you can prove it will restore. Until then it's just a waste of tape, or disk, and time
The point about backups is not to tick the box saying "taken backup?" but to provide your business / customers / whatever with a reliable last resort for restoring almost all their data. If you don't have 100% certainty that it will work, you don't have a backup.

--
politicians are like babies' nappies: they should both be changed regularly and for the same reasons
Re:"they should have used ZFS or btrfs" by bertok · 2009-10-10 23:57 · Score: 4, Interesting

There are plausible reports as to how this happened here.
tl;dr - They tried upgrading their SAN without making a backup first, and the upgrade somehow hosed the entire SAN.
That's the thing that has always worried me most about SANs: you have all your eggs in one basket. No matter how redundant or reliable the hardware is, one bad update or trigger-happy admin can cause the instant loss of all your data. That's only slightly better than having your data center burn down. You still have your hardware, but a total restore like that can be a nightmare. I've heard somewhere that 80% of corporations couldn't recover from a scenario like that.
Here's some fun numbers: a typical tape restore runs at something like 70MB/sec, if you're lucky, per tape drive. Some small low-end SANs that I see people buying these days are 10TB or bigger. At those speeds, it takes 40 hours to restore the complete system. What's worse is that it doesn't scale all that well either, you can get more drives, but the storage controllers and back-end FC loops become a limit. If you have some big cloud provider scenario, a complete restore could take days, or even weeks.
What's scary is that mirroring or off-site replicas don't help. If your array starts writing bad blocks, those will get mirrored also.
Re:"they should have used ZFS or btrfs" by vk2 · 2009-10-11 00:36 · Score: 5, Interesting

Thats why you have logical redundancies. I work for a fortune 10 company and this is a standard practice for all mission critical applications. The application has be to geographically redundant with install base at least at 3 data centers (ATL,SEA and DLS). Different SAN technology at each DC. All Oracle databases have 2 physical dataguard configuration with 4 hours and 8 hours latency (to guard against user errors) and all J2EE apps hard configured to switch connections from one db to the other almost on the fly or with a reboot. Some really really critical databases have all this and transaction duplication via Goldengate to remote databases to off load reporting queries. We have had issues where SAs screwed up allocating LUNs and ended up f*cking up the file systems but we recovered in every scenario even a 30 TB DB restore over 2 days.

Its amazing a consumer serving company like T-Mobile risked itself by hosting their application on Microsoft platform;. Furthermore where is the DR in all this? Who the F*ck in the right mind fiddle something on SAN without confirming a full backup of all applications/databases? It appears that Hitachi and Microsoft are at fault here (if SAN maintenance is the root cause of this failure) but T-Mobile is the fool allowing these companies to ruin their data. Not only there won't be any consequences because of this issue to MS or Hitachi - T-Mobile will be pouring in more money to fly in the MS and Hitachi consultants.

--
No Sig for you.!
Re:"they should have used ZFS or btrfs" by JasonBee · 2009-10-11 00:39 · Score: 2, Interesting

In our environment, a large government shop, our data volumes are capped at around 1 TB of storage for that very reason. Between the SAN, and the tape backups...they just simply have to create a physical cutoff point for data storage due to those onerous recovery periods.
There is nothing wrong in our shop with having TWO 1 TB volumes, but you will never get approved to have one single 2TB. Problem solved...at least for file storage. Database backups are managed via other mechanisms like replication.
Re:"they should have used ZFS or btrfs" by IamTheRealMike · 2009-10-11 01:02 · Score: 4, Informative

I'm not sure what you mean by "cloud provider" as such but Google App Engine has always been replicated across datacenters.
Re:"they should have used ZFS or btrfs" by Tweezer · 2009-10-11 01:03 · Score: 2, Informative

Even with a SAN you need to limit volumes sizes to whatever size you can restore within the acceptable restoration window. There are also those times where you just want to run a chkdsk and if the volume is too big, it takes too long.
That being said, I can't believe they didn't have any backup. Even if they skipped the pre-upgrade backup, they should have had one from last night/week/month. Any of those options would be better than nothing. I have to assume they were doing backup to disk on the same SAN they were upgrading, which is pretty dumb. I still can't understand why they didn't have a backup at another site somewhere else in the world. We do that sort of thing all the time where I work.
Re:"they should have used ZFS or btrfs" by jimicus · 2009-10-11 01:08 · Score: 5, Informative
I've always been amazed that tape is trusted as much as it is. It seem (anecdotally at least) to have a disproportionately high failure rate.
I'm not sure that's the problem so much - after all, LTO has a read head positioned directly after the write head and automatically verifies as it goes along. A tape error is dead easy to spot.
There are a number of places where things can fall apart, and tapes don't even need to come into the matter:
- Nobody checking the logs
- Failure to understand the processes necessary to get a good backup. (You can't just dump the files that comprise a database to disk - you must either quiesce the database or use the DBMS' inbuilt backup routine - or you will wind up with inconsistent files and hence an inconsistent database. You'd be amazed how many people don't understand this.)
- Failure to maintain backup processes. (When you moved the database to another disk because you were running out of space, you did update your backup process? Right?)
- Not doing any test restores.
- Not doing enough test restores, or doing them carefully enough. (If you're unlucky, your database will come back up OK even though you didn't quiesce it before carrying out the backup. Why do I say unlucky? Well, if it had not come up OK, you'd know immediately that there was a problem with your process. Then once the database is back up, make sure you check the restored data to ensure that recent transactions which should be on the backup actually are).
Re:"they should have used ZFS or btrfs" by Jezza · 2009-10-11 01:10 · Score: 3, Interesting

The kind of filesystem have help - I'm familiar with ZFS concepts so I'll stick to those:
In ZFS when you write to a file you don't write over the pre-exisiting data, you write elsewhere then that gets mapped in upon success, the old data is still there and you can see the aged mapping (you know what was there). Now you can at this point recycle this space. However, you can switch this pruning off, now you have a complete record of everything that was ever done on the disk. To stop it ever running out of space I can either: Add disks to the disk-pool to stop that, or prune very old data (older than a give age - maybe 6 months?).
So it helps.
Re:"they should have used ZFS or btrfs" by cupantae · 2009-10-11 01:11 · Score: 2, Insightful

When I read that you had quoted "I really doubt this service was ran without backups," I twitched and the thought
I know it's bad grammar, but let's just ignore it, please
was loud in my ears. I was so relieved when I saw that you weren't mentioning it. I don't know what this makes me, but it happens all the time. I'm definitely bothered by poor grammar and spelling, but I want no one to ever point it out.

--
--
Re:"they should have used ZFS or btrfs" by Antique+Geekmeister · 2009-10-11 01:46 · Score: 2, Insightful

I've had something like that happen. The recovery system for a partner had never been tested with a _full_ recovery, only with recovering a few selected files. But because someone decided to get cute with the backup system to pick and choose which targets got backed up, individual directories each got their own backup target. Thousands and thousands of them. And the backup system had a single tape drive, not a changer.
The result was that to restore the filesystem, the tapes had to be swapped in and out to get the last full dump, then the incremental dump, of _each_ of the thousands of targets. Fortunately for them, I managed to liberate an under-used tape library, but the incredible amount of time having the tape drive grind back and forth to find the different targets on each tape was also incredibly nasty. We helped them find other solutions for that issue, but it was nasty to clean up. And unfortunately for them, they didn't _have_ a large enough repository to have tested the full restoration procedure.
The point is that "random checks" are not enough. You have to actually do a full test, once a year. This is also why I despise people who sell monolithic, "high availability" storage systems that are not partitioned enough to create a mirror of your active data anywhere.
Re:"they should have used ZFS or btrfs" by Anonymous Coward · 2009-10-11 02:04 · Score: 2, Funny

-1 "Thinks he's funny"
Re:"they should have used ZFS or btrfs" by Cylix · 2009-10-11 02:18 · Score: 2, Interesting

Well the first problem was the EMC storage array.
The second problem is believing the tech when he says the data cannot be reclaimed.
The third problem is using a simple raid 5 volume on a great deal of data. Multiple drives fail all the time! Hell, racks of servers fail in unison.
Even if the DCB data is corrupted this can be corrected even on a large SAN.
All or part of the data is generally recoverable.
Either this was an impossibly horribly managed install or something very complex has happened. Generally, the more severe instances are because of multi-faceted failures and not something so simple as lost array data.

--
"You should always go to other people's funerals; otherwise, they won't come to yours." -- Yogi Berra
Re:"they should have used ZFS or btrfs" by Alpha830RulZ · 2009-10-11 02:36 · Score: 2, Funny

Something tells me you have grey hair and wrinkles. And I say that in a good way.

--
I was taught to respect my elders. The trouble is, it's getting harder and harder to find some.
Re:"they should have used ZFS or btrfs" by uncleFester · 2009-10-11 02:40 · Score: 2, Informative

"Who the F*ck in the right mind fiddle something on SAN without confirming a full backup of all applications/databases?
people who drink the kool-aid whenever vendors of said products repeatedly swear up and down all their tasks/patching/operations are 'totally no-impact and no-visibility changes.' combine that with people unwilling to take downtime or spend $$$ to properly protect the contents ahead of time and you have just cooked a recipe for disaster.
-r (not speaking from personal experience.. of course.. :/ )

--
-'fester
Re:"they should have used ZFS or btrfs" by drjzzz · 2009-10-11 03:14 · Score: 3, Interesting

It's not a backup unless you can prove it will restore. Until then it's just a waste of tape, or disk, and time
True. There's a similar problem in biological research, where people think they have secured frozen samples but they haven't tested whether the samples are valuable after thawing. For example, frozen cells might not be viable, or RNA might be degraded. Too often the samples are just wasting freezer space. Anybody can freeze (or backup), the question is whether what you thaw (restore) is valuable.

--
to err is human, to forgive is divine, to forget is... umm...
Re:"they should have used ZFS or btrfs" by Antique+Geekmeister · 2009-10-11 03:41 · Score: 4, Funny

It's not the gray hair (or what is left of it!), and those aren't wrinkles. They're laugh lines from the terrific amusement when some youngster ignores the hard-won lessons of the last millennium, especially when they have to call me or someone like me to clean up the mess. The laugh lines are especially deep from when I collected a paper trail to show where their supervisor ignored my written warnings about the danger: those are used with caution, but can be very, very handy.
Re:"they should have used ZFS or btrfs" by runningduck · 2009-10-11 04:04 · Score: 2, Interesting

At the very least they should have been segmenting customer data. How could a single failure outside of a ten mile wide asteroid hit wipe out all customer data? Was everything stored in a single giant registry? I see this a one of the single greatest failings in current system design. Top professionals trust tools more than data design and management processes. I would say the same thing if they were using ZFS or btrfs. Technology is NOT a solution. Technology is at most a tool that contributes to an overall solution. Without proper automated control systems and at least some form of manual verification reliance on pure technology solutions is little more than blind faith.

--
-rd
Re:"they should have used ZFS or btrfs" by AK+Marc · 2009-10-11 08:44 · Score: 2, Informative

Ever have a tape drive with mis alligned heads? That one drive and only that one drive will be able to read those tapes, and sometimes even it can't read them after the tape is ejected, but will show OK on a verify done before the tape is ejected. You either have a verified backup that can't be used, or a pile of tapes that are completely useless if that drive ever fails.

I found one of these when doing a backup/restore to upgrade a server (backup the data from ServerA and restore the data on ServerB). It took a while to figure out why the tapes worked perfectly in ServerA and not at all in Server B (internal tape drives, fixed by swapping the drive from ServerA into ServerB for the restore, then discarding ServerA and the drive from it after).

For a server-loss scenario (fire, theft), this means there is no backup, yet something that wouldn't be discovered without restoring on a separate system. No idea how common this is, but in dealing with not many situations where it could pop up, I've seen it all of once.

--
Learn to love Alaska
Re:"they should have used ZFS or btrfs" by Anonymous Coward · 2009-10-11 15:50 · Score: 2, Insightful

Nice background, but all useless when the problem they had was morons upgrading the SAN firmware without a proper backup...
Re:"they should have used ZFS or btrfs" by kiwi-backup · 2009-10-11 18:29 · Score: 2, Insightful

Backup is expensive. Disaster recovery exercise is very expensive and bring no extra value to the customer. Managers wants more value for the customer to get more money, no extra expense. It's very hard for the security team to get some time on this kind of things.
Re:"they should have used ZFS or btrfs" by Cytotoxic · 2009-10-12 03:18 · Score: 2, Informative

We had a similar failure here. Had to replace a battery in a redundant SAN controller... it was under support with the vendor so they sent out a rep to do the fix - everything went just fine. Then poof - one whole shelf went dark. No problem, we designed the system to handle that - all arrays striped vertically with no two drives on any one shelf. Then the vendor took the backup card offline to repair the problem. Poof - another shelf down. Uh, oh! A little more work got the shelves back on line - but the drives had been totally corrupted by the glitchy controller. Luckily, not being idiots our engineers had full backups. Unluckily it took days to fully recover everything. Lesson learned - there is no such thing as a safe fix. We moved critical systems off of our "Fisher-Price SAN" over the next several months and it has not caused any additional catastrophes, but we learned a lot about redundancy - a single hardware failure can cut through a lot of layers of redundancy and bring you down hard when the failure mode is less than "off".
Re:"they should have used ZFS or btrfs" by cbreaker · 2009-10-12 03:43 · Score: 3, Informative

The technology is available to get good, solid backups for anything. They just didn't use it, test it, verify it, etc. And in the case of this, users cannot back up their own data. And what they lost isn't backups.

I used to have one of these things.

The phone is (like someone above pointed out) a local cache of what's on the server side. The live database/back end is what crashed. When you make a change on the phone, it immediately sends that change to the server. You can login to the sidekick web site and make changes there, which appear quickly on your phone. If you reboot your phone, it will retrieve anything it needs from the server side. Apparently, the phone doesn't even keep a permanent local copy on some sort of non-volatile storage (hence "Don't turn off your phone.")

It's like someone that uses Google apps and stores all their documents on their system. If that system should go down, you'd be screwed, except that you COULD back up your documents locally. With this case, you can not.

I don't really like the term "cloud computing." All it means is server storage somewhere on the Internet. Under this term you could call any web site a "Cloud." It's ambiguous at best.

--
- It's not the Macs I hate. It's Digg users. -
Re:"they should have used ZFS or btrfs" by amicusNYCL · 2009-10-12 04:16 · Score: 2, Funny

I don't know where you hang out at night, but where I hang out people who call themselves things like "webmistressrachel" are not men.
Like I said, your mileage may vary..

--
"Our two-party system is like a bowl of shit looking at itself in a mirror." - Lewis Black
Re:"they should have used ZFS or btrfs" by Anonymous+McCartneyf · 2009-10-12 07:26 · Score: 2, Interesting

The current millennium has only been around for nine years and ten months. (Eight + ten months if you are a traditionalist and think the Nineties ended in 2001.)
Then again, good back-up policy predates computers. If Microsoft/Danger had the same dedication to backups of valuable documents as monasteries did back in the 1000s, this sort of mess wouldn't have happened.

--
There is a fine line between recklessness and courage... -- Paul McCartney

A server failure? by corsec67 · 2009-10-10 21:36 · Score: 3, Informative

A server failure caused all of the data to be lost?

No backups? Not even a spare server with a mirror of the data? No servers in different places? No off-site backup strategy?

As an aside, why would that data be stored in volatile non-battery backed up ram? All of my graphing calculators have a special battery to keep the ram, and they aren't even supposed to store important stuff. Flash is cheap enough these days, why should simply removing the battery cause important data to be lost?

--
If I have nothing to hide, don't search me

Re:A server failure? by Hadlock · 2009-10-10 21:58 · Score: 3, Insightful

Reportedly sidekicks are thin clients, other than making phone calls, everything on the phone is saved on the server side. Which is a special kind of retarded, in today's world where a blackberry performs all the same functions, and provides a local backup feature. But yeah as for the backups, all your backups are worthless if your data backup code is flawed, and nobody ever checks the backup tapes. When MS bought the service, they probably changed the location the servers were in, plugged everything back in, and kept going. I imagine a project like that would be on a short timetable, and "checking to see that the backup tapes are really being backed up to" is low on the priority list when the service is already live.

--
moox. for a new generation.
Re:A server failure? by PolygamousRanchKid+ · 2009-10-10 22:15 · Score: 4, Funny

A server failure caused all of the data to be lost?
Maybe it was the server failure . . . maybe they only had one . . . ?

--
Schroedinger's Brexit: The UK is both in and out of the EU at the same time!
Re:A server failure? by Serious+Callers+Only · 2009-10-10 22:44 · Score: 4, Informative

There's some interesting background leaks on the takeover of Danger in this article which seem to imply they cut a lot of staff, and gutted the company, which is now running on a skeleton staff. So I guess it's not too surprising when this sort of mistake is made. Not the most reliable source, but they did definitely cut a lot of danger staff after the acquisition.
Re:A server failure? by Locutus · 2009-10-11 02:41 · Score: 3, Funny

in hindsight, firing the person(s) doing backups was probably not a good move. ;-)

LoB

--
"Anyone who stands out in the middle of a road looks like roadkill to me." --Linus

What about the backups? by christwohig · 2009-10-10 21:38 · Score: 4, Interesting

So are we saying microsoft didn't have a backup? what about a offsite backup? Who wants to bet they were using their own backup solution? if they had a decent storage array they could have had snapshots and offsite replica's to restore from

Sidekick by nadaou · 2009-10-10 21:38 · Score: 4, Funny

shit, is that TSR still hanging around? goodness!

If the above means anything to you, "apt-get install joe mc" will make you smile as well.

--
~.~
I'm a peripheral visionary.

Re:Sidekick by tangent3 · 2009-10-11 03:06 · Score: 2, Informative

Ohh yes.. Need an ASCII table? It's just a Ctrl-Alt away

Backups? by ipsi · 2009-10-10 21:41 · Score: 3, Interesting

Either this is a really, really serious meltdown which completely killed not only the server but all their backups as well (and what're the chances of that?), or their IT guys have been really, really slack and just didn't make any backups...

Guess they should have used a better smartphone, like *anything* else on the market... Even the cloud-centric Pre will still work if you don't have access to the Cloud - even if Google and/or Palm dies, you'll still have all your information on your phone! Jesus... Doesn't inspire confidence...

Re:Backups? by TheSunborn · 2009-10-10 21:47 · Score: 5, Insightful

Or this was really a software error, and the backup servers in an other datacenter, just copied the faulty data/delete command.
They should really be far to big to have all their data stored in a single datacenter with no offsite backup. (Or they should have an entry on thedailywtf.com)

Microsoft/Danger by delta98 · 2009-10-10 21:52 · Score: 3, Funny

'nuff said.

It's The Backups Stooped by tres · 2009-10-10 21:57 · Score: 4, Insightful

This is an issue of irresponsibility. Plain and Simple. The company responsible for maintaining the data should -- at the very least -- have had some full system backup from last month. If they had some old backup somewhere at least you could chalk it up to systems failure or bad backup tape or bad admin or something.

But the fact that there is no backup anywhere indicates brazen negligence on the part of everyone responsible for the data. Everyone who had a part in designing the system and managing the system is culpable. The most ridiculous part of this is the over-reliance on server-side data storage by the sidekick designers.

--
Notes From Under *nix: blas.phemo.us

Re:It's The Backups Stooped by 1s44c · 2009-10-11 03:54 · Score: 4, Insightful

But the fact that there is no backup anywhere indicates brazen negligence on the part of everyone responsible for the data. Everyone who had a part in designing the system and managing the system is culpable. The most ridiculous part of this is the over-reliance on server-side data storage by the sidekick designers.
I will bet you there were good people -SCREAMING- to fix the backups, implement and test failover and all sorts of other good things. In my experience things like this are due to management refusing to spend money fixing problems that have not lost customers yet.

Re:Why not store the data on phone permanent memor by Anonymous Coward · 2009-10-10 21:58 · Score: 4, Informative

Because the entire Sidekick architecture is very client-serverish, not transparent as with ordinary phones (GPRS/EDGE/UMTS/etc. through a NAT to internet at large); the server is supposed to be responsible for all that data, and the phone is just caching it. Given that architecture, asking why the local copy is on volatile RAM is analogous to asking why your CPU doesn't have a battery backup for system RAM, or even L2 cache.

That's one of the big reasons I didn't go with a sidekick, even though they have (or had, last I was shopping around) basically the cheapest internet plans available; they push all sorts of stuff that's handled by the phone in any other system off to the Danger servers,. While that does expose you to other people losing your data, as seen here, I didn't even consider that. I just like having a direct internet pipe, so I can run whatever software I want locally.

That said, there are plain benefits to the Sidekick model, for some people. Basically, if you don't want to do funny stuff on your phone, and if you're no less incompetent than the MS/Danger sysadmins, it's better. After all, if you drop your sidekick in a toilet, run over it with a truck, and vaporise it with a plasgun, you can just get a new one and have all your data back -- which is good, since if you're 95% of people, you've _never_ backed up your phone's data. But it's not for me, and given your desire to have your phone work as a PDA even if you power-cycle it in a wilderness/cave/other net-less place, it's not for you either.

Microsoft was testing the US gov edition by AHuxley · 2009-10-10 21:58 · Score: 5, Funny

Right feature, wrong server? MS understands the need for a "Rose Mary Stretch" default setting.
The congress critters have learned a lot from the "terrible mistake" of email backups.
From cute page boys to Iran contra, MS can market this as a feature.

--
Domestic spying is now "Benign Information Gathering"

DIY phone backups by golfnomad · 2009-10-10 22:03 · Score: 4, Informative

There are 3rd party apps out there that will let you "backup" your phone data yourself. I personally use a program called bitpim www.bitpim.org (make sure you d/l latest version). It works with many different phone models and I have used it several times to "restore" my phone data (had 2 phones with hardware issues). It restored my calendar, notes, phone book and rings tones (that last one can save you d/l $$$). It is easy enough to install and use, you do not have to be a total geek to make it functional (but having one available to help you set up backups would probably help). Been working in the IT industry too long to rely on someone else backing up my data for me, and I will not encourage Murphy to have a party in my honor!

WTF by ShooterNeo · 2009-10-10 22:08 · Score: 4, Insightful

This is unbelievably bad. The real problem is : why aren't there incremental off site backups to another server farm? A weekly binary difference snapshot would have made this failure less catastrophic.

Ultimately, with a complex application like this, you can't guarantee 100% that the code doesn't have a bug in it that could result in loss of user data. You can be ALMOST sure it won't, but 100% is not possible with current analysis techniques. (even a mathematical proof of correctness wouldn't protect you from a hacker)

But a properly done set of OFFLINE backups, stored on racks of tapes or hard disks in a separate physical facility : you can be pretty sure that data isn't going anywhere.

Re:WTF by Locutus · 2009-10-11 03:02 · Score: 5, Interesting

from that sounds of it, Microsoft couldn't turn Danger into a WinMo platform so they gutted it of employees instead of spinning it back off since they'd rather have it dead than spreading more Java but not dead before they had Pink out the door. So when you fire everyone from the top downward, you end up with people who's job is to turn the lights off when the doors get locked for good. they're not motivated much nor are they skilled in all of what used to be required to run the shop. Auto-pilot mode comes to mind.

So maybe the backup system needed to be checked or a CRON job verified or maybe the computer in Joe Fired's office was part of the backup process in some little way but important enough that the whole job was failing every night.

As I said, Microsoft tried to replace the Danger stack with Microsoft software but it wasn't going to work or got too much backtalk( thinking of Softimage ) and threats of everyone leaving if they had to port to the WiMo pile/stack. They moved anyone who'd go, over to Pink and left the rest to keep life support systems running. oops, they failed.

With Ballmer publicly saying that WinMo has been a failure, he's hearing the press say WinMo 6.5 is a yawn and expectations are that the Sony PS3 will eclipse MS XBox, and recently reading about how he's telling people that IBM doesn't know what they are doing....There's probably a new monkey-boy dance going on inside his office we'd probably love to see. It might be too dangerous being so close as to record it.

Will Microsoft ever make any profits from anything outside of MS Windows and MS Office? Ballmers 8-Ball still seems to be telling him something very different from what everyone else is seeing.

LoB

--
"Anyone who stands out in the middle of a road looks like roadkill to me." --Linus

Re:See it as an opportunity by AnotherUsername · 2009-10-10 22:52 · Score: 3, Insightful

Now is the opportunity for opensource to show what it's good for. Someone whip together a small app to extract all info from the Sidekick, put it up on sourceforge for FREE and you have tons of goodwill for OSS. Of course, the app should be Linux-only, thus forcing all Sidekick users to install Ubuntu...

Thus eliminating any goodwill that would have been gained...

Really, if you think that open source is a viable option for the masses, you shouldn't care which operating system a powerful application like the one you describe is on. If you really care about using open source for goodwill, releasing it simultaneously on all operating systems should be your goal. How is forcing people to use Ubuntu via software applications any different from Microsoft forcing people to use Windows via software applications?

--
I don't like Linux. This doesn't make me a troll.

Bad brand by MM-tng · 2009-10-10 23:00 · Score: 2, Funny

It's like being kicked in the side.

T-Mobile Press Release by mr_lizard13 · 2009-10-10 23:02 · Score: 2, Funny

All your data are lost by us.

--
"We live in a global world" - Harvey Pitt, former Securities and Exchange Commission Chairman

The clue is in the name of the software by Barsteward · 2009-10-10 23:10 · Score: 2, Funny

Microsoft/Danger

--
"The hands that help are better far than lips that pray." - Robert Ingersoll (1833-1899)

Thin client: Android, too? by KlaymenDK · 2009-10-10 23:12 · Score: 2, Insightful

Reportedly sidekicks are thin clients, other than making phone calls, everything on the phone is saved on the server side. Which is a special kind of retarded

Isn't that also how Android works?

I mean sure, the apps and such are on internal flash, but it's a different story for your "important" data such as email or contacts list. Heck, as I've learned, one can't even read one's existing ("synced") email without a working web connection. How they can call that "syncing", and what it's doing besides simple header indexing, is beyond me.

This is another reason I am loath to trust "the cloud" -- if I know I can be self-sufficient (in a data accessibility context), that's going to be much better than storing things on a corporate server and hope that said corporation is not going to, um, fall from the sky.

--
"Good news, everyone!"

Re:Thin client: Android, too? by RedK · 2009-10-11 01:33 · Score: 4, Informative

No, it's not how Android works, or how the iPhone works either. You can have cloud enabled applications, but you can also have local storage based ones without any problems. There is nothing in the SDKs that force you to use the cloud for storage at all.

--
"Not to mention all the idiots who use words like boxen."
Anonymous Coward on Monday August 04, @06:49PM
Re:Thin client: Android, too? by hedwards · 2009-10-11 02:17 · Score: 2, Informative

It's not as much of an issue. You might be using a product for which Data Liberation Front hasn't gotten to, but Google does have people working on any of those applications to make it possible to make ones own back up. I'm not sure what specifically triggered that, but I keep a backup of any important information on my computer which is backed up to my local backup mirror and remotely.

RIP Sidekick by drinkypoo · 2009-10-10 23:13 · Score: 4, Insightful

With all the competition in the smartphone market today, this is probably an unrecoverable error. If they manage to recover the data then they will come off as heroes for having the courage to tell their customers promptly. Otherwise they just look like they are: incompetent. No great loss, though.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"

Re:undelete (not de-corrupt) by myxiplx · 2009-10-10 23:16 · Score: 2, Informative

Yes, it's called a snapshot. Take a snapshot and you can either roll the entire system back to that point in time, or just browse its contents and extract the files you want.

Irresponsibility to EPIC proportions. by MrCrassic · 2009-10-10 23:25 · Score: 2, Insightful

HOW THE HELL DO THEY NOT HAVE OFF-SITE TAPE BACKUPS????

So essentially, everybody's Sidekick backup data, which is apparently critical should they ever lose power, was all concentrated on A SINGLE SERVER? I hope they at least say their tape backups caught fire and their replicated server died on the same day too...

Their retentions lines are going to be hot this Columbus Day weekend! The iPhone is getting cheaper...

Re:Irresponsibility to EPIC proportions. by AHuxley · 2009-10-10 23:48 · Score: 2, Insightful

"Back him up, boys!"
T-Mobile says, "but I thought you were going to back us up!"
Robbie says, "We didn't get rich buying a lot of servers, you know!"

--
Domestic spying is now "Benign Information Gathering"

This may have to do with the "Pink" project fiasco by HonestButCurious · 2009-10-10 23:32 · Score: 5, Interesting

According to a very long article on AppleInsider:
http://www.appleinsider.com/articles/09/10/09/exclusive_pink_danger_leaks_from_microsofts_windows_phone.html&page=3

MS was misleading T-Mobile about the state of Sidekick support, and apparently charging hundreds of millions every year for, and I quote "a handful of people in Palo Alto managing some contractors in Romania, Ukraine, etc". This is apparently because most of the Sidekick devs had either moved to Pink or quit out of disgust.

Huh? by msauve · 2009-10-10 23:36 · Score: 2, Insightful

"incremental..."weekly binary difference"

Uh, those would do nothing in this case, where it appears the entire DB has been lost. You need a regular full backup, or diffs and incrementals are just cruft. It appears they don't even have that, since there's no talk of restoring to month (or ?) old data.

--
"National Security is the chief cause of national insecurity." - Celine's First Law

Re:Huh? by kobaz · 2009-10-11 03:20 · Score: 2, Interesting

"incremental..."weekly binary difference"
Uh, those would do nothing in this case,
I agree. Weekly? WEEKLY?!!! What is this... 1980? Hell even in 1980 people with critical data on their apple2 spreadsheet kept more than one copy of their data on a daily basis.
I'm not sure why, but one of our customers had a backup daemon running with just incrementals being done. There was one full backup done two years ago and an incremental every night. Well.. they had a computer fry one weekend. It was a crappy windows backup program with only a point and click interface. No way in hell am I going to sit there for days and click restore on 600+ individual backups. I wrote a pretty cool little windows script using autoit3. It was a real pita to write though since every button clicked had to have a "wait-for-next-window" sequence. After five days of the restore script running, they were back in business.
Since then I've gone through every customer's system and made sure they have full backups done weekly and incrementals done daily. And we also do routine backup testing.
A good quote:
"A backup is not a backup, until you try and restore from it"

--

The goal of computer science is to build something that will last at least until we've finished building it.

What do you expect with a name like by Masterofpsi · 2009-10-10 23:42 · Score: 2, Funny

Danger?

Interesting article about Pink/Danger/Sidekick by Richard+W.M.+Jones · 2009-10-10 23:46 · Score: 4, Interesting

Interesting article about the Microsoft/Pink/Danger/Sidekick relationship and leaks indicating that Microsoft are trying to kill Sidekick without telling the partners. Microsoft would never do such a thing of course ...

Rich.

--
libguestfs - tools for accessing and modifying virtual machine disk images

It is an ancient story, endlessly repeated by SmallFurryCreature · 2009-10-10 23:52 · Score: 4, Informative

It is development dome.

Two companies enter, MS comes out, slightly fatter.

If you do business with MS, you are riding a tiger with the brains to realize that lunch is only a roll on the ground away.

MS really should be renamed to BubbaSoft. Get into the shower with BubbaSoft and you know what is going to happen.

--

MMO Quests are like orgasms:

You may solo them, I prefer them in a group.

Re:It is an ancient story, endlessly repeated by harmonise · 2009-10-11 05:32 · Score: 4, Funny

Get into the shower with BubbaSoft and you know what is going to happen.
Just don't drop the SOAP.

--
Cory Doctorow talking about cloud computing makes as much sense as George W Bush talking about electrical engineering.

Re:Irresponsibility to EPIC proportions. -- yes by MrCrassic · 2009-10-11 00:09 · Score: 3, Interesting

A) The Sidekick apparently doesn't store anything, so customers can't make backups that easily, even if they wanted to, and

B) Danger designed this phone to store everything server-side. It is incomprehensibly foolish to not include a SUPER SOLID backup strategy as well. This problem has been ongoing for several days now; I don't know if the data was fine on the onset of this problem, but the infuriated customers have all the right to demand everything AND the kitchen sink for losing practically everything they had.

Yesterday... all those backups seemed a waste... by argent · 2009-10-11 00:18 · Score: 5, Funny

Yesterday,
All those backups seemed a waste of pay.
Now my database has gone away.
Oh I believe in yesterday.

Suddenly,
There's not half the files there used to be,
And there's a milestone hanging over me
The system crashed so suddenly.

I pushed something wrong
What it was I could not say.
Now all my data's gone and I long for yesterday-ay-ay-ay.

Yesterday,
Need for backup seemed so far away.
Seemed my data were all here to stay,
Now I believe in yesterday.

Anonymous

Foggy idea? by Porchroof · 2009-10-11 00:22 · Score: 2

Cloud computing?

That ain't no cloud. That's the fog obscuring the view of sanity.

IT has been trying this crap ever since the emergence of personal computers.

--
Fata viam invenient.

I work in telecom - Sr Tech Arch by Anonymous Coward · 2009-10-11 00:46 · Score: 3, Interesting

I work in telecom at a different provider. SAN upgrades are performed by the SAN vendor and, IME, they always demand a complete backup prior to starting any work unless the customer demands otherwise. If the customer doesn't want the backup, we always had to get a Sr VP to sign off. There were about 10 Sr VPs in the company - not like at a bank where everyone is a VP.

Usually, we would perform firmware upgrades only when migrating from old SAN equipment into new. The old equipment would be upgraded and used to upgrade either lower performing SAN or directly attached disk arrays that had been neglected for 5+ years. Being out of warranty was avoided. Most data is too important to risk that.

BTW, we measured storage in petabytes and our storage team was **never** on the cutting edge. We were always 2+ years behind other BIG companies. Our labs may have this quarters' latest and greatest, but it would take years to get from the lab into production service. That drove some vendors nuts, but not the "names you know."

I saw where someone above said they randomly verified recovery quarterly. What a joke. On my systems (Sr Tech Arch), we deployed with redundant systems at least 500 miles apart. Many systems did have instant fail over, but if instant fail over was not possible due to the amount of data, **never** would we lose more than 24 hours worth of data. Between, RAID-10, near disk backups, tape backups, remote replication and backups at the alternate location, we had the data. Further, to verify the alternate system worked, we swapped primary production locations every week. I and my internal customer slept very well, thank you.

I have a good friend who works at T-Mobile in their architecture design team. It will be interesting to see whether this subcontractor had anything to do with the issues. I called T-Mobile for an unrelated personal item on Tuesday, they were already swamped with calls and said that a sub to Microsoft was working the issue. I'm thinking MS outsourced/bought the provider and the garage shop team was still running things - but I don't know. I do know that Microsoft has excellent engineers for systems like this and they are more cautious than google with their upgrades and deployed systems. Over the years, I've had to deploy a few Windows-Server-based solutions - usually for voice response systems. I was never really happy doing it. I don't trust backup systems much unless it is really a mirror that I can get to 1 file from 3 weeks ago easily.

Ok, back to upgrading the company email servers. A system version upgrade will impact users for less than 10 minutes - probably under 3 minutes, but we like to under promise and over deliver.

The Tao of Backup by ei4anb · 2009-10-11 01:05 · Score: 4, Interesting

Sadly it comes to pass that every generation the Tao of Backup is forgotten and must be relearned through such trial by fire. http://www.taobackup.com/

Claimed information from the inside by cshbell · 2009-10-11 02:08 · Score: 5, Interesting

According to this comment post on Engadget, it was a contractor working for Danger/Microsoft who screwed up a SAN upgrade and caused the data loss. Obviously, take this with a grain of salt until it's substantiated:

"I've been getting the straight dope from the inside on this. Let me assure you, your data IS gone. Currently MS is trying to get the devices to sync the data they have back to the service as a form of recovery.

It's not a server failure. They were upgrading their SAN, and they outsourced it to a Hitachi consulting firm. There was room for a backup of the data on the SAN, but they didn't do it (some say they started it but didn't wait for it to complete). They upgraded the SAN, screwed it up and lost all the data.

All the apps in the developer store are gone too.

This is surely the end of Danger. I only hope it's the end of those involved who screwed this up and the MS folks who laid off and drove out anyone at Danger who knew what they were doing.

"Epic fail" doesn't begin to describe this one.

Re:Claimed information from the inside by Anonymous Coward · 2009-10-11 04:14 · Score: 2, Insightful

This doesn't mean the data is "gone", it means that most likely a bunch of disk with user data have had their metadata changed and perhaps a bit of new data has overwritten them. Reformatting drives or changing the RAID configuration doesn't delete data, it just makes it inconvenient to access it. Unless their SAN is designed to magically write zeros over every disk within a few minutes of a configuration change, at least some data is still there. How hard it is to access it depends on how much support they can get from the people who designed the storage system (file system, database, or a raw object store of some kind).

You assume Danger used a MSFT platform by xswl0931 · 2009-10-11 03:05 · Score: 3, Insightful

Looking at the timeframe that Danger was acquired by MSFT and that the Danger OS was likely based on NetBSD (http://en.wikipedia.org/wiki/Danger_Hiptop), it's more likely that Danger was still using NetBSD as their Server Software and this was merely a process issue. Blaming it on the "Microsoft Platform" without any real data is just spreading FUD.

Re:You assume Danger used a MSFT platform by Anonymous Coward · 2009-10-11 06:44 · Score: 2, Informative

You know nothing of which you speak. I assure you it was running on Microsoft software. Unfortunately, I should know.
Re:You assume Danger used a MSFT platform by xswl0931 · 2009-10-11 10:46 · Score: 2, Insightful

You assure us anonymously without any proof? Of course.
Re:You assume Danger used a MSFT platform by Ecuador · 2009-10-12 02:58 · Score: 2, Funny

He's modded +1 Informative. I guess that's proof enough! :D

--
Violence is the last refuge of the incompetent. Polar Scope Align for iOS

Re:Why not store the data on phone permanent memor by Oshawapilot · 2009-10-11 03:21 · Score: 2, Informative

I'll admit to having one of the original (and second version) of the Sidekick (They were called the Hiptop everywhere else except the USA) and the idea of storing everything on the cloud seemed great at the time - through several device upgrades, warranty replacements, and other hardware changes everything just automagically restored to the new phone within 10-15 minutes of switching the SIM.

One should add that the devices themselves are designed to "Play dead" when the battery gets low and shut down while still maintaining enough power to ensure the volatile ram holding the devices local cache of data remains intact. It's only if the battery is fully exhausted to the point of not being able to accomplish this, or a critical error/OS crash (The dreaded "red X of death") is encountered is the volatile ram actually in danger of being erased.

Therefore all the warnings about not letting the phones go "dead" or turning them off are a bit misleading since, excluding one of the two above situations everything is actually safe, but it's not without warrant since I'm sure MS/Danger are going to try to "backwards restore" whatever is salvageable.

Furthermore, since the OS is locked down extremely tight there's no (to my understanding, admittedly a few years old now) method of locally backing up a Sidekicks data. Contacts stored on the device can be backed up to the SIM card one at a time (with only the basic name/phone data, all other extraneous data such as profile pics, etc will not be included) but it was tedious to accomplish (one contact at a time) and the average Sidekick user (read as teen/clueless) probably has no idea how to do it anyways.

Re:When Paranoia Pays by larry+bagina · 2009-10-11 04:53 · Score: 2, Informative

it runs NetBSD and Java.

--
Do you even lift?

These aren't the 'roids you're looking for.

Autorestore - multiple birds one stone. by Colin+Smith · 2009-10-11 05:17 · Score: 2, Insightful

To the standby or testing system. Our staging/testing systems all run yesterday's production data, restored from the most recent backup.

if your backups don't work then neither will your test/staging server... Which will be noticed.

What do you get?
* Backups tested every day.
* A test/staging/standby system identical to the production.
* Something the business can run all the crappy queries they like against without affecting the production system.

--
Deleted

Means what it said by SuperKendall · 2009-10-11 05:42 · Score: 2, Informative

shit, is that TSR still hanging around? goodness!

Dude, what part of "Stay Resident" did you not understand. It's not like selling your computer rids you of it.

That's why I never ran them, nor consorted with Deamons.

--
"There is more worth loving than we have strength to love." - Brian Jay Stanley

The value of data by symbolset · 2009-10-11 05:49 · Score: 4, Insightful

Granted, this isn't cheap, but our data isn't either.

Microsoft bought Danger for half a billion dollars. Current estimates of the value of this data are roughly... half a billion dollars, plus a little. There's little doubt that in addition to destroying the entire value of the acquisition they've created a connection between "Microsoft", "Danger" and "data loss". In their release T-Mobile isn't being shy about tying those things together. Not good. That's going to have impacts even for some completely unrelated cloud-based products like Azure.

Somebody's about to get a really awkward performance review.

--
Help stamp out iliturcy.

Slashdot Mirror

Server Failure Destroys Sidekick Users' Backup Data

90 of 304 comments (clear)