Why Mirroring Is Not a Backup Solution
Craig writes "Journalspace.com has fallen and can't get up. The post on their site describes how their entire database was overwritten through either some inconceivable OS or application bug, or more likely a malicious act. Regardless of how the data was lost, their undoing appears to have been that they treated drive mirroring as a backup and have now paid the ultimate price for not having point-in-time backups of the data that was their business." The site had been in business since 2002 and had an Alexa page rank of 106,881. Quantcast said they had 14,000 monthly visitors recently. No word on how many thousands of bloggers' entire output has evaporated.
DUH!
While this mirrors previous comments, it's not really a backup solution.
Mirroring, RAID, grid, whatever. At some point, you want your data safe and secure on something not physically attached to any power source.
And that's why your IT department actually needs funding. Sleep tight.
That is one reason why mirroring isn't a backup, and why backups should ideally be off-line.
If I have nothing to hide, don't search me
We do data hosting, and I can't imagine how catastrophic that would be. Jebus. Let this be an ultimate example of why numerous backups are needed. Always. Without question.
Excellent! We can use their demise as yet another cautionary tale.
It is an inexpensive protection against a total harddisc failure, but effective at this part. A software going rogue or a user deleting the wrong files can't be helped by it.
It's really unfortunate that this happened. If they had simply had a backup snapshot of the DB they could have restored it. RAID only saves you from disk failures. It doesn't work on OS/user failures.
Unfortunately this is the kind of thing you tend to learn from experience (either yours or someone else). It's very easy to think "RAID 1 = disks are safe".
Just like a database cluster wouldn't have saved them. A clustering database can save you from load, or you can swap servers if a disk goes bad. But when someone issues "DELETE * FROM..." the other cluster nodes start to happily run the same thing and now you have 2 (or 3 or 10 or...) empty database boxes.
I hope those bloggers had a backup of some sort of their own.
Comment forecast: Bits of genius surrounded by a sea of mediocrity.
That's all I can say at this. I'm really surprised that with all the users they had, they are so quick to say "everything is gone and we're giving up" instead of just starting over and maybe implementing protocol that would make sure this doesn't happen again.
Ave Molech Setting
This is fascinating and altogether newsworthy. I had never before thought of this. I am very pleased, indeed, that kdawson engaged his most finely-honed editorial faculties to post this article to the front page, as it is not only stunning and fascinating in substance but also rather eloquently written.
I do not think it means what you think it means.
Today's weirdness is tomorrow's reason why. -- Hunter S. Thompson
Mirroring: High availability
Backups: High reliability
The rules of backups:
1. Backup all your data
2. Backup frequently
3. Take some backups off-site
4. Keep some old backups
5. Test your backups
6. Secure your backups
7. Perform integrity checking
It's more an issue that some people think that HA == DR.. which obviously this story reminds us that it is not the same thing.
Mirroring / RAID == HA.. if one of your HDDs let the smoke out, you still don't incur downtime. If you have a hot-spare, you're even better.. all it does it let you have alittle time to correct the
issue (ie: "It can wait until morning").
Also, one other very important thing.. mirroring doesn't prevent/restore data corruption. If you're mirroring your rm -rf (as pointed out by Corsec67 below), your RAID will happy do what it does.. and span your command to all your disks.... Congrats, you just successfully gave yourself HA to your disk erasing! :]
Backups are DR.. If your RAID croaks.. your SOL if you don't off-machine backups. If you accidently nuke your disks with an rm or something, you can still go back and restore data.. sure you'll likely loose -some- data, but -some- is better then all in this case.
----- The internet has given everyone the ability to have their voice heard equally as loud.. even if they shouldn't be
I am experiencing a strange phenomenon. The jaw-drop reflex has been popping my mouth open for several minutes and won't stop. If I focus I can close it, but then it pops open again. wow.
The cost of that cleanup, of course, will be borne by taxpayers, not industry.
Maybe I could understand that there might be issues with backing up live databases, and they didn't want to deal with it. Still not an excuse.
BUT, according to the site "the server which held the journalspace data had two large drives in a RAID configuration". Only TWO drives.
All they had to do was pull one of the drives, replace it, and lock up the original off site. In a couple of hours the drives would have been mirrored again.
Important note: don't hire the IT dude with Journalspace.com on his resume.
No doubt this incident is the result of the admin's fault. He's been confusing mirroring and backup and carried on the mistake until it's too late, as pointed out in other comments.
Now what about a user's angle? The morale is you can never think your data is safer when it's "in the cloud". If you value your blog and your readers, you *should* save a copy of your work as well as the readers' info, *locally*, somewhere you have control over.
There's no place like $HOME.
Colorless green Cthulhu waits dreaming furiously.
Even the greenest IT employee knows that mirroring is to protect against hard drive failure and not software corruption.
I only wish that were true. I've given up arguing with friends about this, who insist that their mirrors are good enough backups. I just stare at colleagues who think such, especially those who SHOULD know better. And I *know* coworkers are doing this @ work, too, and I'm just waiting for about 50TB of data to suddenly go missing...
"The urge to save humanity is almost always a false front for the urge to rule." --H.L. Mencken
They also purposely blocked archive.org via a robots.txt exclusion, so the bloggers can't use that to try and recover some of their blogs.
In today's world where primary storage and protection storage are well-defined, and where entire industry grew around it (examples: NetApp, Data Domain), one is hard-pressed to understand the reason for such a debacle. The reading of the note referred to in the article leads me to believe, unfortunately, that Journalspace's IT department did not understand the difference.
It is sometimes considered a bad form to say something bad about fellow techies. We prefer to look for 'outside' causes. Still, to learn and avoid the same problems in the future, one has to admit his mistakes first. This paragraph from the Journalspace's page:
The value of such a setup is that if one drive fails, the server keeps running, using the remaining drive. Since the remaining drive has a copy of the data on the other drive, the data is intact. The administrator simply replaces the drive that's gone bad, and the server is back to operating with two redundant drives.
makes me believe there is a denial going on.
End anonymous moderation and posting on
You pay your infrastructure people to maintain business, continuity I mean the tittle of this post made me go, "Really, no shit" That's like systems admin 101! If the admin was aware then the manager that didn't listen needs to be fired. If the manager listened and they are just run by retards then they got what they deserve. You'd think 17,000 visitors a month would be worth enough to do it right, in add revenue alone. The cost of a consumer machine running linux with a few TB's of SATA space - $1200 How much the company paid to have a system's admin play video games all day - $50,000 The cost of a 17,000 vistor a month site going down because they had no data base backups - Priceless.
See mirroring is like...well a mirror. If you stand before one and stick a fork in your eye your mirror-image does the same. In real time. Analogies are there for a reason.
The article says the data recovery company has found the drives wiped. There is no recoverable data.
It seems like the actual site failure was on the 23rd or so.
IMarv
PS, the internet archive was blocked by their robots, so there isn't even that to look at. http://web.archive.org/web/*/http://www.journalspace.com
Trusting software vendors is no smarter than trus
Looks like at least some content is still in Google's cache, those looking to salvage their journals should act quickly.
You can limit google's search results to a particular site by using the "site:domainname.com" search term (example) and then click the "Cached" links of each result to see Google's copy.
There's also a Greasemonkey script for Firefox that can automatically add Google Cache links next to page links, so you can navigate from one cached page to another easier.
This is just compound foolishness. I gather they did it in an attempt to control bandwidth costs since it's hard to imagine any other reason.
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
You don't just need backups. You need to TEST them. Having a backup run every night is nice and all; but if the tapes are unreadable and no error was reported, or if you're doing it wrong and the backup is corrupted and you only find out when you come to restore ....
Only wimps use tape backup: _real_ men just upload their important stuff on ftp, and let the rest of the world mirror it ;)
Adding the OSX comment and that a bug in their code is impossible is even lamer.
The drives were overwritten sector by sector on a machine that didn't have any of their code running on it. Their application couldn't have done it because it couldn't execute arbitrary code on that server. The "impossible" comment makes sense to me.
As for it being lame/unprofessional to name the possibilities, I disagree. He states the OS it was running on and said that it was either an OS problem or sabotage. There might be a few possibilities, but that about sums them up right there. He was being thorough and open; what's the problem with that?
Since they apparently used OSX Server this is particularly bad. All they needed was a large enough USB attached disk and then to turn on Time Machine. Might not be the best solution for their needs but it is hard to imagine one which requires less effort.
This is why users should be able to easily back up their own data for any online service. If a service entrusted with your data provides no straightforward way to drop a copy of it onto your own hard drive, don't trust it. I'd go as far to say that any service that doesn't strongly recommend you keep your own backups shouldn't be trusted.
Do the big kahunas of the "Web 2.0" world give users that option? Gmail, Myspace, Facebook, Twitter etcetera ad nauseam?
Prisencolinensinainciusol. Ol Rait!
I hope affected users are looking into this, I just did a search of a random JS blog and 2,000 entries were returned, all cached it would seem. So many people might be able to recover their work in a very painstaking manner.
Since you're the only poster to reply without yelling "idiot" (thanks, btw) -- Zeroing the drive makes software recovery impossible. It doesn't make data recovery impossible. There are ways to read the offset data, though this is getting harder as magnetic densities increase every year. Ontrack data recovery specializes in that kind of thing. I've seen them do it. Granted, it's not a 100% thing -- you don't get back something that even resembles a filesystem. At least a third of it is uselessly garbled binary.
#fuckbeta #iamslashdot #dicemustdie
The site was run on OS X Server... I think this may be indicative of the level of IT effort with the company. Look, *I* run an OS X Server... but *I* am a Biology major that knows approximately dick about the UNIX command line, and use it to run a server that I probably wouldn't be able to run any other way. I also have it backup nightly to a cheap NAS, archiving old backups, and I've tested a restore to make sure it works.
This is probably just a couple guys who ran a website in their spare time... not a huge IT effort that failed.
Thats bullshit, and has been for decades.
Its a myth. Just learn about it. Even if we use our newest AFM, or XMCD microscopy, you wont see an overwritten byte in any drive of the last 5 years. And even the last decade is very doubtful (basically, since GMR drives are around).
There IS NO SPACE between tracks anymore. Bits are right next to each other. If you overwrite, nothing above the superparamagnetic limit is left.
Not even the NSA could get anything useful out of a single overwrite with zeros (well, except relocation sectors and other specialities that might compromise security, but doesnt help with a backup)
HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
I'm just sayin'
Are there Darwin awards for websites?
ACID compliant databases use a log, much like a filesystem journal, that contains all the changes made to the database before those changes were actually written out to the main database storage. When you back up the raw database, you back up all the logs since at least the time you started backup up the raw files until the time the backup was finished, and when you need to restore the database you put the raw data back and then let the database replay the logs.
Back in the nineties, a friend of mine was backing his mac system up weekly with a tape drive. The thing is, he was using the same tape to repeatedly back up onto. One day he calls to tell me he needed some help recovering files on his hard drive after a crash. I asked, "What about the tape backups?" He said, "That thing backs up perfectly. The problem is, it doesn't restore at all."
Seth
$5 / month hosted VPS on linux = awesome!
The company that runs Journalspace (or used to, anyway) is Lagomorphics. They will host your site for you...
http://www.lagomorphics.com/hosting/
At Lagomorphics, we're OS X hosting experts. We've been using the Mac mini and Xserve platforms for years, and we're proud to offer you the opportunity to use our colocation facility. Just send us your Mac mini, or let us provide the hardware.
I'm a big tall mofo.
For everything to be just gone and I mean LONG gone, then something besides a truncation or un-linking of the file had to occur.
Now I don't know all that much about the apple file system, but I would imagine it is like most file systems in that it links clusters and sectors of data together using some sort of allocation table, hash, b-tree or something.
Now unless they had file scrubbing turned on and the OS purposefully went out and overwrote every segment of the file with 01010101 and 10101010 then the vast majority of the data should still be there, at least I would think it would be. I mean even the nastiest revenge oriented guy, would have to be able to invoke some kind of program to do that.
I am assuming that it was an SQL database of some flavor. I don't know much about MySQL internals but I am pretty sure a
delete from table
simply goes through the index and marks pages deleted and does not physically go out and scrub ever page that has data on it. I know that is how Oracle works.
So this leaves me wondering about the data recovery house.... I they were doing a sector by sector read on the entire drive ( either of them ) they should till see all sorts of data on the disk. Now I don't know if the database compresses data on the fly ( some do, some don't) and I don't know if drive compression is an option on OS-X. If so, I can see where they would see just mostly larges amounts of compressed data ( making things VERY difficult if not impossible to recover, but baring that, most OS's have the hooks built in do simply do a sector by sector read of the storage device and although your binary data ( images and the like ) might be unrecoverable, you could probably get most if not all of the text.
Just a thought, but hey I might be crazy, it is just the hacker in me that brings these things to mind...
Hey KID! Yeah you, get the fuck off my lawn!
I see your point, but something about this does not pass the smell test.
To have nothing on the HD(s) then someone had to very very carefully wipe the entire disk by overwriting every block and sector that the data occupied, and that would have made whatever DB system shit its pants as it started seeing data disappear so it would have been really obvious, really fast that something was amiss and as you relate would have more then likely caused a kernel panic and or at least a core dump of the DB system.
Hey KID! Yeah you, get the fuck off my lawn!