GitLab.com Melts Down After Wrong Directory Deleted, Backups Fail (theregister.co.uk)

← Back to Stories (view on slashdot.org)

GitLab.com Melts Down After Wrong Directory Deleted, Backups Fail (theregister.co.uk)

Posted by BeauHD on Tuesday January 31, 2017 @07:00PM from the put-in-a-hard-day's-work dept.

An anonymous reader quotes a report from The Register: Source-code hub Gitlab.com is in meltdown after experiencing data loss as a result of what it has suddenly discovered are ineffectual backups. On Tuesday evening, Pacific Time, the startup issued the sobering series of tweets, starting with "We are performing emergency database maintenance, GitLab.com will be taken offline" and ending with "We accidentally deleted production data and might have to restore from backup. Google Doc with live notes [link]." Behind the scenes, a tired sysadmin, working late at night in the Netherlands, had accidentally deleted a directory on the wrong server during a frustrating database replication process: he wiped a folder containing 300GB of live production data that was due to be replicated. Just 4.5GB remained by the time he canceled the rm -rf command. The last potentially viable backup was taken six hours beforehand. That Google Doc mentioned in the last tweet notes: "This incident affected the database (including issues and merge requests) but not the git repos (repositories and wikis)." So some solace there for users because not all is lost. But the document concludes with the following: "So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place." At the time of writing, GitLab says it has no estimated restore time but is working to restore from a staging server that may be "without webhooks" but is "the only available snapshot." That source is six hours old, so there will be some data loss.

209 of 356 comments (clear)

Min score:

Reason:

Sort:

Yawn... by Anonymous Coward · 2017-01-31 19:04 · Score: 5, Insightful

No backups, untested backups, overwriting backups, etc. You'd think a code repository would understand this shit.
This has been going on since the dawn of computing and it seems there's no end in sight.
1. Re: Yawn... by Nutria · 2017-01-31 20:35 · Score: 5, Funny
  
  paki chimps in jungle
  Someone failed geography class...
  
  --
  "I don't know, therefore Aliens" Wafflebox1
2. Re: Yawn... by Anonymous Coward · 2017-01-31 21:16 · Score: 3, Insightful
  
  No no, he's being """ironic""" and """trolling""" you. He isn't actually a stupid racist.
  He's just a racist.
3. Re:Yawn... by Anonymous Coward · 2017-01-31 23:11 · Score: 2
  
  http://knowyourmeme.com/memes/disaster-girl
4. Re:Yawn... by Big+Hairy+Ian · 2017-02-01 00:38 · Score: 2, Insightful
  
  Clearly their DR Plan didn't get any form of QA. It's no good having five forms of backup/replication if non of them work!
  
  --
  Build a Man a Fire, and He'll Be Warm for a Day. Set a Man on Fire, and He'll Be Warm for the Rest of His Life.
5. Re:Yawn... by zifn4b · 2017-02-01 00:47 · Score: 1
  
  No backups, untested backups, overwriting backups, etc. You'd think a code repository would understand this shit.
  This has been going on since the dawn of computing and it seems there's no end in sight.
  You'd think so but the level of incompetence these days rivals the incompetence of 20 years ago. I just heard yesterday that a global multi-national company that's been around for years lost a file because "another file from a different source came in too soon and overwrote it". At that point, I did a complete facepalm because I was astounded that we still have software around running critical business operations sometimes even global operations.
  
  --
  We'll make great pets
6. Re:Yawn... by Anonymous Coward · 2017-02-01 01:11 · Score: 1
  
  It's one of those called "PowerPoint DR Plan".
  Everyone has one nowadays but unfortunately it isn't as effective as the real deal.
7. Re: Yawn... by moronoxyd · 2017-02-01 02:11 · Score: 1
  
  What does GitHub have to do with Gitlab.com?
8. Re: Yawn... by nitehawk214 · 2017-02-01 03:16 · Score: 1
  
  Sounds like they used the "mirror = backup" solution. Best way to destroy everything.
  
  --
  I'm a good cook. I'm a fantastic eater. - Steven Brust
9. Re:Yawn... by Anonymous Coward · 2017-02-01 03:28 · Score: 2, Insightful
  
  And will continue to happen because experienced admins are expensive. Cheaper to hire the new grad who knows the buzzwords than the experienced admin who has lived through a couple of catastrophes and now knows how to plan for them.
10. Re:Yawn... by shaitand · 2017-02-01 04:30 · Score: 1
  
  It is a tough scenario, executing DR plans for real involves disruption to the involved systems so you either definitely impact your operations from time to time of you take the risk there might be some kind of disruption IF you ever need to fall back on DR.
  
  That is why most organizations actually have reliable backup systems that periodically verify data integrity, alert when agents aren't communicating, etc.
11. Re:Yawn... by Penguinisto · 2017-02-01 04:44 · Score: 1
  
  But... but... The Cloud! The Cloud is our DR solution!
  (*chuckle*)
  
  --
  Quo usque tandem abutere, Nimbus, patientia nostra?
12. Re: Yawn... by Qzukk · 2017-02-01 06:02 · Score: 3, Interesting
  
  There's two levels of redundancy. There's "oh my god the database server is on fire! Promote the replicated server to master and failover!" which, depending on the database, should take a few seconds to perform manually. Testing automation for this (pull the plug and see what happens) depends on your setup and how long it takes your heartbeat to decide that the server is dead and how (If we shot servers in the head every time we got a DDoS, we'd burn through servers in a few seconds, it takes more than one failed connection for automation to decide the server is down).
  Then, there's "oh my god the datacenter is on fire!". This is what people usually call "Disaster Recovery". One dead server isn't a disaster when you have failovers, but when your entire datacenter is dead, THAT's a disaster. It's tough as nails to automate too, since without having at least three datacenters, it's inherently a split-brain issue. If Datacenter A stops responding to Datacenter B, which one is actually down? If you aren't an AS and can't just republish your IPs at Datacenter B with a BGP routing change, that means you're going to have to publish new DNS records and wait one TTL for everyone to see them. If you had an authoritative DNS server at Datacenter A, then hopefully it was able to recognize that its down and shot itself (or at least updated its zone files with B's IPs) or you can somehow get to it and kill it, otherwise when Datacenter A comes back online, it'll be serving up A's IPs again and conflict with the other DNS server. This also is setting aside replicating your data between datacenters and how much of that is lost when you switch back and forth.
  
  --
  If I have been able to see further than others, it is because I bought a pair of binoculars.
13. Re:Yawn... by LeftCoastThinker · 2017-02-01 07:17 · Score: 1
  
  It is the classic human factor. The sysadmin probably knew all the right steps to take, but got lazy thinking he didn't need all the extra work and then, working late he made the mistake that all those steps would have protected against.
  No one is immune to complacency.
  
  --
  If you disagree, please post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like
14. Re:Yawn... by The-Ixian · 2017-02-01 08:41 · Score: 1
  
  Luckily this is a lab environment and not production...
  
  --
  My eyes reflect the stars and a smile lights up my face.
15. Re:Yawn... by AK+Marc · 2017-02-01 08:58 · Score: 1
  
  Experts still call it a "backup plan". A "backup" is a copy. Nobody wants a copy. People want a usable service. There should never be a "backup plan". There should only be a "restore plan." So people stop saying "the backup completed without error, it should work". If you didn't restore it, you didn't test the restore.
  
  I've seen too many places with backups that completed without error that didn't have usable backups. And never tested a restore.
  
  --
  Learn to love Alaska
16. Re:Yawn... by riffraff · 2017-02-01 09:52 · Score: 1
  
  At one of my jobs we were required to test our backups at least once a quarter. 8T database backups that took hours to restore, then hours to import, so several days of work, so that we made sure the backups were tested and accurate. It was a lot of work, but worth the trouble to make sure something like this doesn't happen. And with deduping and versioning, it should be easy to go back to an older version, even only a few minutes old.
17. Re: Yawn... by Tablizer · 2017-02-01 10:08 · Score: 1
  
  No! It's right there in pop-up #73 of "Sarah Palin's Illustrated Geography".
  
  --
  Table-ized A.I.
18. Re: Yawn... by chris_osulliva · 2017-02-01 13:35 · Score: 1
  
  if you have 2 data centers you should be load-balancing already!
19. Re:Yawn... by K.+S.+Kyosuke · 2017-02-01 14:51 · Score: 1
  
  I thought Git was supposed to be its own backup, on the basis of all copies being complete (and basically a persistent data structure, if I'm not mistaken)? Shouldn't it be possible for developers to just sync the server with their local copies?
  
  --
  Ezekiel 23:20
20. Re: Yawn... by volodymyrbiryuk · 2017-02-01 22:42 · Score: 1
  
  It was actually a sarcastical reply to "The most worthless race is the human race. Genocide everything that shits." But the sarcasm was never strong with the anonymous coward crowd.
  
  --
  sudo rm -r -f --no-preserve-root /
21. Re:Yawn... by minstrelmike · 2017-02-02 07:14 · Score: 1
  
  We have a database we can actually export and import in less than 2 hours. So every night, we export production, drop our test database, and import the latest and greatest data. And that version is what we use for testing and what our power users use to modify data experimentally to see if they like the changes.
  
  We know fairly quickly if we have a bad "backup." I've been burned many times by files that won't open, won't load, won't whatever.
22. Re: Yawn... by peawormsworth · 2017-02-02 23:32 · Score: 1
  
  ...they used the "mirror = backup" solution.
  I used that solution and it worked ideally. I ran a laptop of an SD card and set RAID to mirror to a 2nd SD card. To make a backup, I just pull one card out and stuff a fresh card in. That's it. The new disk would automatically sync up and be ready for the next backup when I choose to pull that one out. To restore, I just shut down the computer, put in the last backup and boot up. My computer system was instantly restored back to the last time I pulled out that disk.
  I didn't continue with this test set up because laptops do not have 2 SD cards and having a dongle hanging out of a laptop all the time is a recipe for loose and broken USB ports.
  The sysadmins told me that RAID should never be used as a backup. But I found this set up dead simple and the backup restore process was instantaneous.
  Perhaps RAID in mirror mode is not an ideal backup solution. But it is very close to the way backups should work.
23. Re: Yawn... by nitehawk214 · 2017-02-03 09:47 · Score: 1
  
  Please repeat after me, "RAID is not backup." Not just, "not an idea backup..." it is not a backup at all.
  If you accidentally do a "rm -rf /" or ransomware encrypts your entire drive, RAID won't help you a bit. If you don't test your backups, you are equally likely to fall into the same pit.
  That being said I do have mirrored disks on my main computers. A spinning disk is more likely to die than I am to get malware or screw up on the command line, and it is far easier to swap a disk than restore a system from backups. And just about everything supports Raid 1 in hardware these days and disks are dirt cheap.
  
  --
  I'm a good cook. I'm a fantastic eater. - Steven Brust
24. Re: Yawn... by psycheitout · 2017-02-05 01:34 · Score: 1
  
  The importance of things like backups and redundancies are lessons people usually have to learn the hard way. I bet you gitlab won't ever make this mistake again
I feel that lone sysadmin's pain by sixdrum · 2017-01-31 19:06 · Score: 5, Insightful

A few years back, I caught and stopped a fellow sysadmin's rm -rf on /home on our home directory server. He had typo'd while cleaning up some old home directories, i.e.:

rm -rf /home/user1 /home/user2 /home/ user3

Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!
1. Re:I feel that lone sysadmin's pain by Anonymous Coward · 2017-01-31 19:28 · Score: 5, Insightful
  
  That's why you always always run ls first.
  ls -ld /home/user1 /home/user2 /home/ user3
  Then edit the command to rm. Always.
2. Re:I feel that lone sysadmin's pain by mmell · 2017-01-31 19:35 · Score: 5, Interesting
  
  Sadly, I remember personally making a similar mistake about a decade ago. Upgrading SAN hardware, preparing the old hardware for decommissioning (deleting data prior to sending the units to vendor). Even with offsite data replication, I survived several uncomfortable days and never did fully live down my error. Could've been worse - I thought I had a career change opportunity on my hands. My only saving grace was that I was acting under direction from vendor tech support when the error occurred (although it was still my fingers on the keyboard).
3. Re:I feel that lone sysadmin's pain by arglebargle_xiv · 2017-01-31 20:04 · Score: 4, Funny
  
  Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!
  Actually: Check your privilege! (Especially if rm -rf is involved).
4. Re:I feel that lone sysadmin's pain by Opportunist · 2017-01-31 20:31 · Score: 2
  
  That's fine until he decides that typing "rm user1 user2 user3 user4..." is too much of a hassle and he replaces it with a script that lists the directories and removes them all. ...blissfully forgetting that there is a ".." directory. Oh .., how many well intended scripts have thee turned into the spawn of hell...
  
  --
  We used to have a Bill of Rights. Now, with the rights gone, all we have left is the bill.
5. Re:I feel that lone sysadmin's pain by Narcocide · 2017-01-31 20:35 · Score: 1
  
  cd /home
  mkdir removed_home_directories_2017_02_01
  mv user1 user2 user3 removed_home_directories_2017_02-01/
  chmod 0500 removed_home_directories_2017_02_01
6. Re:I feel that lone sysadmin's pain by Megane · 2017-01-31 20:58 · Score: 1
  
  This is when tab completion is your friend, especially when you have path names with spaces in them. Also, for me the big one is overwriting stuff with the mv command (tab completion can make this easier to do), so I have it aliased to "mv -i". I almost never want to delete a file by overwriting it with the mv command.
  
  --
  #naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
7. Re:I feel that lone sysadmin's pain by AmiMoJo · 2017-01-31 21:07 · Score: 4, Interesting
  
  mkdir ./trash
  mv file_to_delete ./trash
  If it's still working next month you can empty trash, but just leaving it there forever is a valid option too. In a production environment, storage is too cheap to warrant deleting anything.
  
  --
  const int one = 65536; (Silvermoon, Texture.cs)
  SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
8. Re:I feel that lone sysadmin's pain by stridebird · 2017-01-31 21:08 · Score: 5, Informative
  
  Correct pattern is:
  > cd /home && rm ...
  ie don't run rm unless cd worked.
9. Re:I feel that lone sysadmin's pain by Anonymous Coward · 2017-01-31 21:15 · Score: 1
  
  Agreed, but there are occasions where this fails too. I had ssh'd into an old server and proceeded to delete a few folders of old data before redoing the backup, the ssh connection had dropped while I was in another window and I ran the rm -rf command on the main server. Still gives me nightmares...
10. Re: I feel that lone sysadmin's pain by saloomy · 2017-01-31 21:15 · Score: 2, Informative
  
  I usually have a /trash directory in my Linux servers, I have moved the rm command to "removed" and wrote a sweet script named rm which moves files/folders to /trash. Then a cron job "removes" files and folders from trash after 48 hours. Works awesome unless I'm space-bound, and I usually am not. Saved my ass more than once!
11. Re:I feel that lone sysadmin's pain by Megol · 2017-01-31 21:52 · Score: 1
  
  Or perhaps the operating system (shell) should prevent these kinds of errors? I guess it isn't macho enough...
12. Re:I feel that lone sysadmin's pain by jez9999 · 2017-01-31 21:55 · Score: 1
  
  Or use a GUI that moves stuff to a recycle bin first. :-) It's saved my bacon on more than one occasion.
  
  --
  == Jez ==
  Do you miss Firefox? Try Pale Moon.
13. Re:I feel that lone sysadmin's pain by Anonymous Coward · 2017-01-31 22:32 · Score: 2, Insightful
  
  Do you prefer your kitchen knives un-sharpened because then you're less likely to cut yourself?
14. Re:I feel that lone sysadmin's pain by Angstroem · 2017-01-31 22:45 · Score: 3, Insightful
  
  The command-line wizards like to mock the GUI crowd, but I've never seen anyone make this kind of blunder with a GUI admin tool. :-P
  Then you have never worked on a repository with users of TortoiseSVN and the likes.
  "Hey, my commit didn't get through because of some funky error I didn't care about. But if I flip this 'force' switch, then everything always goes smoothly."
15. Re:I feel that lone sysadmin's pain by Applehu+Akbar · 2017-01-31 22:48 · Score: 1
  
  Moral: the command line is too powerful for puny humans who might not be totally attentive to every character being entered at all times.
16. Re:I feel that lone sysadmin's pain by Applehu+Akbar · 2017-01-31 22:52 · Score: 1
  
  Including that purist-hated trash can/recycle bin.
17. Re: I feel that lone sysadmin's pain by gsslay · 2017-01-31 23:10 · Score: 5, Insightful
  
  This seems like a good idea, but it gets you into the habit of thinking that "rm" is a safe command that you can easily recover from. Then one day you use it on a server where you have forgotten to, or haven't yet, done your "sweet script" trick. Or worse; on someone else's server.
  
  Far better to treat the command "rm" with the full respect it deserves at all times and never assume it does anything but wipe data. Call your little script something like rm2 instead and get into the habit of always using that. That way the worst thing that can happen when it doesn't exist is "command not found".
18. Re:I feel that lone sysadmin's pain by 140Mandak262Jamuna · 2017-01-31 23:37 · Score: 1
  
  This. I do this.
  
  --
  sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
19. Re:I feel that lone sysadmin's pain by quenda · 2017-01-31 23:56 · Score: 1
  
  GUI? We don't need no stinkin' GUI!
  # mkdir junk # mv file1 dir2 .... junk # ls -la junk
  Look carefully!!
  # rm -rf junk
20. Re:I feel that lone sysadmin's pain by Zontar+The+Mindless · 2017-02-01 00:09 · Score: 1
  
  This is when tab completion is your friend...t
  This. Very first thing I thought of as well.
  
  --
  Il n'y a pas de Planet B.
21. Re: I feel that lone sysadmin's pain by rgbatduke · 2017-02-01 00:10 · Score: 5, Interesting
  
  Having used the "sweet rm" trick back in the 80's somewhere (with much more limited space, and a cron FIFO groomer) it also doesn't protect you from a wide variety of file corruption issues and overwrites. Remove a file, recreate it, remove it again? Delete two files from different parts of your tree -- e.g. README -- that have the same name? Original file gone (unless you don't just alias rm, you write a very complicated script). If you run out of space and have an alias/script like "flush" to take out the trash and make room for more, it just moves the problem one notch downstream.
  With that said, it did save my ass a few times. Then I learned personal discipline, started using version control (SCCS at the time, IIRC) onto a reliable server to not just back up any files of any importance I create but to save reversible strings of revisions back to the Egg, and stopped using my reversible rm altogether after one or two of the disasters it still leaves open.
  Moral: Version control with frequent checkins usually leaves your working image itself on your working machine. Keeping the repository on a different machine is already one level of redundancy. Keeping it on a server class machine in a tier 1 or tier 2 facility with reliable, regular backups and RAIDed disk is suddenly very, very, very reliable. As the current incident shows, not perfectly reliable. Human error, multiple disk failures in an array, nuclear war, internal malice or incompetence or just plain accident can still cause data loss, but in this case what is being reported isn't disaster -- they had 6 hour backups! Even though I'm sure there will be some folks who are inconvenienced, MOST of the users will still have usable, current working copies and be out anywhere from zero to a few hours of work. I've been on both sides of the sysadmin aisle in data loss server crashes, and -- they happen. Wise users use a belt AND suspenders to the extent possible lest they find their pants gathered around their ankles one day...
  
  --
  Even when the experts all agree, they may well be mistaken. --- Bertrand Russell.
22. Re: I feel that lone sysadmin's pain by joboss · 2017-02-01 00:13 · Score: 1
  
  That's not the pro way to do it. You can have snapshots on the FS so you can do something like restore a file as it was an hour ago. You can do similar with database replication.
23. Re:I feel that lone sysadmin's pain by donaldm · 2017-02-01 00:18 · Score: 1
  
  Or use a GUI that moves stuff to a recycle bin first. :-) It's saved my bacon on more than one occasion.
  May I ask what if you are required to do housekeeping on a corporate server that does not have a GUI?
  Answer: In the case of a corporate server whether it is classified as production, development or test, you raise a change request and get it signed off before you do anything.
  If you own the machine and are not answerable to anyone then any mistakes on your part will hopefully be a good lesson for you.
  
  --
  There ain't no such thing as proprietary standards only proprietary formats. Standards are by definition open.
24. Re: I feel that lone sysadmin's pain by Joce640k · 2017-02-01 00:22 · Score: 5, Insightful
  
  I usually have a /trash directory in my Linux servers, I have moved the rm command to "removed" and wrote a sweet script named rm which moves files/folders to /trash. Then a cron job "removes" files and folders from trash after 48 hours. Works awesome unless I'm space-bound, and I usually am not. Saved my ass more than once!
  This sounds clever but it's a facepalming fail on so many levels. Modifying the system is ALWAYS a bad idea. Shame on anybody who upvoted it.
  If that's your intention then why not learn to type "mv" instead of "rm"? This way you're not depending on using a hacked system (or not) and you'll be safe anywhere.
  
  --
  No sig today...
25. Re: I feel that lone sysadmin's pain by jeremyp · 2017-02-01 00:23 · Score: 1
  
  This is a bad idea. rm is a sharp tool and you should never do anything to it that makes you think it isn't. One day you'll be working on somebody else's system but you'll have forgotten that rm can be dangerous and you'll merrily delete something career ending, go look for it in /trash and then have to commit ritual suicide.
  
  --
  All I want is a secure system where it's easy to do anything I want. Is that too much to ask ~~ Randall Munroe
26. Re:I feel that lone sysadmin's pain by Joce640k · 2017-02-01 00:26 · Score: 2
  
  Exactly right.
  Exactly wrong.
  Learn to use "mv" instead of "rm -rf".
  eg. Create a folder called /trash and move the files there.
  When you see the system is still working and you need some disk space then you can empty the trash. Not before.
  
  --
  No sig today...
27. Re: I feel that lone sysadmin's pain by RabidReindeer · 2017-02-01 00:32 · Score: 2
  
  There are many tricks. Personally, I like to tar stuff or do a ZIP-with-delete and keep it for a day or 3 before removal. For large quantities of data, that can take a while, though, so another possibility if one is working with snapshot-capable storage management is to snapshot it and work "offline" on the snapshot. I do this on VM images, for example.
  Hot mirrors updated just infrequently enough that you can break the link before the damage propagates isn't a bad idea, either. Filesystems with "time machine" rollback capabilities, too.
  You can pretty well bet that a backup is never as usable as it's supposed to be. It's going to be outdated, corrupted, or something critical won't be in it.
  To be really viable, you need to devote the same level of attention to backup/restore (accent on the restore) as you do on security management. There is a very strong case for keeping an entire server or set of servers to run frequent checks on backups, including bare-metal restores, and these days, a spare computer or 6 is not a bank-breaking investment any more if your data means anything to you. I also like to employ multiple types of backup media on the premise that not all types of hardware will be affected equally by most failures.
  At least apparently they only lost a few hours worth of work, and although potentially a large amount of data, it's a job that is inherently distributed among many disgruntled clients. Other organizations haven't been so fortunate.
28. Re:I feel that lone sysadmin's pain by jabuzz · 2017-02-01 00:37 · Score: 2
  
  Oh I wish that where really the case. Unfortunately where a single run of a job on an HPC facility can produce 1TB of files that is not actually the case in the real world for everyone.
29. Re:I feel that lone sysadmin's pain by rholtzjr · 2017-02-01 00:44 · Score: 1
  
  My major whoops early in my career.
  Brand new install of Slackware with Kernel 1.2.8 (circa late 1994) which was a statically linked build. Thought I was in /usr/local/lib (shell only had current level directory not the full path) but was really in /lib. Proceeded to rm -rf * to get rid of a test build (or so I thought). Well then I was wondering after about 10 sec the rm command was throwing errors. Seems that once the rm command hit libc.a any and all operations ceased.
  After that I always had the root user have the full path in the shell. Luckily no data was lost that a quick reinstall did not fix. But people did start asking me why they were getting a bizarre error when trying to get their mail with their pop client.
30. Re:I feel that lone sysadmin's pain by Calydor · 2017-02-01 01:21 · Score: 1
  
  I think it was a patch to EVE Online that did the same thing, accidentally deleting / instead of some specific directory within the game.
  
  --
  -=This sig has nothing to do with my comment. Move along now=-
31. Re:I feel that lone sysadmin's pain by John+Allsup · 2017-02-01 01:42 · Score: 1
  
  Also rm -rf /home/{user1,user2,user3} is safer: if you accidentally include a space, the braces don't get expanded at all:
  rm -rf /home/{user[12]}
  is equivalent to
  rm -rf /home/user1 /home/user2
  but rm -rf /home/{user1, user2}
  is equivalent to
  rm -rf "/home/{user1, user2}"
  so 'rm -ageddon' is avoided.
  
  --
  John_Chalisque
32. Re:I feel that lone sysadmin's pain by John+Allsup · 2017-02-01 01:43 · Score: 1
  
  Seriously, though, much more thought needs to be given to two things: one is making accidents harder, the other is making effective backups a no-brainer.
  
  --
  John_Chalisque
33. Re:I feel that lone sysadmin's pain by BlackHawk-666 · 2017-02-01 02:01 · Score: 1
  
  Accenture made exactly this blunder on the London Stock Exchange website root folder (running on IIS). Some nimrod came in and accidentally deleted all the files from that folder taking about 30 different financial products offline. We noticed pretty quick and scrambled to restore from a backup.
  Funny thing is...some other nimrod or the same one did almost the same thing a month later, this time only removing a few key products :-)
  
  --
  All those moments will be lost in time, like tears in rain.
34. Re:I feel that lone sysadmin's pain by BlackHawk-666 · 2017-02-01 02:03 · Score: 1
  
  I habitually shift-delete things because it saves a lot of time moving large folders with massive numbers of files into the recycle bin. I have been caught out by this once or twice over the years, but always had a recent backup and so have never lost anything that way.
  
  --
  All those moments will be lost in time, like tears in rain.
35. Re:I feel that lone sysadmin's pain by sinij · 2017-02-01 02:10 · Score: 1
  
  Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!
  Actually: Check your privilege!
  Sudo is a real victim here. Let not make it worse by engaging in victim-blaming.
36. Re: I feel that lone sysadmin's pain by PincushionMan · 2017-02-01 02:22 · Score: 2
  
  You might want to look into Squashfs. The archive command for a single directory (or file) is:
  mksquashfs source_dir target_image.sqfs
  If you want to do multiple directories or files, no problem:
  mksquashfs source_dir1 source_dir2 souce_file1 source_file2 target_image.sqfs
  Squashfs generation is comparable to that of tar.gz files. Not only does it do gzip compression natively, it can compress the inodes in the directory tree and also do fs level de-duplication. Squashfs is compatible with any kernel from 2009+ (maybe before), and newer kernels also have the ability to use lzo and xz compressors. It's intended to be used anywhere that you would use tar.gz or cpio, with the added benefit that you can mount it loopback and extract a file that you need without the overhead of sequentially scanning through the tape archive. I've heard the windows version of 7zip can access a squashfs archive as well (as of 16.04 it must be a gzip compressed sqfs image). Squashfs natively detects sparse files - unless you tell it not to.
  The only thing I'm not sure how well unsquashfs handles the extraction of sparse files. Linux tar is totally unsuitable when dealing with sparse files, as it requires the full amount of space to extract a sparse file. For Linux tar, there's a workaround for sparse files, and that is to install BSD tar, which seems to extract as sparse files correctly.
37. Re:I feel that lone sysadmin's pain by aaarrrgggh · 2017-02-01 02:33 · Score: 1
  
  But for it to be effective you really need to do a mv /trash/$date/, which still makes a restore/recovery a complete pain in the ass... but hopefully avoids one delete from overwriting another.
  
  When you have a system that uses multiple levels of backup, making sure that all of them are always working takes serious commitment, and the trash concept doesn't change that. Doing good backups is hard, especially for large data sets and tight budgets. We had a fantastic system for our Linux server, but had to migrate to Windows and over a year or two our sensitive data grew from 3TB to 5TB, and the backup system couldn't accommodate our archive process for old data (it doesn't reduce the total backup size). So, you add another marginally tested layer of backups, just in case...
38. Re: I feel that lone sysadmin's pain by TheRaven64 · 2017-02-01 02:34 · Score: 1
  
  That's going to make rm very slow unless everything is on a single filesystem, which makes backups difficult. We tend to put each user's home directory in a separate ZFS filesystem and have a cron job creating and pruning snapshots. If a user accidentally deletes anything, the snapshots are all automounted in their ~/.zfs directory so that they can just copy the older version out themselves. On the main network, home directories are all on the NetApp filer that does this automatically (though using their own filesystem and putting snapshots in the ~/.snap directory)
  
  --
  I am TheRaven on Soylent News
39. Re:I feel that lone sysadmin's pain by TheRaven64 · 2017-02-01 02:35 · Score: 1
  
  And when your home directories are all mounted over NFS, your mv command copies a massive amount of data over the network, fills up the local disk and, if run as root, breaks the system by filling up the emergency part of the FS reserved for the root user. Good plan.
  
  --
  I am TheRaven on Soylent News
40. Re: I feel that lone sysadmin's pain by mrchaotica · 2017-02-01 02:41 · Score: 2
  
  Call your little script something like rm2 instead
  Or better yet, something that doesn't even have the string "rm" in it, like trash.
  
  --
  "[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz
41. Re:I feel that lone sysadmin's pain by TheRaven64 · 2017-02-01 02:43 · Score: 2
  
  I have. It's just as easy to accidentally click on the wrong folder, or delete the foo folder from the window showing bar instead of the window showing baz. This is why good UIs are all about making sure that there's an undo button that works after you've done the stupid thing, not about trying to make the stupid thing impossible. Most GUI systems will move things to the trash, rather than deleting. The problem is that users then get into the habit of reflexively emptying the trash immediately after a delete. You really want a filesystem design that adds blocks from deleted files to the end of a reuse list, so that new file allocation will overwrite the oldest deleted data by default and you can always undelete recently deleted things if you haven't written significant amounts of data in between the delete and the 'oh crap' moment.
  
  --
  I am TheRaven on Soylent News
42. Re: I feel that lone sysadmin's pain by budgenator · 2017-02-01 02:45 · Score: 1
  
  Oh hell yeah,
  format C:\ press any key to continue
  is sooo much easier, safer and modern in powershell.
  
  --
  Apocalypse Cancelled, Sorry, No Ticket Refunds
43. Re:I feel that lone sysadmin's pain by budgenator · 2017-02-01 03:30 · Score: 1
  
  I've had the GUI choke on "File too Large" when deleting, which sucks when the reason you're deleting files is because GUI told you the remaining file system free space is 0 b. At that point all you can do is drop into a terminal and start using rm.
  
  --
  Apocalypse Cancelled, Sorry, No Ticket Refunds
44. Re:I feel that lone sysadmin's pain by sunking2 · 2017-02-01 03:37 · Score: 1
  
  Maybe you should be doing server maintenance on the actual machine that needs the maintenance? If you don't understand where your files actually reside then you should maybe be in another job.
45. Re: I feel that lone sysadmin's pain by tibit · 2017-02-01 04:41 · Score: 1
  
  I do precisely that: each invocation of rm creates a fresh timestamped folder under mountpoint/rm_saved
  
  --
  A successful API design takes a mixture of software design and pedagogy.
46. Re:I feel that lone sysadmin's pain by fuzzywig · 2017-02-01 04:53 · Score: 1
  
  Sharp knives are safer than blunt ones.
47. Re: I feel that lone sysadmin's pain by Penguinisto · 2017-02-01 04:54 · Score: 1
  
  This, right here... holy shit this!
  On critical stuff, you want to make it a habit to mv stuff you're not familiar with somewhere (/tmp works most cases), test the system, test affected applications, double-check once more, and *then* rm.
  On rm itself, I make it a habit to type the rm, double-check the command forwards and (literally) backwards, and only when satisfied hit enter. Ain't perfect, but I've caught potential disaster more times than I can count by bad regex, misplaces spacing, and other dumb tricks by reading it forwards and (literally!) backwards.
  PS: The very first time I screwed up on rm, I learned the hard way to never, ever, ever type rm -rf .* to blanket-remove hidden files. Tends to nuke your entire server, including NFS mounted disks.
  
  --
  Quo usque tandem abutere, Nimbus, patientia nostra?
48. Re: I feel that lone sysadmin's pain by Jesus+H+Rolle · 2017-02-01 05:11 · Score: 1
  
  No, sharp knives are safer than dull knives. Blunt knives are safer than either.
49. Re:I feel that lone sysadmin's pain by mmell · 2017-02-01 05:20 · Score: 1
  
  No - the SAN's internal wipe. It took nearly thirty minutes to wipe the filesystems. Unfortunately, the fact that I'd wiped the wrong device didn't become evident until four hours after that. In all honesty, I'd have fired me that day. I'm glad my manager was a more understanding fellow than myself.
50. Re:I feel that lone sysadmin's pain by nasch · 2017-02-01 05:32 · Score: 1
  
  That is not a perfect substitute for the Windows (and I assume other OSes) trash/recycle bin. For example, try using both techniques to create foo.txt, delete it, create it again in the same place, and delete it again. The recycle bin will have both deleted files and you can restore either. The mv version will have only one copy.
51. Re:I feel that lone sysadmin's pain by gosand · 2017-02-01 05:58 · Score: 1
  
  yep.
  I am not a sysadmin, except on my own linux machine at home. I have been since 1998.
  I have learned that when I write scripts to do things, which is quite often, I always echo the key commands before actually running them.
  for i in 1 2 3 4 5
  do
  echo "rm -f $i"
  done
  I run it, look at what the command is going to do, then remove the echo. When messing around with files that might have spaces in them, or using multiple functions/calculations/variables, there is always something that can go wrong.
  I still remember back in 1999, I was working at a startup and a new developer, as root, did "rm -rf /" on a test server. He didn't live that down.
  
  --
  
  My beliefs do not require that you agree with them.
52. Re:I feel that lone sysadmin's pain by rthille · 2017-02-01 06:00 · Score: 1
  
  Your manager and company had already paid the cost of "training" you, why would they then fire you and have to hire someone who might not be as careful as you would then be?
  
  --
  Awesome furniture, accessories and cabinetry in Santa Rosa, CA: http://humanity-home.com/
53. Re: I feel that lone sysadmin's pain by dgatwood · 2017-02-01 06:25 · Score: 1
  
  Mine is even better because of its simplicity. On my production systems, rm is aliased to 'echo "Use /bin/rm if you really want to do this"' so that it forces me to take a second look at what I'm doing before I run the command in the first place.
  
  --
  Check out my sci-fi/humor trilogy at PatriotsBooks.
54. Re:I feel that lone sysadmin's pain by LordLimecat · 2017-02-01 06:47 · Score: 1
  
  I mean, when literally every chef who has ever cooked has a story about how KniveCo's knives chopped off one of their fingers because the handle was too small, maybe its worth looking at mitigations.
55. Re:I feel that lone sysadmin's pain by N!k0N · 2017-02-01 07:07 · Score: 1
  
  The situation was handled though.
56. Re: I feel that lone sysadmin's pain by RabidReindeer · 2017-02-01 07:16 · Score: 1
  
  I'd never considered using squashfs as a backup mechanism, but if it works for you...
  My Red Hat tar utility does support sparse files, although it's turned off by default, I think. A compressed tarball wouldn't care about holes, since holes compress down to virtually nothing anyway. The real issue is more in how well the receiving filesystem will honor the holes when the files are transferred into it.
  My day-to-day backups are based on Bacula, which supports sparse files. Most of my alternative strategies are for short-term safety or long-term image storage.
  I'm sorry to say that most of the commercial backup products I've worked with over the years have let me down at the worst possible times. Linux tar has not, ZIP has not, as long as I don't have multi-volume ZIPs, and bacula files have not. In addition to being reliable and free, you can work with them on virtually any platform and not get nuked by OS, hardware, or data version issues.
57. Re:I feel that lone sysadmin's pain by Cederic · 2017-02-01 07:30 · Score: 1
  
  Why would anybody not start with
  cd /home
  The moment you ever type 'rm -rf /' you've failed, no matter what you put after it.
58. Re:I feel that lone sysadmin's pain by grep+-v+'.*'+* · 2017-02-01 07:53 · Score: 1
  
  With a 40TB user SAN system from years ago, to delete major files from users and groups I told them they were gone but actually moved them to a user inaccessible directory. Then I waited 3 weeks or so. If no one complained the next day (or the next week for vacations) then I was pretty much good to go.
  
  I also -- being paranoid -- checked the time stamps to make sure no file access had occurred. Then again the users couldn't realistically find or access files from within an open and wet paper bag so I wasn't worried much, but I still checked. (Never mind the 4 hour snapshots, the daily incrementals and weekly full backups. They all worked but GOD were restores slow.)
  
  --
  If the universe is someone's simulation -- does that mean the stars are just stuck pixels?
59. Re:I feel that lone sysadmin's pain by UnknownSoldier · 2017-02-01 08:42 · Score: 1
  
  Tab filename completion would have caught that.
  That, along with double-checking the command line before you do ANY rm stuff ...
60. Re:I feel that lone sysadmin's pain by CanadianMacFan · 2017-02-01 10:32 · Score: 1
  
  Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!
  It isn't a backup until you have verified that you can restore from it.
61. Re: I feel that lone sysadmin's pain by david_thornley · 2017-02-01 11:50 · Score: 1
  
  For most of what I do, I type "ls whatever" and examine the output. If it's what I want to delete, I do a little command-line editing.
  Also, I have a little ritual. When I'm typing something potentially dangerous, I type it, sit on my hands, and examine it carefully. This means concentrating and not paying attention to anything else.
  
  --
  "When you have eliminated the unacceptable, whatever is left, however improbable, must be the truthiness" - Holmes
62. Re:I feel that lone sysadmin's pain by LinuxIsGarbage · 2017-02-01 11:55 · Score: 1
  
  Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!
  IT always says "Don't store stuff on your local hard drive, store it on the network drive where it's regularly backed up!". I still think I'm better off storing on my hard drive, and managing local backups. Particularly where the user's private network drive has a retention policy of 1 year, after which it guarantees the files are deleted.
  And also after I filed a ticket with IT when a user deleted another user's file on the network drive. Two week's later IT STILL couldn't find the backup tape from the ninth... That was one file, imagine losing the whole server.
  Shared network drives also have a tendency of people accidentally dragging and dropping a subdir into another subdir.
63. Re:I feel that lone sysadmin's pain by aaarrrgggh · 2017-02-01 14:52 · Score: 1
  
  ZFS is not backup just like RAID is not backup. I have used ZFS and BTRFS and do love what it can *add* to a backup system, but it still isn't a substitute for multiple offline revisions off site, and no, taking drives out of a zpool and archiving them isn't the same.
  
  The threats today are different than what we used tape for way back when. We often get by with the same mindset, but it isn't perfect.
64. Re:I feel that lone sysadmin's pain by quenda · 2017-02-01 19:29 · Score: 1
  
  No, but if you cant be bothered to log in ...
65. Re:I feel that lone sysadmin's pain by jabuzz · 2017-02-01 21:42 · Score: 1
  
  Yeah I said a *SINGLE* run mate.
66. Re: I feel that lone sysadmin's pain by minstrelmike · 2017-02-02 07:17 · Score: 1
  
  I've heard good things about version control. I use the stuff they have on GitHub. It's incredible.
67. Re: I feel that lone sysadmin's pain by ploppy · 2017-02-03 14:05 · Score: 1
  
  > The only thing I'm not sure how well unsquashfs handles the extraction of sparse files.
  If the file is stored as a sparse file in the Squashfs filesystem (normally the case), then Unsquashfs will create it as a sparse file when extracting it. It doesn't need any more filesystem space than the filled parts of the file when doing so.
  I wrote the code and so I should know :-)
Repeat after me (and others) by Nkwe · 2017-01-31 19:12 · Score: 5, Interesting

If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup.
1. Re:Repeat after me (and others) by dbIII · 2017-01-31 19:28 · Score: 2, Informative
  
  Good advice but it's a misleading headline above. It appears their real backup exists and is six hours old, so annoying but not catastrophic.
  It is a good example that replication is not a backup and is often a way to just mirror mistakes.
2. Re:Repeat after me (and others) by hcs_$reboot · 2017-01-31 19:37 · Score: 1
  
  Typical case of "we're unlikely to lose our data, and anyway we've got a backup which in turn is unlikely to fail ; so why test a unlikely x unlikely event?"
  
  --
  Slashdot, fix the reply notifications... You won't get away with it...
3. Re:Repeat after me (and others) by hcs_$reboot · 2017-01-31 20:09 · Score: 1
  
  Test your backups, at least once, seriously!
  
  --
  Slashdot, fix the reply notifications... You won't get away with it...
4. Re:Repeat after me (and others) by MatthiasF · 2017-01-31 20:24 · Score: 5, Informative
  
  Uh, did you read the article?
  
  The six hours old snapshot was a fluke manual LVM snapshot run, normally they are 24 hours. The SQL_dumps weren't running at all because of mis-configuration, producing tiny little files and failing silently. Webhooks will need to be rolled back to the 24 hour backup since they were removed in the 6 hour one because of a synchronization process (meaning at best 18 hours of updates will have no webhooks but possibly all 24 hours at worst). Lastly, their replication of their backups from Microsoft's Azure to Amazon's S3 for what I assume is vendor agnostic redundancy has sent no files at all ("the bucket is empty").
  
  It's like they thought out everything but never made sure any of it was working.
5. Re:Repeat after me (and others) by dbIII · 2017-01-31 20:30 · Score: 5, Insightful
  
  Uh, did you read the article?
  No, and I got the wrong impression from skimming the article.
  You are correct and I am not.
6. Re:Repeat after me (and others) by phantomfive · 2017-01-31 20:31 · Score: 1
  
  I think backups are surprisingly likely to fail. Just like RAID is surprisingly likely to have more than one disk fail at a time, even though intuitively that seems extremely unlikely.
  
  --
  "First they came for the slanderers and i said nothing."
7. Re:Repeat after me (and others) by Opportunist · 2017-01-31 20:36 · Score: 5, Funny
  
  Sing with me, kids:
  One backup in my bunk
  One backup in my trunk
  One backup at the town's other end
  One backup on another continent
  All of them tested and verified sane
  now go to bed, you can sleep once again
  
  --
  We used to have a Bill of Rights. Now, with the rights gone, all we have left is the bill.
8. Re:Repeat after me (and others) by tonymercmobily · 2017-01-31 21:16 · Score: 5, Insightful
  
  "If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup."
  OK, now that I have repeated it, let me add.
  As a CEO, and a CTO, you MUST test backups and resiliency by artificially creating downtime and real-life events. You switch off the main server. Or instruct the hosting company to reboot the main server, unplug the main hard drive, and plug it back in. Then you sit up, and watch with great interest what happens.
  THEN you will see, for real, how your company reacts to real disasters.
  The difference is that if anything _really_ wrong happens, you can turn the hard drive back on and fire a few people.
  Smart companies do it. For a reason. Yes it creates downtime. But yes it can save your company.
  http://www.datacenterknowledge...
  Merc.
9. Re:Repeat after me (and others) by JaredOfEuropa · 2017-01-31 22:19 · Score: 1
  
  In other words: IT fire drills. Smart companies conduct them... but somehow I have never seen them done, or even seen companies asking their outsourcing partners to produce some proof of recovery procedures having been tested. No, "they are ISO-over-9000 and that is good enough for us". Good enough to cover your arse when things go south, sure.
  
  We had plenty of actual fire drills, though.
  
  --
  If construction was anything like programming, an incorrectly fitted lock would bring down the entire building...
10. Re:Repeat after me (and others) by Daimanta · 2017-01-31 22:26 · Score: 2
  
  RAID fails because hard disks (probably the same type and batch) running together get hit at the same rate as the matching disks do not fail with the same chance distribution. Their failure correlation is therefore to be quite high. This explains that rebuilding a RAID array after failure can be a very dangerous operation and could easily lead to total failure. Usually, doing (incremental) backups are the safer option when a single disk fails as that is not nearly as invasive as a complete RAID rebuild.
  
  --
  Knowledge is power. Knowledge shared is power lost.
11. Re:Repeat after me (and others) by Applehu+Akbar · 2017-01-31 22:55 · Score: 2
  
  If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup.
  Especially now that ransomware is overwriting online backups.
12. Re:Repeat after me (and others) by Anonymous Coward · 2017-01-31 22:56 · Score: 1
  
  ...Lastly, their replication of their backups from Microsoft's Azure to Amazon's S3 for what I assume is vendor agnostic redundancy has sent no files at all ("the bucket is empty").
  It's like they thought out everything but never made sure any of it was working.
  It's one thing to not validate your backups with a test restore.
  It's another level of stupidity entirely when you don't even check to see that there are no fucking files at all.
  Perhaps not the best time to rub salt in the wound, but I find myself recalling the wise words of the immortal Red Foreman...
  "Dumbass!"
13. Re:Repeat after me (and others) by Anonymous Coward · 2017-01-31 23:00 · Score: 1
  
  Uh, did you read the article?
  No, and I got the wrong impression from skimming the article.
  You are correct and I am not.
  This kind of humility does not belong on Slashdot. Honestly I don't know what's happening to this place...
14. Re:Repeat after me (and others) by MatiasKiviniemi · 2017-01-31 23:20 · Score: 2
  
  The internetz council convened and decided we will none of that "admit my mistakes"-bullshit here. Please hand in your card and exit the premises immediately.
15. Re:Repeat after me (and others) by 140Mandak262Jamuna · 2017-01-31 23:43 · Score: 1
  
  Smart companies do it. For a reason. Yes it creates downtime. But yes it can save your company
  You make the assumption they CXO want to save the company. Downtime costs happen this quarter. Benefit accrues to whoever is the CXO five years down the line. Why should current CEO save the a** of the next CEO. Squeeze the company dry, show as much revenue/profit as possible, cash the stock options and skip town. By the time they discover the shoddy backup vendor you hired to cut costs, had been saving the data in the "1TB" thumbdrives bought in some flea market in outer Mongolia, you are already well into wrecking the next company.
  
  --
  sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
16. Re:Repeat after me (and others) by cdrudge · 2017-02-01 00:37 · Score: 4, Insightful
  
  "they are ISO-over-9000 and that is good enough for us"
  Distilled down, all that ISO-around-9000 says is that "we say what we do and do what we say" when it comes to business processes. It's perfectly acceptable from an ISO-around-9000 standpoint to have a disaster recovery process that reads like the below as long as that is really what they do:
  1. Perform backup
  2. Pray nothing goes wrong.
  Now hopefully they have something a lot more than that. But if they don't test the backups. If they don't hold an "IT fire drill" to practice what do do when the feces hits the fan. If they don't have disaster recovery backup servers and snapshots and whatever else they should have, then they have completely documented their process and follow it like the standards require.
17. Re:Repeat after me (and others) by drinkypoo · 2017-02-01 00:46 · Score: 1
  
  As a CEO, and a CTO, you MUST test backups and resiliency by artificially creating downtime and real-life events.
  As an IT professional, and occasional admin, you MUST have backup for your hardware to switch to, which mitigates the pain of live testing. The hardware is typically a small portion of the total cost of the business, even if you double it.
  
  --
  "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
18. Re:Repeat after me (and others) by Zontar+The+Mindless · 2017-02-01 01:08 · Score: 3, Funny
  
  Q: You can never have too much money, too much sex, or ___ ____ ______. (Fill in the blanks.)
  (A: "Too many backups".)
  --Actual question from the final exam for the Networking 100 class I took in 1998.
  
  --
  Il n'y a pas de Planet B.
19. Re:Repeat after me (and others) by tomhath · 2017-02-01 01:18 · Score: 2
  
  If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup.
  And don't trust someone else who says they made and tested the backup. Our DBAs had proof that the sysadmins told them the disk backups worked. But the DBAs never did a practice restore of their own. You can guess what happened when a failed update trashed the database.
20. Re:Repeat after me (and others) by TheRaven64 · 2017-02-01 02:47 · Score: 3, Informative
  
  Please mod the parent up. After the uptick in trolling and invective in the last couple of months, this post is a breath of fresh air around here.
  
  --
  I am TheRaven on Soylent News
21. Re:Repeat after me (and others) by TheRaven64 · 2017-02-01 02:50 · Score: 1
  
  That sounds like a great idea, after you've tested that you can bring up a clone of your production system onto a spare [virtual] machine from the backups. If you don't do that first, then it sounds like an expensive way of discovering the bug that caused you to lose all of your customers' data.
  
  --
  I am TheRaven on Soylent News
22. Re:Repeat after me (and others) by tonymercmobily · 2017-02-01 03:01 · Score: 1
  
  That sounds like a great idea, after you've tested that you can bring up a clone of your production system onto a spare [virtual] machine from the backups. If you don't do that first, then it sounds like an expensive way of discovering the bug that caused you to lose all of your customers' data.
  That should be a given. But, being able to do that doesn't mean that you WILL be able to recover quickly from a REAL outage (hence the voluntary, self-inflicted outage))
23. Re:Repeat after me (and others) by aaarrrgggh · 2017-02-01 03:15 · Score: 1
  
  Ok, but what about the next level? I worked with a bank 10-15 years ago that dropped their data center due to old ups batteries, restarted the mainframes when the generator kicked on, and now had to make a difficult decision on switching to their DR site, or continue to run on generators until the batteries could be replaced in a couple weeks.
  
  It was a difficult decision because while the regularly tested going to DR, there was no way to roll back to primary. (This was true for nearly all banks at the time.) Using the DR site actively cost enough to make their CFO really squirm (more than the fact that he didn't authorize battery replacement in the first place).
  
  Point being, the best laid plans of mice and men...
24. Re:Repeat after me (and others) by Notabadguy · 2017-02-01 03:39 · Score: 3, Funny
  
  Look at his user ID. Give him time, he'll come around.
25. Re:Repeat after me (and others) by Hognoxious · 2017-02-01 03:52 · Score: 1
  
  The old joke was that lead lifejackets were 1SO 9000 compliant ... as long as you write down that they're made of lead.
  
  --
  Confucius say, "Find worm in apple - bad. Find half a worm - worse."
26. Re:Repeat after me (and others) by budgenator · 2017-02-01 04:41 · Score: 1
  
  Q: You can never have too much money, too much sex, or ___ ____ ______. (Fill in the blanks.)
  (A: "Too many backups".)
  --Actual question from the final exam for the Networking 100 class I took in 1998.
  I would argue that there is always a point where the expense of another backup exceeds the benefit of having it. Sometimes the expense is monetary, sometimes it's lost availability. I'm lucky, I can rsync the server once a day to another computer and backup from that machine, it's easier to replace any lost data by hand than to do a more fine-grained backup routine.
  
  --
  Apocalypse Cancelled, Sorry, No Ticket Refunds
27. Re:Repeat after me (and others) by sl3xd · 2017-02-01 05:53 · Score: 1
  
  Is there a tune that's supposed to be sung to?
  
  --
  -- Sometimes you have to turn the lights off in order to see.
28. Re:Repeat after me (and others) by h4ck7h3p14n37 · 2017-02-01 07:50 · Score: 1
  
  If you use RAID, you need to do regular disk scrubbing, SMART surface scans, etc...
  Or you could run ZFS and not worry about it.
29. Re:Repeat after me (and others) by cdrudge · 2017-02-01 08:53 · Score: 1
  
  The ISO-9001 certification has little to do with best practices.
  Exactly. During my ISO-9001 internal auditor training, we had it drilled into us that the standard said nothing about what was the right or wrong way to do something, best practices, common sense, etc. It was all about documenting how something is done and doing something how it's documented.
30. Re:Repeat after me (and others) by phantomfive · 2017-02-01 09:17 · Score: 1
  
  Yeah, I now think of RAID as a way to increase disk access times, rather than as a backup method.
  
  --
  "First they came for the slanderers and i said nothing."
31. Re:Repeat after me (and others) by KlomDark · 2017-02-01 10:09 · Score: 1
  
  Damn whippersnappers!
32. Re:Repeat after me (and others) by Opportunist · 2017-02-01 12:49 · Score: 1
  
  Fffft. You whip up a rhyming routine up on the spot in a foreign language.
  
  --
  We used to have a Bill of Rights. Now, with the rights gone, all we have left is the bill.
33. Re:Repeat after me (and others) by Zontar+The+Mindless · 2017-02-01 13:21 · Score: 1
  
  You didn't even count the blanks. FAIL.
  
  --
  Il n'y a pas de Planet B.
34. Re:Repeat after me (and others) by david_thornley · 2017-02-02 03:53 · Score: 1
  
  I don't really believe in backups unless I've demonstrated I can restore from them. When I'm dealing with anything important, I figure that data that isn't backed up doesn't exist, and data isn't backed up until restoration has been tested.
  I'm reminded of Knuth's comment that he hadn't tested certain code, but merely proved it to be correct.
  
  --
  "When you have eliminated the unacceptable, whatever is left, however improbable, must be the truthiness" - Holmes
Don't use rm! by subk · 2017-01-31 19:13 · Score: 2

Use mv! Also.. What's with the need to tweet? Don't tell the customer anything until the dust settles! Geez... What's with these amateurs?

--
Now, if you'll excuse me, I have backups to corrupt.
1. Re:Don't use rm! by hcs_$reboot · 2017-01-31 19:39 · Score: 1
  
  # rm `which rm`
  
  --
  Slashdot, fix the reply notifications... You won't get away with it...
2. Re:Don't use rm! by subk · 2017-01-31 19:57 · Score: 1
  
  Yeah just mv foo /dev/null
  No, you're missing the point. mv foo /some/safe/place and when everything is working again... and you're sure you don't need it.. Then and only then use rm.
  
  --
  Now, if you'll excuse me, I have backups to corrupt.
3. Re:Don't use rm! by infolation · 2017-01-31 19:58 · Score: 5, Funny
  
  Don't tell the customer anything until the dust settles! Geez... What's with these amateurs?
  Don't tell the customer anything!! Geez... What's with these semi-pros?
4. Re:Don't use rm! by Darinbob · 2017-01-31 20:07 · Score: 3, Interesting
  
  Boring job, doesn't pay as much as others. Everyone wants to be the rockstar since that's who the recruiters look for, nobody wants to be the janitor that cleans up after the concert. Turn that into a startup and seriously, no one at a startup wants to be the grunt, and (almost) no one at a startup has an ounce of experience with real world issues.
  This is why sysadmins were created, because the people actually using the computers didn't want to manage them.
5. Re:Don't use rm! by ThunderBird89 · 2017-01-31 20:25 · Score: 1
  
  > Don't tell the customer anything until the dust settles!
  That's one way to handle a major crisis, but if you're transparent about an issue, it puts a lot more minds at ease than it upsets, since then at least your customers know that you're aware of the problem, that you're working to fix it, and that they can communicate with you.
  
  --
  Hyperbole: I use it liberally!
6. Re:Don't use rm! by sodul · 2017-01-31 20:43 · Score: 3, Insightful
  
  Nowadays since nobody wants to do sysadmin work and since most startups and companies feel that a pure sysadmin job it is a waste of money they slap 'must code shell and chef' on top, call it DevOps but then just treat them just as badly as before. The 'DevOps' term is just is misused as 'Agile' nowadays. What I have seen in practice is DevOps are Ops that Develop scripts, or worse a DevOps team/role between Devs and Ops ... and a new silo is created instead of walls broken. Most Agile shops are actually chaos driven with anything goes since Sales promised a feature to a prospect customer yesterday, every week.
7. Re:Don't use rm! by Anonymous Coward · 2017-01-31 22:59 · Score: 1
  
  Don't tell the customer anything until the dust settles! Geez... What's with these amateurs?
  Don't tell the customer anything!! Geez... What's with these semi-pros?
  Tell the customers everything is A-OK, then blame everyone else for everything!!! Geez... What's with these upstart pros?
8. Re:Don't use rm! by c4757p · 2017-02-01 00:40 · Score: 1
  
  Also.. What's with the need to tweet? Don't tell the customer anything until the dust settles! Geez... What's with these amateurs?
  I've never even used them before, and this transparency has moved them to the top of my list for the future.
  Fuckups happen to everybody, despite all the 20/20 Captain Hindsights here pointing out everything that went wrong. I like to see how people handle their fuckups, and they're handling this one with grace.
Test your backups! by djinn6 · 2017-01-31 19:13 · Score: 2

Two things:
1. Test your backups
2. TEST your BACKUPS!
1. Re:Test your backups! by Anonymous Coward · 2017-01-31 20:15 · Score: 2, Funny
  
  but NOT on your production hardware running live services.
  me thinks gitlab should have browsed their hosted repos for some backup software.
2. Re:Test your backups! by asylumx · 2017-02-01 00:36 · Score: 1
  
  but NOT on your production hardware running live services.
  There are plenty who disagree with this. Right or wrong, their arguments have merit.
3. Re:Test your backups! by aaarrrgggh · 2017-02-01 03:24 · Score: 1
  
  All I can say is I sure as hell preferred SnapBack to Veeam. While I originally crafted a solution that provided the same net result, SnapBack was so painless. Throw in btrfs snapshots and you have a fast, robust, reliable system that you can run from a NAS unit doing pull backups.
  
  I get the benefits of Veeam, but it is an awful tool for hourly backups and deleted file recovery.
4. Re:Test your backups! by AK+Marc · 2017-02-01 10:06 · Score: 1
  
  The CIO who demanded we not test backups (the email, printed and filed) was promoted after the server failed, and we found out that the "never errored" backups didn't actually back up the systems in question, but the wrong sets of files from a selection of servers, set up long before I got there, but since the BackupExec job completed successfully every day for years, there couldn't be a problem, and it would be a waste of time to check them. I wasn't fired for that, but I was thrown under the bus by the guy that caused the problem.
  
  --
  Learn to love Alaska
Re:Inexcusable for a hosting provider by glenebob · 2017-01-31 19:32 · Score: 1

The first sentence is true. The second one only achieves "should be true" status.
Simple by Anonymous Coward · 2017-01-31 19:36 · Score: 1

Recycle bin, restore.
Re: Not quite surprised by Anonymous Coward · 2017-01-31 19:52 · Score: 1

Github is not Gitlab
At least it wasn't github.com by jtara · 2017-01-31 20:02 · Score: 2

At least it wasn't github.com.
So, it didn't break the Internet.
And practically everything else.
1. Re:At least it wasn't github.com by phantomfive · 2017-01-31 20:38 · Score: 1
  
  Github goes down from time to time, too. Self-hosting code is so easy (that's what git was designed to do), that there's really no reason to have your company depend on Github. Unless you're early stage startup and don't even have an office or something.
  
  --
  "First they came for the slanderers and i said nothing."
2. Re:At least it wasn't github.com by Richard_at_work · 2017-01-31 22:08 · Score: 1
  
  Github isnt just code - there is a heck of a lot there which you dont get locally without lots of third party tools and the hassle that comes with them.
3. Re:At least it wasn't github.com by guruevi · 2017-02-01 01:26 · Score: 1
  
  That's only the systemd repo, most repos don't do that.
  
  --
  Custom electronics and digital signage for your business: www.evcircuits.com
4. Re:At least it wasn't github.com by Richard_at_work · 2017-02-01 06:09 · Score: 1
  
  Good for you ... now the vast majority of everyone else, however, doesn't think that because they don't use something its worthless...
5. Re:At least it wasn't github.com by phantomfive · 2017-02-01 09:15 · Score: 1
  
  If you can handle the downtime and have good local backups, them fine, go with that. Otherwise you're going to be in pain and regret, sooner rather than later.
  
  --
  "First they came for the slanderers and i said nothing."
Made this mistake once... by daid303 · 2017-01-31 20:06 · Score: 2

I've made this mistake, deleted all attachments on a life system once.
After this, I made all the prompts for critical servers a different color:
export PS1='\e[41m\u@\h:\w\$\e[49m'
1. Re:Made this mistake once... by serviscope_minor · 2017-01-31 21:00 · Score: 4, Funny
  
  Good choice. But, I always use this prompt:
  PS1='C:$(echo ${PWD//\//\\\} | tr "[:lower:]" "[:upper:]" | sed -e"s/\$[^\\]\\{6\\}\$[^\\]\\{2,\\}/\\1~1/g" ) >'
  
  --
  SJW n. One who posts facts.
2. Re:Made this mistake once... by Pascoea · 2017-02-01 04:26 · Score: 1
  
  Haha. Now that's just awesome enough that I'm going to keep it how you suggested. My Linux skills would be classified as "Knows enough to be dangerous", and I have to admit I had no idea exactly what that would do, but the "C:" intrigued me enough to try...
How can this keep happening? by Anonymous Coward · 2017-01-31 20:17 · Score: 3, Interesting

I'm not a fan of git, I'm not happy when I'm forced to use it and I don't understand how it works, not really. But remember how KDE deleted all their projects, everywhere, globally, except for a powered-down virtual machine?
http://jefferai.org/2013/03/29/distillation/
When I remember that, and I read this story, I can't understand why people use something that is so sensitive to mistakes. It's like giving everybody root on every machine, which is running DOS in real mode. Somebody please explain it to me.
1. Re: How can this keep happening? by Anonymous Coward · 2017-01-31 22:58 · Score: 1
  
  The KDE incident could reasonably be called a flaw with Git (I don't know if it's been fixed since then), but this time it's just a case of someone deleting the wrong data, and that's hardly Git's fault. If anything, distributed systems like Git are more robust against that than centralised ones, because more of data is copied to the client whenever they clone or update.
2. Re:How can this keep happening? by Entrope · 2017-02-01 00:46 · Score: 3, Informative
  
  KDE's problems were not due to Git. They were due to a corrupt filesystem, a home-brew mirroring setup, and overworked admins.
  If you're going to troll-ol-ol a blame vector for that, at least be remotely fair and blame Linux (or whatever OS their master server was running), open source, and the associated culture.
If only there was another copy of the repo by HxBro · 2017-01-31 20:28 · Score: 5, Funny

Just imagine if git had some other magical copy of the repo somewhere, maybe even on the local machine you develop on, now that would save your data in a case like this
1. Re:If only there was another copy of the repo by gweihir · 2017-01-31 22:37 · Score: 2
  
  Just imagine if you had actually read the story. The git-repos are not affected.
  
  --
  Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
2. Re:If only there was another copy of the repo by AmiMoJo · 2017-02-01 02:15 · Score: 1
  
  Repos will be okay, it's all the ancillary stuff, i.e. the things that make them worth using over other git hosting companies. User management, wikis, release management, issue tracking etc.
  
  --
  const int one = 65536; (Silvermoon, Texture.cs)
  SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
Re: In defence of lone sysadmins by Anonymous Coward · 2017-01-31 20:52 · Score: 1

libtard
Unfortunately common by CustomSolvers2 · 2017-01-31 20:56 · Score: 4, Interesting

I see another problem on top of failing backups (really?) and a tired system admin deleting the wrong files (not precisely ideal, but within the kind of errors which should be expected): allowing to delete these files at all.

If your whole business is about dealing with the data which a big number of users generate at any point, you should (after having made completely sure that your backup system is rock solid) restrict as much as possible the access to such valuable information; not just to avoid unintended deletions, but also to account for other potential problems (e.g., privacy protection). There are many ways to do so, even after having developed the whole system; for example, giving read-only access unless strictly required like high-level admin personnel (who can use these credentials only after passing through a further validation step) or automated applications (whose credentials are regularly generated and nobody knows).

These problems are usually provoked when developing/dealing with a system without putting the whole focus on technical aspects/what is best for it from a technical perspective. They shouldn't exist at all when doing everything properly at each stage from development to deployment, administration, general policies, etc.

--
Custom Solvers 2.0 = Alvaro Carballo Garcia = varocarbas.
1. Re:Unfortunately common by CustomSolvers2 · 2017-02-03 01:25 · Score: 1
  
  The person who executed the command -is- their high-level admin person
  If this is the case, the fact of these files being deletable would certainly make (some) sense.
  
  Although there are many alternatives to further constrain certain deletions (even the ones done by the most-privileged users) and to minimise the chances of problems. For example, by asking for additional confirmation or (as suggested in various posts above) always keeping a copy of the deleted items for some days or always checking whether an actual backup of the deleted files exists before going ahead with the deletion, etc. I am not saying that it is impossible, but with a proper system in place giving a very special treatment to the most important parts, it would be really difficult.
  
  Their database replication stopped because the replicating server....
  Nothing of this sounds as a valid excuse to me. A proper backup system shouldn't be affected by almost anything. Some examples to minimise these risks: automatically duplicating each user input in the moment, always having backups to the backups to the backups (e.g., programs running 24/7 checking that the main backup is OK and, if not, starting using the alternative 1 and then 2, etc. After having clearly warned everyone about such an issue!), etc.
  
  I don't know the exact situation and that's why cannot deliver a worthy enough assessment. Additionally, I don't like talking in abstract terms and do firmly believe that there are lots of exceptions everywhere and every time. But even by bearing all this in mind, I cannot think of many good reasons for a company, whose main business is dealing with user data, to not be able to avoid a data lost by doing everything properly. I can even add that if I were ever personally involved in such a situation, I would feel really ashamed and wouldn't even think about trying to (somehow) justify my behaviour.
  
  --
  Custom Solvers 2.0 = Alvaro Carballo Garcia = varocarbas.
Or you use scripts by perpenso · 2017-01-31 21:08 · Score: 2

That's why you always always run ls first.
ls -ld /home/user1 /home/user2 /home/ user3
Then edit the command to rm. Always.
Or you use scripts.

somescript user1 user2 user3
1. Re:Or you use scripts by jbrown.za · 2017-02-01 02:46 · Score: 1
  
  Scripts can be very dangerous as well.
  Many problems have been caused by scripts that are tested while logged in as a user and then run under the root crontab, where the starting directory or environment variables are not the same.
2. Re:Or you use scripts by perpenso · 2017-02-01 06:05 · Score: 1
  
  Scripts can be very dangerous as well. Many problems have been caused by scripts that are tested while logged in as a user and then run under the root crontab, where the starting directory or environment variables are not the same.
  You can have a typo or mistake in the script and have it occur once (in a test environment usually). Or you can have a potential typo or mistake (typing from wrong directory here too) every time you manually execute commands on a production system. There are potential typos and mistakes on either path, but one reduces the risks.
3. Re:Or you use scripts by grep+-v+'.*'+* · 2017-02-01 07:41 · Score: 1
  
  Or you use scripts. somescript user1 user2 user3
  Certainly: somescript . /user1
  
  I was originally going to say: .. /user1 but figured that would just be mean on rm /home/$user success. You could always try to make the script smarter but that just breeds more intelligent idiots.
  
  Signed: Bobby Tables
  
  --
  If the universe is someone's simulation -- does that mean the stars are just stuck pixels?
Every tech company by Anonymous Coward · 2017-01-31 21:33 · Score: 1

goes through this kind of incident. It's part of growing up.
We lost 30TB of data from a flaky NAS a few year back, still there !
(and good luck, Gitlab is great software)
All my sympathy... by Gumbercules!! · 2017-01-31 22:23 · Score: 4, Insightful

I don't care if this is a mistake and screw up of their own making (and it is, on every level) - if you've ever worked as sysadmin you have got to feel for these guys.
1. Re:All my sympathy... by malkavian · 2017-02-01 01:13 · Score: 2
  
  Definitely feel for 'em.. And really feel for the guy who was on the keyboard..
2. Re:All my sympathy... by Pascoea · 2017-02-01 04:31 · Score: 1
  
  I think everybody with any kind of root access has done it. I found out the hard way that "delete * from userSettings where username=whatever" is a significantly different query than "delete * from userSettings where username-whatever". Seeing a result of "23134 records affected" when expecting "1 records affected" will wake a guy up in a hurry.
An that is why you run BCM and recovery tests by gweihir · 2017-01-31 22:36 · Score: 3, Interesting

Something like this is going to happen sooner or later. It cannot really be avoided. BCM and recovery tests are the only way to be sure your replication/journaling/etc. works and your backups can be restored.
Of course in this age of incompetent bean-counters, these are often skipped, because "everything works" and these test do involve downtime.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
1. Re:An that is why you run BCM and recovery tests by Entrope · 2017-02-01 00:52 · Score: 1
  
  BCM? Bravo Company, manufacturer of firearm parts so you can shoot your servers? Buzzword-Centric Methodology? The SourceForge "BCM" project, a file compression utility? Baylor College of Medicine? Bear Creek Mining? Bacau International Airport? Broadcom?
2. Re: An that is why you run BCM and recovery tests by Entrope · 2017-02-01 01:48 · Score: 1
  
  You first. If your head is so far up your ass that you can't tell when you're using a buzzword acronym with little exposure in the tech world and a lot of plausible meanings, you might be a tool.
3. Re:An that is why you run BCM and recovery tests by msauve · 2017-02-01 03:37 · Score: 1
  
  He also mentioned bean counters, so maybe it means Bean Counting Management. But based on his later post, it's clear that even he doesn't know what it means.
  
  --
  "National Security is the chief cause of national insecurity." - Celine's First Law
4. Re: An that is why you run BCM and recovery tests by gweihir · 2017-02-01 04:37 · Score: 1
  
  You need both. BCM to continue operating while the DR activities are ongoing and DR to get back to normal. For example, a fail-over system is a BCM measure, while restoring from backups is DR. For a university it may be acceptable to just do without IT until DR is completed. Both are absolute standard terms in enterprise IT.
  @Entrope: Incidentally, a civil question gets a civil answer. The first 2 Google hits also decipher BCM in the context here and if you add "outage" it is a few more. So I guess you were not even trying, except trying to be an ass.
  
  --
  Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
5. Re: An that is why you run BCM and recovery tests by AK+Marc · 2017-02-01 10:25 · Score: 1
  
  Nope. DR is a "solution" to the question of BCM. But you are speaking as if you assume BCM is some manner of redundancy (or diversity). They had BCM. BCM doesn't help. They had backups. They didn't properly test them. That means they had BCM, and BCM was worthless. So bringing it up looks to be more a way to talk buzzwords than help people.
  
  BCM is "when the site goes down, we spin up the backups in AWS" or something like that. That the backups don't work, and aren't tested is unrelated to BCM, or DR, or any of your other worthless buzzwords.
  
  --
  Learn to love Alaska
6. Re: An that is why you run BCM and recovery tests by gweihir · 2017-02-01 11:29 · Score: 1
  
  That is really not how this works. DR is how you re-establish normal operations. BCM is what you do before you reach that state.
  Also, have you missed the part were I said "An that is why you run BCM and recovery tests"? You do not just need plans for BCM and DR, you need to test them and that was my whole point.
  
  --
  Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
7. Re: An that is why you run BCM and recovery tests by gweihir · 2017-02-01 11:35 · Score: 1
  
  Indeed. There are however a lot of amateurs around and quite a few BCM and DR plans that are not very good or not adequately tested. This story demonstrates that nicely. The thing is that both BCM and DR and respective tests are costly and do not directly create value and so the bean-counters always try to reduce them, because they do not grasp risk management.
  
  --
  Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
8. Re:An that is why you run BCM and recovery tests by gweihir · 2017-02-01 11:38 · Score: 1
  
  Your powers of deduction are amazing in their ineffectiveness. By now everybody else in this thread has probably seen that it is of course "Business Continuity Management". Google has about 521'000 hits for it (in quotes, i.e. the full term), hence it can hardly be an obscure concept.
  
  --
  Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
9. Re: An that is why you run BCM and recovery tests by AK+Marc · 2017-02-01 11:59 · Score: 1
  
  You don't "run" BCM. Business Continuity Management is about calculating the costs and risks. It is done by the CFO and COO, not the CIO. The CIO (or grunts below) come up with a BCP to meet the BCM. DR is one option for a BCP to meet the BCM.
  
  Yes, I read your words. They didn't make sense. Expand the acronym, and it's not even proper English. "An that is why you run Business Continuity Management (BCM) and recovery tests". You don't run BCM tests. You run BCP tests.
  
  Or do you not know the difference between BCM and BCP, and are lecturing others for using the terms wrong?
  
  --
  Learn to love Alaska
10. Re: An that is why you run BCM and recovery tests by gweihir · 2017-02-01 12:20 · Score: 1
  
  This is slashdot. You will always find inaccuracies because of lack of detail as nobody writes a dissertation here. (And even in a dissertation, that problem would exist. I know, I did one.) It is completely clear what I meant and were I simplified. Your argument has no merit.
  
  --
  Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
11. Re: An that is why you run BCM and recovery tests by AK+Marc · 2017-02-01 12:38 · Score: 1
  
  You over simplified to the point it was incorrect. That makes you wrong. Your argument has no merit. If you want to stop being wrong and looking like an argumentative idiot, spend more effort in being accurate. It wouldn't have taken any more words to have been more accurate.
  
  --
  Learn to love Alaska
DR Testing as a business model by swb · 2017-02-01 00:25 · Score: 1

Does anyone think that Backups/DR Testing as a business would be something that businesses would go for?
Everybody "runs backups" but due to all the usual limitations in time and capacity, nobody really tests whether they can restore everything and actually make it work, and how long it might actually take to accomplish this.
I always wondered if you could mount a hundred TB of storage, a couple of tape drives, and switching into one of those rock band roadie cases and take it to a business with the idea that they would hand over their backup media and then see what happens when they try to restore their data to your equipment.
The customer would provide all software and media, just as they would in a real disaster.
It would eliminate the "we can't restore everything" capacity issue most places have, the fact that the equipment would differ from what they have (even if its only slight model derivations) would be the kind of variation likely in a real DR scenario -- if you have to physically replace hardware, it likely won't be the same model stuff you have now.
An option would exist to have/not have the staff participate in the process -- I'm sure many CxOs are curious if their "system" can survive being restored by someone else.
1. Re:DR Testing as a business model by coofercat · 2017-02-01 01:53 · Score: 1
  
  As a sysadmin, this sounds great (a bit 'brown trousers' for me personally, but great). However, one of my clients is entirely 'in the cloud', so no need for your truck of kit - just provide as many VMs as we like somewhere on t'internet. Ideally you'd be able to do this in a 'little internet' which has a VPN to get into it, has it's own DNS servers, and maybe ways to 'bend' or alter requests to other cloudy services, such as Google or Amazon such that the app 'thinks' its talking to the real, live production service, but actually it's talking to a test account or some such. That means I can spin up my clients world in your environment and have it think it was on the internet, but actually not interact with anything real - and I don't need to change every account and password baked into the code and config so I don't do any damage to real data.
  Secondly, just like the backups and drills that most companies don't bother to do, they won't bother to hire a service like this either. You'll probably be able to make a few top-dollar sales to some big shops who already have very good DR procedures, but the little place (or even medium place) probably won't bother.
  One way I could imagine this working would be to gain some sort of certification. Say for example, the fiduciary regulations of Elbonia were changed to say that all app providers must have externally verified DR capability, then your business would fit right in and solve that need - and you'd probably get lots of work, and hopefully lots of repeat work too. Short of regulations though, whatever certification you could come up with on your own wouldn't be worth enough to have people want to pay to get it.
2. Re:DR Testing as a business model by swb · 2017-02-01 03:43 · Score: 1
  
  Secondly, just like the backups and drills that most companies don't bother to do, they won't bother to hire a service like this either.
  
  Maybe, but often the real problem is that they don't have the facilities to do it in. It becomes kind of an existential question they can't answer. I think if you attacked the CIO/CFO with the idea of this service and why your staff can't do it now and what they don't know, you'd get more uptake than you might think. You might even get line staff on board with it, too, since a successful restore or the ability to adjust procedures to get a successful restore might (a) make them sleep better and night and (b) be an ace in their pocket if something does go wrong in an actual disaster -- "we hired the service, and tested the system as completely as possible and it worked. This failure is act-of-god/statistical improbability that you can't blame on us."
  
  Say for example, the fiduciary regulations of Elbonia were changed to say that all app providers must have externally verified DR capability, then your business would fit right in and solve that need
  I'd bet between SOX, HIPPA, partner agreements, insurance, etc, there's already enough soft requirements that you could say "Sure, you're not *mandated* to have more than "just" a DR plan, but if your plan is shit and non-viable your civil liability it limitless. A proven and certified execution of your DR plan is a get out of jail free card if it doesn't work for act-of-god reasons."
  The cloud part is tougher, but to be honest, I don't really know how people protect themselves in those environments, and I'd wager a lot don't besides making redundant data copies and hoping that the cloud has them covered -- which it might, from a lot of physical failures, but I think they impart too much faith in cloud systems from a recovery perspective, but that's almost a different discussion.
3. Re:DR Testing as a business model by _Sharp'r_ · 2017-02-01 04:16 · Score: 1
  
  An expensive way, which is also pretty bulletproof:
  At least two geographically separate production environments, run in each for approximately half the year total, switching periodically which is the target DR setup and which is the Prod environment.
  Then you always know your backups to your DR are working (hint: use snapshoting/versioning as well, to avoid the replicating the disaster issue), because you are periodically forced to actually use it as a real production environment. You know your switchover and switchback processes work and how long they really take, because you routinely follow them. It's not just data. In these days of Internetworking, you need to be sure your IP space, firewall rules, partner's firewall rules, routing, proxies, DDOS, VPNs, etc... will all function properly if you need to fail-over during a disaster. It also helps to go live in an environment with only a set or two of patching/upgrade cycles having passed, rather than hoping years of OS and firmware changes were also properly applied to your backup environment.
  If you've never run in an environment, then you may have some hardware and such, but you don't quite have an actual environment yet.
  
  --
  The party of stupid and the party of evil get together and do something both stupid and evil, then call it bipartisan.
4. Re:DR Testing as a business model by AK+Marc · 2017-02-01 11:16 · Score: 1
  
  SOX and HIPAA have no requirements around "uptime" other than being able to provide historic records within a matter of weeks, if required. DR and uptime are unrelated to most statutory requirements.
  
  --
  Learn to love Alaska
5. Re:DR Testing as a business model by AK+Marc · 2017-02-01 11:20 · Score: 1
  
  Sounds like someone who hasn't heard of Blue/Green discovering Blue/Green. You don't have "prod" and "dev" but you build in dev (or test, whatever you like to call it), then promote dev to prod, and down-grade prod to dev. A short hold-down to be able to roll back to previous prod, if problems occur, then old prod becomes new dev. Switching between environments happen every release.
  
  But you could do that without any backups or DR.
  
  --
  Learn to love Alaska
6. Re:DR Testing as a business model by _Sharp'r_ · 2017-02-02 10:36 · Score: 1
  
  Sounds like a similar concept in a lot of ways, but some of us have regulatory requirements for separation of duties which (among other things) prohibit the use of production data in test environments.
  
  --
  The party of stupid and the party of evil get together and do something both stupid and evil, then call it bipartisan.
Solution by aliquis · 2017-02-01 00:27 · Score: 1

Maybe they could had uploaded any new changes to some sort of online repository.
HAMMERFS by Anonymous Coward · 2017-02-01 00:38 · Score: 1

Allows exactly that, as does I am told ZFS.
Assuming you have a large enough disk, and small enough space requirements to leave the room necessary for it, a filesystem solution like either of the above allows exactly this, and barring filesystem corruption due to cornercases or hardware issues, should allow reverting all but the largest PEBKAC issues.
Only perform reversible actions by marko123 · 2017-02-01 00:58 · Score: 1

A lesson always learnt the hard way. Those of us who have learnt it the hard way have known the feeling before: I'll trust that this is correct and the feeling after: Shiat!

--
http://pcblues.com - Digits and Wood
Devops by funkymonkjay · 2017-02-01 01:25 · Score: 1

Clearly a case of too much deving and not enough oping.
I see no mention of alarms. With all those failures there must have been some notification.
Chances are they were lost in the nested folders of outlook with millions of other false alarm alerts.
They must have a mirrored test zone that they can copy from.
Lesson learned. Test your backups, regularly!
And of course by John+Allsup · 2017-02-01 01:48 · Score: 1

And of course everybody here knows _never_ to _rely_ upon cloud storage. Use it, by all means, but plan as if the cloud storage facility could have a meltdown at any moment. Gitlab users should just push their project to a different git server. There is also something to be said for having git server projects mirrored, e.g. a master on github and a second on gitlab, so that, in the event of one cloud service failing, you have a hot spare.
What is frustrating is that, given all the progress in hardware reliability since when I grew up, people take reliability for granted, whereas way-back-when, people who did that learned pretty f'ing quickly that stuff can and does go wrong.

--
John_Chalisque
Go ahead...yawn, but by Provocateur · 2017-02-01 01:51 · Score: 1

That 4.5 GB of data, happens to hold the answer! To life, the universe, and EVERYTHING!! Mankind is fortunate that the weary sysadmin was able to abort the procedure before it completely wiped the slate clean!
So don't blame the guy, praise him and thank him for saving us all!

--
WARNING: Smartphones have side effects--most of them undocumented.
Six hours of loss is a "melt-down"? by thesandbender · 2017-02-01 01:53 · Score: 4, Insightful

Editors. I understand that any loss is bad but holy hyperbole batman... the title reads like a nuke was dropped on Gitlab's datacenters. I had to read halfway through the post to see they lost six (6!) hours of data. Again, really bad, but just losing six hours of data would be a case study in success for a lot of companies and definitely not a "melt-down".
1. Re:Six hours of loss is a "melt-down"? by sysrammer · 2017-02-01 07:21 · Score: 2
  
  I see your point but I'd guess you are not a professional sysadmin. TFA should have been prefaced "For SysAdmins only". Most don't care about losing data: this far along in the computer revolution, most of us have lost years of data due to a disk or pebcak failure.
  Most of the time it is not a deal-breaker, or "melt-down" in this case. A company might have to spend some money, or a worker has to spend a lot of time, or the two dozen drafts of your "Great American Novel" goes gone.
  But sometimes it's the entire financial transaction or contractual history. Or the the finished version of your novel. Or the priceless-to-you pictures of your baby/SO/parent/nana.
  Pro sysadmins pretty much find that a "no data loss mindset" is career enhancing.
  
  --
  His ignorance covered the whole earth like a blanket, and there was hardly a hole in it anywhere. - Mark Twain
Um, to clarify: by rickb928 · 2017-02-01 02:06 · Score: 1

"So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place."
Or, more accurately, less than 5 backup/replication techniques were deployed.
I've seen this before. The backup strategy you didn't deploy didn't fail. It never existed except in documentation. And your unwarranted trust.
I do not miss sysadmin work so much.

--
deleting the extra space after periods so i can stay relevant, yeah.
so... by JustNiz · 2017-02-01 02:17 · Score: 1

What kind of moron just deploys a new backup strategy then just sits back and trusts their entire infrastructure to it, without ever having actually performed a test recovery?
Tickets/issues/tasks in repository by Lennie · 2017-02-01 02:21 · Score: 1

This is why I think it would be good to keep the tickets in the repository or a second repository. Easy to replicate, easy to keep history, easy to backup.

--
New things are always on the horizon
Re:The missing disclaimer to this article by rodia · 2017-02-01 03:03 · Score: 1

Sourceforge is still a thing..?
That made me so insecure that I actually had to check.. :O)
Yes, they are still there.
Oh Gitlab... by SpencerWilliams · 2017-02-01 03:05 · Score: 1

I love you, but when will you stop sucking?
Same story, different day by darkain · 2017-02-01 04:47 · Score: 1

I preach the same thing every time.
ZFS snapshots.
ZFS Send/Recv to other data centers.
Is it really that hard? That is literally all you have to do. Delete a folder? Copy it from snapshot. Things are more fucked then that? Revert to snapshot. Entire server is nuked? You have 100% replication off-site with snapshotting intact. Don't know how to set it all up? Install FreeNAS and use the built in web UI for it. No longer are any other excuses viable.
check your precision! by gosand · 2017-02-01 06:03 · Score: 1

To be more precise... test your RESTORE PROCESS.
It is important to not only know that your backups are good, but that your process of restoring them is sound and that you have at least tried it.

--

My beliefs do not require that you agree with them.
Re:did they even try a recovery? by ChrisMaple · 2017-02-01 06:51 · Score: 1

That was my first thought. Recovery isn't guaranteed, and the process might be labor intensive. They'd have to notify their customers that their data might be corrupted.

--
Contribute to civilization: ari.aynrand.org/donate
ouch by D,Petkow · 2017-02-01 07:45 · Score: 1

a fellow admin once wiped a live prod box serving lots of customers - but it was load balanced and we managed to rsync it from the other host but i still remember his face
If you are not testing restores... by DdJ · 2017-02-01 09:23 · Score: 1

...then you are not performing backups.
COW filesystem by DrYak · 2017-02-02 09:05 · Score: 1

A much simpler way to do it, that won't require you to hack standard system command-line tools,
would be to use some copy-on-write or log-structured files system (e.g.: BTRFS, ZFS, etc. depending of your taste),
and use snapshots to keep older versions of your file tree.
If anything goes wrong you can still recover from a previous snapshot.
Some Linux distributions (like: opensuse) have tools (like snapper) that can automate this task for you (and opensuse uses snapper to similarily snapshot system upgrades).

--
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]