Slashdot Mirror


GitLab.com Melts Down After Wrong Directory Deleted, Backups Fail (theregister.co.uk)

An anonymous reader quotes a report from The Register: Source-code hub Gitlab.com is in meltdown after experiencing data loss as a result of what it has suddenly discovered are ineffectual backups. On Tuesday evening, Pacific Time, the startup issued the sobering series of tweets, starting with "We are performing emergency database maintenance, GitLab.com will be taken offline" and ending with "We accidentally deleted production data and might have to restore from backup. Google Doc with live notes [link]." Behind the scenes, a tired sysadmin, working late at night in the Netherlands, had accidentally deleted a directory on the wrong server during a frustrating database replication process: he wiped a folder containing 300GB of live production data that was due to be replicated. Just 4.5GB remained by the time he canceled the rm -rf command. The last potentially viable backup was taken six hours beforehand. That Google Doc mentioned in the last tweet notes: "This incident affected the database (including issues and merge requests) but not the git repos (repositories and wikis)." So some solace there for users because not all is lost. But the document concludes with the following: "So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place." At the time of writing, GitLab says it has no estimated restore time but is working to restore from a staging server that may be "without webhooks" but is "the only available snapshot." That source is six hours old, so there will be some data loss.

60 of 356 comments (clear)

  1. Yawn... by Anonymous Coward · · Score: 5, Insightful

    No backups, untested backups, overwriting backups, etc. You'd think a code repository would understand this shit.

    This has been going on since the dawn of computing and it seems there's no end in sight.

     

    1. Re: Yawn... by Nutria · · Score: 5, Funny

      paki chimps in jungle

      Someone failed geography class...

      --
      "I don't know, therefore Aliens" Wafflebox1
    2. Re: Yawn... by Anonymous Coward · · Score: 3, Insightful

      No no, he's being """ironic""" and """trolling""" you. He isn't actually a stupid racist.

      He's just a racist.

    3. Re:Yawn... by Anonymous Coward · · Score: 2
    4. Re:Yawn... by Big+Hairy+Ian · · Score: 2, Insightful

      Clearly their DR Plan didn't get any form of QA. It's no good having five forms of backup/replication if non of them work!

      --

      Build a Man a Fire, and He'll Be Warm for a Day. Set a Man on Fire, and He'll Be Warm for the Rest of His Life.

    5. Re:Yawn... by Anonymous Coward · · Score: 2, Insightful

      And will continue to happen because experienced admins are expensive. Cheaper to hire the new grad who knows the buzzwords than the experienced admin who has lived through a couple of catastrophes and now knows how to plan for them.

    6. Re: Yawn... by Qzukk · · Score: 3, Interesting

      There's two levels of redundancy. There's "oh my god the database server is on fire! Promote the replicated server to master and failover!" which, depending on the database, should take a few seconds to perform manually. Testing automation for this (pull the plug and see what happens) depends on your setup and how long it takes your heartbeat to decide that the server is dead and how (If we shot servers in the head every time we got a DDoS, we'd burn through servers in a few seconds, it takes more than one failed connection for automation to decide the server is down).

      Then, there's "oh my god the datacenter is on fire!". This is what people usually call "Disaster Recovery". One dead server isn't a disaster when you have failovers, but when your entire datacenter is dead, THAT's a disaster. It's tough as nails to automate too, since without having at least three datacenters, it's inherently a split-brain issue. If Datacenter A stops responding to Datacenter B, which one is actually down? If you aren't an AS and can't just republish your IPs at Datacenter B with a BGP routing change, that means you're going to have to publish new DNS records and wait one TTL for everyone to see them. If you had an authoritative DNS server at Datacenter A, then hopefully it was able to recognize that its down and shot itself (or at least updated its zone files with B's IPs) or you can somehow get to it and kill it, otherwise when Datacenter A comes back online, it'll be serving up A's IPs again and conflict with the other DNS server. This also is setting aside replicating your data between datacenters and how much of that is lost when you switch back and forth.

      --
      If I have been able to see further than others, it is because I bought a pair of binoculars.
  2. I feel that lone sysadmin's pain by sixdrum · · Score: 5, Insightful

    A few years back, I caught and stopped a fellow sysadmin's rm -rf on /home on our home directory server. He had typo'd while cleaning up some old home directories, i.e.:

    rm -rf /home/user1 /home/user2 /home/ user3

    Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!

    1. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 5, Insightful

      That's why you always always run ls first.

      ls -ld /home/user1 /home/user2 /home/ user3

      Then edit the command to rm. Always.

    2. Re:I feel that lone sysadmin's pain by mmell · · Score: 5, Interesting

      Sadly, I remember personally making a similar mistake about a decade ago. Upgrading SAN hardware, preparing the old hardware for decommissioning (deleting data prior to sending the units to vendor). Even with offsite data replication, I survived several uncomfortable days and never did fully live down my error. Could've been worse - I thought I had a career change opportunity on my hands. My only saving grace was that I was acting under direction from vendor tech support when the error occurred (although it was still my fingers on the keyboard).

    3. Re:I feel that lone sysadmin's pain by arglebargle_xiv · · Score: 4, Funny

      Backups were ineffective. 30% of our users lost their home directories permanently. He never lived it down. Check your backups!

      Actually: Check your privilege! (Especially if rm -rf is involved).

    4. Re:I feel that lone sysadmin's pain by Opportunist · · Score: 2

      That's fine until he decides that typing "rm user1 user2 user3 user4..." is too much of a hassle and he replaces it with a script that lists the directories and removes them all. ...blissfully forgetting that there is a ".." directory. Oh .., how many well intended scripts have thee turned into the spawn of hell...

      --
      We used to have a Bill of Rights. Now, with the rights gone, all we have left is the bill.
    5. Re:I feel that lone sysadmin's pain by AmiMoJo · · Score: 4, Interesting

      mkdir ./trash
      mv file_to_delete ./trash

      If it's still working next month you can empty trash, but just leaving it there forever is a valid option too. In a production environment, storage is too cheap to warrant deleting anything.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    6. Re:I feel that lone sysadmin's pain by stridebird · · Score: 5, Informative

      Correct pattern is:
      > cd /home && rm ...

      ie don't run rm unless cd worked.

    7. Re: I feel that lone sysadmin's pain by saloomy · · Score: 2, Informative

      I usually have a /trash directory in my Linux servers, I have moved the rm command to "removed" and wrote a sweet script named rm which moves files/folders to /trash. Then a cron job "removes" files and folders from trash after 48 hours. Works awesome unless I'm space-bound, and I usually am not. Saved my ass more than once!

    8. Re:I feel that lone sysadmin's pain by Anonymous Coward · · Score: 2, Insightful

      Do you prefer your kitchen knives un-sharpened because then you're less likely to cut yourself?

    9. Re:I feel that lone sysadmin's pain by Angstroem · · Score: 3, Insightful

      The command-line wizards like to mock the GUI crowd, but I've never seen anyone make this kind of blunder with a GUI admin tool. :-P

      Then you have never worked on a repository with users of TortoiseSVN and the likes.

      "Hey, my commit didn't get through because of some funky error I didn't care about. But if I flip this 'force' switch, then everything always goes smoothly."

    10. Re: I feel that lone sysadmin's pain by gsslay · · Score: 5, Insightful

      This seems like a good idea, but it gets you into the habit of thinking that "rm" is a safe command that you can easily recover from. Then one day you use it on a server where you have forgotten to, or haven't yet, done your "sweet script" trick. Or worse; on someone else's server.

      Far better to treat the command "rm" with the full respect it deserves at all times and never assume it does anything but wipe data. Call your little script something like rm2 instead and get into the habit of always using that. That way the worst thing that can happen when it doesn't exist is "command not found".

    11. Re: I feel that lone sysadmin's pain by rgbatduke · · Score: 5, Interesting

      Having used the "sweet rm" trick back in the 80's somewhere (with much more limited space, and a cron FIFO groomer) it also doesn't protect you from a wide variety of file corruption issues and overwrites. Remove a file, recreate it, remove it again? Delete two files from different parts of your tree -- e.g. README -- that have the same name? Original file gone (unless you don't just alias rm, you write a very complicated script). If you run out of space and have an alias/script like "flush" to take out the trash and make room for more, it just moves the problem one notch downstream.

      With that said, it did save my ass a few times. Then I learned personal discipline, started using version control (SCCS at the time, IIRC) onto a reliable server to not just back up any files of any importance I create but to save reversible strings of revisions back to the Egg, and stopped using my reversible rm altogether after one or two of the disasters it still leaves open.

      Moral: Version control with frequent checkins usually leaves your working image itself on your working machine. Keeping the repository on a different machine is already one level of redundancy. Keeping it on a server class machine in a tier 1 or tier 2 facility with reliable, regular backups and RAIDed disk is suddenly very, very, very reliable. As the current incident shows, not perfectly reliable. Human error, multiple disk failures in an array, nuclear war, internal malice or incompetence or just plain accident can still cause data loss, but in this case what is being reported isn't disaster -- they had 6 hour backups! Even though I'm sure there will be some folks who are inconvenienced, MOST of the users will still have usable, current working copies and be out anywhere from zero to a few hours of work. I've been on both sides of the sysadmin aisle in data loss server crashes, and -- they happen. Wise users use a belt AND suspenders to the extent possible lest they find their pants gathered around their ankles one day...

      --
      Even when the experts all agree, they may well be mistaken. --- Bertrand Russell.
    12. Re: I feel that lone sysadmin's pain by Joce640k · · Score: 5, Insightful

      I usually have a /trash directory in my Linux servers, I have moved the rm command to "removed" and wrote a sweet script named rm which moves files/folders to /trash. Then a cron job "removes" files and folders from trash after 48 hours. Works awesome unless I'm space-bound, and I usually am not. Saved my ass more than once!

      This sounds clever but it's a facepalming fail on so many levels. Modifying the system is ALWAYS a bad idea. Shame on anybody who upvoted it.

      If that's your intention then why not learn to type "mv" instead of "rm"? This way you're not depending on using a hacked system (or not) and you'll be safe anywhere.

      --
      No sig today...
    13. Re:I feel that lone sysadmin's pain by Joce640k · · Score: 2

      Exactly right.

      Exactly wrong.

      Learn to use "mv" instead of "rm -rf".

      eg. Create a folder called /trash and move the files there.

      When you see the system is still working and you need some disk space then you can empty the trash. Not before.

      --
      No sig today...
    14. Re: I feel that lone sysadmin's pain by RabidReindeer · · Score: 2

      There are many tricks. Personally, I like to tar stuff or do a ZIP-with-delete and keep it for a day or 3 before removal. For large quantities of data, that can take a while, though, so another possibility if one is working with snapshot-capable storage management is to snapshot it and work "offline" on the snapshot. I do this on VM images, for example.

      Hot mirrors updated just infrequently enough that you can break the link before the damage propagates isn't a bad idea, either. Filesystems with "time machine" rollback capabilities, too.

      You can pretty well bet that a backup is never as usable as it's supposed to be. It's going to be outdated, corrupted, or something critical won't be in it.

      To be really viable, you need to devote the same level of attention to backup/restore (accent on the restore) as you do on security management. There is a very strong case for keeping an entire server or set of servers to run frequent checks on backups, including bare-metal restores, and these days, a spare computer or 6 is not a bank-breaking investment any more if your data means anything to you. I also like to employ multiple types of backup media on the premise that not all types of hardware will be affected equally by most failures.

      At least apparently they only lost a few hours worth of work, and although potentially a large amount of data, it's a job that is inherently distributed among many disgruntled clients. Other organizations haven't been so fortunate.

    15. Re:I feel that lone sysadmin's pain by jabuzz · · Score: 2

      Oh I wish that where really the case. Unfortunately where a single run of a job on an HPC facility can produce 1TB of files that is not actually the case in the real world for everyone.

    16. Re: I feel that lone sysadmin's pain by PincushionMan · · Score: 2

      You might want to look into Squashfs. The archive command for a single directory (or file) is:

      mksquashfs source_dir target_image.sqfs

      If you want to do multiple directories or files, no problem:

      mksquashfs source_dir1 source_dir2 souce_file1 source_file2 target_image.sqfs

      Squashfs generation is comparable to that of tar.gz files. Not only does it do gzip compression natively, it can compress the inodes in the directory tree and also do fs level de-duplication. Squashfs is compatible with any kernel from 2009+ (maybe before), and newer kernels also have the ability to use lzo and xz compressors. It's intended to be used anywhere that you would use tar.gz or cpio, with the added benefit that you can mount it loopback and extract a file that you need without the overhead of sequentially scanning through the tape archive. I've heard the windows version of 7zip can access a squashfs archive as well (as of 16.04 it must be a gzip compressed sqfs image). Squashfs natively detects sparse files - unless you tell it not to.

      The only thing I'm not sure how well unsquashfs handles the extraction of sparse files. Linux tar is totally unsuitable when dealing with sparse files, as it requires the full amount of space to extract a sparse file. For Linux tar, there's a workaround for sparse files, and that is to install BSD tar, which seems to extract as sparse files correctly.

    17. Re: I feel that lone sysadmin's pain by mrchaotica · · Score: 2

      Call your little script something like rm2 instead

      Or better yet, something that doesn't even have the string "rm" in it, like trash.

      --

      "[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz

    18. Re:I feel that lone sysadmin's pain by TheRaven64 · · Score: 2

      I have. It's just as easy to accidentally click on the wrong folder, or delete the foo folder from the window showing bar instead of the window showing baz. This is why good UIs are all about making sure that there's an undo button that works after you've done the stupid thing, not about trying to make the stupid thing impossible. Most GUI systems will move things to the trash, rather than deleting. The problem is that users then get into the habit of reflexively emptying the trash immediately after a delete. You really want a filesystem design that adds blocks from deleted files to the end of a reuse list, so that new file allocation will overwrite the oldest deleted data by default and you can always undelete recently deleted things if you haven't written significant amounts of data in between the delete and the 'oh crap' moment.

      --
      I am TheRaven on Soylent News
  3. Repeat after me (and others) by Nkwe · · Score: 5, Interesting

    If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup.

    1. Re:Repeat after me (and others) by dbIII · · Score: 2, Informative

      Good advice but it's a misleading headline above. It appears their real backup exists and is six hours old, so annoying but not catastrophic.
      It is a good example that replication is not a backup and is often a way to just mirror mistakes.

    2. Re:Repeat after me (and others) by MatthiasF · · Score: 5, Informative

      Uh, did you read the article?

      The six hours old snapshot was a fluke manual LVM snapshot run, normally they are 24 hours. The SQL_dumps weren't running at all because of mis-configuration, producing tiny little files and failing silently. Webhooks will need to be rolled back to the 24 hour backup since they were removed in the 6 hour one because of a synchronization process (meaning at best 18 hours of updates will have no webhooks but possibly all 24 hours at worst). Lastly, their replication of their backups from Microsoft's Azure to Amazon's S3 for what I assume is vendor agnostic redundancy has sent no files at all ("the bucket is empty").

      It's like they thought out everything but never made sure any of it was working.

    3. Re:Repeat after me (and others) by dbIII · · Score: 5, Insightful

      Uh, did you read the article?

      No, and I got the wrong impression from skimming the article.
      You are correct and I am not.

    4. Re:Repeat after me (and others) by Opportunist · · Score: 5, Funny

      Sing with me, kids:

      One backup in my bunk
      One backup in my trunk
      One backup at the town's other end
      One backup on another continent

      All of them tested and verified sane
      now go to bed, you can sleep once again

      --
      We used to have a Bill of Rights. Now, with the rights gone, all we have left is the bill.
    5. Re:Repeat after me (and others) by tonymercmobily · · Score: 5, Insightful

      "If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup."
      OK, now that I have repeated it, let me add.

      As a CEO, and a CTO, you MUST test backups and resiliency by artificially creating downtime and real-life events. You switch off the main server. Or instruct the hosting company to reboot the main server, unplug the main hard drive, and plug it back in. Then you sit up, and watch with great interest what happens.

      THEN you will see, for real, how your company reacts to real disasters.

      The difference is that if anything _really_ wrong happens, you can turn the hard drive back on and fire a few people.

      Smart companies do it. For a reason. Yes it creates downtime. But yes it can save your company.

      http://www.datacenterknowledge...

      Merc.

    6. Re:Repeat after me (and others) by Daimanta · · Score: 2

      RAID fails because hard disks (probably the same type and batch) running together get hit at the same rate as the matching disks do not fail with the same chance distribution. Their failure correlation is therefore to be quite high. This explains that rebuilding a RAID array after failure can be a very dangerous operation and could easily lead to total failure. Usually, doing (incremental) backups are the safer option when a single disk fails as that is not nearly as invasive as a complete RAID rebuild.

      --
      Knowledge is power. Knowledge shared is power lost.
    7. Re:Repeat after me (and others) by Applehu+Akbar · · Score: 2

      If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup.

      Especially now that ransomware is overwriting online backups.

    8. Re:Repeat after me (and others) by MatiasKiviniemi · · Score: 2

      The internetz council convened and decided we will none of that "admit my mistakes"-bullshit here. Please hand in your card and exit the premises immediately.

    9. Re:Repeat after me (and others) by cdrudge · · Score: 4, Insightful

      "they are ISO-over-9000 and that is good enough for us"

      Distilled down, all that ISO-around-9000 says is that "we say what we do and do what we say" when it comes to business processes. It's perfectly acceptable from an ISO-around-9000 standpoint to have a disaster recovery process that reads like the below as long as that is really what they do:

      1. Perform backup
      2. Pray nothing goes wrong.

      Now hopefully they have something a lot more than that. But if they don't test the backups. If they don't hold an "IT fire drill" to practice what do do when the feces hits the fan. If they don't have disaster recovery backup servers and snapshots and whatever else they should have, then they have completely documented their process and follow it like the standards require.

    10. Re:Repeat after me (and others) by Zontar+The+Mindless · · Score: 3, Funny

      Q: You can never have too much money, too much sex, or ___ ____ ______. (Fill in the blanks.)

      (A: "Too many backups".)

      --Actual question from the final exam for the Networking 100 class I took in 1998.

      --
      Il n'y a pas de Planet B.
    11. Re:Repeat after me (and others) by tomhath · · Score: 2

      If you have not successfully tested a restore and you do not have a completely offline copy, you do not have a backup.

      And don't trust someone else who says they made and tested the backup. Our DBAs had proof that the sysadmins told them the disk backups worked. But the DBAs never did a practice restore of their own. You can guess what happened when a failed update trashed the database.

    12. Re:Repeat after me (and others) by TheRaven64 · · Score: 3, Informative

      Please mod the parent up. After the uptick in trolling and invective in the last couple of months, this post is a breath of fresh air around here.

      --
      I am TheRaven on Soylent News
    13. Re:Repeat after me (and others) by Notabadguy · · Score: 3, Funny

      Look at his user ID. Give him time, he'll come around.

  4. Don't use rm! by subk · · Score: 2

    Use mv! Also.. What's with the need to tweet? Don't tell the customer anything until the dust settles! Geez... What's with these amateurs?

    --
    Now, if you'll excuse me, I have backups to corrupt.
    1. Re:Don't use rm! by infolation · · Score: 5, Funny

      Don't tell the customer anything until the dust settles! Geez... What's with these amateurs?

      Don't tell the customer anything!! Geez... What's with these semi-pros?

    2. Re:Don't use rm! by Darinbob · · Score: 3, Interesting

      Boring job, doesn't pay as much as others. Everyone wants to be the rockstar since that's who the recruiters look for, nobody wants to be the janitor that cleans up after the concert. Turn that into a startup and seriously, no one at a startup wants to be the grunt, and (almost) no one at a startup has an ounce of experience with real world issues.

      This is why sysadmins were created, because the people actually using the computers didn't want to manage them.

    3. Re:Don't use rm! by sodul · · Score: 3, Insightful

      Nowadays since nobody wants to do sysadmin work and since most startups and companies feel that a pure sysadmin job it is a waste of money they slap 'must code shell and chef' on top, call it DevOps but then just treat them just as badly as before. The 'DevOps' term is just is misused as 'Agile' nowadays. What I have seen in practice is DevOps are Ops that Develop scripts, or worse a DevOps team/role between Devs and Ops ... and a new silo is created instead of walls broken. Most Agile shops are actually chaos driven with anything goes since Sales promised a feature to a prospect customer yesterday, every week.

  5. Test your backups! by djinn6 · · Score: 2

    Two things:
    1. Test your backups
    2. TEST your BACKUPS!

    1. Re:Test your backups! by Anonymous Coward · · Score: 2, Funny

      but NOT on your production hardware running live services.

      me thinks gitlab should have browsed their hosted repos for some backup software.

  6. At least it wasn't github.com by jtara · · Score: 2

    At least it wasn't github.com.

    So, it didn't break the Internet.

    And practically everything else.

  7. Made this mistake once... by daid303 · · Score: 2

    I've made this mistake, deleted all attachments on a life system once.

    After this, I made all the prompts for critical servers a different color:
    export PS1='\e[41m\u@\h:\w\$\e[49m'

    1. Re:Made this mistake once... by serviscope_minor · · Score: 4, Funny

      Good choice. But, I always use this prompt:

      PS1='C:$(echo ${PWD//\//\\\} | tr "[:lower:]" "[:upper:]" | sed -e"s/\\([^\\]\\{6\\}\\)[^\\]\\{2,\\}/\\1~1/g" ) >'

      --
      SJW n. One who posts facts.
  8. How can this keep happening? by Anonymous Coward · · Score: 3, Interesting

    I'm not a fan of git, I'm not happy when I'm forced to use it and I don't understand how it works, not really. But remember how KDE deleted all their projects, everywhere, globally, except for a powered-down virtual machine?

      http://jefferai.org/2013/03/29/distillation/

    When I remember that, and I read this story, I can't understand why people use something that is so sensitive to mistakes. It's like giving everybody root on every machine, which is running DOS in real mode. Somebody please explain it to me.

    1. Re:How can this keep happening? by Entrope · · Score: 3, Informative

      KDE's problems were not due to Git. They were due to a corrupt filesystem, a home-brew mirroring setup, and overworked admins.

      If you're going to troll-ol-ol a blame vector for that, at least be remotely fair and blame Linux (or whatever OS their master server was running), open source, and the associated culture.

  9. If only there was another copy of the repo by HxBro · · Score: 5, Funny

    Just imagine if git had some other magical copy of the repo somewhere, maybe even on the local machine you develop on, now that would save your data in a case like this

    1. Re:If only there was another copy of the repo by gweihir · · Score: 2

      Just imagine if you had actually read the story. The git-repos are not affected.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
  10. Unfortunately common by CustomSolvers2 · · Score: 4, Interesting

    I see another problem on top of failing backups (really?) and a tired system admin deleting the wrong files (not precisely ideal, but within the kind of errors which should be expected): allowing to delete these files at all.

    If your whole business is about dealing with the data which a big number of users generate at any point, you should (after having made completely sure that your backup system is rock solid) restrict as much as possible the access to such valuable information; not just to avoid unintended deletions, but also to account for other potential problems (e.g., privacy protection). There are many ways to do so, even after having developed the whole system; for example, giving read-only access unless strictly required like high-level admin personnel (who can use these credentials only after passing through a further validation step) or automated applications (whose credentials are regularly generated and nobody knows).

    These problems are usually provoked when developing/dealing with a system without putting the whole focus on technical aspects/what is best for it from a technical perspective. They shouldn't exist at all when doing everything properly at each stage from development to deployment, administration, general policies, etc.

    --
    Custom Solvers 2.0 = Alvaro Carballo Garcia = varocarbas.
  11. Or you use scripts by perpenso · · Score: 2

    That's why you always always run ls first.

    ls -ld /home/user1 /home/user2 /home/ user3

    Then edit the command to rm. Always.

    Or you use scripts.

    somescript user1 user2 user3

  12. All my sympathy... by Gumbercules!! · · Score: 4, Insightful

    I don't care if this is a mistake and screw up of their own making (and it is, on every level) - if you've ever worked as sysadmin you have got to feel for these guys.

    1. Re:All my sympathy... by malkavian · · Score: 2

      Definitely feel for 'em.. And really feel for the guy who was on the keyboard..

  13. An that is why you run BCM and recovery tests by gweihir · · Score: 3, Interesting

    Something like this is going to happen sooner or later. It cannot really be avoided. BCM and recovery tests are the only way to be sure your replication/journaling/etc. works and your backups can be restored.

    Of course in this age of incompetent bean-counters, these are often skipped, because "everything works" and these test do involve downtime.

    --
    Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
  14. Six hours of loss is a "melt-down"? by thesandbender · · Score: 4, Insightful

    Editors. I understand that any loss is bad but holy hyperbole batman... the title reads like a nuke was dropped on Gitlab's datacenters. I had to read halfway through the post to see they lost six (6!) hours of data. Again, really bad, but just losing six hours of data would be a case study in success for a lot of companies and definitely not a "melt-down".

    1. Re:Six hours of loss is a "melt-down"? by sysrammer · · Score: 2

      I see your point but I'd guess you are not a professional sysadmin. TFA should have been prefaced "For SysAdmins only". Most don't care about losing data: this far along in the computer revolution, most of us have lost years of data due to a disk or pebcak failure.

      Most of the time it is not a deal-breaker, or "melt-down" in this case. A company might have to spend some money, or a worker has to spend a lot of time, or the two dozen drafts of your "Great American Novel" goes gone.

      But sometimes it's the entire financial transaction or contractual history. Or the the finished version of your novel. Or the priceless-to-you pictures of your baby/SO/parent/nana.

      Pro sysadmins pretty much find that a "no data loss mindset" is career enhancing.

      --
      His ignorance covered the whole earth like a blanket, and there was hardly a hole in it anywhere. - Mark Twain