A Note On Thursday's Downtime
If you were browsing the site on Thursday, you may have noticed that we went static for a big chunk of the day. A few of you asked what the deal was, so here's quick follow-up. The short version is that a storage fault led to significant filesystem corruption, and we had to restore a massive amount of data from backups. There's a post at the SourceForge blog going into a bit more detail, and describing the steps our Siteops team took (and is still taking) to restore service. (Slashdot and SourceForge share a corporate overlord, as well as a fair bit of infrastructure.)
oh, I thought some of that shitware they sling got loose and bit them in the ass
Isn't NSA world's most famous backup solution? Just ask them for the data. 'NSA cloud', now for more than just political prosecution and government oppression and corporate theft (and an occasional viewing of your dick). Backup your data with us! It is also paid by debt and inflation, so you don't even have to pay for the service, you have already paid!
As to /. ..... I have no words, only adjectives left for you.
You can't handle the truth.
Lately I have noticed quite a fair bit less of the typical trolls.
Has Slashdot improved filtering mechanisms?
Sourceforge is Badware risks .. http://i.imgur.com/Hhtgv0H.png
like unicode support and ipv6.
Could have been far worse...
All right! Nobody moves, or the storage gets it! .... Help me! Help me! .... Shut down! ..... Won't somebody help that bad drive?!
The reboot is near.
much of left-wing thought is a kind of playing with fire by people who don't even know that fire is hot - George Orwell
I clicked on a "firehose" link and the most recent story was "YouTube's ready to select a winner" from March 2013.
But the "help us select the next story" link was ok, as was directly entering Slashdot.org/recent.
Good luck with the restore / clean up / troubleshooting. That's not a fun way to spend a weekend.
much of left-wing thought is a kind of playing with fire by people who don't even know that fire is hot - George Orwell
Hardware isn't free and employees aren't free. I seriously don't understand how Sourceforge has kept the lights on all these years.
And by the way, I'm a very satisfied user of their services. But I do worry about their future.
Anyone else get an "Internal Server Error" 500 from this page?
Had to reload 2-3 times before it come through successfully.
All is not well afloat the good ship SlashDice!
Not APK.
Thank you to the Slashdot team. Bringing systems back up like that is emergency-mode-fun, but a lot of work, and we appreciate it.
It was fast as hell!
here
http://saveie6.com/
And right before the Pluto flyby.
Seriously, though, imagine the thoughts going through NASA minds when the probe crapped out a week before the big encounter. Their toilets must have been full of bricks.
It's not like rover problems where you can continue where you left off after you fix it. New Horizons couldn't stop.
Table-ized A.I.
Next time use an open source data storage solution, proprietary and fickle technologies like EMC must extinct!!
It's great to see how you responded to the failure and got services resumed pretty quickly. However, I'd rather like to see a follow-up sometime, describing a root cause analysis. With all the clustered, distributed servers and filesystems you use today, such an outage shouldn't be possible, right?
To Terminate, or not to Terminate, that's the question - SCSIROB
Serious question: how much of this could have either had been prevented, or restored much more quickly if they were using ZFS with proper parity, checksuming, snapshotting, and sending (backups)? This really is the one-size-fits-all storage solution at this point.
[...] This incident impacted all block devices on our Ceph cluster.
Power/communications/routing down event? Was monitor quorum lost? Inquiring minds that are not trolls are curious and grateful that the path to restoration was clear. Best wishes.
<blink>down the rabbit hole</blink>
Autorefresh sucks. Turn it off.
I've negligible experience in this sort of failure and recovery, but...
Shouldn't slashdot and sourceforge be entirely separate, so that the failure of one can't bring down the other?
Shouldn't there be live redundant systems, so that when one fails, one of the redundant systems is switched online in minutes? I don't mean just redundant storage, but 3 or 4 systems running concurrently, taking the same input and monitoring to confirm that the output is the same.
Is this too expensive or not technically feasible?
Contribute to civilization: ari.aynrand.org/donate
Your important files encryption produced on this computer: photos, videos, documents, etc. Here is a complete list of encrypted files, and you can personally verify this.
Encryption was produced using a unique public key RSA-2048 generated for this computer. To decrypt files you need to obtain a private key. The single copy of the private key, which will allow you to decrypt the files, located on a secret server on the Internet; the server will destroy the key after a time specified in this window. After that, nobody and never will be able to restore files.
To obtain the private key for this computer, which will automatically decrypt files, you need to pay 666 Bitcoin like similar amount in another country. Click >nezt> to select the method of payment and the currency. Any attempt to remove or damage this software will lead to the immediate destruction of the private key by server.
The Russians have won. They have made the world a cesspool of distrust, greed, fear and hate.
"Storage corruption" is fairly vague. I've been bit by it in the past - once due to a vendor software bug (Oracle block corruption), and once due to hardware (flaky storage controller chip writing garbage (Supermicro MB)) I would like to hear more about the root cause.
RM
}#q NO CARRIER
It looks like /. had a Plan B ready in the case of a catastrophic failure. For some sites one just gets a blank page with some strange message when that happens. /. did the right thing letting users know they had a problem and were working on it and then let us know a bit about what happened. Thanks, /. techs.
In a time of universal deceit, telling the truth is a revolutionary act. George Orwell
On the same Thursday that Slashdot experienced data storage corruption, the 1TB hard drive on my Windows gaming PC crashed, reporting 4GB of free space available and unresponsive to IO block commands. (I've seen that behavior on USB sticks, but never on a hard drive.) Except for several years of email, all my data was on the file server. Oh, well. I got a good excuse to rebuild my eight-year-old PC, especially with Windows 10 around the corner. Meanwhile, I'm using a $250 Dell laptop for everything except gaming.
They tried Unicode before. It allowed spoofing comment scores. SoylentNews claims to support Unicode; I wonder how it prevents spoofing comment scores.
I say this as someone that runs ZFS on his backup/file server; if you do have to restore or resilver it can take a long while! A single slow drive in a vdev will limit the entire pool's IO (the extent of which is entirely dependent on topology, but the weakest link always crushes you in ZFS). After a handful of TB of data, even with a pool of mirrored vdevs and a flash cache device, the resilver for a single drive can take a day unless you've got some serious spindle count at high RPMs. Even SAS drives don't provide that many IOPS.
If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.
Nice to see that blurb of text again. Can we get this to happen every time you post a Nerval's Lobster/Dice slashvertisement, too?