New Linux Petabyte-Scale Distributed File System
An anonymous reader writes "A recent addition to Linux's impressive selection of file systems is Ceph, a distributed file system that incorporates replication and fault tolerance while maintaining POSIX compatibility. Explore the architecture of Ceph and learn how it provides fault tolerance and simplifies the management of massive amounts of data."
Ceph was designed by Sage Weil (of WebRing fame), who is also one of the founders of DreamHost. They will likely be using it internally soon, if they aren't already. http://en.wikipedia.org/wiki/DreamHost
Look at Google and Facebook, arguably among the top users of massive databases. They have petabytes upon petabytes of data stored and are constantly growing. But what happens if they lose some data?
Nothing. They can always go back and regenerate that data. It's just a matter of time.
So at this large scale, it doesn't make any sense at all to focus on data integrity beyond making sure that fopen() and fread() don't return garbage. It's the smaller databases that contain critical information that need data integrity. These are typically sub-terabyte, though some may creep over that limit in a few uncommon instances.
And realistically, if you don't want your data to be hacked up, lost, then thrown out with a bad drive, ReiserFS or any other modern journaling filesystem is the right choice.
I wouldn't bet money on distributed filesystems just yet.
The headline in the Ceph wiki: Ceph is under heavy development, and is not yet suitable for any uses other than benchmarking and review.
"Maybe this world is another planet's hell"
Aldous Huxley
"It took a lot of work, but this latest Linux patch enables support for multi-petabyte file organization and storage!"
"Do you have support for smooth, full-screen Flash video yet?"
"No, but who uses that?"
Dislike the Electoral College? Lobby your state to join the National Popular Vote Interstate Compact.
Let me guess - you work for the SEC and need it for your porn collection
Google's BigFile/BigTable architecture is a distributed filesystem. if a node goes down, the data that was on that node gets copied to other nodes to keep the replication count up.
Facebook is using apache cassandra, which adopts similar designs.
I think the big issue in the programming community as a whole is the current lack of understanding of the differences between eventual and atomic consistency.
Distributed file systems work quite well when you have a single source of truth, but when you have multiple data stores, you can have multiple sources of truth. It essentially adds a temporal dimension to your data. As in, John Smith is a debtor of XYZ corp on Monday morning, but due to the server being down, we haven't realised on Tuesday morning that he paid his bill on Monday afternoon. Add late fee penalties.
It adds another layer of complexity to an application that delayed gestures roll back transitive actions between actors in an Ecosystem. In the example, it would be to send out another letter stating that the late fee penalties have been removed, and if already paid, a refund is to be issued.
Science advances one funeral at a time- Max Planck
While google may be able to go ahead and re-index websites if it loses that data, "regenerating" gmail and google docs stuff isn't quite so easy, and even small amounts of data loss would kill those applications (especially among paid users).
You just contradicted yourself. You're right; it's just a matter of time. Only, thing is, this is the Internet. How long to recreate that data? Weeks? Months? Years? 6 months is an eternity on the Net.
If all the accounts and stories were lost on Slashdot due to a massive database failure, how many people would come back, creating a new account and so forth? How many long would it take before there was enough content and accounts to make it interesting again? Now realize that Slashdot is a drop in the bucket compared to Google.
My blog
640 petabytes should be enough for everybody.
Everything I write is lies, read between the lines.
It was noble of you to try to wrest control of a troll thread, but your comment loses a lot of credibility for being titled "Re: Do niggers use linux?"
Would it hurt to at least change the title while you strive for visibility and relevance? When I saw the title of your post, I half-expected to see a poorly-written diatribe against Jamal Jackson for playing basketball and chasing caucasian women.
Thank you, kind sir, for listening. We all must do our part to prevent trolling!
Would it hurt to at least change the title while you strive for visibility and relevance?
Well you didn't change it
http://michaelsmith.id.au
I see a lot too many layers over layers there. Which always smells like the inner-platform anti-pattern that a “enterprise consultant” would to, to me.
But maybe I’m just misunderstanding things and that amount of layers is needed for large installations. Anyone here, who actually administers such large storage systems and read the article? Would be interesting to hear from someone with daily experience in this.
Also, I could not find any mentioning of any ZFS-like scrubbing going on. Which in my experience equals zero reliability at all with today’s unreliable drives. How would that system detect a controller creating corruption? Or data degradation? I had those problems. And they killed half my data. Despite having a RAID, doing automatic backups with verification and having a git-like history of changes (to protect from accidental overwriting). Nothing of that helped me at all.
Only constantly checking all data, and fixing them, before the errors become big enough for ECC to stop working, can prevent this.
Did I miss it, or did they really forget that crucial part?
Any sufficiently advanced intelligence is indistinguishable from stupidity.
Second, you have other sectors producing large amount of data beside your favourite networking website. One example is the LHC. It is going to produce terabytes of data per DAY (15 petabytes per year). Another are space telescopes. Those data can't just be 'regenerated'. 1 day worth of data is incredibly expensive to produce.
Distributed file systems are already there, and people use them. Maybe not on your level of computer usage.
When you don't know what you are talking about, I think it is better to just keep quiet.
EULA : By reading the above message, you agree that I now own your soul.
The first word in the article summary is "Linux®"
Does that look weird to anyone else? I realize it's technically correct for the registered trademark symbol to be there, but somehow it just doesn't seem right.
this copying of the node happens after the node goes down?
One of the remaining replicas of each block on the failed node is copied so the total replication count does not go down. The original was perhaps poorly phrased, no need to be a dick about it, though.
sic transit gloria mundi
I am not real familiar with ceph and after going through the pain to learn more about glusterfs (http://www.gluster.org/) only to learn that gluster was not quite ready for primetime (this was about 6 month ago - may have changed), I am a bit skeptical. Anyone know the main differences between ceph and glusterfs (besides that glusterfs can run in userspace)?
Yes, but Google's file system makes no attempt to implement either the POSIX standard or the Linux VFS. It's highly specialized to only deal with the types of loads that Google sees. As a general solution, it's worth is debatable.
If you woke up one morning in Tokyo to discover that someone had blurred your genitalia during the night, I'd bet you would consider puking on someone too.
Fascism trolls keeping me up every night. When I starts a preachin', he HITS ME WITH HIS REICH!
Why? Is there something special about those?
You must be new here!
Nothing special at all. It only means Taco used sequential instead of randomised integers for user ids, which in turn can be viewed as a very loose chronology of user registrations.
In other words, no.
Tera -> Tetra -> 4 -> 1000^4
Peta -> Penta (like Pentagram) -> 5 -> 1000^5
Exa -> Hexa (like Hexagon) -> 6 -> 1000^6
Zeta -> Setta (like 7 in many languages) -> 7 -> 1000^7
Yotta -> Otta -> 8 -> 1000^8
Or use 1024 if you don't like IEEE/IEC norms...