New Linux Petabyte-Scale Distributed File System
An anonymous reader writes "A recent addition to Linux's impressive selection of file systems is Ceph, a distributed file system that incorporates replication and fault tolerance while maintaining POSIX compatibility. Explore the architecture of Ceph and learn how it provides fault tolerance and simplifies the management of massive amounts of data."
but in soviet Russia file systems Distribute you
Ceph was designed by Sage Weil (of WebRing fame), who is also one of the founders of DreamHost. They will likely be using it internally soon, if they aren't already. http://en.wikipedia.org/wiki/DreamHost
Look at Google and Facebook, arguably among the top users of massive databases. They have petabytes upon petabytes of data stored and are constantly growing. But what happens if they lose some data?
Nothing. They can always go back and regenerate that data. It's just a matter of time.
So at this large scale, it doesn't make any sense at all to focus on data integrity beyond making sure that fopen() and fread() don't return garbage. It's the smaller databases that contain critical information that need data integrity. These are typically sub-terabyte, though some may creep over that limit in a few uncommon instances.
And realistically, if you don't want your data to be hacked up, lost, then thrown out with a bad drive, ReiserFS or any other modern journaling filesystem is the right choice.
I wouldn't bet money on distributed filesystems just yet.
The headline in the Ceph wiki: Ceph is under heavy development, and is not yet suitable for any uses other than benchmarking and review.
"Maybe this world is another planet's hell"
Aldous Huxley
"It took a lot of work, but this latest Linux patch enables support for multi-petabyte file organization and storage!"
"Do you have support for smooth, full-screen Flash video yet?"
"No, but who uses that?"
Dislike the Electoral College? Lobby your state to join the National Popular Vote Interstate Compact.
Google's BigFile/BigTable architecture is a distributed filesystem. if a node goes down, the data that was on that node gets copied to other nodes to keep the replication count up.
Facebook is using apache cassandra, which adopts similar designs.
Oh, and I forgot about Amazon Dynamo.
I think the big issue in the programming community as a whole is the current lack of understanding of the differences between eventual and atomic consistency.
Distributed file systems work quite well when you have a single source of truth, but when you have multiple data stores, you can have multiple sources of truth. It essentially adds a temporal dimension to your data. As in, John Smith is a debtor of XYZ corp on Monday morning, but due to the server being down, we haven't realised on Tuesday morning that he paid his bill on Monday afternoon. Add late fee penalties.
It adds another layer of complexity to an application that delayed gestures roll back transitive actions between actors in an Ecosystem. In the example, it would be to send out another letter stating that the late fee penalties have been removed, and if already paid, a refund is to be issued.
Science advances one funeral at a time- Max Planck
I'm not really sure how much a petabyte is. Could someone please translate to Natalie Portmans? or Station wagons full of congresses? or Rods to the Hogshead?
While google may be able to go ahead and re-index websites if it loses that data, "regenerating" gmail and google docs stuff isn't quite so easy, and even small amounts of data loss would kill those applications (especially among paid users).
You just contradicted yourself. You're right; it's just a matter of time. Only, thing is, this is the Internet. How long to recreate that data? Weeks? Months? Years? 6 months is an eternity on the Net.
If all the accounts and stories were lost on Slashdot due to a massive database failure, how many people would come back, creating a new account and so forth? How many long would it take before there was enough content and accounts to make it interesting again? Now realize that Slashdot is a drop in the bucket compared to Google.
My blog
It was noble of you to try to wrest control of a troll thread, but your comment loses a lot of credibility for being titled "Re: Do niggers use linux?"
Would it hurt to at least change the title while you strive for visibility and relevance? When I saw the title of your post, I half-expected to see a poorly-written diatribe against Jamal Jackson for playing basketball and chasing caucasian women.
Thank you, kind sir, for listening. We all must do our part to prevent trolling!
How is this different than Lustre?
Would it hurt to at least change the title while you strive for visibility and relevance?
Well you didn't change it
http://michaelsmith.id.au
I see a lot too many layers over layers there. Which always smells like the inner-platform anti-pattern that a “enterprise consultant” would to, to me.
But maybe I’m just misunderstanding things and that amount of layers is needed for large installations. Anyone here, who actually administers such large storage systems and read the article? Would be interesting to hear from someone with daily experience in this.
Also, I could not find any mentioning of any ZFS-like scrubbing going on. Which in my experience equals zero reliability at all with today’s unreliable drives. How would that system detect a controller creating corruption? Or data degradation? I had those problems. And they killed half my data. Despite having a RAID, doing automatic backups with verification and having a git-like history of changes (to protect from accidental overwriting). Nothing of that helped me at all.
Only constantly checking all data, and fixing them, before the errors become big enough for ECC to stop working, can prevent this.
Did I miss it, or did they really forget that crucial part?
Any sufficiently advanced intelligence is indistinguishable from stupidity.
I think I'll stick with ZFS. It's a million times better, give or take.
Second, you have other sectors producing large amount of data beside your favourite networking website. One example is the LHC. It is going to produce terabytes of data per DAY (15 petabytes per year). Another are space telescopes. Those data can't just be 'regenerated'. 1 day worth of data is incredibly expensive to produce.
Distributed file systems are already there, and people use them. Maybe not on your level of computer usage.
When you don't know what you are talking about, I think it is better to just keep quiet.
EULA : By reading the above message, you agree that I now own your soul.
If that were to happen, I'd finally be able to get a low UID!
That's Goldman-Sach's job, you insensitive clod!
The first word in the article summary is "Linux®"
Does that look weird to anyone else? I realize it's technically correct for the registered trademark symbol to be there, but somehow it just doesn't seem right.
this copying of the node happens after the node goes down? so the software time travels? That totally disproves Stephen Hawkin's recent time travel can only go forward statement! DUDE - AWSOME!
this copying of the node happens after the node goes down?
One of the remaining replicas of each block on the failed node is copied so the total replication count does not go down. The original was perhaps poorly phrased, no need to be a dick about it, though.
sic transit gloria mundi
I am not real familiar with ceph and after going through the pain to learn more about glusterfs (http://www.gluster.org/) only to learn that gluster was not quite ready for primetime (this was about 6 month ago - may have changed), I am a bit skeptical. Anyone know the main differences between ceph and glusterfs (besides that glusterfs can run in userspace)?
Why? Is there something special about those?
Yes, but Google's file system makes no attempt to implement either the POSIX standard or the Linux VFS. It's highly specialized to only deal with the types of loads that Google sees. As a general solution, it's worth is debatable.
If you've stored the data, you can reproduce the data in the event some of it is lost.
RAID much? PAR often?
Why? Is there something special about those?
You must be new here!
But that is not what the original question was about. The original question was about sites like Google or Facebook using anything like a distributed file system to keep from losing data.
Nothing special at all. It only means Taco used sequential instead of randomised integers for user ids, which in turn can be viewed as a very loose chronology of user registrations.
In other words, no.
Why do you assume that:
A: PB storage is very rare and only used by several large organizations.
B: PB storage is used to house generated data the can easily be replaced.
- Gilboa
..and the pretty amazing open source distributed multi-master no-single-point-of-failure database Riak.
My other account has a 3-digit UID.
This was a reference to ReiserFS.
A bad taste joke, but not offtopic.
Nothing. They can always go back and regenerate that data. It's just a matter of time.
No, they can't. This is a really, really important distinction to make. They cannot "regenerate" the data. They *might* (perhaps even "probably") be able to "recopy" the data, *assuming the original source is still available*.
Calculate the overhead of say, RAID 6, for 1 Petabyte of data.
So there. :)
The diversity and expression of human opinion is essential to human survival.
Acutally your raid array can't regenerate your data in most failure scenarios because of idiotic design :
Bit error in RAID 1 :
disk A : 000000111011011
disk B : 001000111011011
that's the information your raid array has in case of a bit error. Do tell, which is the correct one ?
Or, better, yet, a 3 disk RAID-5 array :
disk A : 000000111011011
disk B : 001001010011001
parity disk : 001101101000010
clearly something is wrong ... now fix the problem.
RAID is worthless unless you know which data set is wrong.
Just curious - too far in my technical past for me to recall - Ceph is claimed to adhere to POSIX standards. Do POSIX standards accommodate the "eventually consistent" filesystem models?
"Ahh! I see you're in that indeterminate Schrodinger state where - oh, uh
Big parallel filesystems are a dime a dozen now. There is CXFS, Panasis, pNFS is coming, Lustre, PVFS (not good), Ceph (not good), GFS (Google File System, which is being rewritten/seriously updated), GPFS, Hadoop, QFS (not sure of its scaling) and more.
Almost all of these things have the same basic architecture where the metatdata (actual inodes) are separate from the data, and the data can be RAIDED/duplicated and/or a single file can be striped across a number of storage machines. Like memory, filesystems on large scale have a hierarchy as well. Memory, you have, registers, L1/2/3 cache, system memory, swap if you still use that. Storage is going the same way.
The crazy thing is that the disparity between CPU speed and IO speed is becoming greater and greater. Also, datasets are getting larger and larger.
kernel.org only has up to 2.6.33.
It was noble of you to try to wrest control of a troll thread, but your comment loses a lot of credibility for being titled "Re: Do niggers use linux?"
While it's off-topic, it's at least an honest question! I'm sure the slashbots want to know the answer.
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
Facebook uses MySQL/memcached, cassandra is only used for systems running the statistical analysis.
New things are always on the horizon
RAID5 doesn't use a parity disk, nor does it protect against bit level errors, if a drive fails, you obviously know which drive is bad, and unless a second drive fails, recover your data.
if you want bit level error checking, use a ZFS based RAID.
That would reduce the number of posts on slashdot by about 99%.
To have a right to do a thing is not at all the same as to be right in doing it
The parity in RAID5 is spread over the disks, so for any block of data, yes there is a "parity disk" for it.
Except that not a single one of those is usable across existing multiple platforms (Solaris,BSD/Linux/Windows, etc.), where you might want to have cardinality of your data. You put your .Net data on your .Net platforms, and your J2EE data on your J2EE platforms, etc. I always have to have a layer like CIFS or SMB on top.
I suppose that's fine, you can always throw extra layers, but if a system is doing distribution already, why can't you throw in location transparency on top and add a Posix layer.
And none of them so far are Free for all the above mentioned platforms. Hell, I'll take just Linux and Windows. GPFS is, but IBM will ass-rape you in $$$ for it. Lustre has the same problem if you want Windows support.
The rest (almost universally) don't support Posix semantics (though OCFS is apparently shooting to do so).
Then there's the fact that almost all of the open Linux only ones have a SPOF in the form of the metadata server.
I deal with genomic data sets. Each data set is 30-80Gb. We're a small group, but the big ones have petabytes of storage. If the biological material is used up, you cannot regenerate the data.