Google Switching To EXT4 Filesystem

Btrfs? by Wonko+the+Sane · 2010-01-14 09:00 · Score: 2, Interesting

I guess they didn't consider btrfs ready enough for benchmarking yet.

Re:Btrfs? by Anonymous Coward · 2010-01-14 09:37 · Score: 1, Interesting

Ext3 is just a couple flags added to ext2. For ext4, if you want to take advantage of its features, you have to start from scratch. However, I don't think this is an issue for Google, as they have a ton of redundancy.

No ReiserFS? by CRCulver · 2010-01-14 09:00 · Score: 3, Interesting

It's interesting that ReiserFS wasn't even an option here. I myself even ended up using Ext4 when I set up a new box not too long ago. It's a real shame that just because the creator of the filesystem committed a crime, people are drawn to treat the technology itself are somehow dishonored.

Re:No ReiserFS? by mqduck · 2010-01-14 11:51 · Score: 3, Interesting

So it's not because the creator of the filesystem committed a crime, it's because the product has an unsavoury name
Actually, it's more likely because the creator and main developer of the filesystem is suddenly gone. As I understand it, he wasn't a very friendly guy (surprise!) and drove others away from the project.

--
Property is theft.

Google doesn't need journaling? by Paradigm_Complex · 2010-01-14 09:00 · Score: 3, Interesting

The main advantage of EXT3 over EXT2 is that, with journaling, if you ever need to fsck the data, it goes a LOT quicker. It's interesting to note that Google never felt it needed that functionality.

Additionally, I was under the impression that Google used massive numbers of commodity consumer-grade harddrives, as opposed to high-grade stuff which I presume is less likely to err. Couple this fact with the massive amount of data Google is working with and there has got to be a lot of filesystem errors, no?

Can anyone else with experience with big database stuff hint as to why Google would not need to fsck their data (often enough for EXT3 to be worthwhile)? Is it cheaper just to overwrite the data from some backup elsewhere at this scale? How do they know the backup is clean without fscking that?

--
"A witty saying proves nothing." - Voltaire

Re:Google doesn't need journaling? by tytso · 2010-01-14 11:55 · Score: 4, Interesting

So there's a major problem with Soft Updates, which is that you can't be sure that data has hit the disk platter and is on stable store unless you issue a barrier operation, which is very slow. What Soft Updates apparently does is assume that once the data is sent to the disk, it is safely on the disk. But that's not a true assumption! The disk drive, especially modern ones with large caches, can reorder writes which are sent to the disk, sometimes (with the right pathological workloads) for minutes at a time. You won't notice this problem if you just crash the kernel, or even if you hit the reset button. But if you pull the plug or otherwise cause the system to drop power, data in the disk's write cache won't necessarily be written to disk. The problem that we saw with journal checksums and ext4 only showed up on a power drop, because there was a missing barrier operation, so this is not a hypothetical consideration.
In addition, if you have a very heavy write workload, the Soft Updates code will need to burn a fairly large amount of memory tracking the dependencies and burn quite a bit of CPU figuring out which dependencies need to be rolled back. I'm a bit suspicious of how well they perform and how much CPU they steal from applications --- which granted, may not show up in benchmarks which are disk bound. But if the applications or the large number of jobs running on a shared machine are trying to use lots of CPU as well as disk bandwidth, this could very much be an issue.
BTW, while I was doing some quick research for this reply. it seems that NetBSD is about to drop Soft Updates in favor of a physical block journaling technology (WAPBL), according to Wikipedia. They didn't get a reference to this, nor did they say why NetBSD was planning on dropping Soft Updates, but there is a description of the replacement technology here: http://www.wasabisystems.com/technology/wjfs. But if Soft Updates is so great, why is NetBSD replacing it and why did Free BSD add file system journaling alternative to UFS?
Re:Google doesn't need journaling? by Anonymous Coward · 2010-01-14 13:10 · Score: 2, Interesting

Well, the performance is not that easy to compare in pure theory. SU will often require less writes than journaling. But SU requires that complex dependency tracking.
About the barriers. Is it really that different from journaling file systems? If the disk drive can change the order of the operations that surely has an impact on journaling file systems. The journal would be quite useless when the transaction it represents is commited before it is logged in the journal. That way the operation could be commited half way and there is no journal entry to roll it back or complete it. Maybe I am wrong but you would need a barrier for every operation with the journal.
About the BSDs:
I found two reasons for NetBSD switching to WAPBL. Their implementation of soft updates (called softdeps) seems to be buggy in some corner cases. Journaling is less complex and easier to get right, while having similar performance characteristics. They often cite performance statistics where WAPBL wins by about 10%-15%. But that is not very solid, it only covers one usage pattern. The research I know of usually shows that in general journaling and soft updates are very similar with each one winning in some patterns. I think using the simpler solution is really the right choice for a project like NetBSD.
Journaling in FreeBSD is another quite interesting story. Journaling for FreeBSD is implemented in GEOM. My knowledge here is really limited, but GEOM acts below the file system. So the implementation in GEOM could provide journaling for every file system, like GEOM can provide encryption for every file system. AFAIK journaling in GEOM provides hooks that are used by UFS. I don't know why but my guess is performance improvments.
Additionally there is this quite new UFS SU+J implementation. That is UFS with soft updates and a journal to keep track of the freed space.
What I really am ranting about is that for Linux this hasn't even been tried. Allthough there are loads of Linux file systems there isn't much innovation going on. Really the only reason I found was that soft updates is complex. At least BtrFS comes with copy-on-write.
Re:Google doesn't need journaling? by tytso · 2010-01-14 17:52 · Score: 2, Interesting

So I'm an engineer, and not an academic. I'm not trying to get a Ph.D. The whole Keep it Simple, Stupid principle is an important one, especially as you say, "Journalling and Soft Updates have similar performance characteristics."
If sometimes Journalling posts better benchmarks, and sometimes Soft Updates produces better results, but Soft Updates is hideously more complex, thus inhibiting new features such as ACL's and Extended Attributes (which appeared in BSD much latter than Linux, and I think Soft Updates made it much harder to find people capable of extending the file system) --- then the choice of the simpler technology seems to be obvious. The performance gains are a toss up, and using a hideously complex algorithm for its own sake is only good if you are an academic gunning for a Ph.D. thesis or a paper publication, or if you are trying to ensure job security by implementing something so hard to maintain that only you and few other people can hack it.

It's Not Hans by TheNinjaroach · 2010-01-14 09:06 · Score: 4, Interesting

I too have abandoned using ReiserFS but it's not about the horrible crime Hans committed. It's about the fact I don't think the company that he owned (who developed ReiserFS) has a great future, so I foresee maintenance problems with that filesystem. Sure, somebody else can continue their work but I'm not going to hold my breath.

--
I went to eat some animal crackers and the box said, "Do not eat if seal is broken." I opened the box and sure enough..

Re:It's Not Hans by Anonymous Coward · 2010-01-14 10:05 · Score: 1, Interesting

If Google had found that it gave some badass speeds, they probably would have just picked up maintenance themselves.
Re:It's Not Hans by mqduck · 2010-01-14 11:46 · Score: 2, Interesting

Personally, I think Hans should have been allowed to continue his work on ReiserFS while incarcerated. Better to let a guilty man contribute to society than do nothing but rot in prison, no?

--
Property is theft.

Re:Time for a backup? by Anonymous Coward · 2010-01-14 09:10 · Score: 1, Interesting

The "upgrade" process is to simply mount your old ext3 volume as ext4, and let new writes take advantage of ext4 features.

You say that like it's a good thing. one error, like an assumption in the maximum number of files or clusters causes a wrap round and it all goes tits up.

It's not like they haven't dropped the ball before: http://www.techcrunch.com/2006/12/28/gmail-disaster-reports-of-mass-email-deletions/

Do no evil, but be a bit incompetent sometimes.

Re:Not A Nerd? by MBGMorden · 2010-01-14 09:11 · Score: 3, Interesting

I too found it interesting, because it basically alleviates any need for me to worry about "upgrading" to ext4. My current Linux systemse use an ext3 /boot partition and everything else xfs. Given some of the press ext4 has gotten lately, I just trust xfs more, and knowing that I'm not really giving up any performance is a huge plus.

Truthfully though, where the heck are the meta-data based filesystems that we were promised? I've love to be able to, on a filesystem level, instantly pull up a folder view of all videos - or all images. Or all images of my dog. Or all images outdoors. Or all images of my dog outdoors.

Basically, just the ability to organize via an arbitrary number of categorized tags.

--
"People who think they know everything are very annoying to those of us who do."-Mark Twain

XFS performance highly variable by bzipitidoo · 2010-01-14 09:14 · Score: 3, Interesting

I've used XFS on a RAID1 setup with SATA drives, and found the performance of the delete operation extremely dependent on how the partition was formatted.

I saw times of up to 5 minutes to delete a Linux kernel source tree on a partition that was formatted XFS with the defaults. Have to use something like sunit=64, swidth=64, and even then it takes 5 seconds to rm -rf /usr/src/linux. I've heard that SAS drives wouldn't exhibit this slowness. Under Reiserfs on the same system, the delete took 1 second. Anyway, XFS is notorious for slow delete operations.

--
Intellectual Property is a monopolistic, selfish, and defective concept. It is "tyranny over the mind of man"

Re:XFS performance highly variable by Anonymous Coward · 2010-01-14 10:00 · Score: 2, Interesting

mounting with nobarrier will change those 5 minutes to 5 seconds, but don't turn off your computer during the delete then.

Ubuntu 9.10? by GF678 · 2010-01-14 09:36 · Score: 4, Interesting

Gee, I hope they're not using Ubuntu 9.10 by any chance: http://www.ubuntu.com/getubuntu/releasenotes/910

There have been some reports of data corruption with fresh (not upgraded) ext4 file systems using the Ubuntu 9.10 kernel when writing to large files (over 512MB). The issue is under investigation, and if confirmed will be resolved in a post-release update. Users who routinely manipulate large files may want to consider using ext3 file systems until this issue is resolved. (453579)

The damn bug is STILL not fixed apparently. Some people get the corruption, and some don't. Scares me enough to not even try using ext4 just yet, and I'm still surprised Canonical was stupid enough to have ext4 as the default filesystem in Karmic.

Then again, perhaps Google knows what they're doing.

Re:Ubuntu 9.10? by Anonymous Coward · 2010-01-14 10:24 · Score: 1, Interesting

From the bug comments, this could be linked to latent kernel bug on journal checksums. Which went unnoticed until they were enabled by default after 2.6.31 and and reverted in 2.6.32-rc6. If ubuntu picked up that patch for their kernel, that would have caused corruptions.
http://bugzilla.kernel.org/show_bug.cgi?id=14354
Re:Ubuntu 9.10? by RoboRay · 2010-01-14 14:58 · Score: 2, Interesting

Yeah, they've got their own custom OS... Goobuntu.

Give us a +-0 Counterbalance by itomato · 2010-01-14 09:38 · Score: 2, Interesting

When does black become white?
#CCCCCC or #888888

Is there overlap with Flamebait?

When does an otherwise 'troll' moderation-worthy comment lose out on status that could validate 19 responses, with 50% scoring +2?

Sometimes a troll is a troll, but sometimes its just a shadow.

Downtime by Joucifer · 2010-01-14 10:01 · Score: 2, Interesting

Is this why Google was down for about 30 minutes today? Did anyone else even experience this or was it a local issue?

Re:Not A Nerd? by Hurricane78 · 2010-01-14 10:05 · Score: 2, Interesting

I tried TagFS. And I found the main problem is, that the tagging is way too much work, to get to the level of tagging I want.

Also I avoid XFS, since it keeps huge amounts of (log?) data in RAM. So on a power failure, it’s goodbye data.
XFS is for servers with battery backup. Not for normal home computers.

I also tried JFS, and I got corruption with it. So I avoid it too.

I wish I could use ZFS... especially the scrubbing functionality.

--
Any sufficiently advanced intelligence is indistinguishable from stupidity.

Re:Use of commas. by dloose · 2010-01-14 10:10 · Score: 1, Interesting

Who gives a fuck about an Oxford comma?

Re:Not A Nerd? by marcansoft · 2010-01-14 11:51 · Score: 2, Interesting

SSD (NAND Flash) is still a block device. In fact, it's even "more" block, insomuch as it requires a filesystem a lot more aware of blocks, their limitations, and the proper way of using them (wear leveling, error correction, etc). It also uses larger blocks and also addresses groups of blocks for certain operations (erase). You either need a Flash-specific filesystem, or a translation to a more typical block device via a flash translation layer (FTL). Furthermore, I'm not aware of a single NAND Flash device that is accessible as memory mapped storage, nor can you run code from NAND, nor do I know of any CPUs capable of booting from NAND (they tend to have built-in ROM bootloaders to do the job). NOR Flash is another matter, but it's not competitive for SSDs. Going from HDDs to SSDs is hardly anything like going to RAM, except for the "solid state" part.

Re:Not A Nerd? by TheRaven64 · 2010-01-14 12:16 · Score: 2, Interesting

Everything you say is true about Flash, but not about SSDs in general. Flash can be written to one byte at a time, but then it is stuck in that state until it is erased. The circuitry for erasing is bigger than the circuitry for writing, so it is shared among a group of bytes in a cell. These can be any size, but there are trades. The smaller you make them, the more copies of the erase circuit are needed, so the fewer bytes of storage you get per area of die size (and per dollar). The larger you make them, the more you need to erase to modify a single byte. I think most devices use 128KB cells, but I haven't really been paying attention.

Other technologies, such as Magnetic RAM and Phase Change RAM that are starting to hit the market do not have these limitations. The most exciting technology at the moment is Phase Change RAM, which is slightly (about 50%) slower than DRAM, but is non-volatile. You can use it just like RAM, but the contents don't go away when you turn off the power. They're currently at around 64MB, so there's a way to go before they're hard drive replacements, but Flash was at that sort of capacity not long ago.

--
I am TheRaven on Soylent News

Re:Not A Nerd? by smash · 2010-01-14 12:18 · Score: 3, Interesting

You can use ZFS. Just run FreeBSD or opensolaris. The amount of software that runs on Linux but not FreeBSD (particularly if you're talking about open-source) is exceedingly minimal.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.

Slashdot Mirror

Google Switching To EXT4 Filesystem

25 of 348 comments (clear)