Ask ReiserFS Project Leader Hans Reiser
Hans Reiser leads a successful Free Software project that has attracted plenty of attention, many users, and even that Holy Grail of so many who have started their own Free or Open Source projects: Big-time funding from DARPA, SuSE, and others. How did he do it? What's his advice for other project leaders? Ask him! And ask him any other question you have in mind. Please stick to one question per post, and avoid questions that can be answered with a few minutes' worth of research. We'll publish Mr. Reiser's answers as soon as he gets them back to us.
Among his customers have been DARPA and BigStorage, which are noted sponsors right on the front page. I think I remember reading that BigStorage is using ReiserFS for some sort SAN.
My journal has hot
Integrity of data? Er, please read up on ext3 - a journelling filesystem, same as reiserfs, that seems to have the same (or slightly better) filesystem integrity as reiserfs.
The correct answer would be along the lines that reiserfs is better at handling some files then ext3 - especially small files. I have a ton of text files on an 80 gig shared drive - all small files. Since I'm using ext3, a lot of space is being wasted.
It was significantly faster for me. I downloaded movies and junk off of newsgroups. When I would open one of KDE's windows to start Parring/Unraring it would sit there for quite a while and the hard drive light was on as it was reading the information. When I switched to ReiserFS just to try it, it took significantly less time to load the information. It's a lot faster than ext3 and its just as secure if not more.
I also like the way it's designed it's written so that you can put modules in it. Say you want to add encryption support to the filesystem. You can write a module and load it into the Filesystem and encrypt everything written to the drive transparently. Not saying that I know how to do that. That's far beyond my programming skill at the moment. Mostly I like it for the speed that I gained.
find ~your -name '*base* | xargs chown
ReiserFS main competitor isn't really EXT3. :-)
EXT3 is a journaling addition to EXT2, and much more interesting for people who want to change their existing file systems instead of creating new file systems. Note that EXT3 is slower than both ReiserFS and EXT2, but it does have journaling, and provides faster reboots
The main competitor for performance is SGI's excellent XFS. The latest implementations are quite solid, and the performance likewise are excellent. Even compared to ReiserFS.
Both ReiserFS and XFS suffer from the potential of data loss on system failures, and XFS probably more so than ReiserFS, as tiny files might not be committed at all. However, for RAID users, I can not see any reason to use ReiserFS instead of XFS, and definitely not EXT3 unless upgrading the file system.
Regards,
--
Arthur Hagen
In fact, I disable access time tracking on every box I work with. I haven't found a worthwhile reason to ever enable it. And that's my 2 cents!
"Reiser4 is due June 30, 2003!"
atime can be quite useful for caches, like client and proxy web caches and man page caches. It's also used for other services that expire data based on access time, like usenet leaf servers, and log rotating programs.
Before turning off atime, I advise that an effort is made to identify what data really needs atime, and if possible create separate partitions for those, with atime enabled.
Regards,
--
*Art
atime is necessary for one major component of a lot of websites: The PHP Session files.
/tmp partition. I also use notail on the /tmp partition, and anywhere that has frequent file IO.
The default PHP session handler uses the atime of the files to expire them properly. If they don't have atime, they get expired prematurely. (I think... It's been a while since I made the mistake of noatime on the partition that holds the session files.)
My solution to this is to use noatime everywhere except the
Matthew Walker
http://www.tweeterdiet.com/ - My Diet Tracking Tool
SGI's XFS still occasionally hangs my machine under heavy load. Plus, by the time they have a release out for 2.4.20 (they still don't), I'm sure I'll be running 2.4.21. In addition, it's still not part of the standard kernel sources. XFS would have to be considered the least supported choice of the three.
Even though ext3 is a journaling filesystem, it still does a lengthy (and annoying) filesystem check every 20 mounts or so. To its credit it has never found an error, but still. I thought getting rid of that stuff was why we wanted journaling filesystems.
ReiserFS has been rock solid for me, and has been the default Slackware filesystem for two releases. I don't forsee something else replacing it as default any time soon. It's still a bit of a moving target, though... if you're thinking of running a few different kernel versions you may run into situations where your filesystem has features that are too new to be mounted. (In those kinds of cases ext2 is still the safe choice)
There's also IBM's JFS. The one thing I've noticed about that is that a newly formatted partition won't mount cleanly until you've run fsck.jfs on it. This doesn't inspire great trust, but other than that I've had no problems while testing it.
The tricky thing about solving the hash problem (in cryptography) is finding a value that when hashed matches a given string. Here, we are saying that given several hundred thousand keys, what is the probability that any two of them hash to the same value.
The probability is far enough from zero to be a significant danger. Just because hashtables and one-way encryption both use the hashing algorithms does not mean that you can use the same figures.
Even though ext3 is a journaling filesystem, it still does a lengthy (and annoying) filesystem check every 20 mounts or so. To its credit it has never found an error, but still. I thought getting rid of that stuff was why we wanted journaling filesystems.
/dev/PARTITIONNAME"
I personally think that the occasional check is probably a good idea, but if it annoys you then you can always change the interval, or even disable it.
Just use "tune2fs -c <how many mounts>
-c 0 should cause it to not use that functionality.
Wait until one of your boxes gets r00ted, and you (or some other poor soul dealing with one of your mangled boxes) need to do some fairly in-depth forensic analysis on the box to work out exactly what was happening, to what file, in what order.
The Access Time attribute can yield some useful clues to what was going on during an attack when you are doing a forensic analysis. Sure, there are plenty of other things to look at before you get that deep into things, but it's still useful to have sometimes!
Disclaimer: I meant what I thought, not what I wrote! What? You can't read my Mind? Oh dear!
Just go to http://oss.sgi.com/projects/xfs/patchlist.html and pick up the patch against 2.4.20. Works very well for me. All the releases get you is a bunch of release notes and rpms against RedHat kernels. I always get these patches which come out very promptly after the stock kernel release and work very well.
As a Java developer this is what I am interested in...Java produces very large numbers of small files. Any file system that handles this more efficiently is going to make for faster compilation.
To get something done, a committee should consist of no more than three persons, two of them absent.
ReiserFS doesn't use cryptographic hash by default.
Get a clue before you post irrelevant (and incorrect) information.
Here's something to try:
This month, I had two disk-failure on a 1.0 TB software raid5 with ReiserFS.
I was able to copy most of the data with dd_rescue and myrescue.
By the time I was finished mucking around, I had dome mkraid -f several times, so there were spots of missing data on the disk. The filesystem would not mount. So I used resierfsck --rebuild-tree, and once it completed five days later, I was able to mount the filesystem, with most of the files intact.
Non-Journalled, or Unsure:
Journalled:
Network file-systems
To summarize: We have a horribly-large number of filesystems, most of which are incompatiable, many of which do not support the Linux security module extensions, one (e2fs) provides defragging under Linux, and none at all provide support for conversions.
Hey, diversity is good! I -like- diversity! I want MORE diviersity! I also want ways to efficiently move data around.
Will future versions of ReiserFS include additional userland tools for defrag, fs conversion, scope of logging (eg: none, meta, full), pluggable hashing algorithm, etc?
Ultimately, all the choice in the world is no choice at all if there's no way to make use of those choices.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Gentoo -- and no, I'm not, Gentoo gives you the choice of vanilla sources, gentoo sources, xfs sources, etc.
My journal has hot
Up to 127 filenames per directory (MAX_GENERATION_NUMBER, defined in include/linux/reiserfs.h) can have the same hash value. After this, creating more filenames with this hash is impossible (the EBUSY error code is returned). ReiserFS does NOT blindly overwrite files because of hash collisions.
/* there is no free generation number */ /* /* I think it was better to have an error code with a name that says /* adjust offset of directory enrty */ /* update max-hash-collisions counter in reiserfs_sb_info */
/* we need to re-search for the insertion point */
/* Following line is 2nd line touched by Alan Cox' trivial fix */ /* I think it was better to have an error code with a name that says
You can choose from multiple hash algorithms when you create the filesystem (faster hashes have a greater probability of hash collision). But collisions aren't a reason to avoid ReiserFS - most other filesystems (including ext2/ext3) won't get anywhere near a million files in a directory before suffering huge performance losses.
The following code was taken from linux-2.4.21-rc1/fs/reiserfs/namei.c and demonstrates the handling of hash collisions.
gen_number = find_first_zero_bit ((unsigned long *)bit_string, MAX_GENERATION_NUMBER + 1);
if (gen_number > MAX_GENERATION_NUMBER) {
reiserfs_warning ("reiserfs_add_entry: Congratulations! we have got hash function screwed up\n");
if (buffer != small_buf)
reiserfs_kfree (buffer, buflen, dir->i_sb);
pathrelse (&path);
* Trivial changes by Alan Cox to remove EHASHCOLLISION for compatibility
*
* Trivial Changes:
* Rights granted to Hans Reiser to redistribute under other terms providing
* he accepts all liability including but not limited to patent, fitness
* for purpose, and direct or indirect claims arising from failure to perform.
*
* NO WARRANTY
* This is one of two lines that this fix consist of.
*/
return -EBUSY;
what it means, but I choose not to fight over it. Persons porting to
other operating systems should consider keeping it as it was
(return -EHASHCOLLISION;). -Hans */
}
put_deh_offset(deh, SET_GENERATION_NUMBER(deh_offset(deh), gen_number));
set_cpu_key_k_offset (&entry_key, deh_offset(deh));
PROC_INFO_MAX( th -> t_super, max_hash_collisions, gen_number );
if (gen_number != 0) {
if (search_by_entry_key (dir->i_sb, &entry_key, &path, &de) != NAME_NOT_FOUND) {
reiserfs_warning ("vs-7032: reiserfs_add_entry: "
"entry with this key (%K) already exists\n", &entry_key);
if (buffer != small_buf)
reiserfs_kfree (buffer, buflen, dir->i_sb);
pathrelse (&path);
return -EBUSY;
what it means, but I choose not to fight over it. Persons porting to
other operating systems should consider keeping it as it was
(return -EHASHCOLLISION;). -Hans */
}
}
Another big reason why a lot of people implement snapshot differently than NetApps, is to avoid shooting yourself in the foot. With NetApps, the snapshot data is kept on the same volume as the data itself, which leads to situation where you jump from say 50% usage to 99% just like that overnight (the snapshot area is allowed to run over the data area). This is quite a delicate situation as deleting files makes things worse (you have to get rid of old snapshots to free up space). I have seen big production database taken to their knees because of this.
On the other hand, the other implementations are a bit slower because the blocks are copied instead of being just not deleted, but snapshots never take space from data. The implementer has to make the choice, space control and simple understanding of space vs. speed to snapshot and recovery.
--Reiser4 was planned from the ground up to surpass v3. One of its features is delayed block allocation until write-to-disk, which is expected to make the whole filesystem much more efficient. They will also be making the size of the journal smaller, which should finally enable me to start using Reiserfs on 100-Meg Zip disks. :)
--Hans has said in the past that he believes filesystems should be re-written from scratch every few years, so they can take advantage of algorithm improvements and new concepts. He's making good on his word. The V4 whitepaper ( http://www.namesys.com/v4/v4.html ) is an interesting read, especially if you're into/understand database design.
.
== WolfriderV6 == I'm willing to admit that *I just might* be wrong... Are you??