PVFS2 - a High-Performance Parallel File System

← Back to Stories (view on slashdot.org)

PVFS2 - a High-Performance Parallel File System

Posted by timothy on Tuesday November 9, 2004 @01:49PM from the good-nodes-are-still-available dept.

neillm78 writes "As part of the development team, we're announcing PVFS2 version 1.0 here in Pittsburgh at the SC2004 conference! PVFS2 is a GPL/LGPL based parallel file system for cluster-based applications. It logically groups any number of storage servers into a coherent file system for use by client nodes, specifically tailored to handle efficient access to large shared files. PVFS2 supports access via an MPI-IO interface for high-performance parallel applications, but you can still mount it like a regular GNU/Linux file system for traditional serial applications and managment. The PVFS2 project is conducted jointly between The Parallel Architecture Research Laboratory at Clemson University and The Mathematics and Computer Science Division at Argonne National Laboratory. Please feel free to give it a try!"

26 comments

Min score:

Reason:

Sort:

Been following it for a while... by brsmith4 · 2004-11-09 14:08 · Score: 2, Informative

PVFS (in its first incarnation) despite some instability (more so due to the fact that our first cluster was COTS cheap-o hardware), really helped drive down the load on our clusters by removing the need to perform NFS writes to a single head node for scratch space. The set up is extrememly simple and the code base was really small.

I plan on evaluating PVFS2 for our new clusters along with Lustre and GFS although I have heard nothing about the latter two operating over the MPI-ROMIO subsystem (which would definitely offer a performance increase).
1. Re:Been following it for a while... by superpulpsicle · 2004-11-09 15:07 · Score: 1
  
  This sounds like a very advanced version of XFS?! Are they saying MANY people can rm and cp and write to the same exact point in the filesystem simultaneously. Looking at the specs, I am struggling to see what's special.
2. Re:Been following it for a while... by brsmith4 · 2004-11-09 15:56 · Score: 4, Informative
  
  It's a parallel file system, not a drop in replacement for local FS's like XFS or ext3. It runs across multiple hosts, striping the data on each host. Also, haveing multiple I/O hosts in the array helps to distribute the read/write across multiple nodes, thus reducing the overhead for those operations.
  
  This is like "Distributed NFS" although that description does it a huge injustice, it should help to get the point across.
3. Re:Been following it for a while... by mikefe · 2004-11-09 19:29 · Score: 2, Interesting
  
  So it works over the network without needing a network block device layer?
  
  That would mean it should compete on the level of OpenAFS, Intermezzo and CODA for fault tolerant network filesystems -- except it would have internode locking which the others don't at the moment.
  
  That would also mean it doesn't directly compete at the same level as GFS (which is targeted at configurations of servers connected by a SAN or similar).
  
  Is this project set on integrating with the mainline kernel? What has/will happen on that front?
  
  This also looks perfect for an active/active LinuxHA failover cluster -- if it has redundancy, which any clustering filesystem should have. Right now the LinuxHA project is integrating GFS into their stack of interwoven sub-projects.
  
  After looking at the site, it looks like it would be good for server to server connections, and not good for server to workstation connections. For instance, it doesn't look like it has any caching functionaility like OpenAFS does and it looks like each node needs to have a copy of some of the cluster data (or does that end at the meta-data nodes?). PVFS2 looks like it has a similar archatecture to Lustre, except PVFS2 is developed openly.
  
  --
  There: Something at a specific location.
  Their: Owned by someone.
  Please make sure your english compiles.
4. Re:Been following it for a while... by rizzy · 2004-11-10 02:47 · Score: 2, Interesting
  
  That would mean it should compete on the level of OpenAFS, Intermezzo and CODA for fault tolerant network filesystems -- except it would have internode locking which the others don't at the moment.
  
  That's an interesting thought, but at no time have we ever thought of ourselves as a replacement for those file systems. The ones you mention are general purpose file systems whereas PVFS2 is meant to be a fast file system for parallel applications.
  
  except it would have internode locking which the others don't at the moment.
  
  I'm not sure what you mean here. We have no locking anywhere -- which is exactly why we can deliver such high performance. Scientific applicaitons often don't need a locking subsystem getting in their way.
  
  Is this project set on integrating with the mainline kernel? What has/will happen on that front?
  
  There really isn't much for us *to* integerate into the kernel. We do have a VFS interface, but it acts primarily as a way to convert kernel system calls into userspace PVFS2 calls. Yes, there are lots of "file system in userspace" projects, but by making something that works just for PVFS2, we can get better performance.
  
  This also looks perfect for an active/active LinuxHA failover cluster -- if it has redundancy, which any clustering filesystem should have. Right now the LinuxHA project is integrating GFS into their stack of interwoven sub-projects.
  
  Funny you should mention LinuxHA. I spent some time this summer setting it up with PVFS2. If you really care about redundancy, you can invest in shared storage solutions (SCSI and firewire drives can be shared between two hosts simulaneously -- if you buy the really expensive stuff). With shared storage, you've got a way to tolerate node failure. You're still screwed if something eats your big expensive hard drive, granted. We're working on software replication.
  
  PVFS2 looks like it has a similar archatecture to Lustre, except PVFS2 is developed openly.
  
  Thanks for noticing! While I understand why CFS has taken the approach they have, we really feel that the HPC community (and Linux in general) needs a file system that's free software.
5. Re:Been following it for a while... by mikefe · 2004-11-10 08:19 · Score: 1
  
  "That's an interesting thought, but at no time have we ever thought of ourselves as a replacement for those file systems. The ones you mention are general purpose file systems whereas PVFS2 is meant to be a fast file system for parallel applications."
  
  Yes, I realize that now. Everything except for the last paragraph of my post was speculation, and the last paragraph was there to correct those speculations which was written after reviewing the web site a bit.
  
  "I'm not sure what you mean here. We have no locking anywhere -- which is exactly why we can deliver such high performance. Scientific applicaitons often don't need a locking subsystem getting in their way."
  
  File server clusters would need inter-node fcntl locking, which I presumed was offered by PVFS2. This is one thing GFS does, so it is good to know they are both useful in different types of clusters.
  
  "There really isn't much for us *to* integerate into the kernel. We do have a VFS interface, but it acts primarily as a way to convert kernel system calls into userspace PVFS2 calls. Yes, there are lots of "file system in userspace" projects, but by making something that works just for PVFS2, we can get better performance."
  
  Small kernel drivers are definately a good sign. I'm just wondering if that driver is being split into small incremental patches for mainline integration review and if there are any plans or progress made on this.
  
  "Funny you should mention LinuxHA. I spent some time this summer setting it up with PVFS2. If you really care about redundancy, you can invest in shared storage solutions (SCSI and firewire drives can be shared between two hosts simulaneously -- if you buy the really expensive stuff). With shared storage, you've got a way to tolerate node failure. You're still screwed if something eats your big expensive hard drive, granted. We're working on software replication."
  
  What advantages does PVFS2 have over GFS then? The replication (which I presumed was there) looked like the most likely advantage over GFS, but I don't know enough about PVFS2 to say.
  
  "Thanks for noticing! While I understand why CFS has taken the approach they have, we really feel that the HPC community (and Linux in general) needs a file system that's free software."
  
  I think that over time companies will understand that in order to work on infrastructure in FLOSS systems they will have to have open development practices also.
  
  --
  There: Something at a specific location.
  Their: Owned by someone.
  Please make sure your english compiles.
It's Linux! by egarland · 2004-11-09 14:21 · Score: 2, Insightful

The kernel is called Linux. Yea, you may compile against GCC but come on people! it's a Linux specific kernel module. Leave the GNU/ out of it.

That said, Nice job! I love to see the capabilities of Linux expanded in new directions like this. Cool work. I wish I had time to work on cool projects like that.

--
set softtabstop=4 shiftwidth=4 expandtab nocp worlddomination
1. Re:It's Linux! by neillm78 · 2004-11-09 17:18 · Score: 1, Troll
  
  hello friend. i'm a developer on the project and i firmly stand of calling a distribution "GNU/Linux" rather than Linux. the kernel is another story, but then again, we run on clusters based on distros -- not on kernels. cheers!
  
  -Neill;
2. Re:It's Linux! by kelnos · 2004-11-09 17:39 · Score: 1
  
  Right, but I think the parent's point is that the filesystem isn't GNU/Linux-specific; it's a Linux kernel module. Calling distros GNU/Linux is fine; while I don't share your insistence in calling them that (mainly due to laziness), that is correct in that that's what they are. But calling a Linux filesystem driver "GNU/Linux" is incorrect.
  
  Sorry. I'm just a nitpicky, pedantic bastard.
  
  Having said that, I skimmed through the info on the project website, and it looks like some interesting stuff. At the very least, NFS is old and nasty and performs terribly. It's nice to see a new solution (albeit one not really intended for standard network shares).
  
  --
  Xfce: Lighter than some, heavier than others. Just right.
3. Re:It's Linux! by Anonymous Coward · 2004-11-09 18:11 · Score: 0
  
  Linux is a kernel. But linux is NOT a operating system. In the context that he referenced it (ie a network file system style) it's absolutely correct to use GNU/Linux.
  
  Now if he was actually refering the kernel module, then yes, he should of used just Linux because it's a linux kernel. But in that sentance he didn't mention anything about modules.
4. Re:It's Linux! by Anonymous Coward · 2004-11-10 01:58 · Score: 0
  
  he should of used
  
  "should have used".
5. Re:It's Linux! by Anonymous Coward · 2004-11-12 20:29 · Score: 0
  
  ...and the contraction that sounds like "should of" is properly written "should've". Learn it by heart, and the spelling nazis will love you.
I hope the meta-data performance improved... by Ayanami+Rei · 2004-11-09 14:29 · Score: 2, Interesting

I found that gigabit NFS was usually much faster with files smaller than 1MB. I guess because either way, you still had to go through one server to set up each FS operation. NFS had been around longer; the Sun implementation was hard to beat.

Has the meta-data server been speed up at all, or made distributed with some kind of coherency-syncro backend?

--
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
1. Re:I hope the meta-data performance improved... by alecthomas · 2004-11-09 16:25 · Score: 2, Informative
  
  Has the meta-data server been speed up at all, or made distributed with some kind of coherency-syncro backend?
  
  From the PVFS2 Guide:
  
  The new design has a number of important features, including:
  
  * modular networking and storage subsystems,
  * powerful request format for structured non-contiguous accesses,
  * flexible and extensible data distribution modules,
  * distributed metadata,
  * stateless servers and clients (no locking subsystem),
  * explicit concurrency support,
  * tunable semantics,
  * flexible mapping from file references to servers,
  * tight MPI-IO integration, and
  * support for data and metadata redundancy.
2. Re:I hope the meta-data performance improved... by rizzy · 2004-11-09 16:53 · Score: 3, Informative
  
  > * flexible and extensible data distribution modules,
  > * distributed metadata,
  > * stateless servers and clients (no locking subsystem),
  
  Just to clarify... while we have distributed metadata, we don't have *replicated* metadata. At least, not yet.
  
  If you have multiple metadata servers they will do load balancing. If you are working with lots and lots of small files, having a couple metadata servers might alieviate a possible bottleneck.
Parallel Architecture Research Lab by St.+Arbirix · 2004-11-09 14:56 · Score: 1

This is exciting and all, but the really importing thing about PARL is that they were the only ones at Clemson willing to host our site.
</SELF-PLUG>

--
Direct away from face when opening.
1. Re:Parallel Architecture Research Lab by Anonymous Coward · 2004-11-10 01:42 · Score: 0
  
  the really importing thing
  
  "important".
I know this is for large clusters..... by zogger · 2004-11-09 15:10 · Score: 1

..but..uhh..not to appear too lame because I probably don't understand it.. but...could this be used in conjunction with someting like bittorrent so that big files like ISOs or whatnot could be shared easier cross platform? Do you understand what I am asking? An Esperanto for computers with large numbers of people working all over?
1. Re:I know this is for large clusters..... by brsmith4 · 2004-11-09 15:59 · Score: 3, Informative
  
  Simple answer: No. This is commonly used for allocated scratch space in cluster environments e.g. beowulf. We use it to reduce the reads and writes that usually bring an NFS system to its knees. It would not help Bittorrent.
Oh. Neato. Well, I could have looked... heh... by Ayanami+Rei · 2004-11-09 20:28 · Score: 1

Sorry.

--
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
Redundancy ? by Bothari · 2004-11-09 23:14 · Score: 1

I've been skimming the documetantion for this.
Does anyone use this for big, transparent file storage networks.
I've been looking for something better than "a bunch of nfs servers with some code to redirect each client to his storage". This is a pain to manage as well as having lots-'n-lots of pof's...

I've noticed that that metadata is not in a single node anymore, but it's not replicated yet either. I could live with this reliability problem if it could give me the transparency to just add a server when needed and not worry about wasted space in old servers...
1. Re:Redundancy ? by rizzy · 2004-11-10 02:25 · Score: 1
  
  We don't encourage anyone to rely on PVFS2 to host the sole copy of their data. So it might not be the best idea to use PVFS2 as a "transparent storage network".
  
  PVFS2's real sweet spot is for scratch space for scientific applications -- writing out checkpoints, reading in datasets.
  
  I don't know if I'd call what PVFS2 has a "reliability problem". If you've got money, hardware-based failover solutions exist today and work well with PVFS2 (think heartbeat). In the not-so-distant future we've got people working on software based replication of data, but no matter how you slice it, there's going to be a performance hit. The trick is to find a way to replicate data while hiding the signs of that extra work from the clients. A final solution is a little ways away, but we're pretty confident we can eventually implement a good replcation method.
  
  Your other item -- about wanting to add servers as needed -- is something we've heard from a lot of people. We think we can make that happen without a ton of effort, but it didn't make the 1.0 cutoff.
  
  Thanks for the feedback
2. Re:Redundancy ? by REggert · 2004-11-10 04:09 · Score: 2, Interesting
  
  I use Andrew File System (specifically, http://www.openafs.org/) for my files, since I was used to using it at school, and I'm fond of its access control system. It allows you to designate redudant sites for your volumes for backup or load balancing purposes. However, its major downside is that it's optimized for reads but not for writes (PVFS would probably work better if you need optimal write performance), and it can be a real bitch to set up for the first time. I've also yet to figure out how to get it to work through my NAT, though it's supposed to be possible. It beats the hell out of NFS (v2, at least, I haven't really taken a look at NFS v3) in terms of reliability, security, and scalability, though.
  
  --
  cp /dev/zero ~/signature.txt
3. Re:Redundancy ? by philci52 · 2004-11-10 06:48 · Score: 1
  
  You might want to check out ZFS when solaris 10 comes out. http://www.sun.com/2004-0914/feature/
Missing feature: Undeletion facilities by Anonymous Coward · 2004-11-10 05:10 · Score: 0

Am I the only one who think they overlook this design? After all, accidents happen, and you can't possibly expect everyone has backup copies of everything. The recycle bin idea simply does not work, because it does not preserve the directories structure of the files prior to deletion, and it can actually make system less secure by having to delete everything twice.
1. Re:Missing feature: Undeletion facilities by RAMMS+EIN · 2004-11-13 10:01 · Score: 1
  
  ``The recycle bin idea simply does not work, because it does not preserve the directories structure of the files prior to deletion, and it can actually make system less secure by having to delete everything twice.''
  
  There's nothing about the recycle bin idea that makes directory structures disappear, except that that's how it's implemented on some systems.
  
  Supporting udelete in your filesystem can be a huge pain, and stop you from doing many more interesting and useful things.
  
  And yes, accidents do happen. That's why we have backups. In fact, recycle bins are much like backups, except that they usually back up things that you don't want to keep.
  
  --
  Please correct me if I got my facts wrong.