Slashdot Mirror


Using Relational Databases as Virtual Filesystems?

Pogie asks: "At my office, we've got what one could only describe as a huge Network attached storage infrastructure. We're talking multiple terabytes of applications, user trees, data files, sybase and oracle databses, etc. 'In the beginning' it was a concious decision to create a shared NFS infrastructure using NetApp Filers (I humbly recommend them over SAN solutions any day...flame on!), but our data center has grown so large, and there are so many interdependencies that we're becoming concerned that if the wrong filer goes down, our production network would be, to say the least, hosed." To combat this problem, Pogie wants to implement his filesystem in a relational database...Oracle to be precise. Read on for his reasoning.

"To conquer our fears we're trying to get a handle on exactly what is where, with the goal of reorganizing the true physical locations of data to minimize the business impact if any single NFS server goes down. At the moment, the plan of attack is to construct a relational Oracle 8.1.6 database on linux which will basically mirror the filesystem in a DB. To accomplish this, I'm writing a horde of scripts using the perl DBI which will poll the entirety of the NFS filesystems on our network and create what basically amounts to a virtual filesystem in the DB which we can then drill into for specific information in much less time than it would take us to search through the actual filesystems in question. In addition, we gain the ability to maintain historical data, which allows us, among other things, to know exactly what went wrong if a luser rm's, mv's, or cp's the wrong thing to the wrong place.

Has anyone tried this before? And is this even a good idea? Does anyone know of existing packages that will do this? I'm really curious what the slashdot community thinks of the idea. I was several hours into this before someone said to me, 'Do you realize you're writing a filesystem in SQL?'"

18 of 52 comments (clear)

  1. You're really putting too much effort into this by Mr.+Foogle · · Score: 2, Insightful
    Or rather, too much effort in the wrong direction. I can't imagine this would work very well.

    You don't need to stuff your file system data into a database, you need to investigate high-availability .. ya know, fall over systems, redundant everything etc.

    My biggest objection would be that you're violating the KISS rule, and making life a living hell for whoever follows you.

    Or, maybe I just have a limited imagination.

    --
    Display some adaptability.
    1. Re:You're really putting too much effort into this by DNS-and-BIND · · Score: 2, Funny

      But...it's fun! And he gets to build a linux box and write perl scripts! Isn't that what having a computer job is all about?

      --
      Shutting down free speech with violence isn't fighting fascism. It IS fascism!
  2. Oracle iFS by sohp · · Score: 3, Insightful

    Why yes, this has been done, and you can get it from Oracle, under the name Oracle Internet File System, and I've played with it a bit. Interesting concept, not a very robust implementation, but perhaps it's gotten better since I tried it under 8.1.7? It's kind of neat to be able to mount a drive under windows that's really data in an Oracle table.

  3. Why directories? by gnovos · · Score: 2

    Just store every file in a huge table with two columns: Key and File. The key could be a URL, like "file:///usr/local/blah" and the file would be the file. Of course you'll need to text-index the keys do that you can search for "file:///user/*", but since you would only be making a single call (instead of call after call as you recuse down directory tables) it would make for a very simple system that would be fast as fast can be.

    --
    "Your superior intellect is no match for our puny weapons!"
  4. Re:HA Linux by sigwinch · · Score: 3, Insightful
    In the end, what extra capabilities is Oracle going to give you that the right sort of filesystem wouldn't?

    1. Consistent backups are trivial. Are there any common filesystems that provide this?
    2. Applications can do execute atomic transactions that involve multiple files and directories.
    3. It is easy to keep older versions files around and do undeletes.
    4. If you store keep the data and metadata in separate tables, it is easy to create totally different views of the same files. Dunno if this is useful...
    5. You can do access controls as appropriate to your problem.

    I'm not necessarily advocating RDMBS-as-filesystem, but the idea does have some merit.

    They're two very different ideas for data storage (heirachical vs relational is just for starters).

    Hierarchical is a special case of relational: 1) Each item has a foreign key for its parent directory, or NULL if it's in the root. 2) There is a UNIQUE constraint on foreign key + item name.

    --

    --
    Kuro5hin.org: where the good times never end. ;-)

  5. Why relational? by cperciva · · Score: 3, Insightful

    IANADE (... Database Expert), but...

    Why are you considering a *relational* database? Unless you're planning on completely changing filesystem semantics I don't see why you wouldn't just use a simple hierarchical database.

    I mean, seriously, you want to have a filesystem which acts like a distributed database; but you don't really need to be able to run RDBMS queries do you? You'll probably end up with a much better result if you work down a checklist and decide which database features you want and which will just add bloat.

  6. The idea dates back to the 60's by one-egg · · Score: 4, Insightful
    Back in the 60's, there was the Michigan Terminal System, out of the University of Michigan. Their filesystem was DB-based. That was before relational became the "in" thing, so it was ISAM.

    It was an interesting idea. I think that the problem they had in MTS will be the same with your idea: not everything fits neatly into the DB model. In fact, some things really have to be shoehorned in.

    The insightful reader will be saying, "But wait! You also have to shoehorn stuff into the conventional FS model." True enough. The question is how much fits naturally and how much has to be shoehorned.

    My contention is that the conventional model is a better fit for most stuff. That's especially (perhaps sadly) true because of legacy software that expects the conventional model. Perhaps a ground-up OS and application implementation would be able to rethink some of those issues and find new insights. But I'm naturally skeptical.

    There is also the issue of performance. I know little about DBs (my loss), but it seems to me that if the FS is stored in an existing relational system, you're going to have to warp some stuff to make it fit. I'd suspect that either you're going to have to make every file be a different table, or you're going to have to store the contents of every file as a variable-length text field. Either option is going to have really nasty effects on the efficiency of the DB, which has been highly optimized under the assumption that each table contains tons of highly homogeneous records.

    I wouldn't want to dive into that kind of can of worms as an "I want to use it in production" project. It might make interesting research on a 5-year horizon, though.

  7. This is how things should really be anyway by cookd · · Score: 2, Interesting

    I see nothing wrong with what you are doing, and it is one heck of a good idea. I really wish I had the time to go in and write some little drivers that would journal the addition/deletion of files and folders to the personal SQL server on my PC, so that when I do my (frequent) searches for files, they would be quick. Locate is nice, but it isn't real-time...

    You'll get good info, and the info is the most important part for doing a good job of reorganizing things. Ad-hoc can be fine when everybody is responsible for their own stuff, but when the whole system is supposed to work cohesively, nothing is as cool as a really well-engineered large system that is bulletproof, and you can't do that with out really good planning.

    Maybe off topic...

    I think the whole "files and folders" system is artificial anyway. We had no concept of files at first, until the technology on the mainframes was good enough that we were able to finally put some structure around the data. Then we started categorizing the files, and eventually the device, folders/directories, etc. structure evolved. In addition, we have this server hierarchy, with mount points, etc. It is a lot more complex, and somewhat more capable of organizing our data, but it still isn't even close to how we really think. It was invented because it was a good way to organize that was efficient for the horsepower available at the time. Now we're somewhat entrenched by it.

    Database filesystems are a much more natural way to do things, though. How do you categorize your MP3s? By Hard rock/Soft Rock/Pop/Oldies? By Artist? By title of the song? With folders you have to pick one method. With a database, you can switch anytime.

    Now that the computers are capable of it, we're starting to move in the direction of database filesystems already. MP3 categorizers are coming in quickly, as are filesystem indexers (locate, and MS's indexing server). Handheld devices kind of go by a "database" filesystem as well.

    I envision a filesystem as follows: a flat set of files, each with a serial number (inode number basically). Also, a database that associates each inode with any number of attributes. There are certain pre-defined attributes with globally well-understood meanings, and namespace rules about defining new attributes and personal-use attributes. The attributes can themselves have attributes (more on this later).

    Attributes include file names (as opposed to "filenames," though similar), creation/modification/access dates, owner, comment, file type, keywords. And a file may or may not be assigned a "Default open with" attribute... Or maybe 2 or 3 (in which case a list box would pop up when you double clicked on the file!)

    If the file didn't have a "default open with," what then? Well, it probably has a file type attribute. Say, File Type text/plain. Well, the attribute text/plain has a "default open with" attribute of "GVim.EXE." Well, cool!

    This would be nice in a lot of places: source code control would be very simple. Just put a "version" attribute on, and move the "latestversion" tag around. Also, this eliminates the need for multiple copies of a file in a build environment (well, theoretically, you should be able to eliminate this by proper engineering, but we all know that sometimes you miss something and have to copy a file somewhere...) -- you just refer to the same file twice.

    That is my idea of how the database file system should work. Of course, it really messes with our current paradigm, and it introduces some problems (bye bye canonical pathnames for files!) but sometimes I really wish I had a database behind my filesystem, not a hierarchy.

    --
    Time flies like an arrow. Fruit flies like a banana.
  8. Re:Some possible advantages and shortcomings by Ayende+Rahien · · Score: 2

    A> Why use a Relational data base to store hirercial data? A HDB is *much* faster than a comparable RDBS, because it architecture is much simpler.
    B> All of the things that you mention could be solved with a good, high-end file system. Forgive me for intreducing a non-Unix concept, but I didn't have the chance to work with file systems there. But on NTFS, you can have pretty much anything you list here, including triggers (called repharsed points) and change journal, additional properties and notes can be handled via streams.

    Other systems, for example, provide more complete solution, in that they provide journal data (as opposed to journaled meta data only on NTFS), which can help increase your stability.

    All in all, I don't see they reason to go with an RDB instead of a good, high end, file system or a HDB.

    --

    --
    Two witches watched two watches.
    Which witch watched which watch?
  9. Re:Some possible advantages and shortcomings by duffbeer703 · · Score: 3, Insightful

    There are plenty of reasons to want to use an RDBMS.

    If you are working with financial data or health records, there are Federal reporting requirements relating to who accesses what data where & when. Using a central Oracle or other RDBMS makes it easier to keep track of what's up.

    Why Oracle? Maybe the organization has a bunch of PL/SQL gurus. Maybe having Java integrated into the DB is advantageous. Or maybe they have a giant Oracle server sitting around with extra cycles.

    --
    Conformity is the jailer of freedom and enemy of growth. -JFK
  10. slower.. by josepha48 · · Score: 2
    Using a database as a filesystem is a bad idea if there is a lot of traffic. Database IO is generally slower than filesystem access. My former internet company would agree with this as well as we tried to make db access as small as possible. One solution was to use caching of tables in memory to make the access faster, then timed updates to the database.

    I'd recommend setting up failover servers instead of DB access.

    --

    Only 'flamers' flame!

  11. Interesting idea, *if* your data fits a DB model by tshoppa · · Score: 2
    The way you propose your idea - without any details - it's hard to tell if your data really does fit a DB model well. If your data is just "all the files", it probably doesn't map well.

    It's also not clear what problem you're trying to solve. Is it a problem in administering the large amounts of data? Is it a problem in managing the complex relationships between the data? Is it just a worry about the reliability of the storage media?

    If you've been using part of your filesystem *as* a database (i.e. very large numbers of files in a directory, where the filename is a key) then you may come out ahead by putting the data where it really belonged in the first place, a database!

    It may very well be time to closely scrutinize what all the files you have are and what they contain. Store files with complex interrelationships to each other on the same machine; store other groups with little relations somewhere else. Draw ER diagrams, do it in UML, whatever tool you like. But you won't solve a lack of understanding of all your files by hiding them all behind a database layer of abstraction, you'll just be brushing everything under the rug.

  12. CVS by DrZaius · · Score: 2, Interesting
    Why not use CVS?

    We don't use Filers where I work, but just the local file system. Our CVS root looks something like this:

    systems/$hostname
    systems/$hostname/home
    systems/$hostname/etc
    systems/$hostname/var

    and so forth. You can then add files as you want -- if you don't want to back up everything in /etc as it was created automagically, then don't.

    This set up uses a very mature piece of software. It has a lot of nice interfaces (cvsweb, wincvs). It also gives your users the ability to pull their own backups and doing branching of their home directories :)

    It works really well as you can back up the central repository at what ever frequency you want.

    It may take a lot of storage to back it all up, but it will probably be smaller than an oracle database and you will get diffs for your history.

    --
    -- DrZaius - Minister of Sciences and Protector of the Faith
  13. Re:sounds crazy by rfreynol · · Score: 2, Informative

    Actually, if he used Oracle iFS (internet files system) the files could reside in the db (security and ability to get good, clean backups) while being mountable in via many protocols (nfs, samba, http, ftp, etc.)

    I often hit files with grep/sed/awk from a unix box that has a iFS filesystem mounted via nfs.

    The solution to his problem is iFS, and he probably already has a license for it.

  14. Reiser FS by CaraCalla · · Score: 2, Interesting

    The guy who wrote Reiser FS has quite some things to say on this very subject: What he says (in short) is:

    .) Databases shouldn't exist. They exist only because System-Programmers didn't listen to Application-Programmer's needs.

    .) Those Application-Programmers implemented their own solution to the problem: Databases.

    That's why he started Reiser FS. It is today the ideal filesystem to store little bits of information ( 100 bytes) which you normally store in Databases, because you don't want millions of small files lying around.

    (I know this is a simplified view, read his own arguments for a broader discussion.)

  15. Database storage -- and the real battle: the users by webwench_72 · · Score: 2, Informative

    There's a product I've been working with for 3 years which does something similar to what you are asking for:

    OpenText Livelink

    They have a number of other products that probably have similar underpinnings.

    Basically, a database (can be Oracle, Sybase, or SQL Server) keeps meta-information on items stored in the database. Items can be documents, folders, URL links, tasks, discussion topics and replies, etc. Meta-information includes dates of creation, updating, deletion (i.e. the makings of an audit trail), whether the document is checked out for editing and if so, by whom (that makings of a simple source-control system), who can see or edit the document (implies that a table of users is also maintained, which it is), etc.

    Livelink provides a server (basically a big CGI app) that you can run through a web server, allowing this stuff to be navigated and maintained through the web. The web pages are customizable. Really, the whole thing is customizable, which means you can write all kinds of little apps and processes above and beyond what is supplied by Livelink (our most common examples are scanning apps, that scan and store new documents in one step, and document expiration processes, which force certain documents to be read and revised every 6 months or whatever). There's also an API for VB, C++, and Java, to allow access methods other than the web.

    Depending on the number and size of documents you're storing, documents can be kept in the database itself, or can be kept in a filesystem, and pointers to those documents stored in the database. The second option is usually preferred because the first option will cause trouble when it comes time to backup or restore from backup, or to migrate data, etc.

    The biggest disadvantage from a user's point of view is the need to log in, if you plan to keep any semblance of an ACL, source control, or auditing. You could provide one common login for read-only access to most of your files, which would ease the pain a bit. Or you could 'roll your own' solution, based on some of the premises used by this type of system.

    I'd like to add that I think the technology to use is the least of your issues. The biggest issue will be in finding and categorizing all of the content that's already out there. When you find 4 different variations of the same document, or 3 different builds of the same source code, or tables for 3 different apps in one database schema, and it's not clear what is what, how will you know who to contact? Who among your users has the extra headcount to spare to give you detailed info on all their files and databases, etc? How much stuff is out there, that was owned by people who have left or been laid-off, who no one else can provide info on? That's gonna be your real battle.

    --

  16. Re:HA Linux by Tet · · Score: 2
    Consistent backups are trivial. Are there any common filesystems that provide this?

    VxFS does for a start. My first point of call would be to check whether the freevfxs filesystem in linux supports the point in time copies of the real thing. Or wait for Veritas's official release (which is purely a marketing issue -- it's been running on Linux for a long time inside Veritas). Other than that, maybe check if XFS or JFS support similar features.

    --
    "The invisible and the non-existent look very much alike." -- Delos B. McKown
  17. LDAP? GFS? by runswithd6s · · Score: 2

    If I had my choice on what to implement for the ultimate network distributed filesystem, I would concentrate on LDAP, GFS, and Kerberose. LDAP, by its very nature was designed to be a distributed, redundant resource locator and data respository. It can be back-ended by any number of engines, including your more popular RDBMS's. It may seem a bit overwhelming, but well worth the investment in time and energy. Check out the OpenLDAP site for more information.

    The second issue you're trying to address is data redundancy and failover. You want a high-availability solution. Look into using the Linux Global Filesystem (GFS). In a nutshell, it's a clustered journaling filesystem whose participants are equally responsible for the data on disc. If one of the servers in the cluster goes down, the first server to see it plays back the unfinished journal of the downed server, and the whole cluster continues on its merry old way.

    So, it would be one GFS+LDAP cluster with multiple 1U, fiberchannel servers attached to a fiberchannel disc array. Tack on a gigabit ethernet backbone, and you've got a winner.

    --
    assert(expired(knowledge)); /* core dump */