Slashdot Mirror


Rethinking the Nature of Files

An anonymous reader writes "Two recent papers, one from Microsoft Research and one from University of Wisconsin (PDF), are providing a refreshing take on rethinking 'what a file is.' This could have major implications for the next-gen file system design, and will probably cause a stir among Slashdotters, given that it will affect the programmatic interface. The first paper has some hints as to what went wrong with the previous WinFS approach. Quoting the first paper: 'For over 40 years the notion of the file, as devised by pioneers in the field of computing, has proved robust and has remained unchallenged. Yet this concept is not a given, but serves as a boundary object between users and engineers. In the current landscape, this boundary is showing signs of slippage, and we propose the boundary object be reconstituted. New abstractions of file are needed, which reflect what users seek to do with their digital data, and which allow engineers to solve the networking, storage and data management problems that ensue when files move from the PC on to the networked world of today. We suggest that one aspect of this adaptation is to encompass metadata within a file abstraction; another has to do what such a shift would mean for enduring user actions such as "copy" and "delete" applicable to the deriving file types. We finish by arguing that there is an especial need to support the notion of "ownership" that adequately serves both users and engineers as they engage with the world of networked sociality. '"

34 of 369 comments (clear)

  1. There is no "issue." *I* own my files and data by elrous0 · · Score: 2, Insightful

    I'm sorry, but MS issuing a paper on the "issues of file ownership" and the cloud sends a little chill up my spine. Makes me think that engineering may not be the only impetus behind their paper. It also makes me wonder if someone isn't looking to take a little more "ownership" of what has traditionally been considered *my* data.

    It's bad enough I'm already forced into "buying" software and media that I can never resell. Now they want my fucking Word files too I guess.

    --
    SJW: Someone who has run out of real oppression, and has to fake it.
    1. Re:There is no "issue." *I* own my files and data by fuzzyfuzzyfungus · · Score: 4, Insightful

      Don't worry, user, of course you own those little files of yours.

      We just want to install some robust Technological Protection Measures to preserve your ownership of those files across all devices and platforms and legal systems aligned with international norms... Totally harmless, nothing to worry about.

    2. Re:There is no "issue." *I* own my files and data by imric · · Score: 2

      Well since the {xxAA} already owns most modern works of art and all performances forever (with the blessings of our government), and companies already own ideas (thoughts), it stands to reason that Microsoft would want to own the results of any actions facilitated by software written by them as well. I mean, how can they continue to expand their market if they don't? Be REASONABLE! I mean, this can get rid of any ambiguity about ownership and remove copyright and patent issues forever! It's simple - "All your files are belong to us"!

      --
      Paranoia is a Survival Trait!
    3. Re:There is no "issue." *I* own my files and data by CharlyFoxtrot · · Score: 3, Insightful

      You should read the article, you are illustrating their point. They talk about how users associate ownership with having a file on a known physical location and how in order for people to feel comfortable with cloud storage the definition of file needs to be redefined in a way that people feel they have ownership over data that exists "out there".

      "[...] ownership is what we are thinking of, when ownership stands as proxy for what used to be knowledge of location and responsibility for that location. What was once a relationship between a user and a physical thing now needs to stand as a relationship between a user and a digital thing. Just what this ownership might be and how it might function in terms of what is specified in this new entity we are thinking of, one that somehow has the properties we have described above and which also allows this new characteristic, we have begun to outline but a beginning is all it is."

      Part of this is the ability to be able to delete their data even when it has been put out there in the wild.

      "A boundary object needs to be developed that can bridge the abstraction of the user and the one of the engineer, who needs to worry about where this thing that keeps growing and changing, and where the locale of storage changes too, such that when a user says ‘delete’, the thing whatever it is and wherever the entities constitutive of it are, are indeed, done away with."

      This is a paper talking about your concerns and how to address them.

      --
      If all else fails, immortality can always be assured by spectacular error.
    4. Re:There is no "issue." *I* own my files and data by elrous0 · · Score: 2, Insightful

      A quote from the conclusion of the article:

      A boundary object needs to be developed that can bridge the abstraction of the user and the one of the engineer, who needs to worry about where this thing that keeps growing and changing, and where the locale of storage changes too, such that when a user says ‘delete’, the thing whatever it is and wherever the entities constitutive of it are, are indeed, done away with.

      I'm sorry, but that sounds a *lot* like DRMing every file to me, with a central service controlling every file (how else could you implement such a system?). The authors even admit as much a few sentences later:

      At first reading one might think they are alluding to digital rights management.

      Of course, they seem to deny that this is DRM. But that's sure what it sounds like to me. And DRM needs some sort of central service to work, which I'm sure MS will be happy to provide of course.

      --
      SJW: Someone who has run out of real oppression, and has to fake it.
    5. Re:There is no "issue." *I* own my files and data by Short+Circuit · · Score: 2

      I was amused when I discovered that the Xen hypervisor allows you to emulate a TPM in software. I didn't dig into it enough to find out if there were a way to extract stored data from within the dom0.

      What's that about a secure keystore again?

    6. Re:There is no "issue." *I* own my files and data by elrous0 · · Score: 5, Informative

      No, they're talking about DRM. They try to deny it a few sentences later, but how else would you implement a system where any given file downloaded off the web could be deleted by a central authority at any time?

      --
      SJW: Someone who has run out of real oppression, and has to fake it.
    7. Re:There is no "issue." *I* own my files and data by CharlyFoxtrot · · Score: 2

      There's nothing wrong with DRM when it's used to protect my ownership of my files. Would you be opposed to a DRM scheme that would allow you to totally and irrevocably delete a picture you posted to Facebook because it allows you to retain total ownership ? The problem with DRM is when it's used to take away rights you traditionally hold, i.e. when DRM is used to reduce your ownership instead of increasing it.

      --
      If all else fails, immortality can always be assured by spectacular error.
    8. Re:There is no "issue." *I* own my files and data by CharlyFoxtrot · · Score: 2

      But the ability of a user to delete his "cloud" files would be a benefit. DRM is only evil when it gives a third party control over your stuff, not when it gives you control over your own stuff.

      --
      If all else fails, immortality can always be assured by spectacular error.
    9. Re:There is no "issue." *I* own my files and data by elrous0 · · Score: 2

      DRM is only evil when it gives a third party control

      Who do you think is going to be running the central service that administers all this DRM?

      I'll give you a hint. It rhymes with Picrosoft.

      --
      SJW: Someone who has run out of real oppression, and has to fake it.
    10. Re:There is no "issue." *I* own my files and data by elrous0 · · Score: 2

      I am nominating your post for the Irony Awards. I think you're a shoe-in this year.

      --
      SJW: Someone who has run out of real oppression, and has to fake it.
    11. Re:There is no "issue." *I* own my files and data by digitig · · Score: 3, Informative

      And if instead of a picture it was a music track or a book? And if you charged the customer for access to it? And you could still delete it after they had "bought" it? And how does that look from the other side of the fence? How is your sort of DRM any different from the "bad" sort?

      --
      Quidnam Latine loqui modo coepi?
    12. Re:There is no "issue." *I* own my files and data by StuartHankins · · Score: 3, Insightful

      +1 Insightful. Allowing Microsoft to do this sort of thing would be a horrible mistake. They've shown they can't be trusted too many times. Maybe the kids weren't aware when this stuff started, but I still remember the tricks Microsoft played... and are still playing. Boo on them forever in my book.

      Poetic justice would have Apple purchase Microsoft and break it into divisions.

    13. Re:There is no "issue." *I* own my files and data by StuartHankins · · Score: 2

      Yes I would be opposed. Nothing is 100% secure and having all my files disappear would be unacceptable. My files, my ownership, on my machines. That's how I like it.

    14. Re:There is no "issue." *I* own my files and data by biodata · · Score: 4, Insightful

      The cloud idea likes to project an illusion of it not mattering where the file is, but it is predicated on (more or less) limitless bandwidth with near zero latency, and limitless local storage/cache. If the file you want is not on the local hard disk then it isn't. If your OS needs to fetch it behind the scenes then you need to wait until it arrives. Yes you might think you don't want to know where the file is physically, but when it takes ten minutes to open a file that should take ten seconds, you will probably want to know why (oh, it's in another country and the network is busy because everyone is watching some new TV prog, i see now). Not knowing where the file is just means needing to ask all the time. Is it really better not to know, than just knowing in the first place, and making sure it is where you need it to be? Bandwidth will never be unlimited and latency will never be zero. We are routinely working on 10GB files now where I work, and you always need to know where they are, and to care because however big the pipes are and how ever big the disk space and the RAM, the data streams grow even faster. The technologies underlying data capture devices obey their own version of Moore's law, frequently with higher multiplicities.

      --
      Korma: Good
    15. Re:There is no "issue." *I* own my files and data by Dripdry · · Score: 2

      Before you know it we'll have to send in Tron to stop the Microsoft Control Program
      (showing my age here)

      --
      -
  2. Ugh by Anrego · · Score: 2

    I couldn’t make it through the first paper. It came across as meandering and very academic. Didn’t try the second

    Either way, of all the stuff that is currently broken, files are one of the few things that still mostly work. Yes would be nice to have more standardization and maybe metadata, but I don’t foresee it happening. And yes users sometimes get confused, but the generally figure stuff out.. and nothing described in the article seemed any more intuitive and would probably be just as miss-understood by users.

    We’ll end up with 10 different standards, and no one will bother keeping metadata accurate on all their files. At best metadata is useful for a single person on a small subset of files where they find it useful. Everything else, the only metadata anyone is going to care about (and be bothered to enter) is title, which is served fairly effectively by the file name.

    1. Re:Ugh by fuzzyfuzzyfungus · · Score: 2

      Are you saying that quoting Wittgenstein in a paper that is ostensibly concerned with file structures is pretentious, content-free twaddle?

      Couldn't be...

  3. Auto deleting files... by klubar · · Score: 4, Interesting

    I've always thought it would be useful if you could mark as file as automatically deleting at a certain date. If you create a temporary file, it would be nice to flag it as "delete after 60 days" so it doesn't need attention in the future. (The same functionality would be really useful for emials...I want to save this email until after the event (or whatever it's about) and then have it automatically deleted.) I once saw the file functionality on a custom Cray operating system in the 1977.

    1. Re:Auto deleting files... by deniable · · Score: 2

      Lots of people have 'temp' files that don't live in %TEMP^%. I had to move *important* data for one of our units a couple of months ago and saw a file 'To do December 2002' or some such. Things like that should have expiry dates.

    2. Re:Auto deleting files... by Khopesh · · Score: 2

      I have coworkers that do this (on Posix systems). They prefix temporary files' names with commas. Then all they need is a daily cron job like this:

      0 4 * * * * find $HOME -name ',*' -mtime +30 2>/dev/null |xargs rm -rf

      Voilà!

      --
      Use my userscript to add story images to Slashdot. There's no going back.
  4. Hmm where have I seen by OzPeter · · Score: 2

    We suggest that one aspect of this adaptation is to encompass metadata within a file abstraction

    this before? Are resource forks coming back into vogue?

    --
    I am Slashdot. Are you Slashdot as well?
  5. an especial need by blackmesadude · · Score: 2

    really?

  6. Re:I like fuzzy folder structures... by wertarbyte · · Score: 5, Interesting

    DOCUMENT=~/myschematics.pdf
    SHAID=$(sha512sum "$DOCUMENT" | cut -f1 -d' ')
    mkdir heap
    mv "$DOCUMENT" "heap/$SHAID"
    mkdir tags
    mkdir tags/Schematics
    mkdir tags/Pentagon
    mkdir tags/Operation_Zesty_Lemon

    ln "heap/$SHAID" tags/Pentagon/
    ln "heap/$SHAID" tags/Schematics/
    ln "heap/$SHAID" tags/Operation_Zesty_Lemon/

    --
    Life is just nature's way of keeping meat fresh.
  7. Re:Are they confusing form with function? by SuricouRaven · · Score: 2

    NTFS supports the same thing, it's just that hardly anyone ever uses it. Including Microsoft.

  8. Oh dear god, please, please, please.... by gestalt_n_pepper · · Score: 3, Insightful

    Do NOT "improve" the file. I'd like to continue to be able to use my computer and other devices.

    --
    Please do not read this sig. Thank you.
  9. POSIX xattrs by Salamander · · Score: 3, Insightful

    Look them up. They already allow you to attach arbitrary metadata to a file. Most modern filesystems and user-level utilities support them already. They're even used as the underpinnings for security mechanisms such as POSIX ACLs and SELinux. Sure, there are issues with performance when you have *lots* of xattrs on a file, and that's a fruitful area of research, but we sure don't need some brand-new Microsoft-invented thing to deal with metadata.

    --
    Slashdot - News for Herds. Stuff that Splatters.
  10. Re:Also, by Millennium · · Score: 2

    The thing is, those particular file systems also use a different notion of what a file is than what Unix folks are used to. One major example of this is that on these systems, a file can contain multiple streams of data, which both NTFS and HFS+ call forks. NTFS doesn't use forks much, but Macs used them heavily in the pre-OSX days (not so much anymore).

    Files-11 and HFS+ also support a notion of files as being containers of discrete data records, rather than streams of bytes. Again, Macs used this concept heavily in the pre-OSX days, mostly when dealing with a file's resource fork, but it's not as common anymore.

  11. Metadata and sharing by Kjella · · Score: 2

    Personally, I've found that the biggest issue with all the "metadata" systems that try to improve on the basic file/folder system is that they don't transfer anywhere. Send the file once through Samba, NFS, email, FTP, rsync or whatever and the metadata is lost. The only systems that actually get used are those that are embedded in the file, like EXIF for JPG, ID3 for MP3 and so on.

    The stupid thing is that we didn't make that a generic part of all file formats, a simple key-value list appended to the file would do. But today that'd break almost everything, plus most things working on the file system would have to know that each file has a data and metadata part. Maybe use a compatibility layer for metadata-unaware applications, where they only see the data part?

    That way we really could have a standard form of metadata. It might not cover every use but it'd sure cover a lot. Copy the file, copy the metadata (if you want, of course). Of course most of these researchers seem to want to get rid of the file altogether and replace it with some sort of cloud service, but I'd rather not. I'd rather know where I have my stuff and be able to put it where I want.

    --
    Live today, because you never know what tomorrow brings
  12. Yes by biodata · · Score: 2

    Yes I would. If I deliberately transmit a message to someone else, then I have no expectation of being able to 'untransmit' that message. The logic error here is thinking that files are like objects. They are not (only), they are also like messages. Big business wants files to be like objects so they can own them. Everyone knows they can't do it, and this effort will fail like all others, due to the nature of reality. Files are not objects.

    --
    Korma: Good
  13. Buzz barf by LoRdTAW · · Score: 2

    "Quoting the first paper: 'For over 40 years the notion of the file, as devised by pioneers in the field of computing, has proved robust and has remained unchallenged. Yet this concept is not a given, but serves as a boundary object between users and engineers. In the current landscape, this boundary is showing signs of slippage, and we propose the boundary object be reconstituted. New abstractions of file are needed, which reflect what users seek to do with their digital data, and which allow engineers to solve the networking, storage and data management problems that ensue when files move from the PC on to the networked world of today."

    They pretty much peppered the report with bullshit and buzz words to make "meta data" and "internet based storage" sound all new and shiny for the brain dead market droids and managers.

    This reminds me of that MIT operating system hoax that was going to take current file system ideas and throw them out the window. Face it, how else do you organize bits of information? The concept of a file is simple: an organized arrangement of bits that contains data which can be moved, re-sized or deleted. How do you change that? The only thing that can change is the method in which they are stored on physical media (file system) or cataloged and indexed.

    I just want one thing: a file system that is part database for fast file searches. I don't want to manually build indexes or any other bullshit just look at the file table and give me my fucking file. Even if you had 100,000 files with file names of 256 characters, its only 2.5 MB, how long does that take to parse? Maybe I don't understand file systems but even a 10 MB file table should only take a few seconds to scan. When I do a search of a directory or entire disk with tens of thousands of files it sometimes takes a minute or two. The disk is thrashing away as if the program is looking all over for the file names. Shouldn't they all be in one place pointing to where they are on disk? Maybe I don't understand file systems in general, someone care to explain?

    And one thing that just popped into my mind is a better method to tag and store files. When I download a file or save a document/image/whatever I shouldn't have to dig through a huge directory hierarchy. I should be able to type the name of a directory and something along the lines of Google's auto complete or intellisense will begin to auto complete my search, regardless of what volume its stored on. As I type vacation.. it should list all directories beginning with that string or tag. Maybe I am ignorant of similar functionality for Windows and Linux. The tags and file/directory names should be system wide and accessible to all programs and commands that interact with files, not just a built in shell.

  14. Re:I like fuzzy folder structures... by dotancohen · · Score: 2

    That will break as soon as I edit the file with a non-supported application (that doesn't know to update the stored SHA1 hash). This is why it is important to implement the feature at the filesystem level.

    --
    It is dangerous to be right when the government is wrong.
  15. FILE(1) by tqk · · Score: 2

    I'd like to take this opportunity to point out the brilliance of the "file" command (in *nix). All its smarts, plus all the details mentioned in its manpage, are all I ever needed to know about any file's technical details. This BS from Microsoft is re-inventing the wheel, badly and foolishly, with suspiciously strange priorities. No surprise there.

    The "file(1)" manpage is a great read, including potshots at SysV, BSD, and mention that it (or at least Debian's version) was written by a fellow Canuck (Ian F. Darwin).

    FYI, a point & click interface to manpages:

                  xman -notopbox -bothshown &

    Enjoy the odd behaviour of the Athena Widget Set's scrollbars. :-)

    --
    "Tongue tied and twisted, just an Earth bound misfit ..." -- Pink Floyd.
  16. Universal container formats by Tetsujin · · Score: 2

    Wouldn't it be possible to make a "universal" file container, in that any other file type could be imbeded with a text file that listed: what type of file it is, what program it is associated with, owner, creation/mod dates, and especially, tags and other types of metadata?

    Do you know that they tried this already?
    In 1985? (Well, I'm speaking specifically of IFF - but there were other efforts. Mac's file forks were kind of the same sort of thing, except that they maintained the abstraction all the way down to the filesystem layer.)

    Now, just because they tried it already and more or less failed doesn't mean it couldn't work... But they were in a much better position in 1985 to make this work than they are now (we've gone too long and come too far without a "universal format", it'd be nearly impossible to get people to embrace that kind of change now...) so I think it's kind of a lost cause.

    I found it absolutely fascinating, personally, when I read one of the original documents on IFF. The ambition, the hubris perhaps, with which they were trying to guide the future of personal computing. They weren't just seeking to create "a" format, they were aiming for it to be the format. And it would have been capable of just about everything you suggest - embed a FORM of whatever you like in a LIST, put in descriptive chunks, etc... I believe Amiga embraced the concept to a fairly high degree.

    There are various historical and technical reasons why it didn't really pan out. I think one of the big ones is simply that IFF wasn't the right format for everything. Perhaps no one format can be. Among other things, IFF required four-byte payload sizes appear at the start of each chunk. That limits a chunk (and therefore a file) to 4GiB maximum (not such a big deal in 1985 or even 1995... But these days it'd be an unacceptable limitation) - but another problem is that sometimes you need to write out some data and you just don't know how big it's gonna be. Streaming audio and video are a pretty good example. You can discretize the stream, populate it with known-size chunks, but you don't know the size of the whole stream until it ends.

    I think general-purpose data formats are a good thing - but I believe it's very important to consider that there may be cases where a particular format just isn't right for the problem. And that brings us back more or less to the current scenario, in which different applications tend to have totally distinct file formats, not even sharing an overall containment structure. From that perspective, it's wasteful to continue re-inventing metadata storage for each new file format that comes along, and wasteful to implement all these different methods of reading metadata out of different application-specific file formats. There's also the danger that we will want to change the format of the data in the metadata fields (just as we shifted from "whatever local variant of ASCII your region uses" to mostly using UTF-8 - which still isn't necessarily adequate for all regions, incidentally) Another all-new text encoding so soon after Unicode's introduction isn't too likely, but the OS, in defining how these metadata fields are defined and used, could change the requirements that go beyond what the container format can provide (for instance, storing data that goes beyond the limit of a particular format's "metadata region" size limit, or storing something that's better encoded in some binary form other than text. Decoupling the encoding of metadata from the definition of file formats eliminates a bunch of redundant work and leaves us more room to change what metadata contains and how those contents are used, as we get a better idea of how, ultimately, it will be used as the dust settles around this whole issue.

    --
    Bow-ties are cool.