Rethinking the Nature of Files
An anonymous reader writes "Two recent papers, one from Microsoft Research and one from University of Wisconsin (PDF), are providing a refreshing take on rethinking 'what a file is.' This could have major implications for the next-gen file system design, and will probably cause a stir among Slashdotters, given that it will affect the programmatic interface. The first paper has some hints as to what went wrong with the previous WinFS approach. Quoting the first paper: 'For over 40 years the notion of the file, as devised by pioneers in the field of computing, has proved robust and has remained unchallenged. Yet this concept is not a given, but serves as a boundary object between users and engineers. In the current landscape, this boundary is showing signs of slippage, and we propose the boundary object be reconstituted. New abstractions of file are needed, which reflect what users seek to do with their digital data, and which allow engineers to solve the networking, storage and data management problems that ensue when files move from the PC on to the networked world of today. We suggest that one aspect of this adaptation is to encompass metadata within a file abstraction; another has to do what such a shift would mean for enduring user actions such as "copy" and "delete" applicable to the deriving file types. We finish by arguing that there is an especial need to support the notion of "ownership" that adequately serves both users and engineers as they engage with the world of networked sociality. '"
I'm sorry, but MS issuing a paper on the "issues of file ownership" and the cloud sends a little chill up my spine. Makes me think that engineering may not be the only impetus behind their paper. It also makes me wonder if someone isn't looking to take a little more "ownership" of what has traditionally been considered *my* data.
It's bad enough I'm already forced into "buying" software and media that I can never resell. Now they want my fucking Word files too I guess.
SJW: Someone who has run out of real oppression, and has to fake it.
I couldn’t make it through the first paper. It came across as meandering and very academic. Didn’t try the second
Either way, of all the stuff that is currently broken, files are one of the few things that still mostly work. Yes would be nice to have more standardization and maybe metadata, but I don’t foresee it happening. And yes users sometimes get confused, but the generally figure stuff out.. and nothing described in the article seemed any more intuitive and would probably be just as miss-understood by users.
We’ll end up with 10 different standards, and no one will bother keeping metadata accurate on all their files. At best metadata is useful for a single person on a small subset of files where they find it useful. Everything else, the only metadata anyone is going to care about (and be bothered to enter) is title, which is served fairly effectively by the file name.
I've always thought it would be useful if you could mark as file as automatically deleting at a certain date. If you create a temporary file, it would be nice to flag it as "delete after 60 days" so it doesn't need attention in the future. (The same functionality would be really useful for emials...I want to save this email until after the event (or whatever it's about) and then have it automatically deleted.) I once saw the file functionality on a custom Cray operating system in the 1977.
We suggest that one aspect of this adaptation is to encompass metadata within a file abstraction
this before? Are resource forks coming back into vogue?
I am Slashdot. Are you Slashdot as well?
really?
DOCUMENT=~/myschematics.pdf
SHAID=$(sha512sum "$DOCUMENT" | cut -f1 -d' ')
mkdir heap
mv "$DOCUMENT" "heap/$SHAID"
mkdir tags
mkdir tags/Schematics
mkdir tags/Pentagon
mkdir tags/Operation_Zesty_Lemon
ln "heap/$SHAID" tags/Pentagon/
ln "heap/$SHAID" tags/Schematics/
ln "heap/$SHAID" tags/Operation_Zesty_Lemon/
Life is just nature's way of keeping meat fresh.
NTFS supports the same thing, it's just that hardly anyone ever uses it. Including Microsoft.
Do NOT "improve" the file. I'd like to continue to be able to use my computer and other devices.
Please do not read this sig. Thank you.
Look them up. They already allow you to attach arbitrary metadata to a file. Most modern filesystems and user-level utilities support them already. They're even used as the underpinnings for security mechanisms such as POSIX ACLs and SELinux. Sure, there are issues with performance when you have *lots* of xattrs on a file, and that's a fruitful area of research, but we sure don't need some brand-new Microsoft-invented thing to deal with metadata.
Slashdot - News for Herds. Stuff that Splatters.
The thing is, those particular file systems also use a different notion of what a file is than what Unix folks are used to. One major example of this is that on these systems, a file can contain multiple streams of data, which both NTFS and HFS+ call forks. NTFS doesn't use forks much, but Macs used them heavily in the pre-OSX days (not so much anymore).
Files-11 and HFS+ also support a notion of files as being containers of discrete data records, rather than streams of bytes. Again, Macs used this concept heavily in the pre-OSX days, mostly when dealing with a file's resource fork, but it's not as common anymore.
Personally, I've found that the biggest issue with all the "metadata" systems that try to improve on the basic file/folder system is that they don't transfer anywhere. Send the file once through Samba, NFS, email, FTP, rsync or whatever and the metadata is lost. The only systems that actually get used are those that are embedded in the file, like EXIF for JPG, ID3 for MP3 and so on.
The stupid thing is that we didn't make that a generic part of all file formats, a simple key-value list appended to the file would do. But today that'd break almost everything, plus most things working on the file system would have to know that each file has a data and metadata part. Maybe use a compatibility layer for metadata-unaware applications, where they only see the data part?
That way we really could have a standard form of metadata. It might not cover every use but it'd sure cover a lot. Copy the file, copy the metadata (if you want, of course). Of course most of these researchers seem to want to get rid of the file altogether and replace it with some sort of cloud service, but I'd rather not. I'd rather know where I have my stuff and be able to put it where I want.
Live today, because you never know what tomorrow brings
Yes I would. If I deliberately transmit a message to someone else, then I have no expectation of being able to 'untransmit' that message. The logic error here is thinking that files are like objects. They are not (only), they are also like messages. Big business wants files to be like objects so they can own them. Everyone knows they can't do it, and this effort will fail like all others, due to the nature of reality. Files are not objects.
Korma: Good
"Quoting the first paper: 'For over 40 years the notion of the file, as devised by pioneers in the field of computing, has proved robust and has remained unchallenged. Yet this concept is not a given, but serves as a boundary object between users and engineers. In the current landscape, this boundary is showing signs of slippage, and we propose the boundary object be reconstituted. New abstractions of file are needed, which reflect what users seek to do with their digital data, and which allow engineers to solve the networking, storage and data management problems that ensue when files move from the PC on to the networked world of today."
They pretty much peppered the report with bullshit and buzz words to make "meta data" and "internet based storage" sound all new and shiny for the brain dead market droids and managers.
This reminds me of that MIT operating system hoax that was going to take current file system ideas and throw them out the window. Face it, how else do you organize bits of information? The concept of a file is simple: an organized arrangement of bits that contains data which can be moved, re-sized or deleted. How do you change that? The only thing that can change is the method in which they are stored on physical media (file system) or cataloged and indexed.
I just want one thing: a file system that is part database for fast file searches. I don't want to manually build indexes or any other bullshit just look at the file table and give me my fucking file. Even if you had 100,000 files with file names of 256 characters, its only 2.5 MB, how long does that take to parse? Maybe I don't understand file systems but even a 10 MB file table should only take a few seconds to scan. When I do a search of a directory or entire disk with tens of thousands of files it sometimes takes a minute or two. The disk is thrashing away as if the program is looking all over for the file names. Shouldn't they all be in one place pointing to where they are on disk? Maybe I don't understand file systems in general, someone care to explain?
And one thing that just popped into my mind is a better method to tag and store files. When I download a file or save a document/image/whatever I shouldn't have to dig through a huge directory hierarchy. I should be able to type the name of a directory and something along the lines of Google's auto complete or intellisense will begin to auto complete my search, regardless of what volume its stored on. As I type vacation.. it should list all directories beginning with that string or tag. Maybe I am ignorant of similar functionality for Windows and Linux. The tags and file/directory names should be system wide and accessible to all programs and commands that interact with files, not just a built in shell.
That will break as soon as I edit the file with a non-supported application (that doesn't know to update the stored SHA1 hash). This is why it is important to implement the feature at the filesystem level.
It is dangerous to be right when the government is wrong.
I'd like to take this opportunity to point out the brilliance of the "file" command (in *nix). All its smarts, plus all the details mentioned in its manpage, are all I ever needed to know about any file's technical details. This BS from Microsoft is re-inventing the wheel, badly and foolishly, with suspiciously strange priorities. No surprise there.
The "file(1)" manpage is a great read, including potshots at SysV, BSD, and mention that it (or at least Debian's version) was written by a fellow Canuck (Ian F. Darwin).
FYI, a point & click interface to manpages:
xman -notopbox -bothshown &
Enjoy the odd behaviour of the Athena Widget Set's scrollbars. :-)
"Tongue tied and twisted, just an Earth bound misfit
Wouldn't it be possible to make a "universal" file container, in that any other file type could be imbeded with a text file that listed: what type of file it is, what program it is associated with, owner, creation/mod dates, and especially, tags and other types of metadata?
Do you know that they tried this already?
In 1985? (Well, I'm speaking specifically of IFF - but there were other efforts. Mac's file forks were kind of the same sort of thing, except that they maintained the abstraction all the way down to the filesystem layer.)
Now, just because they tried it already and more or less failed doesn't mean it couldn't work... But they were in a much better position in 1985 to make this work than they are now (we've gone too long and come too far without a "universal format", it'd be nearly impossible to get people to embrace that kind of change now...) so I think it's kind of a lost cause.
I found it absolutely fascinating, personally, when I read one of the original documents on IFF. The ambition, the hubris perhaps, with which they were trying to guide the future of personal computing. They weren't just seeking to create "a" format, they were aiming for it to be the format. And it would have been capable of just about everything you suggest - embed a FORM of whatever you like in a LIST, put in descriptive chunks, etc... I believe Amiga embraced the concept to a fairly high degree.
There are various historical and technical reasons why it didn't really pan out. I think one of the big ones is simply that IFF wasn't the right format for everything. Perhaps no one format can be. Among other things, IFF required four-byte payload sizes appear at the start of each chunk. That limits a chunk (and therefore a file) to 4GiB maximum (not such a big deal in 1985 or even 1995... But these days it'd be an unacceptable limitation) - but another problem is that sometimes you need to write out some data and you just don't know how big it's gonna be. Streaming audio and video are a pretty good example. You can discretize the stream, populate it with known-size chunks, but you don't know the size of the whole stream until it ends.
I think general-purpose data formats are a good thing - but I believe it's very important to consider that there may be cases where a particular format just isn't right for the problem. And that brings us back more or less to the current scenario, in which different applications tend to have totally distinct file formats, not even sharing an overall containment structure. From that perspective, it's wasteful to continue re-inventing metadata storage for each new file format that comes along, and wasteful to implement all these different methods of reading metadata out of different application-specific file formats. There's also the danger that we will want to change the format of the data in the metadata fields (just as we shifted from "whatever local variant of ASCII your region uses" to mostly using UTF-8 - which still isn't necessarily adequate for all regions, incidentally) Another all-new text encoding so soon after Unicode's introduction isn't too likely, but the OS, in defining how these metadata fields are defined and used, could change the requirements that go beyond what the container format can provide (for instance, storing data that goes beyond the limit of a particular format's "metadata region" size limit, or storing something that's better encoded in some binary form other than text. Decoupling the encoding of metadata from the definition of file formats eliminates a bunch of redundant work and leaves us more room to change what metadata contains and how those contents are used, as we get a better idea of how, ultimately, it will be used as the dust settles around this whole issue.
Bow-ties are cool.