File Organization — How Do You Do It In 2011?
siddesu writes "After 30 years of being around computers, I have, like everyone else, amassed a huge amount of files in huge amount of formats about a huge amount of topics. And it isn't only me — the family has now a ton of data that they want managed and easily accessible. Keeping all that information in order has always been a pain, but it has gone harder as the storage has increased and people and files and sizes have multiplied. What do you folks use to keep your odd terabyte of document, picture, video and code files organized — that is, relatively uniformly tagged, versioned, searchable and ultimately findable, without 50 duplicates over your 50 devices and without typing arcane commands in a terminal window? I found this discussion from 2003 and this tangentially relevant post from 2006. How have things changed for you in 2011? And how satisfied is your extended family with the solution you have unleashed upon them?"
.. seriously.. they still work for me.
I’ve got a 12TB file server (~6TB filled). It’s arranged as follows:
documents/
incoming_downloads/ (before you ask.. yes.. _legit_ downloads)
media/
media/video/
media/video/movies/
media/video/tv_shows/
media/video/tv_shows/some_tv_show/
media/video/standup
media/video/etc..
media/music/
media/images/
media/images/various_subfolders/
code/
virtual_machines/
tmp/
backup_links/
backups/
That’s always been enough for me. Never got into all this tagging/meta data stuff. If there’s anything I’d ever want to search on... I put it in the file name. Indexed every night via slocate.
backup_links is part of my hacked together backup system.
The thing is raid6, setup so two drives can fail without loss of data. I see this as adequate “backup” for stuff that is replaceable (the large portion of my media is rips of DVDs I own... so although it would be a huge pain in the ass to re-rip them all... it’s not impossible). Stuff that is irreplaceable, I backup to separate hard drives (via hot swap trays).
I leave one backup drive plugged into the machine, and keep the other elsewhere. I periodically swap these drives. I have a script that just rsyncs the files and directories pointed to in backup_links (the irreplaceable ones) to the currently plugged in drive (and yes I verified that I’m not getting a backup of my links ;p). This way I always have one drive that has a pretty recent backup (runs nightly), and one drive that has at most a month or so old backup if the plugged in one fails for some reason.
backups is backed up files from other machines.
Keeping everything in one place helps with the organization I think. Most of the other machines on this network are basically just OS installs. All the real files are on the file server. My desktop runs of a small SSD, which is not even half filled.
I think you left a directory out. ;)
Fap and non-fap, and must-not-fap.
I've tried forcing myself to use various schemes including relying completely on metadata and search. The last couple of years this is how I've ended up setting things up:
"Public" network storage
This is for data that should be accessible to the entire network at home. NFS mounted on all my machines, stored on ZFS volume on my file server.
Private network storage
I use my home directory on the file server (also on the ZFS volume) for storing personal files and mirroring home directories from client machines in ~/Backup/homes/.
Local storage
On individual client machines I generally try to stick with whatever the operating system tries to make me use with an rsync script that syncs everything to the file server (automatically for desktops, run manually on portable machines).
This is what works for me. I would probably have stuck to the "just use metadata" approach if most user interfaces didn't seem to try and make it a major chore to edit and view metadata...
Greylisting is to SMTP as NAT is to IPv4
I have recently found an incredibly fast search tool called Everything. We're talking about Google-like searching where the results pop up as you type. It must be something on the order of a fifth of a second for my 1.5 million files. This kind of technology should be widespread - it makes searches actually *pleasant* to do. Anyway thanks to Everything, I worry less now about where I store my files, and I also try to pack in keywords into the filename.
Anyway, this kind of program is just a glimpse of what a future OS would look like. Imagine a system where everything is stored in tags and where folders become obsolete or used far less often. What you have then is a database or metadata file-system. The relatively new Haiku OS uses such a system, and I wrote about the massive advantages from this old page:
http://www.skytopia.com/project/articles/filesystem.html
Honestly, we'll all be better off the sooner we switch.
Why OpalCalc is the best Windows calc
Simple: Delete stuff.
Do you need all those instalation files for 10 year old shareware? Do you really need Gigabytes of movies you will never watch again? Music Collection so big that your playlist is months on lenght? Irrelevant TV shows? More ebooks than you can possibly read?
What you really need to keep are personal files - photos, home video, documents. Those can easily be managed - tag by occasion, file under year/month. done. (they do not take that much space either and people get tired of documenting everything sooner or later.).
-- Technology for the sake of technology is as pathetic as eschewing technology because it's technology.
I also still use a similar directory structure, but I've made once change in the past few years that makes it much easier to manage: I keep the special, personal, irreplaceable in a separate hierarchy.
This negates the need for something like a backup_links directory, and makes it much easier to just share the "normal" media directory with everyone/thing on my home network and then handle permissions on the personal stuff with more granularity. It's also much easier when I know I'm looking for a photo I've taken or a document I've made that it'll be in the personal hierarchy under those categories rather than the main ones.
It's a small change, but keeping a separation between stuff I've made and the easily replaceable stuff I've acquired has gone a long way to making my personal data and treasures more secure--both from loss and accidental sharing.
My main file server, where anything not in immediate use is stored, is organized mostly for human convenience. That is, a tree-hierarchy of folders.
media
media/video
media/video/movies
media/video/tv
media/video/shorts
media/video/educational
media/audio
media/audio/music
media/audio/drama
media/audio/comedy
media/audio/educational
media/pictures
media/pictures/family (with various subfolders like "zoo", "picnic", "christmas 2010", etc.)
documents
documents/work/[person's name]
documents/school/[person's name]
documents/misc
web/[site name]
programming/[person's name]/project
family history/
misc/
At the end of the year, or when I do a mass data import, I spend more time getting the meta-data and tags correct than anything else. All of my audio and video are properly tagged. Ditto for any documents.
Almost all video is accessed with "smart" programs, like Amarok or XBMC which automatically pull in things like lyrics, trailers, cover art, etc. That stuff is almost never accessed thru the directory tree. The interfaces on the programs are way too good -- assuming the stuff is properly tagged.
The web and programming folders are basically .tar.gz files that are backed up and copied over (drag-n-drop via smb mounted share). They're archives of whatever project someone is working on their local system. I've set up cron/scheduled tasks to update those daily on everyone's PCs, even the kids.
Most media folders are read-only, to prevent accidental deletion. My account is the master and I can upload stuff there, but I don't want accidents from people wanting to just watch a movie. 600+ DVDs/BluRays, including movies, educational & television shows all on a 2 Tb file server in h.264 format. All *music* is FLAC format, with Amarok auto-transcoding if people want to transfer to an iPod. All other audio, like drama/comedy/educational is 128 Kbps MP3 for ease of streaming. And old comedy albums aren't exactly THX-quality to begin with.
Learning HOW to think is more important than learning WHAT to think.
Who has the time to hand-pick all the relevant tags for every file they download? Yeah, me neither.
Finding time to put things in their own directory, and not dumping them all in "downloads", is a great accomplishment.
However finding a meaningful, hierarchical structure is non-trivial. I'm still working on it.
I have media drives that hold the bulk and they are easily organized into games/pictures/books/movies/tv/music. Smaller document/coding directories are on my C drive for source/text/spreadsheets I make myself.
I don't tag anything. For my pictures. I simply name the directories Year_date_mainContent. (ex 2010_12_25_Xmas). Media names are self evident, but I also run XBMC for video, so I guess that has internal tagging. But still easy to find video outside of XBMC which I only use about 50% of the time.
I almost never even use search to find things, because the layout is very logical and it is pretty much obvious where everything is.
Everything is online and in my computer, multiple TB drives. No raid.
For backup I simply use external esata multiple TB drives and FreeFileSync, that I run once/week.
Search, don't sort.
..don't panic
I had great success with Google Desktop Search (on windoze) for a while. It would index my mail, files, and web history (if instructed to) - and the best part was hitting one key to get an instant, minimalist search box with auto-preview. From there, you could jump straight to what you were looking for, or open a further page to narrow the search.
Sadly, it doesn't work with Thunderbird 3.0, and Google doesn't appear to care, or even to be supporting it anymore. So now I'm on a hodgepodge of GDS, Windows built-in search, and the sucky T-bird search bar.
I honestly can't believe that nobody has duplicated this Spotlight-esque functionality yet. I realize there are other desktop search options, but none of the ones I've come across have that one-key mini search that goes away as easily as it is called up. For an operation that I'm performing dozens of times daily, that's pretty crucial. It even replaced the file browser for me - much easier to call up the GDS box & type a couple letters than to grab the mouse and drill down into some directory structure - even if I know exactly where I'm going.
sticky labels on each floppy disk.
"We live in a global world" - Harvey Pitt, former Securities and Exchange Commission Chairman
I'm pretty much a "have a lot of structured directories" guy myself; I don't see your complaint about rising file sizes, or even total number of files. They've pretty much increased linearly in number while the speed of the linux "locate" command has gone up exponentially with Moore's Law. It's the other way around from management trouble - with TB hard drives, I have so much space I leave around TV shows and other media files I'll likely never watch again, "just in case".
At work, the search problems are harder, because I've got quite the multi-tasking job where I may spend just minutes on some problem, then be asked for an update months later, totally skeptical that I ever addressed the issue. And my favourite file-management with that is the most insane-sounding of all: one big directory. I sort it by date and rely on the fact that I take time to write out helpful file names like "downtown_condition_assessment_newmall_4_ernie.xlsx" (not actually that long, I use abbrevs in RL). Only files that have a whole lot of subject-matter friends get their own subdirectory; lonely "one-off" files go in the Big Pile.
The "sort the directory by date" uses the theory behind "lifestreams" promoted by Eric Freeman and David Gelernter at Yale. It really is the best thing I've found (same 30 years) to stimulate the memory - seeing the names of other things you did at the same time; you can actually sense yourself getting close to the file as you remember, "Oh yeah, I worked on that in the spring".
An additional word of Fear & Loathing for "document management systems" like LiveLink by Formark. Required to use this by work (shared directories are strictly for 'short-term' storage), it's awful. Terribly slow, the search function approaches useless, and it's hard (and slow, did I mention slow) to even re-sort a directory (sorry, that's a 'filter down' in Livelink's vocab) by name or date or whatever. After promising that photos would be displayed with thumbnails by the great new Version 4 for two years, it came, broke some stuff that was working, and did not provide thumbnails - all media files are unsearchable in any way. I suspect for long-term archiving, putting documents in a database would have advantages, but for active business usage, it's been crippling.
Everything is what I use on the PC to quickly find any file I am looking for.
On the Mac I use Spotlight.
While it would be nice to be completely organized, these tools let me find my files anywhere they are located on my PC. I try to keep things organized into folders, but I am always falling behind so these are what I can use in the interim.
I also was intrigued by the idea of a database-oriented file-system. A basic operation though is to get a file. how? by it's name or id. what is it's name? Something you have to define. It could have a category (eg. javascript development library) but that's something the schema would impose upon you. what if you're more interested in the files' Contributor (author, downloader, etc.) ?
By itself a file-system backed by a database engine doesn't make the problem smaller it adds overhead.
There's only one resolution that identifies one from another and that's the explicit bytes contained within its storage. That can be simplified by indexing schemes like mdasum but they all can have collisions. (rare but how much of a chance are you willing to take?)
Is a file of bytes, ended by CR the same as the same file of bytes ended by CRLF? While the system itself might probably use null termination, other files from other systems won't.
the low-hanging fruit for file de-duplication is in backup storage. When you and another person need to retain the same file it can easily be merged into the stream. when you have two files that are byte comparable that's not so easy because you probably have defined some separation criteria (eg different file paths). so on your system they still need to remain discrete.
I've not heard much about how they would integrate this at the OS level but I think that's the trick.
Some savant once said "DON'T TRUST ANYONE!" with your money or your wife (or today, your data). I think that includes Google...
Sometimes, real fast is almost as good as real-time.
What happened to Beagle for Linux? It used to work pretty well for me, and now it seems to have been abandoned.
I finally dealt with this problem once and for all in the following way. I found the best personal wiki out there (Zim: http://zim-wiki.org/), and wrote a simple python script (http://www.inrim.it/~magni/zimDMS.htm) that scans nightly my folder structure, keeping up-to-date my wiki. My wiki, therefore, is a perfect mirror of my folder structure, with the added bonuses that I can navigate to each folder, comment it, describe its content, insert images, insert links to other folders, and finally by a single click I can open it in the file manager. My ~ 15000 folders are managed perfectly...
Applying the Infinite Monkey Theorem I put everything into one folder, assigning each file a pseudo-random name. Although there's only one of me, in time, I'm confident that a pattern will emerge...
It must have been something you assimilated. . . .
Of everyone who posted their fancy choice of directory structure *nobody* told us where they keep their ~/.pr0n
If you are on Windows you might want to give Nemo Documents a try. It gives a time based view and allows one to use tags. Disclaimer: author posting ;-)
Having just started cleaning my house, this story comes close to my heart. Looking around, I have 6 boxes of old “documents”. What to do with them?
First to cover the common areas:
Video:
I have two TIVO boxes, one is high definition, both recording constantly.
I have one system with 8TB of storage to sort/organize the incoming TIVO recording.
I’m setting up two 60TB servers for my “movies and TV shows”. (Each will handle 26 hard drives). I use the term “setting up” as I’ve run into some issues with these systems.
Binary:
I have a 2TB system set up for binary files. (This would be development, OS, drivers, patches and the like). You never know when you will need a DOS bootable disc.
Music:
I have one system (with 2TB storage) to handle my MP3’s. (Still need to sort/organize/remove duplicates). Currently this one also houses my image collection, important documents and the like. It is acting as kind of a catchall for everything else.
Data:
I’ve recently set up a system to handle “data” (document based); with 130 GB of space. I’m using “Home Document Manager” . Though not mature, they are more amenable to fixing the problems.
And now to the point: Organization.
Overview
The first – glaring issue is lack of a good storage house. Most management systems sort a single file in a single location, sometimes with tags. A good example of the problem that I found: what if I have a Medical Bill, which is being kept for Legal reasons, which I will need at Tax time? What if I have a MP3, Music Video and Movie that I would like to tie together (or heaven forbid multiple playlists)? Or Movie props that I’ve purchased off eBay.
I would not like to keep the medical bill after 3 years, but for legal reason would like to keep it for seven. I don’t want to delete the “item”, but I no longer need to be reminded about the “bill”. I don’t want to have multiple copies of the same item, which makes searching a nightmare. And “tags” are a start, but are not granular enough.
Video organization:
Extreme Movie Manager. Ok, it has some bugs, but it does a VERY good job. With its multiple views, and multiple ways of keeping track of movies, it is the best one that I’ve seen.
Music: Currently I’m (just) using Media Monkey and MS Media Player. Media Money has a severe limitation in that it does not handle video (read music videos-Watch "Vertical Lines" by Leather Hands to get the point). I attempted to use an “automated sorting” system, however it has significant issues, the biggest being it took MPS’s from a known group (1970’s for example), and moved them to “Unknown”, “Unknown”. Can’t use that. I also used Clone Master, and found that I have almost 2500 duplicate (MP3) files. Unfortunately, it “guesses” the wrong one most time for the likely file needing to be deleted.
Binary is actually the most straightforward simple file structure
Other issues:
Video Servers: I’m also running hard drive selection into issues with the video servers. The problem is: Enterprise class SATA drives are expensive, “small” (only 2TB), fast (as such they use a lot more energy). “Green” drives are cheap and plentiful and use a lot less power (and generate a lot less heat) however they are not compatible with the RAID controllers needed.
Video Playback: I have a decent system to handle the Blue-ray, high def requirements. However the software also has problems: In/with high def you can’t read the “default” fonts displayed