How Do You Organize Your Experimental Data?
digitalderbs writes "As a researcher in the physical sciences, I have generated thousands of experimental datasets that need to be sorted and organized — a problem which many of you have had to deal with as well, no doubt. I've sorted my data with an elaborate system of directories and symbolic links to directories that sort by sample, pH, experimental type, and other qualifiers, but I've found that through the years, I've needed to move, rename, and reorganize these directories and links, which have left me with thousands of dangling links and a heterogeneous naming scheme. What have you done to organize, tag and add metadata to your data, and how have you dealt with redirecting thousands of symbolic links at a time?"
In my experience, the best thing is to let the structure stand as it was the first time you stored the data.
Your statement of this policy illustrates a hidden assumption which causes the policy to fail before it gets out of the gate: there should be no notion of "structure" in the way you store the data. A filesystem-- all production* examples of which are hierarchical-- imparts a particular set of relations between the elements of your data: that of a hierarchy (and a specific one at that!). By caring about the hierarchy, you're being beholden both to one kind of relationship and to a specific instance of that relationship with respect to your data.
(* There's been some work on relational-- as opposed to hierarchical-- filesystems; cf. WinFS. If this type of file representation/storage were de facto, the data "organization" habit into which so many people unknowingly fall wouldn't be a bad one.)
While a symlink farm is one way to separate representation from presentation, it still causes all your views into the data to be hierarchical relationships (each "view" directory in a "views" directory is still organized under some other, higher directory; it's ambiguous where in the hierarchy the hierarchy semantics end and relational semantics begin).
Raw data should stand untouched, or you may delete it by mistake.
Leaving data "untouched" is the wrong solution to that problem. Your data was economically expensive, and it should not be brittle. You need backups, and they must 1) prevent data corruption (so you can undo a mistake made by you or your hardware, such as accidental/regretable data modification or filesystem corruption), and 2) provide data persistence (so your data will persist if a copy of it is destroyed, as in a disaster).