How Do You Organize Your Experimental Data?

← Back to Stories (view on slashdot.org)

How Do You Organize Your Experimental Data?

Posted by timothy on Sunday August 15, 2010 @04:10AM from the can't-remember-where-I-put-my-memory dept.

digitalderbs writes "As a researcher in the physical sciences, I have generated thousands of experimental datasets that need to be sorted and organized — a problem which many of you have had to deal with as well, no doubt. I've sorted my data with an elaborate system of directories and symbolic links to directories that sort by sample, pH, experimental type, and other qualifiers, but I've found that through the years, I've needed to move, rename, and reorganize these directories and links, which have left me with thousands of dangling links and a heterogeneous naming scheme. What have you done to organize, tag and add metadata to your data, and how have you dealt with redirecting thousands of symbolic links at a time?"

16 of 235 comments (clear)

Min score:

Reason:

Sort:

Use databases! by Cyberax · 2010-08-15 04:13 · Score: 3, Insightful

Subj.
If you have something more complex than a flat file, then use relational databases. Even Access databases are better than a collection of text files.
1. Re:Use databases! by garcia · 2010-08-15 04:31 · Score: 2, Insightful
  
  If you have something more complex than a flat file, then use relational databases. Even Access databases are better than a collection of text files.
  That really depends on what your intended use for them is. I mean I don't know this particular fellow's situation for data collection or what tools he uses for reporting and visualization but perhaps, for him, it's a much better idea to store them in flat files. Me? I have been using flat files for all my data collection about local crime (see here, here, here, and here) for several reasons:
  1. I script it all with awk/sed to scrape the data and then put it in a CSV for summary with MySQL.
  2. Yes, I could use MySQL for it all but I like to easily see it in its raw format on another remote machine. I also like to use Excel to do ad-hoc pivots and this is the easiest way for me to do that.
  3. I upload the data to Google Docs and use their gadgets to make charts for my dashboards and maps. If I were to store it solely in MySQL I would have to make the CSV, pipe it into the MySQL, convert it back out to CSV and then upload it. An additional step for nothing.
  Hey, no method is perfect for everyone and every project is a little different and while it's hard for me, based on the information provided, to give this guy any help, automatically suggesting that he needs a relational database to do his data storage might be just a little shortsighted.
  YMMV.
2. Re:Use databases! by Idiomatick · 2010-08-15 04:33 · Score: 2, Insightful
  
  I'm amused by all the /.ers suggesting the physics nerd setup a SQL database. It ain't super easy and I can't imagine anyone that isn't a programmer of some sort doing so.
  
  The question he is really asking is probably more like: 'Is there anything like Windows live photo gallery that will allow me to tag and sort all of my data in a variety of ways like wlpg lets me sort my pictures?'
3. Re:Use databases! by Dynedain · 2010-08-15 05:50 · Score: 2, Insightful
  
  Translation: I am not a DB guru, but I deal with massive amounts of complex data and need a DB guru, but I have no intent on hiring one.
  Seriously, hire a DB wizard in the DB software of your choice for a couple of days. Have him setup the data and optimize it. You'll save yourself a lot of headaches, AND put yourself in a good position for future data maintenance. Imagine that your project gets a lot of attention in the future, and you suddenly get a lot of funding and the money to hire more people Or imagine that you'd like to provide or incorporate data with some outside sources or other researchers. If you're using something "standard" like a relational DB, it will be much easier to hire a DB wizard then trying to find a programmer who can piece together a lot of mismatched files and convoluted organization schemes.
  This is what databases are designed to do. Just because you're not an expert at setting them up, and theres a performance hit to setting them up wrong, doesn't mean that they aren't still the right tool.
  
  --
  I'm out of my mind right now, but feel free to leave a message.....
4. Re:Use databases! by BlitzTech · 2010-08-15 06:05 · Score: 2, Insightful
  
  Apache: click the install button (use default options, or switch to non-service mode which it very clearly explains means it only runs when you run it instead of whenever you start your computer)
  MySQL: click the install button (use default options, they're all fine)
  phpMyAdmin: put in document root, configure ("click the install button")
  
  And you're set. How was that hard...?
  
  Some software is, in fact, difficult to set up and maintain. As a scientist with an unusually large sample collection, learning to use a database is probably a good idea. Many scientists are taught MATLAB, and setting up a WAMP stack is much, much easier than learning MATLAB.
  
  I'm amused by all the /.ers suggesting the physics nerd setup a SQL database. It ain't super easy and I can't imagine anyone that isn't a programmer of some sort doing so.
  Scientists are pretty smart. He should learn to use a database.
5. Re:Use databases! by bunratty · 2010-08-15 07:15 · Score: 3, Insightful
  
  I've helped my wife, who is a research scientist, by writing scripts to process her data. The IT departments at the companies where she's worked have no idea about what her work is or how she does it, and perhaps have even less interest in helping her. Their function is to keep the infrastructure (networks, file servers, email servers, etc.) working and install software packages onto the computers. They aren't of any use in helping individual users with their work. They will install SAS and S-Plus but will not help by writing SAS and S-Plus code, for example. You might be in the same situation as I am from your comment about your wife.
  
  --
  What a fool believes, he sees, no wise man has the power to reason away.
sqlite by sugarmotor · 2010-08-15 04:23 · Score: 3, Insightful

http://www.sqlite.org/ a "replacement for fopen()" -- http://www.sqlite.org/about.html

--
http://stephan.sugarmotor.org
I used to be anal about organization... by taoboy · 2010-08-15 04:27 · Score: 2, Insightful

...but then google came along and taught me that it's not about know where things are, but rather about being able to find them. My email, for instance, is "organized" by the year in which it arrives, and I use the search function of my email client to find things. No big folder structure, moving messages around, and I haven't had problems finding any email I need. Oh yes, I keep them all... good fodder for "on x/x/xxx you said..." retorts.
For files, then, the key is to have descriptive file names that provide readily searched text. Including the data somewhere in the name (I tend to use this format because it sorts well: 20100815) makes it easier to sort through multiple versions.
Then, you can spend quality time figuring out how to reliably back up all that stuff.... :)
Try using a scientific workflow system by moglito · 2010-08-15 04:34 · Score: 3, Insightful

You may want to consider a scientific workflow system. These systems handle both data storage (including meta-data and provenance -- where the data came from), and design and execution of computational experiments. If you are concerned about the complexity of the meta-data (e.g., pH value..) and would like to make sure to be able sort things according to this, you want to give "Wings" a try. You can try out the sandbox to get an idea: http://wind.isi.edu/sandbox.
What about a wiki? by gotfork · 2010-08-15 05:12 · Score: 3, Insightful

In my previous lab group we used a mediawiki install to keep track of microelectronic devices that several people were working on at the same time. These devices were still under development so most of the data was qualitative -- images, profilometry data, IV/CV curves were all stored on the wiki page for each sample, and each page included a recipe for exactly how it was made, which made it easy to trouble shoot later. It worked pretty well for what we used it for, but once we had a working device all the in-depth data for that sample was kept separately. This seemed like a half-decent way of cataloging samples, although one would need something a bit more robust for complex data sets that don't integrate well with a wiki.
Use tags in Apple OS X by wealthychef · 2010-08-15 05:23 · Score: 2, Insightful

If you are using Mac OS X, you can tag the files using the Finder Get Info and putting "Spotlight comments" there. Then you can easily find them based on keywords and Spotlight in constant time. The good thing about keywords is that they give you a multidimensional database effect. The bad thing I've found is I tend to forget my keywords that I'm storing stuff with, so I don't really know what to search for. OS X Spotlight is promsing and might work very well for you.

--
Currently hooked on AMP
Re:Relational Databases won't do! by Anonymous Coward · 2010-08-15 05:24 · Score: 1, Insightful

You can just pipe the output from the SQL client to a text file (or export the results to a CSV file if you use a Query Browser).
You need a LIMS by pigreco314 · 2010-08-15 05:53 · Score: 2, Insightful

A Laboratory Information Management System will help you store, organize, analyze and data-mine your data.

--
"linux" is a very common word and was not included in your search.
Consistently by rwa2 · 2010-08-15 06:37 · Score: 3, Insightful

Word up. I'd say the first goal is to store your raw, bulk data consistently. Then you can have several sets of post processing scripts that all draw from the same raw data set.
You want this data format to be well-documented, but I wouldn't bother meticulously marking it up with XML tags and other metadata or whatever. You just want to be able to read it fast, and have other scripts be able to convert it into other formats that would be useful for analysis, be it matlab, octave, csv, or some tediously marked-up XML. You do want to be able to grep and filter the data pretty easily, so keep that in mind when you're designing the format. It will likely end up being pretty repetitive, but that's OK, since you'll likely store it compressed. That can improve performance when reading it, since the storage medium you're pulling the data from is often slower than the processor doing the decompression... and it also provides some data integrity / consistency checking. Oh, and of course, you can store more raw data if its compressed.
Re:Relational Databases won't do! by gmueckl · 2010-08-15 11:12 · Score: 3, Insightful

I hereby humbly suggest that you are, instead, wrong. Here is why:
Scientists are not software developers and never want to be that. They want to run their experiments and analyse their data. The latter requires recording and processing of numerical data. This is where computers enter their workflow - as number crunching tools that have to be easy to use and utterly flexible.
At times, my work consisted of writing lots of one-off C and Python program to process data in ever new ways in order to get an idea what I was actually looking at. And I had to write them myself because these weren't your run of the mill analysis steps. Many of these programs were not run again once I had their results. During all this time, I as a scientist was looking to get the data in and out of the programs in ways that are easy to code without getting distracted from what I wanted to achieve scientifically. My head was full of theory and formulas, not data structures and good software design.
In that particular state of mind, writing SQL isn't one of the things that I would have wanted to spend any time on. The inherent complexities are a distraction and a big one at that. And, hell, I'm one of the guys who actually *know* SQL. Most scientists actually don't. Hell, many of them barely know how to use their favorite language's core libs to their advantage. They don't care and - may I say - rightly so.
Besides, the code would get more bloated. If I want to output three values that belong together I write a print statement that places them on the same line of text in the output file and I'm done. That's a one-liner that takes me about 20 seconds to type in. In the worst case, I need to open a file beforehand and close it afterwards instead of piping it into stdout. That's maybe 3 lines of code. Now tell me: how many lines of code do I need to write to place these values in a database? That is, provided that a table already exists to hold that data.
My point is: relational databases don't do the job for scientists. Instead, they get in the way. And you and anyone else here who is arguing in favor of them probably lack the related experience to understand that - no offense intended. The points you make are derived from pure theory. Respect the needs of the users as well, please.
Maybe there is a middle ground here: hire a software developer who builds and maintains the DB and a nice, convenient to use wrapper library around it for you. That'll take a while and someone will have to foot the bill for it.

--
http://www.moonlight3d.eu/
Great comments by gringer · 2010-08-15 11:15 · Score: 2, Insightful
Reading these comments has changed my thoughts on data storage a little bit, but has reinforced my idea that databases are a bad idea for this sort of thing.
The main issues I have with using databases are file size (I store and convert text files that are 10-100MB zipped), and mutability (generated data doesn't typically change, I just add new experiments on top of other data). A secondary issue is that for plain-text data files (or plain-text convertible data files), writing code is easier when you don't have to bother about a database middleman.
So, if I were to do [another] large research project in the future, here's my thoughts on what I would consider an appropriate approach:
- Use a file system, rather than a database
- New data gets put in a consistent place, e.g /data/<date>/<source>/
- Backup this data. Assuming immutable files, incremental backups don't make much sense.
- Symlink categories to the original data
- When a file is referenced (or attached) in an email, keep a link between email date and file, e.g. /categories/<email>/<date>/<sink>/
- Maintain (preferably autogenerate) and backup a plain-text file linking categories to files. This will help when data gets lost (i.e. accidentally deleted).
My most common uses for old data are re-running analyses (generating new data as results), and sending data to someone else. It helps to be able to make those things as quick as possible.
--
Ask me about repetitive DNA