How Do You Organize Your Experimental Data?

← Back to Stories (view on slashdot.org)

How Do You Organize Your Experimental Data?

Posted by timothy on Sunday August 15, 2010 @04:10AM from the can't-remember-where-I-put-my-memory dept.

digitalderbs writes "As a researcher in the physical sciences, I have generated thousands of experimental datasets that need to be sorted and organized — a problem which many of you have had to deal with as well, no doubt. I've sorted my data with an elaborate system of directories and symbolic links to directories that sort by sample, pH, experimental type, and other qualifiers, but I've found that through the years, I've needed to move, rename, and reorganize these directories and links, which have left me with thousands of dangling links and a heterogeneous naming scheme. What have you done to organize, tag and add metadata to your data, and how have you dealt with redirecting thousands of symbolic links at a time?"

14 of 235 comments (clear)

Min score:

Reason:

Sort:

Matlab Structures by Anonymous Coward · 2010-08-15 04:24 · Score: 4, Interesting

I have to organize and analyze 100 GB of data from a single day's experiment. Raw data goes on numbered HDs that are cloned and stored in separate locations. This data is the processed and extracted into about 1 GB of info. From there the relevant bits are pulled into hierarchal matlab structures. Abstract everything that you can into a MATLAB structure array. Have all your analysis methods embedded in the classes that compose the array. Use SVN to maintain your analysis code base.
1. Re:Matlab Structures by pz · 2010-08-15 12:09 · Score: 4, Interesting
  
  I have to organize and analyze 100 GB of data from a single day's experiment. Raw data goes on numbered HDs that are cloned and stored in separate locations. This data is the processed and extracted into about 1 GB of info. From there the relevant bits are pulled into hierarchal matlab structures. Abstract everything that you can into a MATLAB structure array. Have all your analysis methods embedded in the classes that compose the array. Use SVN to maintain your analysis code base.
  Yes, yes, yes.
  I have very similar data collection requirements and strategy with one exception: the data that can be made human-readable in original format are made so. Always. Every original file that gets written has the read-only bit turned on (or writeable bit turned off, whichever floats your boat) as soon as it is closed. Original files are NEVER EVER DELETED and NEVER EVER MODIFIED. If a mistake is discovered requiring a modification to a file, a specially tagged version is created, but the original is never deleted or modified.
  Also, every single data file, log file, and whatever else that needs to be associated with it is named with a YYMMDD-HHMMSS- prefix and since experiments in my world are day-based, are put into a single directory called YYMMDD. I've used this system now for nearly 20 years and not screwed up with using the wrong file, yet. FIles are always named in a way that (a) doing a directory listing with alpha sort produces an ordering that makes sense and is useful, and (b) there is no doubt as to what experiment was done.
  In addition, every variable that is created in the original data files has a clear, descriptive, and somewhat verbose name that is replicated through in the MATLAB structures.
  Finally, and very importantly, the code that ran on the data collection machines is archived with each day's data set so that when bugs are discovered we can know EXACTLY which data sets were affected. As a scientist, your data files are your most valuable possessions, and need to be accorded the appropriate care. If you're doing anything ad-hoc after more than one experiment, then you aren't putting enough time into a devising a proper system.
  (I once described my data collection strategy to a scientific instrument vendor and he offered me a job on the spot.)
  I also make sure that when figures are created for my papers I've got a clear and absolutely reproducible path from the raw data to the final figures that include ZERO manual intervention. If I connect to the appropriate directory and type "make clean ; make", it may take a few hours or days to complete, but the figures will be regenerated, down to every single arrow and label. For the aspiring scientist (and all of the people working in my lab who might be reading this), this is perhaps the most important piece of advice I can give. Six months, two years, five years from now when someone asks you about a figure and you need to understand how it was created, the *only* way of knowing that these days is having a fully scripted path from raw data to final figure. Anything that required manual intervention generally cannot be proven to have been done correctly.
  
  --
  
  Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
Don't bother with hierarchies by ccleve · 2010-08-15 04:25 · Score: 5, Interesting

Instead of trying to organize your data into a directory structure, use tagging instead. There's a lot of theory on this -- originally from library science, and more recently from user interface studies. The basic idea is that you often want your data to be in more than one category. In the old days, you couldn't do this, because in a library a book had to be on one and only one shelf. In this digital world you can put a book on more than one "shelf" by assigning multiple tags to it.
Then, to find what you want, get a search engine that supports faceted navigation.
Four "facets" of ten nodes each have the same discriminatory power as a single hierarchy of 10,000 nodes. It's simpler, cleaner, faster, and you don't have to reorganize anything. Just be careful about how you select the facets/tags. Use some kind of controlled vocabulary, which you may already have.
There are a bunch of companies that sell such search engines, including Dieselpoint, Endeca, Fast, etc.
Databases are not as convenient as files by goombah99 · 2010-08-15 04:40 · Score: 2, Interesting

I agree that this is a candidate for a database. One problem with data bases for researchers is that generally one does not know the right schema before hand ond one is dealing with ad hoc contingencies a lot. Another is portability to machines you don't control or that are not easily networked. A final problem is archival persistence. I can't think of a single data base scheme that has lasted 5 let alone ten years and still function without being maintained. Files can do that.
So if you want some bandaid approaches:
1) if you have a mac then, uses aliases rather than symbolic links. alias don't get messed up if you move the file.
2) use hard links rather than symbolic links. THe problem here is that these can get unlinked if you plan to modify the file. But if the file will never change these are just as space efficient and a softlink but tolerate renaming. They can't span across different disks however.
3) poormans database:
give your files a numerical name that chages, typically the date and time they were created. then have a flat file that list the files in some set for each category.
4) low tech database. If you decide to use a database then choose one that is likely never to go out of style. for example pick something like a perl-tie. those are so close to the language that they probably won't get depricated in the next 10 years.

--
Some drink at the fountain of knowledge. Others just gargle.
Re:Use databases! by rumith · 2010-08-15 04:54 · Score: 4, Interesting

Hello, I'm a space research guy.
I've recently made a comparison of MySQL 5.0, Oracle 10i and HDF5 file based data storage for our space data. The results are amusing (the linked page contains charts and explanations; pay attention to the conclusive chart, it is the most informative one). In short (for those who don't want to look at the charts): using relational databases for pretty common scientific tasks sucks badly performance-wise.
Disclaimer: while I'm positively not a guru DBA and thus admit that both of the databases tested could be configured and optimized better, but the thing is that I am not supposed to. Neither is the OP. While we may do lots of programming work to accomplish our scientific tasks, being a qualified DBA is a completely separate challenge - an unwanted one, as well.
So far, PyTables/HDF5 FTW. Which brings us back to the OP's question about organizing these files...
Re:Use databases! by clive_p · 2010-08-15 06:06 · Score: 2, Interesting

As it happens I'm also in space research. My feeling is that what approach you take depends a lot on what sort of operations you need to carry out. Databases are good at sorting, searching, grouping, and selecting data, and joining one table with another. Getting your data into a database and extracting it is always a pain, and for practical purposes we found nothing to beat converting to CSV format (comma-separated-value). We ended up using Postgres as it had the best spatial (2-d) indexing, beating MySQL at the time. The expensive commercial DBMS like Oracle didn't have anything that the open-source ones did for modest-sized scientific datasets. I found Postgres was fine for our tables, which were no bigger than around 10 million rows long and 300 columns wide. You might well get better performance using something like HDF but you'll probably spend a lot more time programming to do that, and it won't be as flexible. The only thing you can be sure of in scientific data handling is that the requirements will change often, so flexibility is important. If your scientific data are smallish in volume and pretty consistent in format from one run to the next, you might consider storing the data in the database, in a BLOB (binary large object) if no other data type seems to suit. But a fairly good alternative is just to store the metadata in the database, e.g. filename, date of observation, size, shape, parameters, etc and leave the scienficic data in the files. You can then use the database to select the files you need according to the parameters of the observation or experiment.
Re:Use databases! by rockmuelle · 2010-08-15 06:22 · Score: 3, Interesting

I've built LIMS systems that manage peta-bytes of data for companies and small scale data management solutions for my own research. Regardless the scale, the same basic pattern applies: Databases + files + programming languages. Put your meta-data and experimental conditions in the database. This makes it easy to manage and find your data. Keep the raw data in files. Keep the data in a simple directory structure (I like instrument_name/project_name/date type heiracchies, but it doesn't really matter what you do as long as you're consistent) and keep pointers to the files in the database. Use Python/Perl/Erlang/R/Haskell/C/whatever-floats-your-boat for analysis
Databases are great tools when used properly. They're terrible tools when you try to shoehorn all your analysis into them. It's unfortunate that so few scientists (computer and other) understand how to use them. Also, for most scientific projects, SQLite is all you need for managing meta-data. Anything else and you'll be tempted to to your analysis in the database. Basic database design is not difficult to learn - it's about the same as learning to use a scripting language for basic analysis tasks.
The main points:
1) Use databases for what they're good at: managing meta-data and relations.
2) Use programming languages for what they're good at: analysis.
3) Use file systems for what they're good at: storing raw data.
-Chris
Re:Use databases! by mobby_6kl · 2010-08-15 06:50 · Score: 3, Interesting
Certainly it depends, YMMV, and all that. Still, I think that some of the points that you bring up are not actually arguing against a relational database, perhaps just for a slight reorganization of your processes.
1. I don't know where you get the data from, but anything awk/sed can do, so can Perl. And from Perl (or PHP, if you're lame) it's very easy to load the data into a database
2. It's easy to connect to an SQL server from the remote machine and either dump everything or just select what you need. You'll need more than notepad.exe to do this, but it's not rocket science. Pivots in Excel can be really useful, but Excel can easily connect to the same database and query the data directly from there and use it for your charts/tables.
3. Since by this point you'll already have all the data in the db, exporting it to CSV would be trivial. Or you could even skip Google Docs entirely and generate your charts with tools which can automatically query your database.
I agree with your final point though, we really have no idea what would be best for the submitter within the limitations of skills, budget, time, etc. Perhaps flat files are really the best solutions, or maybe stone tablets are.
SciDB, Open Source DB for Science by geoffrobinson · 2010-08-15 07:10 · Score: 2, Interesting

http://www.scidb.org/

--
Except for ending slavery, the Nazis, communism, & securing American independence, war has never solved anything.
t really depends, be more specific by floydman · 2010-08-15 12:00 · Score: 2, Interesting

I am a programmer, who works closely with scientists in scientific computing in the fields of fluid mechanics simulation, and aerodynamics simulation.
Your question is really not clear, in both these fields that I work on, the requirements vary vastly, and it also varies to the users I support (over 100 scientist). some of them have huge data sets, spanning up to 600 GB/file, a single simulation run can give a geologist a 1 TB file.
Others, have a few hundred MB of data. Each is handled differently.
The data itself, can be parsed and stored in in a DB for analysis in some cases, and in others, that is very impractical and will slow down your work.
Each scientist has a different way of doing things.
So the bottom line, if you want any useful answers, be more specific. What field of science (i can tell you are a chemist?), what simulations/tests do you use, how fine are your models are your data sets and what is their format, what kind of data are you interested in, you should seriously consider an archiving solution because i guarantee you will run out of space.

--
The lunatic is in my head
1. Re:t really depends, be more specific by Anonymous Coward · 2010-08-15 14:00 · Score: 1, Interesting
  
  As a former 'commercial' programmer and now a scientific programmer, I have gone through an adjustment to an environment where I now use flat files (mostly binary, sometimes text) and HDF and NetCDF files. As the datasets are non-homogenous, and very dynamic in nature (you may need to blow away a dataset or just a specific portion of it, (and not just you but the scientists involved) and then regenerate it again. You may produce near-duplicate datasets with a slightly different algorithm and need to keep every version. It can be a challenge at times and is very very different to a CRUD scenario.
  I have no magic bullet but we designate a key server with a very large RAID subsystem as our data repository. We split every thing into directory structures and symlink where appropriate to make navigation easier for novices exploring a new dataset. We nominate individuals as the 'structure gatekeeper' and mandate any changes in that dataset structure to be OK'd by that scientist. For current experiments which are I/O bound, we symlink separately mounted SSD's and high-perf HDD's to give us the best possible sequential read and write performance. For duplication onto scientist's local drives, we have sync scripts that will take portions of the complete data repository and copy them to/from the local drives.
  Of the NoSQL's, I have used/played with couchdb and although it is a great db, it's not appropriate for us. There have been changes to the internal structures over the last year and it would be a nightmare to have to migrate our mass of data from an old format to a new. Flat files are easily understood and manipulated by every operating system, easy to write code to read/write from and if you want to formalize a number of datasets as 'gold quality', HDF5 is the accepted standard.
Re:Use databases! by Anonymous Coward · 2010-08-15 18:33 · Score: 1, Interesting

Pipes are executed in order, and unless you split your data and pass it to several processes, there is nothing "inherently parallel" in them.
The processes in a pipeline run in parallel, unless you are in dos/windows (as konohitowa pointed out).
You often have a process that is reading the file, and a series of successive proceses that are filtering and processing the data. Each one can run on a different core, in parallel
You can do the same with perl, but most perl isn't written that way. It's really cool that scripts I wrote fifteen years ago are scaling incredibly well with additional cores (and that will only get better). I can't say the same for most all perl, python, C programs, etc. Perhaps it would be more fair to say the difference isn't so much the language as it is the monolithic thinking and implementation.
Also, many datasets are often in columnar format. So the data is already split, and the columns are easily divided across multiple processes.
And parallelism aside, awk is often significantly faster at processing text than perl. Especially if you use awka to convert the script to C and compile.
That stuff matters when you're crunching transaction logs at the largest e-commerce sites, etc.
Re:Use databases! by jandersen · 2010-08-15 20:14 · Score: 2, Interesting

...using relational databases for pretty common scientific tasks sucks badly performance-wise.
Well, it has never been a secret that relational databases do not performs as well as e.g. a bespoke ISAM or hash-indexed data-file, mostly due to the fact that it involves interpretation of of SQL. But then the main purpose of SQL databases has never been to optimise raw performance - rather the idea is to provide maximum, logical flexibility. The beauty of relational databases is that you can change both data annd metadata at the wave of a hand, where you in the high-performing, bespoke database have to go and rewrite significant amounts of code.
At the end of the day you choose your tools to fit your needs, or at least that is what you ought to do.
As for the OP's question: the main problem seems to be one of having to rename and move stuff; this is clearly the area where SQL is strongest.
Not all IT is the same -- you want 'Informatics' by oneiros27 · 2010-08-15 23:41 · Score: 3, Interesting
The problem is, most IT people have no idea what do with science data -- it'd be like going to a dentist because you're having a heart attack. They might be able to give general advice, but have clue what specifics need to be done. Likewise, IT might be people who are really good at diagnosing hardware, but they might suck at writing code. Not all IT specialists are cross-trained in enough topics to deal with this issue effectively (data modeling, UIs, database admin, programming, and the science discipline itself).
There's a field out there called 'Science Informatics'. It's not a very large group, but there's a number of us who specialize in helping scientists organize, find, and generally manage data. Think of us as librarians for science data.
Most of us would even be willing to give advice to people outside our place of work, as the better organized science data is in general, the more usable and re-usable it is. There's even a number of efforts to have people publish data, so it can be shared, verified, etc. And most of us have a programming background, so we might be able to share code with you, as we try to make it open source where we can, so we don't all have to re-solve the same problems.
Because each discipline varies so much, both in how they think about their data, and what their ultimate needs are, we tend to be specialists, but there's a number of different groups out there, for example:
- Geoinformatics : GEON Grid
- Earth Science : ESIP Federation
- Earth and Space Science : ESSI
- Astronomy : IVOA
There's also Bioinformatics, Health/medical informatics, chemical informatics, etc. plug in your science discipline + 'informatics' into your favorite search engine, and odds are you'll find a group, or person you can write to to try to get more info and advice.
Recently, NSF just funded a few more groups to try to build out systems and communities : DataOne and the Data Conservancy, and I believe there's some more money still to be awarded.
--
Build it, and they will come^Hplain.