How Do You Organize Your Experimental Data?
digitalderbs writes "As a researcher in the physical sciences, I have generated thousands of experimental datasets that need to be sorted and organized — a problem which many of you have had to deal with as well, no doubt. I've sorted my data with an elaborate system of directories and symbolic links to directories that sort by sample, pH, experimental type, and other qualifiers, but I've found that through the years, I've needed to move, rename, and reorganize these directories and links, which have left me with thousands of dangling links and a heterogeneous naming scheme. What have you done to organize, tag and add metadata to your data, and how have you dealt with redirecting thousands of symbolic links at a time?"
I have to organize and analyze 100 GB of data from a single day's experiment. Raw data goes on numbered HDs that are cloned and stored in separate locations. This data is the processed and extracted into about 1 GB of info. From there the relevant bits are pulled into hierarchal matlab structures. Abstract everything that you can into a MATLAB structure array. Have all your analysis methods embedded in the classes that compose the array. Use SVN to maintain your analysis code base.
Instead of trying to organize your data into a directory structure, use tagging instead. There's a lot of theory on this -- originally from library science, and more recently from user interface studies. The basic idea is that you often want your data to be in more than one category. In the old days, you couldn't do this, because in a library a book had to be on one and only one shelf. In this digital world you can put a book on more than one "shelf" by assigning multiple tags to it.
Then, to find what you want, get a search engine that supports faceted navigation.
Four "facets" of ten nodes each have the same discriminatory power as a single hierarchy of 10,000 nodes. It's simpler, cleaner, faster, and you don't have to reorganize anything. Just be careful about how you select the facets/tags. Use some kind of controlled vocabulary, which you may already have.
There are a bunch of companies that sell such search engines, including Dieselpoint, Endeca, Fast, etc.
I agree that this is a candidate for a database. One problem with data bases for researchers is that generally one does not know the right schema before hand ond one is dealing with ad hoc contingencies a lot. Another is portability to machines you don't control or that are not easily networked. A final problem is archival persistence. I can't think of a single data base scheme that has lasted 5 let alone ten years and still function without being maintained. Files can do that.
So if you want some bandaid approaches:
1) if you have a mac then, uses aliases rather than symbolic links. alias don't get messed up if you move the file.
2) use hard links rather than symbolic links. THe problem here is that these can get unlinked if you plan to modify the file. But if the file will never change these are just as space efficient and a softlink but tolerate renaming. They can't span across different disks however.
3) poormans database:
give your files a numerical name that chages, typically the date and time they were created. then have a flat file that list the files in some set for each category.
4) low tech database. If you decide to use a database then choose one that is likely never to go out of style. for example pick something like a perl-tie. those are so close to the language that they probably won't get depricated in the next 10 years.
Some drink at the fountain of knowledge. Others just gargle.
I've recently made a comparison of MySQL 5.0, Oracle 10i and HDF5 file based data storage for our space data. The results are amusing (the linked page contains charts and explanations; pay attention to the conclusive chart, it is the most informative one). In short (for those who don't want to look at the charts): using relational databases for pretty common scientific tasks sucks badly performance-wise.
Disclaimer: while I'm positively not a guru DBA and thus admit that both of the databases tested could be configured and optimized better, but the thing is that I am not supposed to. Neither is the OP. While we may do lots of programming work to accomplish our scientific tasks, being a qualified DBA is a completely separate challenge - an unwanted one, as well.
So far, PyTables/HDF5 FTW. Which brings us back to the OP's question about organizing these files...
As it happens I'm also in space research. My feeling is that what approach you take depends a lot on what sort of operations you need to carry out. Databases are good at sorting, searching, grouping, and selecting data, and joining one table with another. Getting your data into a database and extracting it is always a pain, and for practical purposes we found nothing to beat converting to CSV format (comma-separated-value). We ended up using Postgres as it had the best spatial (2-d) indexing, beating MySQL at the time. The expensive commercial DBMS like Oracle didn't have anything that the open-source ones did for modest-sized scientific datasets. I found Postgres was fine for our tables, which were no bigger than around 10 million rows long and 300 columns wide. You might well get better performance using something like HDF but you'll probably spend a lot more time programming to do that, and it won't be as flexible. The only thing you can be sure of in scientific data handling is that the requirements will change often, so flexibility is important. If your scientific data are smallish in volume and pretty consistent in format from one run to the next, you might consider storing the data in the database, in a BLOB (binary large object) if no other data type seems to suit. But a fairly good alternative is just to store the metadata in the database, e.g. filename, date of observation, size, shape, parameters, etc and leave the scienficic data in the files. You can then use the database to select the files you need according to the parameters of the observation or experiment.
I've built LIMS systems that manage peta-bytes of data for companies and small scale data management solutions for my own research. Regardless the scale, the same basic pattern applies: Databases + files + programming languages. Put your meta-data and experimental conditions in the database. This makes it easy to manage and find your data. Keep the raw data in files. Keep the data in a simple directory structure (I like instrument_name/project_name/date type heiracchies, but it doesn't really matter what you do as long as you're consistent) and keep pointers to the files in the database. Use Python/Perl/Erlang/R/Haskell/C/whatever-floats-your-boat for analysis
Databases are great tools when used properly. They're terrible tools when you try to shoehorn all your analysis into them. It's unfortunate that so few scientists (computer and other) understand how to use them. Also, for most scientific projects, SQLite is all you need for managing meta-data. Anything else and you'll be tempted to to your analysis in the database. Basic database design is not difficult to learn - it's about the same as learning to use a scripting language for basic analysis tasks.
The main points:
1) Use databases for what they're good at: managing meta-data and relations.
2) Use programming languages for what they're good at: analysis.
3) Use file systems for what they're good at: storing raw data.
-Chris
Certainly it depends, YMMV, and all that. Still, I think that some of the points that you bring up are not actually arguing against a relational database, perhaps just for a slight reorganization of your processes.
I agree with your final point though, we really have no idea what would be best for the submitter within the limitations of skills, budget, time, etc. Perhaps flat files are really the best solutions, or maybe stone tablets are.
http://www.scidb.org/
Except for ending slavery, the Nazis, communism, & securing American independence, war has never solved anything.
I am a programmer, who works closely with scientists in scientific computing in the fields of fluid mechanics simulation, and aerodynamics simulation.
Your question is really not clear, in both these fields that I work on, the requirements vary vastly, and it also varies to the users I support (over 100 scientist). some of them have huge data sets, spanning up to 600 GB/file, a single simulation run can give a geologist a 1 TB file.
Others, have a few hundred MB of data. Each is handled differently.
The data itself, can be parsed and stored in in a DB for analysis in some cases, and in others, that is very impractical and will slow down your work.
Each scientist has a different way of doing things.
So the bottom line, if you want any useful answers, be more specific. What field of science (i can tell you are a chemist?), what simulations/tests do you use, how fine are your models are your data sets and what is their format, what kind of data are you interested in, you should seriously consider an archiving solution because i guarantee you will run out of space.
The lunatic is in my head
Pipes are executed in order, and unless you split your data and pass it to several processes, there is nothing "inherently parallel" in them.
The processes in a pipeline run in parallel, unless you are in dos/windows (as konohitowa pointed out).
You often have a process that is reading the file, and a series of successive proceses that are filtering and processing the data. Each one can run on a different core, in parallel
You can do the same with perl, but most perl isn't written that way. It's really cool that scripts I wrote fifteen years ago are scaling incredibly well with additional cores (and that will only get better). I can't say the same for most all perl, python, C programs, etc. Perhaps it would be more fair to say the difference isn't so much the language as it is the monolithic thinking and implementation.
Also, many datasets are often in columnar format. So the data is already split, and the columns are easily divided across multiple processes.
And parallelism aside, awk is often significantly faster at processing text than perl. Especially if you use awka to convert the script to C and compile.
That stuff matters when you're crunching transaction logs at the largest e-commerce sites, etc.
...using relational databases for pretty common scientific tasks sucks badly performance-wise.
Well, it has never been a secret that relational databases do not performs as well as e.g. a bespoke ISAM or hash-indexed data-file, mostly due to the fact that it involves interpretation of of SQL. But then the main purpose of SQL databases has never been to optimise raw performance - rather the idea is to provide maximum, logical flexibility. The beauty of relational databases is that you can change both data annd metadata at the wave of a hand, where you in the high-performing, bespoke database have to go and rewrite significant amounts of code.
At the end of the day you choose your tools to fit your needs, or at least that is what you ought to do.
As for the OP's question: the main problem seems to be one of having to rename and move stuff; this is clearly the area where SQL is strongest.
The problem is, most IT people have no idea what do with science data -- it'd be like going to a dentist because you're having a heart attack. They might be able to give general advice, but have clue what specifics need to be done. Likewise, IT might be people who are really good at diagnosing hardware, but they might suck at writing code. Not all IT specialists are cross-trained in enough topics to deal with this issue effectively (data modeling, UIs, database admin, programming, and the science discipline itself).
There's a field out there called 'Science Informatics'. It's not a very large group, but there's a number of us who specialize in helping scientists organize, find, and generally manage data. Think of us as librarians for science data.
Most of us would even be willing to give advice to people outside our place of work, as the better organized science data is in general, the more usable and re-usable it is. There's even a number of efforts to have people publish data, so it can be shared, verified, etc. And most of us have a programming background, so we might be able to share code with you, as we try to make it open source where we can, so we don't all have to re-solve the same problems.
Because each discipline varies so much, both in how they think about their data, and what their ultimate needs are, we tend to be specialists, but there's a number of different groups out there, for example:
There's also Bioinformatics, Health/medical informatics, chemical informatics, etc. plug in your science discipline + 'informatics' into your favorite search engine, and odds are you'll find a group, or person you can write to to try to get more info and advice.
Recently, NSF just funded a few more groups to try to build out systems and communities : DataOne and the Data Conservancy, and I believe there's some more money still to be awarded.
Build it, and they will come^Hplain.