How Do You Organize Your Experimental Data?
digitalderbs writes "As a researcher in the physical sciences, I have generated thousands of experimental datasets that need to be sorted and organized — a problem which many of you have had to deal with as well, no doubt. I've sorted my data with an elaborate system of directories and symbolic links to directories that sort by sample, pH, experimental type, and other qualifiers, but I've found that through the years, I've needed to move, rename, and reorganize these directories and links, which have left me with thousands of dangling links and a heterogeneous naming scheme. What have you done to organize, tag and add metadata to your data, and how have you dealt with redirecting thousands of symbolic links at a time?"
I store them in first posts.
Subj.
If you have something more complex than a flat file, then use relational databases. Even Access databases are better than a collection of text files.
Whenever I need to find anything, I use "Command-F"
For justice, we must go to Don Corleone
In my experience, the best thing is to let the structure stand as it was the first time you stored the data.
Later, when you discover more and more relationships around that data, you may create, change, and destroy those symbolic links as you wish.
I usually refrain from moving the data itself. Raw data should stand untouched, or you may delete it by mistake. Organize the links, not the data.
Organize your data like I organize my bedroom: Everything on the floor.
Look, how big is your desk? 8 square feet? How big is your floor? Several hundred square feet? If you can see all of your stuff, then you can access it instantly. Organized Chaos.
Now, if you'll excuse me... I think something's moving around in my trash can.
http://www.sqlite.org/ a "replacement for fopen()" -- http://www.sqlite.org/about.html
http://stephan.sugarmotor.org
SQL comes really handy. I can imagine several simple scripts + SQLite indexing table. Or anything else.
I was going to say the same thing. You can also check to see if there are any software in your domain that might help you insert it into a database. If not, you can keep the data as flat files but have records in the database and have the path to them in there. A little bit of programming but not much will get you a list of file path that you can then just us a bash script to retrieve.
I wonder if there's an opensource project to create and manage extended attributes on supporting filesystems?
http://www.freedesktop.org/wiki/CommonExtendedAttributes
But you're likely to get better results from having filenames be a field in a DB, and let all the metadata live in other DB fields..
ps: here's a CPAN entry that manipulates extended attributes: http://search.cpan.org/dist/File-ExtAttr/lib/File/ExtAttr.pm
I have to organize and analyze 100 GB of data from a single day's experiment. Raw data goes on numbered HDs that are cloned and stored in separate locations. This data is the processed and extracted into about 1 GB of info. From there the relevant bits are pulled into hierarchal matlab structures. Abstract everything that you can into a MATLAB structure array. Have all your analysis methods embedded in the classes that compose the array. Use SVN to maintain your analysis code base.
OK, subject is the short answer, here's the big answer
Since experimental data usually doesn't have the same structure for all experiments, you may try something like this:
at the deeper, most basic level organize it using JSON or XML (I don't know what kind of experiment you do, but you would put lists of data, etc)
Then you store this in a NoSQL db (like CouchDb or Redis) and index it the way you like, still if you don't index you can always search it manually (slower, still...)
how long until
Instead of trying to organize your data into a directory structure, use tagging instead. There's a lot of theory on this -- originally from library science, and more recently from user interface studies. The basic idea is that you often want your data to be in more than one category. In the old days, you couldn't do this, because in a library a book had to be on one and only one shelf. In this digital world you can put a book on more than one "shelf" by assigning multiple tags to it.
Then, to find what you want, get a search engine that supports faceted navigation.
Four "facets" of ten nodes each have the same discriminatory power as a single hierarchy of 10,000 nodes. It's simpler, cleaner, faster, and you don't have to reorganize anything. Just be careful about how you select the facets/tags. Use some kind of controlled vocabulary, which you may already have.
There are a bunch of companies that sell such search engines, including Dieselpoint, Endeca, Fast, etc.
Instead of symlinking to directories,
create directories of hard links to the files.
Then you can move files around whenever you like,
and you never have any dangling links.
...but then google came along and taught me that it's not about know where things are, but rather about being able to find them. My email, for instance, is "organized" by the year in which it arrives, and I use the search function of my email client to find things. No big folder structure, moving messages around, and I haven't had problems finding any email I need. Oh yes, I keep them all... good fodder for "on x/x/xxx you said..." retorts.
For files, then, the key is to have descriptive file names that provide readily searched text. Including the data somewhere in the name (I tend to use this format because it sorts well: 20100815) makes it easier to sort through multiple versions.
Then, you can spend quality time figuring out how to reliably back up all that stuff.... :)
Life is too short. Get someone else to do it, under the disguise of valuable field experience.
The present (and the future) of experimental data organisation, repurposing, re-analysing, etc. is being shifted towards Linked Data and supporting graph data stores. Give it a spin.
var sig = function() { sig(); }
I never understood how you can have something organized or not.
I organize my stuff at the planetary level.
Universe > Solarsystem > Earth > Contenent > United States > Florida > County > City > Street > House > Room > Desk > Computer > Hard Drive > Folder > File Type > Location
I think im pretty well organized even though i miss place stuff all the time.
You may want to consider a scientific workflow system. These systems handle both data storage (including meta-data and provenance -- where the data came from), and design and execution of computational experiments. If you are concerned about the complexity of the meta-data (e.g., pH value..) and would like to make sure to be able sort things according to this, you want to give "Wings" a try. You can try out the sandbox to get an idea: http://wind.isi.edu/sandbox.
"Search, don't sort".
The size and complexity of your data management should match the size and complexity of your data set. If you have thousands of datasets, give serious consideration to a relational database. Store all of your metadata (pH, date, etc) in the database so you can query it easily. If your raw data lives in a text-based format, put it in the database too, otherwise just store the path to your file in the database and keep your files in some sort of simple date-based archive or whatever.
Now, you can start to search though the data by thinking about which sets of data to compare. Much easier.
This is very general advice - if you have one experimenter and a couple of experiments, just use a lab notebook. If you have a handful of experimenters and ~100 experiments, try a spreadsheet or well organized structure on disk. If you have many people involved, or thousands of experiments, or both, you need something to help manage all of that in a way that lets you think in terms of sets rather than individual data files. Otherwise, you'll find yourself wearing your 'data steward' hat way to often, and not wearing your 'experimentalist' or 'analyst' hats much at all.
-V-
Who can decide a priori? Nobody.
-Sartre
$PRJ_ROOT/data/theoretical
$PRJ_ROOT/data/fits
$PRJ_ROOT/data/doesnt_fit
$PRJ_ROOT/data/doesnt_fit/fixed
$PRJ_ROOT/data/made_up
This post contains no rudeness or derision of any kind. All arguments are friendly. Terms and exclusions may apply.
It depends how you update the files. Many systems, when updating a file, will write the entire new file to a temporary location, then atomically rename it on top of the old location, which would kill any hardlinks, but symlinks would still work.
I have to agree with the database suggestions, though something NoSQL-ish may work better.
Don't thank God, thank a doctor!
I agree that this is a candidate for a database. One problem with data bases for researchers is that generally one does not know the right schema before hand ond one is dealing with ad hoc contingencies a lot. Another is portability to machines you don't control or that are not easily networked. A final problem is archival persistence. I can't think of a single data base scheme that has lasted 5 let alone ten years and still function without being maintained. Files can do that.
So if you want some bandaid approaches:
1) if you have a mac then, uses aliases rather than symbolic links. alias don't get messed up if you move the file.
2) use hard links rather than symbolic links. THe problem here is that these can get unlinked if you plan to modify the file. But if the file will never change these are just as space efficient and a softlink but tolerate renaming. They can't span across different disks however.
3) poormans database:
give your files a numerical name that chages, typically the date and time they were created. then have a flat file that list the files in some set for each category.
4) low tech database. If you decide to use a database then choose one that is likely never to go out of style. for example pick something like a perl-tie. those are so close to the language that they probably won't get depricated in the next 10 years.
Some drink at the fountain of knowledge. Others just gargle.
is to use CVS (comma/tab seperated value) files to store the data. This makes it easy to import into a spreadsheet or database in the future as your needs grow.
Mod me up/Mod me down: I wont frown as I've no crown
To everybody here suggesting relational databases: you are on the wrong track here, I'm afraid to tell you. Relational databases handle large sets of completely homogenious data well if you can be bothered to write software for all the I/O around them. This is where it all falls apart:
1. Many lab experiments don't give you the exact same data every time. You often don't do the same experiment over and over. You vary it and the data handling tools have to be flexible enough to cope with that. Relational databases aren't the answer to that.
2. Storing and fetching data through database interfaces is vastly more difficult than just using the standard input/ouput or plain text files. I've written hundreds of programs for processing experimental data and I can tell you: nothing beats plain old ASCII text files! The biggest boon is that they are compatible to almost any scientific software you can get your hands on. Your custom database likely is not. Or how would you load the contents your database table into gnuplot, Xmgrace or Origin, just to name a few tools that come to my mind right now?
I wish I had a good answer to the problem. At times I wished for one myself, but I fear the best reply might still be "shut up and cope with it".
http://www.moonlight3d.eu/
This is exactly what SparkLab aims to solve, take a look here: http://sparklix.com/demo-movie It's free for academic and non-profit organizations. Personal free edition will be up later this year.
A lot depends on the type of data. If it is truly experimental results, then results could be easily organized in tables, and tables can be logically accessed, arranged and manipulated using standard rules of set theory. Relational databases work this way, but there are other approaches.
If your data is derived or crunched, You may have a massive logic problem. See this: http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html , and take heed.
The previous suggestions about leaving your data intact and refining the access is good advice. I have used and developed some network dbms systems for this type of data. The current trend seems to be toward Object-Oriented Network dbms systems, but I'm not sure that is the way to go; OONDBMS tends to be static and hard to maintain in a dynamic experimental environment. The largest experimental environment that I've had the opportunity to work on, with an energy company here in Houston, was a statistical analysis of nuclear reactions. The data was constantly changing and we needed a self-referencing, dynamic data repository. This is the type of system where you download data sets and do your analysis AFTER you have acquired it locally. The dbms was written in FORTRAN90 and was very fast, but you need a team for something like this unless you are epert enough to program it all yourself.. It actually used very little code, but the record management and indexes (mostly ISAM/invertedISAM took massive amounts of computer power. There are now some cute tools in FORTRAN 2000 that allow you to use a web browser as a front end, but I don't usually want to look at the data being gathered; I usually want to crunch the statistics and see the results. The browser front-ends I have seen tend to require too much tweaking in order to adapt to the changing data parameters. Remote terminals make more sense. Maybe you should be willing to change the method of accessing the data and not try to maintain dozens of links.
"The mind works quicker than you think!"
I would just put each set of experimental data in a separate subdirectory. Within each subdirectory I'd put a file with specific name (e.g., "description.txt") in which you briefly write up exactly what the experimental data is, how it was generated (e.g., if generated by a program, give the arguments and/or pointers to input data), and some keywords to allow it to be indexed/searched. Then I'd use your standard OS search tools to find the description file(s) you're looking for, thereby allowing you to locate your data based on its description rather than some brittle directory hierarchy.
I have a pretty standard setup for generating experimental data in my work. Whenever I run an experiment (which are usually simulations), I have a wrapper script that generates a random (meaningless) subdirectory name, copies my simulation binary and configuration to that directory (so I can reproduce the results later in case either my simulator code or its configuration changes), and prompts me to enter a description of what it is I'm simulating, and asks me to provide some keyword tags. The only way I can find the data afterward is to search the description files from the last step, because the data is otherwise just in a randomly-named directory.
Of course, this scheme depends on you doing a decent job of describing your data and providing keywords, but I don't think you can get around that with any technique. At some point you have to inject some human labeling/categorization. Directories and symlinks are just a pretty restrictive way of organizing things.
Good luck and enjoy!
In my previous lab group we used a mediawiki install to keep track of microelectronic devices that several people were working on at the same time. These devices were still under development so most of the data was qualitative -- images, profilometry data, IV/CV curves were all stored on the wiki page for each sample, and each page included a recipe for exactly how it was made, which made it easy to trouble shoot later. It worked pretty well for what we used it for, but once we had a working device all the in-depth data for that sample was kept separately. This seemed like a half-decent way of cataloging samples, although one would need something a bit more robust for complex data sets that don't integrate well with a wiki.
Comment removed based on user account deletion
Sorry, I forgot to include the fact that the network dbms system does not require you to rename or re-link your directory scheme. It simply creates pointers to relevant links and then maintains the pointer logic.
"The mind works quicker than you think!"
You need to start using a database. You don't have to actually put the data in a database but all of the meta data needs to go into one. Store your data files in one file system using whatever naming scheme you want and never move the files again. At the same time record the file system location along with all other meta data that is relevant. Then some simple database queries, e.g. embedded in some web pages can retrieve the location and even the data. You can of course also store the data in a database as well if you wish. I personally find it more practical to do it this way.
If you are using Mac OS X, you can tag the files using the Finder Get Info and putting "Spotlight comments" there. Then you can easily find them based on keywords and Spotlight in constant time. The good thing about keywords is that they give you a multidimensional database effect. The bad thing I've found is I tend to forget my keywords that I'm storing stuff with, so I don't really know what to search for. OS X Spotlight is promsing and might work very well for you.
Currently hooked on AMP
Well CMS is one of the large experiments at the LHC. The data produced should reach pentabytes per year and add to it the simulated data we have a hellava lot of data to store and address. What we use is a logical filename (LFN) format. We have a global "filesystem" where different storage elements have files in a filesystem organized in a hierarchical subdirectory structure. As an example: /store/mc/Summer10/Wenu/GEN-SIM-RECO/START37_V5_S09-v1/0136/0400DDE2-F681-DF11-BA13-00215E21DC1E.root
the /store is a beginning marker of the logical filename region that different sites can map differently (who uses NFS, who uses http etc etc) /mc/ -> it's monte carlo data /Summer10/ -> the data was produced during Summer of 2010 /Wenu/ -> it's a simulation of W decaying to electron and neutrino /GEN-SIM-RECO/ -> the data generation steps that have been done /START37_.../ -> The detector conditions that have been used (the actual full description of the conditions is in some central database) /0136/ -> is the serial number (actually I'm not 100%, but it's related to the production workflow etc) /0400DDE2-F681-DF11-BA13-00215E21DC1E.root -> the actual filename, the hash is due to the fact that the process has to make sure there are no conflicts in filenames
Another example: /store/data/Run2010A/MinimumBias/RECO/Jul16thReReco-v1/0000/0018523B-D490-DF11-BF5B-00E08178C111.root
This file is real data, taken during the first run of 2010 and filtered to the MinimumBias primary dataset (related to event trigger content). The datafiles in there contain RECO content and were done during the re-reconstruction process on July 16th. Then there's again the serial number (block edges define new serial numbers) and then the filename.
You could use a similar structure to differentiate the datafiles that you actually use. The good thing is that you can map such filenames separately everywhere as long as you change the prefix according to the protocol used (we use for example file:, http:, gridftp:, srm: etc). You can also easily share data with other collaborating sites as long as everyone uses similar structure it's quite good. No need for special databases etc. If you need some lookup functionality, then one option is a simple find (assuming you have filesystem access) or you could build a database in parallel and you can use the LFN structure to index things etc.
I have seen these kinds of situations happen a lot (I'm a statistician who works on computationally-intensive physical science applications), and the best solution I have seen was a BerkeleyDB setup. One group I work with had a very, very large number of ASCII data files (order of 10-100 million) in a directory tree. One of their researchers consolidated them to a BerkeleyDB, which greatly improved data management and access. CouchDB or the like could also work, but I think the general idea of a key-value store that lets you keep your data in the original structure would work well.
I haven't had to store experimental results like that. My work produces prototypes, some data, demos and support documentation. There are tons of KM tools out there to manage heterogenous data in a recoverable way. We've used document repositories like Hummingbird (acceptable) and of course SharePoint. The key (literally) is including the right metadata and tags when you check in the element. When a data set goes dormant (static) you can tarball the CVS tree or whatever and drop it in the repo. Then there's Knowledge Discovery, something we've created tools for. They let you understand how you got that idea from three hours of web/repo surfing.
First I would lay out your data using meaningful labels, like a directory named for the acquisition date + machine + username. Never change this. It will always remain valid and allow you to later recover the data if other indexes are lost. Then back up this data.
Next build indexes atop the data that semantically couple the components in the ways that are meaningful or acessible. This may manifest as indexed tables in a relational database, duplicate flat files linked by a compound naming convention, unix directory soft links, etc.
If you're processing a lot of data, your choice of indexes may have to optimize your data access pattern rather than the data's underlying semantics. Optimize your data organization for whatever is your weakest link: analysis runtime, memory footprint, index complexity, frequent data additions or revisions, etc.
In a second repository, maintain a precise record of your indexing scheme, and ideally, the code that automatically re-generates it. This way you (or someone else) can rebuild lost databases/indexes without repeating all your design and data cleansing decisions, and domain expertise. This info is often stored in a lab notebook (or nowadays in an 'electronic lab notebook').
I'd emphasize that if you can't remember how your data is laid out or pre-conditioned, your analysis of it may be invalid or unrepeatable. Be systematic, simple, obvious, and keep records.
I did a little bioinformatics in the past, and we were using postgreSql to manage our results. It was nice because you can create meaningful fields to query in the future. It took some time developing the system, but it really helped out in the long run. We had to consider errors in the readings of the results and had to incorporate a little bit of fuzzy logic into the tools we used to run comparisons on the database.
If you are at a university or near a university, the computer science dept may give a few students credit to build you a system that can handle it, so you don't have to.
A Laboratory Information Management System will help you store, organize, analyze and data-mine your data.
"linux" is a very common word and was not included in your search.
I agree. It depends.
Yes, relational databases store and retrieve well-defined data very, very well. Do you have referential integrity needs?. If that's your situation, use SQLite (small data and very simple types but little referential integrity), MySQL (medium to large data), or PostgreSQL (medium to very large data or more complex data types) and don't look back. SQL queries, relationships, and referential integrity are very powerful.
If not, then I'd look at MongoDB with GridFS. I'd even go further and explore GridFS-FUSE (a mountable file system version of MongoDB/GrisFS).
With GridFS-FUSE, you have a crazy powerful database/file system combo. Now, since MongoDB is a NoSQL database, you cannot do SQL queries against it. You can store and retrieve key-value pairs, NoSQL "documents," and actual files with MongoDB/GridFS/GridFS-FUSE.
Instead of sorting datasets, use a testlist database (flat files). The test contains/links/points to its dataset. The test lists are selected at test run time. Each entry in a test list tells how to generate the specific test environment for the test. A test list entry contains the test, the RCS tag/version of the test to be "gotten", the test seed, and array of exit codes that should be retired, how many retries, whether the test is gating, and an array of tests dependencies. A test run can be considered to pass even though an individual, non-gating test fails. One test entry may extract and prepare the test data and other dependent entries can then run against that test dataset.
It seems you have never heard of LIMS (Laboratory Information Systems), which is unfortunate.
This is a thriving software sector, and you are actually expected to be at least vaguely familiar with these kind of systems should you ever transfer to industry and work in data-generating or data-processing positions.
Nobody in industry keeps experimental data as individual, handcrafted datasets. The risk of losing important data, not not being able to make cross-references (patents!) is much too high if you let people run their own set-ups. Do yourself, and your research group, a favor: Get some grant money and purchase a robust commercial set-up at least for your group, or better your department. Entry level systems, with academic discounts, are affordable. There are no competitive open-source solutions.
Start your research here:
http://en.wikipedia.org/wiki/LIMS
(though the systems listed there are instrument-centric, if you are more into generic chemistry there are other standard package by companies such as Accelrys and CambridgeSoft).
I used to have a two word answer for this question: Use BeOS
But now it's a six word answer (*sigh*): Invent time machine, then use BeOS
--MarkusQ
It's already been said, but it bears saying again. Directories and symlinks.... oh my!
real men upload their stuff to kernel.org and let the world mirror it.
I want to delete my account but Slashdot doesn't allow it.
Here's what I do:
Directory for each data set, labeled by date (20100815).
Short README file inside each directory with description of the run.
Big spreadsheet (or database, if you're fancy) with experimental parameters and core results, that can be sorted, reorganized, and graphed.
Word up. I'd say the first goal is to store your raw, bulk data consistently. Then you can have several sets of post processing scripts that all draw from the same raw data set.
You want this data format to be well-documented, but I wouldn't bother meticulously marking it up with XML tags and other metadata or whatever. You just want to be able to read it fast, and have other scripts be able to convert it into other formats that would be useful for analysis, be it matlab, octave, csv, or some tediously marked-up XML. You do want to be able to grep and filter the data pretty easily, so keep that in mind when you're designing the format. It will likely end up being pretty repetitive, but that's OK, since you'll likely store it compressed. That can improve performance when reading it, since the storage medium you're pulling the data from is often slower than the processor doing the decompression... and it also provides some data integrity / consistency checking. Oh, and of course, you can store more raw data if its compressed.
http://www.scidb.org/
Except for ending slavery, the Nazis, communism, & securing American independence, war has never solved anything.
I would recommend just downloading a VM or cloud image of something like Knowledge Tree or Alfresco (I personally prefer Alfresco), and run it on the free vmwareplayer or a real VM solution if you have one.
I recently setup a demo showing the benefits of such a system, I was able to, in about one day, download and setup Alfresco, expose CIFS interface (ie, \\192.168.x.x\documents) and just dump a portion of my entire document base into the system. After digestion, the system had all the documents full-text-indexed (yes, even word docs and excel files thanks to OpenOffice libraries), and I could go about changing directory structure, moving around and renaming files, etc. .. and the source control would show me changes. In fact, I could go into the backend and write SQL queries if I wanted to with detailed reports of how things were on date X or Y revisions ago. Was quite sweet. All the while, the users still saw the same windows directory structure and modifications they made there would be versioned and modified in Alfresco's database.
Here is a bitnami VM image, will save you days of configuration. If the solution works for you, but is slow, just DL the native stack and migrate or re-import.
Make sure everyone's vote counts: Verified Voting
What I am looking for is a multidimensional adressing - file or database system.
something like multiple B-tree's for the content with the possibility to add another B-tree index if required later.
Maybe the Google people have an answer?
One can mix-n-match: use flat files to store raw data, post-processed stuff, etc, but use a database to keep track of everything. The latter can even be handled by your OS from the get-go. On OS X you could write a spotlight plugin or two for your data files, and as long as the file format allows storing metadata within the files, spotlight will index it. Same goes for Windows Search. You could also use native mechanisms for adding metadata to files - those exist IIRC on both Windows 7 and on OS X.
A successful API design takes a mixture of software design and pedagogy.
A personal wiki that runs from one file, I link my files from there and I can add documentation and references at will. http://www.tiddlywiki.com/
DNA in your Linux: DNALinux
If you're really serious about tracking metadata, it may be worthwhile to take a look at some of the tools offered by caBIG:
https://cabig.nci.nih.gov/
The caBIG tools are geared toward using a model-driven approach to define precise metadata which promote semantic interoperability. Underlying the caBIG tools is a metadata repository called the caDSR, which follows the ISO/IEC 11179 Standard for Metadata Registries:
https://cabig.nci.nih.gov/concepts/caDSR/
The caBIG tools are all open-source, developed by the National Cancer Institute.
File corruption is unlikely with text files. If you should have corrupted files, you have a chance to recover them with text files. With binary files, databases etc. this becomes much harder.
To find your files later, tag them properly. Something like OpenMeta might help you.
All this talk of using whatever kind of database to organize your experimental data is nuts. It's well intentioned, I'm sure, but it's still insane. I always tell students that there is no general way to organise ones data, you have to find a system that works for you. I reckon that > 99% of physical science researchers (not just physicists, as seems to be a confusion in several replies) wouldn't be able to set up and use a database in a way that's better in terms of time and effort efficiency than just doing whatever it is that they already do to organise their data. Worse still, I reckon that it's the sort of thing that one would spend a huge chunk of time doing and then only use for a short while before you got bored of inputing stuff into the data base properly and then started to forget to do it or worse, resolving to "do it in batches". Anyway, the result of this will be that one eventually stops using the database and goes back to what one was doing before, but now with a huge hole in the data-trail where the database used to be. Alternatively, one will struggle on with the database for a while and then try and re-design it to put in all the features that were missed out in the original design, all the while sucking loads of time and eventually going back to the original method.
Getting on to some actuall advice. I would suggest two features that your chosen system should have. It should be:
1. robust
2. quick
Personally, I have a lab book in which I record experimental details (what I actually did to generate the data). There's a date at the start of every day and the rough "titles" of the experiments and then the details of what it was. When I generate data files I organise them into hierarchy of directories. So, there's a "projects" directory that has all the different projects in it. there might be a project that I'm working on to do with nanoparticles or something. and that'll have a directory called "nanoparticles". There's a bunch of directories in the project folder, such as "data", "analyses", "reports", etc. The Data directory is divided by the experimental technique that was used to get the data, such as "fluorescenceMicroscopy", "TEM" or "SEM" or whatever. Then the actual experiments are in directories inside the relevant technique. I name the directories by date first and then a brief indicator of what the experiment is about. So it I was looking for the aggregation behaviour of my nanoparticles in the presence of different polymers or something, I'd have a directory called something like
"20100815-FePt_particles_with_100k_PEG_in_PBS_at_pH7"
or something like that. Something that people often do is call a directory "20100815" of something like that. I used to do this, but I didn't find it useful to look back on after 6 months. People also forget that you can have something like 256 characters for the file/directory name - USE THEM! Inside this directory will be a bunch of data files that I acquire. I tend to start the naming of files with a number and then a decription of the sample and what's the point of this data. So, for example the first image in a set of TEM images might be named "01-100mM_PEG_generalGrid_300x", the next will be named "02-...", "03-...", etc. This way, all the files are ordered in the order that they were acquired, which I find helps to find them later, since I think that you remember the order of things better than their absolute position. So, if I wanted to find an image of aggregated nanoparticles that I took some time in the winter, I could easily find "Projects/nanopartocles/Data/TEM/20091106-FeNP_with_20k_PEG_in_EtOH/03-10mM_PEG_aggregatedParticles_20kx".
Anyway, this works for me, but my data needs to be organised so that I can get to the relevant data and then do something with it, not aggregate large amounts together and get some numbers.
Yes, agreed, a combination is good (SQL + NoSQL + filesystem).
There is no one-size-fits-all scenario, here.
However, there is utility in a NoSQL database over a raw filesystem. One feature is indexed search. Another is versioning. Another is the fact that it is extremely multiuser (proper record locking, even if there are multiple writes to the same record). Also, many NoSQL databases (especially MongoDB) have built-in replication, sharding, Map-Reduce, and horizontal scaling.
MongoDB's GridFS (especially with FUSE support) marries many of these features together. MongoDB does have some SQL DB features (such as indexing/searching and transactions) but not others.
Check out the whole stack here:
http://www.mongodb.org/
http://www.mongodb.org/display/DOCS/GridFS
http://github.com/mikejs/gridfs-fuse
Depending on the nature of your data, I used to use netcdf files quite a lot (https://www.unidata.ucar.edu/software/netcdf/). I also work on data sharing and standardisation, this is a full time job, so really you can spend as much time as you want on this and still not get it done. There are a variety of international data standards which exist to facilitate the management and sharing of data. I know you aren't necessarily talking about sharing your data, but many of the same issues apply. In short, unless you wish to spend a great deal of time on this or your employer has some data management solution for you then there are probably only a variety of unsatisfactory solutions which no doubt the slashdotters here will suggest :)
Just give all your datasets a number and put them in a database so you can search on all criteria you want.
You can also use Excel to keep track of your metadata if you want...
Privacy is terrorism.
You are finding the same problem everyone has with any data set. Hierarchical folders with one name only allow for a single, pre-arranged organization. It's terrible for the way we really use files, data, whatever really.
Store your data sets with simple "inventory names" like 00001 through 99999 or random serial numbers. Have a spreadsheet or database that associates all of your data sets with as many characteristics as you like. Then you can sort and find by any combination you can think of in the future.
- For the complete works of Shakespeare: cat
Reading these comments has changed my thoughts on data storage a little bit, but has reinforced my idea that databases are a bad idea for this sort of thing.
The main issues I have with using databases are file size (I store and convert text files that are 10-100MB zipped), and mutability (generated data doesn't typically change, I just add new experiments on top of other data). A secondary issue is that for plain-text data files (or plain-text convertible data files), writing code is easier when you don't have to bother about a database middleman.
So, if I were to do [another] large research project in the future, here's my thoughts on what I would consider an appropriate approach:
My most common uses for old data are re-running analyses (generating new data as results), and sending data to someone else. It helps to be able to make those things as quick as possible.
Ask me about repetitive DNA
I am a programmer, who works closely with scientists in scientific computing in the fields of fluid mechanics simulation, and aerodynamics simulation.
Your question is really not clear, in both these fields that I work on, the requirements vary vastly, and it also varies to the users I support (over 100 scientist). some of them have huge data sets, spanning up to 600 GB/file, a single simulation run can give a geologist a 1 TB file.
Others, have a few hundred MB of data. Each is handled differently.
The data itself, can be parsed and stored in in a DB for analysis in some cases, and in others, that is very impractical and will slow down your work.
Each scientist has a different way of doing things.
So the bottom line, if you want any useful answers, be more specific. What field of science (i can tell you are a chemist?), what simulations/tests do you use, how fine are your models are your data sets and what is their format, what kind of data are you interested in, you should seriously consider an archiving solution because i guarantee you will run out of space.
The lunatic is in my head
It works well, and we designed the database schema for extensibility, normalized and all that. (and the thing is growing and adapting well to changes, but that DB design is all important to make that not so hard)
Looks like the best plan for stuffing a lot of data away and finding it later on. www.coultersmithing.com will show you a bit of what we are up to here. Or our forum at Science/Engineering/Tech forums
Why guess when you can know? Measure!
I'm also an experimental physical scientist. My experience tells me that I have absolutely no idea what kind of meta-data I'll want to keep track of in the future, and I only know what I want to keep track of now, which is probably a small subset of what I'll want to keep track of in the future. Every sample that I make is assigned a unique serial number (Experiment N Sample M Piece Q etc). All the master data is in my lab notebook which I keep anal retentitively. Any metadata that I know now that I want to keep track of is contained in there. Any analysis I do on any sample I make is also filed under this serial number. Now I just need to convince my boss to let me switch to an electronic notebook (like Microsoft OneNote) so that I can assign each sample or each experiment its own tab so I don't have to jump pages back and forth in my paper notebook.
One way of looking at the problem is by organizing information about the data versus organizing the data itself. One way to do this might be to have a text file describing facts about your data in the same directory as the data. Use something like solr and lucene to index these text files. You can make search queries without the need a uniform schema that describes the meta-data about your experimental data. It would be like "Googling" your data. You could organize it in such a way that you can search for specific info like date etc. Or lookup by search terms that might be found in the text.
I face the same issues too. Immensely large datasets, that change and no proper way of tracking them through the file system. Trust me, when I say this -- it will be worth your while to spend some time thinking about your requirements and do some quick coding to get an infrastructure in place.
It was said here before (I guess just a couple of posts above), but this is right on mark -- You have got to separate out your data and meta-data. Text files are immensely convenient and to be honest, that is also where I prefer to store my actual data. But statistics about my data I store in a quick relational database. My meta-data db consists of fixed columns that have all the statistics for my data sets that I usually need. For example: Date/Time, Number of Columns, Number of Rows, Row Description, Column Description, Algorithm Description, Parameters, Special Notes, File Name of Data, etc.
Row Description, Column Description, and Algo Description point to separate helper tables.
I also agree with many people here that relational databases can be an overkill for a manageable database, but if you generate a lot of datasets, the break-even point is reached almost immediately. Besides, text files even though extremely convenient for a quick grab and feed into a software are simply horrid when it comes to trending across many datasets.
Now depending upon your skill set, setting up this database could be a day's job or it can take you weeks. If you are computationally inclined though -- go for the relational database to store the meta-data, keep your main data in text files.
If you are not, there are excellent software out there that give you a nice interface to a relational database.
Nevertheless, my main point is, whatever route you choose, it will help you to separate out your actual data, and stuff about that data (meta-data).
But one thing you will find is that you have to use SQL to get the best out of any relational database, and this involves thinking in a new way - it's basically set-oriented - rather than sequentially row by row. This takes a bit of effort, but can be rewarding, as you will discover new ways of achieving some of the things you want to do.
Depending on your exact requirements, this is perhaps a fit for a Enterprise Content Management system. If these datasets are heterogeneous, I'd think looking into some kind of flexible meta-data system would be the way to go. This can be anything from a custom application, a bought solution or a opensource ECM system.
Though I caution you, some of these systems can be convoluted to set up, maintain and learn. Don't let that scare you away from the concept of it though.
The problem is, most IT people have no idea what do with science data -- it'd be like going to a dentist because you're having a heart attack. They might be able to give general advice, but have clue what specifics need to be done. Likewise, IT might be people who are really good at diagnosing hardware, but they might suck at writing code. Not all IT specialists are cross-trained in enough topics to deal with this issue effectively (data modeling, UIs, database admin, programming, and the science discipline itself).
There's a field out there called 'Science Informatics'. It's not a very large group, but there's a number of us who specialize in helping scientists organize, find, and generally manage data. Think of us as librarians for science data.
Most of us would even be willing to give advice to people outside our place of work, as the better organized science data is in general, the more usable and re-usable it is. There's even a number of efforts to have people publish data, so it can be shared, verified, etc. And most of us have a programming background, so we might be able to share code with you, as we try to make it open source where we can, so we don't all have to re-solve the same problems.
Because each discipline varies so much, both in how they think about their data, and what their ultimate needs are, we tend to be specialists, but there's a number of different groups out there, for example:
There's also Bioinformatics, Health/medical informatics, chemical informatics, etc. plug in your science discipline + 'informatics' into your favorite search engine, and odds are you'll find a group, or person you can write to to try to get more info and advice.
Recently, NSF just funded a few more groups to try to build out systems and communities : DataOne and the Data Conservancy, and I believe there's some more money still to be awarded.
Build it, and they will come^Hplain.
You said you're dealing with physical science. From what you describe, I'd guess that you're dealing with earth science, from what we call "small science". (lots of smaller investigations that can be done with a small team, rather than the multi-million dollar satellite or sensor grid projects).
I'd suggest talking to one of the following groups:
There's a hell of a lot more groups out there, but those two larger groups would be able to stear you towards more specialized groups that deal with a specific scientific discipline.
Build it, and they will come^Hplain.
I'm not sure this answers my question. It sounds like you're basically saying all the people who are telling us to put our data in the database are wrong. Which is possible, but it doesn't explain what they thought we were going to do with it in the database in the first place. But then you're suggesting that we do a tremendous amount of extra work, including some very touchy coding, to get the metadata into a database. Unless I'm missing something, keeping the metadata up to date would then require rewriting all of the software we use (the software modifies, creates, and deletes files). Even if this were possible, it would be extremely fragile. And I'm not sure what the benefit would be. I have no idea what the distinction is between "row by row" and "set-oriented," because we don't use a database at present, and I can't even imagine what "row by row" would mean for our data. This is rich scientific data, not transactions in some financial system. And the supposed benefit sounds very meager. We don't need new ways of achieving things, unless they're better than our current things.
Still baffled.
I use a program called Personal Brain, available at: www.thebrain.com I prefer to pay for the upper-end version due to its better functionality, but the company has a free version as well. They also sell an enterprise version that is useable for multi-users. I don't have experience with this version, but I presume the function /s is/are similar.
they're even good for storing video!
You can use Nutch (http://nutch.apache.org/) to crawl your documents, then you'll have a Lucene index. Nutch also gives you a basic search page as a frontend as you go through your documents. For grouping of search results Carrot (http://project.carrot2.org/) has some very interesting algorithms for research work, and Nutch has a native plugin for Carrot that has shipped since 1.0.
fak3r.com
It doesn't work so well for biology subjects (they all become autopsy experiments, and after a week or so the smell from my desk reminds me of a "Body Farm," but I LIKE sticking things on a long vertical metal rod. (I KID, I KID...)
MSBPodcast.com The opinions expressed here are my own. If you don't like 'em... Think up your own stuff.
with poor spelling, the idea of RDBs or other stuff just sounds like a nonstarter. (if you think install apache 1click, etc works, then you have no idea of what real scientists are like; perhaps this means that they don't have the training they should have, but I have used these one click installers, and they are way, way, waaaay to complex for me, and I am sure, for most scientists, for whom an excel pivot table or vlookup is the height of programming expertise) The problem is that "scientific data" is often like the real world - messy, undefined stuff. It also , typically includes all sorts of things - images from cameras ranging from consumer point and clicks to peltier cooled 24bit CCDs, flat files uploaded from machines, complex propietary files uploaded, hand written notes, etc etc. For instance, for a recent experiment I did, I have ".xad" files, which are a prop format from an agilent instrument ".nef" files from a Nikon consumer camera several .txt files which need diff converters to put them into useful excel files
several .pngs
some of hte key .txt were imported into .xlsx, then imported into a specialized graphics program (Igor) which has its own propetary format...
my written notes on how to do the experiment (word doc with embedded excel minitables, which show calculations for things like how many grams of salt per liter of solution..)
propietary files from a nanodrop (don't ask) which can be opened with excel to get sort of usable data
hand entered data from a thermo electron conductivity meter (integers) a cheap balance, lot numbers of key reagents, pointers to previous experiments where i describe how key reagents were prepared, pointers to key experiments with background info
I don't think a single one of these files has the data organized in a really useful manner; all require extensie hand editing to get them into a form that is even understandable, let alone integrated.
not to mention the hideous problems of writing reports in word with embedded pictures, the loss of functionality in excel 2007 (x error bars hard to do, xy graph format reverts to line format without telling you,etc)
what this guy really wants is sort of the holy grail of web searching - the ability to do natural language queries on his data set.
Like, for growing fish, what is the effect of pH (acidity) and ionic strength (how much salt is in th water) on body length vs time
IN the meantime, what would help would be
1) if everyone complained to gates and ballmer,a nd got them to divert some miniscule fraction of their monopoly level profits from office, and put those profits to work making office usable for technically minded people
2) A way to keep track of files across changes in file name and directory name.
this is a huge problem; what would help is a way to embed a serial number into a file, and every time you use the file, the serial number goes with it, eg, you have a .tif, you crop it in image editor, import the crop into powerpoint, add text, export the mess into word, save the word doc, re edit it a year later, then two years later, looking at the re editied word doc, yo need to find the ORIGINAL .tiff, and in teh mean time, you ahve moved to a new computer, and have a different directory tree.....
I write a Rails server for each of my experiments. Rake tasks are GREAT for bioinformatics pipelines, and migrations and database backups make it incredibly easy to save old experiments.
Plus, various Ruby gems (starling, workling) enable me to farm out long-running experiments to a variety of lab machines. I can use Rice to write C++ Ruby extensions that are compiled individually on different machines.
And all of this is stored in a PostgreSQL database. (MySQL is slow for complex joins, which you sometimes have to write in bioinformatics.)
I have pretty much the same problem. Files system became a mess and databases did't help much with entire directories of up to 1 gig. Plus, for scientific traceability, experimental data should be self-sufficient: external links (to other data, scripts, software, etc.) are toxic. So, I developed 2 open sources to organize my experimental data. 1) new experiments: Basic Experimenter. A wizard that stores together the datasets and all the files of the experiment (protocol, checklists, scripts..). http://sites.google.com/site/basiclabbook/basicexperimenter 2) old experiments: Basic Bookcase. Datasets are zipped and stored in a scientific repository as documents (BibTeX category 'dataset'). At http://sites.google.com/site/basiclabbook/basicbookcase This did not solve the previous cross links, but it helped a lot.
Science Informatics. You just told me what I do, when I'm not doing my own research... which seems to be happening less and less these days. A lot of the kids we train now can write something in Matlab or IDV, but couldn't perform a first normalization on a database if their lives depended on it. A lot of them have learned a little perl... or .Net, but know nothing of C or Fortran, and can't spell MPICH even when the models they use depend on it.
And seriously, corporate and university IT staff are not suited for this purpose, just as I'm poorly suited to help you with your next Windows installation. My expertise lies elsewhere. Getting the senior management to understand these differences, however, is a problem.
Never ascribe to malice that which can adequately be explained by tenure.
I run a lot of numerical simulations, and we run into the same issues, as a few people have noted above. Many of my colleagues still write code in fortran 77 and manually give descriptive filenames to their data. It's a mess. There's a guy at the University of Toronto, Greg Wilson, whose work centers on getting computational scientists--largely people who run simulations--to become more sophisticated. He runs a one-week course teaching software engineering principles to computational scientists. It's available online for self-study at http://www.sofware-carpentry.org./ One section is a lesson on using databases (he demonstrates SQL) to handle numerical data, which I imagine looks a lot like experimental data. He's currently rewriting the course, but I have learned a lot from it.
You need your "master data" (ie: data that describes your work and is considered to be of quality) centralized in something like a RDBMS (SQL Server, etc). Once you have it there you can write scripts to find orphans. Basically you will have a metadata table that will contain all available 'links' to your master data. You then try and find the other side of the "link". The question you are trying to determine is "is the meta data of any quality?".
So, if you needed a table it would look something like this.
Unique key of Master Data, Flag to determine if unique key exists (because this changes over time), Meta Data key, flag to determine if meta data exists(because this changes as well)
You then run a script against this nightly to determine if your "links" are broken. I also like to have it visualized by having an HTML page display a nice gren check for a good link and a big red x for a bad link.
If you do this, you will really have a good idea of your data and it will allow you to QA it much better.
I am a behavioral neuroscientist and MySQL has been invaluable for my data storage and analysis.
MATLAB is my main analytical tool, so i generally keep my matlab code in a subversion repository and use MyM to communicate back and forth between MATLAB and MySQL.
I've been using MySQL since 1999. If I was getting into the game now, i might choose another relational DB.
1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765
As an example of the former: The patches of experiment space containing "measure the lifetime of the bottom quark" and "estimate the average length of 5 year old blue whales" are strongly disjoint and there is essentially no description reduction scheme that can handle such a broad range of inputs. Equivalently, "estimate the resistivity of the population of salt bridges I've experimented with" and "estimate the total data production of Earth in 2010" are questions drawn from experiments that are too different to have a unified data reduction description. I've led programs to address this range of problems in several ways:
Sort them as they should be sorted, in a mindmap.
'I am become Shiva, destroyer of worlds'