How Do You Organize Your Experimental Data?
digitalderbs writes "As a researcher in the physical sciences, I have generated thousands of experimental datasets that need to be sorted and organized — a problem which many of you have had to deal with as well, no doubt. I've sorted my data with an elaborate system of directories and symbolic links to directories that sort by sample, pH, experimental type, and other qualifiers, but I've found that through the years, I've needed to move, rename, and reorganize these directories and links, which have left me with thousands of dangling links and a heterogeneous naming scheme. What have you done to organize, tag and add metadata to your data, and how have you dealt with redirecting thousands of symbolic links at a time?"
I store them in first posts.
Subj.
If you have something more complex than a flat file, then use relational databases. Even Access databases are better than a collection of text files.
Whenever I need to find anything, I use "Command-F"
For justice, we must go to Don Corleone
Data isn't just data - it has, as you've learned, a history. Learn about how RCS works and use one to store your data in from now on.
Or, you could just store it in a proper SQL database, and be able to query it any way you like, without having to create all these link farms giving you different views on the underlying data.
In my experience, the best thing is to let the structure stand as it was the first time you stored the data.
Later, when you discover more and more relationships around that data, you may create, change, and destroy those symbolic links as you wish.
I usually refrain from moving the data itself. Raw data should stand untouched, or you may delete it by mistake. Organize the links, not the data.
Organize your data like I organize my bedroom: Everything on the floor.
Look, how big is your desk? 8 square feet? How big is your floor? Several hundred square feet? If you can see all of your stuff, then you can access it instantly. Organized Chaos.
Now, if you'll excuse me... I think something's moving around in my trash can.
http://www.sqlite.org/ a "replacement for fopen()" -- http://www.sqlite.org/about.html
http://stephan.sugarmotor.org
SQL comes really handy. I can imagine several simple scripts + SQLite indexing table. Or anything else.
I was going to say the same thing. You can also check to see if there are any software in your domain that might help you insert it into a database. If not, you can keep the data as flat files but have records in the database and have the path to them in there. A little bit of programming but not much will get you a list of file path that you can then just us a bash script to retrieve.
I wonder if there's an opensource project to create and manage extended attributes on supporting filesystems?
http://www.freedesktop.org/wiki/CommonExtendedAttributes
But you're likely to get better results from having filenames be a field in a DB, and let all the metadata live in other DB fields..
ps: here's a CPAN entry that manipulates extended attributes: http://search.cpan.org/dist/File-ExtAttr/lib/File/ExtAttr.pm
I have to organize and analyze 100 GB of data from a single day's experiment. Raw data goes on numbered HDs that are cloned and stored in separate locations. This data is the processed and extracted into about 1 GB of info. From there the relevant bits are pulled into hierarchal matlab structures. Abstract everything that you can into a MATLAB structure array. Have all your analysis methods embedded in the classes that compose the array. Use SVN to maintain your analysis code base.
OK, subject is the short answer, here's the big answer
Since experimental data usually doesn't have the same structure for all experiments, you may try something like this:
at the deeper, most basic level organize it using JSON or XML (I don't know what kind of experiment you do, but you would put lists of data, etc)
Then you store this in a NoSQL db (like CouchDb or Redis) and index it the way you like, still if you don't index you can always search it manually (slower, still...)
how long until
Instead of trying to organize your data into a directory structure, use tagging instead. There's a lot of theory on this -- originally from library science, and more recently from user interface studies. The basic idea is that you often want your data to be in more than one category. In the old days, you couldn't do this, because in a library a book had to be on one and only one shelf. In this digital world you can put a book on more than one "shelf" by assigning multiple tags to it.
Then, to find what you want, get a search engine that supports faceted navigation.
Four "facets" of ten nodes each have the same discriminatory power as a single hierarchy of 10,000 nodes. It's simpler, cleaner, faster, and you don't have to reorganize anything. Just be careful about how you select the facets/tags. Use some kind of controlled vocabulary, which you may already have.
There are a bunch of companies that sell such search engines, including Dieselpoint, Endeca, Fast, etc.
Instead of symlinking to directories,
create directories of hard links to the files.
Then you can move files around whenever you like,
and you never have any dangling links.
...but then google came along and taught me that it's not about know where things are, but rather about being able to find them. My email, for instance, is "organized" by the year in which it arrives, and I use the search function of my email client to find things. No big folder structure, moving messages around, and I haven't had problems finding any email I need. Oh yes, I keep them all... good fodder for "on x/x/xxx you said..." retorts.
For files, then, the key is to have descriptive file names that provide readily searched text. Including the data somewhere in the name (I tend to use this format because it sorts well: 20100815) makes it easier to sort through multiple versions.
Then, you can spend quality time figuring out how to reliably back up all that stuff.... :)
Life is too short. Get someone else to do it, under the disguise of valuable field experience.
The present (and the future) of experimental data organisation, repurposing, re-analysing, etc. is being shifted towards Linked Data and supporting graph data stores. Give it a spin.
var sig = function() { sig(); }
I never understood how you can have something organized or not.
I organize my stuff at the planetary level.
Universe > Solarsystem > Earth > Contenent > United States > Florida > County > City > Street > House > Room > Desk > Computer > Hard Drive > Folder > File Type > Location
I think im pretty well organized even though i miss place stuff all the time.
You may want to consider a scientific workflow system. These systems handle both data storage (including meta-data and provenance -- where the data came from), and design and execution of computational experiments. If you are concerned about the complexity of the meta-data (e.g., pH value..) and would like to make sure to be able sort things according to this, you want to give "Wings" a try. You can try out the sandbox to get an idea: http://wind.isi.edu/sandbox.
"Search, don't sort".
The size and complexity of your data management should match the size and complexity of your data set. If you have thousands of datasets, give serious consideration to a relational database. Store all of your metadata (pH, date, etc) in the database so you can query it easily. If your raw data lives in a text-based format, put it in the database too, otherwise just store the path to your file in the database and keep your files in some sort of simple date-based archive or whatever.
Now, you can start to search though the data by thinking about which sets of data to compare. Much easier.
This is very general advice - if you have one experimenter and a couple of experiments, just use a lab notebook. If you have a handful of experimenters and ~100 experiments, try a spreadsheet or well organized structure on disk. If you have many people involved, or thousands of experiments, or both, you need something to help manage all of that in a way that lets you think in terms of sets rather than individual data files. Otherwise, you'll find yourself wearing your 'data steward' hat way to often, and not wearing your 'experimentalist' or 'analyst' hats much at all.
-V-
Who can decide a priori? Nobody.
-Sartre
$PRJ_ROOT/data/theoretical
$PRJ_ROOT/data/fits
$PRJ_ROOT/data/doesnt_fit
$PRJ_ROOT/data/doesnt_fit/fixed
$PRJ_ROOT/data/made_up
This post contains no rudeness or derision of any kind. All arguments are friendly. Terms and exclusions may apply.
It depends how you update the files. Many systems, when updating a file, will write the entire new file to a temporary location, then atomically rename it on top of the old location, which would kill any hardlinks, but symlinks would still work.
I have to agree with the database suggestions, though something NoSQL-ish may work better.
Don't thank God, thank a doctor!
I agree that this is a candidate for a database. One problem with data bases for researchers is that generally one does not know the right schema before hand ond one is dealing with ad hoc contingencies a lot. Another is portability to machines you don't control or that are not easily networked. A final problem is archival persistence. I can't think of a single data base scheme that has lasted 5 let alone ten years and still function without being maintained. Files can do that.
So if you want some bandaid approaches:
1) if you have a mac then, uses aliases rather than symbolic links. alias don't get messed up if you move the file.
2) use hard links rather than symbolic links. THe problem here is that these can get unlinked if you plan to modify the file. But if the file will never change these are just as space efficient and a softlink but tolerate renaming. They can't span across different disks however.
3) poormans database:
give your files a numerical name that chages, typically the date and time they were created. then have a flat file that list the files in some set for each category.
4) low tech database. If you decide to use a database then choose one that is likely never to go out of style. for example pick something like a perl-tie. those are so close to the language that they probably won't get depricated in the next 10 years.
Some drink at the fountain of knowledge. Others just gargle.
My research area is about programming with MindMaps, MindMaps as source code, I'm developing a programming language based on them
I choose MindMaps because I could see the detail and the global in the same GUI so I recommend you Freemind
The MindMaps software could map your filesystem structure even with the symbolik link structure
Good luck
is to use CVS (comma/tab seperated value) files to store the data. This makes it easy to import into a spreadsheet or database in the future as your needs grow.
Mod me up/Mod me down: I wont frown as I've no crown
To everybody here suggesting relational databases: you are on the wrong track here, I'm afraid to tell you. Relational databases handle large sets of completely homogenious data well if you can be bothered to write software for all the I/O around them. This is where it all falls apart:
1. Many lab experiments don't give you the exact same data every time. You often don't do the same experiment over and over. You vary it and the data handling tools have to be flexible enough to cope with that. Relational databases aren't the answer to that.
2. Storing and fetching data through database interfaces is vastly more difficult than just using the standard input/ouput or plain text files. I've written hundreds of programs for processing experimental data and I can tell you: nothing beats plain old ASCII text files! The biggest boon is that they are compatible to almost any scientific software you can get your hands on. Your custom database likely is not. Or how would you load the contents your database table into gnuplot, Xmgrace or Origin, just to name a few tools that come to my mind right now?
I wish I had a good answer to the problem. At times I wished for one myself, but I fear the best reply might still be "shut up and cope with it".
http://www.moonlight3d.eu/
This is exactly what SparkLab aims to solve, take a look here: http://sparklix.com/demo-movie It's free for academic and non-profit organizations. Personal free edition will be up later this year.
Hire a local contractor (read local grad students) to program a simple system for you. This really needs to be in a database which is accessed through an interface you will be comfortable with and which makes it easy for you to manipulate your data.
Write down how your data is described, how you access and update the data, as well as what output is needed from the system, like how you need to view the data in order to use it in reports or calculations. It doesn't sound like it would be very hard to write something to organize your data. A good price for something like this where I live would be three to seven hundred dollars. Find someone with a decent track record and you should be much more organized in no time.
I have recently started using PyTables to store my data. Very fast, great compression and in Python!
http://www.pytables.org/
A lot depends on the type of data. If it is truly experimental results, then results could be easily organized in tables, and tables can be logically accessed, arranged and manipulated using standard rules of set theory. Relational databases work this way, but there are other approaches.
If your data is derived or crunched, You may have a massive logic problem. See this: http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html , and take heed.
The previous suggestions about leaving your data intact and refining the access is good advice. I have used and developed some network dbms systems for this type of data. The current trend seems to be toward Object-Oriented Network dbms systems, but I'm not sure that is the way to go; OONDBMS tends to be static and hard to maintain in a dynamic experimental environment. The largest experimental environment that I've had the opportunity to work on, with an energy company here in Houston, was a statistical analysis of nuclear reactions. The data was constantly changing and we needed a self-referencing, dynamic data repository. This is the type of system where you download data sets and do your analysis AFTER you have acquired it locally. The dbms was written in FORTRAN90 and was very fast, but you need a team for something like this unless you are epert enough to program it all yourself.. It actually used very little code, but the record management and indexes (mostly ISAM/invertedISAM took massive amounts of computer power. There are now some cute tools in FORTRAN 2000 that allow you to use a web browser as a front end, but I don't usually want to look at the data being gathered; I usually want to crunch the statistics and see the results. The browser front-ends I have seen tend to require too much tweaking in order to adapt to the changing data parameters. Remote terminals make more sense. Maybe you should be willing to change the method of accessing the data and not try to maintain dozens of links.
"The mind works quicker than you think!"
I would just put each set of experimental data in a separate subdirectory. Within each subdirectory I'd put a file with specific name (e.g., "description.txt") in which you briefly write up exactly what the experimental data is, how it was generated (e.g., if generated by a program, give the arguments and/or pointers to input data), and some keywords to allow it to be indexed/searched. Then I'd use your standard OS search tools to find the description file(s) you're looking for, thereby allowing you to locate your data based on its description rather than some brittle directory hierarchy.
I have a pretty standard setup for generating experimental data in my work. Whenever I run an experiment (which are usually simulations), I have a wrapper script that generates a random (meaningless) subdirectory name, copies my simulation binary and configuration to that directory (so I can reproduce the results later in case either my simulator code or its configuration changes), and prompts me to enter a description of what it is I'm simulating, and asks me to provide some keyword tags. The only way I can find the data afterward is to search the description files from the last step, because the data is otherwise just in a randomly-named directory.
Of course, this scheme depends on you doing a decent job of describing your data and providing keywords, but I don't think you can get around that with any technique. At some point you have to inject some human labeling/categorization. Directories and symlinks are just a pretty restrictive way of organizing things.
Good luck and enjoy!
In my previous lab group we used a mediawiki install to keep track of microelectronic devices that several people were working on at the same time. These devices were still under development so most of the data was qualitative -- images, profilometry data, IV/CV curves were all stored on the wiki page for each sample, and each page included a recipe for exactly how it was made, which made it easy to trouble shoot later. It worked pretty well for what we used it for, but once we had a working device all the in-depth data for that sample was kept separately. This seemed like a half-decent way of cataloging samples, although one would need something a bit more robust for complex data sets that don't integrate well with a wiki.
Comment removed based on user account deletion
Sorry, I forgot to include the fact that the network dbms system does not require you to rename or re-link your directory scheme. It simply creates pointers to relevant links and then maintains the pointer logic.
"The mind works quicker than you think!"
Name them by dates and save them in 1 directory. That's how you'll end up saving the files for you LaTeX paper anyway.
You need to start using a database. You don't have to actually put the data in a database but all of the meta data needs to go into one. Store your data files in one file system using whatever naming scheme you want and never move the files again. At the same time record the file system location along with all other meta data that is relevant. Then some simple database queries, e.g. embedded in some web pages can retrieve the location and even the data. You can of course also store the data in a database as well if you wish. I personally find it more practical to do it this way.
If you are using Mac OS X, you can tag the files using the Finder Get Info and putting "Spotlight comments" there. Then you can easily find them based on keywords and Spotlight in constant time. The good thing about keywords is that they give you a multidimensional database effect. The bad thing I've found is I tend to forget my keywords that I'm storing stuff with, so I don't really know what to search for. OS X Spotlight is promsing and might work very well for you.
Currently hooked on AMP
Well CMS is one of the large experiments at the LHC. The data produced should reach pentabytes per year and add to it the simulated data we have a hellava lot of data to store and address. What we use is a logical filename (LFN) format. We have a global "filesystem" where different storage elements have files in a filesystem organized in a hierarchical subdirectory structure. As an example: /store/mc/Summer10/Wenu/GEN-SIM-RECO/START37_V5_S09-v1/0136/0400DDE2-F681-DF11-BA13-00215E21DC1E.root
the /store is a beginning marker of the logical filename region that different sites can map differently (who uses NFS, who uses http etc etc) /mc/ -> it's monte carlo data /Summer10/ -> the data was produced during Summer of 2010 /Wenu/ -> it's a simulation of W decaying to electron and neutrino /GEN-SIM-RECO/ -> the data generation steps that have been done /START37_.../ -> The detector conditions that have been used (the actual full description of the conditions is in some central database) /0136/ -> is the serial number (actually I'm not 100%, but it's related to the production workflow etc) /0400DDE2-F681-DF11-BA13-00215E21DC1E.root -> the actual filename, the hash is due to the fact that the process has to make sure there are no conflicts in filenames
Another example: /store/data/Run2010A/MinimumBias/RECO/Jul16thReReco-v1/0000/0018523B-D490-DF11-BF5B-00E08178C111.root
This file is real data, taken during the first run of 2010 and filtered to the MinimumBias primary dataset (related to event trigger content). The datafiles in there contain RECO content and were done during the re-reconstruction process on July 16th. Then there's again the serial number (block edges define new serial numbers) and then the filename.
You could use a similar structure to differentiate the datafiles that you actually use. The good thing is that you can map such filenames separately everywhere as long as you change the prefix according to the protocol used (we use for example file:, http:, gridftp:, srm: etc). You can also easily share data with other collaborating sites as long as everyone uses similar structure it's quite good. No need for special databases etc. If you need some lookup functionality, then one option is a simple find (assuming you have filesystem access) or you could build a database in parallel and you can use the LFN structure to index things etc.
I have seen these kinds of situations happen a lot (I'm a statistician who works on computationally-intensive physical science applications), and the best solution I have seen was a BerkeleyDB setup. One group I work with had a very, very large number of ASCII data files (order of 10-100 million) in a directory tree. One of their researchers consolidated them to a BerkeleyDB, which greatly improved data management and access. CouchDB or the like could also work, but I think the general idea of a key-value store that lets you keep your data in the original structure would work well.
I haven't had to store experimental results like that. My work produces prototypes, some data, demos and support documentation. There are tons of KM tools out there to manage heterogenous data in a recoverable way. We've used document repositories like Hummingbird (acceptable) and of course SharePoint. The key (literally) is including the right metadata and tags when you check in the element. When a data set goes dormant (static) you can tarball the CVS tree or whatever and drop it in the repo. Then there's Knowledge Discovery, something we've created tools for. They let you understand how you got that idea from three hours of web/repo surfing.
First I would lay out your data using meaningful labels, like a directory named for the acquisition date + machine + username. Never change this. It will always remain valid and allow you to later recover the data if other indexes are lost. Then back up this data.
Next build indexes atop the data that semantically couple the components in the ways that are meaningful or acessible. This may manifest as indexed tables in a relational database, duplicate flat files linked by a compound naming convention, unix directory soft links, etc.
If you're processing a lot of data, your choice of indexes may have to optimize your data access pattern rather than the data's underlying semantics. Optimize your data organization for whatever is your weakest link: analysis runtime, memory footprint, index complexity, frequent data additions or revisions, etc.
In a second repository, maintain a precise record of your indexing scheme, and ideally, the code that automatically re-generates it. This way you (or someone else) can rebuild lost databases/indexes without repeating all your design and data cleansing decisions, and domain expertise. This info is often stored in a lab notebook (or nowadays in an 'electronic lab notebook').
I'd emphasize that if you can't remember how your data is laid out or pre-conditioned, your analysis of it may be invalid or unrepeatable. Be systematic, simple, obvious, and keep records.
In the butt Bob.
SharePoint lists items, basically a database. You can sort and group by any parameter and attach files. No programming of special technical knowledge necessary.
I did a little bioinformatics in the past, and we were using postgreSql to manage our results. It was nice because you can create meaningful fields to query in the future. It took some time developing the system, but it really helped out in the long run. We had to consider errors in the readings of the results and had to incorporate a little bit of fuzzy logic into the tools we used to run comparisons on the database.
If you are at a university or near a university, the computer science dept may give a few students credit to build you a system that can handle it, so you don't have to.
A Laboratory Information Management System will help you store, organize, analyze and data-mine your data.
"linux" is a very common word and was not included in your search.
I agree. It depends.
Yes, relational databases store and retrieve well-defined data very, very well. Do you have referential integrity needs?. If that's your situation, use SQLite (small data and very simple types but little referential integrity), MySQL (medium to large data), or PostgreSQL (medium to very large data or more complex data types) and don't look back. SQL queries, relationships, and referential integrity are very powerful.
If not, then I'd look at MongoDB with GridFS. I'd even go further and explore GridFS-FUSE (a mountable file system version of MongoDB/GrisFS).
With GridFS-FUSE, you have a crazy powerful database/file system combo. Now, since MongoDB is a NoSQL database, you cannot do SQL queries against it. You can store and retrieve key-value pairs, NoSQL "documents," and actual files with MongoDB/GridFS/GridFS-FUSE.
Instead of sorting datasets, use a testlist database (flat files). The test contains/links/points to its dataset. The test lists are selected at test run time. Each entry in a test list tells how to generate the specific test environment for the test. A test list entry contains the test, the RCS tag/version of the test to be "gotten", the test seed, and array of exit codes that should be retired, how many retries, whether the test is gating, and an array of tests dependencies. A test run can be considered to pass even though an individual, non-gating test fails. One test entry may extract and prepare the test data and other dependent entries can then run against that test dataset.
It seems you have never heard of LIMS (Laboratory Information Systems), which is unfortunate.
This is a thriving software sector, and you are actually expected to be at least vaguely familiar with these kind of systems should you ever transfer to industry and work in data-generating or data-processing positions.
Nobody in industry keeps experimental data as individual, handcrafted datasets. The risk of losing important data, not not being able to make cross-references (patents!) is much too high if you let people run their own set-ups. Do yourself, and your research group, a favor: Get some grant money and purchase a robust commercial set-up at least for your group, or better your department. Entry level systems, with academic discounts, are affordable. There are no competitive open-source solutions.
Start your research here:
http://en.wikipedia.org/wiki/LIMS
(though the systems listed there are instrument-centric, if you are more into generic chemistry there are other standard package by companies such as Accelrys and CambridgeSoft).
I used to have a two word answer for this question: Use BeOS
But now it's a six word answer (*sigh*): Invent time machine, then use BeOS
--MarkusQ
This was developed in the context of computational biology experiments, but should hold true for other types of computational projects:
http://www.ncbi.nlm.nih.gov/pubmed/19649301
It's already been said, but it bears saying again. Directories and symlinks.... oh my!
Google Desktop on a PC and Spotlight on my Mac have helped me a great deal.
real men upload their stuff to kernel.org and let the world mirror it.
I want to delete my account but Slashdot doesn't allow it.
Here's what I do:
Directory for each data set, labeled by date (20100815).
Short README file inside each directory with description of the run.
Big spreadsheet (or database, if you're fancy) with experimental parameters and core results, that can be sorted, reorganized, and graphed.
Word up. I'd say the first goal is to store your raw, bulk data consistently. Then you can have several sets of post processing scripts that all draw from the same raw data set.
You want this data format to be well-documented, but I wouldn't bother meticulously marking it up with XML tags and other metadata or whatever. You just want to be able to read it fast, and have other scripts be able to convert it into other formats that would be useful for analysis, be it matlab, octave, csv, or some tediously marked-up XML. You do want to be able to grep and filter the data pretty easily, so keep that in mind when you're designing the format. It will likely end up being pretty repetitive, but that's OK, since you'll likely store it compressed. That can improve performance when reading it, since the storage medium you're pulling the data from is often slower than the processor doing the decompression... and it also provides some data integrity / consistency checking. Oh, and of course, you can store more raw data if its compressed.
Mark each file with its description and insert each transformation after the original description as suggested in "More Programming Pearls" by Jon Bentley.
Put some keywords in the description and fell free to add more as you go. You can use a free format or go to a more rigid organization.
In my experience, the best thing is to let the structure stand as it was the first time you stored the data.
Your statement of this policy illustrates a hidden assumption which causes the policy to fail before it gets out of the gate: there should be no notion of "structure" in the way you store the data. A filesystem-- all production* examples of which are hierarchical-- imparts a particular set of relations between the elements of your data: that of a hierarchy (and a specific one at that!). By caring about the hierarchy, you're being beholden both to one kind of relationship and to a specific instance of that relationship with respect to your data.
(* There's been some work on relational-- as opposed to hierarchical-- filesystems; cf. WinFS. If this type of file representation/storage were de facto, the data "organization" habit into which so many people unknowingly fall wouldn't be a bad one.)
While a symlink farm is one way to separate representation from presentation, it still limits all your views of the data to hierarchical relationships.
Raw data should stand untouched, or you may delete it by mistake.
Leaving data "untouched" is the wrong solution to that problem. You need backups, and they must 1) prevent data corruption (so you can undo a mistake made by you or your hardware, such as accidental/regretable data modification or filesystem corruption), and 2) provide data persistence (so your data will persist if a copy of it is destroyed, as in a disaster).
I've...had the presentation app tie itself in knots if need be to leave the data as-is and present it in any new ways required.
The representation of data should not impose the need for contortion on the part of viewers of the data. If it does, the representation is flawed in that it embeds relationships rather than abstracting them from the representation (and you should always be interested in fixing flaws). The only thing that should contribute to the complexity of the presentation is the complexity of the relationship, not baggage from insufficiently general representation.
The central point is that storing your data in hierrarchies inherently implies relationships (as long as the user looks directly at the representation, which can't be helped in the case of filesystems, yet). Another post points this out, perhaps unintentionally: "(Title: How can you not?) I never understood how you can have something organized or not." Exactly. If it lives in a hierarchy, then it's organized, ipso facto. Use the relational model; use a relational database.
ccleve's post "Don't bother with hierarchies" has a couple good tidbits too.
It's like having an ancient monument: you don't shuffle it around to suit the latest whim, else you'll most likely mess it up beyond repair.
If your data is as brittle as this, you are doing it [data management] wrong.
http://www.scidb.org/
Except for ending slavery, the Nazis, communism, & securing American independence, war has never solved anything.
I would recommend just downloading a VM or cloud image of something like Knowledge Tree or Alfresco (I personally prefer Alfresco), and run it on the free vmwareplayer or a real VM solution if you have one.
I recently setup a demo showing the benefits of such a system, I was able to, in about one day, download and setup Alfresco, expose CIFS interface (ie, \\192.168.x.x\documents) and just dump a portion of my entire document base into the system. After digestion, the system had all the documents full-text-indexed (yes, even word docs and excel files thanks to OpenOffice libraries), and I could go about changing directory structure, moving around and renaming files, etc. .. and the source control would show me changes. In fact, I could go into the backend and write SQL queries if I wanted to with detailed reports of how things were on date X or Y revisions ago. Was quite sweet. All the while, the users still saw the same windows directory structure and modifications they made there would be versioned and modified in Alfresco's database.
Here is a bitnami VM image, will save you days of configuration. If the solution works for you, but is slow, just DL the native stack and migrate or re-import.
Make sure everyone's vote counts: Verified Voting
Use software specifically designed to archive scientific datasets and make them available to others. See, for example, DataVerse (http://thedata.org/home).
Hi, I've been working in immunology and microbiology for ~20 years at the bench [well, actually in a biosafety cabinet...] and face this same problem.
I've created a data directory on our server that is organized by experiment number [M199, M200, etc.] Each of these folders begins it's life as a sequential number in either my lab notebook or in my boss's lab notebook as a hypothesis -- "does immunity to bacteria x depend on secreted cell product y in the lung?" I sketch an experimental design that will answer this hypothesis in my notebook. Next, I create a folder on our server with the experimental code, "M200" for example. Then I formalize the experimental design by creating an Excel spreadsheet that lists my experimental groups ["control uninfected mice", "control infected mice", "infected mice with gene knockout z", etc.] and set the timepoints for analysis as well as the assay by which I'll measure the validity of my hypothesis, typically something like bacterial load in the target organ. This Excel sheet also serves as an experimental record -- of the lot of bacteria I used to infect the mice, the various reagents used for secondary assays [like phenotyping the cells that are responding to the infection.] All of the quantitative instrumental data that I collect on the various cells and tissues [flow cytometry, flourescent microscopy, Elispot assays, etc.] goes into subdirectories by time point day 30, day 60, etc.
My lab has a couple of funding sources, they each get their own high-level directory tree. Grant "XX-10" > experiment "M200" > "Day 10" > "Flow Cytometry" > etc.
While I really, really apprectiate the inherent limitations of this system, it is straight-forward -- meaning new technicans and post-docs can use it without screwing it up too much. I can find a particular experiment by refering to my lab notebook. It satisfies my institution's intellectural property documentation requirements.
The key point of all this: I'm paid to answer the questions that were in the grant submission. My success [or failure] depends on my ability to do this work -- not to make or administer a database. I could try to get a grant to hire someone to do this for me, but the chances of getting it funded seem pretty slim -- what value does this really offer to the taxpayer? Will a database [although cool] tell me something about my results that I didn't know? If I design my experiments properly, the primary assay will clearly support my hypothesis, or not. By the way, my taxes are already way too high.
My IT department is very small, and basically only provides server backup and very limited user support. Any kind of dataabase work is well outside of their capability.
Good luck. If you find something that fits the bill, please let us know. J
What I am looking for is a multidimensional adressing - file or database system.
something like multiple B-tree's for the content with the possibility to add another B-tree index if required later.
Maybe the Google people have an answer?
One can mix-n-match: use flat files to store raw data, post-processed stuff, etc, but use a database to keep track of everything. The latter can even be handled by your OS from the get-go. On OS X you could write a spotlight plugin or two for your data files, and as long as the file format allows storing metadata within the files, spotlight will index it. Same goes for Windows Search. You could also use native mechanisms for adding metadata to files - those exist IIRC on both Windows 7 and on OS X.
A successful API design takes a mixture of software design and pedagogy.
A personal wiki that runs from one file, I link my files from there and I can add documentation and references at will. http://www.tiddlywiki.com/
DNA in your Linux: DNALinux
If you're really serious about tracking metadata, it may be worthwhile to take a look at some of the tools offered by caBIG:
https://cabig.nci.nih.gov/
The caBIG tools are geared toward using a model-driven approach to define precise metadata which promote semantic interoperability. Underlying the caBIG tools is a metadata repository called the caDSR, which follows the ISO/IEC 11179 Standard for Metadata Registries:
https://cabig.nci.nih.gov/concepts/caDSR/
The caBIG tools are all open-source, developed by the National Cancer Institute.
I currently look after 20+ million data files from a synchrotron radiation source totalling about 65 terabytes. This is rising at close on 1 terabyte per day during experimental runs. We have a Lustre file system storing the files as they are made. Filenames are mostly generated automatically, based on experimental station, date and a unique label given to each visit by a research team. File metadata is extracted automatically and wrapped up into a NeXuS file http://www.nexusformat.org which is also added to the archive. Each file generated by tne instruments is queued up into a tape archival system, and inserted into the DB along with the metadata extracted earlier. The metadata is stored as triples and a strict ontology means that terms are always used as intended. Datasets can be created from this archive in many different ways and on the fly, and a web interface permits downloading of datasets from the archive. The schema we use for metadata is found at http://icatproject.googlecode.com
We do NOT store data in the database, just metadata. Part of the metadata is a persistent identifier to the data file within the archive meaning that filenames are largely unimportant as far as searching the archive is concerned. No duplicate files can occur because of automated filename generation.
This very same system is also used in a neutron source and a laser facility, so it is pretty generic.
It's called the command line. Use it.
File corruption is unlikely with text files. If you should have corrupted files, you have a chance to recover them with text files. With binary files, databases etc. this becomes much harder.
To find your files later, tag them properly. Something like OpenMeta might help you.
All this talk of using whatever kind of database to organize your experimental data is nuts. It's well intentioned, I'm sure, but it's still insane. I always tell students that there is no general way to organise ones data, you have to find a system that works for you. I reckon that > 99% of physical science researchers (not just physicists, as seems to be a confusion in several replies) wouldn't be able to set up and use a database in a way that's better in terms of time and effort efficiency than just doing whatever it is that they already do to organise their data. Worse still, I reckon that it's the sort of thing that one would spend a huge chunk of time doing and then only use for a short while before you got bored of inputing stuff into the data base properly and then started to forget to do it or worse, resolving to "do it in batches". Anyway, the result of this will be that one eventually stops using the database and goes back to what one was doing before, but now with a huge hole in the data-trail where the database used to be. Alternatively, one will struggle on with the database for a while and then try and re-design it to put in all the features that were missed out in the original design, all the while sucking loads of time and eventually going back to the original method.
Getting on to some actuall advice. I would suggest two features that your chosen system should have. It should be:
1. robust
2. quick
Personally, I have a lab book in which I record experimental details (what I actually did to generate the data). There's a date at the start of every day and the rough "titles" of the experiments and then the details of what it was. When I generate data files I organise them into hierarchy of directories. So, there's a "projects" directory that has all the different projects in it. there might be a project that I'm working on to do with nanoparticles or something. and that'll have a directory called "nanoparticles". There's a bunch of directories in the project folder, such as "data", "analyses", "reports", etc. The Data directory is divided by the experimental technique that was used to get the data, such as "fluorescenceMicroscopy", "TEM" or "SEM" or whatever. Then the actual experiments are in directories inside the relevant technique. I name the directories by date first and then a brief indicator of what the experiment is about. So it I was looking for the aggregation behaviour of my nanoparticles in the presence of different polymers or something, I'd have a directory called something like
"20100815-FePt_particles_with_100k_PEG_in_PBS_at_pH7"
or something like that. Something that people often do is call a directory "20100815" of something like that. I used to do this, but I didn't find it useful to look back on after 6 months. People also forget that you can have something like 256 characters for the file/directory name - USE THEM! Inside this directory will be a bunch of data files that I acquire. I tend to start the naming of files with a number and then a decription of the sample and what's the point of this data. So, for example the first image in a set of TEM images might be named "01-100mM_PEG_generalGrid_300x", the next will be named "02-...", "03-...", etc. This way, all the files are ordered in the order that they were acquired, which I find helps to find them later, since I think that you remember the order of things better than their absolute position. So, if I wanted to find an image of aggregated nanoparticles that I took some time in the winter, I could easily find "Projects/nanopartocles/Data/TEM/20091106-FeNP_with_20k_PEG_in_EtOH/03-10mM_PEG_aggregatedParticles_20kx".
Anyway, this works for me, but my data needs to be organised so that I can get to the relevant data and then do something with it, not aggregate large amounts together and get some numbers.
Depending on the type of data, maybe a GIS might help you? In ArcGIS for example, you can link files (xls, csv), access databases, pictures/photo's, etc.
For processing RNA/DNA microarray data sets, we use a well defined (==strict) structure and naming convention for directories and files. We and 100s of users have used for about 5 years now and it works quite well and suits most of our needs.
Directory structure:
First of all paths should be relative to the current/working directory. This asserts that any scripts/code will be the same regardless of the absolution location of the data. Each data set is in a separate directory. Optionally, it may contain one or more subdirectories (possibly in multiple levels). The actual data files are located in these subdirectories. Depending on the type of data, the data set directories is located in different so called root directories. Conceptually, the directory structure is: <rootPath>/<dataSet>/<subDir>/.
Format of file and directory names:
What really adds to the above, is that the names of the directories and the files follow a certain syntax. A filename (without the path) can be split up in two parts, its fullname and its filename extension:
<filename> := <fullname>.<extention>
In turn, the fullname part can be split up into the name and comma-separated tags:
<fullname> := <name>[,<tags>]*
This setup makes it possible to annotate data files with both human readable as well as computer interpretable information.
Similarly, the directory names have a fullname with a name part and optional tags.
An example of a file set is:
rawData/HapMap270,CEU/Mapping250K_Nsp/*.CEL
where rawData/ is the root path, HapMap270,CEU,test/ is the data set, Mapping250K_Nsp/ is a subdirectory and *.CEL are the data files. The name of the data set is "HapMap270" with tags "CEU" and "test". The subdirectory indicates that the technology used to measure the data is for microarray type "Mapping250K_Nsp".
With this strict directory structure, we can have methods/functions that automatically locates file sets by their (full)names without specifying absolute paths and so on.
When we process data, we store intermediate and final results in new file set directories where we add tags indicated what type of processing has been done. We sometimes also change the root path. For example, one of the first preprocessing steps of our raw data removes systematic effects. That step adds tag "ACC" and stores the data files in:
probeData/HapMap270,CEU,ACC/Mapping250K_Nsp/*.CEL
We were quite careful to design it so that it would work on as many operating and file systems as possible, including Unix, OSX and Windows. Commas are valid symbols in filenames in all these systems.
We did consider using relational databases to store the data, but realized that it adds lots of complications when it comes to backups as well as migration/sharing all or part of the data. In the end of the day, file systems are pretty neat and allows people to quickly get an overview of the content. There are file system browsers, you can check progress with a simple 'ls', access it via [s]ftp, web browsers etc. There are of course trade offs you have to consider, but we found the file system to be more than good enough for what we wanted.
We do all our work in the R language. We have implemented utility classes and functions that provide easy access to such file sets. See the R.filesets package [http://cran.r-project.org/web/packages/R.filesets/] for more examples.
Yes, agreed, a combination is good (SQL + NoSQL + filesystem).
There is no one-size-fits-all scenario, here.
However, there is utility in a NoSQL database over a raw filesystem. One feature is indexed search. Another is versioning. Another is the fact that it is extremely multiuser (proper record locking, even if there are multiple writes to the same record). Also, many NoSQL databases (especially MongoDB) have built-in replication, sharding, Map-Reduce, and horizontal scaling.
MongoDB's GridFS (especially with FUSE support) marries many of these features together. MongoDB does have some SQL DB features (such as indexing/searching and transactions) but not others.
Check out the whole stack here:
http://www.mongodb.org/
http://www.mongodb.org/display/DOCS/GridFS
http://github.com/mikejs/gridfs-fuse
Depending on the nature of your data, I used to use netcdf files quite a lot (https://www.unidata.ucar.edu/software/netcdf/). I also work on data sharing and standardisation, this is a full time job, so really you can spend as much time as you want on this and still not get it done. There are a variety of international data standards which exist to facilitate the management and sharing of data. I know you aren't necessarily talking about sharing your data, but many of the same issues apply. In short, unless you wish to spend a great deal of time on this or your employer has some data management solution for you then there are probably only a variety of unsatisfactory solutions which no doubt the slashdotters here will suggest :)
What your looking for is a a LIMS (laboratory information management system).
http://www.bikalabs.com/ -open source
Plenty of commercial LIMS available, but expect to pay $$$.
Just give all your datasets a number and put them in a database so you can search on all criteria you want.
You can also use Excel to keep track of your metadata if you want...
Privacy is terrorism.
You are finding the same problem everyone has with any data set. Hierarchical folders with one name only allow for a single, pre-arranged organization. It's terrible for the way we really use files, data, whatever really.
Store your data sets with simple "inventory names" like 00001 through 99999 or random serial numbers. Have a spreadsheet or database that associates all of your data sets with as many characteristics as you like. Then you can sort and find by any combination you can think of in the future.
- For the complete works of Shakespeare: cat
There has been a move to content repositories like fedora and content/document management systems like alfresco. I'll throw rdf repositories into the mix as well.
Reading these comments has changed my thoughts on data storage a little bit, but has reinforced my idea that databases are a bad idea for this sort of thing.
The main issues I have with using databases are file size (I store and convert text files that are 10-100MB zipped), and mutability (generated data doesn't typically change, I just add new experiments on top of other data). A secondary issue is that for plain-text data files (or plain-text convertible data files), writing code is easier when you don't have to bother about a database middleman.
So, if I were to do [another] large research project in the future, here's my thoughts on what I would consider an appropriate approach:
My most common uses for old data are re-running analyses (generating new data as results), and sending data to someone else. It helps to be able to make those things as quick as possible.
Ask me about repetitive DNA
I recently reenginered our data acquisition / storage system; building basically everything around XML.
The big advantage of XML is the huge ecosystem of tools that exist around it.
Spend some time coming up with an (extensible) schema for storing your data that guarantees the data is not ambiguous. At any point you can then validate your data against the schema. This lets you find problems right away that would otherwise go unnoticed until it comes time to analyze the data.
With XML you can make use of XQuery, XSLT etc to easily transform your data into another format, making it easier to collaborate with other people who use different formats.
I use eXist-db for storing and querying the data. This is one of these new No-sql, document orientated databases. The big advantage of this compared to relational databases is that you have to learn just one data structure (your schema), you don't have to shoe-horn the data from one format into database tabels. You can then do very rich queries on the data using XQuery. Depending on your application and amount to data this might be an overkill, you can always use the standard XML tools that come with any modern language on individual documents.
The only disadvantage I guess is if you have terabytes of data. You can store 'heavy' parts of the data in binary converted to base 64 inside the XML files and still maintain the benefits of the structure of the XML, though there is a bit of a penalty in size.
I am a programmer, who works closely with scientists in scientific computing in the fields of fluid mechanics simulation, and aerodynamics simulation.
Your question is really not clear, in both these fields that I work on, the requirements vary vastly, and it also varies to the users I support (over 100 scientist). some of them have huge data sets, spanning up to 600 GB/file, a single simulation run can give a geologist a 1 TB file.
Others, have a few hundred MB of data. Each is handled differently.
The data itself, can be parsed and stored in in a DB for analysis in some cases, and in others, that is very impractical and will slow down your work.
Each scientist has a different way of doing things.
So the bottom line, if you want any useful answers, be more specific. What field of science (i can tell you are a chemist?), what simulations/tests do you use, how fine are your models are your data sets and what is their format, what kind of data are you interested in, you should seriously consider an archiving solution because i guarantee you will run out of space.
The lunatic is in my head
I have worked for scientific instrument companies for many years and have come across this problem many times.
The optimal solution we have found is a hybrid approach.
We keep all the data in files, typically fields of study or analysis programs have a file format they prefer.
Then to manage the data we use a database usually SQLite for its ease of use to keep track of the files.
The database contains all the meta data needed to find a particular data set and a link to the original data, a url or file path.
This allows you to find things fast using the database but once you have you get the original data file using the link.
You need to write a program that walks all your files and builds the database at first this can be simple and you can
can evolve it as needed rebuilding the database each time.
It works well, and we designed the database schema for extensibility, normalized and all that. (and the thing is growing and adapting well to changes, but that DB design is all important to make that not so hard)
Looks like the best plan for stuffing a lot of data away and finding it later on. www.coultersmithing.com will show you a bit of what we are up to here. Or our forum at Science/Engineering/Tech forums
Why guess when you can know? Measure!
Try the National Instruments product DIAdem, which is designed to store, manage and analyze large datasets. It will probably work better than any roll your own solution, with less pain in setup and maintenance. Go to http://www.ni.com/diadem for information, and links to a free 30 day demo. It is not cheap, so definitely try it before you buy! As with any other approach to file organization and management, it isn't a magic bullet if you start with a mess - you'll have some work to do, transferring everything so that the new arrangement is organized, and if you are continually haphazard about organization, this tool will be of limited use.
Not a shill for the company - I am a wage slave (staff) at a large public university, and have been relatively pleased with both the products and the support from the company.`
I'm also an experimental physical scientist. My experience tells me that I have absolutely no idea what kind of meta-data I'll want to keep track of in the future, and I only know what I want to keep track of now, which is probably a small subset of what I'll want to keep track of in the future. Every sample that I make is assigned a unique serial number (Experiment N Sample M Piece Q etc). All the master data is in my lab notebook which I keep anal retentitively. Any metadata that I know now that I want to keep track of is contained in there. Any analysis I do on any sample I make is also filed under this serial number. Now I just need to convince my boss to let me switch to an electronic notebook (like Microsoft OneNote) so that I can assign each sample or each experiment its own tab so I don't have to jump pages back and forth in my paper notebook.
One way of looking at the problem is by organizing information about the data versus organizing the data itself. One way to do this might be to have a text file describing facts about your data in the same directory as the data. Use something like solr and lucene to index these text files. You can make search queries without the need a uniform schema that describes the meta-data about your experimental data. It would be like "Googling" your data. You could organize it in such a way that you can search for specific info like date etc. Or lookup by search terms that might be found in the text.
I worked on simulations in Materials Science during undergrads and grad school. My accumulated files consisted of raw data, processed data, scripts for processing, intermediate documents (presentations, utility programs etc). In general, IMHO -
1. Flat files with no metadat work best for personal use and sharing with collaborators.
2. RDBMS, tagging etc might be an overkill.
3. Arranging data in the way you obtained it is the best way. Relationships between the data etc. go into your publication.
Hence I used -
$ROOT/project/experiment-strategy/date/file-type/meaningful-file-name.extension
Here,
1. project : Name of the project (e.g. Nickel_Plasticity)
2. experiment-strategy : The sort of work you did to get the data (e.g. Tensile_Test)
3. date : MMDDYY (e.g. 120210)
4. file-type : raw | plot | script | misc
5. meaningful-file-name.extension : Some file name which will immediately remind you of whats inside (e.g. two_cycle-100mpa.full)
Once the project is over, I gzip the data and send to backup tape.
I suggest you try dBASE and its clones/dialects. Some consider it old-fashioned and bash it for silly reasons[1], but it's a nice compromise between flat-files and RDBMS in my opinion. Its pre-mouse design is great for ad-hoc textual scripting and makes it very keyboard-friendly, a lost art.
It was even invented for scientific usage at NASA's JPL lab in the late 70's (although was influenced by other products). Unfortunately, there's no current (finished) interpretative open-source versions out there, only compiled. For certain data-chomping tasks I really miss it. (My shop won't approve it.)
[1] It may be one of those tools that's highly subjective: some love it, some hate it with few in-between, kind of like Neil Diamond tunes.
I face the same issues too. Immensely large datasets, that change and no proper way of tracking them through the file system. Trust me, when I say this -- it will be worth your while to spend some time thinking about your requirements and do some quick coding to get an infrastructure in place.
It was said here before (I guess just a couple of posts above), but this is right on mark -- You have got to separate out your data and meta-data. Text files are immensely convenient and to be honest, that is also where I prefer to store my actual data. But statistics about my data I store in a quick relational database. My meta-data db consists of fixed columns that have all the statistics for my data sets that I usually need. For example: Date/Time, Number of Columns, Number of Rows, Row Description, Column Description, Algorithm Description, Parameters, Special Notes, File Name of Data, etc.
Row Description, Column Description, and Algo Description point to separate helper tables.
I also agree with many people here that relational databases can be an overkill for a manageable database, but if you generate a lot of datasets, the break-even point is reached almost immediately. Besides, text files even though extremely convenient for a quick grab and feed into a software are simply horrid when it comes to trending across many datasets.
Now depending upon your skill set, setting up this database could be a day's job or it can take you weeks. If you are computationally inclined though -- go for the relational database to store the meta-data, keep your main data in text files.
If you are not, there are excellent software out there that give you a nice interface to a relational database.
Nevertheless, my main point is, whatever route you choose, it will help you to separate out your actual data, and stuff about that data (meta-data).
But one thing you will find is that you have to use SQL to get the best out of any relational database, and this involves thinking in a new way - it's basically set-oriented - rather than sequentially row by row. This takes a bit of effort, but can be rewarding, as you will discover new ways of achieving some of the things you want to do.
Depending on your exact requirements, this is perhaps a fit for a Enterprise Content Management system. If these datasets are heterogeneous, I'd think looking into some kind of flexible meta-data system would be the way to go. This can be anything from a custom application, a bought solution or a opensource ECM system.
Though I caution you, some of these systems can be convoluted to set up, maintain and learn. Don't let that scare you away from the concept of it though.
I do a similar job, and I use that combined with the ancient technology of a paper notebook. Just as you do the experiment you note what it is in each experiment/series of experiments.
The problem is, most IT people have no idea what do with science data -- it'd be like going to a dentist because you're having a heart attack. They might be able to give general advice, but have clue what specifics need to be done. Likewise, IT might be people who are really good at diagnosing hardware, but they might suck at writing code. Not all IT specialists are cross-trained in enough topics to deal with this issue effectively (data modeling, UIs, database admin, programming, and the science discipline itself).
There's a field out there called 'Science Informatics'. It's not a very large group, but there's a number of us who specialize in helping scientists organize, find, and generally manage data. Think of us as librarians for science data.
Most of us would even be willing to give advice to people outside our place of work, as the better organized science data is in general, the more usable and re-usable it is. There's even a number of efforts to have people publish data, so it can be shared, verified, etc. And most of us have a programming background, so we might be able to share code with you, as we try to make it open source where we can, so we don't all have to re-solve the same problems.
Because each discipline varies so much, both in how they think about their data, and what their ultimate needs are, we tend to be specialists, but there's a number of different groups out there, for example:
There's also Bioinformatics, Health/medical informatics, chemical informatics, etc. plug in your science discipline + 'informatics' into your favorite search engine, and odds are you'll find a group, or person you can write to to try to get more info and advice.
Recently, NSF just funded a few more groups to try to build out systems and communities : DataOne and the Data Conservancy, and I believe there's some more money still to be awarded.
Build it, and they will come^Hplain.
You said you're dealing with physical science. From what you describe, I'd guess that you're dealing with earth science, from what we call "small science". (lots of smaller investigations that can be done with a small team, rather than the multi-million dollar satellite or sensor grid projects).
I'd suggest talking to one of the following groups:
There's a hell of a lot more groups out there, but those two larger groups would be able to stear you towards more specialized groups that deal with a specific scientific discipline.
Build it, and they will come^Hplain.
I'm not sure this answers my question. It sounds like you're basically saying all the people who are telling us to put our data in the database are wrong. Which is possible, but it doesn't explain what they thought we were going to do with it in the database in the first place. But then you're suggesting that we do a tremendous amount of extra work, including some very touchy coding, to get the metadata into a database. Unless I'm missing something, keeping the metadata up to date would then require rewriting all of the software we use (the software modifies, creates, and deletes files). Even if this were possible, it would be extremely fragile. And I'm not sure what the benefit would be. I have no idea what the distinction is between "row by row" and "set-oriented," because we don't use a database at present, and I can't even imagine what "row by row" would mean for our data. This is rich scientific data, not transactions in some financial system. And the supposed benefit sounds very meager. We don't need new ways of achieving things, unless they're better than our current things.
Still baffled.
I use a program called Personal Brain, available at: www.thebrain.com I prefer to pay for the upper-end version due to its better functionality, but the company has a free version as well. They also sell an enterprise version that is useable for multi-users. I don't have experience with this version, but I presume the function /s is/are similar.
they're even good for storing video!
You can use Nutch (http://nutch.apache.org/) to crawl your documents, then you'll have a Lucene index. Nutch also gives you a basic search page as a frontend as you go through your documents. For grouping of search results Carrot (http://project.carrot2.org/) has some very interesting algorithms for research work, and Nutch has a native plugin for Carrot that has shipped since 1.0.
fak3r.com
It doesn't work so well for biology subjects (they all become autopsy experiments, and after a week or so the smell from my desk reminds me of a "Body Farm," but I LIKE sticking things on a long vertical metal rod. (I KID, I KID...)
MSBPodcast.com The opinions expressed here are my own. If you don't like 'em... Think up your own stuff.
with poor spelling, the idea of RDBs or other stuff just sounds like a nonstarter. (if you think install apache 1click, etc works, then you have no idea of what real scientists are like; perhaps this means that they don't have the training they should have, but I have used these one click installers, and they are way, way, waaaay to complex for me, and I am sure, for most scientists, for whom an excel pivot table or vlookup is the height of programming expertise) The problem is that "scientific data" is often like the real world - messy, undefined stuff. It also , typically includes all sorts of things - images from cameras ranging from consumer point and clicks to peltier cooled 24bit CCDs, flat files uploaded from machines, complex propietary files uploaded, hand written notes, etc etc. For instance, for a recent experiment I did, I have ".xad" files, which are a prop format from an agilent instrument ".nef" files from a Nikon consumer camera several .txt files which need diff converters to put them into useful excel files
several .pngs
some of hte key .txt were imported into .xlsx, then imported into a specialized graphics program (Igor) which has its own propetary format...
my written notes on how to do the experiment (word doc with embedded excel minitables, which show calculations for things like how many grams of salt per liter of solution..)
propietary files from a nanodrop (don't ask) which can be opened with excel to get sort of usable data
hand entered data from a thermo electron conductivity meter (integers) a cheap balance, lot numbers of key reagents, pointers to previous experiments where i describe how key reagents were prepared, pointers to key experiments with background info
I don't think a single one of these files has the data organized in a really useful manner; all require extensie hand editing to get them into a form that is even understandable, let alone integrated.
not to mention the hideous problems of writing reports in word with embedded pictures, the loss of functionality in excel 2007 (x error bars hard to do, xy graph format reverts to line format without telling you,etc)
what this guy really wants is sort of the holy grail of web searching - the ability to do natural language queries on his data set.
Like, for growing fish, what is the effect of pH (acidity) and ionic strength (how much salt is in th water) on body length vs time
IN the meantime, what would help would be
1) if everyone complained to gates and ballmer,a nd got them to divert some miniscule fraction of their monopoly level profits from office, and put those profits to work making office usable for technically minded people
2) A way to keep track of files across changes in file name and directory name.
this is a huge problem; what would help is a way to embed a serial number into a file, and every time you use the file, the serial number goes with it, eg, you have a .tif, you crop it in image editor, import the crop into powerpoint, add text, export the mess into word, save the word doc, re edit it a year later, then two years later, looking at the re editied word doc, yo need to find the ORIGINAL .tiff, and in teh mean time, you ahve moved to a new computer, and have a different directory tree.....
You might want to take a look at perfbase (perfbase.tigris.org).
It's an experiment management tool with a postgresql backend.
I write a Rails server for each of my experiments. Rake tasks are GREAT for bioinformatics pipelines, and migrations and database backups make it incredibly easy to save old experiments.
Plus, various Ruby gems (starling, workling) enable me to farm out long-running experiments to a variety of lab machines. I can use Rice to write C++ Ruby extensions that are compiled individually on different machines.
And all of this is stored in a PostgreSQL database. (MySQL is slow for complex joins, which you sometimes have to write in bioinformatics.)
I also like plain old text files. I have been using version control systems like git to keep track of my research data.
You are going to the wrong people, go to your Records and Information Managers (RIMS) if you want advice on organising information and futureproofing it, this profession has been organising information since Sumerian times c.5-7000 years ago and the modern canon of information management reallly took off about 150 years ago. Asking the IT guys is like inviting a mechanic to help you design a car, they are great at fine tuning and understanding performance but their skill set is not primarily that of a designer.
Most IT professionals think in time scales of 6 months or 6 years not in 60 or 600 years like RIMS do, I am not in any way denigrating IT professionals it just the different disciplines have a different focus. I have worked at the front end as a Records and Information Manager and at the back end as a Digital Archivist and I can tell you that in the IT world no one gets any money to fix 'yesterday's problem', old 'stuff'just gets neglected. It is telling that the term 'archiving' in the IT sense means sticking something on off-line or near-line storage with minimal management rather than in the recordkeeping sense of being catalogued, arranged, described, monitored and having retention and disposal applied as well as running obsolecence risk profiling and migration protocols. Hopefully all three perspectives will start working together to solve some of these issues, we can still access the science of ancient Greece, Isaac Newton and Captain Cook's scientific measurements of weather patterns in the Pacific from the 1700s, will be still have access to today's data in 2310?
Regards
Stephen Clarke
I agree with many people here plain ASCII does a good job for measurement data (as long as it comes not to Terabytes). Nothing beats the good old CSV files.
1. Its easy readable by human on any (upcoming) platform and with nearly all kind of software
2. You could easy process them
3. Easily store them
4. They will be valid at a time, where matlab structures, binary blobs, Labview blablabla or SQL-whatever is long gone or such heavily modified that "it can't read that old junk".
Ask a senior about his data posted on the presentation from 1990. Maybe its likely he tells you its on a 5 1/4 inch disk and if you get him an old box with LOTUS whatever and MS-DOS as well as a floppy drive he might can try to read it. Whereas the last two might be still available somewhere, its rather unlikely you that you will find the very special software he used 30 years ago, to read the data and if you can read the data there might be no way to export them.
However, I found that it is much more hard to keep track of the surrounding parameters and data rather then the measurement data itself. If I look at a file of a measurement from a few years ago, I can still generate a fancy plot. But hey...
did I rinse before I changed the analyte media ???
How long did I wait for the set-up to settle down ???
Did I switch on the measurement devices an hour early to warm up ???
What was the room temperature ??
And was this measurement performed with this faulty BNC cable as I found out months later ???
Did I use this or that reference electrode for this measurement ???
Ohhhh which version of my home-made software did I use? Maybe it was still the (more) buggy one ???
And what the heck was the reason for this peak at the measurement start ???
No doubt all this could be written down (or have to be written down) in your laboratory journal. But its hard to keep it up-to-date and maintain everything else in electronic form.
I propose to check out org-mode (http://orgmode.org/) an Emacs mode with a very rich feature set to organise plain text files.
This mode is perfect for creating a electronic version of a laboratory journal. You can refer, link, annotate, outline, mark, priorities, timestamp, set reminders, create agendas, set deadlines, even use literate programming to create your results..... the feature set is simply exhausting.
Furthermore, I use Git to keep track of all changes I do to my home-made data evaluation software, the org-mode files and my publications. A log of this gives you a decent way to check what did you do with your original data over time and what was the version of your software-tools at that time. Thus, finding out why the graph now looks different compared to the one two months before, is much more easy since you could check how your software changed over that period.
Just my two cents
I have pretty much the same problem. Files system became a mess and databases did't help much with entire directories of up to 1 gig. Plus, for scientific traceability, experimental data should be self-sufficient: external links (to other data, scripts, software, etc.) are toxic. So, I developed 2 open sources to organize my experimental data. 1) new experiments: Basic Experimenter. A wizard that stores together the datasets and all the files of the experiment (protocol, checklists, scripts..). http://sites.google.com/site/basiclabbook/basicexperimenter 2) old experiments: Basic Bookcase. Datasets are zipped and stored in a scientific repository as documents (BibTeX category 'dataset'). At http://sites.google.com/site/basiclabbook/basicbookcase This did not solve the previous cross links, but it helped a lot.
Science Informatics. You just told me what I do, when I'm not doing my own research... which seems to be happening less and less these days. A lot of the kids we train now can write something in Matlab or IDV, but couldn't perform a first normalization on a database if their lives depended on it. A lot of them have learned a little perl... or .Net, but know nothing of C or Fortran, and can't spell MPICH even when the models they use depend on it.
And seriously, corporate and university IT staff are not suited for this purpose, just as I'm poorly suited to help you with your next Windows installation. My expertise lies elsewhere. Getting the senior management to understand these differences, however, is a problem.
Never ascribe to malice that which can adequately be explained by tenure.
I run a lot of numerical simulations, and we run into the same issues, as a few people have noted above. Many of my colleagues still write code in fortran 77 and manually give descriptive filenames to their data. It's a mess. There's a guy at the University of Toronto, Greg Wilson, whose work centers on getting computational scientists--largely people who run simulations--to become more sophisticated. He runs a one-week course teaching software engineering principles to computational scientists. It's available online for self-study at http://www.sofware-carpentry.org./ One section is a lesson on using databases (he demonstrates SQL) to handle numerical data, which I imagine looks a lot like experimental data. He's currently rewriting the course, but I have learned a lot from it.
You need your "master data" (ie: data that describes your work and is considered to be of quality) centralized in something like a RDBMS (SQL Server, etc). Once you have it there you can write scripts to find orphans. Basically you will have a metadata table that will contain all available 'links' to your master data. You then try and find the other side of the "link". The question you are trying to determine is "is the meta data of any quality?".
So, if you needed a table it would look something like this.
Unique key of Master Data, Flag to determine if unique key exists (because this changes over time), Meta Data key, flag to determine if meta data exists(because this changes as well)
You then run a script against this nightly to determine if your "links" are broken. I also like to have it visualized by having an HTML page display a nice gren check for a good link and a big red x for a bad link.
If you do this, you will really have a good idea of your data and it will allow you to QA it much better.
It will most certainly be a pain in a$$ but you could program a crawler for finding dead links and you could use also use it to batch change meta. You also might want to keep a crawler maintained index of what you have and its creation date.
I am a behavioral neuroscientist and MySQL has been invaluable for my data storage and analysis.
MATLAB is my main analytical tool, so i generally keep my matlab code in a subversion repository and use MyM to communicate back and forth between MATLAB and MySQL.
I've been using MySQL since 1999. If I was getting into the game now, i might choose another relational DB.
1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765
As an example of the former: The patches of experiment space containing "measure the lifetime of the bottom quark" and "estimate the average length of 5 year old blue whales" are strongly disjoint and there is essentially no description reduction scheme that can handle such a broad range of inputs. Equivalently, "estimate the resistivity of the population of salt bridges I've experimented with" and "estimate the total data production of Earth in 2010" are questions drawn from experiments that are too different to have a unified data reduction description. I've led programs to address this range of problems in several ways:
Sort them as they should be sorted, in a mindmap.
'I am become Shiva, destroyer of worlds'
Hey, a hint for organizing data in natural sciences etc.:
I keep all my biology data in flat files (xls mostly, scv, pcg, etc...) in folders named by Year-Protocol but they are all submitted to Mendeley (mendeley.com), which is like Endnote but freeware and allows you to search the text *inside* files and also to create separate libraries inside your global library, but it doesn't move files, only links to files, so you can actually have the same file appearing in more than one library (eg. one Methods file with appearing in Experiment 1 and Experiment2). It'll tell you the file's location. It's been working for me...Mendeley is my Explorer now...
Depending on your budget / affiliation, you might want to look at a scientific data management system (SDMS), the original, and still, (IMHO) is NuGenesis (http://www.waters.com/waters/nav.htm?cid=513068&locale=en_GB), now owned by Waters inc. It will automatically capture and catalogue both raw data files and electronic print outs, store any describing (meta) information into an Oracle DB and securely store the files for you, it can even be put in front of a fancy like content addressable storage (http://en.wikipedia.org/wiki/Content-addressable_storage) like Centera.