Slashdot Mirror


How Do You Organize Your Experimental Data?

digitalderbs writes "As a researcher in the physical sciences, I have generated thousands of experimental datasets that need to be sorted and organized — a problem which many of you have had to deal with as well, no doubt. I've sorted my data with an elaborate system of directories and symbolic links to directories that sort by sample, pH, experimental type, and other qualifiers, but I've found that through the years, I've needed to move, rename, and reorganize these directories and links, which have left me with thousands of dangling links and a heterogeneous naming scheme. What have you done to organize, tag and add metadata to your data, and how have you dealt with redirecting thousands of symbolic links at a time?"

44 of 235 comments (clear)

  1. Here by Xamusk · · Score: 2, Funny

    I store them in first posts.

  2. Use databases! by Cyberax · · Score: 3, Insightful

    Subj.

    If you have something more complex than a flat file, then use relational databases. Even Access databases are better than a collection of text files.

    1. Re:Use databases! by garcia · · Score: 2, Insightful

      If you have something more complex than a flat file, then use relational databases. Even Access databases are better than a collection of text files.

      That really depends on what your intended use for them is. I mean I don't know this particular fellow's situation for data collection or what tools he uses for reporting and visualization but perhaps, for him, it's a much better idea to store them in flat files. Me? I have been using flat files for all my data collection about local crime (see here, here, here, and here) for several reasons:

      1. I script it all with awk/sed to scrape the data and then put it in a CSV for summary with MySQL.

      2. Yes, I could use MySQL for it all but I like to easily see it in its raw format on another remote machine. I also like to use Excel to do ad-hoc pivots and this is the easiest way for me to do that.

      3. I upload the data to Google Docs and use their gadgets to make charts for my dashboards and maps. If I were to store it solely in MySQL I would have to make the CSV, pipe it into the MySQL, convert it back out to CSV and then upload it. An additional step for nothing.

      Hey, no method is perfect for everyone and every project is a little different and while it's hard for me, based on the information provided, to give this guy any help, automatically suggesting that he needs a relational database to do his data storage might be just a little shortsighted.

      YMMV.

    2. Re:Use databases! by Idiomatick · · Score: 2, Insightful

      I'm amused by all the /.ers suggesting the physics nerd setup a SQL database. It ain't super easy and I can't imagine anyone that isn't a programmer of some sort doing so.

      The question he is really asking is probably more like: 'Is there anything like Windows live photo gallery that will allow me to tag and sort all of my data in a variety of ways like wlpg lets me sort my pictures?'

    3. Re:Use databases! by grub · · Score: 2, Funny

      If the dude works at a research institute of minor size and up, they should have IT staff who can do that initial setup for him.

      --
      Trolling is a art,
    4. Re:Use databases! by rumith · · Score: 4, Interesting
      Hello, I'm a space research guy.
      I've recently made a comparison of MySQL 5.0, Oracle 10i and HDF5 file based data storage for our space data. The results are amusing (the linked page contains charts and explanations; pay attention to the conclusive chart, it is the most informative one). In short (for those who don't want to look at the charts): using relational databases for pretty common scientific tasks sucks badly performance-wise.

      Disclaimer: while I'm positively not a guru DBA and thus admit that both of the databases tested could be configured and optimized better, but the thing is that I am not supposed to. Neither is the OP. While we may do lots of programming work to accomplish our scientific tasks, being a qualified DBA is a completely separate challenge - an unwanted one, as well.

      So far, PyTables/HDF5 FTW. Which brings us back to the OP's question about organizing these files...

    5. Re:Use databases! by Dynedain · · Score: 2, Insightful

      Translation: I am not a DB guru, but I deal with massive amounts of complex data and need a DB guru, but I have no intent on hiring one.

      Seriously, hire a DB wizard in the DB software of your choice for a couple of days. Have him setup the data and optimize it. You'll save yourself a lot of headaches, AND put yourself in a good position for future data maintenance. Imagine that your project gets a lot of attention in the future, and you suddenly get a lot of funding and the money to hire more people Or imagine that you'd like to provide or incorporate data with some outside sources or other researchers. If you're using something "standard" like a relational DB, it will be much easier to hire a DB wizard then trying to find a programmer who can piece together a lot of mismatched files and convoluted organization schemes.

      This is what databases are designed to do. Just because you're not an expert at setting them up, and theres a performance hit to setting them up wrong, doesn't mean that they aren't still the right tool.

      --
      I'm out of my mind right now, but feel free to leave a message.....
    6. Re:Use databases! by BlitzTech · · Score: 2, Insightful
      Apache: click the install button (use default options, or switch to non-service mode which it very clearly explains means it only runs when you run it instead of whenever you start your computer)
      MySQL: click the install button (use default options, they're all fine)
      phpMyAdmin: put in document root, configure ("click the install button")

      And you're set. How was that hard...?

      Some software is, in fact, difficult to set up and maintain. As a scientist with an unusually large sample collection, learning to use a database is probably a good idea. Many scientists are taught MATLAB, and setting up a WAMP stack is much, much easier than learning MATLAB.

      I'm amused by all the /.ers suggesting the physics nerd setup a SQL database. It ain't super easy and I can't imagine anyone that isn't a programmer of some sort doing so.

      Scientists are pretty smart. He should learn to use a database.

    7. Re:Use databases! by clive_p · · Score: 2, Interesting

      As it happens I'm also in space research. My feeling is that what approach you take depends a lot on what sort of operations you need to carry out. Databases are good at sorting, searching, grouping, and selecting data, and joining one table with another. Getting your data into a database and extracting it is always a pain, and for practical purposes we found nothing to beat converting to CSV format (comma-separated-value). We ended up using Postgres as it had the best spatial (2-d) indexing, beating MySQL at the time. The expensive commercial DBMS like Oracle didn't have anything that the open-source ones did for modest-sized scientific datasets. I found Postgres was fine for our tables, which were no bigger than around 10 million rows long and 300 columns wide. You might well get better performance using something like HDF but you'll probably spend a lot more time programming to do that, and it won't be as flexible. The only thing you can be sure of in scientific data handling is that the requirements will change often, so flexibility is important. If your scientific data are smallish in volume and pretty consistent in format from one run to the next, you might consider storing the data in the database, in a BLOB (binary large object) if no other data type seems to suit. But a fairly good alternative is just to store the metadata in the database, e.g. filename, date of observation, size, shape, parameters, etc and leave the scienficic data in the files. You can then use the database to select the files you need according to the parameters of the observation or experiment.

    8. Re:Use databases! by rockmuelle · · Score: 3, Interesting

      I've built LIMS systems that manage peta-bytes of data for companies and small scale data management solutions for my own research. Regardless the scale, the same basic pattern applies: Databases + files + programming languages. Put your meta-data and experimental conditions in the database. This makes it easy to manage and find your data. Keep the raw data in files. Keep the data in a simple directory structure (I like instrument_name/project_name/date type heiracchies, but it doesn't really matter what you do as long as you're consistent) and keep pointers to the files in the database. Use Python/Perl/Erlang/R/Haskell/C/whatever-floats-your-boat for analysis

      Databases are great tools when used properly. They're terrible tools when you try to shoehorn all your analysis into them. It's unfortunate that so few scientists (computer and other) understand how to use them. Also, for most scientific projects, SQLite is all you need for managing meta-data. Anything else and you'll be tempted to to your analysis in the database. Basic database design is not difficult to learn - it's about the same as learning to use a scripting language for basic analysis tasks.

      The main points:

      1) Use databases for what they're good at: managing meta-data and relations.
      2) Use programming languages for what they're good at: analysis.
      3) Use file systems for what they're good at: storing raw data.

      -Chris

    9. Re:Use databases! by BrokenHalo · · Score: 2, Informative

      If the dude works at a research institute of minor size and up, they should have IT staff who can do that initial setup for him.

      I'm not quite sure why your post is rated as funny; scientists are not necessarily the best people to be left in charge of setting up databases. I've seen all sorts of atrocities constructed in the name of science, from vast flat files to cross-linked ISAM/VSAM files, and I remember many late nights (with complaints from wife) spent sorting them out when a subscript went out of range.

    10. Re:Use databases! by mobby_6kl · · Score: 3, Interesting

      Certainly it depends, YMMV, and all that. Still, I think that some of the points that you bring up are not actually arguing against a relational database, perhaps just for a slight reorganization of your processes.

      1. I don't know where you get the data from, but anything awk/sed can do, so can Perl. And from Perl (or PHP, if you're lame) it's very easy to load the data into a database
      2. It's easy to connect to an SQL server from the remote machine and either dump everything or just select what you need. You'll need more than notepad.exe to do this, but it's not rocket science. Pivots in Excel can be really useful, but Excel can easily connect to the same database and query the data directly from there and use it for your charts/tables.
      3. Since by this point you'll already have all the data in the db, exporting it to CSV would be trivial. Or you could even skip Google Docs entirely and generate your charts with tools which can automatically query your database.

      I agree with your final point though, we really have no idea what would be best for the submitter within the limitations of skills, budget, time, etc. Perhaps flat files are really the best solutions, or maybe stone tablets are.

    11. Re:Use databases! by bunratty · · Score: 3, Insightful

      I've helped my wife, who is a research scientist, by writing scripts to process her data. The IT departments at the companies where she's worked have no idea about what her work is or how she does it, and perhaps have even less interest in helping her. Their function is to keep the infrastructure (networks, file servers, email servers, etc.) working and install software packages onto the computers. They aren't of any use in helping individual users with their work. They will install SAS and S-Plus but will not help by writing SAS and S-Plus code, for example. You might be in the same situation as I am from your comment about your wife.

      --
      What a fool believes, he sees, no wise man has the power to reason away.
    12. Re:Use databases! by jandersen · · Score: 2, Interesting

      ...using relational databases for pretty common scientific tasks sucks badly performance-wise.

      Well, it has never been a secret that relational databases do not performs as well as e.g. a bespoke ISAM or hash-indexed data-file, mostly due to the fact that it involves interpretation of of SQL. But then the main purpose of SQL databases has never been to optimise raw performance - rather the idea is to provide maximum, logical flexibility. The beauty of relational databases is that you can change both data annd metadata at the wave of a hand, where you in the high-performing, bespoke database have to go and rewrite significant amounts of code.

      At the end of the day you choose your tools to fit your needs, or at least that is what you ought to do.

      As for the OP's question: the main problem seems to be one of having to rename and move stuff; this is clearly the area where SQL is strongest.

  3. Separate data from presentation by mangu · · Score: 4, Informative

    In my experience, the best thing is to let the structure stand as it was the first time you stored the data.

    Later, when you discover more and more relationships around that data, you may create, change, and destroy those symbolic links as you wish.

    I usually refrain from moving the data itself. Raw data should stand untouched, or you may delete it by mistake. Organize the links, not the data.

  4. sqlite by sugarmotor · · Score: 3, Insightful
    --
    http://stephan.sugarmotor.org
  5. Matlab Structures by Anonymous Coward · · Score: 4, Interesting

    I have to organize and analyze 100 GB of data from a single day's experiment. Raw data goes on numbered HDs that are cloned and stored in separate locations. This data is the processed and extracted into about 1 GB of info. From there the relevant bits are pulled into hierarchal matlab structures. Abstract everything that you can into a MATLAB structure array. Have all your analysis methods embedded in the classes that compose the array. Use SVN to maintain your analysis code base.

    1. Re:Matlab Structures by pz · · Score: 4, Interesting

      I have to organize and analyze 100 GB of data from a single day's experiment. Raw data goes on numbered HDs that are cloned and stored in separate locations. This data is the processed and extracted into about 1 GB of info. From there the relevant bits are pulled into hierarchal matlab structures. Abstract everything that you can into a MATLAB structure array. Have all your analysis methods embedded in the classes that compose the array. Use SVN to maintain your analysis code base.

      Yes, yes, yes.

      I have very similar data collection requirements and strategy with one exception: the data that can be made human-readable in original format are made so. Always. Every original file that gets written has the read-only bit turned on (or writeable bit turned off, whichever floats your boat) as soon as it is closed. Original files are NEVER EVER DELETED and NEVER EVER MODIFIED. If a mistake is discovered requiring a modification to a file, a specially tagged version is created, but the original is never deleted or modified.

      Also, every single data file, log file, and whatever else that needs to be associated with it is named with a YYMMDD-HHMMSS- prefix and since experiments in my world are day-based, are put into a single directory called YYMMDD. I've used this system now for nearly 20 years and not screwed up with using the wrong file, yet. FIles are always named in a way that (a) doing a directory listing with alpha sort produces an ordering that makes sense and is useful, and (b) there is no doubt as to what experiment was done.

      In addition, every variable that is created in the original data files has a clear, descriptive, and somewhat verbose name that is replicated through in the MATLAB structures.

      Finally, and very importantly, the code that ran on the data collection machines is archived with each day's data set so that when bugs are discovered we can know EXACTLY which data sets were affected. As a scientist, your data files are your most valuable possessions, and need to be accorded the appropriate care. If you're doing anything ad-hoc after more than one experiment, then you aren't putting enough time into a devising a proper system.

      (I once described my data collection strategy to a scientific instrument vendor and he offered me a job on the spot.)

      I also make sure that when figures are created for my papers I've got a clear and absolutely reproducible path from the raw data to the final figures that include ZERO manual intervention. If I connect to the appropriate directory and type "make clean ; make", it may take a few hours or days to complete, but the figures will be regenerated, down to every single arrow and label. For the aspiring scientist (and all of the people working in my lab who might be reading this), this is perhaps the most important piece of advice I can give. Six months, two years, five years from now when someone asks you about a figure and you need to understand how it was created, the *only* way of knowing that these days is having a fully scripted path from raw data to final figure. Anything that required manual intervention generally cannot be proven to have been done correctly.

      --

      Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
  6. Go for NoSQL! by JamesP · · Score: 3, Funny

    OK, subject is the short answer, here's the big answer

    Since experimental data usually doesn't have the same structure for all experiments, you may try something like this:

    at the deeper, most basic level organize it using JSON or XML (I don't know what kind of experiment you do, but you would put lists of data, etc)

    Then you store this in a NoSQL db (like CouchDb or Redis) and index it the way you like, still if you don't index you can always search it manually (slower, still...)

    --
    how long until /. fixes commenting on Chrome?
  7. Don't bother with hierarchies by ccleve · · Score: 5, Interesting

    Instead of trying to organize your data into a directory structure, use tagging instead. There's a lot of theory on this -- originally from library science, and more recently from user interface studies. The basic idea is that you often want your data to be in more than one category. In the old days, you couldn't do this, because in a library a book had to be on one and only one shelf. In this digital world you can put a book on more than one "shelf" by assigning multiple tags to it.

    Then, to find what you want, get a search engine that supports faceted navigation.

    Four "facets" of ten nodes each have the same discriminatory power as a single hierarchy of 10,000 nodes. It's simpler, cleaner, faster, and you don't have to reorganize anything. Just be careful about how you select the facets/tags. Use some kind of controlled vocabulary, which you may already have.

    There are a bunch of companies that sell such search engines, including Dieselpoint, Endeca, Fast, etc.

  8. I used to be anal about organization... by taoboy · · Score: 2, Insightful

    ...but then google came along and taught me that it's not about know where things are, but rather about being able to find them. My email, for instance, is "organized" by the year in which it arrives, and I use the search function of my email client to find things. No big folder structure, moving messages around, and I haven't had problems finding any email I need. Oh yes, I keep them all... good fodder for "on x/x/xxx you said..." retorts.

    For files, then, the key is to have descriptive file names that provide readily searched text. Including the data somewhere in the name (I tend to use this format because it sorts well: 20100815) makes it easier to sort through multiple versions.

    Then, you can spend quality time figuring out how to reliably back up all that stuff.... :)

  9. Linked Data, of course by Rui+Lopes · · Score: 2, Informative

    The present (and the future) of experimental data organisation, repurposing, re-analysing, etc. is being shifted towards Linked Data and supporting graph data stores. Give it a spin.

    --
    var sig = function() { sig(); }
  10. Try using a scientific workflow system by moglito · · Score: 3, Insightful

    You may want to consider a scientific workflow system. These systems handle both data storage (including meta-data and provenance -- where the data came from), and design and execution of computational experiments. If you are concerned about the complexity of the meta-data (e.g., pH value..) and would like to make sure to be able sort things according to this, you want to give "Wings" a try. You can try out the sandbox to get an idea: http://wind.isi.edu/sandbox.

  11. four directories by arielCo · · Score: 4, Funny

    $PRJ_ROOT/data/theoretical
    $PRJ_ROOT/data/fits
    $PRJ_ROOT/data/doesnt_fit
    $PRJ_ROOT/data/doesnt_fit/fixed
    $PRJ_ROOT/data/made_up

    --
    This post contains no rudeness or derision of any kind. All arguments are friendly. Terms and exclusions may apply.
    1. Re:four directories by morgan_greywolf · · Score: 3, Funny

      Oh, come on! Who let the climatologists in here?

  12. Re:Interns. by spacefight · · Score: 3, Informative

    Yeah right, let the interns do the job. Not. Interns use new tools no one understands, then finish the project during their term, then move on and let the most probably buggy or unfinished project behind. Pitty for the person who has to cleanup the mess. Better do the job on your own, know the tools or hire someone permanently for the whole deptartment.

  13. Databases are not as convenient as files by goombah99 · · Score: 2, Interesting

    I agree that this is a candidate for a database. One problem with data bases for researchers is that generally one does not know the right schema before hand ond one is dealing with ad hoc contingencies a lot. Another is portability to machines you don't control or that are not easily networked. A final problem is archival persistence. I can't think of a single data base scheme that has lasted 5 let alone ten years and still function without being maintained. Files can do that.

    So if you want some bandaid approaches:

    1) if you have a mac then, uses aliases rather than symbolic links. alias don't get messed up if you move the file.

    2) use hard links rather than symbolic links. THe problem here is that these can get unlinked if you plan to modify the file. But if the file will never change these are just as space efficient and a softlink but tolerate renaming. They can't span across different disks however.

    3) poormans database:
    give your files a numerical name that chages, typically the date and time they were created. then have a flat file that list the files in some set for each category.

    4) low tech database. If you decide to use a database then choose one that is likely never to go out of style. for example pick something like a perl-tie. those are so close to the language that they probably won't get depricated in the next 10 years.

    --
    Some drink at the fountain of knowledge. Others just gargle.
  14. Relational Databases won't do! by gmueckl · · Score: 3, Informative

    To everybody here suggesting relational databases: you are on the wrong track here, I'm afraid to tell you. Relational databases handle large sets of completely homogenious data well if you can be bothered to write software for all the I/O around them. This is where it all falls apart:

    1. Many lab experiments don't give you the exact same data every time. You often don't do the same experiment over and over. You vary it and the data handling tools have to be flexible enough to cope with that. Relational databases aren't the answer to that.

    2. Storing and fetching data through database interfaces is vastly more difficult than just using the standard input/ouput or plain text files. I've written hundreds of programs for processing experimental data and I can tell you: nothing beats plain old ASCII text files! The biggest boon is that they are compatible to almost any scientific software you can get your hands on. Your custom database likely is not. Or how would you load the contents your database table into gnuplot, Xmgrace or Origin, just to name a few tools that come to my mind right now?

    I wish I had a good answer to the problem. At times I wished for one myself, but I fear the best reply might still be "shut up and cope with it".

    --
    http://www.moonlight3d.eu/
    1. Re:Relational Databases won't do! by gmueckl · · Score: 3, Insightful

      I hereby humbly suggest that you are, instead, wrong. Here is why:

      Scientists are not software developers and never want to be that. They want to run their experiments and analyse their data. The latter requires recording and processing of numerical data. This is where computers enter their workflow - as number crunching tools that have to be easy to use and utterly flexible.

      At times, my work consisted of writing lots of one-off C and Python program to process data in ever new ways in order to get an idea what I was actually looking at. And I had to write them myself because these weren't your run of the mill analysis steps. Many of these programs were not run again once I had their results. During all this time, I as a scientist was looking to get the data in and out of the programs in ways that are easy to code without getting distracted from what I wanted to achieve scientifically. My head was full of theory and formulas, not data structures and good software design.

      In that particular state of mind, writing SQL isn't one of the things that I would have wanted to spend any time on. The inherent complexities are a distraction and a big one at that. And, hell, I'm one of the guys who actually *know* SQL. Most scientists actually don't. Hell, many of them barely know how to use their favorite language's core libs to their advantage. They don't care and - may I say - rightly so.

      Besides, the code would get more bloated. If I want to output three values that belong together I write a print statement that places them on the same line of text in the output file and I'm done. That's a one-liner that takes me about 20 seconds to type in. In the worst case, I need to open a file beforehand and close it afterwards instead of piping it into stdout. That's maybe 3 lines of code. Now tell me: how many lines of code do I need to write to place these values in a database? That is, provided that a table already exists to hold that data.

      My point is: relational databases don't do the job for scientists. Instead, they get in the way. And you and anyone else here who is arguing in favor of them probably lack the related experience to understand that - no offense intended. The points you make are derived from pure theory. Respect the needs of the users as well, please.

      Maybe there is a middle ground here: hire a software developer who builds and maintains the DB and a nice, convenient to use wrapper library around it for you. That'll take a while and someone will have to foot the bill for it.

      --
      http://www.moonlight3d.eu/
  15. SQLite + Scripting language by ericbg05 · · Score: 2, Informative
    Others have already mentioned SQLite. Let me briefly expound on the features that are likely the most important to you, assuming (if you'll permit me) that you don't have much experience with databases:
    1. 0. The basic idea here is that you are replacing this whole hierarchy of files and directories by a single file that will contain all your data from an experiment. You figure out ahead of time what data the database will hold and specify that to SQLite. Then you to create, update, read, and destroy records as you see fit--pretty much as many records as you want. (I personally have created billions of records in a single database, though I'm sure you could make more.) Once you have records in your database, you can with great flexibility define which result sets you want from the data. SQLite will compute the result sets for you.
    2. 1. SQLite is easy to learn and use properly. This is as opposed to other database management systems, which require you to do lots of computery things that are probably overkill for you.
    3. 2. Your entire data set sits in a single file. If you're not in the middle of using the file, you can back up the database by simply copying the file somewhere else.
    4. 3. Transactions. You can wrap a large set of updates into a single "transaction". These have some nice properties that you will want:
      1. 3.1. Atomic. A transaction either fully happens or (if e.g. there was some problem) fully does not happen.
      2. 3.2. Consistent. If you write some consistency rules into your database, then those consistency rules are always satisfied after a transaction (whether or not the transaction was successful).
      3. 3.3 Isolated. (Not likely to be important to you.) If you have two programs, one writing a transaction to the database file while the other reads it, then the reader will either see the WHOLE transaction or NONE of it, even if the writer and reader are operating concurrently.
      4. 3.4. Durable. Once SQLite tells you the transaction has happened, it never "un-happens".
      5. These properties hold even if your computer loses power in the middle of the transaction.
    5. 4. Excellent scripting APIs. You are a physical sciences researcher -- in my experience this means you have at least a little knowledge of basic programming. Depending on what you're doing, this might greatly help you to get what you need out of your data set. You may have a scripting language that you prefer -- if so, it likely has a nice interface to SQLite. If you don't already know a language, I personally recommend Tcl -- it's an extremely easy language to get going with, and has tremendous support directly from the SQLite developers.

    Good luck and enjoy!

  16. What about a wiki? by gotfork · · Score: 3, Insightful

    In my previous lab group we used a mediawiki install to keep track of microelectronic devices that several people were working on at the same time. These devices were still under development so most of the data was qualitative -- images, profilometry data, IV/CV curves were all stored on the wiki page for each sample, and each page included a recipe for exactly how it was made, which made it easy to trouble shoot later. It worked pretty well for what we used it for, but once we had a working device all the in-depth data for that sample was kept separately. This seemed like a half-decent way of cataloging samples, although one would need something a bit more robust for complex data sets that don't integrate well with a wiki.

  17. Comment removed by account_deleted · · Score: 2, Informative

    Comment removed based on user account deletion

  18. Use tags in Apple OS X by wealthychef · · Score: 2, Insightful

    If you are using Mac OS X, you can tag the files using the Finder Get Info and putting "Spotlight comments" there. Then you can easily find them based on keywords and Spotlight in constant time. The good thing about keywords is that they give you a multidimensional database effect. The bad thing I've found is I tend to forget my keywords that I'm storing stuff with, so I don't really know what to search for. OS X Spotlight is promsing and might work very well for you.

    --
    Currently hooked on AMP
  19. How CMS sorts data by toruonu · · Score: 2, Informative

    Well CMS is one of the large experiments at the LHC. The data produced should reach pentabytes per year and add to it the simulated data we have a hellava lot of data to store and address. What we use is a logical filename (LFN) format. We have a global "filesystem" where different storage elements have files in a filesystem organized in a hierarchical subdirectory structure. As an example: /store/mc/Summer10/Wenu/GEN-SIM-RECO/START37_V5_S09-v1/0136/0400DDE2-F681-DF11-BA13-00215E21DC1E.root

    the /store is a beginning marker of the logical filename region that different sites can map differently (who uses NFS, who uses http etc etc) /mc/ -> it's monte carlo data /Summer10/ -> the data was produced during Summer of 2010 /Wenu/ -> it's a simulation of W decaying to electron and neutrino /GEN-SIM-RECO/ -> the data generation steps that have been done /START37_.../ -> The detector conditions that have been used (the actual full description of the conditions is in some central database) /0136/ -> is the serial number (actually I'm not 100%, but it's related to the production workflow etc) /0400DDE2-F681-DF11-BA13-00215E21DC1E.root -> the actual filename, the hash is due to the fact that the process has to make sure there are no conflicts in filenames

    Another example: /store/data/Run2010A/MinimumBias/RECO/Jul16thReReco-v1/0000/0018523B-D490-DF11-BF5B-00E08178C111.root

    This file is real data, taken during the first run of 2010 and filtered to the MinimumBias primary dataset (related to event trigger content). The datafiles in there contain RECO content and were done during the re-reconstruction process on July 16th. Then there's again the serial number (block edges define new serial numbers) and then the filename.

    You could use a similar structure to differentiate the datafiles that you actually use. The good thing is that you can map such filenames separately everywhere as long as you change the prefix according to the protocol used (we use for example file:, http:, gridftp:, srm: etc). You can also easily share data with other collaborating sites as long as everyone uses similar structure it's quite good. No need for special databases etc. If you need some lookup functionality, then one option is a simple find (assuming you have filesystem access) or you could build a database in parallel and you can use the LFN structure to index things etc.

  20. Another vote for NoSQL and some experience by wolf87 · · Score: 2, Informative

    I have seen these kinds of situations happen a lot (I'm a statistician who works on computationally-intensive physical science applications), and the best solution I have seen was a BerkeleyDB setup. One group I work with had a very, very large number of ASCII data files (order of 10-100 million) in a directory tree. One of their researchers consolidated them to a BerkeleyDB, which greatly improved data management and access. CouchDB or the like could also work, but I think the general idea of a key-value store that lets you keep your data in the original structure would work well.

  21. First devise a meaningful stable primary key by RandCraw · · Score: 2, Informative

    First I would lay out your data using meaningful labels, like a directory named for the acquisition date + machine + username. Never change this. It will always remain valid and allow you to later recover the data if other indexes are lost. Then back up this data.

    Next build indexes atop the data that semantically couple the components in the ways that are meaningful or acessible. This may manifest as indexed tables in a relational database, duplicate flat files linked by a compound naming convention, unix directory soft links, etc.

    If you're processing a lot of data, your choice of indexes may have to optimize your data access pattern rather than the data's underlying semantics. Optimize your data organization for whatever is your weakest link: analysis runtime, memory footprint, index complexity, frequent data additions or revisions, etc.

    In a second repository, maintain a precise record of your indexing scheme, and ideally, the code that automatically re-generates it. This way you (or someone else) can rebuild lost databases/indexes without repeating all your design and data cleansing decisions, and domain expertise. This info is often stored in a lab notebook (or nowadays in an 'electronic lab notebook').

    I'd emphasize that if you can't remember how your data is laid out or pre-conditioned, your analysis of it may be invalid or unrepeatable. Be systematic, simple, obvious, and keep records.

  22. You need a LIMS by pigreco314 · · Score: 2, Insightful

    A Laboratory Information Management System will help you store, organize, analyze and data-mine your data.

    --
    "linux" is a very common word and was not included in your search.
  23. Consistently by rwa2 · · Score: 3, Insightful

    Word up. I'd say the first goal is to store your raw, bulk data consistently. Then you can have several sets of post processing scripts that all draw from the same raw data set.

    You want this data format to be well-documented, but I wouldn't bother meticulously marking it up with XML tags and other metadata or whatever. You just want to be able to read it fast, and have other scripts be able to convert it into other formats that would be useful for analysis, be it matlab, octave, csv, or some tediously marked-up XML. You do want to be able to grep and filter the data pretty easily, so keep that in mind when you're designing the format. It will likely end up being pretty repetitive, but that's OK, since you'll likely store it compressed. That can improve performance when reading it, since the storage medium you're pulling the data from is often slower than the processor doing the decompression... and it also provides some data integrity / consistency checking. Oh, and of course, you can store more raw data if its compressed.

  24. SciDB, Open Source DB for Science by geoffrobinson · · Score: 2, Interesting
    --
    Except for ending slavery, the Nazis, communism, & securing American independence, war has never solved anything.
  25. Use a document mangement system by rsborg · · Score: 2, Informative
    Document Management Systems are great - they combine (some of) the benefits of source control, file systems, and email (collaboration).

    I would recommend just downloading a VM or cloud image of something like Knowledge Tree or Alfresco (I personally prefer Alfresco), and run it on the free vmwareplayer or a real VM solution if you have one.

    I recently setup a demo showing the benefits of such a system, I was able to, in about one day, download and setup Alfresco, expose CIFS interface (ie, \\192.168.x.x\documents) and just dump a portion of my entire document base into the system. After digestion, the system had all the documents full-text-indexed (yes, even word docs and excel files thanks to OpenOffice libraries), and I could go about changing directory structure, moving around and renaming files, etc. .. and the source control would show me changes. In fact, I could go into the backend and write SQL queries if I wanted to with detailed reports of how things were on date X or Y revisions ago. Was quite sweet. All the while, the users still saw the same windows directory structure and modifications they made there would be versioned and modified in Alfresco's database.

    Here is a bitnami VM image, will save you days of configuration. If the solution works for you, but is slow, just DL the native stack and migrate or re-import.

    --
    Make sure everyone's vote counts: Verified Voting
  26. Re:Use databases! (maybe, maybe not) by mikehoskins · · Score: 2, Informative

    Yes, agreed, a combination is good (SQL + NoSQL + filesystem).

    There is no one-size-fits-all scenario, here.

    However, there is utility in a NoSQL database over a raw filesystem. One feature is indexed search. Another is versioning. Another is the fact that it is extremely multiuser (proper record locking, even if there are multiple writes to the same record). Also, many NoSQL databases (especially MongoDB) have built-in replication, sharding, Map-Reduce, and horizontal scaling.

    MongoDB's GridFS (especially with FUSE support) marries many of these features together. MongoDB does have some SQL DB features (such as indexing/searching and transactions) but not others.

    Check out the whole stack here:
        http://www.mongodb.org/
        http://www.mongodb.org/display/DOCS/GridFS
        http://github.com/mikejs/gridfs-fuse

  27. Great comments by gringer · · Score: 2, Insightful

    Reading these comments has changed my thoughts on data storage a little bit, but has reinforced my idea that databases are a bad idea for this sort of thing.

    The main issues I have with using databases are file size (I store and convert text files that are 10-100MB zipped), and mutability (generated data doesn't typically change, I just add new experiments on top of other data). A secondary issue is that for plain-text data files (or plain-text convertible data files), writing code is easier when you don't have to bother about a database middleman.

    So, if I were to do [another] large research project in the future, here's my thoughts on what I would consider an appropriate approach:

    • Use a file system, rather than a database
    • New data gets put in a consistent place, e.g /data/<date>/<source>/
    • Backup this data. Assuming immutable files, incremental backups don't make much sense.
    • Symlink categories to the original data
    • When a file is referenced (or attached) in an email, keep a link between email date and file, e.g. /categories/<email>/<date>/<sink>/
    • Maintain (preferably autogenerate) and backup a plain-text file linking categories to files. This will help when data gets lost (i.e. accidentally deleted).

    My most common uses for old data are re-running analyses (generating new data as results), and sending data to someone else. It helps to be able to make those things as quick as possible.

    --
    Ask me about repetitive DNA
  28. t really depends, be more specific by floydman · · Score: 2, Interesting

    I am a programmer, who works closely with scientists in scientific computing in the fields of fluid mechanics simulation, and aerodynamics simulation.
    Your question is really not clear, in both these fields that I work on, the requirements vary vastly, and it also varies to the users I support (over 100 scientist). some of them have huge data sets, spanning up to 600 GB/file, a single simulation run can give a geologist a 1 TB file.
    Others, have a few hundred MB of data. Each is handled differently.
    The data itself, can be parsed and stored in in a DB for analysis in some cases, and in others, that is very impractical and will slow down your work.
    Each scientist has a different way of doing things.

    So the bottom line, if you want any useful answers, be more specific. What field of science (i can tell you are a chemist?), what simulations/tests do you use, how fine are your models are your data sets and what is their format, what kind of data are you interested in, you should seriously consider an archiving solution because i guarantee you will run out of space.

    --
    The lunatic is in my head
  29. Not all IT is the same -- you want 'Informatics' by oneiros27 · · Score: 3, Interesting

    The problem is, most IT people have no idea what do with science data -- it'd be like going to a dentist because you're having a heart attack. They might be able to give general advice, but have clue what specifics need to be done. Likewise, IT might be people who are really good at diagnosing hardware, but they might suck at writing code. Not all IT specialists are cross-trained in enough topics to deal with this issue effectively (data modeling, UIs, database admin, programming, and the science discipline itself).

    There's a field out there called 'Science Informatics'. It's not a very large group, but there's a number of us who specialize in helping scientists organize, find, and generally manage data. Think of us as librarians for science data.

    Most of us would even be willing to give advice to people outside our place of work, as the better organized science data is in general, the more usable and re-usable it is. There's even a number of efforts to have people publish data, so it can be shared, verified, etc. And most of us have a programming background, so we might be able to share code with you, as we try to make it open source where we can, so we don't all have to re-solve the same problems.

    Because each discipline varies so much, both in how they think about their data, and what their ultimate needs are, we tend to be specialists, but there's a number of different groups out there, for example:

    There's also Bioinformatics, Health/medical informatics, chemical informatics, etc. plug in your science discipline + 'informatics' into your favorite search engine, and odds are you'll find a group, or person you can write to to try to get more info and advice.

    Recently, NSF just funded a few more groups to try to build out systems and communities : DataOne and the Data Conservancy, and I believe there's some more money still to be awarded.

    --
    Build it, and they will come^Hplain.