Slashdot Mirror


Open Source Experiment Management Software?

Alea asks: "I do a lot of empirical computer science, running new algorithms on hundreds of datasets, trying many combinations of parameters, and with several versions of many pieces of software. Keeping track of these experiments is turning into a nightmare and I spend an unreasonable amount of time writing code to smooth the way. Rather than investing this effort over and over again, I have been toying with writing a framework to manage everything, but don't want to reinvent the wheel. I can find commercial solutions (often specific to a particular domain) but does anyone know of an open source effort? Failing that, does anyone have any thoughts on such a beast?"

"The features I would want would be:

  • management of all details of an experiment, including parameter sets, datasets, and the resulting data
  • ability to "execute" experiments and report their status
  • an API for obtaining parameter values and writing out results (available to multiple languages)
  • additionally (alternately?) a standard format for transferring data (XDF might be good)
  • ability to extract selected results from experimental data
  • ability to add notes
  • ability to differentiate versions of software
In my dreamworld, it would also (via plugin architecture?) provide these:
  • automatically run experiments over several parameters values
  • distribute jobs and data over a cluster
  • output to various formats (spreadsheets, Matlab, LaTeX tables, etc.)
Things I don't think it needs to do:
  • provide a fancy front-end (that can be done separately - I'm thinking mainly in terms of libraries)
  • visualize data
  • statistical analysis (although some basic stats would be handy)
The amount of output data I'm dealing with doesn't necessitate database software (some sort of structured markup is ok for me), but some people would probably like more powerful storage backends. I can see it as experiment management 'middleware'. There's no reason such software should be limited to computer science (nothing I'm contemplating is very domain specific). I can imagine many disciplines that would benefit."

17 of 122 comments (clear)

  1. Object Modeling System by Anonymous Coward · · Score: 5, Informative

    Take a look at the object modeling system. It is currently being developed by Agricultural Research Service but many other agencies are cooperating.

    http://oms.ars.usda.gov/

  2. R? by Elektroschock · · Score: 4, Informative

    Did you consider R, a Splus clone? For Scientific Statistics a very flexible solution. http://www.r-project.org

  3. AppLeS? by kst · · Score: 2, Informative

    Something like the AppLeS Parameter Sweep Template software might suit your needs. I've never used it myself, but it looks like it might be close to what you're looking for.

    See here for other projects from the GRAIL lab at SDSC and UCSD.

  4. ROOT by kenthorvath · · Score: 4, Informative
    http://root.cern.ch/

    We experimental high-energy physics folk have been using it (and PAW) for some time. It offers scripting and histogramming and analysis and a bunch of other features. And it's open source. Check it out.

  5. suggest jdb for managing individual experiments by john_heidemann · · Score: 4, Informative

    I've been very happy using jdb (see below) to handle individual experiments, and directories and shell scripts to handle sets of experiments.

    JDB is a package of commands for manipulating flat-ASCII databases from shell scripts. JDB is useful to process medium amounts of data (with very little data you'd do it by hand, with megabytes you might want a real database). JDB is very good at doing things like:

    • extracting measurements from experimental output
    • re-examining data to address different hypotheses
    • joining data from different experiments
    • eliminating/detecting outliers
    • computing statistics on data (mean, confidence intervals, histograms, correlations)
    • reformatting data for graphing programs

    For more details, see http://www.isi.edu/~johnh/SOFTWARE/JDB/.

  6. A project with similar goals by Anonymous Coward · · Score: 1, Informative


    http://sourceforge.net/projects/pythonlabtools/

  7. tcltest by trb · · Score: 2, Informative

    It might not satisfy all your requirements out of the box, but could you put something together with tcltest?

  8. Sounds like High Energy Physics by Anonymous Coward · · Score: 4, Informative

    What you describe does indeed sound like High Energy Physics.

    And the "middleware" you need are the GNU tools gluing together the specialized programs that do the specific things you want.

    We have been using unix for a long time, and many of us prefer the combination of small targeted tools philosophy rather than a single monolithic package.

    I will repeat, and you can stop reading now if you want. The GNU tools, unix, and specialized scriptable programs are already the "middleware" you seek.

    If you are just missing some of the tools in the middle, here are the ones used in HEP. You might find more appropriate ones closer to whatever discipline you work in.

    All the basic unix text processing tools and shells.
    bash. csh. Perl. grep. sed. and so on.

    Filename schemes ranging from appropriate to clever to bizarre.
    (See other posts here)

    Make it so that all the inputs you want to change can be done on the command line or with an input steering text file.

    Same tools combined with some simple c-code to produce formats for spreadsheets or PAW or ROOT or whatever visualization or post-processing thing you need done. Has ntuple and histogram support automatically, which might be all you need.

    Almost always I choose space delimited text for simple output to push into PAW, ROOT, or spreadsheets. I keep a directory of templates to help me out here.

    Some people use full blown databases to manage output. For a long time there have been databases specific to the HEP needs. I recently have started using XML-style data formats to encapsulate such things in text files if the resulting output is more complicated than a single line. You mention XDF, sure, that sounds like the same idea.

    CONDOR (U Wisconsin) has worked nicely for me for clustering and batch job submission when I need to tool through 100 data files or 100 diffrent parameter lists on tens of computers. The standard unix "at" is good enough in a pinch if you play on only 5 computers or so.

    HEP folks use things like PAW and ROOT (find them at CERN) which contain many statistical analysis things and monstrous computation algorithsm. Or at least ntuples, histograms, averages, and standard deviations. You could go commercial or the gsl here if you prefer such things.

    CVS or similar to take care of code versions.
    Don't forget to comment your code.

    We write our own code and compile from fortran or c or c++ for most everything else.

    Output all plots to postscript or eps.

    LaTeX is scriptable.

    And use shells, grep, perl to glue it all together. Did I mention those already?
    I get a good night's sleep more often than not.

    And decide what to do next after coffee the following morning.
    This is where you put your brain, and if you have done the above well enough, this is where you spend most of your time.
    The answer I get each morning (as another post suggests) is always so suprising that I need to start from scratch anyway.

    I bet that is what you are doing already. Probably no monolithic software will be as efficient as that in a dynamic research environment.

    What did I miss from your question?

    Oh, yes. Get a ten-pack of computation notebook with 11 3/4 x 9 1/4 inch pages (if you print things with standard US letter paper). And lots of pens. And scotch tape to tape plots into that notebook. Laser printer and photocopier. Post-it notes to remind yourself what you wanted to do next (or e-mail memos to yourself). Maybe I should have listed this first.

    Good luck.

  9. schema by Tablizer · · Score: 2, Informative

    Draft relational schema:

    Table: experiments
    ----
    exprmntID
    exprmntWhen // date-time stamp
    exprmntDescr // description
    outcome

    Table: params
    ----
    paramID // auto-num
    exprmntRef // foreign key to experiments table
    paramName
    paramValue

    Table: dataSet
    ----
    dataSetID // auto-num
    filePath
    datasetDescr
    isGenerated&nbsp ; // "True" if from experiment
    CRC // ASCII check-sum to make sure not changed

    Table: dataSetUsed
    ----
    exprmntRef // foreign key to experiments table
    dataSetRef // foreign key to dataSet table

    Table: softwareVersion
    ----
    svID
    softwareTitle
    svVers ion

    Table: softwareVersionUsed
    ----
    svRef // foreign key to softwareVersion
    exprmntRef // foreign key to experiments table

    Just use something like MySQL or MS-Access, and perhaps some kind of CRUD[1] tool to create front ends. You can expand from there based on new needs you encounter.

    [1] CRUD = typical Create, Read (list), Update, Delete screens.

    (Note: slashdot's filter scrambles certain variable names.)

  10. configuration management, build scripts, etc... by foog · · Score: 2, Informative

    The features I would want would be:

    management of all details of an experiment, including parameter sets, datasets, and the resulting data


    This can be handled by an ad-hoc database, a flat file in most cases. If you were a Windows power user, you'd spend an hour or two putting together something in Access for it.

    ability to "execute" experiments and report their status

    make with a little scripting, or whatever you use as a build system.

    an API for obtaining parameter values and writing out results (available to multiple languages)
    additionally (alternately?) a standard format for transferring data (XDF might be good)
    ability to extract selected results from experimental data
    ability to add notes


    Again, an ad-hoc database would be your friend.

    ability to differentiate versions of software

    This is conventionally handled with a configuration management system like CVS, Sourcesafe, or Clearcase.

    I hate reinventing the wheel, too, and I'd love to see a good book on using standard free Unix tools like make, CVS, Postgres, perl or some other common scripting language, TeX, etc for cleanly and efficiently
    automating complex computing processes and producing nice reports from them.

    PAW and ROOT look interesting though they look like overkill for many apps.

    Also, get a copy of Writing The Laboratory Notebook, some hardbound buffered laboratory notebooks, and Sakura 05 Pigma Micron archival pigment pens to keep your paper records. You'll thank me.

  11. that's what UNIX is there for by g4dget · · Score: 4, Informative
    Managing and organizing really huge amounts of data is one of the big strengths of UNIX--you just have to learn how to use it well:
    • Consider using "make" or "mk" for automating complex processing steps. "make" also lets you parallelize complex experiments (by figuring out which jobs can be run safely in parallel), and some versions of "make" are capable of dealing with compute clusters. If you need to try something with multiple parameter values, write make rules and put the parameter values in there as dependencies.
    • Organize your data into directory hierarchies; pick meaningful and self-explanatory names. Don't let directories become too big. Keep related data files and results together in the same directory, and keep different data files in different directories.
    • Keep scripts and programs along with the data, not in completely separate source trees.
    • Write scripts that summarize the data and give them obvious names; you can figure out later from that what needs looking at and what it means.
    • Use textual data files as much as possible and have your programs add information to those files as comments that document what they did.
    • If you generated important result, keep a snapshot of the sources that generated it along with it.
    • Leave copious README files everywhere, containing notes to yourself, so that you can figure out what you did.
    • If you generate junk during some trial runs, delete it, or at least rename it to something like "results.junk", otherwise you'll trip over it later.
    • Back things up.
    • Learn the core UNIX command line tools, tools like "sort", "uniq", "awk", "cut", "paste", "find", "xargs", etc.; they are really powerful. You probably also want to learn Perl, but don't get into the habit of trying to do everything in Perl--the traditional UNIX tools are often simpler.
    • If you are using Windows, switch to UNIX. Windows may be good for starting up MS Office, but it is no good for this sort of thing. If you absolutely must use Windows for data analysis, stick your data into a relational database or Excel spreadsheets.
    • Learn to use environment variables.
    • Learn to use the Bourne/Korn/Bash shell; the C-shell is no good for this sort of thing.
    • For certain kinds of automation, expect is also very handy.
    • For visualizing data, write scripts that analyze your data and automatically generate the plots/graphs--you will run them again and again.

    Distribution of jobs, running things with multiple parameter values, etc., all can be handed smoothly from the shell. This is really the sort of thing that UNIX was designed for, and the entire UNIX environment is your "experiment management software".

  12. SMIRP by sco08y · · Score: 2, Informative

    I'm one of the principal designers of a system called SMIRP.

    It started out as a very simple system that didn't act as much more than a set of tables with some simple linking structures. On top of that is an alerting system, (so you can track new experiments being done) a full text index, bots for automating certain procedures, and a system for transferring data to Excel.

    What's surprising is that for the most part, the underlying structure stayed exactly the same even though we've been running all the operations in an inorganic chemistry lab on for, oh, four years now. I've been chewing over ways of rewriting it because, honestly, it's still the same prototype. I'd love to go with an all Perl solution... but the damned thing just works and I have other stuff to do.

    Some lessons I've learned, problems I've run into:

    A general interface. You really need a flexible structure because scientists never know what parameters they're going to use until they do the experiment. Our big success has been such a simple structure that people can throw a SMIRPSpace together in minutes.

    Browser based interface. It's great because it's ubiquitous, but it's painful because of the inflexibility of forms. One big win with it is that you can get a horde of workstudies to form a pipeline. For example, a grad student might put a request in the system for an article, a workstudy recieves a notification of the change and hits the web to fill in details, another then gets notified and sends a request to the library, another gets notified and scans the result and finally the grad student sees a scanned copy of the article.

    Excel based interface. It's great because people can play with data, but it's Excel...

    XML is garbage. There's nothing you can do in XML that you can't do better with a flat file + regexes, or a SQL DBMS. XML is utterly, completely worthless.

    Proprietary products. This won't be a huge surprise to /.'ers, but we got seriously screwed when the prototype we did in Cold Fusion became production code and we realised that Allaire (and later Macromedia) would not computer redistribution for less than 10,000 units. I could try to get it running on another CF implementation (I think there's some Blue Dragon or something) but honestly, I'd rather rewrite the whole thing.

    Reporting. This is *hard* to do. We still don't have any serious system for handling reports beyond "import the data to Excel and do it manually."

  13. there are many projects developing such software! by edeljoe · · Score: 4, Informative

    Funding agencies in the USA (NSF, NIH) and Europe have recently decided to target the construction of such software, and many competing projects have been given grants, most of which involve the production of open source software.

    Relevant keywords are "eScience", "Experimental Data Management", "Experimental Metadata", and to some extent "Grid Computing".

    Here is a paper which lays out the program of research.

    I work for one such NSF & NIH funded project at Dartmouth College. We're developing such a tool : Java-based, completely open, available at sourceforge, currently in alpha, to be released for fMRI use in July, but designed from the start to be generalizable for all of experimental science. This is built on top of a pre-existing framework for semantic data management and modeling from Stanford.

    I'll try to list some of the features relevant to your needs:

    • the thing will organize all your data across all experiments and sports a nice Java API, annotations, a set of interchangable & sophisticated query engines, and java plugins for supporting, among other things, application specific tasks, application specific rendering widgets for data, and new backend data formats.
    • currently supported backend formats include: RDF, DAML+OIL, XML, text files, and SQL databases.
    • we should have cluster job submission support integrated in by july, but it depends on your cluster set-up. currently this is presented to the user by way of executing "processing pipelines" for data. If this metaphor doesn't work for you, you may have to write some additional code for us!
    • since the experimental designs are represented in a prolog-style knowledge-base, it would be very simple to put some intelligence in about how to "run" or "execute" a given class of experimental designs and do a lot of automatic reasoning or planning re: dependencies. In fact, I think that someone at Stanford has already done this, but I'd have to look into it.

    Finally, I would like to stress that our project is one of many, and that if it doesn't meet your needs, within a year there will be many competing "eScience" toolkits.

    You may contact me for more information by reversing the following string: "ude.htuomtrad@exj".

  14. I Develop This Kind of Software by spirality · · Score: 3, Informative


    The Computer Aided Engineering (CAE) world has much the same problem you do.

    They model their products with several different analysis codes, each with its own input and output format. This generates a gob of data, and is currently managed in ad hoc ways, is not easy to integrate with other results and wastes the time of lots of engineers.

    The product we've come up with to manage both the models, the process for executing the models, and the data generated by running the models is a software framework called CoMeT (Computational Modeling Toolkit).

    We are also capable of managing different versions of the model, parameter studies, and some basic data mining. The whole thing is scriptable with Scheme.

    Unfortunately, we are a commercial software company, and the software is still under development, although everything I mentioned above can currently be done. We are mostly working on a front end now, although we still need to make a few improvements to the framework and add support for many analysis codes.

    The reason I'm replying to this is that your list of requirements is a perfect subset of ours. We are aiming our product at CAE in the mechanical and electrical domains (Mechatronics).

    I know, it's not free, but we feel we've done some very innovative things and it has taken several people many years of low pay to get this far. We really want to make some money off it eventually.... :)

    If you want more information check out the web-site or email me here. We're in need of proving this technology in a production environment so maybe we can work something out.

    -Craig.

  15. Might be suitable? by gowdy · · Score: 2, Informative

    http://roofit.sourceforge.net/

  16. ExpLab by The+Visiting+Priest · · Score: 2, Informative

    I'm in precisely the same situation as Alea, so I read the suggestions here with considerable interest.

    I'd like to mention ExpLab.

    Though I haven't used ExpLab yet, these folks have been associated with other very high quality work (CGAL) so I expect good things. Here are three goals they list for the project:

    • to provide a simple way to set up and run computational experiments;
    • to provide a means of automatically documenting the environment in which an experiment is run so the experiment can be easily rerun (provided the same environment is still available) and the results can be more accurately compared to the results of other computational experiments;
    • to eliminate some of the tedium involved in collecting and analyzing output by providing basic text output processing tools.