Slashdot Mirror


How Do You Organize Your Experimental Data?

digitalderbs writes "As a researcher in the physical sciences, I have generated thousands of experimental datasets that need to be sorted and organized — a problem which many of you have had to deal with as well, no doubt. I've sorted my data with an elaborate system of directories and symbolic links to directories that sort by sample, pH, experimental type, and other qualifiers, but I've found that through the years, I've needed to move, rename, and reorganize these directories and links, which have left me with thousands of dangling links and a heterogeneous naming scheme. What have you done to organize, tag and add metadata to your data, and how have you dealt with redirecting thousands of symbolic links at a time?"

235 comments

  1. Here by Xamusk · · Score: 2, Funny

    I store them in first posts.

    1. Re:Here by thomasdz · · Score: 0

      I store them in first posts.

      Thanks for reminding me. 16 8483 917 8.321 9118387126 and 42

      --
      Karma: Excellent. 15 moderator points expire sometime.
    2. Re:Here by AaronLS · · Score: 1

      Sounds like he needs a database, not a file system, and then there would be no concept of "first" since rows are unordered until an order by clause is applied.

  2. Use databases! by Cyberax · · Score: 3, Insightful

    Subj.

    If you have something more complex than a flat file, then use relational databases. Even Access databases are better than a collection of text files.

    1. Re:Use databases! by garcia · · Score: 2, Insightful

      If you have something more complex than a flat file, then use relational databases. Even Access databases are better than a collection of text files.

      That really depends on what your intended use for them is. I mean I don't know this particular fellow's situation for data collection or what tools he uses for reporting and visualization but perhaps, for him, it's a much better idea to store them in flat files. Me? I have been using flat files for all my data collection about local crime (see here, here, here, and here) for several reasons:

      1. I script it all with awk/sed to scrape the data and then put it in a CSV for summary with MySQL.

      2. Yes, I could use MySQL for it all but I like to easily see it in its raw format on another remote machine. I also like to use Excel to do ad-hoc pivots and this is the easiest way for me to do that.

      3. I upload the data to Google Docs and use their gadgets to make charts for my dashboards and maps. If I were to store it solely in MySQL I would have to make the CSV, pipe it into the MySQL, convert it back out to CSV and then upload it. An additional step for nothing.

      Hey, no method is perfect for everyone and every project is a little different and while it's hard for me, based on the information provided, to give this guy any help, automatically suggesting that he needs a relational database to do his data storage might be just a little shortsighted.

      YMMV.

    2. Re:Use databases! by Idiomatick · · Score: 2, Insightful

      I'm amused by all the /.ers suggesting the physics nerd setup a SQL database. It ain't super easy and I can't imagine anyone that isn't a programmer of some sort doing so.

      The question he is really asking is probably more like: 'Is there anything like Windows live photo gallery that will allow me to tag and sort all of my data in a variety of ways like wlpg lets me sort my pictures?'

    3. Re:Use databases! by StormReaver · · Score: 1

      The question he is really asking is probably more like: 'Is there anything like Windows live photo gallery that will allow me to tag and sort all of my data in a variety of ways like wlpg lets me sort my pictures?'

      That is what Nepomuk is supposed to do: allow you to build semantic meaning into your data. The problem I have with it as it current stands is that the Nepomuk processes suck the life out of my computers, so I disable Nepomuk entirely.

    4. Re:Use databases! by grub · · Score: 2, Funny

      If the dude works at a research institute of minor size and up, they should have IT staff who can do that initial setup for him.

      --
      Trolling is a art,
    5. Re:Use databases! by rumith · · Score: 4, Interesting
      Hello, I'm a space research guy.
      I've recently made a comparison of MySQL 5.0, Oracle 10i and HDF5 file based data storage for our space data. The results are amusing (the linked page contains charts and explanations; pay attention to the conclusive chart, it is the most informative one). In short (for those who don't want to look at the charts): using relational databases for pretty common scientific tasks sucks badly performance-wise.

      Disclaimer: while I'm positively not a guru DBA and thus admit that both of the databases tested could be configured and optimized better, but the thing is that I am not supposed to. Neither is the OP. While we may do lots of programming work to accomplish our scientific tasks, being a qualified DBA is a completely separate challenge - an unwanted one, as well.

      So far, PyTables/HDF5 FTW. Which brings us back to the OP's question about organizing these files...

    6. Re:Use databases! by turbidostato · · Score: 1

      "If you have something more complex than a flat file, then use relational databases. Even Access databases are better than a collection of text files."

      Ever tried to access from Access (pun intended) big blob fields?

      The guy has a lot of data. Therefore he needs a data-base. He already has one with a hierarchical storage (the filesystem) that probably conveys the way data is generated (hierarchically, if only by date).

      Then he needs to access the data by different criteria. Are they relational? Then a relational data-base is probably what's needed. Access is still hierarchical, tree-based...?

      My point is that (quite of obviously) data storage and data mining/retrival are different beasts so they can (and probably should) be managed in different ways/by different tools.

      One first thing to do is "freezing" the data storage deployment. He says he has problems because dangling symlinks. That wouldn't happen if he didn't find the need to move the "original" data around. OK: don't do that and the problem will go away. Just let the "original" data go into a plain or almost plain structure: as long as there're no name collisions anything could do, from a single directory to a somehow optimized tree structure (like a first level directory list ordered by date or alphabetically or anything that fits and adding subdirs as needed to the point that any subdirectory only holds files on the hundreds or lower thousands).

      The only important thing to remember is that once a file gets stored it never moves from its place (at least logically: with time maybe newer data can go to faster filesystems while older/less used data can go to second tier storage, etc.).

      On top of that you add searching/datamining tools. Since your data storage is fixed by convention you can add as many and as unrelated searching tools as you see fit. It can be a forest of symlinks for different ordering criteria, it can be a RDBM, it can be a search engine, it can be accesable only by means of command line tools, a web interface... whatever and you can use all or part of them as need arises.

      A practical example: Linux distributions' sources of packages like apt or yum. Packages are stored on an (almost) flat alphabetically ordered filesystem and then an in-parallel structure handles access (ordered by architecture, usage subset, etc.) by different means.

    7. Re:Use databases! by Planesdragon · · Score: 1

      I'm amused by all the /.ers suggesting the physics nerd setup a SQL database. It ain't super easy

      one SQL database? As in, one ginamic SQL database? yeah, that's a headache.

      OTOH, one SQL database per "type" of data (i.e., "last summer's research.sql")? Hell yes. If it's not in a database, you need to spend the ten minutes to learn how to make one. MS Access exists for a reason, and this is it. It won't be a very pretty db, but it doesn't need to be -- it'll be indexed, searchable, and more or less protected.

      The only reason NOT to put it into a relational database is if you have some system that's essentially a database already, and in that case you can just leave it as-is. (Microsoft's SharePoint, if the data's small and simple enough, is an example.)

    8. Re:Use databases! by Shikaku · · Score: 1

      http://sourceforge.net/projects/vym/

      I think using a flat file and this would be helpful. Maybe. I don't know exactly what the data is, however, so this may not be very helpful.

    9. Re:Use databases! by Dynedain · · Score: 2, Insightful

      Translation: I am not a DB guru, but I deal with massive amounts of complex data and need a DB guru, but I have no intent on hiring one.

      Seriously, hire a DB wizard in the DB software of your choice for a couple of days. Have him setup the data and optimize it. You'll save yourself a lot of headaches, AND put yourself in a good position for future data maintenance. Imagine that your project gets a lot of attention in the future, and you suddenly get a lot of funding and the money to hire more people Or imagine that you'd like to provide or incorporate data with some outside sources or other researchers. If you're using something "standard" like a relational DB, it will be much easier to hire a DB wizard then trying to find a programmer who can piece together a lot of mismatched files and convoluted organization schemes.

      This is what databases are designed to do. Just because you're not an expert at setting them up, and theres a performance hit to setting them up wrong, doesn't mean that they aren't still the right tool.

      --
      I'm out of my mind right now, but feel free to leave a message.....
    10. Re:Use databases! by oldhack · · Score: 1

      "While we may do lots of programming work to accomplish our scientific tasks, being a qualified DBA is a completely separate challenge - an unwanted one, as well."

      While I get what you are getting at, it's the same shit - you muck with it (hacking perl/python/dbms) because there is no prepackaged stuff for your needs.

      --
      Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
    11. Re:Use databases! by BlitzTech · · Score: 2, Insightful
      Apache: click the install button (use default options, or switch to non-service mode which it very clearly explains means it only runs when you run it instead of whenever you start your computer)
      MySQL: click the install button (use default options, they're all fine)
      phpMyAdmin: put in document root, configure ("click the install button")

      And you're set. How was that hard...?

      Some software is, in fact, difficult to set up and maintain. As a scientist with an unusually large sample collection, learning to use a database is probably a good idea. Many scientists are taught MATLAB, and setting up a WAMP stack is much, much easier than learning MATLAB.

      I'm amused by all the /.ers suggesting the physics nerd setup a SQL database. It ain't super easy and I can't imagine anyone that isn't a programmer of some sort doing so.

      Scientists are pretty smart. He should learn to use a database.

    12. Re:Use databases! by clive_p · · Score: 2, Interesting

      As it happens I'm also in space research. My feeling is that what approach you take depends a lot on what sort of operations you need to carry out. Databases are good at sorting, searching, grouping, and selecting data, and joining one table with another. Getting your data into a database and extracting it is always a pain, and for practical purposes we found nothing to beat converting to CSV format (comma-separated-value). We ended up using Postgres as it had the best spatial (2-d) indexing, beating MySQL at the time. The expensive commercial DBMS like Oracle didn't have anything that the open-source ones did for modest-sized scientific datasets. I found Postgres was fine for our tables, which were no bigger than around 10 million rows long and 300 columns wide. You might well get better performance using something like HDF but you'll probably spend a lot more time programming to do that, and it won't be as flexible. The only thing you can be sure of in scientific data handling is that the requirements will change often, so flexibility is important. If your scientific data are smallish in volume and pretty consistent in format from one run to the next, you might consider storing the data in the database, in a BLOB (binary large object) if no other data type seems to suit. But a fairly good alternative is just to store the metadata in the database, e.g. filename, date of observation, size, shape, parameters, etc and leave the scienficic data in the files. You can then use the database to select the files you need according to the parameters of the observation or experiment.

    13. Re:Use databases! by rockmuelle · · Score: 3, Interesting

      I've built LIMS systems that manage peta-bytes of data for companies and small scale data management solutions for my own research. Regardless the scale, the same basic pattern applies: Databases + files + programming languages. Put your meta-data and experimental conditions in the database. This makes it easy to manage and find your data. Keep the raw data in files. Keep the data in a simple directory structure (I like instrument_name/project_name/date type heiracchies, but it doesn't really matter what you do as long as you're consistent) and keep pointers to the files in the database. Use Python/Perl/Erlang/R/Haskell/C/whatever-floats-your-boat for analysis

      Databases are great tools when used properly. They're terrible tools when you try to shoehorn all your analysis into them. It's unfortunate that so few scientists (computer and other) understand how to use them. Also, for most scientific projects, SQLite is all you need for managing meta-data. Anything else and you'll be tempted to to your analysis in the database. Basic database design is not difficult to learn - it's about the same as learning to use a scripting language for basic analysis tasks.

      The main points:

      1) Use databases for what they're good at: managing meta-data and relations.
      2) Use programming languages for what they're good at: analysis.
      3) Use file systems for what they're good at: storing raw data.

      -Chris

    14. Re:Use databases! by RobertLTux · · Score: 1

      Nice idea

      Nice service you have advertised in your sig
      Even Nicer referral link

      --
      Any person using FTFY or editing my postings agrees to a US$50.00 charge
    15. Re:Use databases! by PrecambrianRabbit · · Score: 1

      Depending on the size and stability of the GPs research budget, that may not be practical. I worked on a fairly large academic research team (by EE/CS standards) that had the budget to hire a few full-time staff members for certain things. After the main implementation push the project wound down a bit, and those staff moved on to other jobs, leaving the grad students to maintain the infrastructure. That was fine as it was, but could have been massively not-fine if the staff had used complex tools that required specialized knowledge that the students didn't have, and would have to divert their energies from research to tool-learning.

      Basically, if you're hiring a DBA, make sure that you can keep them on staff indefinitely.

    16. Re:Use databases! by BrokenHalo · · Score: 2, Informative

      If the dude works at a research institute of minor size and up, they should have IT staff who can do that initial setup for him.

      I'm not quite sure why your post is rated as funny; scientists are not necessarily the best people to be left in charge of setting up databases. I've seen all sorts of atrocities constructed in the name of science, from vast flat files to cross-linked ISAM/VSAM files, and I remember many late nights (with complaints from wife) spent sorting them out when a subscript went out of range.

    17. Re:Use databases! by bkaul01 · · Score: 1

      Funny indeed. Having worked at both the university and National Lab levels, the IT departments I've encountered have more of the function of pushing so many restrictive security policies and pieces of corporate spyware to everyone's systems than of actually enabling researchers to be more productive...

      Really, this is exactly what WinFS would've been perfect for, if MS had ever gotten it working and released it. As it is, I use hierarchical directory structure - top level is the research project, then date/experiment from there. Definitely subject to the kind of problems the original poster is encountering, though.

    18. Re:Use databases! by bitingduck · · Score: 1

      Any physics nerd should be able to set up a MySQL database pretty easily-- it's not quite as easy as falling out of a tree, but it's not anywhere near as difficult as a lot of other things in physics. A great deal of data acquisition and analysis for many (if not most) physical scientists involves a bunch of custom programming, and many of the theoretical sorts do a lot of computer modeling. MySQL is pretty easy to install on just about anything, and if you have a reasonable idea of what your data will look like it's pretty easy to decide how to set up the tables you need. The first iteration may not be more than a few simple tables and some straightforward queries, but it's way easier to maintain than a tangled nest of symlinks.

      (I'm a physical scientist who plays around with SQL for cheap entertainment)

    19. Re:Use databases! by mobby_6kl · · Score: 3, Interesting

      Certainly it depends, YMMV, and all that. Still, I think that some of the points that you bring up are not actually arguing against a relational database, perhaps just for a slight reorganization of your processes.

      1. I don't know where you get the data from, but anything awk/sed can do, so can Perl. And from Perl (or PHP, if you're lame) it's very easy to load the data into a database
      2. It's easy to connect to an SQL server from the remote machine and either dump everything or just select what you need. You'll need more than notepad.exe to do this, but it's not rocket science. Pivots in Excel can be really useful, but Excel can easily connect to the same database and query the data directly from there and use it for your charts/tables.
      3. Since by this point you'll already have all the data in the db, exporting it to CSV would be trivial. Or you could even skip Google Docs entirely and generate your charts with tools which can automatically query your database.

      I agree with your final point though, we really have no idea what would be best for the submitter within the limitations of skills, budget, time, etc. Perhaps flat files are really the best solutions, or maybe stone tablets are.

    20. Re:Use databases! by Anonymous Coward · · Score: 0

      Translation:

      I'm a DB wizard, hire me!

    21. Re:Use databases! by didroe84 · · Score: 1
      I would say put the data in the database too. It doesn't have to be relational, you can use BLOBs. It has the following benefits:
      • You can use transactions on both the metadata and the data
      • Backing up the database backs up everything in one go
      • You can easily set up standby systems that mirror the whole database consistently
      • It makes writing applications easier, anything that can talk to the database can get at the data. No messing around with mounts or file servers
      • You can use a single security model. Most databases support a decent access rights system
    22. Re:Use databases! by Anonymous Coward · · Score: 0

      Access is still hierarchical, tree-based...?

      Then put it in LDAP. Duh. :-)

    23. Re:Use databases! by bunratty · · Score: 3, Insightful

      I've helped my wife, who is a research scientist, by writing scripts to process her data. The IT departments at the companies where she's worked have no idea about what her work is or how she does it, and perhaps have even less interest in helping her. Their function is to keep the infrastructure (networks, file servers, email servers, etc.) working and install software packages onto the computers. They aren't of any use in helping individual users with their work. They will install SAS and S-Plus but will not help by writing SAS and S-Plus code, for example. You might be in the same situation as I am from your comment about your wife.

      --
      What a fool believes, he sees, no wise man has the power to reason away.
    24. Re:Use databases! by tibit · · Score: 1

      Umm, what? Any decent scientist these days should be able to program. Setting up an SQL database takes next to no time, as long as you use the right tools to do it. pgAdmin comes to mind, or oo.ORG database frontend. Now the problem may be that many scientists and engineers still don't know how to program, but that's another story.

      --
      A successful API design takes a mixture of software design and pedagogy.
    25. Re:Use databases! by tibit · · Score: 1

      That's where documentation comes handy. They did document the system enough so that a less-than-specialist could still keep it running, right?

      --
      A successful API design takes a mixture of software design and pedagogy.
    26. Re:Use databases! by Anonymous Coward · · Score: 0

      hey, i'm smarter than you, know pH, and wouldn't mind your job. how about telling us where you work, putting in a 2 week notice, and letting a slashdotter take your job!

    27. Re:Use databases! by PrecambrianRabbit · · Score: 1

      In my case, yes, the system was exceedingly well documented, and also made use of standard tools (Makefiles, perl and bash scripts, etc.).

      But I don't think documentation is a panacea if the tool used is particularly rarified. Perhaps the DBA in question (this is purely hypothetical now) set something up using Oracle, and then left. Now, maybe it's easy enough to use as the interface for SQL queries and the like, but what happens if there are major reorganizations that really do require specialized knowledge? Can you document all possible contingencies? Without simply giving enough learning materials for the user to become a DBA? (I have no idea, honestly, since I don't have any experience with high-end DBs, I can't say anything about how hard it is to maintain one, so I'm more making a general point rather than a specific one.)

    28. Re:Use databases! by Anonymous Coward · · Score: 0

      You are all very used to working in data in flat files ... That's what I'm used to also. So, if you're treating a relation database like a flat file, and executing your scripts with a bunch of SQL calls, you're not going to gain much except a central repository. However, if you get a little advice from a guy who's good at databases, you might find that there's an entirely new paradigm for programming and processing your data, and that you can get way, way more use out of the data and that once you get past the paradigm, you can run all sorts of analysis that you hadn't thought of before.

      No point in a database if you treat it as a flat file.

    29. Re:Use databases! by drfireman · · Score: 1

      Can I ask a dumb question as a completely database-naive person? I'm an occasional scientist, and I have a fair bit of experimental data lying around, stored as files. I've had it suggested to me before that I clearly need to get the data into a database (and that our center in fact should force people to store their raw data this way). At the same time, my daily workflow revolves around files. I use software that opens and deals with files. Some of the software expects to find the files organized in directories in particular ways. So my question is simply this: how do I put my data into a database but have it still work with the many software packages that have been written to do this kind of work? I can't rewrite the software. Do database systems let you expose the contents as though they were in directories, while somehow maintaining the organizing facility of the database? I know this is a dumb question, but having never used databases, I don't know what I would tell, for example, a medical image viewer when it opens up a file dialog and asks me which brain image I want to look at.

      I'm also an occasional software developer. Maybe that's claiming too much, but I do write software for doing this kind of research. My software opens, reads, and writes files. I'm not anxious to rewrite it to access databases, and of course I can't know in advance what kind of database system end users will have set up (the software is not just used locally). So at some level, I need to know the answer to this question from the other end as well. However, the users of our software also use dozens of other packages that will never be rewritten, I suspect there's no point to modifying my code to deal with data stored in database systems.

      I should mention that when I've thought about this before, it's occurred to me that possibly it's only being suggested that the raw data be organized in databases, and that processed data can be stored as files. But this is obviously irrelevant, since it's both (but especially the processed and intermediate data) that need to be organized in labs with a lot of data.

      In any case, please help us database-naive scientists understand. Thanks.

    30. Re:Use databases! by Kvasio · · Score: 1

      perhaps IT staff there sucks in anything more complicated than rebooting or running minesweeper and solitaire.

    31. Re:Use databases! by Anonymous Coward · · Score: 0

      I don't know where you get the data from, but anything awk/sed can do, so can Perl.

      Sure, but in processing text data, perl is often much slower than awk. A neat feature of combining commands in a pipeline is they are inherently parallel. So with higher core count processors becoming common, those old scripts just get faster and faster. Your typical perl equiv does not.

      And the answer "Use databases" ignores the problem of having bunches and bunches of databases. And if you say put it all in one database, then you just shift the dataset management problem to "how do I manage all of these datasets that are in my giant nightmare mess of a slow database?!"

      Since the OP's question is one of organization, maybe he should hire an assistant ;)

    32. Re:Use databases! by Anonymous Coward · · Score: 0

      "A neat feature of combining commands in a pipeline is they are inherently parallel. "

      LOLWUT? Pipes are executed in order, and unless you split your data and pass it to several processes, there is nothing "inherently parallel" in them.

      And that's much easier done in perl than in the shell.

    33. Re:Use databases! by konohitowa · · Score: 1

      LOLWUT? Pipes are executed in order...

      In DOS maybe.

      cat /dev/random | od -bc | less
      ^Z
      $ ps
          PID TTY TIME CMD
      57677 ttys000 0:00.02 -bash
      57958 ttys000 0:00.01 cat /dev/random
      57959 ttys000 0:00.01 od -bc
      57960 ttys000 0:00.00 less

    34. Re:Use databases! by eggnoglatte · · Score: 1

      It is funny because anybody who is familiar with levels and quality of IT support at universities knows how ludicrous the idea is.

    35. Re:Use databases! by kramulous · · Score: 1

      85 Thousand rows and 25 columns of floats is not what I consider a large dataset. Also, due to the scaling of your y-axis on those graphs, it is not immediately apparent that even the PyTable doesn't scale. What happens when you scale up by a couple of orders of magnitude? I'm not trying to be flamey, but how does it work when you have 85x10^6 or 85x10^9 rows?

      One dataset I have is a timeseries of thousands of square kilometres of points where there is a density of around 400 points per square metre. Each point has 12 dimensions. Honestly, a flat file structure with well named directories and filenames, along with meta data stored in 2 file (txt and XML - I know, I don't like XML either, but it is not a big file - couple of KBs each) in each directory works best for us. Files are currently compressed with lzo (parallel decompression, buffered and sequential - seems to work ok for now).

      We can search the repo quickly for a desired ROI using any programming language we want. There are no such problems such as version changes.

      --
      .
    36. Re:Use databases! by Anonymous Coward · · Score: 0

      OR just use the database to store the meta data (what its for, when created, what ever you want to know, etc) about each file so you can query it and find it.
      Create data directores(date0,1,2,3) to hold 10,000 files each, and number you files sequentially and store the sequntial number in the database - you don't need the directory since you know
      each directory will hold 10,000 files; thus, you can compute which one it should be in. A bit complicated, but if you really do have thousands AND they are of value THEN its worth the extra effort. Plus now you can let other people find you files with a simple SQL query.

    37. Re:Use databases! by tibit · · Score: 1

      Documentation is not a panacea, but is perhaps a good starting point. When you need to do changes, you're on your own of course, but to keep it running I'd say documentation shoudl be enough.

      --
      A successful API design takes a mixture of software design and pedagogy.
    38. Re:Use databases! by TheObruniSpeaks · · Score: 1

      50-90% (depending on project) of what physics grad students do is programming. It's definitely reasonable for a physics nerd to set up an SQL database. Whether that's the right way to spend their time and effort organizing the data is another question.

    39. Re:Use databases! by Anonymous Coward · · Score: 0

      Check out ZFS. You don't need a database if you just intend to use BLOBs.

    40. Re:Use databases! by GaryOlson · · Score: 1

      The other side of this discussion:
      The IT staff would probably be very interested to help. But, research scientists are notorious for unwillingness to describe the work in any but the most esoteric terminology. Any attempts to clarify or redefine the information provided by the research scientist leads to denigrating attitudes and/or tirades on how the IT staff cannot change the whole concept of the work and are idiots. If this first hurdle is overcome, then the obstacle of informing the research scientist we are not his/her full-time lab assistant; and cannot devote 100% of our time to their project. Any attempt to limit our involvement after initial buy-in is met with various immature and pedantic attitudes.

      Good research scientists budget money for development support and use it effectively. The majority of research scientists would rather suffer in silence/isolation and hack at the work themselves than attempt to work with other professionals with differing mindsets.

      Been there, have the monogrammed beaker, drank afterwards [multiples] to celebrate with the good scientists; drank alone to wash the experience away with the not so agreeable scientists.

      --
      Every mans' island needs an ocean; choose your ocean carefully.
    41. Re:Use databases! by Marillion · · Score: 1

      Use a full blown system. Our lab is looking to implement this system BioSig. We'll be running lots of microscopy with lots of variables and we need to share the data with lots of collaborators.

      --
      This is a boring sig
    42. Re:Use databases! by Anonymous Coward · · Score: 1, Interesting

      Pipes are executed in order, and unless you split your data and pass it to several processes, there is nothing "inherently parallel" in them.

      The processes in a pipeline run in parallel, unless you are in dos/windows (as konohitowa pointed out).

      You often have a process that is reading the file, and a series of successive proceses that are filtering and processing the data. Each one can run on a different core, in parallel

      You can do the same with perl, but most perl isn't written that way. It's really cool that scripts I wrote fifteen years ago are scaling incredibly well with additional cores (and that will only get better). I can't say the same for most all perl, python, C programs, etc. Perhaps it would be more fair to say the difference isn't so much the language as it is the monolithic thinking and implementation.

      Also, many datasets are often in columnar format. So the data is already split, and the columns are easily divided across multiple processes.

      And parallelism aside, awk is often significantly faster at processing text than perl. Especially if you use awka to convert the script to C and compile.

      That stuff matters when you're crunching transaction logs at the largest e-commerce sites, etc.

    43. Re:Use databases! by jandersen · · Score: 2, Interesting

      ...using relational databases for pretty common scientific tasks sucks badly performance-wise.

      Well, it has never been a secret that relational databases do not performs as well as e.g. a bespoke ISAM or hash-indexed data-file, mostly due to the fact that it involves interpretation of of SQL. But then the main purpose of SQL databases has never been to optimise raw performance - rather the idea is to provide maximum, logical flexibility. The beauty of relational databases is that you can change both data annd metadata at the wave of a hand, where you in the high-performing, bespoke database have to go and rewrite significant amounts of code.

      At the end of the day you choose your tools to fit your needs, or at least that is what you ought to do.

      As for the OP's question: the main problem seems to be one of having to rename and move stuff; this is clearly the area where SQL is strongest.

    44. Re:Use databases! by fonske · · Score: 1

      My wife, who is a research scientist, was specifically hired for writing some sort of scripts to pipeline pilot to organize the data of her company.
      The IT department also has no clue what she is up to.
      They even managed to hire an "IT specialist" behind her back (she was on birth leave or how do you call that in the US?).
      After three months of the new guy on the job her boss called her to ask how much time she spent on some specific job.
      They threw the guy out after that phone call - "One day." "Don't be so modest, you can honestly tell us how long you spend on that job." "OK, so in fact not even a whole day..."

      All the data I create with our TOF-SIMS is pinned on the wall - a mass spectrum with some information printed on it.
      My boss thinks it is a great way of archiving.

    45. Re:Use databases! by Zemplar · · Score: 1

      If you have something more complex than a flat file, then use relational databases. Even Access databases are better than a collection of text files.

      That really depends on what your intended use for them is. I mean I don't know this particular fellow's situation for data collection or what tools he uses for reporting and visualization but perhaps, for him, it's a much better idea to store them in flat files. Me? I have been using flat files for all my data collection about local crime (see here, here, here, and here) for several reasons:

      1. I script it all with awk/sed to scrape the data and then put it in a CSV for summary with PostgreSQL.

      2. Yes, I could use PostgreSQL for it all but I like to easily see it in its raw format on another remote machine. I also like to use Excel to do ad-hoc pivots and this is the easiest way for me to do that.

      3. I upload the data to Google Docs and use their gadgets to make charts for my dashboards and maps. If I were to store it solely in PostgreSQL I would have to make the CSV, pipe it into the PostgreSQL, convert it back out to CSV and then upload it. An additional step for nothing.

      Hey, no method is perfect for everyone and every project is a little different and while it's hard for me, based on the information provided, to give this guy any help, automatically suggesting that he needs a relational database to do his data storage might be just a little shortsighted.

      YMMV.

      There, fixed that for 'ya.

    46. Re:Use databases! by Bill+Barth · · Score: 1

      You might find this interesting.

      --
      Yes...I am a rocket scientist.
    47. Re:Use databases! by Gibbs-Duhem · · Score: 1

      As someone who just finished their PhD in materials science, and is now doing a PostDoc, I found myself basically doing the same thing you are doing. I found a few flaws with my original directory structure, but the one I've adopted for my PostDoc is rather designed to avoid those issues.

      As someone else mentioned (so I won't bother), while I'm fine at programming, and have built SQL databses, as someone who's job is *not* programming, it seems like a horrible waste of time to set up such complex tools when there aren't even any frontends that would display all of the data in a single application. For instance, my XRD data is in .xml files that have to be translated to a graphic to be interesting, my TEM images are graphics, my GC data is in pdfs and openoffice spreadsheets, and so on. Similarly, although as someone who was originally a physicist and am somewhat of a magician at matlab, I found that it took me less time to make publication quality graphics in *OpenOffice* than in either Matlab, Mathematica, or Microsoft Office. How messed up is that, right? I found that with every tool I used (and I did try briefly some other more exotic/expensive ones recommended by colleagues), I had to do so much postprocessing to get sane default settings for the graphics; i.e. size 36 fonts, line widths of at least 4 points, colors that colorblind people can distinguish as the first two options (wtf gnuplot), a graphic that takes up the *entire* bounding box so I can make the sides, tops and bottoms of graphics align, and the ability to easily combine graphics using OO draw into larger graphics to save space by sharing legends and such, not to mention the ability to export as perfect pdf files for including in my latex source... but I digress.

      Here is the layout I have used very successfully to organize similarly many thousands of experiments worth of data, graphics, papers, presentations, and my own writing. It's a little odd in that I organize in one direction for some types of documents and the exact opposite for others, but I find it to be extremely natural as this order is the way I naturally think about the data.

      I. Project Folder (each project gets its own folder - I find I'm usually doing 3-5)
      1. Data Folder
      i. Data Type Folder (TEM image, AFM image, XRD scan, FTIR, GC, etc)
      a. Experiment Number Folder (I use at least three digits for numbers so they show up in order, i.e. BN001-BN999 followed by a short description for my memory -- but really that's what my lab notebook is for anyway.)
      b. Composite Folder (for graphics made from multiple experiments, assuming there are only a handful of these in comparison).
      2. Proposals/Presentations Folder (my initial proposals for funding)
      3. Graphics Folder (Publication quality graphs I will likely want to reuse many times, which incorporate data from many sources)
      II. Paper Collection Folder
      1. Project Folder containing related papers (if you have related topics, don't bother).
      i. Each paper with the filename written as "author name - title.pdf" so that you can find papers in your bibliography later on easily as well as use standard filesystem search tools to find papers that have titles with a keyword when you know you have it, but can't remember the author or full title.
      III. Publications Folder
      1. Project folder containing source materials for each publication you're writing.
      i. Main Figures Folder (I just copy and have duplicates of the graphics from the project graphics folders, but I suppose you could also make a symbolic link in linux, or whatever the equivalent is in windows).
      ii. Supplementary Figures Folder (ditto. This also helps me find specific graphics that are in papers in case someone asks for a high resolution version).

    48. Re:Use databases! by Gibbs-Duhem · · Score: 1

      Because in his proposed solution you require one IT person per five staff scientists. =) Extra!

      I worked at MIT, which has amazing tech support from what I experienced, and even there I was still *setting up* the databases myself, even if they did set up SQL for me. If I hadn't known how to use PHP (sigh) to make websites, I still wouldn't have had much of a frontend... all in all, quite a bit of work for me, and I didn't bother in the end.

      Other universities do not have *nearly* the level of IT support that I saw there. Want a SQL server? Better buy a computer to run it on then, and hire a postdoc that knows linux.

    49. Re:Use databases! by jrminter · · Score: 1

      Setting up database software is easy; designing a proper database for a given research project takes much more thought and often quite a bit of experimentation evaluating prototypes. That said, it is a good exercise for a research scientist because it helps to see the interrelationships/dependencies between the variables in the data set. Beginners often design a system that is far too complicated and needs far more "care and feeding" than the project warrants. Been there done that don't want to do it again. As others have mentioned, I like to keep things simple. While one can store large binary objects in a database, I prefer to store the key descriptors and conditions ( with electron microscope images it is key instrument, specimen, and project information) in the database along with a directory path (or a way to create it) to the large image data sets. The real question the research scientist needs to tackle is the use cases they have for the database. What queries do they need to make and what should those queries return. Again, in my experience, the database project can take on a life of its own and consume many more resources than the value it delivers. The most successful cases I have seen involved close collaboration between patient database designers/administrators and principal investigators.

    50. Re:Use databases! by rumith · · Score: 1

      Actually, these 85k rows represent only a day's worth of data from a single satellite instrument. We often operate on data sets tens of millions of rows large, and the general performance trend is nearly the same.
      About scaling: as far as I understand, PyTables' main bottleneck is raw disk I/O. I think that applying a couple of primitive tricks like using memcached and a large amount of RAM will tip the scales in its favor even further, but that needs to be verified.

    51. Re:Use databases! by rumith · · Score: 1

      While theoretically it's definitely possible, I'm not sure if storing BLOBs in a database is a good solution; at least none of our colleagues from other universities and agencies like ESA or NOAA that I personally know (or at least have some knowledge of how their systems are built) use this approach. All of them do exactly as GP said: store metadata and e.g. file paths in the database, and store tables, BLOBs etc in the file system. Since some of these projects have a huge amount of resources, I think that this approach may have more to it than it may seem.
      And as far as backups and mirroring go, there are projects that use popular protocols like BitTorrent for mirroring and load balancing, and there are some that have built non-trivial custom data distribution systems (like the SDO guys).

    52. Re:Use databases! by GPSguy · · Score: 1

      Mod up +3. You're right on-target.

      --
      Never ascribe to malice that which can adequately be explained by tenure.
    53. Re:Use databases! by spokedoke · · Score: 1

      I agree, your three step program to data analysis is exactly what I use. I have become a believer in running lots of small scripts on data that is consistently distributed, as opposed to the lumping and sifting approach. I am suspect at being organized enough to eat lunch, let alone keep track of data.

    54. Re:Use databases! by redtube · · Score: 1

      hair straighteners babyliss
      <a href=http://www.salehairstraightener.com/professional-1-original-styler-p-15.html>remington hair straighteners</a>
      <a href=http://www.salehairstraightener.com/the-rare-styler-p-16.html>hair corioliss</a>

  3. Totally disorganized by countertrolling · · Score: 1, Funny

    Whenever I need to find anything, I use "Command-F"

    --
    For justice, we must go to Don Corleone
    1. Re:Totally disorganized by Anonymous Coward · · Score: 0

      Quiet now, adults are speaking.

      He's a researcher in physical sciences. The computer he uses won't have a command button.

    2. Re:Totally disorganized by ElektronSpinRezonans · · Score: 1

      I wish that was true. # of command buttons are steadily increasing in all sciences. I genuinely fear for the future of science...

    3. Re:Totally disorganized by Idiomatick · · Score: 1

      Seriously? We have a Mac lab @ my uni which people use to play flash games on and browse email because they have giant ass glossy panel screens. Linux machines get the old shitty hardware. Really biased :S. Pretty sure the linux lab cost 1/8th that of the mac lab. But even so no body does work on them...

    4. Re:Totally disorganized by cwebster · · Score: 1

      Why fear? I personally use a macbook and have it setup to run Matlab, Maple, Grads, gnuplot, octave, R, NCAR graphics/ncl, CERN's root stuff, Vis5D+, full utilities to manipulate netCDF 3/4 and HDF 4/5 files, GEMPAK, ldm. I havent tried running WRF on this machine yet (still confined to my linux machines) because my laptop would probably melt if I tried.

      Get fink and ports installed and start installing useful programs and mac is just as good as a linux box for science.

    5. Re:Totally disorganized by marcosdumay · · Score: 1

      It's a widely known fact that work machines are cheap, and toys are the really expensive lot. That isn't an exclusivity of yours.

  4. Use a revision control system or a database by Anonymous Coward · · Score: 0

    Data isn't just data - it has, as you've learned, a history. Learn about how RCS works and use one to store your data in from now on.

    Or, you could just store it in a proper SQL database, and be able to query it any way you like, without having to create all these link farms giving you different views on the underlying data.

  5. Separate data from presentation by mangu · · Score: 4, Informative

    In my experience, the best thing is to let the structure stand as it was the first time you stored the data.

    Later, when you discover more and more relationships around that data, you may create, change, and destroy those symbolic links as you wish.

    I usually refrain from moving the data itself. Raw data should stand untouched, or you may delete it by mistake. Organize the links, not the data.

    1. Re:Separate data from presentation by DamonHD · · Score: 1

      Absolutely agreed.

      And indeed my biggest 'data' collection of over a decade old is all directly exposed to the Web, and other than taking care of significant errors in the original naming, I've followed the "Cool URIs Don't Change" manta and had the presentation app tie itself in knots if need be to leave the data as-is and present it in any new ways required.

      It's like having an ancient monument: you don't shuffle it around to suit the latest whim, else you'll most likely mess it up beyond repair.

      Rgds

      Damon

      --
      http://m.earth.org.uk/
    2. Re:Separate data from presentation by drolli · · Score: 1

      Yes, and store it in a human readable form.

    3. Re:Separate data from presentation by mangu · · Score: 1

      store it in a human readable form.

      Absolutely. Nothing is worse than finding that an SQL table which supposedly contained the raw data to some experiment is actually a collection of BLOBs from which only the long departed creator once knew the details of the structure.

      Text files, text files all the way and all hail the Unix philosophy!

  6. Organize it with style by Anonymous Coward · · Score: 1, Funny

    Organize your data like I organize my bedroom: Everything on the floor.

    Look, how big is your desk? 8 square feet? How big is your floor? Several hundred square feet? If you can see all of your stuff, then you can access it instantly. Organized Chaos.

    Now, if you'll excuse me... I think something's moving around in my trash can.

  7. sqlite by sugarmotor · · Score: 3, Insightful
    --
    http://stephan.sugarmotor.org
  8. Databases by eexaa · · Score: 1

    SQL comes really handy. I can imagine several simple scripts + SQLite indexing table. Or anything else.

    1. Re:Databases by obstacleman · · Score: 1

      I agree. As the amount of data and metadata increase a good way to organize it all is via a database. Then access can be done through queries on the metadata and all relevant locations returned. In some sense, it will no longer matter where the data is stored on disk as long as the database knows the location (and moving it can be done easily but requires the database be updated too). One of the simplest form for the directory structure is along the lines of date ordering, e.g. year-dir, month-dir, day-dir, dataset-dir. One of the advantages of a database is it can allow you to replicate the data, say for instance on tape copies, and store the location on tape in the database too. In High Energy Physics there are petabytes of data stored this way.

  9. Re:Databases. by nicolas.kassis · · Score: 1

    I was going to say the same thing. You can also check to see if there are any software in your domain that might help you insert it into a database. If not, you can keep the data as flat files but have records in the database and have the path to them in there. A little bit of programming but not much will get you a list of file path that you can then just us a bash script to retrieve.

  10. extended attributes? by otis+wildflower · · Score: 1

    I wonder if there's an opensource project to create and manage extended attributes on supporting filesystems?

    http://www.freedesktop.org/wiki/CommonExtendedAttributes

    But you're likely to get better results from having filenames be a field in a DB, and let all the metadata live in other DB fields..

    ps: here's a CPAN entry that manipulates extended attributes: http://search.cpan.org/dist/File-ExtAttr/lib/File/ExtAttr.pm

  11. Matlab Structures by Anonymous Coward · · Score: 4, Interesting

    I have to organize and analyze 100 GB of data from a single day's experiment. Raw data goes on numbered HDs that are cloned and stored in separate locations. This data is the processed and extracted into about 1 GB of info. From there the relevant bits are pulled into hierarchal matlab structures. Abstract everything that you can into a MATLAB structure array. Have all your analysis methods embedded in the classes that compose the array. Use SVN to maintain your analysis code base.

    1. Re:Matlab Structures by pz · · Score: 4, Interesting

      I have to organize and analyze 100 GB of data from a single day's experiment. Raw data goes on numbered HDs that are cloned and stored in separate locations. This data is the processed and extracted into about 1 GB of info. From there the relevant bits are pulled into hierarchal matlab structures. Abstract everything that you can into a MATLAB structure array. Have all your analysis methods embedded in the classes that compose the array. Use SVN to maintain your analysis code base.

      Yes, yes, yes.

      I have very similar data collection requirements and strategy with one exception: the data that can be made human-readable in original format are made so. Always. Every original file that gets written has the read-only bit turned on (or writeable bit turned off, whichever floats your boat) as soon as it is closed. Original files are NEVER EVER DELETED and NEVER EVER MODIFIED. If a mistake is discovered requiring a modification to a file, a specially tagged version is created, but the original is never deleted or modified.

      Also, every single data file, log file, and whatever else that needs to be associated with it is named with a YYMMDD-HHMMSS- prefix and since experiments in my world are day-based, are put into a single directory called YYMMDD. I've used this system now for nearly 20 years and not screwed up with using the wrong file, yet. FIles are always named in a way that (a) doing a directory listing with alpha sort produces an ordering that makes sense and is useful, and (b) there is no doubt as to what experiment was done.

      In addition, every variable that is created in the original data files has a clear, descriptive, and somewhat verbose name that is replicated through in the MATLAB structures.

      Finally, and very importantly, the code that ran on the data collection machines is archived with each day's data set so that when bugs are discovered we can know EXACTLY which data sets were affected. As a scientist, your data files are your most valuable possessions, and need to be accorded the appropriate care. If you're doing anything ad-hoc after more than one experiment, then you aren't putting enough time into a devising a proper system.

      (I once described my data collection strategy to a scientific instrument vendor and he offered me a job on the spot.)

      I also make sure that when figures are created for my papers I've got a clear and absolutely reproducible path from the raw data to the final figures that include ZERO manual intervention. If I connect to the appropriate directory and type "make clean ; make", it may take a few hours or days to complete, but the figures will be regenerated, down to every single arrow and label. For the aspiring scientist (and all of the people working in my lab who might be reading this), this is perhaps the most important piece of advice I can give. Six months, two years, five years from now when someone asks you about a figure and you need to understand how it was created, the *only* way of knowing that these days is having a fully scripted path from raw data to final figure. Anything that required manual intervention generally cannot be proven to have been done correctly.

      --

      Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
    2. Re:Matlab Structures by witch-doktor · · Score: 1

      Dear Prof. Pz. I don't know if this is a good time to tell you this, but we have been secretly migrating to python behind your back. Sincerely, Your OSI-nut students.

    3. Re:Matlab Structures by Anonymous Coward · · Score: 0

      Sounds like you store a Matlab analysis code base in Subversion: How do you set things up in Matlab / SVN so that the SVN metadata (i.e. the .SVN sub-directories) does not interfere?

      I have a Matlab script that removes the .SVN subdirs from the Matlab path. That works, but it's still a bit iffy.

      Suggestions?

    4. Re:Matlab Structures by GPSguy · · Score: 1

      In the end, it really is all about organization, and a few basic rules.
      1. Store your data first. Before you do anything to modify it, transmogrify it, or distort it.
      2. Assimilate like datasets. This obviously doesn't apply to the parent's datasets, but it does to mine.
      3. Normalize for rapid retrieval (CSV, ISAM, flat file, or RDMBS as appropriate; flash cards are fine with me, if that's what you like). Note that you've not done anything to the original data.
      4. Process/analyze. SQL queries, process chains, whatever it takes. That's what the normalization stage allows you to do readily.

      What you've described is an inredibly straightforward approach to scientific data management, and worth wading through the other comments to find.

      --
      Never ascribe to malice that which can adequately be explained by tenure.
  12. Go for NoSQL! by JamesP · · Score: 3, Funny

    OK, subject is the short answer, here's the big answer

    Since experimental data usually doesn't have the same structure for all experiments, you may try something like this:

    at the deeper, most basic level organize it using JSON or XML (I don't know what kind of experiment you do, but you would put lists of data, etc)

    Then you store this in a NoSQL db (like CouchDb or Redis) and index it the way you like, still if you don't index you can always search it manually (slower, still...)

    --
    how long until /. fixes commenting on Chrome?
  13. Don't bother with hierarchies by ccleve · · Score: 5, Interesting

    Instead of trying to organize your data into a directory structure, use tagging instead. There's a lot of theory on this -- originally from library science, and more recently from user interface studies. The basic idea is that you often want your data to be in more than one category. In the old days, you couldn't do this, because in a library a book had to be on one and only one shelf. In this digital world you can put a book on more than one "shelf" by assigning multiple tags to it.

    Then, to find what you want, get a search engine that supports faceted navigation.

    Four "facets" of ten nodes each have the same discriminatory power as a single hierarchy of 10,000 nodes. It's simpler, cleaner, faster, and you don't have to reorganize anything. Just be careful about how you select the facets/tags. Use some kind of controlled vocabulary, which you may already have.

    There are a bunch of companies that sell such search engines, including Dieselpoint, Endeca, Fast, etc.

    1. Re:Don't bother with hierarchies by Anonymous Coward · · Score: 0

      This is horribly off-topic, but a lot of the problems faced by librarians are similar, if not the same, as the problems faced by warehouse designers.

    2. Re:Don't bother with hierarchies by Yvanhoe · · Score: 1

      What ccleve said.

      I would also add that if your datasets are like mine (enormous), it may be a good idea to md5 them. Have a log file where you enter information about each file : its creation date, its md5, nature of data, source, etc...

      Do not use name or path hierarchies to keep track of metadata, it is doomed to fail. If you feel this is worth the effort you can set up a database for this info but in my opinion if you have hundreds to thousands files, a simple flat file can be good enough.

      --
      The Wise adapts himself to the world. The Fool adapts the world to himself. Therefore, all progress depends on the Fool.
    3. Re:Don't bother with hierarchies by aperdaens · · Score: 1

      You could also use a sharing tool focused on faceted navigation like Knowledge Plaza. This way you have an interface where you can both store and categorize your results and be able to browse and search those results.

  14. Use hard links by swm · · Score: 1

    Instead of symlinking to directories,
    create directories of hard links to the files.

    Then you can move files around whenever you like,
    and you never have any dangling links.

    1. Re:Use hard links by vrmlguy · · Score: 1

      Instead of symlinking to directories,
      create directories of hard links to the files.

      Then you can move files around whenever you like,
      and you never have any dangling links.

      I second this. I have a big collection of photos that I've downloaded over the years, and I "tag" them via hard-links into directories. The same photo may be found under "party/jane/nyc", "party/nyc/jane", "nyc/party/jane", "nyc/jane/party", "jane/party/nyc" and "jane/nyc/party". If two people are in the photo, that's twenty-four links, but I have Perl scripts that take care of the grunt work; a picture with N tags will have "just" N! links. I don't link photos to intermediate directories, but all pictures from parties in New York can be found via either "find party/nyc -type f" or "find nyc/party -type f"; removing the dups is left as an exercise for the student.

      BTW, this works with Windows as well as Unix. NTFS supports hard-links and while there isn't a native command to create them, Perl will do so.

      --
      Nothing for 6-digit uids?
  15. I used to be anal about organization... by taoboy · · Score: 2, Insightful

    ...but then google came along and taught me that it's not about know where things are, but rather about being able to find them. My email, for instance, is "organized" by the year in which it arrives, and I use the search function of my email client to find things. No big folder structure, moving messages around, and I haven't had problems finding any email I need. Oh yes, I keep them all... good fodder for "on x/x/xxx you said..." retorts.

    For files, then, the key is to have descriptive file names that provide readily searched text. Including the data somewhere in the name (I tend to use this format because it sorts well: 20100815) makes it easier to sort through multiple versions.

    Then, you can spend quality time figuring out how to reliably back up all that stuff.... :)

    1. Re:I used to be anal about organization... by Idiomatick · · Score: 1

      Fine if you want to find something. How about doing analysis on 30things at once?

    2. Re:I used to be anal about organization... by taoboy · · Score: 1

      YMMV, but my experience is that datasets aren't really usable for a particular endeavor without some munging; stripping unneeded columns, common-format date/times, etc. Also, I tend not to work on the seminal copy; too easy to alter it and screw up separate md5 sums, or worse. So, if I have to perform analysis, I tend to round up copies of all the relevant datasets in one directory. Finding them is still the key...

  16. Interns. by Gordonjcp · · Score: 1

    Life is too short. Get someone else to do it, under the disguise of valuable field experience.

    1. Re:Interns. by spacefight · · Score: 3, Informative

      Yeah right, let the interns do the job. Not. Interns use new tools no one understands, then finish the project during their term, then move on and let the most probably buggy or unfinished project behind. Pitty for the person who has to cleanup the mess. Better do the job on your own, know the tools or hire someone permanently for the whole deptartment.

    2. Re:Interns. by grcumb · · Score: 1

      Yeah right, let the interns do the job. Not. Interns use new tools no one understands, then finish the project during their term, then move on and let the most probably buggy or unfinished project behind. Pitty for the person who has to cleanup the mess. Better do the job on your own, know the tools or hire someone permanently for the whole deptartment.

      Just guessing here, but is your copy editor an intern?

      --
      Crumb's Corollary: Never bring a knife to a bun fight.
  17. Linked Data, of course by Rui+Lopes · · Score: 2, Informative

    The present (and the future) of experimental data organisation, repurposing, re-analysing, etc. is being shifted towards Linked Data and supporting graph data stores. Give it a spin.

    --
    var sig = function() { sig(); }
    1. Re:Linked Data, of course by pvanheus · · Score: 1

      Yes, this! I work in bioinformatics, and while relational DBMSes are used by a large number of projects, the problem you face with experimental data is that its organisation is non-obvious and relationships between different bits of data are not apparent. This means that RDBMSes aren't a natural fit. One of my colleagues has been experimenting with graph oriented databases (specifically neo4j (http://neo4j.org/)), an approach which has interesting intersections with declarative programming. In the near future I think much of use will come from being a scientific "data geek", utilising the kind of skills described by Deepak Singh (blog: http://mndoci.com/).

  18. How can you not? by Rivalz · · Score: 1

    I never understood how you can have something organized or not.
    I organize my stuff at the planetary level.
    Universe > Solarsystem > Earth > Contenent > United States > Florida > County > City > Street > House > Room > Desk > Computer > Hard Drive > Folder > File Type > Location
    I think im pretty well organized even though i miss place stuff all the time.

    1. Re:How can you not? by TimSSG · · Score: 1

      You forgot the Galaxy. Tim S.

    2. Re:How can you not? by Rivalz · · Score: 1

      I skip the little things... Plus I do not recognize U.S. soccer teams as a tool for organization. I had to remove Galaxy from my organizational charts when they formed the L.A. Galaxy. That little naming convention set me back 10 years worth of organization. I became confused of which galaxy I lived in and had to seek extensive therapy. I feel a relapse coming on.

      But honestly thanks for the clarification, I actually forgot a little thing like galaxies.

  19. Try using a scientific workflow system by moglito · · Score: 3, Insightful

    You may want to consider a scientific workflow system. These systems handle both data storage (including meta-data and provenance -- where the data came from), and design and execution of computational experiments. If you are concerned about the complexity of the meta-data (e.g., pH value..) and would like to make sure to be able sort things according to this, you want to give "Wings" a try. You can try out the sandbox to get an idea: http://wind.isi.edu/sandbox.

  20. Be like Google... by Vornzog · · Score: 1

    "Search, don't sort".

    The size and complexity of your data management should match the size and complexity of your data set. If you have thousands of datasets, give serious consideration to a relational database. Store all of your metadata (pH, date, etc) in the database so you can query it easily. If your raw data lives in a text-based format, put it in the database too, otherwise just store the path to your file in the database and keep your files in some sort of simple date-based archive or whatever.

    Now, you can start to search though the data by thinking about which sets of data to compare. Much easier.

    This is very general advice - if you have one experimenter and a couple of experiments, just use a lab notebook. If you have a handful of experimenters and ~100 experiments, try a spreadsheet or well organized structure on disk. If you have many people involved, or thousands of experiments, or both, you need something to help manage all of that in a way that lets you think in terms of sets rather than individual data files. Otherwise, you'll find yourself wearing your 'data steward' hat way to often, and not wearing your 'experimentalist' or 'analyst' hats much at all.

    --

    -V-

    Who can decide a priori? Nobody.
    -Sartre

  21. four directories by arielCo · · Score: 4, Funny

    $PRJ_ROOT/data/theoretical
    $PRJ_ROOT/data/fits
    $PRJ_ROOT/data/doesnt_fit
    $PRJ_ROOT/data/doesnt_fit/fixed
    $PRJ_ROOT/data/made_up

    --
    This post contains no rudeness or derision of any kind. All arguments are friendly. Terms and exclusions may apply.
    1. Re:four directories by Anonymous Coward · · Score: 0

      you say it's four because there's no "fits", right?

    2. Re:four directories by Anonymous Coward · · Score: 0

      $ ls -lrt $PRJ_ROOT/data
      total 5
      -rw-r--r-- 1 geek phd 4096 1998-04-15 15:16 theoretical
      -rw-r--r-- 1 geek phd 4096 2000-10-01 17:20 doesnt_fit
      -rw-r--r-- 1 geek phd 4096 2006-06-29 22:17 doesnt_fit/fixed
      lrwxrwxrwx 1 geek phd 11 2007-03-12 23:03 fits -> theoretical
      -rw-r--r-- 1 geek phd 65536 2009-02-17 23:33 made_up

    3. Re:four directories by morgan_greywolf · · Score: 3, Funny

      Oh, come on! Who let the climatologists in here?

    4. Re:four directories by jochem_m · · Score: 1

      with $PRJ_ROOT/data/made_up being the biggest one? ;)

    5. Re:four directories by arielCo · · Score: 1

      (I posted AC above by mistake):

      $ ls -lrt $PRJ_ROOT/data
      total 5
      -rw-r--r-- 1 geek phd 4096 1998-04-15 15:16 theoretical
      -rw-r--r-- 1 geek phd 4096 2000-10-01 17:20 doesnt_fit
      -rw-r--r-- 1 geek phd 4096 2006-06-29 22:17 doesnt_fit/fixed
      lrwxrwxrwx 1 geek phd 11 2007-03-12 23:03 fits -> theoretical
      -rw-r--r-- 1 geek phd 65536 2009-02-17 23:33 made_up

      --
      This post contains no rudeness or derision of any kind. All arguments are friendly. Terms and exclusions may apply.
  22. Careful... by SanityInAnarchy · · Score: 1

    It depends how you update the files. Many systems, when updating a file, will write the entire new file to a temporary location, then atomically rename it on top of the old location, which would kill any hardlinks, but symlinks would still work.

    I have to agree with the database suggestions, though something NoSQL-ish may work better.

    --
    Don't thank God, thank a doctor!
  23. Databases are not as convenient as files by goombah99 · · Score: 2, Interesting

    I agree that this is a candidate for a database. One problem with data bases for researchers is that generally one does not know the right schema before hand ond one is dealing with ad hoc contingencies a lot. Another is portability to machines you don't control or that are not easily networked. A final problem is archival persistence. I can't think of a single data base scheme that has lasted 5 let alone ten years and still function without being maintained. Files can do that.

    So if you want some bandaid approaches:

    1) if you have a mac then, uses aliases rather than symbolic links. alias don't get messed up if you move the file.

    2) use hard links rather than symbolic links. THe problem here is that these can get unlinked if you plan to modify the file. But if the file will never change these are just as space efficient and a softlink but tolerate renaming. They can't span across different disks however.

    3) poormans database:
    give your files a numerical name that chages, typically the date and time they were created. then have a flat file that list the files in some set for each category.

    4) low tech database. If you decide to use a database then choose one that is likely never to go out of style. for example pick something like a perl-tie. those are so close to the language that they probably won't get depricated in the next 10 years.

    --
    Some drink at the fountain of knowledge. Others just gargle.
    1. Re:Databases are not as convenient as files by atamido · · Score: 1

      I agree that this is a candidate for a database. One problem with data bases for researchers is that generally one does not know the right schema before hand ond one is dealing with ad hoc contingencies a lot. Another is portability to machines you don't control or that are not easily networked. A final problem is archival persistence. I can't think of a single data base scheme that has lasted 5 let alone ten years and still function without being maintained. Files can do that.

      The article didn't actually ask for a way to organize data, he asked for a way to organize files. He could easily create a database that points to whatever files he wants. Populate the relevant columns, and if he wants to add another type of data to search on, add a column. It's not like you need data in every cell, you simply need data in the cells that you want to find again.

    2. Re:Databases are not as convenient as files by bitingduck · · Score: 1

      The article didn't actually ask for a way to organize data, he asked for a way to organize files. He could easily create a database that points to whatever files he wants. Populate the relevant columns, and if he wants to add another type of data to search on, add a column. It's not like you need data in every cell, you simply need data in the cells that you want to find again.

      And right now he's using a complicated mess of symlinks that amounts to a db schema that's probably a huge pain to maintain. Pick one straightforward way to organize files (e.g. date, with directories by month or something) and use the db for sifting through them to pick files by pH, lunar phase, and hair color (or whatever).

    3. Re:Databases are not as convenient as files by GPSguy · · Score: 1

      My personal preference is to store data in NetCDF (or another self-describing common data format) and also encode it into a database. I can do manipulations better in the database, but NetCDF has a staying power derived from its NASA and UCAR origins.

      --
      Never ascribe to malice that which can adequately be explained by tenure.
  24. MindMaps is a perfect solution for you, I think by Anonymous Coward · · Score: 0

    My research area is about programming with MindMaps, MindMaps as source code, I'm developing a programming language based on them

    I choose MindMaps because I could see the detail and the global in the same GUI so I recommend you Freemind

    The MindMaps software could map your filesystem structure even with the symbolik link structure

    Good luck

  25. The Obvious Solution by fast+turtle · · Score: 1

    is to use CVS (comma/tab seperated value) files to store the data. This makes it easy to import into a spreadsheet or database in the future as your needs grow.

    --
    Mod me up/Mod me down: I wont frown as I've no crown
    1. Re:The Obvious Solution by Improv · · Score: 1

      I think you mean CSV

      --
      For every problem, there is at least one solution that is simple, neat, and wrong.
    2. Re:The Obvious Solution by raymansean · · Score: 1

      No, doing it the way he suggested you would want to use a CVS (drugstore in the USA).

      --
      insert inflammatory comment here!
  26. Relational Databases won't do! by gmueckl · · Score: 3, Informative

    To everybody here suggesting relational databases: you are on the wrong track here, I'm afraid to tell you. Relational databases handle large sets of completely homogenious data well if you can be bothered to write software for all the I/O around them. This is where it all falls apart:

    1. Many lab experiments don't give you the exact same data every time. You often don't do the same experiment over and over. You vary it and the data handling tools have to be flexible enough to cope with that. Relational databases aren't the answer to that.

    2. Storing and fetching data through database interfaces is vastly more difficult than just using the standard input/ouput or plain text files. I've written hundreds of programs for processing experimental data and I can tell you: nothing beats plain old ASCII text files! The biggest boon is that they are compatible to almost any scientific software you can get your hands on. Your custom database likely is not. Or how would you load the contents your database table into gnuplot, Xmgrace or Origin, just to name a few tools that come to my mind right now?

    I wish I had a good answer to the problem. At times I wished for one myself, but I fear the best reply might still be "shut up and cope with it".

    --
    http://www.moonlight3d.eu/
    1. Re:Relational Databases won't do! by bjourne · · Score: 1

      Storing and fetching data through database interfaces is vastly more difficult than just using the standard input/ouput or plain text files. I've written hundreds of programs for processing experimental data and I can tell you: nothing beats plain old ASCII text files!

      Well I believe utf-8 encoded text files beats them hands down. Especially for scientific research. //snarky

    2. Re:Relational Databases won't do! by Anonymous Coward · · Score: 1, Insightful

      You can just pipe the output from the SQL client to a text file (or export the results to a CSV file if you use a Query Browser).

    3. Re:Relational Databases won't do! by Anonymous Coward · · Score: 0

      2. Storing and fetching data through database interfaces is vastly more difficult than just using the standard input/ouput or plain text files. I've written hundreds of programs for processing experimental data and I can tell you: nothing beats plain old ASCII text files! The biggest boon is that they are compatible to almost any scientific software you can get your hands on. Your custom database likely is not. Or how would you load the contents your database table into gnuplot, Xmgrace or Origin, just to name a few tools that come to my mind right now?

      I kinda think that the guy posted this question because flatfiles/directories/symlnks are not working for him.

      Having done scientific computing for 20 years or so, I would recommend a relational database from what little information that the guy has given. His data seems pretty simple and relatively small, I'm guessing around 1GB, certainly much less than 1TB. Also, since the data is so small, the guy can leave the raw data as is and write scripts to insert and retrieve the data. The DB can be wiped, and reloaded with raw data during the transition.

      I believe the original poster's use and issues with symlinks is a clear indication that a relational DB is required.

    4. Re:Relational Databases won't do! by jc42 · · Score: 1

      I can tell you: nothing beats plain old ASCII text files!

      Well I believe utf-8 encoded text files beats them hands down. Especially for scientific research. //snarky

      Hey, no need to be snarky about it!

      I've run into a number of tasks where we "standardized" on ASCII plain-text files, and then ran across problems like people and place names in French, Russian, Greek, and/or Chinese. It was a real relief when we officially replaced the "ASCII" with "UTF-8" throughout the docs. The only problem then was finding versions of all the system commands that didn't garble the non-English names. That's a problem that you just have to fight one app at a time.

      But we're slowly supplanting the "English only" attitude of the American vendors, primarily by not dealing with then when we stumble across this problem. It gets harder with the European vendors with their "8859-1 (Latin1) only" attitude. But we're finding ways to replace them with more cooperative vendors. All this is probably good news for the Asian vendors. We do find that if we buy software that supports Chinese and Japanese, then it also supports all European and American languages, too.

      (But I just know that someday soon, we'll have to support Mayan writing. To my knowledge, Mayan still doesn't have its own Unicode block. But I could be wrong. And yes, I actually do have some Mayan writing on my own personal web site. ;-)

      --
      Those who do study history are doomed to stand helplessly by while everyone else repeats it.
    5. Re:Relational Databases won't do! by ericbg05 · · Score: 1
      Informative?! This post is so, so terribly wrong.

      To everybody here suggesting relational databases: you are on the wrong track here, I'm afraid to tell you. Relational databases handle large sets of completely homogenious data

      Wrong. We're not on the wrong track. Databases don't only handle "homogeneous data" sets. You just don't know how to use them flexibly.

      if you can be bothered to write software for all the I/O around them

      Wrong. Databases abstract away I/O primitives and file formats, making creating/accessing your data much simpler than using (e.g.) flat text files.

      nothing beats plain old ASCII text files!

      Wrong. A great many things beat flat text files, under a great many use cases. The capabilities of (e.g.) a sqlite database are a strict (and much larger) superset of those of flat text files, while usually being *less* burdensome to their users.

      how would you load the contents your database table into gnuplot

      You can always dump your db contents to a flat ascii format if you need to (like to send the data to gnuplot).

      Just because *you* don't know how to properly use a db doesn't mean you should shoot down the very idea in such broad strokes.

    6. Re:Relational Databases won't do! by syraq · · Score: 1

      There is no problem to use a relational database for this. It all depends how you design it. Have a look at the EAV model (http://en.wikipedia.org/wiki/Entity-attribute-value_model). It is specifically designed to handle this scenario. The downside is that the queries tend to be complex.

      --
      You know, I always wanted to be a dancer, but I could never get the shit off my shoes
    7. Re:Relational Databases won't do! by gmueckl · · Score: 3, Insightful

      I hereby humbly suggest that you are, instead, wrong. Here is why:

      Scientists are not software developers and never want to be that. They want to run their experiments and analyse their data. The latter requires recording and processing of numerical data. This is where computers enter their workflow - as number crunching tools that have to be easy to use and utterly flexible.

      At times, my work consisted of writing lots of one-off C and Python program to process data in ever new ways in order to get an idea what I was actually looking at. And I had to write them myself because these weren't your run of the mill analysis steps. Many of these programs were not run again once I had their results. During all this time, I as a scientist was looking to get the data in and out of the programs in ways that are easy to code without getting distracted from what I wanted to achieve scientifically. My head was full of theory and formulas, not data structures and good software design.

      In that particular state of mind, writing SQL isn't one of the things that I would have wanted to spend any time on. The inherent complexities are a distraction and a big one at that. And, hell, I'm one of the guys who actually *know* SQL. Most scientists actually don't. Hell, many of them barely know how to use their favorite language's core libs to their advantage. They don't care and - may I say - rightly so.

      Besides, the code would get more bloated. If I want to output three values that belong together I write a print statement that places them on the same line of text in the output file and I'm done. That's a one-liner that takes me about 20 seconds to type in. In the worst case, I need to open a file beforehand and close it afterwards instead of piping it into stdout. That's maybe 3 lines of code. Now tell me: how many lines of code do I need to write to place these values in a database? That is, provided that a table already exists to hold that data.

      My point is: relational databases don't do the job for scientists. Instead, they get in the way. And you and anyone else here who is arguing in favor of them probably lack the related experience to understand that - no offense intended. The points you make are derived from pure theory. Respect the needs of the users as well, please.

      Maybe there is a middle ground here: hire a software developer who builds and maintains the DB and a nice, convenient to use wrapper library around it for you. That'll take a while and someone will have to foot the bill for it.

      --
      http://www.moonlight3d.eu/
    8. Re:Relational Databases won't do! by gmueckl · · Score: 1

      If you really have to handle all that languages correctly, then you are of course right. But when your files only store digits, 7 bit ASCII is indistinguishable from UTF-8 ;-).

      As a matter of fact, I am German and I still try to avoid writing Umlauts and stuff in text files for the reasons that you cite. Ah, the cruelties of 8 bit code page incompatibilities...

      --
      http://www.moonlight3d.eu/
    9. Re:Relational Databases won't do! by Dynedain · · Score: 1

      A huge part of my job is to bridge between people who are experts in their domain, and know very little about technology other than "I need the technology to get my job done".

      If your scientists are in that group, then it's time for them to HIRE SOMEONE who knows how to work with complex data in an abstract manner. Same reason why you buy Windows or download Linux instead of rolling your own kernels. You're smart people, do you do your own janitorial work as well? Learn to recognize when it is the time to get someone who is smart in their own field, so you can spend more time on what you are good at, namely the research and the analysis.

      This is the reason why investment bankers hire programmers and mathematicians instead of simply doing it themselves.

      --
      I'm out of my mind right now, but feel free to leave a message.....
    10. Re:Relational Databases won't do! by toddestan · · Score: 1

      Actually, Origin now supports reading from SQL (and Access) databases:
      http://www.originlab.com/index.aspx?go=Products/Origin/ImportingData/DatabaseAccess

      Excel also supports importing from SQL databases, though I don't know if it plays nicely with anything other than MSSQL.

    11. Re:Relational Databases won't do! by jc42 · · Score: 1

      On the other hand, there has been widespread acknowledgement that UTF-8 solves the charset problems well enough that we should simply decree it the standard and move on. But then, some organizations started doing that 15 years ago, and we still have a long way to go. Lots of people, especially most American corporate managers, simply don't care. Everyone should just learn English, y'know.

      --
      Those who do study history are doomed to stand helplessly by while everyone else repeats it.
    12. Re:Relational Databases won't do! by TheTyrannyOfForcedRe · · Score: 1

      Hell, many of them barely know how to use their favorite language's core libs to their advantage. They don't care and - may I say - rightly so.

      WHAT??? I hope you don't believe that. Here are some similar statements:

      Geneticist - Many of them can't utilize the gene sequencer and rightly so!

      Astronomer - Many of them don't know how to use these darn telescopes...and rightly so!

      Physicist - Many of them can't operate the doors leading into the lab...and rightly so!

      How in the world can scientists justify this? If they don't don't have the time or will to get the most out of their tools they NEED TO HIRE AN EXPERT who does.

      --
      "Liechtenstein is the world's largest producer of sausage casings, potassium storage units, and false teeth."
    13. Re:Relational Databases won't do! by gmueckl · · Score: 1

      I don't believe it. I experienced it. They learn just what they need to get their particular problem solved, when they need it. Remember, these people are lab rats, not coders. And most probably coders wouldn't understand what the scientist actually want to get done. So it's a catch-22 here.

      --
      http://www.moonlight3d.eu/
    14. Re:Relational Databases won't do! by Anonymous Coward · · Score: 0

      We have a file archiving and retrieval system that uses a database for managing the meta-data and location of datafiles. When an engineer or scientists needs to get at one or more datafiles, he or she enters the search terms and the system returns a list of 0 or more files that match the search criteria.

      Once you have thousands (and in our case millions) of files, you'd be crazy not to consider a database just for religious reasons. ("I'm a scientist, I don't do databases...")

  27. SparkLab by guznik · · Score: 1

    This is exactly what SparkLab aims to solve, take a look here: http://sparklix.com/demo-movie It's free for academic and non-profit organizations. Personal free edition will be up later this year.

  28. Contractors/Grad Students by Anonymous Coward · · Score: 0

    Hire a local contractor (read local grad students) to program a simple system for you. This really needs to be in a database which is accessed through an interface you will be comfortable with and which makes it easy for you to manipulate your data.

    Write down how your data is described, how you access and update the data, as well as what output is needed from the system, like how you need to view the data in order to use it in reports or calculations. It doesn't sound like it would be very hard to write something to organize your data. A good price for something like this where I live would be three to seven hundred dollars. Find someone with a decent track record and you should be much more organized in no time.

  29. HDF5 database files / PyTables by Anonymous Coward · · Score: 0

    I have recently started using PyTables to store my data. Very fast, great compression and in Python!
    http://www.pytables.org/

  30. Two approaches... by meburke · · Score: 1

    A lot depends on the type of data. If it is truly experimental results, then results could be easily organized in tables, and tables can be logically accessed, arranged and manipulated using standard rules of set theory. Relational databases work this way, but there are other approaches.

    If your data is derived or crunched, You may have a massive logic problem. See this: http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html , and take heed.

    The previous suggestions about leaving your data intact and refining the access is good advice. I have used and developed some network dbms systems for this type of data. The current trend seems to be toward Object-Oriented Network dbms systems, but I'm not sure that is the way to go; OONDBMS tends to be static and hard to maintain in a dynamic experimental environment. The largest experimental environment that I've had the opportunity to work on, with an energy company here in Houston, was a statistical analysis of nuclear reactions. The data was constantly changing and we needed a self-referencing, dynamic data repository. This is the type of system where you download data sets and do your analysis AFTER you have acquired it locally. The dbms was written in FORTRAN90 and was very fast, but you need a team for something like this unless you are epert enough to program it all yourself.. It actually used very little code, but the record management and indexes (mostly ISAM/invertedISAM took massive amounts of computer power. There are now some cute tools in FORTRAN 2000 that allow you to use a web browser as a front end, but I don't usually want to look at the data being gathered; I usually want to crunch the statistics and see the results. The browser front-ends I have seen tend to require too much tweaking in order to adapt to the changing data parameters. Remote terminals make more sense. Maybe you should be willing to change the method of accessing the data and not try to maintain dozens of links.

    --
    "The mind works quicker than you think!"
  31. tags and search (find/grep) by nycguy · · Score: 1

    I would just put each set of experimental data in a separate subdirectory. Within each subdirectory I'd put a file with specific name (e.g., "description.txt") in which you briefly write up exactly what the experimental data is, how it was generated (e.g., if generated by a program, give the arguments and/or pointers to input data), and some keywords to allow it to be indexed/searched. Then I'd use your standard OS search tools to find the description file(s) you're looking for, thereby allowing you to locate your data based on its description rather than some brittle directory hierarchy.

    I have a pretty standard setup for generating experimental data in my work. Whenever I run an experiment (which are usually simulations), I have a wrapper script that generates a random (meaningless) subdirectory name, copies my simulation binary and configuration to that directory (so I can reproduce the results later in case either my simulator code or its configuration changes), and prompts me to enter a description of what it is I'm simulating, and asks me to provide some keyword tags. The only way I can find the data afterward is to search the description files from the last step, because the data is otherwise just in a randomly-named directory.

    Of course, this scheme depends on you doing a decent job of describing your data and providing keywords, but I don't think you can get around that with any technique. At some point you have to inject some human labeling/categorization. Directories and symlinks are just a pretty restrictive way of organizing things.

  32. SQLite + Scripting language by ericbg05 · · Score: 2, Informative
    Others have already mentioned SQLite. Let me briefly expound on the features that are likely the most important to you, assuming (if you'll permit me) that you don't have much experience with databases:
    1. 0. The basic idea here is that you are replacing this whole hierarchy of files and directories by a single file that will contain all your data from an experiment. You figure out ahead of time what data the database will hold and specify that to SQLite. Then you to create, update, read, and destroy records as you see fit--pretty much as many records as you want. (I personally have created billions of records in a single database, though I'm sure you could make more.) Once you have records in your database, you can with great flexibility define which result sets you want from the data. SQLite will compute the result sets for you.
    2. 1. SQLite is easy to learn and use properly. This is as opposed to other database management systems, which require you to do lots of computery things that are probably overkill for you.
    3. 2. Your entire data set sits in a single file. If you're not in the middle of using the file, you can back up the database by simply copying the file somewhere else.
    4. 3. Transactions. You can wrap a large set of updates into a single "transaction". These have some nice properties that you will want:
      1. 3.1. Atomic. A transaction either fully happens or (if e.g. there was some problem) fully does not happen.
      2. 3.2. Consistent. If you write some consistency rules into your database, then those consistency rules are always satisfied after a transaction (whether or not the transaction was successful).
      3. 3.3 Isolated. (Not likely to be important to you.) If you have two programs, one writing a transaction to the database file while the other reads it, then the reader will either see the WHOLE transaction or NONE of it, even if the writer and reader are operating concurrently.
      4. 3.4. Durable. Once SQLite tells you the transaction has happened, it never "un-happens".
      5. These properties hold even if your computer loses power in the middle of the transaction.
    5. 4. Excellent scripting APIs. You are a physical sciences researcher -- in my experience this means you have at least a little knowledge of basic programming. Depending on what you're doing, this might greatly help you to get what you need out of your data set. You may have a scripting language that you prefer -- if so, it likely has a nice interface to SQLite. If you don't already know a language, I personally recommend Tcl -- it's an extremely easy language to get going with, and has tremendous support directly from the SQLite developers.

    Good luck and enjoy!

    1. Re:SQLite + Scripting language by Anonymous Coward · · Score: 0

      I'm not sure that SQLite is truly concurrency safe, because it relies on looking mechanisms provides by the underlying OS/file system. Causing is certainly needed. See for instance http://www.sqlite.org/faq.html#q5

    2. Re:SQLite + Scripting language by Anonymous Coward · · Score: 0

      *any* database is a really stupid idea because the data needs to last longer longer than your lifetime, and most any database storage suggested so far is less than a decade old and will last less than a decade.

      You already have knowledge to work with complex topics with files in a directory structure - data migration, searches, computer upgrades, OS changes, backup systems, and probably you have some scripting skills.

      Stick with files - don't let anyone convince you that you need multi-user atomic isolated transactions or a normalised database.

  33. What about a wiki? by gotfork · · Score: 3, Insightful

    In my previous lab group we used a mediawiki install to keep track of microelectronic devices that several people were working on at the same time. These devices were still under development so most of the data was qualitative -- images, profilometry data, IV/CV curves were all stored on the wiki page for each sample, and each page included a recipe for exactly how it was made, which made it easy to trouble shoot later. It worked pretty well for what we used it for, but once we had a working device all the in-depth data for that sample was kept separately. This seemed like a half-decent way of cataloging samples, although one would need something a bit more robust for complex data sets that don't integrate well with a wiki.

  34. Comment removed by account_deleted · · Score: 2, Informative

    Comment removed based on user account deletion

  35. Re:Two approaches...OOPS! by meburke · · Score: 1

    Sorry, I forgot to include the fact that the network dbms system does not require you to rename or re-link your directory scheme. It simply creates pointers to relevant links and then maintains the pointer logic.

    --
    "The mind works quicker than you think!"
  36. Master of Chaos by Anonymous Coward · · Score: 0

    Name them by dates and save them in 1 directory. That's how you'll end up saving the files for you LaTeX paper anyway.

  37. A database ... by frogzilla · · Score: 1

    You need to start using a database. You don't have to actually put the data in a database but all of the meta data needs to go into one. Store your data files in one file system using whatever naming scheme you want and never move the files again. At the same time record the file system location along with all other meta data that is relevant. Then some simple database queries, e.g. embedded in some web pages can retrieve the location and even the data. You can of course also store the data in a database as well if you wish. I personally find it more practical to do it this way.

  38. Use tags in Apple OS X by wealthychef · · Score: 2, Insightful

    If you are using Mac OS X, you can tag the files using the Finder Get Info and putting "Spotlight comments" there. Then you can easily find them based on keywords and Spotlight in constant time. The good thing about keywords is that they give you a multidimensional database effect. The bad thing I've found is I tend to forget my keywords that I'm storing stuff with, so I don't really know what to search for. OS X Spotlight is promsing and might work very well for you.

    --
    Currently hooked on AMP
    1. Re:Use tags in Apple OS X by Anonymous Coward · · Score: 0

      If you are using Mac OS X, you can tag the files using the Finder Get Info and putting "Spotlight comments" there. Then you can easily find them based on keywords and Spotlight in constant time. The good thing about keywords is that they give you a multidimensional database effect. The bad thing I've found is I tend to forget my keywords that I'm storing stuff with, so I don't really know what to search for.
      OS X Spotlight is promsing and might work very well for you.

      Extremely unlikely that a scientist would be using OSX, but luckily Linux and Windows can do the same thing with better performance.

    2. Re:Use tags in Apple OS X by wealthychef · · Score: 1

      Extremely unlikely that a scientist would be using OSX

      Apparently you don't know many scientists. I work at a national laboratory for our computing center, and there are hundreds of scientists using Macs at our laboratory. I've noticed lots of the Linux heads switching to Mac. It's nice to have a sweet GUI along with a BSD Unix underpinning,
      Apple does not make good cluster machines, but they make excellent desktops for doing science.
      Get your head out of your bias.
      I didn't know that Windows and Linux have metadata integrated into their filesystems to allow for constant-time search results. Tell me more.

      --
      Currently hooked on AMP
    3. Re:Use tags in Apple OS X by Anonymous Coward · · Score: 0

      All true. But don't feed the trolls.

    4. Re:Use tags in Apple OS X by wealthychef · · Score: 1

      My bad! You are right. Sigh

      --
      Currently hooked on AMP
  39. How CMS sorts data by toruonu · · Score: 2, Informative

    Well CMS is one of the large experiments at the LHC. The data produced should reach pentabytes per year and add to it the simulated data we have a hellava lot of data to store and address. What we use is a logical filename (LFN) format. We have a global "filesystem" where different storage elements have files in a filesystem organized in a hierarchical subdirectory structure. As an example: /store/mc/Summer10/Wenu/GEN-SIM-RECO/START37_V5_S09-v1/0136/0400DDE2-F681-DF11-BA13-00215E21DC1E.root

    the /store is a beginning marker of the logical filename region that different sites can map differently (who uses NFS, who uses http etc etc) /mc/ -> it's monte carlo data /Summer10/ -> the data was produced during Summer of 2010 /Wenu/ -> it's a simulation of W decaying to electron and neutrino /GEN-SIM-RECO/ -> the data generation steps that have been done /START37_.../ -> The detector conditions that have been used (the actual full description of the conditions is in some central database) /0136/ -> is the serial number (actually I'm not 100%, but it's related to the production workflow etc) /0400DDE2-F681-DF11-BA13-00215E21DC1E.root -> the actual filename, the hash is due to the fact that the process has to make sure there are no conflicts in filenames

    Another example: /store/data/Run2010A/MinimumBias/RECO/Jul16thReReco-v1/0000/0018523B-D490-DF11-BF5B-00E08178C111.root

    This file is real data, taken during the first run of 2010 and filtered to the MinimumBias primary dataset (related to event trigger content). The datafiles in there contain RECO content and were done during the re-reconstruction process on July 16th. Then there's again the serial number (block edges define new serial numbers) and then the filename.

    You could use a similar structure to differentiate the datafiles that you actually use. The good thing is that you can map such filenames separately everywhere as long as you change the prefix according to the protocol used (we use for example file:, http:, gridftp:, srm: etc). You can also easily share data with other collaborating sites as long as everyone uses similar structure it's quite good. No need for special databases etc. If you need some lookup functionality, then one option is a simple find (assuming you have filesystem access) or you could build a database in parallel and you can use the LFN structure to index things etc.

  40. Another vote for NoSQL and some experience by wolf87 · · Score: 2, Informative

    I have seen these kinds of situations happen a lot (I'm a statistician who works on computationally-intensive physical science applications), and the best solution I have seen was a BerkeleyDB setup. One group I work with had a very, very large number of ASCII data files (order of 10-100 million) in a directory tree. One of their researchers consolidated them to a BerkeleyDB, which greatly improved data management and access. CouchDB or the like could also work, but I think the general idea of a key-value store that lets you keep your data in the original structure would work well.

    1. Re:Another vote for NoSQL and some experience by jgrahn · · Score: 1

      I have seen these kinds of situations happen a lot (I'm a statistician who works on computationally-intensive physical science applications), and the best solution I have seen was a BerkeleyDB setup. One group I work with had a very, very large number of ASCII data files (order of 10-100 million) in a directory tree. One of their researchers consolidated them to a BerkeleyDB, which greatly improved data management and access [...] I think the general idea of a key-value store that lets you keep your data in the original structure would work well.

      A file system *is* a key-value store.

      I suspect those 100,000,000 files were in fact tiny pieces of data which didn't make sense to access using normal tools (from ls to MS Word). That the conversion worked out for *you* doesn't mean that it would be useful to convert *every* set of files into a BerkeleyDB. Especially not sets of (say) 500 files, 10GB each.

    2. Re:Another vote for NoSQL and some experience by wolf87 · · Score: 1

      I have seen these kinds of situations happen a lot (I'm a statistician who works on computationally-intensive physical science applications), and the best solution I have seen was a BerkeleyDB setup. One group I work with had a very, very large number of ASCII data files (order of 10-100 million) in a directory tree. One of their researchers consolidated them to a BerkeleyDB, which greatly improved data management and access [...] I think the general idea of a key-value store that lets you keep your data in the original structure would work well.

      A file system *is* a key-value store.

      I suspect those 100,000,000 files were in fact tiny pieces of data which didn't make sense to access using normal tools (from ls to MS Word). That the conversion worked out for *you* doesn't mean that it would be useful to convert *every* set of files into a BerkeleyDB. Especially not sets of (say) 500 files, 10GB each.

      I completely agree. If you have a lot of small datasets that break ls and such (as was the case in my situation), BerkeleyDB provided a great solution. If you have a smaller set of very large files, a different solution is needed (perhaps just the file system with some kind of automated indexing).

  41. Knowledge Management Tools by Isao · · Score: 1

    I haven't had to store experimental results like that. My work produces prototypes, some data, demos and support documentation. There are tons of KM tools out there to manage heterogenous data in a recoverable way. We've used document repositories like Hummingbird (acceptable) and of course SharePoint. The key (literally) is including the right metadata and tags when you check in the element. When a data set goes dormant (static) you can tarball the CVS tree or whatever and drop it in the repo. Then there's Knowledge Discovery, something we've created tools for. They let you understand how you got that idea from three hours of web/repo surfing.

  42. First devise a meaningful stable primary key by RandCraw · · Score: 2, Informative

    First I would lay out your data using meaningful labels, like a directory named for the acquisition date + machine + username. Never change this. It will always remain valid and allow you to later recover the data if other indexes are lost. Then back up this data.

    Next build indexes atop the data that semantically couple the components in the ways that are meaningful or acessible. This may manifest as indexed tables in a relational database, duplicate flat files linked by a compound naming convention, unix directory soft links, etc.

    If you're processing a lot of data, your choice of indexes may have to optimize your data access pattern rather than the data's underlying semantics. Optimize your data organization for whatever is your weakest link: analysis runtime, memory footprint, index complexity, frequent data additions or revisions, etc.

    In a second repository, maintain a precise record of your indexing scheme, and ideally, the code that automatically re-generates it. This way you (or someone else) can rebuild lost databases/indexes without repeating all your design and data cleansing decisions, and domain expertise. This info is often stored in a lab notebook (or nowadays in an 'electronic lab notebook').

    I'd emphasize that if you can't remember how your data is laid out or pre-conditioned, your analysis of it may be invalid or unrepeatable. Be systematic, simple, obvious, and keep records.

    1. Re:First devise a meaningful stable primary key by kubitus · · Score: 1

      answers some of my previous questions.

  43. Well, I'd have to say by Anonymous Coward · · Score: 0

    In the butt Bob.

  44. SharePoint! by Anonymous Coward · · Score: 0

    SharePoint lists items, basically a database. You can sort and group by any parameter and attach files. No programming of special technical knowledge necessary.

  45. postgreSql by Alanonfire · · Score: 1

    I did a little bioinformatics in the past, and we were using postgreSql to manage our results. It was nice because you can create meaningful fields to query in the future. It took some time developing the system, but it really helped out in the long run. We had to consider errors in the readings of the results and had to incorporate a little bit of fuzzy logic into the tools we used to run comparisons on the database.

    If you are at a university or near a university, the computer science dept may give a few students credit to build you a system that can handle it, so you don't have to.

  46. You need a LIMS by pigreco314 · · Score: 2, Insightful

    A Laboratory Information Management System will help you store, organize, analyze and data-mine your data.

    --
    "linux" is a very common word and was not included in your search.
  47. Re:Use databases! (maybe, maybe not) by mikehoskins · · Score: 1

    I agree. It depends.

    Yes, relational databases store and retrieve well-defined data very, very well. Do you have referential integrity needs?. If that's your situation, use SQLite (small data and very simple types but little referential integrity), MySQL (medium to large data), or PostgreSQL (medium to very large data or more complex data types) and don't look back. SQL queries, relationships, and referential integrity are very powerful.

    If not, then I'd look at MongoDB with GridFS. I'd even go further and explore GridFS-FUSE (a mountable file system version of MongoDB/GrisFS).

    With GridFS-FUSE, you have a crazy powerful database/file system combo. Now, since MongoDB is a NoSQL database, you cannot do SQL queries against it. You can store and retrieve key-value pairs, NoSQL "documents," and actual files with MongoDB/GridFS/GridFS-FUSE.

  48. test lists and RCS by wrench+turner · · Score: 1

    Instead of sorting datasets, use a testlist database (flat files). The test contains/links/points to its dataset. The test lists are selected at test run time. Each entry in a test list tells how to generate the specific test environment for the test. A test list entry contains the test, the RCS tag/version of the test to be "gotten", the test seed, and array of exit codes that should be retired, how many retries, whether the test is gating, and an array of tests dependencies. A test run can be considered to pass even though an individual, non-gating test fails. One test entry may extract and prepare the test data and other dependent entries can then run against that test dataset.

  49. LIMS! This is a no-brainer! by Wdi · · Score: 1

    It seems you have never heard of LIMS (Laboratory Information Systems), which is unfortunate.

    This is a thriving software sector, and you are actually expected to be at least vaguely familiar with these kind of systems should you ever transfer to industry and work in data-generating or data-processing positions.

    Nobody in industry keeps experimental data as individual, handcrafted datasets. The risk of losing important data, not not being able to make cross-references (patents!) is much too high if you let people run their own set-ups. Do yourself, and your research group, a favor: Get some grant money and purchase a robust commercial set-up at least for your group, or better your department. Entry level systems, with academic discounts, are affordable. There are no competitive open-source solutions.

    Start your research here:

    http://en.wikipedia.org/wiki/LIMS

    (though the systems listed there are instrument-centric, if you are more into generic chemistry there are other standard package by companies such as Accelrys and CambridgeSoft).

  50. Used to be two-word answer by MarkusQ · · Score: 1

    I used to have a two word answer for this question: Use BeOS

    But now it's a six word answer (*sigh*): Invent time machine, then use BeOS

    --MarkusQ

    1. Re:Used to be two-word answer by yyxx · · Score: 1

      Wrong two word answer. I seriously doubt BeOS was ever used much for scientific research.

      The correct two word answer is: "lab notebook".

    2. Re:Used to be two-word answer by Convector · · Score: 1

      Seven words: Invent backwards time machine, then use BeOS.

  51. Computation project organization by Anonymous Coward · · Score: 0

    This was developed in the context of computational biology experiments, but should hold true for other types of computational projects:

    http://www.ncbi.nlm.nih.gov/pubmed/19649301

  52. Learn a Relational Database! by theNAM666 · · Score: 1

    It's already been said, but it bears saying again. Directories and symlinks.... oh my!

  53. Spotlight by Anonymous Coward · · Score: 0

    Google Desktop on a PC and Spotlight on my Mac have helped me a great deal.

  54. backups are for wimps by Gothmolly · · Score: 1

    real men upload their stuff to kernel.org and let the world mirror it.

    --
    I want to delete my account but Slashdot doesn't allow it.
  55. By date, external file for metadata. by goodmanj · · Score: 1

    Here's what I do:

    Directory for each data set, labeled by date (20100815).
    Short README file inside each directory with description of the run.
    Big spreadsheet (or database, if you're fancy) with experimental parameters and core results, that can be sorted, reorganized, and graphed.

  56. Consistently by rwa2 · · Score: 3, Insightful

    Word up. I'd say the first goal is to store your raw, bulk data consistently. Then you can have several sets of post processing scripts that all draw from the same raw data set.

    You want this data format to be well-documented, but I wouldn't bother meticulously marking it up with XML tags and other metadata or whatever. You just want to be able to read it fast, and have other scripts be able to convert it into other formats that would be useful for analysis, be it matlab, octave, csv, or some tediously marked-up XML. You do want to be able to grep and filter the data pretty easily, so keep that in mind when you're designing the format. It will likely end up being pretty repetitive, but that's OK, since you'll likely store it compressed. That can improve performance when reading it, since the storage medium you're pulling the data from is often slower than the processor doing the decompression... and it also provides some data integrity / consistency checking. Oh, and of course, you can store more raw data if its compressed.

  57. Self-describing data by Anonymous Coward · · Score: 0

    Mark each file with its description and insert each transformation after the original description as suggested in "More Programming Pearls" by Jon Bentley.
    Put some keywords in the description and fell free to add more as you go. You can use a free format or go to a more rigid organization.

  58. two wrong solutions by Anonymous Coward · · Score: 0

    In my experience, the best thing is to let the structure stand as it was the first time you stored the data.

    Your statement of this policy illustrates a hidden assumption which causes the policy to fail before it gets out of the gate: there should be no notion of "structure" in the way you store the data. A filesystem-- all production* examples of which are hierarchical-- imparts a particular set of relations between the elements of your data: that of a hierarchy (and a specific one at that!). By caring about the hierarchy, you're being beholden both to one kind of relationship and to a specific instance of that relationship with respect to your data.

    (* There's been some work on relational-- as opposed to hierarchical-- filesystems; cf. WinFS. If this type of file representation/storage were de facto, the data "organization" habit into which so many people unknowingly fall wouldn't be a bad one.)

    While a symlink farm is one way to separate representation from presentation, it still limits all your views of the data to hierarchical relationships.

    Raw data should stand untouched, or you may delete it by mistake.

    Leaving data "untouched" is the wrong solution to that problem. You need backups, and they must 1) prevent data corruption (so you can undo a mistake made by you or your hardware, such as accidental/regretable data modification or filesystem corruption), and 2) provide data persistence (so your data will persist if a copy of it is destroyed, as in a disaster).

    I've...had the presentation app tie itself in knots if need be to leave the data as-is and present it in any new ways required.

    The representation of data should not impose the need for contortion on the part of viewers of the data. If it does, the representation is flawed in that it embeds relationships rather than abstracting them from the representation (and you should always be interested in fixing flaws). The only thing that should contribute to the complexity of the presentation is the complexity of the relationship, not baggage from insufficiently general representation.

    The central point is that storing your data in hierrarchies inherently implies relationships (as long as the user looks directly at the representation, which can't be helped in the case of filesystems, yet). Another post points this out, perhaps unintentionally: "(Title: How can you not?) I never understood how you can have something organized or not." Exactly. If it lives in a hierarchy, then it's organized, ipso facto. Use the relational model; use a relational database.

    ccleve's post "Don't bother with hierarchies" has a couple good tidbits too.

    It's like having an ancient monument: you don't shuffle it around to suit the latest whim, else you'll most likely mess it up beyond repair.

    If your data is as brittle as this, you are doing it [data management] wrong.

  59. SciDB, Open Source DB for Science by geoffrobinson · · Score: 2, Interesting
    --
    Except for ending slavery, the Nazis, communism, & securing American independence, war has never solved anything.
  60. Use a document mangement system by rsborg · · Score: 2, Informative
    Document Management Systems are great - they combine (some of) the benefits of source control, file systems, and email (collaboration).

    I would recommend just downloading a VM or cloud image of something like Knowledge Tree or Alfresco (I personally prefer Alfresco), and run it on the free vmwareplayer or a real VM solution if you have one.

    I recently setup a demo showing the benefits of such a system, I was able to, in about one day, download and setup Alfresco, expose CIFS interface (ie, \\192.168.x.x\documents) and just dump a portion of my entire document base into the system. After digestion, the system had all the documents full-text-indexed (yes, even word docs and excel files thanks to OpenOffice libraries), and I could go about changing directory structure, moving around and renaming files, etc. .. and the source control would show me changes. In fact, I could go into the backend and write SQL queries if I wanted to with detailed reports of how things were on date X or Y revisions ago. Was quite sweet. All the while, the users still saw the same windows directory structure and modifications they made there would be versioned and modified in Alfresco's database.

    Here is a bitnami VM image, will save you days of configuration. If the solution works for you, but is slow, just DL the native stack and migrate or re-import.

    --
    Make sure everyone's vote counts: Verified Voting
  61. Use Data Archiving software by Anonymous Coward · · Score: 0

    Use software specifically designed to archive scientific datasets and make them available to others. See, for example, DataVerse (http://thedata.org/home).

  62. Organize by HYPOTHESIS, then funding source by Anonymous Coward · · Score: 0

    Hi, I've been working in immunology and microbiology for ~20 years at the bench [well, actually in a biosafety cabinet...] and face this same problem.

    I've created a data directory on our server that is organized by experiment number [M199, M200, etc.] Each of these folders begins it's life as a sequential number in either my lab notebook or in my boss's lab notebook as a hypothesis -- "does immunity to bacteria x depend on secreted cell product y in the lung?" I sketch an experimental design that will answer this hypothesis in my notebook. Next, I create a folder on our server with the experimental code, "M200" for example. Then I formalize the experimental design by creating an Excel spreadsheet that lists my experimental groups ["control uninfected mice", "control infected mice", "infected mice with gene knockout z", etc.] and set the timepoints for analysis as well as the assay by which I'll measure the validity of my hypothesis, typically something like bacterial load in the target organ. This Excel sheet also serves as an experimental record -- of the lot of bacteria I used to infect the mice, the various reagents used for secondary assays [like phenotyping the cells that are responding to the infection.] All of the quantitative instrumental data that I collect on the various cells and tissues [flow cytometry, flourescent microscopy, Elispot assays, etc.] goes into subdirectories by time point day 30, day 60, etc.

    My lab has a couple of funding sources, they each get their own high-level directory tree. Grant "XX-10" > experiment "M200" > "Day 10" > "Flow Cytometry" > etc.

    While I really, really apprectiate the inherent limitations of this system, it is straight-forward -- meaning new technicans and post-docs can use it without screwing it up too much. I can find a particular experiment by refering to my lab notebook. It satisfies my institution's intellectural property documentation requirements.

    The key point of all this: I'm paid to answer the questions that were in the grant submission. My success [or failure] depends on my ability to do this work -- not to make or administer a database. I could try to get a grant to hire someone to do this for me, but the chances of getting it funded seem pretty slim -- what value does this really offer to the taxpayer? Will a database [although cool] tell me something about my results that I didn't know? If I design my experiments properly, the primary assay will clearly support my hypothesis, or not. By the way, my taxes are already way too high.

    My IT department is very small, and basically only provides server backup and very limited user support. Any kind of dataabase work is well outside of their capability.

    Good luck. If you find something that fits the bill, please let us know. J

  63. Re:RDB won't do! by kubitus · · Score: 1
    IMHO best answer so far

    What I am looking for is a multidimensional adressing - file or database system.

    something like multiple B-tree's for the content with the possibility to add another B-tree index if required later.

    Maybe the Google people have an answer?

  64. Re:Use databases! (maybe, maybe not) by tibit · · Score: 1

    One can mix-n-match: use flat files to store raw data, post-processed stuff, etc, but use a database to keep track of everything. The latter can even be handled by your OS from the get-go. On OS X you could write a spotlight plugin or two for your data files, and as long as the file format allows storing metadata within the files, spotlight will index it. Same goes for Windows Search. You could also use native mechanisms for adding metadata to files - those exist IIRC on both Windows 7 and on OS X.

    --
    A successful API design takes a mixture of software design and pedagogy.
  65. TiddyWiki by stm2 · · Score: 1

    A personal wiki that runs from one file, I link my files from there and I can add documentation and references at will. http://www.tiddlywiki.com/

    --
    DNA in your Linux: DNALinux
  66. caBIG, ISO/IEC 11179 metadata by BatesMethod · · Score: 1

    If you're really serious about tracking metadata, it may be worthwhile to take a look at some of the tools offered by caBIG:

    https://cabig.nci.nih.gov/

    The caBIG tools are geared toward using a model-driven approach to define precise metadata which promote semantic interoperability. Underlying the caBIG tools is a metadata repository called the caDSR, which follows the ISO/IEC 11179 Standard for Metadata Registries:

    https://cabig.nci.nih.gov/concepts/caDSR/

    The caBIG tools are all open-source, developed by the National Cancer Institute.

  67. Relational DB can be the right way by Anonymous Coward · · Score: 0

    I currently look after 20+ million data files from a synchrotron radiation source totalling about 65 terabytes. This is rising at close on 1 terabyte per day during experimental runs. We have a Lustre file system storing the files as they are made. Filenames are mostly generated automatically, based on experimental station, date and a unique label given to each visit by a research team. File metadata is extracted automatically and wrapped up into a NeXuS file http://www.nexusformat.org which is also added to the archive. Each file generated by tne instruments is queued up into a tape archival system, and inserted into the DB along with the metadata extracted earlier. The metadata is stored as triples and a strict ontology means that terms are always used as intended. Datasets can be created from this archive in many different ways and on the fly, and a web interface permits downloading of datasets from the archive. The schema we use for metadata is found at http://icatproject.googlecode.com

    We do NOT store data in the database, just metadata. Part of the metadata is a persistent identifier to the data file within the archive meaning that filenames are largely unimportant as far as searching the archive is concerned. No duplicate files can occur because of automated filename generation.

    This very same system is also used in a neutron source and a laser facility, so it is pretty generic.

  68. Virtual Terminal by Anonymous Coward · · Score: 0

    It's called the command line. Use it.

  69. Use plain text files, tag them properly by geggo98 · · Score: 1
    Use plain text files. They can be easily managed: Version control and comparing is easy. The best thing: Text files and can be read by nearly every application.

    File corruption is unlikely with text files. If you should have corrupted files, you have a chance to recover them with text files. With binary files, databases etc. this becomes much harder.

    To find your files later, tag them properly. Something like OpenMeta might help you.

  70. Crazy database talk by ravenacious · · Score: 1

    All this talk of using whatever kind of database to organize your experimental data is nuts. It's well intentioned, I'm sure, but it's still insane. I always tell students that there is no general way to organise ones data, you have to find a system that works for you. I reckon that > 99% of physical science researchers (not just physicists, as seems to be a confusion in several replies) wouldn't be able to set up and use a database in a way that's better in terms of time and effort efficiency than just doing whatever it is that they already do to organise their data. Worse still, I reckon that it's the sort of thing that one would spend a huge chunk of time doing and then only use for a short while before you got bored of inputing stuff into the data base properly and then started to forget to do it or worse, resolving to "do it in batches". Anyway, the result of this will be that one eventually stops using the database and goes back to what one was doing before, but now with a huge hole in the data-trail where the database used to be. Alternatively, one will struggle on with the database for a while and then try and re-design it to put in all the features that were missed out in the original design, all the while sucking loads of time and eventually going back to the original method.

    Getting on to some actuall advice. I would suggest two features that your chosen system should have. It should be:

    1. robust
    2. quick

    Personally, I have a lab book in which I record experimental details (what I actually did to generate the data). There's a date at the start of every day and the rough "titles" of the experiments and then the details of what it was. When I generate data files I organise them into hierarchy of directories. So, there's a "projects" directory that has all the different projects in it. there might be a project that I'm working on to do with nanoparticles or something. and that'll have a directory called "nanoparticles". There's a bunch of directories in the project folder, such as "data", "analyses", "reports", etc. The Data directory is divided by the experimental technique that was used to get the data, such as "fluorescenceMicroscopy", "TEM" or "SEM" or whatever. Then the actual experiments are in directories inside the relevant technique. I name the directories by date first and then a brief indicator of what the experiment is about. So it I was looking for the aggregation behaviour of my nanoparticles in the presence of different polymers or something, I'd have a directory called something like

    "20100815-FePt_particles_with_100k_PEG_in_PBS_at_pH7"

    or something like that. Something that people often do is call a directory "20100815" of something like that. I used to do this, but I didn't find it useful to look back on after 6 months. People also forget that you can have something like 256 characters for the file/directory name - USE THEM! Inside this directory will be a bunch of data files that I acquire. I tend to start the naming of files with a number and then a decription of the sample and what's the point of this data. So, for example the first image in a set of TEM images might be named "01-100mM_PEG_generalGrid_300x", the next will be named "02-...", "03-...", etc. This way, all the files are ordered in the order that they were acquired, which I find helps to find them later, since I think that you remember the order of things better than their absolute position. So, if I wanted to find an image of aggregated nanoparticles that I took some time in the winter, I could easily find "Projects/nanopartocles/Data/TEM/20091106-FeNP_with_20k_PEG_in_EtOH/03-10mM_PEG_aggregatedParticles_20kx".
    Anyway, this works for me, but my data needs to be organised so that I can get to the relevant data and then do something with it, not aggregate large amounts together and get some numbers.

  71. GIS? by Anonymous Coward · · Score: 0

    Depending on the type of data, maybe a GIS might help you? In ArcGIS for example, you can link files (xls, csv), access databases, pictures/photo's, etc.

  72. File sets with strict structure & names by Anonymous Coward · · Score: 0

    For processing RNA/DNA microarray data sets, we use a well defined (==strict) structure and naming convention for directories and files. We and 100s of users have used for about 5 years now and it works quite well and suits most of our needs.

    Directory structure:
    First of all paths should be relative to the current/working directory. This asserts that any scripts/code will be the same regardless of the absolution location of the data. Each data set is in a separate directory. Optionally, it may contain one or more subdirectories (possibly in multiple levels). The actual data files are located in these subdirectories. Depending on the type of data, the data set directories is located in different so called root directories. Conceptually, the directory structure is: <rootPath>/<dataSet>/<subDir>/.

    Format of file and directory names:
    What really adds to the above, is that the names of the directories and the files follow a certain syntax. A filename (without the path) can be split up in two parts, its fullname and its filename extension:

    <filename> := <fullname>.<extention>

    In turn, the fullname part can be split up into the name and comma-separated tags:

    <fullname> := <name>[,<tags>]*

    This setup makes it possible to annotate data files with both human readable as well as computer interpretable information.

    Similarly, the directory names have a fullname with a name part and optional tags.

    An example of a file set is:

    rawData/HapMap270,CEU/Mapping250K_Nsp/*.CEL

    where rawData/ is the root path, HapMap270,CEU,test/ is the data set, Mapping250K_Nsp/ is a subdirectory and *.CEL are the data files. The name of the data set is "HapMap270" with tags "CEU" and "test". The subdirectory indicates that the technology used to measure the data is for microarray type "Mapping250K_Nsp".

    With this strict directory structure, we can have methods/functions that automatically locates file sets by their (full)names without specifying absolute paths and so on.

    When we process data, we store intermediate and final results in new file set directories where we add tags indicated what type of processing has been done. We sometimes also change the root path. For example, one of the first preprocessing steps of our raw data removes systematic effects. That step adds tag "ACC" and stores the data files in:

    probeData/HapMap270,CEU,ACC/Mapping250K_Nsp/*.CEL

    We were quite careful to design it so that it would work on as many operating and file systems as possible, including Unix, OSX and Windows. Commas are valid symbols in filenames in all these systems.

    We did consider using relational databases to store the data, but realized that it adds lots of complications when it comes to backups as well as migration/sharing all or part of the data. In the end of the day, file systems are pretty neat and allows people to quickly get an overview of the content. There are file system browsers, you can check progress with a simple 'ls', access it via [s]ftp, web browsers etc. There are of course trade offs you have to consider, but we found the file system to be more than good enough for what we wanted.

    We do all our work in the R language. We have implemented utility classes and functions that provide easy access to such file sets. See the R.filesets package [http://cran.r-project.org/web/packages/R.filesets/] for more examples.

  73. Re:Use databases! (maybe, maybe not) by mikehoskins · · Score: 2, Informative

    Yes, agreed, a combination is good (SQL + NoSQL + filesystem).

    There is no one-size-fits-all scenario, here.

    However, there is utility in a NoSQL database over a raw filesystem. One feature is indexed search. Another is versioning. Another is the fact that it is extremely multiuser (proper record locking, even if there are multiple writes to the same record). Also, many NoSQL databases (especially MongoDB) have built-in replication, sharding, Map-Reduce, and horizontal scaling.

    MongoDB's GridFS (especially with FUSE support) marries many of these features together. MongoDB does have some SQL DB features (such as indexing/searching and transactions) but not others.

    Check out the whole stack here:
        http://www.mongodb.org/
        http://www.mongodb.org/display/DOCS/GridFS
        http://github.com/mikejs/gridfs-fuse

  74. Depending on the nature of your data... by patch0 · · Score: 1

    Depending on the nature of your data, I used to use netcdf files quite a lot (https://www.unidata.ucar.edu/software/netcdf/). I also work on data sharing and standardisation, this is a full time job, so really you can spend as much time as you want on this and still not get it done. There are a variety of international data standards which exist to facilitate the management and sharing of data. I know you aren't necessarily talking about sharing your data, but many of the same issues apply. In short, unless you wish to spend a great deal of time on this or your employer has some data management solution for you then there are probably only a variety of unsatisfactory solutions which no doubt the slashdotters here will suggest :)

  75. LIMS by Anonymous Coward · · Score: 0

    What your looking for is a a LIMS (laboratory information management system).

    http://www.bikalabs.com/ -open source

    Plenty of commercial LIMS available, but expect to pay $$$.

  76. Directory names : experiment number by Fuzzums · · Score: 1

    Just give all your datasets a number and put them in a database so you can search on all criteria you want.
    You can also use Excel to keep track of your metadata if you want...

    --
    Privacy is terrorism.
  77. Tagging in a database by hoggoth · · Score: 1

    You are finding the same problem everyone has with any data set. Hierarchical folders with one name only allow for a single, pre-arranged organization. It's terrible for the way we really use files, data, whatever really.

    Store your data sets with simple "inventory names" like 00001 through 99999 or random serial numbers. Have a spreadsheet or database that associates all of your data sets with as many characteristics as you like. Then you can sort and find by any combination you can think of in the future.

    --
    - For the complete works of Shakespeare: cat /dev/random (may take some time)
  78. Anonymous Coward by Anonymous Coward · · Score: 0

    There has been a move to content repositories like fedora and content/document management systems like alfresco. I'll throw rdf repositories into the mix as well.

  79. Great comments by gringer · · Score: 2, Insightful

    Reading these comments has changed my thoughts on data storage a little bit, but has reinforced my idea that databases are a bad idea for this sort of thing.

    The main issues I have with using databases are file size (I store and convert text files that are 10-100MB zipped), and mutability (generated data doesn't typically change, I just add new experiments on top of other data). A secondary issue is that for plain-text data files (or plain-text convertible data files), writing code is easier when you don't have to bother about a database middleman.

    So, if I were to do [another] large research project in the future, here's my thoughts on what I would consider an appropriate approach:

    • Use a file system, rather than a database
    • New data gets put in a consistent place, e.g /data/<date>/<source>/
    • Backup this data. Assuming immutable files, incremental backups don't make much sense.
    • Symlink categories to the original data
    • When a file is referenced (or attached) in an email, keep a link between email date and file, e.g. /categories/<email>/<date>/<sink>/
    • Maintain (preferably autogenerate) and backup a plain-text file linking categories to files. This will help when data gets lost (i.e. accidentally deleted).

    My most common uses for old data are re-running analyses (generating new data as results), and sending data to someone else. It helps to be able to make those things as quick as possible.

    --
    Ask me about repetitive DNA
  80. XML by Anonymous Coward · · Score: 0

    I recently reenginered our data acquisition / storage system; building basically everything around XML.

    The big advantage of XML is the huge ecosystem of tools that exist around it.

    Spend some time coming up with an (extensible) schema for storing your data that guarantees the data is not ambiguous. At any point you can then validate your data against the schema. This lets you find problems right away that would otherwise go unnoticed until it comes time to analyze the data.

    With XML you can make use of XQuery, XSLT etc to easily transform your data into another format, making it easier to collaborate with other people who use different formats.

    I use eXist-db for storing and querying the data. This is one of these new No-sql, document orientated databases. The big advantage of this compared to relational databases is that you have to learn just one data structure (your schema), you don't have to shoe-horn the data from one format into database tabels. You can then do very rich queries on the data using XQuery. Depending on your application and amount to data this might be an overkill, you can always use the standard XML tools that come with any modern language on individual documents.

    The only disadvantage I guess is if you have terabytes of data. You can store 'heavy' parts of the data in binary converted to base 64 inside the XML files and still maintain the benefits of the structure of the XML, though there is a bit of a penalty in size.

  81. t really depends, be more specific by floydman · · Score: 2, Interesting

    I am a programmer, who works closely with scientists in scientific computing in the fields of fluid mechanics simulation, and aerodynamics simulation.
    Your question is really not clear, in both these fields that I work on, the requirements vary vastly, and it also varies to the users I support (over 100 scientist). some of them have huge data sets, spanning up to 600 GB/file, a single simulation run can give a geologist a 1 TB file.
    Others, have a few hundred MB of data. Each is handled differently.
    The data itself, can be parsed and stored in in a DB for analysis in some cases, and in others, that is very impractical and will slow down your work.
    Each scientist has a different way of doing things.

    So the bottom line, if you want any useful answers, be more specific. What field of science (i can tell you are a chemist?), what simulations/tests do you use, how fine are your models are your data sets and what is their format, what kind of data are you interested in, you should seriously consider an archiving solution because i guarantee you will run out of space.

    --
    The lunatic is in my head
    1. Re:t really depends, be more specific by Anonymous Coward · · Score: 1, Interesting

      As a former 'commercial' programmer and now a scientific programmer, I have gone through an adjustment to an environment where I now use flat files (mostly binary, sometimes text) and HDF and NetCDF files. As the datasets are non-homogenous, and very dynamic in nature (you may need to blow away a dataset or just a specific portion of it, (and not just you but the scientists involved) and then regenerate it again. You may produce near-duplicate datasets with a slightly different algorithm and need to keep every version. It can be a challenge at times and is very very different to a CRUD scenario.

      I have no magic bullet but we designate a key server with a very large RAID subsystem as our data repository. We split every thing into directory structures and symlink where appropriate to make navigation easier for novices exploring a new dataset. We nominate individuals as the 'structure gatekeeper' and mandate any changes in that dataset structure to be OK'd by that scientist. For current experiments which are I/O bound, we symlink separately mounted SSD's and high-perf HDD's to give us the best possible sequential read and write performance. For duplication onto scientist's local drives, we have sync scripts that will take portions of the complete data repository and copy them to/from the local drives.

      Of the NoSQL's, I have used/played with couchdb and although it is a great db, it's not appropriate for us. There have been changes to the internal structures over the last year and it would be a nightmare to have to migrate our mass of data from an old format to a new. Flat files are easily understood and manipulated by every operating system, easy to write code to read/write from and if you want to formalize a number of datasets as 'gold quality', HDF5 is the accepted standard.

  82. Hybrid approach by Anonymous Coward · · Score: 0

    I have worked for scientific instrument companies for many years and have come across this problem many times.
    The optimal solution we have found is a hybrid approach.
    We keep all the data in files, typically fields of study or analysis programs have a file format they prefer.
    Then to manage the data we use a database usually SQLite for its ease of use to keep track of the files.
    The database contains all the meta data needed to find a particular data set and a link to the original data, a url or file path.
    This allows you to find things fast using the database but once you have you get the original data file using the link.
    You need to write a program that walks all your files and builds the database at first this can be simple and you can
    can evolve it as needed rebuilding the database each time.

  83. Re:Use databases! I am. by DCFusor · · Score: 1
    Recently our fusion efforts have demanded similar things so we can data mine and look at fleeting events, sweep a multiparameter space and find "sweet spots" and so on. We are doing just what parent suggests -- putting into a MySQL database, using perl. It's going along swimmingly.

    It works well, and we designed the database schema for extensibility, normalized and all that. (and the thing is growing and adapting well to changes, but that DB design is all important to make that not so hard)

    Looks like the best plan for stuffing a lot of data away and finding it later on. www.coultersmithing.com will show you a bit of what we are up to here. Or our forum at Science/Engineering/Tech forums

    --
    Why guess when you can know? Measure!
  84. Try NI DIAdem as a data management tool. by Anonymous Coward · · Score: 0

    Try the National Instruments product DIAdem, which is designed to store, manage and analyze large datasets. It will probably work better than any roll your own solution, with less pain in setup and maintenance. Go to http://www.ni.com/diadem for information, and links to a free 30 day demo. It is not cheap, so definitely try it before you buy! As with any other approach to file organization and management, it isn't a magic bullet if you start with a mess - you'll have some work to do, transferring everything so that the new arrangement is organized, and if you are continually haphazard about organization, this tool will be of limited use.

    Not a shill for the company - I am a wage slave (staff) at a large public university, and have been relatively pleased with both the products and the support from the company.`

  85. Master Naming Scheme by EagleFalconn · · Score: 1

    I'm also an experimental physical scientist. My experience tells me that I have absolutely no idea what kind of meta-data I'll want to keep track of in the future, and I only know what I want to keep track of now, which is probably a small subset of what I'll want to keep track of in the future. Every sample that I make is assigned a unique serial number (Experiment N Sample M Piece Q etc). All the master data is in my lab notebook which I keep anal retentitively. Any metadata that I know now that I want to keep track of is contained in there. Any analysis I do on any sample I make is also filed under this serial number. Now I just need to convince my boss to let me switch to an electronic notebook (like Microsoft OneNote) so that I can assign each sample or each experiment its own tab so I don't have to jump pages back and forth in my paper notebook.

  86. You need tdifferentiate between data and meta-data by strangeattraction · · Score: 1

    One way of looking at the problem is by organizing information about the data versus organizing the data itself. One way to do this might be to have a text file describing facts about your data in the same directory as the data. Use something like solr and lucene to index these text files. You can make search queries without the need a uniform schema that describes the meta-data about your experimental data. It would be like "Googling" your data. You could organize it in such a way that you can search for specific info like date etc. Or lookup by search terms that might be found in the text.

  87. I would use flat-files in a meaningful dir tree by Anonymous Coward · · Score: 0

    I worked on simulations in Materials Science during undergrads and grad school. My accumulated files consisted of raw data, processed data, scripts for processing, intermediate documents (presentations, utility programs etc). In general, IMHO -
    1. Flat files with no metadat work best for personal use and sharing with collaborators.
    2. RDBMS, tagging etc might be an overkill.
    3. Arranging data in the way you obtained it is the best way. Relationships between the data etc. go into your publication.

    Hence I used -
    $ROOT/project/experiment-strategy/date/file-type/meaningful-file-name.extension

    Here,
    1. project : Name of the project (e.g. Nickel_Plasticity)
    2. experiment-strategy : The sort of work you did to get the data (e.g. Tensile_Test)
    3. date : MMDDYY (e.g. 120210)
    4. file-type : raw | plot | script | misc
    5. meaningful-file-name.extension : Some file name which will immediately remind you of whats inside (e.g. two_cycle-100mpa.full)

    Once the project is over, I gzip the data and send to backup tape.

  88. Try dBASE by Anonymous Coward · · Score: 0

    I suggest you try dBASE and its clones/dialects. Some consider it old-fashioned and bash it for silly reasons[1], but it's a nice compromise between flat-files and RDBMS in my opinion. Its pre-mouse design is great for ad-hoc textual scripting and makes it very keyboard-friendly, a lost art.

    It was even invented for scientific usage at NASA's JPL lab in the late 70's (although was influenced by other products). Unfortunately, there's no current (finished) interpretative open-source versions out there, only compiled. For certain data-chomping tasks I really miss it. (My shop won't approve it.)

    [1] It may be one of those tools that's highly subjective: some love it, some hate it with few in-between, kind of like Neil Diamond tunes.

  89. Separate Data and Meta-Data by pankajmay · · Score: 1

    I face the same issues too. Immensely large datasets, that change and no proper way of tracking them through the file system. Trust me, when I say this -- it will be worth your while to spend some time thinking about your requirements and do some quick coding to get an infrastructure in place.

    It was said here before (I guess just a couple of posts above), but this is right on mark -- You have got to separate out your data and meta-data. Text files are immensely convenient and to be honest, that is also where I prefer to store my actual data. But statistics about my data I store in a quick relational database. My meta-data db consists of fixed columns that have all the statistics for my data sets that I usually need. For example: Date/Time, Number of Columns, Number of Rows, Row Description, Column Description, Algorithm Description, Parameters, Special Notes, File Name of Data, etc.
    Row Description, Column Description, and Algo Description point to separate helper tables.

    I also agree with many people here that relational databases can be an overkill for a manageable database, but if you generate a lot of datasets, the break-even point is reached almost immediately. Besides, text files even though extremely convenient for a quick grab and feed into a software are simply horrid when it comes to trending across many datasets.

    Now depending upon your skill set, setting up this database could be a day's job or it can take you weeks. If you are computationally inclined though -- go for the relational database to store the meta-data, keep your main data in text files.

    If you are not, there are excellent software out there that give you a nice interface to a relational database.

    Nevertheless, my main point is, whatever route you choose, it will help you to separate out your actual data, and stuff about that data (meta-data).

  90. Re:Use database for metadata by clive_p · · Score: 1
    The obvious answer is to use a database but only for the metadata, that is the items that describe the main dataset, such as filename, date of collection, dates of processing etc, shape, size, important parameters. Leave the data in the files so you can use your usual software packages, but use the database to organise the data collection. On the whole that gets you the best of both worlds.

    But one thing you will find is that you have to use SQL to get the best out of any relational database, and this involves thinking in a new way - it's basically set-oriented - rather than sequentially row by row. This takes a bit of effort, but can be rewarding, as you will discover new ways of achieving some of the things you want to do.

  91. Enterprise Content Management by GravityStar · · Score: 1

    Depending on your exact requirements, this is perhaps a fit for a Enterprise Content Management system. If these datasets are heterogeneous, I'd think looking into some kind of flexible meta-data system would be the way to go. This can be anything from a custom application, a bought solution or a opensource ECM system.

    Though I caution you, some of these systems can be convoluted to set up, maintain and learn. Don't let that scare you away from the concept of it though.

  92. Re:Directory names : experiment number + notebook by Anonymous Coward · · Score: 0

    I do a similar job, and I use that combined with the ancient technology of a paper notebook. Just as you do the experiment you note what it is in each experiment/series of experiments.

  93. Not all IT is the same -- you want 'Informatics' by oneiros27 · · Score: 3, Interesting

    The problem is, most IT people have no idea what do with science data -- it'd be like going to a dentist because you're having a heart attack. They might be able to give general advice, but have clue what specifics need to be done. Likewise, IT might be people who are really good at diagnosing hardware, but they might suck at writing code. Not all IT specialists are cross-trained in enough topics to deal with this issue effectively (data modeling, UIs, database admin, programming, and the science discipline itself).

    There's a field out there called 'Science Informatics'. It's not a very large group, but there's a number of us who specialize in helping scientists organize, find, and generally manage data. Think of us as librarians for science data.

    Most of us would even be willing to give advice to people outside our place of work, as the better organized science data is in general, the more usable and re-usable it is. There's even a number of efforts to have people publish data, so it can be shared, verified, etc. And most of us have a programming background, so we might be able to share code with you, as we try to make it open source where we can, so we don't all have to re-solve the same problems.

    Because each discipline varies so much, both in how they think about their data, and what their ultimate needs are, we tend to be specialists, but there's a number of different groups out there, for example:

    There's also Bioinformatics, Health/medical informatics, chemical informatics, etc. plug in your science discipline + 'informatics' into your favorite search engine, and odds are you'll find a group, or person you can write to to try to get more info and advice.

    Recently, NSF just funded a few more groups to try to build out systems and communities : DataOne and the Data Conservancy, and I believe there's some more money still to be awarded.

    --
    Build it, and they will come^Hplain.
  94. Ask the correct crowd -- Earth Science Informatics by oneiros27 · · Score: 1

    You said you're dealing with physical science. From what you describe, I'd guess that you're dealing with earth science, from what we call "small science". (lots of smaller investigations that can be done with a small team, rather than the multi-million dollar satellite or sensor grid projects).

    I'd suggest talking to one of the following groups:

    There's a hell of a lot more groups out there, but those two larger groups would be able to stear you towards more specialized groups that deal with a specific scientific discipline.

    --
    Build it, and they will come^Hplain.
  95. Re:Use database for metadata by drfireman · · Score: 1

    I'm not sure this answers my question. It sounds like you're basically saying all the people who are telling us to put our data in the database are wrong. Which is possible, but it doesn't explain what they thought we were going to do with it in the database in the first place. But then you're suggesting that we do a tremendous amount of extra work, including some very touchy coding, to get the metadata into a database. Unless I'm missing something, keeping the metadata up to date would then require rewriting all of the software we use (the software modifies, creates, and deletes files). Even if this were possible, it would be extremely fragile. And I'm not sure what the benefit would be. I have no idea what the distinction is between "row by row" and "set-oriented," because we don't use a database at present, and I can't even imagine what "row by row" would mean for our data. This is rich scientific data, not transactions in some financial system. And the supposed benefit sounds very meager. We don't need new ways of achieving things, unless they're better than our current things.

    Still baffled.

  96. store the data how you think by rbubb2 · · Score: 1

    I use a program called Personal Brain, available at: www.thebrain.com I prefer to pay for the upper-end version due to its better functionality, but the company has a free version as well. They also sell an enterprise version that is useable for multi-users. I don't have experience with this version, but I presume the function /s is/are similar.

  97. post notes, ofc by aradnik · · Score: 1

    they're even good for storing video!

  98. Crawl the documents in Nutch, build a Lucene index by fak3r · · Score: 1

    You can use Nutch (http://nutch.apache.org/) to crawl your documents, then you'll have a Lucene index. Nutch also gives you a basic search page as a frontend as you go through your documents. For grouping of search results Carrot (http://project.carrot2.org/) has some very interesting algorithms for research work, and Nutch has a native plugin for Carrot that has shipped since 1.0.

  99. I use the "spike" method myself. by crovira · · Score: 1

    It doesn't work so well for biology subjects (they all become autopsy experiments, and after a week or so the smell from my desk reminds me of a "Body Farm," but I LIKE sticking things on a long vertical metal rod. (I KID, I KID...)

    --
    MSBPodcast.com The opinions expressed here are my own. If you don't like 'em... Think up your own stuff.
  100. as a practicing scientist by cinnamon+colbert · · Score: 1

    with poor spelling, the idea of RDBs or other stuff just sounds like a nonstarter. (if you think install apache 1click, etc works, then you have no idea of what real scientists are like; perhaps this means that they don't have the training they should have, but I have used these one click installers, and they are way, way, waaaay to complex for me, and I am sure, for most scientists, for whom an excel pivot table or vlookup is the height of programming expertise) The problem is that "scientific data" is often like the real world - messy, undefined stuff. It also , typically includes all sorts of things - images from cameras ranging from consumer point and clicks to peltier cooled 24bit CCDs, flat files uploaded from machines, complex propietary files uploaded, hand written notes, etc etc. For instance, for a recent experiment I did, I have ".xad" files, which are a prop format from an agilent instrument ".nef" files from a Nikon consumer camera several .txt files which need diff converters to put them into useful excel files several .pngs some of hte key .txt were imported into .xlsx, then imported into a specialized graphics program (Igor) which has its own propetary format... my written notes on how to do the experiment (word doc with embedded excel minitables, which show calculations for things like how many grams of salt per liter of solution..) propietary files from a nanodrop (don't ask) which can be opened with excel to get sort of usable data hand entered data from a thermo electron conductivity meter (integers) a cheap balance, lot numbers of key reagents, pointers to previous experiments where i describe how key reagents were prepared, pointers to key experiments with background info I don't think a single one of these files has the data organized in a really useful manner; all require extensie hand editing to get them into a form that is even understandable, let alone integrated. not to mention the hideous problems of writing reports in word with embedded pictures, the loss of functionality in excel 2007 (x error bars hard to do, xy graph format reverts to line format without telling you,etc) what this guy really wants is sort of the holy grail of web searching - the ability to do natural language queries on his data set. Like, for growing fish, what is the effect of pH (acidity) and ionic strength (how much salt is in th water) on body length vs time IN the meantime, what would help would be 1) if everyone complained to gates and ballmer,a nd got them to divert some miniscule fraction of their monopoly level profits from office, and put those profits to work making office usable for technically minded people 2) A way to keep track of files across changes in file name and directory name. this is a huge problem; what would help is a way to embed a serial number into a file, and every time you use the file, the serial number goes with it, eg, you have a .tif, you crop it in image editor, import the crop into powerpoint, add text, export the mess into word, save the word doc, re edit it a year later, then two years later, looking at the re editied word doc, yo need to find the ORIGINAL .tiff, and in teh mean time, you ahve moved to a new computer, and have a different directory tree.....

  101. Perfbase by Anonymous Coward · · Score: 0

    You might want to take a look at perfbase (perfbase.tigris.org).

    It's an experiment management tool with a postgresql backend.

  102. Ruby on Rails by Aciel · · Score: 1

    I write a Rails server for each of my experiments. Rake tasks are GREAT for bioinformatics pipelines, and migrations and database backups make it incredibly easy to save old experiments.

    Plus, various Ruby gems (starling, workling) enable me to farm out long-running experiments to a variety of lab machines. I can use Rice to write C++ Ruby extensions that are compiled individually on different machines.

    And all of this is stored in a PostgreSQL database. (MySQL is slow for complex joins, which you sometimes have to write in bioinformatics.)

  103. Version control system like Git can be useful here by Anonymous Coward · · Score: 0

    I also like plain old text files. I have been using version control systems like git to keep track of my research data.

  104. Speak to RIMS people by Anonymous Coward · · Score: 0

    You are going to the wrong people, go to your Records and Information Managers (RIMS) if you want advice on organising information and futureproofing it, this profession has been organising information since Sumerian times c.5-7000 years ago and the modern canon of information management reallly took off about 150 years ago. Asking the IT guys is like inviting a mechanic to help you design a car, they are great at fine tuning and understanding performance but their skill set is not primarily that of a designer.
    Most IT professionals think in time scales of 6 months or 6 years not in 60 or 600 years like RIMS do, I am not in any way denigrating IT professionals it just the different disciplines have a different focus. I have worked at the front end as a Records and Information Manager and at the back end as a Digital Archivist and I can tell you that in the IT world no one gets any money to fix 'yesterday's problem', old 'stuff'just gets neglected. It is telling that the term 'archiving' in the IT sense means sticking something on off-line or near-line storage with minimal management rather than in the recordkeeping sense of being catalogued, arranged, described, monitored and having retention and disposal applied as well as running obsolecence risk profiling and migration protocols. Hopefully all three perspectives will start working together to solve some of these issues, we can still access the science of ancient Greece, Isaac Newton and Captain Cook's scientific measurements of weather patterns in the Pacific from the 1700s, will be still have access to today's data in 2310?

    Regards

    Stephen Clarke

  105. I propose Emacs, org-mode and CVS by Anonymous Coward · · Score: 0

    I agree with many people here plain ASCII does a good job for measurement data (as long as it comes not to Terabytes). Nothing beats the good old CSV files.
    1. Its easy readable by human on any (upcoming) platform and with nearly all kind of software
    2. You could easy process them
    3. Easily store them
    4. They will be valid at a time, where matlab structures, binary blobs, Labview blablabla or SQL-whatever is long gone or such heavily modified that "it can't read that old junk".

    Ask a senior about his data posted on the presentation from 1990. Maybe its likely he tells you its on a 5 1/4 inch disk and if you get him an old box with LOTUS whatever and MS-DOS as well as a floppy drive he might can try to read it. Whereas the last two might be still available somewhere, its rather unlikely you that you will find the very special software he used 30 years ago, to read the data and if you can read the data there might be no way to export them.

    However, I found that it is much more hard to keep track of the surrounding parameters and data rather then the measurement data itself. If I look at a file of a measurement from a few years ago, I can still generate a fancy plot. But hey...
    did I rinse before I changed the analyte media ???
    How long did I wait for the set-up to settle down ???
    Did I switch on the measurement devices an hour early to warm up ???
    What was the room temperature ??
    And was this measurement performed with this faulty BNC cable as I found out months later ???
    Did I use this or that reference electrode for this measurement ???
    Ohhhh which version of my home-made software did I use? Maybe it was still the (more) buggy one ???
    And what the heck was the reason for this peak at the measurement start ???

    No doubt all this could be written down (or have to be written down) in your laboratory journal. But its hard to keep it up-to-date and maintain everything else in electronic form.

    I propose to check out org-mode (http://orgmode.org/) an Emacs mode with a very rich feature set to organise plain text files.
    This mode is perfect for creating a electronic version of a laboratory journal. You can refer, link, annotate, outline, mark, priorities, timestamp, set reminders, create agendas, set deadlines, even use literate programming to create your results..... the feature set is simply exhausting.
    Furthermore, I use Git to keep track of all changes I do to my home-made data evaluation software, the org-mode files and my publications. A log of this gives you a decent way to check what did you do with your original data over time and what was the version of your software-tools at that time. Thus, finding out why the graph now looks different compared to the one two months before, is much more easy since you could check how your software changed over that period.

    Just my two cents

  106. scientific data: 2 tools that may help by ejf · · Score: 1

    I have pretty much the same problem. Files system became a mess and databases did't help much with entire directories of up to 1 gig. Plus, for scientific traceability, experimental data should be self-sufficient: external links (to other data, scripts, software, etc.) are toxic. So, I developed 2 open sources to organize my experimental data. 1) new experiments: Basic Experimenter. A wizard that stores together the datasets and all the files of the experiment (protocol, checklists, scripts..). http://sites.google.com/site/basiclabbook/basicexperimenter 2) old experiments: Basic Bookcase. Datasets are zipped and stored in a scientific repository as documents (BibTeX category 'dataset'). At http://sites.google.com/site/basiclabbook/basicbookcase This did not solve the previous cross links, but it helped a lot.

  107. Re:Not all IT is the same -- you want 'Informatics by GPSguy · · Score: 1

    Science Informatics. You just told me what I do, when I'm not doing my own research... which seems to be happening less and less these days. A lot of the kids we train now can write something in Matlab or IDV, but couldn't perform a first normalization on a database if their lives depended on it. A lot of them have learned a little perl... or .Net, but know nothing of C or Fortran, and can't spell MPICH even when the models they use depend on it.

    And seriously, corporate and university IT staff are not suited for this purpose, just as I'm poorly suited to help you with your next Windows installation. My expertise lies elsewhere. Getting the senior management to understand these differences, however, is a problem.

    --
    Never ascribe to malice that which can adequately be explained by tenure.
  108. Software carpentry by Haberdasher · · Score: 1

    I run a lot of numerical simulations, and we run into the same issues, as a few people have noted above. Many of my colleagues still write code in fortran 77 and manually give descriptive filenames to their data. It's a mess. There's a guy at the University of Toronto, Greg Wilson, whose work centers on getting computational scientists--largely people who run simulations--to become more sophisticated. He runs a one-week course teaching software engineering principles to computational scientists. It's available online for self-study at http://www.sofware-carpentry.org./ One section is a lesson on using databases (he demonstrates SQL) to handle numerical data, which I imagine looks a lot like experimental data. He's currently rewriting the course, but I have learned a lot from it.

  109. I solved this one in my business ... by gregor_jk · · Score: 1

    You need your "master data" (ie: data that describes your work and is considered to be of quality) centralized in something like a RDBMS (SQL Server, etc). Once you have it there you can write scripts to find orphans. Basically you will have a metadata table that will contain all available 'links' to your master data. You then try and find the other side of the "link". The question you are trying to determine is "is the meta data of any quality?".

    So, if you needed a table it would look something like this.

    Unique key of Master Data, Flag to determine if unique key exists (because this changes over time), Meta Data key, flag to determine if meta data exists(because this changes as well)

    You then run a script against this nightly to determine if your "links" are broken. I also like to have it visualized by having an HTML page display a nice gren check for a good link and a big red x for a bad link.

    If you do this, you will really have a good idea of your data and it will allow you to QA it much better.

  110. Re:Use databases! and Bots by Anonymous Coward · · Score: 0

    It will most certainly be a pain in a$$ but you could program a crawler for finding dead links and you could use also use it to batch change meta. You also might want to keep a crawler maintained index of what you have and its creation date.

  111. SVN + MATLAB + MySQL by jackchance · · Score: 1

    I am a behavioral neuroscientist and MySQL has been invaluable for my data storage and analysis.

    MATLAB is my main analytical tool, so i generally keep my matlab code in a subversion repository and use MyM to communicate back and forth between MATLAB and MySQL.

    I've been using MySQL since 1999. If I was getting into the game now, i might choose another relational DB.

    --
    1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765
  112. It Depends by Fuzzy+Eric · · Score: 1
    I have been involved in experimental science ranging in scale from preliminary survey of the support variable space to rigorously designed (as in design of experiment = "DoX") production support runs. The short answer to your question is: It depends. Mostly it depends on two things:
    • How similar is the space of experiments you are performing?
    • What sorts of questions do you intend to answer from your data?

    As an example of the former: The patches of experiment space containing "measure the lifetime of the bottom quark" and "estimate the average length of 5 year old blue whales" are strongly disjoint and there is essentially no description reduction scheme that can handle such a broad range of inputs. Equivalently, "estimate the resistivity of the population of salt bridges I've experimented with" and "estimate the total data production of Earth in 2010" are questions drawn from experiments that are too different to have a unified data reduction description. I've led programs to address this range of problems in several ways:

    • Don't bother with links. Like any other "two representations of the data" problems, it WILL go out of sync as soon as something is reorganized.
    • Data goes in the leaf folders. Subsequent processing, folding, spindling, mutilating, and hand-waving with statistics occurs in parent folders. This typically includes interim reports and similar information. This leads to a strong visual model of data being hoisted from lower directories to higher directories by means of the data analysis tools that are *in* those directories. (This pins the version of the analysis tool that was used, so that the analysis can be replicated together with whatever oddities in processing were in that version of the tools.)
    • For back-of-the-envelope experiments (preliminary support variable space surveys), we tend to store the data in single directories named for the category of experiment, distinct instrumental data streams are stored in folders by instrument name (yeah, yeah, I know, that sounds transverse, but it solves any number of "process all the spectrometer data the same way" sorts of problems because all the spectrometer data is together in one place instead of trying to solve a potentially intractable programmatic data format recognition problem) and files from one "run" are named identically. For small support spaces, the variables values are logged right in the file names. For medium experiments (typically too many variables to make workable filenames) a meta-data file is created. This file either has a rigorous layout of support variable information separated by known section boundaries, or uses a form of pidgin markup (required for, for example, optical filter stacks, where a not-previously-specified number of filters may be electrical taped in a stack) that's not really too complicated, only brackets unformatted strings, but makes automatic parsing of the metadata file feasible.
    • For medium sized experiments, with a specific ending condition (makes more sense in the context of items a couple of bullets down), the pidgin metadata file can be used, but it tends to transmogrify into a *real* (strict) markup language. I'm not pushing XML, but there are plenty of tools out there for automatically parsing XML. However, most of them are broken in that they require loading the *entire* XML file before they start parsing. Oddly, for large experiments (next item), the metadata can be oppressively large.
    • For large experiments, the strict markup metadata file tends to transmogrify into an (actual) relational database. It really doesn't matter which one you use, they are all equally inaccessible to your data analysis tools. You will find yourself writing an export or report routine that dumps the database into something like the strict markup metadata file just so your other tools can access it. This is especially true for large DoX runs, with data gathering occurring in parallel in multiple labs where management wants to see something
  113. Mind Map by webhat · · Score: 1

    Sort them as they should be sorted, in a mindmap.

    --
    'I am become Shiva, destroyer of worlds'
  114. tried this? by Anonymous Coward · · Score: 0

    Hey, a hint for organizing data in natural sciences etc.:

    I keep all my biology data in flat files (xls mostly, scv, pcg, etc...) in folders named by Year-Protocol but they are all submitted to Mendeley (mendeley.com), which is like Endnote but freeware and allows you to search the text *inside* files and also to create separate libraries inside your global library, but it doesn't move files, only links to files, so you can actually have the same file appearing in more than one library (eg. one Methods file with appearing in Experiment 1 and Experiment2). It'll tell you the file's location. It's been working for me...Mendeley is my Explorer now...

  115. Consider NuGenesis SDMS by Anonymous Coward · · Score: 0

    Depending on your budget / affiliation, you might want to look at a scientific data management system (SDMS), the original, and still, (IMHO) is NuGenesis (http://www.waters.com/waters/nav.htm?cid=513068&locale=en_GB), now owned by Waters inc. It will automatically capture and catalogue both raw data files and electronic print outs, store any describing (meta) information into an Oracle DB and securely store the files for you, it can even be put in front of a fancy like content addressable storage (http://en.wikipedia.org/wiki/Content-addressable_storage) like Centera.