Building and Maintaining Large, Collaborative Databases?

← Back to Stories (view on slashdot.org)

Building and Maintaining Large, Collaborative Databases?

Posted by Cliff on Monday May 12, 2003 @06:30AM from the software-for-large-scale-jobs dept.

hherb asks: "We are in the process of building and maintaining a free pharmaceutical reference database, in order to liberate medical decision support systems from vendor driven databases. For that purpose, we need some way of allowing multiple authors to contribute to a large number of data records - most of them small...too small for CVS. We need version tracking as well as authentication of authors. We need to tag every bit of information enterrred with information about source reference, author, peer review result, and so forth. I had a look at existing version tracking software. like CVS and Subversion, and I did not have the impression that any of them would suit our needs. Does anyone have ideas for *free* software solutions that we can use?"

32 comments

Min score:

Reason:

Sort:

Free as in Beer ? by MerlynEmrys67 · 2003-05-12 06:40 · Score: 1

I am assuming you are looking for a "Free as in Beer" solution. If one of the standard VCS solutions will not work for you, you are stuck writting your own ontop of MySQL or something like that to handle the small records you are looking for.
I guess I don't understand why CVS won't work for you, I assume you create a record for each drug/whatever and then track them... So you end up with a flat directory structure that holds all of 1,000,000,000 records. If CVS won't work for you maybe you should look into ClearCase or one of the other supported tools that will, you won't own the source to your VCS system, but that isn't what you want, you want to own the database itself

--
I have mod points and I am not afraid to use them
1. Re:Free as in Beer ? by hherb · 2003-05-13 10:27 · Score: 1
  
  What I don't seem to have made clear enough that this is about highly structured data that needs to be modified by hundreds of health professionals with little to no knowledge from a multitude of different platforms.
  
  Hence, we need a web interface and cannot use a more suitable tool like a XML editor which would enforce compliance of entered data with a DTD.
  
  For medicolegal and other reasons, we need to be able to backtrack which junk of information has been modified by whom and when.
  
  Currently, as you can see from our web site, we are using a PHP driven web frontend for a highly normalized database running on PostgreSQL. Data entry and display works fine, but the audit trailing still doesn't - gets rather complex.
2. Re:Free as in Beer ? by Jellybob · 2003-05-14 00:56 · Score: 1
  
  If you're already using PHP from a web based interface, why can't you switch to using XML as you'd like to?
wikipedia is a good example by vaskin · 2003-05-12 06:41 · Score: 1

Wikipedia is a very nice example of a collaborative, moderated, reference site.

--
--- cut here ---
1. Re:wikipedia is a good example by sohp · 2003-05-12 08:28 · Score: 1
  
  Second the recommendation to use a Wiki. There are Wiki engines that support user authentication and history. Various backend storage mechanisms are also supported, from flat text files to full-blown relational databases. Take a look at the Wiki Clones Directory
2. Re:wikipedia is a good example by littlerubberfeet · 2003-05-12 09:40 · Score: 1
  
  Beyond Wiki, I would also suggest everything2.com. The engine for that site is at everydevel.com. everything2 is a project of Nate and Hemos, among others, who developed most of this site. The engine seems flexable in my use of it, and provides a mechanism through which similar pages are linked based on user search patterns. Adding a little for source authentication should be minor. Best of luck.
  
  --
  Sig (appended to the end of comments you post, 120 chars)
Sounds like simple data entry by michaelggreer · 2003-05-12 06:45 · Score: 1

Looks like you are looking for some kind of front end to the database. Fairly simple, but you will probably need to write it yourself if noon-techies will be doing the entry. Frankly, I don't see why you need to poll the Slashdot community on how to write a simple front-end to a database.
1. Re:Sounds like simple data entry by hherb · 2003-05-13 09:29 · Score: 1
  
  This is about collaborating on highly structured data. We can't require users to run specific software, it must run via web interface. In fact, currently we are using PostgreSQL as backend and a simple web interface as frontend.
  
  The problem is the version tracking - assigning authorship to the differences between records - since no matter how normalized a record (table row) is, it might always have a number of authors. OTOH, this tracking/auditing must not obfuscate the record text (inline tagging not feasible)
  
  Alas, not as simple as it looks.
2. Re:Sounds like simple data entry by HarleyPig · 2003-05-13 14:55 · Score: 1
  The problem is the version tracking - assigning authorship to the differences between records - since no matter how normalized a record (table row) is, it might always have a number of authors. OTOH, this tracking/auditing must not obfuscate the record text (inline tagging not feasible)
  
  Alas, not as simple as it looks.
  
  Unless I'm missing something, it should be simple. I know there are other solutions out there but one solution I would suggest is like so:
  
  User enters/modifies information
  
  User submits information to script via form
  
  Script validates data
  
  Script submits data to cvs
  
  Script enters/updates data in database
  
  Script returns canned response
  
  Use cvs for your audit and the database for your presentation. You're duplicating data, but think of it as a cheap backup.
  
  Am I misunderstanding something?
  --
  Liberation is not deliverance.
DIY by pmz · 2003-05-12 06:50 · Score: 1

If the records are so small, just create your own database schema that allows keeping a whole copy of each version of each record along with appropriate metadata tags (version, author, etc.).

This isn't very difficult, and anyone with data modeling experience can pull it off. As far as free software goes, just use PostgreSQL or MySQL and a PERL or PHP or JSP front end. Problem solved.

--
Healthcare article at Kuro5hin
1. Re:DIY by hherb · 2003-05-13 10:05 · Score: 1
  
  We ARE using Postgresql and PHP currently. But the problem is NOT solved.
  
  Just copying the whole record for autit purposes does not help, since we need to assign authorship to the changes, not to the whole record.
2. Re:DIY by pmz · 2003-05-14 04:29 · Score: 1
  
  Just copying the whole record for autit purposes does not help, since we need to assign authorship to the changes, not to the whole record.
  
  You could store a whole new copy of the record along with the set of people responsible for that version. Each author entry can have an associated revision comment. That would probably be enough of a "paper trail" that the group of responsible people is small enough for an investigation to flush out the remaining details.
  
  It seems that if the record is small, then the metadata will dominate, regardless of the strategy employed. However, finding a way to keep everything in the database will pay off, because SQL is so powerful relative to CVS-style diffs and greps. Relational databases also have a very high performance potential relative to flat files (they tend to scale very well).
  
  --
  Healthcare article at Kuro5hin
PostGreSQL by stanwirth · 2003-05-12 07:04 · Score: 2, Informative

You might want to try an open-source relational database engine like PostGreSQL

This means, of course, you'll have to create your own schemata for logging version control and records added, but versioning additions to the database -- and user/connect information -- can be tracked with a combination of triggers and timestamps in a separate table. These are fairly standard techniques in generating an audit trail for tracking relational database changes and enforcing data integrity constraints. While the syntax varies from database engine to database engine--PostGreSQL is a good (free) place to start, and if the spirit moves you (or if the DB becomes too large) the syntax differences do not preclude your moving to, say, a development edition of IBM's DB2 UDB or even (aack!) Oracle. All three run rather nicely under SuSE Linux, and are said to run quite nicely under RedHat, as well.

It probably wouldn't harm you at all to develop this database with a relational database, and test it under different engines-- in fact, it would likely make you highly employable in the very near future.
1. Re:PostGreSQL by hherb · 2003-05-13 10:48 · Score: 1
  
  We are experienced in writing standard database applications with standard audit trails (for example our health record system at http://www.gnumed.net). In fact, as you can see from the drugref.org website, we are using a PostgreSQL based drug database at present.
  
  However, the granularity of versioning and audit trailing typically used in relational daabases does not suit our needs, see my coments to other replies.
  
  As far as "being employable" goes: I don't. I do employ people, including doctors, since I am in the fortunate position of owning my own clinic (http://www.dorrigomedical.com) and being completely independent. :-) If somebody comes up with a really smart solution, I might employ him as developer.
2. Re:PostGreSQL by stanwirth · 2003-05-13 17:27 · Score: 1
  
  ...or her.
Zope by Scaba · 2003-05-12 07:06 · Score: 1

It sounds like you're really talking about content management more than version control. Maybe Zope and Plone will do it for you? Out of the box it gives you versioning, authentication and a decent database. The database isn't the most scalable, but you can extend it to use any number of database and database-like backends, like MySQL, PostgreSQL, Berkeley, and various pay-for dbs.
A FOIA project by neitzsche · 2003-05-12 07:24 · Score: 1

http://hardhats.org/

The VA has the VistA project, which covers pretty much every aspect of medicine. The application is based on 30 year old technology known as M (aka MUMPS, which evolved into "Caché".) The source code is not GPL, it is available because of the Freedom Of Information Act (FOIA). And IIRC I believe RMS said a long time ago that the VA FileMan code was GPL-compatible.

The normal audit trail for record addition/modidifcation sounds like it would cover your needs. But more likely, they have something covering ICD9 in their medical dictionary subsystem.

Because M is a hierarchal dynamicly extensible b-tree database, it is VERY fast. Cache tends to be very scalable, in my experience.

See
http://hardhats.org/
http://intersys.com/

I know this doesn't address your specific question exactly, but perhaps a fresh look at your approach might help.

--
"God is dead." - Frederik Nietzsche
1. Re:A FOIA project by Grotus · 2003-05-12 08:15 · Score: 1
  
  The VistA source is GPL compatible because it is Public Domain.
  
  There are parts of the VistA system which are not included in the FOIA release. The poster's primary interest (medication instructions) would appear to be one of those bits. The bits which aren't covered are primarily information copyrighted by other organizations (like the AMA).
  
  If you aren't already an experienced M programmer, I would recommend having a large bottle of your favorite painkiller handy for when you review the code, it is pretty hard to read.
  
  On the bright side, once you do understand M, you are pretty much guaranteed a job at a VA.
  
  --
  "From my cold, dead hands you damn, dirty apes!" - CH
2. Re:A FOIA project by gmhowell · 2003-05-12 15:30 · Score: 1
  
  Fortunately, the AMA is supposedly developing an open source EMR system. (Or is it AAFP? Yeah, it's AAFP) Whether this sort of data will also be open is anyone's guess at this point (also whether or not the eventual license will be cleared by OSI).
  
  --
  Jesus was all right but his disciples were thick and ordinary. -John Lennon
Content Management system, not Version Control by plsuh · 2003-05-12 07:33 · Score: 1

From your description and your web site, what you need is not "version control" or "source code management" or "configuration management" software (which is what cvs or subversion provide), but a "content management" system, which contains editorial workflow, publish/unpublish capability, probably the ability to generate static pages, be a web services provider, etc.

Try one of the following for size:
Tiki
phpWebSite
OpenDocMan

Your workflow requirements seem to be a lot more sophisticated compared to most sites, but one of these projects may serve as a starting point.

--Paul
1. Re:Content Management system, not Version Control by jj_johny · 2003-05-12 07:59 · Score: 1
  
  Yeah but I would look at going to a deeper level of CMS that has workflow built in.
  eZ Publish from ez.no, and a bunch of others listed at OSCOM
  But I would stay well away from some of the new stuff. There are plenty of CMS's out there without a project needing to use something that is new and is a duplicate of an existing product.
Use File Loading Processes by justanyone · 2003-05-12 07:37 · Score: 2, Interesting

I work for a large multinational bank in Chicago. We are aggregating data in a data warehouse and have both lots of sources and lots of data.

The way we cope with this problem is that each data source is given a code (it's usually just the filename). We have a Perl program parse these files (they're comma delimited ascii, tab delimited, DBase IV, etc., and some even human-readable reports), and load the database with the contents. Each record includes the source ID, for easy attribution / tracking.

We keep each file version for a while. Each file has a business date so if they want to clobber a previous version of that data, they get to do so.

This could keep your troubles to a miminum. Write a parser and have a file-upload site that lets people upload data. Define a group of people if you want. They should only be able to add/replace/delete their own data by the nature of the file. Each group can only create a certain filename or the group id is in the file.

This way, many can share database updates via a batch run where updates are tracked and possibly even approved before committing them.

--
Unitarian Church: Freethinkers Congregate!
CMS by Samus · 2003-05-12 07:54 · Score: 1

Sounds like you need a light weight content management system that at the very least has an audit trail associated with it. That isn't trivial but you don't need a rocket scientist to build it for you either. Just use one of the free relational dbs and in a week or so you could have a pretty functional system.

--
In Republican America phones tap you.
Few questions: by Muhammar · 2003-05-12 08:20 · Score: 1

Do you plan to compete with databases like Merck Index?
How are you going to select your contributors? How you are going check for the correctness of their data input?
How do you plan to solve the problem with synonymous keywords in the search? [Good old- fashion Chemical Abstract database published realy extensive indexes on general subject keywords - apart from indexes on the substances, chemical core structures etc. You will need to do this and it is a lot of awful work]

The main problem is you will need alot good contributors and made them motivated to spend their (highly paid) time to do the work, which is not much fun to begin with. Then you will need a lot - I mean a lot - of highly qualified editors to go through the entries. You do not want to be hit with a lawsuit for inacuracy of your data. Also, the propaganda of the big pharma companies is done in very professional way - if it is good enough to fool the physicians, it will be hard to avoid swallowing it when it is submited masked as unbiased info. It is hard to check summary info about drugs - you will need to go into primary data - lots of articles and the underlying complex statistics - to see the validity of claims based on clinical trials. Skillful statistics data "renormalisation" can make wonders.

And who are your intended custommers? If it is pharma companies, physicians and other health care pros, the cost savings from open project will not be too important. The main priority of such custommers will be the data reliabilty.

--
I doubt that we will ever figure out - and I suspect that even if we did figure out we couldn't do much about it
1. Re:Few questions: by hherb · 2003-05-13 10:14 · Score: 1
  
  Our scope goes beyond the Merck Index.
  Correctness of data is checked by peer review.
  We have literally hundreds of volunteering qualified physicians and pharmacologists to perform data entry and peer review - any hands make any work small.
  
  The medicolegal aspect of it is exactly the reason why we need a finer granularity of database auditing than most projects do.
  
  Our intended "customers" (our work is free in every aspect!) are health professionals wanting to use independent medical software and decision support - which is probably the majority of non-corporate doctors. We need the database desperately for our own health software project (http://www.gnumed.net)
  
  Just to make clear that I do understand the implications: I am a M.D. with dual qualification in informatics, writing health domain software (mostly pharmaceutical simulations) for more than 15 years. Our core contributors have similar qualifications.
Try TWiki by herderofcats · 2003-05-12 08:37 · Score: 1

You should take a close look at open source TWiki (www.twiki.org) -- from what you've said, I think it might be perfect for you. I personally have been using it with large groups of people for several years.

TWiki is a Wiki variant that adds RCS for every page that is editable, the ability to have access control lists for different pages, yet like Wiki has an "edit this page" button, and is very accessible to those you are not familiar with HTML. TWiki also has a large base of "plugins" that allow you to some very complex structured documents.

One of my favorite features as a manager is the "changes" button -- it gives me a list of all the documents in a particular section that have been changed recently, and with an additional click, I can see in 'change bars' what exactly those changes were. This allows me to get a snapshot of the activities of my group very quickly.

-- Herder of Cats
What other functionality will you need? by epatek · 2003-05-12 08:56 · Score: 2, Informative

It sounds to me like you need more of the Database (probably relational) than just a version management system. Since you mentioned of different authors working on the (potentially) same pieces of the information - you will need one with decent support for locking and transaction isolation levels. PostgreSQL is probably the right candidate from the OpenSource domain and from proprietary ones - I would say Oracle. You might need to look into some additional features such as how your database will be searched - will you just provide search by drug name (relatively simple, if everybody can provide the correct spelling for drug names) or you need something with richer functionality, such as text search in the drug description and/or reviews search of the contributing members. Again - Oracle will give you something called interMedia text search, which works OK. There are some projects in the Free Source world to provide similar functionality (inclduding in MySQL) - but I do not know the name(s) of them. Out of proprietary world for the special Text Search engines I can name Thunderstone and AltaVista - I worked with both, have their own pros and cons, but both are pretty expensive ;-). And... you still need to write the front end/application for functionality you need ;-).

Good Luck - sounds like an interesting project to do.
Bugzilla by The+Bungi · 2003-05-12 09:05 · Score: 2, Informative

Some people have recommended a Wiki... I'd go with Bugzilla. Maybe with a little bit of Perl hacking you can modify it so that it looks (and behaves) the way you want.
I have a modified version of Bugzilla running on Windows (W2KAS/IIS) that is being tested for something vaguely similar. It works great. The hard part was getting it to work on IIS with ActivePerl - if you're running Apache/Linux or BSD it won't be nearly as big a chore. I didn't even have to touch the MySQL schemas.
The Wiki Way by fm6 · 2003-05-12 11:06 · Score: 1

I'm not sure I agree. But I do share your enthusiasm for the Wiki concept. And the best way to understand that enthusiasm is to read (or rather use) the authoritative book on the concept.
(Warning! A lot of remainders vendors have cheap copies of this book, even though it's still in print. The CDs on these copies are screwed up, but are usable if you know to convert the Perl scripts from Macintosh format!)
Wiki? by oz_ko · 2003-05-12 17:20 · Score: 1

Have a look at http://www.wikipedia.com. It's a collaborative encyclopaedia that sounds very much like what you are after. It's really just a very big wiki site.
Also has version control features, peer review, tags etc etc.
HTH.
Oz
Use a relational DB as a delta-system by skywalker404 · 2003-05-13 13:06 · Score: 1

As was mentioned previously, the easiest solution sounds like a redesign of your existing database. But it sounds like you'd want [at least] three specific tables:

drugs:
drugID (unique, autoincrement, long/int)
drugName (text)

users:
userid (unique, autoincrement, long/int)
name (text)
securityLevel (if you want multiple levels, if not everyone's the same level)

data:
dID (same as above)
uID (same as above)
date (timestamp)
added (text)
deleted (text)
rating (float)

Then each addition is run through the UNIX command diff with the result of all previous entries for that drug. This means that each one will be a delta/change entry.

You might want to even do it more explicitely with each change having it's own entry (and then including the line number). That way if someone just changes a single word (ie: "there" -> "their") it won't change all the instances of "there".

Or you can make it a long list of XML style tags: use <line> and <text> as the tags. I just implemented something along that line for the wiki that I built for my site (it allows the storage of different types of media within one db by doing that).

Last, you keep saying things like "needs to be modified by ... health professionals with little to no knowledge from a multitude of different platforms...need a web interface" as a reason against using a relational DB. Where's the problem in this? I suppose you could build a non-web app to access the DBs, but it'd be a big pain. I don't think most of the people who have submitted have suggested anything other than end user access by a web-app.

If you don't follow, send me an email, and I can try to explain further.

--
SuperCalendar.com - Web calendar with RSS, AIM/SMS notific
Content management systems (CMS) not CVS... by aquarian · 2003-05-13 13:52 · Score: 2, Informative

What you want is a content management system, or CMS. These do exactly what you're talking about. There are a whole slew of them out there, free and not free. Furthermore, there are some general web services toolkits with good CMS modules. Find one that comes closest to meeting your needs, then modify it to get exactly what you want. Some that I've used are Zope, OpenACS, Redhat CCM, OpenCMS, MMBase, and Vignette.
Of all of these, I like OpenACS the best, mostly because of its developer community. There are a lot of great people involved, and there's a high signal to noise ratio on the developer forums. Even though OpenACS probably has the least of what you're looking for, it might be the easiest to develop. OpenACS runs on top of Postgres or Oracle, and is written in Tcl.
Redhat CCM is basically a Java rewrite of the original OpenACS. Its CMS modules are supposedly more mature. It runs on a Redhat version of Postgres, and I think Oracle too.
Zope is a whole lotta product, and probably has most of what you're looking for. However, I find it kind of murky, difficult to figure out. YMMV.
These three are the most promising in terms of developer community. This is a bigger undertaking than it might seem at the outset. You'll need all the help you can get, and getting involved with these communities will spare you from trying to reinvent the wheel.
Of course, I'd love to have you guys use and extend the OpenACS toolkit, and share your efforts with the community!