Building and Maintaining Large, Collaborative Databases?

← Back to Stories (view on slashdot.org)

Building and Maintaining Large, Collaborative Databases?

Posted by Cliff on Monday May 12, 2003 @06:30AM from the software-for-large-scale-jobs dept.

hherb asks: "We are in the process of building and maintaining a free pharmaceutical reference database, in order to liberate medical decision support systems from vendor driven databases. For that purpose, we need some way of allowing multiple authors to contribute to a large number of data records - most of them small...too small for CVS. We need version tracking as well as authentication of authors. We need to tag every bit of information enterrred with information about source reference, author, peer review result, and so forth. I had a look at existing version tracking software. like CVS and Subversion, and I did not have the impression that any of them would suit our needs. Does anyone have ideas for *free* software solutions that we can use?"

5 of 32 comments (clear)

Min score:

Reason:

Sort:

PostGreSQL by stanwirth · 2003-05-12 07:04 · Score: 2, Informative

You might want to try an open-source relational database engine like PostGreSQL

This means, of course, you'll have to create your own schemata for logging version control and records added, but versioning additions to the database -- and user/connect information -- can be tracked with a combination of triggers and timestamps in a separate table. These are fairly standard techniques in generating an audit trail for tracking relational database changes and enforcing data integrity constraints. While the syntax varies from database engine to database engine--PostGreSQL is a good (free) place to start, and if the spirit moves you (or if the DB becomes too large) the syntax differences do not preclude your moving to, say, a development edition of IBM's DB2 UDB or even (aack!) Oracle. All three run rather nicely under SuSE Linux, and are said to run quite nicely under RedHat, as well.

It probably wouldn't harm you at all to develop this database with a relational database, and test it under different engines-- in fact, it would likely make you highly employable in the very near future.
Use File Loading Processes by justanyone · 2003-05-12 07:37 · Score: 2, Interesting

I work for a large multinational bank in Chicago. We are aggregating data in a data warehouse and have both lots of sources and lots of data.

The way we cope with this problem is that each data source is given a code (it's usually just the filename). We have a Perl program parse these files (they're comma delimited ascii, tab delimited, DBase IV, etc., and some even human-readable reports), and load the database with the contents. Each record includes the source ID, for easy attribution / tracking.

We keep each file version for a while. Each file has a business date so if they want to clobber a previous version of that data, they get to do so.

This could keep your troubles to a miminum. Write a parser and have a file-upload site that lets people upload data. Define a group of people if you want. They should only be able to add/replace/delete their own data by the nature of the file. Each group can only create a certain filename or the group id is in the file.

This way, many can share database updates via a batch run where updates are tracked and possibly even approved before committing them.

--
Unitarian Church: Freethinkers Congregate!
What other functionality will you need? by epatek · 2003-05-12 08:56 · Score: 2, Informative

It sounds to me like you need more of the Database (probably relational) than just a version management system. Since you mentioned of different authors working on the (potentially) same pieces of the information - you will need one with decent support for locking and transaction isolation levels. PostgreSQL is probably the right candidate from the OpenSource domain and from proprietary ones - I would say Oracle. You might need to look into some additional features such as how your database will be searched - will you just provide search by drug name (relatively simple, if everybody can provide the correct spelling for drug names) or you need something with richer functionality, such as text search in the drug description and/or reviews search of the contributing members. Again - Oracle will give you something called interMedia text search, which works OK. There are some projects in the Free Source world to provide similar functionality (inclduding in MySQL) - but I do not know the name(s) of them. Out of proprietary world for the special Text Search engines I can name Thunderstone and AltaVista - I worked with both, have their own pros and cons, but both are pretty expensive ;-). And... you still need to write the front end/application for functionality you need ;-).

Good Luck - sounds like an interesting project to do.
Bugzilla by The+Bungi · 2003-05-12 09:05 · Score: 2, Informative

Some people have recommended a Wiki... I'd go with Bugzilla. Maybe with a little bit of Perl hacking you can modify it so that it looks (and behaves) the way you want.
I have a modified version of Bugzilla running on Windows (W2KAS/IIS) that is being tested for something vaguely similar. It works great. The hard part was getting it to work on IIS with ActivePerl - if you're running Apache/Linux or BSD it won't be nearly as big a chore. I didn't even have to touch the MySQL schemas.
Content management systems (CMS) not CVS... by aquarian · 2003-05-13 13:52 · Score: 2, Informative

What you want is a content management system, or CMS. These do exactly what you're talking about. There are a whole slew of them out there, free and not free. Furthermore, there are some general web services toolkits with good CMS modules. Find one that comes closest to meeting your needs, then modify it to get exactly what you want. Some that I've used are Zope, OpenACS, Redhat CCM, OpenCMS, MMBase, and Vignette.
Of all of these, I like OpenACS the best, mostly because of its developer community. There are a lot of great people involved, and there's a high signal to noise ratio on the developer forums. Even though OpenACS probably has the least of what you're looking for, it might be the easiest to develop. OpenACS runs on top of Postgres or Oracle, and is written in Tcl.
Redhat CCM is basically a Java rewrite of the original OpenACS. Its CMS modules are supposedly more mature. It runs on a Redhat version of Postgres, and I think Oracle too.
Zope is a whole lotta product, and probably has most of what you're looking for. However, I find it kind of murky, difficult to figure out. YMMV.
These three are the most promising in terms of developer community. This is a bigger undertaking than it might seem at the outset. You'll need all the help you can get, and getting involved with these communities will spare you from trying to reinvent the wheel.
Of course, I'd love to have you guys use and extend the OpenACS toolkit, and share your efforts with the community!