Questions about Database Implementation.
Metuchen asks: "I'm developing a program for Linux that will eventually be managing a medical practice. This program will be running on several terminals on the same computer, but they must all access the same database simultaneously (i.e. if one user were to update a record, the next time another user accesses that patient--on a different terminal--the correct information must be displayed). What would be the best way to implement such a program? I would prefer to stay away from writing everything to file after each update since the database will be updated frequently; for the same reasons, I would like to be able to keep everything in RAM without using a RAMdrive (since that may require up to one file per patient). Any suggestions?"
Postgres, MySQL, or mSQL for varying degrees of "open source-ness". Or you may wish to go with one of the "big 5" commercial databases, Oracle, Informix, DB2, Ingres, (what's the other one, Unidata?). Each of these DBMSes provide a rich API - basically SQL, or a superset thereof - to which to write your application. With careful use of SQL statements, your app will be a lot more portable across DBMSes as well as across platforms. This could be a considerable advantage if your project grows beyond all expectations, or if a new client already has a SQL database. Hell, it might even work on MSSQL!
Ahh - My eye!
The doctor said I'm not supposed to get Slashdot in it!
Sounds to me like you just need a database backend with your own custom frontend. For the backend, I'd say go with one of the open source databases, (eg MySql, Postgress, etc). But you could probably go with whatever you want.
In regards to your concern of if one user were to update a record, the next time another user accesses that patient--on a different terminal--the correct information must be displayed. That is no troubles, you'd have to go out of your way to have that as a problem. (eg write special caching in the client or something).
As for your concerns on speed, etc. Let the backend deal with that. If you really feel a desire to help it along, you could access the database on a ramdrive, and do regular copies to a physical disk. (You don't want to loose a days changes if the server crashes).
I seriously doubt you'd need to use a ramdisk, as you mentioned it would be run on serveral terminals, which means "not hundreds". And even if 10 users were pounding out records every few secounds, I suspect even a slow server could handle the writes
For the frontend, you can do basically whatever you like, from a Command Line interface, to ncurses, to a webinterface. (if you do a webinterface, I recommend a Apache+php3+MySql combo).
Have fun, and remember to do lots of input checks, and don't forget to escape the input string before parsing it to the database. (eg " becomes \")
I use to have a funny sig, but slash cut it off, and I forgot what the punchline was.
Okay, there are two things you can do to solve this problem. First, you could use an "off-the-shelf" database engine. There are quite a few very good open source database engines out there, as well as a few good low cost and high cost engines.
OTOH, I assume you are probably interested in rolling your own engine. So here's a few strategies that you may wish to consider if you are going to do it yourself.
Overall, one of the most powerful metaphores for breaking down a problem into managable pieces is the notion of "layers" or of a program "stack". That is, you break a large problem into a collection of software "layers", like the TCP/IP protocol stack--at the lowest level you have the piece of software that talks to the hardware. Above that, you have the thing which encapsulates the packets. Above that, you have the thing which handles routing. And so forth.
It's a powerful notion because it allows you to break a complex problem such as "develop a database engine which provides transaction rollback and RAM caching" into a bunch of easy to understand units.
For a database, eventually you are going to have to write the records to a file, and take the performance hit that entails. That's just life. And worse--if you want to have proper transaction rollbacks (meaning if the power crashes while you are in the middle of writing a record, you need to have a mechanism to recover from this), that means you are going to have to do *two* file writes--one to write some sort of "recover" record, and one to write the record you are trying to save.
A common way to handle this sort of transaction rollback is to write a record or group of records in three steps. Step (1), read the records you are replacing and write them into a "backup" or "rollback" file. The idea is that this "rollback" file contains a record of everything you changed in the database file, so that if the power fails while you are updating your database, you can "fix" the database by reversing the steps in your rollback file. (You should also checksum the rollback file so you can detect if the power failure occured while creating the rollback file.) Step (2), you update the database by replacing records and by appending new records. Step (3), you delete the rollback file.
Now there are two common database formats that I've encountered out there. The first common database format is a simple file array of fixed-sized records. If each patient record (for example) is a fixed-sized record, you simply create an array of these records in your database.
Of course this has a couple of disadvantages. First, each record is fixed--that means you can potentially waste a lot of space. Second, it's inflexable. And third, there really is no good way to handle adding "out of band" records--that is, records which store information that is not part of this simple array table. Things like hash indexes or B-Tree indexes wind up being stored in separate auxiliary files.
The second common file format I've encountered is documented in "Transaction Processing: Concepts and Techniques" by Jim Gray and Andreas Reuter. (Very heavy, but excellent book.) The idea is that you break your file into a bunch of fixed-sized records whose size is roughly 4K to 32K. Each record contains in it's header a sort of "page table of contents" which indicate the contents of this page. The body contains the data. The idea is that you fit oddly sized blocks of data within the fixed-size record, and mark where those are in the page table of contents so you can find each odd sized block of data using the tuple (page_index, toc_index).
What's good about this is that you can do all sorts of interesting tricks: if you need to replace a block of data with a larger block that no longer fits in the page, you can simply mark in the table of contents a "forwarding address"--that way, the (page_index, toc_index) refers to the same block--just forwarded on another page in the database. You can also cache these records in memory as you read them--creating a higher-level cache which stores a handfull of these pages can speed up access. And it does solve the problem of "how do I handle varchar records" and "where do I store my B-tree index records."
So this is the bottom level--creating code which handles your underlying database, and handles data recovery in case of a system failure.
Now it sounds like you want to keep stuff in memory as long as possible in order to minimize I/O to the file. While that's nice, it has the problem that in a system failure, stuff in RAM can be lost. That's why most commercial databases write transactions to the database as soon as you indicate "END TRANS"--because the idea is that when you end a transaction, the data had better be recoverable in case of a system failure.
I suspect you'll find that most people are looking up names and reading their status a lot more often than they are updating or changing patient information. So on top of your database file I/O engine, I would suggest two things. First, I would suggest a "write-through cache" of data records. The idea is that you keep around a collection of records in RAM, marking how old each record is, and updating that mark when a record is read--that way, records that are no longer being actively used eventually drop out of memory. When the user updates a record, you immediately write that record to the database, update it's timestamp, and move on.
Second, I would recommend that you find the fields that are commonly accessed to look up records (such as name, hospital patient ID, etc), and create either a hash table or a B-tree structure which speeds up looking up by those fields. The reason why is that when your database hits a few thousand records, you need a way to find records that is faster than reading the entire database into memory. A well-managed B-tree object will do the trick--it will permit you to find a patient, yet minimize the number of I/O reads you have to perform to O(log(n))--a big win when n is a few thousand.
If you insist on living a risky life with your data, you can implement a "lazy write" scheme where you write the updated record after it "expires" in memory--and then, optimize your writes by scanning the other records you have in memory that need to be written with this expired record, and write them all together. (This can potentially be a win, especially if you have multiple records living on the same database page--you potentially reduce several page read/page write operations into one page read/page write op.)
However, in my personal tinkering with these techniques in my own database engine, I find that these methods don't necessarly do the trick. That is, if you have 50 updated records, sooner or later you are going to have to do 50 database writes--and if these records are scattered through your database file, at best you will be able to collapse one or two of these records into a single write. So in a sense, you are risking losing a lot of data (losing 50 patients sounds like a lot to me) in the hope that you may get a 2%-4% performance gain. Better to spend a few hundred more on a faster hard disk for your server computer.
In short, I'd avoid any sort of lazy write scheme if possible.
Notice that I haven't put in a word about the multi-user aspect of this. That's because I'm assuming that your writing the database engine as a single process--that is, that the database engine is written as a single process which controls the database, which then communicates with the database clients through a pipe or socket. This has the advantage that it simplifies the whole "record locking" and "race condition" aspect by having one process own the file. And that means on top of your "file I/O", "record cache", and "record search" layers you have a "client access" layer and you're done. Another approach is to implement record locking and create a database engine which allows multiple processes to access the same file. This requires that you be able to implement some form of record locking on the file--that is, that you can (either in the operating system, or simulate through a "record lock file") lock segments of your database file; that way, you can update one segment of the database while another process access other records in the same file. (File locking, by the way, is really only used AFAIK to determine if the database page one process is attempting to read is currently being written over by another process.)
This method isn't much harder to implement than "one database file/one process". It has the advantage that if you are on a box with multiple processors, multiple client access can nicely divide across those processors. However, it's my experience in tinkering with such things that this sort of technique can be harder to code and get right at the record I/O level. (Though it does eliminate all of that "keep track of which client I'm talking to" code.)
Anyways, that's my thoughts on the matter. I do highly recommend the book "Transaction Processing", by the way, as it covers a lot of this (and more!) in incredible detail. The three lessons I've come away from the book is to "layer your code"--that way, error checking can be handled in the same way that error checking is handled on TCP/IP stacks. And second, always have a way to "rollback" or "undo" the last operation (or "transaction"); that way, you can guarentee database consistency in the event of a catestrophic failure, as well as implement a good recovery mechanism in the case of a minor error that doesn't even affect higher level code (such as waiting for a record to become unlocked). And three, you can speed a lot of things up by using a good tree structure to index your records.
*shrug*
I only outline the above because it's interesting, because sometimes you don't need Oracle-SQL to store a few records, and because frankly, a hell of a lot of people who write applications which use a primitive "database" that simply consists of an array of records. And it wouldn't hurt to add a couple of layers to the low level record I/O routines in that application to make it a touch more bullet proof.
Me, I'm not a big believer in using a cannon to kill a fly. Using a full-on DB engine like Oracle or Transact-SQL just to store an array of records is a wee bit overkill...
I think the thing you're missing here is that the term "database" doesn't necessarly apply to just a SQL server performing transactions across a wire. Granted, in this case, the person doing this project should probably invest in Oracle--after all, that's what he's doing. But I did suggest that at the top of my post.
Thing is, a "database" is not just SQL running remotely. Did you ever develop an application that loaded and saved data in a file? Well, that could also be considered a "database". Transaction processing doesn't just apply to traditional databases, either--they're a strategy to prevent your user from losing data just because he hit "save" at the wrong time.
Further, what ever happened to curiosity? What ever happened to the hacker mentality of wanting to know how it works and tinkering with something that's interesting? It does supprise me the number of people who instead of saying "I reinvented the wheel because I thought it would be cool" (hacker mentality) are saying "why the hell did you waste your time doing 'x'?"
Besides, once the guy realizes all that's involved, perhaps he can be the one who makes the mindful decision as to how to perform his employer's wishes?