Dumping Lots of Data to Disk in Realtime?
AmiChris asks: "At work I need something that can dump sequential entries for several hundred thousand instruments in realtime. It also needs to be able to retrieve data for a single instrument relatively quickly. A standard relational database won't cut it. It has to keep up with 2000+ updates per second, mostly on a subset of a few hundred instruments active at a given time. I've got some ideas of how I would build such a beast, based on flat files and a system of caching entries in memory. I would like to know if: someone has already built something like this; and if not, would someone want to use it if I build it? I'm not sure what other applications there might be. I could see recording massive amounts of network traffic or scientific data with such a library. I'm guessing someone out there has done something like this before. I'm currently working with C++ on Windows. "
I did some work on a DVD-Video authoring system that had some incredible file system requirments (obviously, when involving video data and the typical 4 GB data load for a single DVD disc).
The standard file API architechture just didn't hold up, so we (the development team I was working with) had to rewrite some of the file management routines ourselves and work directly with the memory mapped architechture directly. This does give you some other advantages beyond speed as well, as once you establish the file link and set it in a memory address range you can treat the data in the file as if it were RAM within your program, having fun with pointers and everything else you can imagine. Copying data to the file is simply a matter of a memory move operation, or copying from one pointer to another.
The thing to remember is that Windows (this is undocumented) won't allow you to open a memory-mapped file that is larger than 1 GB, and under FAT32 file systems (Windows 95/98/ME/and some low-end XP systems) the total of all memory mapped files on the entire operating system must be below 1 GB (this requirement really sucks the breath out of some applications).
Remember that if you are putting pointers into the file directly, that it works better if the pointers are relative offsets rather than direct memory pointers, even though direct memory pointers are in theory possible during a single session run.
"Can [the storage backend] handle 2000 random seeks per second?"
The short answer is "no."
A 10,000 RPM disk has a period of 6 mSec. That's 3 mSec latency on average for random access (not counting seek time or the fact that read-modify-write will take at least 3 times this long: read, wait one full rotation, write).
So one disk can do, as a generous upper bound, 333 random accesses per second. I'll spare you the details of the Poisson distribution, but if you managed to spread these updates randomly over a disk farm, you'd need about 2000/333*e = 16 independent spindles.
The trick to high throughput is harnessing, and creating, non-randomness. You can do a much better job of this with a purpose-built solution.