Dumping Lots of Data to Disk in Realtime?

← Back to Stories (view on slashdot.org)

Dumping Lots of Data to Disk in Realtime?

Posted by Cliff on Saturday May 14, 2005 @01:04AM from the too-much-for-an-RDBMS? dept.

AmiChris asks: "At work I need something that can dump sequential entries for several hundred thousand instruments in realtime. It also needs to be able to retrieve data for a single instrument relatively quickly. A standard relational database won't cut it. It has to keep up with 2000+ updates per second, mostly on a subset of a few hundred instruments active at a given time. I've got some ideas of how I would build such a beast, based on flat files and a system of caching entries in memory. I would like to know if: someone has already built something like this; and if not, would someone want to use it if I build it? I'm not sure what other applications there might be. I could see recording massive amounts of network traffic or scientific data with such a library. I'm guessing someone out there has done something like this before. I'm currently working with C++ on Windows. "

11 of 127 comments (clear)

Min score:

Reason:

Sort:

2-stage approach by eagl · 2005-05-14 01:11 · Score: 5, Informative

Have you considered a 2-stage approach? Stuff it to disk, and process/index it separately? A fast stream of data would let it all get recorded without loss, and then you could use whatever resources are necessary to index and search without impacting the data dump.

Cost... Are you going to go for local storage or NAS? Need SCSI and RAID or a less expensive hardware setup? Do you think gigabit ethernet will be sufficient for the transfer from the data dump hardware to the processing/indexing/search machines?

Sounds like you might want to run a test case using commodity hardware first.
Suuuure. by Seumas · 2005-05-14 01:12 · Score: 4, Funny

Yeah, like it isn't obvious that this guy works for the government's TIA program and is looking for ways to maintain all of the data culled from the thousands of audio and video sensors they have planted around.

Suuuure.
Wonderware InSQL by Dios · 2005-05-14 01:15 · Score: 4, Informative

Check out wonderware InSQL. We update roughly 50k points every 30 seconds without loading the server much at all. Pretty nice product, also has some custom extensions to SQL built in for querying the data (eg cyclic, resolution, delta storage, etc etc).

http://www.wonderware.com/

Of course, you'll need your data to come from an OPC/Suitelink/other supported protocol, but should work nicely for you.

- Joshua
Don't roll your own by btlzu2 · 2005-05-14 01:21 · Score: 3, Informative

Unless you really want to do a LOT of work. This sounds very much like a SCADA system. There are vendors of such systems. Most of the realtime databases are designed to stay in a large, proprietary, RAM database which is occasionally dumped to disk for backup purposes.

In order to process so many points realtime, it usually will have to be in RAM for performance reasons.

--
Zed's dead baby. Zed's dead.
Cluster it by canuck57 · 2005-05-14 01:24 · Score: 3, Insightful

I know your working with windows but when I read this I said yes.

I'm guessing someone out there has done something like this before.

Google has a cluster of machines far larger than you need but their approach was a Linux cluster. Plus, for the amount of writes going on your going to want not to have any burdens on the system that are not needed.
A commercial RDMS can cut it by jbplou · 2005-05-14 01:28 · Score: 4, Informative

You can definitely use Oracle to write out 2000 updates per second if your hardware is up to it and your db skills are good.
1. Re:A commercial RDMS can cut it by gvc · 2005-05-14 03:56 · Score: 4, Interesting
  
  "Can [the storage backend] handle 2000 random seeks per second?"
  
  The short answer is "no."
  
  A 10,000 RPM disk has a period of 6 mSec. That's 3 mSec latency on average for random access (not counting seek time or the fact that read-modify-write will take at least 3 times this long: read, wait one full rotation, write).
  
  So one disk can do, as a generous upper bound, 333 random accesses per second. I'll spare you the details of the Poisson distribution, but if you managed to spread these updates randomly over a disk farm, you'd need about 2000/333*e = 16 independent spindles.
  
  The trick to high throughput is harnessing, and creating, non-randomness. You can do a much better job of this with a purpose-built solution.
Ramdisk database by Glonoinha · 2005-05-14 03:18 · Score: 4, Informative

Here's a thought - just use a hard-RAM based database.
Either make a big ramdisk and put your database out there (see my Journal from a few months back, ramdisk throughput is pretty damn fast from the local machine, given certain constraints, and random access writing is hella fast), or use a database that runs entirely in memory (think Derby, aka Cloudscape that comes with WebSphere Application Developer.)

When you got your data, save it out to the hard drive.

Granted it helps to have a box with a ton of memory in it, but they are out there now, almost affordable. If you are collecting more than 4G of data in one session, well YMMV - but 4G is a LOT of data, perhaps consider your approach.

--
Glonoinha the MebiByte Slayer
Have you considered memory-mapped files? by Teancum · 2005-05-14 03:50 · Score: 3, Interesting

I did some work on a DVD-Video authoring system that had some incredible file system requirments (obviously, when involving video data and the typical 4 GB data load for a single DVD disc).

The standard file API architechture just didn't hold up, so we (the development team I was working with) had to rewrite some of the file management routines ourselves and work directly with the memory mapped architechture directly. This does give you some other advantages beyond speed as well, as once you establish the file link and set it in a memory address range you can treat the data in the file as if it were RAM within your program, having fun with pointers and everything else you can imagine. Copying data to the file is simply a matter of a memory move operation, or copying from one pointer to another.

The thing to remember is that Windows (this is undocumented) won't allow you to open a memory-mapped file that is larger than 1 GB, and under FAT32 file systems (Windows 95/98/ME/and some low-end XP systems) the total of all memory mapped files on the entire operating system must be below 1 GB (this requirement really sucks the breath out of some applications).

Remember that if you are putting pointers into the file directly, that it works better if the pointers are relative offsets rather than direct memory pointers, even though direct memory pointers are in theory possible during a single session run.
Kdb+ by RussHart · 2005-05-14 05:28 · Score: 3, Informative

Kdb+ by KX Systems (http://www.kx.com/ is by far and away the best thing for this. Its main use is to store tick data from financial markets, and is excellent at this (if expensive).

From how you descibed your needs, this would probably bit the bill..
HP-IB and ISAM by Decker-Mage · 2005-05-14 21:44 · Score: 3, Informative

This was what the Hewlett Packard Interface Bus (HP-IB) was invented for and your instruments may already be equipped for it. As for what to do with the data stream from the instruments, you stuff it into an ISAM database. Why anyone would even think of using an RDBMS for this is beyond me. ISAM (Indexed Sequential Access Method) has been around forever, exists to take tons of sequential data and store it to the media of choice. From your description, retrieval is only going to be based on a few criteria anyway (instrument, time), so those indices are perfect in this instance.
On the coding end, there are numerous (hell, hundreds) of commercial, F/OSS, and books on ISAM libraries for you to use for the actual storage and retrieval. It may even be included in your existing libraries given how old the technique is now. I was doing this back in the '80s for the US Navy using a 24 bit, very slow, mini-computer, so any normal box should be able to handle it today!

We use these techniques in electronic instrument monitoring, logistical systems, systems engineering, you get the idea. You may want to mosey over to the HP developer web site to see if there is a drop in solution, as I imagine there is (sorry, haven't looked).

I hope this helps.

--
"[I]t is a wise man who admits the limits of his knowledge or skill, and that pretending either causes harm." --Terry Go