Slashdot Mirror


Dumping Lots of Data to Disk in Realtime?

AmiChris asks: "At work I need something that can dump sequential entries for several hundred thousand instruments in realtime. It also needs to be able to retrieve data for a single instrument relatively quickly. A standard relational database won't cut it. It has to keep up with 2000+ updates per second, mostly on a subset of a few hundred instruments active at a given time. I've got some ideas of how I would build such a beast, based on flat files and a system of caching entries in memory. I would like to know if: someone has already built something like this; and if not, would someone want to use it if I build it? I'm not sure what other applications there might be. I could see recording massive amounts of network traffic or scientific data with such a library. I'm guessing someone out there has done something like this before. I'm currently working with C++ on Windows. "

7 of 127 comments (clear)

  1. 2-stage approach by eagl · · Score: 5, Informative

    Have you considered a 2-stage approach? Stuff it to disk, and process/index it separately? A fast stream of data would let it all get recorded without loss, and then you could use whatever resources are necessary to index and search without impacting the data dump.

    Cost... Are you going to go for local storage or NAS? Need SCSI and RAID or a less expensive hardware setup? Do you think gigabit ethernet will be sufficient for the transfer from the data dump hardware to the processing/indexing/search machines?

    Sounds like you might want to run a test case using commodity hardware first.

  2. Wonderware InSQL by Dios · · Score: 4, Informative


    Check out wonderware InSQL. We update roughly 50k points every 30 seconds without loading the server much at all. Pretty nice product, also has some custom extensions to SQL built in for querying the data (eg cyclic, resolution, delta storage, etc etc).

    http://www.wonderware.com/

    Of course, you'll need your data to come from an OPC/Suitelink/other supported protocol, but should work nicely for you.

    - Joshua

  3. Don't roll your own by btlzu2 · · Score: 3, Informative

    Unless you really want to do a LOT of work. This sounds very much like a SCADA system. There are vendors of such systems. Most of the realtime databases are designed to stay in a large, proprietary, RAM database which is occasionally dumped to disk for backup purposes.

    In order to process so many points realtime, it usually will have to be in RAM for performance reasons.

    --
    Zed's dead baby. Zed's dead.
  4. A commercial RDMS can cut it by jbplou · · Score: 4, Informative

    You can definitely use Oracle to write out 2000 updates per second if your hardware is up to it and your db skills are good.

  5. Ramdisk database by Glonoinha · · Score: 4, Informative

    Here's a thought - just use a hard-RAM based database.
    Either make a big ramdisk and put your database out there (see my Journal from a few months back, ramdisk throughput is pretty damn fast from the local machine, given certain constraints, and random access writing is hella fast), or use a database that runs entirely in memory (think Derby, aka Cloudscape that comes with WebSphere Application Developer.)

    When you got your data, save it out to the hard drive.

    Granted it helps to have a box with a ton of memory in it, but they are out there now, almost affordable. If you are collecting more than 4G of data in one session, well YMMV - but 4G is a LOT of data, perhaps consider your approach.

    --
    Glonoinha the MebiByte Slayer
  6. Kdb+ by RussHart · · Score: 3, Informative

    Kdb+ by KX Systems (http://www.kx.com/ is by far and away the best thing for this. Its main use is to store tick data from financial markets, and is excellent at this (if expensive).

    From how you descibed your needs, this would probably bit the bill..

  7. HP-IB and ISAM by Decker-Mage · · Score: 3, Informative
    This was what the Hewlett Packard Interface Bus (HP-IB) was invented for and your instruments may already be equipped for it. As for what to do with the data stream from the instruments, you stuff it into an ISAM database. Why anyone would even think of using an RDBMS for this is beyond me. ISAM (Indexed Sequential Access Method) has been around forever, exists to take tons of sequential data and store it to the media of choice. From your description, retrieval is only going to be based on a few criteria anyway (instrument, time), so those indices are perfect in this instance.

    On the coding end, there are numerous (hell, hundreds) of commercial, F/OSS, and books on ISAM libraries for you to use for the actual storage and retrieval. It may even be included in your existing libraries given how old the technique is now. I was doing this back in the '80s for the US Navy using a 24 bit, very slow, mini-computer, so any normal box should be able to handle it today!

    We use these techniques in electronic instrument monitoring, logistical systems, systems engineering, you get the idea. You may want to mosey over to the HP developer web site to see if there is a drop in solution, as I imagine there is (sorry, haven't looked).

    I hope this helps.

    --
    "[I]t is a wise man who admits the limits of his knowledge or skill, and that pretending either causes harm." --Terry Go