Slashdot Mirror


Streaming a Database in Real Time

Roland Piquepaille writes "Michael Stonebraker is well-known in the database business, and for good reasons. He was the computer science professor behind Ingres and Postgres. Eighteen months ago, he started a new company, StreamBase, with another computer science professor, Stan Zdonik, with the goal of speeding access to relational databases. In 'Data On The Fly,' Forbes.com reports that the company software, also named StreamBase, is reading TCP/IP streams and using asynchronous messaging. Streaming data without storing it on disk gives them a tremendous speed advantage. The company claims it can process 140,000 messages per second on a $1,500 PC, when its competitors can only deal with 900 messages per second. Too good to be true? This overview contains more details and references."

9 of 194 comments (clear)

  1. Seriously, Michael by Anonymous Coward · · Score: 4, Insightful

    How much Does Roland Piquepaille pay you to link to his shitty articles?

    It must be alot since the pay for play is so obvious.

  2. I call foul by RFC959 · · Score: 3, Insightful
    I call foul. This quote from the article was what got to me:

    Traditional systems bog down because they first store data on hard drives or in main memory and then query it, Stonebraker says.

    So they manage to do their analysis without even touching main memory? Nifty! What do they do, make it all fit in the L1 data cache? OK, maybe the guy was misquoted - I trust reporters about as far as I can throw them - but the whole thing just smells funny to me. I'm betting that the massive speedup they report is only for carefully selected, pre-groomed data sets. I agree that analyzing data as it comes in rather than storing it up to recrunch later is the smart thing to do, but that insight isn't a breakthrough of the kind the article is spinning this as.
  3. Has nothing to do with relational databases by Wesley+Felter · · Score: 5, Insightful

    If Roland had RTFA, he'd have realized that this StreamBase thing is not a relational database and does not do the job of a traditional relational database. The whole point is that it uses a different architecture to solve problems that don't map well to relational databases.

    1. Re:Has nothing to do with relational databases by Anonymous Coward · · Score: 5, Insightful

      I'm not sure that is an acurate critque of Roland. He likely did RTFA -- he just didn't UTFA

  4. Re:Duh by Anonymous Coward · · Score: 4, Insightful

    You've possibly misunderstood the point of this software.

    At no time is the data 'stored' in any way .. As it's collected (or INSERTed) it passes through a collection of preconfigured SELECT statements, and then disappears. There are no tables full of data, only tables as defined structures for handling incoming and outgoing data.

    You cannot query anything that happened in the past, because the program doesn't remember it.

  5. Read the article before posting by bigtallmofo · · Score: 5, Insightful

    Before another dozen people post about how in-memory databases have been done before, please read the article. They're specifically not talking about in-memory or on-disk databases. They're reading the data and analyzing it in real time as it flows through the network. For everyone asking how they're going to back such data up, you don't need to back up data that is useless 1 second after it has flowed through your network.

    --
    I'm a big tall mofo.
    1. Re:Read the article before posting by kpharmer · · Score: 4, Insightful

      Right, and this solution has its own limitations within this context: namely that if you crunch your data real time, rather than read it from a data store:

      1. if you decide to add a new analytic you have to start with new data - you can't deploy a new analtyical component and against historical data.

      2. if your machine crashes - it takes all your accumulated analytical data along with it. Maintaining a distribution of activity calculated every 5 minutes over 90 days? Great, but after the server comes back up your data starts all over.

      3. if your analtyical component needs to run against a lot of history each time (ex: total number of unique telephone numbers accessed by day, calculate rolling median) then you'll have to maintain that detail data in memory. As you can imagine - you can *easily* identify calculations that will exceed your memory. So, to tune you'll be forced to keep your calculations to relatively recent data only.

      ken

  6. Seems kinda silly to me. by boodaman · · Score: 3, Insightful

    OK, I get what they're trying to do, but my question: so what?

    Sooner or later you have to put something somewhere. Let's say you monitor a battalion in battle in realtime. All of these messages are streaming in and being analyzed. Great. But now what? So something triggers an alert, say. Well, what's tracking the status of the alert? Wouldn't you want to track the status of an alert saying "this Humvee is off course"? Wouldn't you want to track whether someone had acknowledged the alert, and what they did about it?

    And don't forget there are liability issues, historical issues, and more. You're a stock trader, all of these messages are coming and being analyzed. You get an alert...one of your triggers tripped. You make a trade as a result, only to find out 30 minutes later that the trigger was WRONG and your trade was WRONG and you (or your company) is out $10 million. How do you prove that you made the trade based on the trigger like you were supposed to and not because you f**ked up? The trigger, and the data that caused it to trip, is long gone. What do you do now?

    Eventually something has to be written (stored) somewhere, sometime. I guess I can see the need for summarizing data and only storing what StreamBase says is "important" but how would you know if everything was OK if the actual data driving everything was long gone?

  7. Re:speed focus by Aeiri · · Score: 3, Insightful

    One question might be...why write the data directly to a database initially? Why not utilize a faster format, then write to the DB when things have slowed down (i.e. caching)?

    If the server crashes while it's still in "write later mode", then data will be lost.. Since most of the time servers crash BECAUSE of high traffic, this can be kind of bad.