Streaming a Database in Real Time
Roland Piquepaille writes "Michael Stonebraker is well-known in the database business, and for good reasons. He was the computer science professor behind Ingres and Postgres. Eighteen months ago, he started a new company, StreamBase, with another computer science professor, Stan Zdonik, with the goal of speeding access to relational databases. In 'Data On The Fly,' Forbes.com reports that the company software, also named StreamBase, is reading TCP/IP streams and using asynchronous messaging. Streaming data without storing it on disk gives them a tremendous speed advantage. The company claims it can process 140,000 messages per second on a $1,500 PC, when its competitors can only deal with 900 messages per second. Too good to be true? This overview contains more details and references."
How much Does Roland Piquepaille pay you to link to his shitty articles?
It must be alot since the pay for play is so obvious.
So they manage to do their analysis without even touching main memory? Nifty! What do they do, make it all fit in the L1 data cache? OK, maybe the guy was misquoted - I trust reporters about as far as I can throw them - but the whole thing just smells funny to me. I'm betting that the massive speedup they report is only for carefully selected, pre-groomed data sets. I agree that analyzing data as it comes in rather than storing it up to recrunch later is the smart thing to do, but that insight isn't a breakthrough of the kind the article is spinning this as.
If Roland had RTFA, he'd have realized that this StreamBase thing is not a relational database and does not do the job of a traditional relational database. The whole point is that it uses a different architecture to solve problems that don't map well to relational databases.
You've possibly misunderstood the point of this software.
.. As it's collected (or INSERTed) it passes through a collection of preconfigured SELECT statements, and then disappears. There are no tables full of data, only tables as defined structures for handling incoming and outgoing data.
At no time is the data 'stored' in any way
You cannot query anything that happened in the past, because the program doesn't remember it.
Before another dozen people post about how in-memory databases have been done before, please read the article. They're specifically not talking about in-memory or on-disk databases. They're reading the data and analyzing it in real time as it flows through the network. For everyone asking how they're going to back such data up, you don't need to back up data that is useless 1 second after it has flowed through your network.
I'm a big tall mofo.
OK, I get what they're trying to do, but my question: so what?
Sooner or later you have to put something somewhere. Let's say you monitor a battalion in battle in realtime. All of these messages are streaming in and being analyzed. Great. But now what? So something triggers an alert, say. Well, what's tracking the status of the alert? Wouldn't you want to track the status of an alert saying "this Humvee is off course"? Wouldn't you want to track whether someone had acknowledged the alert, and what they did about it?
And don't forget there are liability issues, historical issues, and more. You're a stock trader, all of these messages are coming and being analyzed. You get an alert...one of your triggers tripped. You make a trade as a result, only to find out 30 minutes later that the trigger was WRONG and your trade was WRONG and you (or your company) is out $10 million. How do you prove that you made the trade based on the trigger like you were supposed to and not because you f**ked up? The trigger, and the data that caused it to trip, is long gone. What do you do now?
Eventually something has to be written (stored) somewhere, sometime. I guess I can see the need for summarizing data and only storing what StreamBase says is "important" but how would you know if everything was OK if the actual data driving everything was long gone?
One question might be...why write the data directly to a database initially? Why not utilize a faster format, then write to the DB when things have slowed down (i.e. caching)?
If the server crashes while it's still in "write later mode", then data will be lost.. Since most of the time servers crash BECAUSE of high traffic, this can be kind of bad.