Streaming a Database in Real Time

← Back to Stories (view on slashdot.org)

Streaming a Database in Real Time

Posted by michael on Friday January 21, 2005 @11:20AM from the never-query-in-the-same-river-twice dept.

Roland Piquepaille writes "Michael Stonebraker is well-known in the database business, and for good reasons. He was the computer science professor behind Ingres and Postgres. Eighteen months ago, he started a new company, StreamBase, with another computer science professor, Stan Zdonik, with the goal of speeding access to relational databases. In 'Data On The Fly,' Forbes.com reports that the company software, also named StreamBase, is reading TCP/IP streams and using asynchronous messaging. Streaming data without storing it on disk gives them a tremendous speed advantage. The company claims it can process 140,000 messages per second on a $1,500 PC, when its competitors can only deal with 900 messages per second. Too good to be true? This overview contains more details and references."

2 of 194 comments (clear)

Min score:

Reason:

Sort:

For sensor networks by Anonymous Coward · 2005-01-21 11:41 · Score: 4, Interesting

So this is mostly for sensor networks.. where you have hundereds (or thousands) of small, cheap sensors sending data to a nearby controller.. the controller doesn't need to store every bit of data it receives; it just calculates some prespecified queries (histograms, running sums, checking for trigger conditions, etc) on them and might store some small window of data for ad hoc queries... these systems are more simlar to dataflow applications than traditional databases.

seems similar to his Auroa project... stonebraker has a history of turning his university research projects into successful startups.
Re:I call foul by ComputerSlicer23 · 2005-01-21 11:50 · Score: 5, Interesting

Hmmm, I guess. My guess is that they have implemented something akin to SQL for datastrems. You define a message format. Think of each message as a row in the table. The message format is the table schema.
You have a "standing query". So you can ask things, like, what's the rolling average for the last 60 seconds for this ticker name. What's the minimum price for this commodity.
You can ask to correlate things. Store the last 90 minutes worth of transactions on these commodities. Search for these types of patterns.
It sounds like what they have done is build an OLAP cube that builds its dataset on the fly by processing messages coming over a streaming interface.
It's much smarter to do that, then write every last transaction to disk, and then query the transactions after the fact. That'd be the natural way to thing about it if you used a Relational database.
Essentially, it sure sounds like he's written a generalized packet filter, that can compute interesting functions on the data. Think snort, think ethereal, think iptables, think policy routing. Now apply those kinds of technology to "The price of this stock", "the location of that soldier", where those values are embedded in a network packet frame somewhere.
While each single application of this sounds trivial to implement, if he has done it in a generalized way, that can keep pay with larger systems, bully for him.
The irony of all this for me is that at a former job, I used to process medical data exactly this way. It sounds like the HL7 interface issues we used to have. You couldn't possibly take a full HL7 stream and process it, so you'd filter it down to just the patients that this department was interested in. Then only process messages about those patients.
There were rows that even about those patients you weren't interested in that you had to filter out. You spent a bunch of time filtering, and re-filtering.
We wrote the raw messages to disk, and spooled them to ensure we didn't miss messages due database problems (if the database was down, you had to spool until the database came back up, it was unacceptable to miss patient records for database maintience).
Kirby