Slashdot Mirror


Streaming a Database in Real Time

Roland Piquepaille writes "Michael Stonebraker is well-known in the database business, and for good reasons. He was the computer science professor behind Ingres and Postgres. Eighteen months ago, he started a new company, StreamBase, with another computer science professor, Stan Zdonik, with the goal of speeding access to relational databases. In 'Data On The Fly,' Forbes.com reports that the company software, also named StreamBase, is reading TCP/IP streams and using asynchronous messaging. Streaming data without storing it on disk gives them a tremendous speed advantage. The company claims it can process 140,000 messages per second on a $1,500 PC, when its competitors can only deal with 900 messages per second. Too good to be true? This overview contains more details and references."

12 of 194 comments (clear)

  1. Go Roland by Anonymous Coward · · Score: 0, Interesting

    Yet another winning post by Roland Piquepalle!

    The guy should write a book. It would be bland, devoid of content, have an ad on every page, but it would quote prolifically from NYT Top 100 Best-sellers list.

  2. For sensor networks by Anonymous Coward · · Score: 4, Interesting

    So this is mostly for sensor networks.. where you have hundereds (or thousands) of small, cheap sensors sending data to a nearby controller.. the controller doesn't need to store every bit of data it receives; it just calculates some prespecified queries (histograms, running sums, checking for trigger conditions, etc) on them and might store some small window of data for ad hoc queries... these systems are more simlar to dataflow applications than traditional databases.

    seems similar to his Auroa project... stonebraker has a history of turning his university research projects into successful startups.

  3. Re:I call foul by ComputerSlicer23 · · Score: 5, Interesting
    Hmmm, I guess. My guess is that they have implemented something akin to SQL for datastrems. You define a message format. Think of each message as a row in the table. The message format is the table schema.

    You have a "standing query". So you can ask things, like, what's the rolling average for the last 60 seconds for this ticker name. What's the minimum price for this commodity.

    You can ask to correlate things. Store the last 90 minutes worth of transactions on these commodities. Search for these types of patterns.

    It sounds like what they have done is build an OLAP cube that builds its dataset on the fly by processing messages coming over a streaming interface.

    It's much smarter to do that, then write every last transaction to disk, and then query the transactions after the fact. That'd be the natural way to thing about it if you used a Relational database.

    Essentially, it sure sounds like he's written a generalized packet filter, that can compute interesting functions on the data. Think snort, think ethereal, think iptables, think policy routing. Now apply those kinds of technology to "The price of this stock", "the location of that soldier", where those values are embedded in a network packet frame somewhere.

    While each single application of this sounds trivial to implement, if he has done it in a generalized way, that can keep pay with larger systems, bully for him.

    The irony of all this for me is that at a former job, I used to process medical data exactly this way. It sounds like the HL7 interface issues we used to have. You couldn't possibly take a full HL7 stream and process it, so you'd filter it down to just the patients that this department was interested in. Then only process messages about those patients.

    There were rows that even about those patients you weren't interested in that you had to filter out. You spent a bunch of time filtering, and re-filtering.

    We wrote the raw messages to disk, and spooled them to ensure we didn't miss messages due database problems (if the database was down, you had to spool until the database came back up, it was unacceptable to miss patient records for database maintience).

    Kirby

  4. Classifier Systems: the Genetic Algor of streaming by G4from128k · · Score: 2, Interesting

    Classifier Systems are a genetic algorithm analog for this type of streaming data/pattern analysis. With classifier systems a stream of incoming messages interacts with a constantly evolving population of classifier rules and an internally changing pool of working messages to create a stream of outputs. A reward/feedback loop drives adaption of the rule system to reinforce when it creates "good" outputs. The entire Classifier System concept is analogous to the mammalian immune system in the way that neural nets are analogous to brains and genetic algorithms are analogous to Darwinian evolution.

    With a high enough stream processing speed (using StreamBase's methods), classifier systems might be useful for AI/adaptive learning scenarios.

    --
    Two wrongs don't make a right, but three lefts do.
  5. Re:Duh by dubl-u · · Score: 3, Interesting

    As others have pointed out, the article is talking about something completely different than what you had in mind. Even so:

    Any of the enterprise databases will with gobs of memory end up caching the entire database in memory.

    That's still much slower than in-memory approaches that don't use a database at all. For apps that are amenable to the stick-it-all-in-RAM approach, serializing all your data access is a performance killer.

    A writeable database that doesn't need to be written to disk is not a database, it's called a nonpersistent cache.

    Well, there are different ways of guaranteeing reliability than the way databases do it. If you're keeping all your data hot, transaction logs with lazy snapshots may be a better solution than the database's approach, which treats the disk as the master copy and RAM as a place to story temporary copies.

  6. Re:speed focus by airjrdn · · Score: 3, Interesting

    SQL Server Table Variable, and, to a certain extent, derived table, same basic premise...it's in RAM, not on Disk.

    One question might be...why write the data directly to a database initially? Why not utilize a faster format, then write to the DB when things have slowed down (i.e. caching)?

    Admittedly I haven't read the article, but I am familar with 200+G databases, and there are ways to deal with performance with current DB tech.

    I do welcome any new competition, but there are ways of querying data in memory already. Heck, put the whole thing on a RAM Drive...how much data can there be for stock tickers?

  7. Re:Classifier Systems: the Genetic Algor of stream by headkase · · Score: 2, Interesting

    Check out this diagram of a classifier system. It's taken from The Computational Beauty of Nature. The website isn't really up to date nowadays, but the full source code for everything in the book is available in both Linux and Windows downloads and there's a java applet of all the examples too.
    The material covered in the book is also still very relevant and the books a joy to read.
    You should buy it :^) Not astroturfing just really enjoyed the book myself.

    --
    Shh.
  8. My RTOS will do more than 1400 messages/sec by flyingrobots · · Score: 2, Interesting

    But the idea of a query engine in front of those messages is interesting.

    Yet, then what is LabView? We've been processing live real-time data streams for years.

    I still don't get the scope of it. It seems on one hand to be a lot of the same. This idea that they need this type of software to process data from remote sensors doesn't click. I process data from remote sense in real-time all the time (no pun intended). There is no need to store it in a DBMS and then query it in order for the data to be useful. For historical reasons, yes, but it's never necessary.

  9. Who cares whether it runs on a $1500 PC ? by murr · · Score: 2, Interesting

    ... if the software costs $300K

  10. This is an old concept. by enewhuis · · Score: 3, Interesting

    My first reaction is: He is late in the game. Check out www.kx.com. They've already done this. And this kind of thing has been used for years to analyze real-time stock and commodities trading data as the trades occur in real-time. I've deployed several systems that are essentially streaming databases like this. Or did I miss something here?

  11. Re:Read the article before posting by CounterZer0 · · Score: 2, Interesting

    Why do distributions and such on the live data set? Stream through this system at highspeed, and drop the data onto a datawarehouse, who's *entire purpose in life* is to do historical crap.

  12. Re:THE TRUTH ABOUT ROLAND PIQUEPAILLE by acslat3r · · Score: 2, Interesting

    Perhaps you should stop worrying about how much he is making off of his online journal and instead put that time into a competitive "online journal" that will net you $1200 a month!! Go for 20 accepted submissions and sit back and watch the cash come in by the truckloads...