Database Bigwigs Lead Stealthy Open Source Startup
BobB writes "Michael Stonebraker, who cooked up the Ingres and Postgres database management systems, is back with a stealthy startup called Vertica. And not just him, he has recruited former Oracle bigwigs Ray Lane and Jerry Held to give the company a boost before its software leaves beta testing. The promise — a Linux-based system that handles queries 100 times faster than traditional relational database management systems."
The article mentions that redhat and hp are listed among their partners. i'm not surprised by red hat or informatica (another partner though they aren't mentioned in the article) but i was a little surprised by hp - since they have been trying to get the word out about their own data warehousing and bi stuff. i wonder what that indicates about how they regard this new player.
also interesting is the wikipedia article on Michael Stonebraker if you aren't already familiar with him.
It's hard to believe that's how Micronians are made. Why don't we see it right now by having you both kiss one another?
The question is when will this be ported to a mainstream OS such as Windows?
It was LAMP, now its LAVA. Much cooler name.
-----BEGIN PGP SIGNATURE-----
12345
-----END PGP SIGNATURE-----
"grid-enabled, column-oriented relational database management system"
What does that mean?
If anything.
Loading a million random records out of a set of one hundred million records is an enormously difficult task for an RDBMS on commodity hardware (e.g. magnetic rotating disks). This is a more common task than you would think. ORM systems backed by an RDBMS, such as Ruby on Rails, Django, Hibernate, have exactly this requirement and will only demand more as these models become more mainstream. Think about what search engines have to do: find millions among billions, all to show a user a dozen.
These problems are solvable now, but there's a lot of duplication of effort going on that a smart database vendor could solve for us.
Vertica's website has had all the details about what they're doing for months. They've had a Wikipedia article for a long time.
This is some new Network World definition of "Stealthy", apparently...
KX is primarily in-memory. The competing column-oriented product is primarily Sybase IQ, which has been on the market for a while now.
What happened to Gallium Arsenide replacing silicon? What happened to solid state memory completely repalcing magnetic disks? Technology field is littered with such fiascos.
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
http://en.wikipedia.org/wiki/Column-oriented_DBMS
It's basically an optimization of the current data access patterns. Databases have been row-oriented for decades, because they evolved from fixed width flat files. Once we eliminated COBOL-style accesses to databases, the full row data became less important. It became far more important to be able to scan a column as fast as possible. For example:
select * from names where lastname LIKE '%son'
The above query might have an index available to find what it needs. But it's just as likely that the database will need to do a table-scan. Since table-scans involve looking through every record in the database, you can imagine that it would be faster to just load the lastname column rather than loading every row in the database just to discard 90% of that data.
Javascript + Nintendo DSi = DSiCade
Column oriented is easy. Imagine a database as a set of tables, each of which has rows of data records, in organized columns (column 1 = "User name", column 2 = "User ID", column 3 = "Favorite slashdot admin", etc).
Normal row-oriented databases store records which have a row of the data: "User name", "User ID", "Favorite slashdot admin" for user row #12345.
Column oriented databases store records which have a column of the data: "User name" for user rows 1-100,000; "User ID" for user rows 1-100,000; etc.
Updates are faster with row-oriented: you access the last record file and append something, or access an intermediate record file and update one "row" across.
Searches are faster with column-oriented: you access the record file for "Favorite slashdot admin" and look for entries which say "Phred", and then output the list of rows of data which match. Instead of going through the whole database top to bottom for the search, you just search on the one column. If you have 100 columns of data, then you look through 1/100th of the total data in the search. To pull data out, you then have to look at all the column files and index in the right number of records, but that goes relatively quickly.
Indexes are useful, but column-oriented is more efficient in some ways. You don't have to maintain the indexes, and can just automatically search any column without having indexed it, in a reasonably efficient manner.
Column-oriented also lets you compress the data on the fly efficiently: all the records are the same data type (string, integer, date, whatever) and lists of same data types compress well, and uncompress typically far faster than you can pull them off disk, so you can just automatically do it for all the data and save both speed and time...
during the transition when you tell people your business runs on LAVA-LAMP technology.
I noticed that Stonebraker is the company founder. Stonebraker has contributed extensively to database research over the years.
He's known for advocating the "shared-nothing" approach to parallel databases. The shared-nothing approach means that nodes in the parallel database don't attempt memory or cache synchronization, and each node has its own commodity disk array. In a shared-nothing parallel database, the data is "partitioned" across servers. So, for example, rows with id's 1-10 would be on the first server, 11-20 on the second server, etc. Executing the SQL query "select * from table where id < 1000" would send requests to multiple commodity servers and then aggregate the results. The optimizer is modified to take into account network bandwidth and latency, etc.
My guess on what they're doing: they're working on a shared-nothing parallel RDBMS with an in-memory client similar to Oracle TimesTen.
The are a few drawbacks to the shared-nothing approach: 1) the RDBMS software is more difficult to implement; 2) since the data is partitioned, any transaction that updates tuples on more than one database node requires a two-phase distributed commit, which is much more expensive; and 3) some queries are more expensive because they require transmitting large amounts of data over the network rather than a memory bus, and in rare cases that network overhead cannot be eliminated by the optimizer.
The advantage, of course, is linear scalability by adding commodity hardware. No more need for $3M+ boxes.
By which I am asking that while Vertica is obviously well-researched and well funded as a start up, MonetDB is well-researched, already benchmarked and available now.. So why would I wait to invest my time, energy, and $$ in a proprietary future product rather than the time and energy, etc. to develop market leadership in my chosen corporate area in the present?
...Open Source isn't the only answer -- but it's almost always a better value than the alternatives...
You're 100 times faster than anyone else, obviously.
How dare you be so modest!! You conceited bastard!!
Vertica is not open source. Not sure where the confusion came from.
Note: I work for Vertica.
See http://en.wikipedia.org/wiki/Bigtable for a description of Google's column oriented database.
Note: I work for Vertica
This looks like it will be a commercial version of the Michael Stonebraker and MIT developed C-Store column-oriented:
- Web site: http://db.lcs.mit.edu/projects/cstore/
- Wikipedia Entry: http://en.wikipedia.org/wiki/C-Store
They distribute the source with a fairly liberal license, so this looks like something the open source community could pick up and run with.
Oh ye of little faith, here i present thee with The Facts. Or a paper at the very least: One size fits all? a Benchmark
+Raider of the lost BBS