Database Bigwigs Lead Stealthy Open Source Startup
BobB writes "Michael Stonebraker, who cooked up the Ingres and Postgres database management systems, is back with a stealthy startup called Vertica. And not just him, he has recruited former Oracle bigwigs Ray Lane and Jerry Held to give the company a boost before its software leaves beta testing. The promise — a Linux-based system that handles queries 100 times faster than traditional relational database management systems."
The article mentions that redhat and hp are listed among their partners. i'm not surprised by red hat or informatica (another partner though they aren't mentioned in the article) but i was a little surprised by hp - since they have been trying to get the word out about their own data warehousing and bi stuff. i wonder what that indicates about how they regard this new player.
also interesting is the wikipedia article on Michael Stonebraker if you aren't already familiar with him.
It's hard to believe that's how Micronians are made. Why don't we see it right now by having you both kiss one another?
The article seems to describe the big advantage as being column oriented.
How does this differ than KX System's kdb (www.kx.com) which IIRC is similar in that way; and is alredy in use at many if not most major financial institutions (see their customer list)?
The question is when will this be ported to a mainstream OS such as Windows?
It was LAMP, now its LAVA. Much cooler name.
-----BEGIN PGP SIGNATURE-----
12345
-----END PGP SIGNATURE-----
"grid-enabled, column-oriented relational database management system"
What does that mean?
If anything.
This is totally what we need.
With comodity hardware getting faster and cheaper by the minute, having a system that can handle a higher than average load with optimized software is, imho, a winner.
I'm sure everyone here can add some anecdotal evidence to how they had a heavy-hardware, database serving machine die on them because of some software bug.
This is one of the reasons I've been looking forward to ZFS. Hopefully the DB guru's will take the best of what's good about software, drop the legacy crap and really deliver something that's going to handle the kind of load that a good slashdotting delivers with hardware that didn't require a lease to be affordable.
This is not the greatest
V
-----BEGIN PGP SIGNATURE-----
12345
-----END PGP SIGNATURE-----
What if the Hokey Pokey really is what it's all about?
Loading a million random records out of a set of one hundred million records is an enormously difficult task for an RDBMS on commodity hardware (e.g. magnetic rotating disks). This is a more common task than you would think. ORM systems backed by an RDBMS, such as Ruby on Rails, Django, Hibernate, have exactly this requirement and will only demand more as these models become more mainstream. Think about what search engines have to do: find millions among billions, all to show a user a dozen.
These problems are solvable now, but there's a lot of duplication of effort going on that a smart database vendor could solve for us.
smaller in number - but i'm willing to bet much more profitable and growing rapidly. we've been looking at data warehousing options and frankly most of them suck in one way or another. if someone can do it right - they can make a killing.
It's hard to believe that's how Micronians are made. Why don't we see it right now by having you both kiss one another?
Vertica's website has had all the details about what they're doing for months. They've had a Wikipedia article for a long time.
This is some new Network World definition of "Stealthy", apparently...
What happened to Gallium Arsenide replacing silicon? What happened to solid state memory completely repalcing magnetic disks? Technology field is littered with such fiascos.
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
Network World is a trade rag. To them, anything not advertised is stealthy. Especially since they want to motivate people to think "oh no, I don't want to be stealthy, that means unknown! quick buy some advertising!"
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Watch...they'll run into patent problems with patents held by Oracle, Sybase, and MS.
http://en.wikipedia.org/wiki/Column-oriented_DBMS
It's basically an optimization of the current data access patterns. Databases have been row-oriented for decades, because they evolved from fixed width flat files. Once we eliminated COBOL-style accesses to databases, the full row data became less important. It became far more important to be able to scan a column as fast as possible. For example:
select * from names where lastname LIKE '%son'
The above query might have an index available to find what it needs. But it's just as likely that the database will need to do a table-scan. Since table-scans involve looking through every record in the database, you can imagine that it would be faster to just load the lastname column rather than loading every row in the database just to discard 90% of that data.
Javascript + Nintendo DSi = DSiCade
Column oriented is easy. Imagine a database as a set of tables, each of which has rows of data records, in organized columns (column 1 = "User name", column 2 = "User ID", column 3 = "Favorite slashdot admin", etc).
Normal row-oriented databases store records which have a row of the data: "User name", "User ID", "Favorite slashdot admin" for user row #12345.
Column oriented databases store records which have a column of the data: "User name" for user rows 1-100,000; "User ID" for user rows 1-100,000; etc.
Updates are faster with row-oriented: you access the last record file and append something, or access an intermediate record file and update one "row" across.
Searches are faster with column-oriented: you access the record file for "Favorite slashdot admin" and look for entries which say "Phred", and then output the list of rows of data which match. Instead of going through the whole database top to bottom for the search, you just search on the one column. If you have 100 columns of data, then you look through 1/100th of the total data in the search. To pull data out, you then have to look at all the column files and index in the right number of records, but that goes relatively quickly.
Indexes are useful, but column-oriented is more efficient in some ways. You don't have to maintain the indexes, and can just automatically search any column without having indexed it, in a reasonably efficient manner.
Column-oriented also lets you compress the data on the fly efficiently: all the records are the same data type (string, integer, date, whatever) and lists of same data types compress well, and uncompress typically far faster than you can pull them off disk, so you can just automatically do it for all the data and save both speed and time...
run Windows for their website?
I prefer the "u" in honour as it seems to be missing these days.
during the transition when you tell people your business runs on LAVA-LAMP technology.
I noticed that Stonebraker is the company founder. Stonebraker has contributed extensively to database research over the years.
He's known for advocating the "shared-nothing" approach to parallel databases. The shared-nothing approach means that nodes in the parallel database don't attempt memory or cache synchronization, and each node has its own commodity disk array. In a shared-nothing parallel database, the data is "partitioned" across servers. So, for example, rows with id's 1-10 would be on the first server, 11-20 on the second server, etc. Executing the SQL query "select * from table where id < 1000" would send requests to multiple commodity servers and then aggregate the results. The optimizer is modified to take into account network bandwidth and latency, etc.
My guess on what they're doing: they're working on a shared-nothing parallel RDBMS with an in-memory client similar to Oracle TimesTen.
The are a few drawbacks to the shared-nothing approach: 1) the RDBMS software is more difficult to implement; 2) since the data is partitioned, any transaction that updates tuples on more than one database node requires a two-phase distributed commit, which is much more expensive; and 3) some queries are more expensive because they require transmitting large amounts of data over the network rather than a memory bus, and in rare cases that network overhead cannot be eliminated by the optimizer.
The advantage, of course, is linear scalability by adding commodity hardware. No more need for $3M+ boxes.
... for a long time.
Classic RDBMSes are crutches. A forced-upon neccesitiy we have to put up with for our app models to latch on to real world hardware and it's limitations. A historically grown mess with an overhead so huge it's insane. With a Database PL and 30+ dialects of it from back in the days when we flew to the moon using a slide-ruler as primary means of calculation.
If what they claim is true, these guys are probably finally ditching the omnipresent redundant n-fold layers user and connection management in favour of a lean system that at last does away with the distinction of filesystem and database and data access layer. Imagine a persistance layer with no SQL, no extra user management, no extra connection layer, no filesystem under it and native object suport for any PL you wish to compile in.
I tell you, finally ditching classic RDBMSes is *long* overdue, they're basically all the same ancient pile of rubble, from MySQL up to Oracle. If these guys are up to taking on this deed (or part of it) and they get finished when solid-state finally relieves our current super-slowpoking spinning metal disks on a broad scale we'll feel like being in heaven compared to the shit we still have to put up with today.
I wish these guys all the best. They appear to have the skills to do it and the authority to emphasise that todays RDBMSes and their underlying concepts are a relic of the past.
My 2 cents.
We suffer more in our imagination than in reality. - Seneca
Yeah, but what does its radar signature look like?
It's not bad, but the new startup synergistica that I'm working on is gonna be completely invisible.
Take the cheese to sickbay, the doctor should see it as soon as possible - B'Elanna Torres, "Learning Curve"
Yup, it is all about making the individual files smaller and more regular. Kinda the opposite of XML.
Excuse me, but please get off my Pennisetum Clandestinum, eh!
By which I am asking that while Vertica is obviously well-researched and well funded as a start up, MonetDB is well-researched, already benchmarked and available now.. So why would I wait to invest my time, energy, and $$ in a proprietary future product rather than the time and energy, etc. to develop market leadership in my chosen corporate area in the present?
...Open Source isn't the only answer -- but it's almost always a better value than the alternatives...
info week just ran an article on hp getting into data warehousing and bi that had this paragraph pretty early on: Until sitting down with InformationWeek recently, the company has been mum on the initiative--not so much as a peep from its normally talkative marketing team. Indeed, it's an unlikely move into a sector where IBM, Oracle, SAS Institute, and Teradata have years of experience, well regarded products, and loyal customers. Those four vendors--along with Microsoft, which has muscled in on the strength of its SQL Server database--hold about 85% of the $5.2 billion-a-year data warehousing software market, a sector IDC projects will grow 9.5% annually through 2010.
so you are right - there's a lot of opportunity there, even for a small player.
on a side note, i thought the opening paragraph described the current situation pretty well
For more than a decade, big companies and sophisticated data aggregators have adopted data warehouses, yet few have mastered them, and many have outright failed in the effort or have been scared off by the complexity. The goal is to give workers access to real-time data across departments and geographic units, but more often than not, data warehouses end up as costly clunkers with outdated, inconsistent, and missing information.
It's hard to believe that's how Micronians are made. Why don't we see it right now by having you both kiss one another?
You're 100 times faster than anyone else, obviously.
How dare you be so modest!! You conceited bastard!!
Uhm... wtf?
Seriously, you tested MySQL vs. other databases with "out of the box" setups? MySQL isn't a real database when running MyIsam engine, you simply cannot compare that with anything else. And on top of that, try do a proper insertion in MySQL, one single transaction with a few millions of rows and see how well that does. Oh and did you ever stop to think about _why_ MySQL does perform so much faster on that test? Try doing it on a InnoDB table with standard setup, even at 600k rows it slows to a crawl. (Easily fixable, but requires some optimizations)
Seriously the reason why big vendors have a clause in their eula for people to NOT do benchmarks is exactly people like you, you have no idea about what you are comparing, just figured that setting up something out of the box will give a good insight into the speed. Sheesh.
Ohh and the 100 fold increase in speed is very much likely to happen - on certain types of queries. With horizontal representation you can do sequential scan only on the part of the data that you need, not the entire set, which should be very very fast.
Vertica is not open source. Not sure where the confusion came from.
Note: I work for Vertica.
See http://en.wikipedia.org/wiki/Bigtable for a description of Google's column oriented database.
Note: I work for Vertica
Microsoft is backing them?
...following the principles of Heisenburger's Uncertain Cat...
This looks like it will be a commercial version of the Michael Stonebraker and MIT developed C-Store column-oriented:
- Web site: http://db.lcs.mit.edu/projects/cstore/
- Wikipedia Entry: http://en.wikipedia.org/wiki/C-Store
They distribute the source with a fairly liberal license, so this looks like something the open source community could pick up and run with.
This is a different kind of issue, really, more like the difference between a CPU and a GPU. At the moment, a good GPU has >100x the performance of a good CPU on a certain class of computations. Column stores will clearly never replace row stores for transaction processing for obvious reasons, but (coupled with a few other architectural decisions) they do exhibit >100x the performance of row stores for the kinds of queries seen in data warehouses.
Also, the two technologies are complementary. The goal is not to replace one thing with another, but to provide more kinds of tools and make them work together. Keep a row store for transaction processing, and feed the data into a column store for analysis in near real-time, much like a video game uses a CPU for AI and a GPU for 3D rendering.
I'm certainly not suggesting these guys are the first to implement a shared-nothing parallel RDBMS. IBM has offered DB2 parallel edition which is shared-nothing for some time now. However IBM wants a ton of money for parallel edition, and DB2 has some legacy stuff which might not be useful in a shared-nothing architecture. An open-source shared-nothing RDBMS might be compelling.
I think the shared-nothing approach is the best one for an open-source RDBMS offering. Organizations which use open source will almost certainly want to use commodity, open hardware. Shared-nothing will allow them to do that.
Is that you do not scale as well to a large number of columns. To access a set of X records with 100 columns, you have 100 asynchronous I/O calls to the separate column stores. I sell an analytical software that does just this, and it is not a technical something that should just be ignored. In some regards the single file row oriented system has less I/O overhead. We have come up with some ways to reduce the file system overhead, but while it is small, it is noticeable, more so on systems not designed to have a some large amount simultaneous open files. All that really happened is that it switched part of the bottleneck to rely less on the product architecture and more on the system architecture. Whether you think that is wise, well, that's up to you.
BTW, first post, I am no longer an eavesdropper, yay
Josh
it's on the front page of slashdot.. how stealthy can it be?
VLC FOR MAC IS DYING! IF YOU DEVELOP, PLEASE SAVE IT!!
Oh ye of little faith, here i present thee with The Facts. Or a paper at the very least: One size fits all? a Benchmark
+Raider of the lost BBS
I've worked in DW for a time, and I can tell you that it's not easy to "get it right" because so far it's not something that can be packaged. You can get the data models and fancy machinery, but you will most definitely need an architect to tailor it to the particular organization because all companies work differently on the inside. And that architect will have a dickens of a time understanding how the company works because the bigger they are, the more likely not even their own employees do. As long as there isn't an official structural model imposed on them like it happens with accounting, corporations will grow and be structured however best suits them (or sometimes they just "grow" like wild weeds, unruly and chaotic). And a Data Warehouse is an attempt to code this internal structure and its dealings in a central repository that will serve a number of goals like Business Intelligence, Trend Analysis, etc... So you won't find a product or solution that will fit your company out of the box. It's pretty much like with self-help books. The general idea works in general terms, but you have to adjust it to your own reality and quirks for it to be of any value to you in particular.
+Raider of the lost BBS
I've never heard of column based databases prior to this article. Would I be correct in assuming that you still can work with these using regular SQL?
Yes! No! Sort of!
Indexes only optimize some types of queries. To get the absolute maximum performance out of your database, you have to make sure that there is a specific index for each query you run, and that your indexes are properly rebuilt and optimized for least-time search. Suffice it to say, this rarely happens in the real world. So there's almost always some scanning, even after the indexes narrow things down a bit. By going with a column-oriented storage design, the scan can be streamed at higher levels of thoroughput than is possible with row-oriented databases.
The downside is that you're sacrificing the time to access individual rows, so if you're pulling and processing millions of rows of data, this might actually be slower than a traditional row-oriented database. Updates are almost guaranteed to be slower as you have to write to several column-oriented data stores rather than a single row-oriented store.
Still, column orientation makes a lot of sense for a variety of today's database applications. So if you're in need of querying a multi-terrabyte table, this product may be just what the (senior database) administrator ordered.
Javascript + Nintendo DSi = DSiCade
One of the benefits of column oriented DBs is that tables have an ordering, and that ordering can be exploited in queries. SQL doesn't give a good way to exploit it. Column DBs do allows SQL, but they also have other native languages that people tend to use.