Database Bigwigs Lead Stealthy Open Source Startup
BobB writes "Michael Stonebraker, who cooked up the Ingres and Postgres database management systems, is back with a stealthy startup called Vertica. And not just him, he has recruited former Oracle bigwigs Ray Lane and Jerry Held to give the company a boost before its software leaves beta testing. The promise — a Linux-based system that handles queries 100 times faster than traditional relational database management systems."
The article mentions that redhat and hp are listed among their partners. i'm not surprised by red hat or informatica (another partner though they aren't mentioned in the article) but i was a little surprised by hp - since they have been trying to get the word out about their own data warehousing and bi stuff. i wonder what that indicates about how they regard this new player.
also interesting is the wikipedia article on Michael Stonebraker if you aren't already familiar with him.
It's hard to believe that's how Micronians are made. Why don't we see it right now by having you both kiss one another?
The article seems to describe the big advantage as being column oriented.
How does this differ than KX System's kdb (www.kx.com) which IIRC is similar in that way; and is alredy in use at many if not most major financial institutions (see their customer list)?
It appears this was made new from the ground up? I am so used to logical progression with specific technologies trying to squeeze out the last drop of performance with no real new innovation that this idea seems foreign. It is refreshing to actually see something that is potentially new. Hopefully it holds up to the quoted 100 times faster.
Wow- that was fast.
The question is when will this be ported to a mainstream OS such as Windows?
It was LAMP, now its LAVA. Much cooler name.
-----BEGIN PGP SIGNATURE-----
12345
-----END PGP SIGNATURE-----
Michael Stonebraker, who cooked up the Ingres and Postgres database management systems, is back with a stealthy startup called Vertica ... The promise -- a Linux-based system that handles queries 100 times faster than traditional relational database management systems.
Yeah, but what does its radar signature look like?
Wizard Needs Food, Badly
"grid-enabled, column-oriented relational database management system"
What does that mean?
If anything.
A column oriented relational database? I'd like some more details on how that works. I don't suppose it's just a regular SQL db with Excel's Pivot Tables run on it...
Seriously, though, the target market for grid-based high volume data-warehousing type dbs are a lot smaller than the MySQL crowd. Not as big a deal as it seems, but it'd be nice to have if you needed it.
This is totally what we need.
With comodity hardware getting faster and cheaper by the minute, having a system that can handle a higher than average load with optimized software is, imho, a winner.
I'm sure everyone here can add some anecdotal evidence to how they had a heavy-hardware, database serving machine die on them because of some software bug.
This is one of the reasons I've been looking forward to ZFS. Hopefully the DB guru's will take the best of what's good about software, drop the legacy crap and really deliver something that's going to handle the kind of load that a good slashdotting delivers with hardware that didn't require a lease to be affordable.
This is not the greatest
how is this open-source?
The promise -- a Linux-based system that handles queries 100 times faster than traditional relational database management systems... ...using the power of oxygen!
https://www.eff.org/https-everywhere
Loading a million random records out of a set of one hundred million records is an enormously difficult task for an RDBMS on commodity hardware (e.g. magnetic rotating disks). This is a more common task than you would think. ORM systems backed by an RDBMS, such as Ruby on Rails, Django, Hibernate, have exactly this requirement and will only demand more as these models become more mainstream. Think about what search engines have to do: find millions among billions, all to show a user a dozen.
These problems are solvable now, but there's a lot of duplication of effort going on that a smart database vendor could solve for us.
Without any benchmarks of any kind and a lack of data I remain skeptical but if it works this could be a huge breakthrough for the database management as data storage amounts continue to skyrocket. I am curious if it will be ported to Windows or other proprietary systems and if so what affect it will have on the speed claims. Because if the speed claims are true and it stays Linux I would think companies would have to consider moving to Linux to realize the speed gains.
My user ID is a palindrome!
Vertica's website has had all the details about what they're doing for months. They've had a Wikipedia article for a long time.
This is some new Network World definition of "Stealthy", apparently...
This sounds great but will it work with Windows applications? How proprietary is their system? Do they have a suitable set of signed ODBC drivers that will let my legacy applications talk to their system? Do they have .NET enabled database connectors so I can dump it into my project? How well has their DB been tested again chatty network environments like a mix of Windows and Mac's or weird routing? What are their DB management system like? Is it CLI or GUI?
I can claim my custom written DOS database system is 20X faster then anything on the market(which it is), but if it can't easily work in a Windows and/or Linux (which it can't) then it worthless as marketable product. (But you should see what it can do on a serial network.)
You say things that offend me and I can deal with it. Can you?
What happened to Gallium Arsenide replacing silicon? What happened to solid state memory completely repalcing magnetic disks? Technology field is littered with such fiascos.
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
Network World is a trade rag. To them, anything not advertised is stealthy. Especially since they want to motivate people to think "oh no, I don't want to be stealthy, that means unknown! quick buy some advertising!"
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Watch...they'll run into patent problems with patents held by Oracle, Sybase, and MS.
Where does it say that Vertica is going to be open source?
In any case, if people wonder how they get 100x speedups, it's probably related to Stonebraker's previous company called Streambase.
run Windows for their website?
I prefer the "u" in honour as it seems to be missing these days.
during the transition when you tell people your business runs on LAVA-LAMP technology.
It's hard for something like this to be relevant if it cannot interface with existing systems.
I noticed that Stonebraker is the company founder. Stonebraker has contributed extensively to database research over the years.
He's known for advocating the "shared-nothing" approach to parallel databases. The shared-nothing approach means that nodes in the parallel database don't attempt memory or cache synchronization, and each node has its own commodity disk array. In a shared-nothing parallel database, the data is "partitioned" across servers. So, for example, rows with id's 1-10 would be on the first server, 11-20 on the second server, etc. Executing the SQL query "select * from table where id < 1000" would send requests to multiple commodity servers and then aggregate the results. The optimizer is modified to take into account network bandwidth and latency, etc.
My guess on what they're doing: they're working on a shared-nothing parallel RDBMS with an in-memory client similar to Oracle TimesTen.
The are a few drawbacks to the shared-nothing approach: 1) the RDBMS software is more difficult to implement; 2) since the data is partitioned, any transaction that updates tuples on more than one database node requires a two-phase distributed commit, which is much more expensive; and 3) some queries are more expensive because they require transmitting large amounts of data over the network rather than a memory bus, and in rare cases that network overhead cannot be eliminated by the optimizer.
The advantage, of course, is linear scalability by adding commodity hardware. No more need for $3M+ boxes.
... for a long time.
Classic RDBMSes are crutches. A forced-upon neccesitiy we have to put up with for our app models to latch on to real world hardware and it's limitations. A historically grown mess with an overhead so huge it's insane. With a Database PL and 30+ dialects of it from back in the days when we flew to the moon using a slide-ruler as primary means of calculation.
If what they claim is true, these guys are probably finally ditching the omnipresent redundant n-fold layers user and connection management in favour of a lean system that at last does away with the distinction of filesystem and database and data access layer. Imagine a persistance layer with no SQL, no extra user management, no extra connection layer, no filesystem under it and native object suport for any PL you wish to compile in.
I tell you, finally ditching classic RDBMSes is *long* overdue, they're basically all the same ancient pile of rubble, from MySQL up to Oracle. If these guys are up to taking on this deed (or part of it) and they get finished when solid-state finally relieves our current super-slowpoking spinning metal disks on a broad scale we'll feel like being in heaven compared to the shit we still have to put up with today.
I wish these guys all the best. They appear to have the skills to do it and the authority to emphasise that todays RDBMSes and their underlying concepts are a relic of the past.
My 2 cents.
We suffer more in our imagination than in reality. - Seneca
On the subject I've just published a new benchmark.
And the largest margin of all the tests that we ran is around 4 times in multi-threaded tests in favour of MySQL.
This is just marketing, nothing concrete to see - move along.
TODO: 753) write sig.
There wasn't much information on the web site, but everything is in Wikipedia (look under C-Store, the BSD-licensed open source version). It really is just a column-oriented database.
As relational databases aren't known for their astonishing efficiency, I wouldn't hold my breath if they are comparing their vaporware to rdbms performance.
By which I am asking that while Vertica is obviously well-researched and well funded as a start up, MonetDB is well-researched, already benchmarked and available now.. So why would I wait to invest my time, energy, and $$ in a proprietary future product rather than the time and energy, etc. to develop market leadership in my chosen corporate area in the present?
...Open Source isn't the only answer -- but it's almost always a better value than the alternatives...
You're 100 times faster than anyone else, obviously.
How dare you be so modest!! You conceited bastard!!
I wonder how this compares to http://en.wikipedia.org/wiki/NetezzaNetezza.
I've worked with another of this type of system - Alterian. For query intensive work, as compared to well designed / indexed Oracle, it was easily 100s of times faster. Sure, LOADING sucks, but given the goal was occasional loads with lots and lots of queries, it worked, and worked well.
> people like you, you have no idea about what you are comparing, just figured that setting up something out of the box will give a good insight into the speed.
I guess you didn't read the first page, or the second?
As stated (multiple times), the purpose of this report is to compare various aspects with "out of the box" performance, with all the caveats that it implies.
And FYI I will be comparing MySQL InnoDB next time around.
> Ohh and the 100 fold increase in speed is very much likely to happen
> With horizontal representation you can do sequential scan only on the part of the data that you need..
A scan is still a scan is still a scan.
And even with horizontal representation you shouldn't be too far off indexed data access speed, so the "100 times" figure is still unrealistic.
But, heh, you don't know me, so keep talking...
TODO: 753) write sig.
This is the second time that Vertica has managed to sneak into Slashdot on non-news. the previous time based on some shallow whitepaper. What vertica is doing has been done by multiple other companies (kx kdb, Sybase IQ). It is bound to run against patents held by them. I also have to wonder about Vertica and the company Stonebraker keeps there. These fluffy submissions are not the mark of a company which has pride in its originality or intellectual heft.
See http://en.wikipedia.org/wiki/Bigtable for a description of Google's column oriented database.
Right, so you got a table with lets say, 5 million rows spanning perhaps 2 GB of data. Now if you want to find out how many of the rows contain a specific value, lets say foo_bar is below 50, with vertical representation you need to scan your entire dataset, that is 2 GB. With horizontal representation you only need to dig through 5 million entries, let's assume it's 32bit integers and you are down to 20MB of data. Of course you can fix a bit of this by using an index covering just that column, but on very large datasets it just isn't an option.
And about me not knowing you - I don't need to, a quick scan of you paper told me you had a lot to learn. If you already know your test is flawed why the hell do you keep it online?
How about a database with the exact same query API (not just "but it's all SQL") as, say, Oracle or MS-SQL, or even Postgres, that allows any number of parallel query servers to work against a single datastore?
In other words, instead of yet another incompatible database, how about one that we could just switch to from an existing one, that is arbitrarily scalable against shared data. If you're going to get clever and act like you can solve hard problems, why not give people what we need, and not just what you think you can give us?
--
make install -not war
Wait how is this different than Sybase IQ server?
This looks like it will be a commercial version of the Michael Stonebraker and MIT developed C-Store column-oriented:
- Web site: http://db.lcs.mit.edu/projects/cstore/
- Wikipedia Entry: http://en.wikipedia.org/wiki/C-Store
They distribute the source with a fairly liberal license, so this looks like something the open source community could pick up and run with.
Gee, I don't know anyone who's been succuessfully doing this for years... or getting crazy performance with partitioned databases, or anything...
/Caveat, I work for the folks who make this product... but nobody pays me for PR or anything
That would make sense for a remote application. When ran, they're ran so far away.
Wizard Needs Food, Badly
SQLite has offered high speed column storage for at least the past year. What's so good about Vertica's offerring?
Heh.
I'm always amazed at the vehemence.
What good is a huge pile of data with no order? Someone has to pay the bills, someone has to see where the profits are, someone has to see which shipments went out late. These are reporting functions. You create data once, but use it many, many times if you are paying attention to it. I'm sorry you feel differently.
Execution of insert queries is extremely important and time sensitive. Execution of everything else is often not quite as mission critical, but it is still important enough to be worthwhile, or noone would do it.
My little site.
This is a different kind of issue, really, more like the difference between a CPU and a GPU. At the moment, a good GPU has >100x the performance of a good CPU on a certain class of computations. Column stores will clearly never replace row stores for transaction processing for obvious reasons, but (coupled with a few other architectural decisions) they do exhibit >100x the performance of row stores for the kinds of queries seen in data warehouses.
Also, the two technologies are complementary. The goal is not to replace one thing with another, but to provide more kinds of tools and make them work together. Keep a row store for transaction processing, and feed the data into a column store for analysis in near real-time, much like a video game uses a CPU for AI and a GPU for 3D rendering.
I'm certainly not suggesting these guys are the first to implement a shared-nothing parallel RDBMS. IBM has offered DB2 parallel edition which is shared-nothing for some time now. However IBM wants a ton of money for parallel edition, and DB2 has some legacy stuff which might not be useful in a shared-nothing architecture. An open-source shared-nothing RDBMS might be compelling.
I think the shared-nothing approach is the best one for an open-source RDBMS offering. Organizations which use open source will almost certainly want to use commodity, open hardware. Shared-nothing will allow them to do that.
Did anyone else notice they said it would run on linux based HARDWARE. Well informed reporter.
Is that you do not scale as well to a large number of columns. To access a set of X records with 100 columns, you have 100 asynchronous I/O calls to the separate column stores. I sell an analytical software that does just this, and it is not a technical something that should just be ignored. In some regards the single file row oriented system has less I/O overhead. We have come up with some ways to reduce the file system overhead, but while it is small, it is noticeable, more so on systems not designed to have a some large amount simultaneous open files. All that really happened is that it switched part of the bottleneck to rely less on the product architecture and more on the system architecture. Whether you think that is wise, well, that's up to you.
BTW, first post, I am no longer an eavesdropper, yay
Josh
it's on the front page of slashdot.. how stealthy can it be?
VLC FOR MAC IS DYING! IF YOU DEVELOP, PLEASE SAVE IT!!
What does this have to do with StreamBase? Is Stonebraker just throwing StreamBase under the bus? Are they complementary? How can one person (even someone with his abilities) function as CTO of two separate companies?
http://www.research.att.com/viewProject.cfm?prjID= 69
Other than this text, there is no discernible information contained in this sig.
Any relation to Required technologies? Unfortunately, that's what I think of when I hear about a column store
if you think this is bad, you should have seen my last sig
All I am saying is that claiming that the performance is going to be 100 times faster is not a good measurement. Every database vendor will find a scenario that suits their engine and proves unequivocally that they are the best - but they can't all be right, now can they?
This is exactly the kind of thing that I tried to avoid in the paper that you didn't read: by focusing on very simple cases (and very simple differences being measured)
Some examples to save you reading it:
- IBM JDK 1.5 threading model does not scale well and causes much more system load
- PostgreSQL does not benefit from using prepared statements as much as other dbs (even less so for updates)
- On the same hardware, MySQL is generally faster as i686 than x86_64
etc...But eh, "this benchmark is flawed", right? so just delete this information from your brain, you'll be fine.
I won't deny I still have plenty to learn, that's what I enjoy doing! And I like sharing it too!
read it!.
TODO: 753) write sig.
Oh ye of little faith, here i present thee with The Facts. Or a paper at the very least: One size fits all? a Benchmark
+Raider of the lost BBS
I've never heard of column based databases prior to this article. Would I be correct in assuming that you still can work with these using regular SQL?
Is that a Microsoft 100x increase or a Linux 100x increase? (For reference 100x(Microsoft) = 1.1x(Linux))
I may agree with what you say, but I will defend to the death your right to face the consequences of saying it.
one must beg a question, do our memory system(in our brain literally) run on column-based, row-based or different type of a database? :)
> Grid enabled - This means the DBMS can make use of a large distributed group of computers and potentially have access
> to a huge amount of computing power. The typical DBMS runs on at beat a multi-processor server. Thi sis kind of like a DBMS
> server running a a "seti at home" type network.
Or like teradata in around, what? 1992? Informix around 1994? db2 around 1995? Oracle isn't there yet since their grid solution is more about failover than partitioning.
This is now lower-end functionality in the high-end database market. The typical database only runs best on an SMP if you mean postgesql or mysql when you say typical. The large commercial databases can easily split your data across 2,4 or 500 servers for handling 1 second queries that require complex queries across billions of rows of data.
One of the benefits of column oriented DBs is that tables have an ordering, and that ordering can be exploited in queries. SQL doesn't give a good way to exploit it. Column DBs do allows SQL, but they also have other native languages that people tend to use.
Column-oriented DBs should scale better with more columns if the the query doesn't access all the columns (which they rarely do). The DB only needs to keep the columns in memory that are being accessed. This is far better than a row-oriented DB that needs to cycle though the entire table or numerous indexes to get a result set.
Column oriented sounds to me a lot like an index with a single field. Even if you place all columns into a single block other columns could still be required when displaying the record, if you have multiple columns like this it could mean a lot of blocks being read once you actually know the direct block locations of your data. So do you combine row + column style, then using some DDL highlighting which columns to store independent of the rest of that row.
If you're using a data warehouse for operational reporting, you're a tard. We do, and we're tards. Probably 95% of management reports are binned unread.
KDB actually does its best when pulling from disk. If you can cache a B-tree in memory, there isn't much as much performance increase since disk seeking is your typical enemy in these large datasets. A hash table is even good when everything is in memory, so standard indexes do well. However, when going to disk, KDB can get away with far fewer disk seeks since it is pulling contiguous regions into memory.
It will be interesting to see how Stonebreaker's new DB performs, since there are a number of column DBs out there, but only a few actually work well with the massive amounts of data that Stonebreader seems to be targeting.
People who manage are sometimes tards. Ideally, in a streamlined shop, there is a lot of use for data warehousing. Instead, because it's a tech task that can be sold as a big budget increase, it sometimes gets a lot of people working on it and produces huge piles of crud.
If instead of producing reams and reams of reports noone needs, what if the people in charge were capable of and willing to dig into the warehouse and find out what was going right and what was going wrong? Or hire someone to do so? That's what data warehousing would be great for. It's not used that way because of organizational factors, but the fact is that it is a tool, and expanding its usefulness or making it more useful has nothing to do with whether that tool is well used. The best managers I have known are interested in acquiring as much information as possible in an organized format. And redefining the format again and again in the hopes that it means something. Your particular set seems to be more interested in process; that's fine sometimes. But don't assume that that is the way it always goes.
My little site.
You just described Teradata, which has been around since 1979, and do just what you said.
Initially, they used parallelized hardware with each "node" having its own disks, with tables partitioned, and a specialized interconnect. They then migrated all that in software.
See the diagrams on page 4, 5 and 6 here (PDF).
2bits.com, Inc: Drupal, WordPress, and LAMP performance tuning.
Unless the query is slow enough to be a problem and has to be optimized.
Then it pretty much always happens.
I know, I'm the dude that makes slow processes run 10x faster, sometimes more, depending on the level of incompetence of the original developers. The level of incompetence can sometimes be mind blowing. I've gotten 100x speed improvements when I was the 30th developer (all of who where supposed to be looking for optimizations) looking at a block of client code/stored procs/queries.
As to updates, what percent of updates only change one column worth of data? It depends on the database design of course. The percent is higher in highly normalized designs but I suspect it rarely hits 50%.
The database design in turn is often a thoughtful trade-off between transaction processing and analytic tasks.
My question. How much server power does this product need vs. setting up a mainstream db and just indexing ALL the columns.
Lets leave compound indexes out of the discussion for now but note the mainstream dbs will likely come out ahead on that metric.
John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'
Reviewing my 20 years at this crap...
Never seen a requirement to load random records. They are always based on a search criteria. Which is usually indexable.
Granted theirs the case of field x that contains the text 'blah'. That is a small part of database operations and doesn't particularly gain much from this.
John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'
Define well designed / indexed.
I suspect it was not all that well designed or indexed (for analysis). Most highly normalized designs are optimized for transaction processing.
You realize that mainstream dbs have ways to tell the server to keep certain indexes in memory? That can make carefully tuned queries scream. You can see 100x differences between designs put together following normalization rules and those thoughtfully de-normalized and with performance tweaks applied (pinned indices, sometimes compound indexes, sometimes pinned compound indexes, realized views etc etc).
Designing a mainstream db for query intensive work is very different to designing a mainstream db to TP.
John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'
You almost had me convinced you new something of what you said. Until that line.
You sir are an idiot. Go back to your CS homework.
Are you telling me indexes stop working at a certain record count? They have a cost of course as they contain a copy of the columns worth of data in an easy to scan format. Just like column oriented dbs. When scanning an index you are down to the same 20MB of data.
John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'
Yes in fact they stop working at a certain point. Try a fun experiment, populate a table where 5 percent meets a certain criteria, 7.5 percent another, 12.5 percent another, 25 percent another and 50 percent for the last one. Run each query against a cold database (reboot your system - or stop the database, unmount the filesystem, remount it and start the database again). Then create an index covering the queries and run the queries again, again against a cold database.
You will notice that at first index is much faster than non index, but when you get nearer 25 percent coverage the two types are about the same, index will most likely be a bit slower than non indexed. But at 50 percent they are going to be almost the same, indexed a bit slower.
Regarding the index at 20MB, yes you could do that, but when you got millions of rows indexes will get very expensive and thats why we want horizontal representation for these types of queries.
Vertica has indeed made the business/venture steps to
t /Credits/Partners/index.html
follow the MonetDB approach to exploit column-based stores
for large scale datawarehouse solutions.
Its science library provides many studies on the underlying
technology.
MonetDB has already build a business history in the
area of analytical CRM solutions available through SPSS.
In the area of datamining PROXIMITY is a leading
product for relational mining.
Not to mention the support for both SQL and XQuery
engine support. This all in the context of an open-source
community activity for several years.
See http://monetdb.cwi.nl/
http://monetdb.cwi.nl/projects/monetdb/Developmen
I had a long chat with Mike Stonebraker a few weeks ago, and came away with the following tentative opinions about Vertica's prospects, and those for columnar systems in general.
* Pinpoint data lookup doesn't seem like a great fit for columnar systems. Indeed, traditional rows-and-B-trees would seem to be best.
* Constrained query and reporting would seem to be a sweet spot, even though it's a sweet spot for some of the best competition as well.
* Cube-filling calculations involve big intermediate result sets. I'm not sure that's a great fit for columnar systems.
* Hardcore tabular data crunching would seem in many cases to be another sweet spot, again against a lot of competition, at least in some of its sub-categories.
* Text and media search are best done by specialized systems that, at least in the case of text, wind up being quasi-columnar. The same goes for other specialty areas. Systems like Vertica's have nothing to offer directly to these applications. However, it might be possible for Vertica to integrate with them fairly quickly, given that they're starting from vaguely similar philosophical roots.
There also are some technical details in that article; a link to a short, somewhat hagiographic intro to Mike himself; and so on.
To err is human. To forgive is good system design.
And it's not just Sybase IQ, either. There are lots of columnar players. Kognitio also has a columnar VLDB offering, but it's quite different from Vertica's. And the columnar memory-centric BI offerings are interesting as well, such as QlikTech's and SAP's. Also, full-text indexing is pretty columnar itself.
To err is human. To forgive is good system design.