Database Bigwigs Lead Stealthy Open Source Startup
BobB writes "Michael Stonebraker, who cooked up the Ingres and Postgres database management systems, is back with a stealthy startup called Vertica. And not just him, he has recruited former Oracle bigwigs Ray Lane and Jerry Held to give the company a boost before its software leaves beta testing. The promise — a Linux-based system that handles queries 100 times faster than traditional relational database management systems."
The article mentions that redhat and hp are listed among their partners. i'm not surprised by red hat or informatica (another partner though they aren't mentioned in the article) but i was a little surprised by hp - since they have been trying to get the word out about their own data warehousing and bi stuff. i wonder what that indicates about how they regard this new player.
also interesting is the wikipedia article on Michael Stonebraker if you aren't already familiar with him.
It's hard to believe that's how Micronians are made. Why don't we see it right now by having you both kiss one another?
The article seems to describe the big advantage as being column oriented.
How does this differ than KX System's kdb (www.kx.com) which IIRC is similar in that way; and is alredy in use at many if not most major financial institutions (see their customer list)?
The question is when will this be ported to a mainstream OS such as Windows?
It was LAMP, now its LAVA. Much cooler name.
-----BEGIN PGP SIGNATURE-----
12345
-----END PGP SIGNATURE-----
Michael Stonebraker, who cooked up the Ingres and Postgres database management systems, is back with a stealthy startup called Vertica ... The promise -- a Linux-based system that handles queries 100 times faster than traditional relational database management systems.
Yeah, but what does its radar signature look like?
Wizard Needs Food, Badly
"grid-enabled, column-oriented relational database management system"
What does that mean?
If anything.
A column oriented relational database? I'd like some more details on how that works. I don't suppose it's just a regular SQL db with Excel's Pivot Tables run on it...
Seriously, though, the target market for grid-based high volume data-warehousing type dbs are a lot smaller than the MySQL crowd. Not as big a deal as it seems, but it'd be nice to have if you needed it.
This is totally what we need.
With comodity hardware getting faster and cheaper by the minute, having a system that can handle a higher than average load with optimized software is, imho, a winner.
I'm sure everyone here can add some anecdotal evidence to how they had a heavy-hardware, database serving machine die on them because of some software bug.
This is one of the reasons I've been looking forward to ZFS. Hopefully the DB guru's will take the best of what's good about software, drop the legacy crap and really deliver something that's going to handle the kind of load that a good slashdotting delivers with hardware that didn't require a lease to be affordable.
This is not the greatest
how is this open-source?
The promise -- a Linux-based system that handles queries 100 times faster than traditional relational database management systems... ...using the power of oxygen!
https://www.eff.org/https-everywhere
Loading a million random records out of a set of one hundred million records is an enormously difficult task for an RDBMS on commodity hardware (e.g. magnetic rotating disks). This is a more common task than you would think. ORM systems backed by an RDBMS, such as Ruby on Rails, Django, Hibernate, have exactly this requirement and will only demand more as these models become more mainstream. Think about what search engines have to do: find millions among billions, all to show a user a dozen.
These problems are solvable now, but there's a lot of duplication of effort going on that a smart database vendor could solve for us.
Without any benchmarks of any kind and a lack of data I remain skeptical but if it works this could be a huge breakthrough for the database management as data storage amounts continue to skyrocket. I am curious if it will be ported to Windows or other proprietary systems and if so what affect it will have on the speed claims. Because if the speed claims are true and it stays Linux I would think companies would have to consider moving to Linux to realize the speed gains.
My user ID is a palindrome!
Vertica's website has had all the details about what they're doing for months. They've had a Wikipedia article for a long time.
This is some new Network World definition of "Stealthy", apparently...
What happened to Gallium Arsenide replacing silicon? What happened to solid state memory completely repalcing magnetic disks? Technology field is littered with such fiascos.
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
Network World is a trade rag. To them, anything not advertised is stealthy. Especially since they want to motivate people to think "oh no, I don't want to be stealthy, that means unknown! quick buy some advertising!"
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Watch...they'll run into patent problems with patents held by Oracle, Sybase, and MS.
Where does it say that Vertica is going to be open source?
In any case, if people wonder how they get 100x speedups, it's probably related to Stonebraker's previous company called Streambase.
run Windows for their website?
I prefer the "u" in honour as it seems to be missing these days.
during the transition when you tell people your business runs on LAVA-LAMP technology.
It's hard for something like this to be relevant if it cannot interface with existing systems.
I noticed that Stonebraker is the company founder. Stonebraker has contributed extensively to database research over the years.
He's known for advocating the "shared-nothing" approach to parallel databases. The shared-nothing approach means that nodes in the parallel database don't attempt memory or cache synchronization, and each node has its own commodity disk array. In a shared-nothing parallel database, the data is "partitioned" across servers. So, for example, rows with id's 1-10 would be on the first server, 11-20 on the second server, etc. Executing the SQL query "select * from table where id < 1000" would send requests to multiple commodity servers and then aggregate the results. The optimizer is modified to take into account network bandwidth and latency, etc.
My guess on what they're doing: they're working on a shared-nothing parallel RDBMS with an in-memory client similar to Oracle TimesTen.
The are a few drawbacks to the shared-nothing approach: 1) the RDBMS software is more difficult to implement; 2) since the data is partitioned, any transaction that updates tuples on more than one database node requires a two-phase distributed commit, which is much more expensive; and 3) some queries are more expensive because they require transmitting large amounts of data over the network rather than a memory bus, and in rare cases that network overhead cannot be eliminated by the optimizer.
The advantage, of course, is linear scalability by adding commodity hardware. No more need for $3M+ boxes.
... for a long time.
Classic RDBMSes are crutches. A forced-upon neccesitiy we have to put up with for our app models to latch on to real world hardware and it's limitations. A historically grown mess with an overhead so huge it's insane. With a Database PL and 30+ dialects of it from back in the days when we flew to the moon using a slide-ruler as primary means of calculation.
If what they claim is true, these guys are probably finally ditching the omnipresent redundant n-fold layers user and connection management in favour of a lean system that at last does away with the distinction of filesystem and database and data access layer. Imagine a persistance layer with no SQL, no extra user management, no extra connection layer, no filesystem under it and native object suport for any PL you wish to compile in.
I tell you, finally ditching classic RDBMSes is *long* overdue, they're basically all the same ancient pile of rubble, from MySQL up to Oracle. If these guys are up to taking on this deed (or part of it) and they get finished when solid-state finally relieves our current super-slowpoking spinning metal disks on a broad scale we'll feel like being in heaven compared to the shit we still have to put up with today.
I wish these guys all the best. They appear to have the skills to do it and the authority to emphasise that todays RDBMSes and their underlying concepts are a relic of the past.
My 2 cents.
We suffer more in our imagination than in reality. - Seneca
There wasn't much information on the web site, but everything is in Wikipedia (look under C-Store, the BSD-licensed open source version). It really is just a column-oriented database.
A) Is your benchmark a data warehouse type app benchmark or transactional? Column oriented is slower for transactional typically but much faster for data warehouse. I don't care how many frames per second you measure if I'm buying a LAMP web server system.
B) Your benchmark data doesn't show that you've tried to run Sybase IQ or C-store column-oriented databases against the workload.
Are you really sure that you want to be so sure about this, given that you may not be testing the right thing, and haven't tested the comparable things? 8-)
By which I am asking that while Vertica is obviously well-researched and well funded as a start up, MonetDB is well-researched, already benchmarked and available now.. So why would I wait to invest my time, energy, and $$ in a proprietary future product rather than the time and energy, etc. to develop market leadership in my chosen corporate area in the present?
...Open Source isn't the only answer -- but it's almost always a better value than the alternatives...
You're 100 times faster than anyone else, obviously.
How dare you be so modest!! You conceited bastard!!
I wonder how this compares to http://en.wikipedia.org/wiki/NetezzaNetezza.
Uhm... wtf?
Seriously, you tested MySQL vs. other databases with "out of the box" setups? MySQL isn't a real database when running MyIsam engine, you simply cannot compare that with anything else. And on top of that, try do a proper insertion in MySQL, one single transaction with a few millions of rows and see how well that does. Oh and did you ever stop to think about _why_ MySQL does perform so much faster on that test? Try doing it on a InnoDB table with standard setup, even at 600k rows it slows to a crawl. (Easily fixable, but requires some optimizations)
Seriously the reason why big vendors have a clause in their eula for people to NOT do benchmarks is exactly people like you, you have no idea about what you are comparing, just figured that setting up something out of the box will give a good insight into the speed. Sheesh.
Ohh and the 100 fold increase in speed is very much likely to happen - on certain types of queries. With horizontal representation you can do sequential scan only on the part of the data that you need, not the entire set, which should be very very fast.
I've worked with another of this type of system - Alterian. For query intensive work, as compared to well designed / indexed Oracle, it was easily 100s of times faster. Sure, LOADING sucks, but given the goal was occasional loads with lots and lots of queries, it worked, and worked well.
> people like you, you have no idea about what you are comparing, just figured that setting up something out of the box will give a good insight into the speed.
I guess you didn't read the first page, or the second?
As stated (multiple times), the purpose of this report is to compare various aspects with "out of the box" performance, with all the caveats that it implies.
And FYI I will be comparing MySQL InnoDB next time around.
> Ohh and the 100 fold increase in speed is very much likely to happen
> With horizontal representation you can do sequential scan only on the part of the data that you need..
A scan is still a scan is still a scan.
And even with horizontal representation you shouldn't be too far off indexed data access speed, so the "100 times" figure is still unrealistic.
But, heh, you don't know me, so keep talking...
TODO: 753) write sig.
See http://en.wikipedia.org/wiki/Bigtable for a description of Google's column oriented database.
Note: I work for Vertica
Right, so you got a table with lets say, 5 million rows spanning perhaps 2 GB of data. Now if you want to find out how many of the rows contain a specific value, lets say foo_bar is below 50, with vertical representation you need to scan your entire dataset, that is 2 GB. With horizontal representation you only need to dig through 5 million entries, let's assume it's 32bit integers and you are down to 20MB of data. Of course you can fix a bit of this by using an index covering just that column, but on very large datasets it just isn't an option.
And about me not knowing you - I don't need to, a quick scan of you paper told me you had a lot to learn. If you already know your test is flawed why the hell do you keep it online?
How about a database with the exact same query API (not just "but it's all SQL") as, say, Oracle or MS-SQL, or even Postgres, that allows any number of parallel query servers to work against a single datastore?
In other words, instead of yet another incompatible database, how about one that we could just switch to from an existing one, that is arbitrarily scalable against shared data. If you're going to get clever and act like you can solve hard problems, why not give people what we need, and not just what you think you can give us?
--
make install -not war
This looks like it will be a commercial version of the Michael Stonebraker and MIT developed C-Store column-oriented:
- Web site: http://db.lcs.mit.edu/projects/cstore/
- Wikipedia Entry: http://en.wikipedia.org/wiki/C-Store
They distribute the source with a fairly liberal license, so this looks like something the open source community could pick up and run with.
It takes balls to say things like that about Michael Stonebraker in the database field... ...and lack of brains or historical clue...
That would make sense for a remote application. When ran, they're ran so far away.
Wizard Needs Food, Badly
Heh.
I'm always amazed at the vehemence.
What good is a huge pile of data with no order? Someone has to pay the bills, someone has to see where the profits are, someone has to see which shipments went out late. These are reporting functions. You create data once, but use it many, many times if you are paying attention to it. I'm sorry you feel differently.
Execution of insert queries is extremely important and time sensitive. Execution of everything else is often not quite as mission critical, but it is still important enough to be worthwhile, or noone would do it.
My little site.
This is a different kind of issue, really, more like the difference between a CPU and a GPU. At the moment, a good GPU has >100x the performance of a good CPU on a certain class of computations. Column stores will clearly never replace row stores for transaction processing for obvious reasons, but (coupled with a few other architectural decisions) they do exhibit >100x the performance of row stores for the kinds of queries seen in data warehouses.
Also, the two technologies are complementary. The goal is not to replace one thing with another, but to provide more kinds of tools and make them work together. Keep a row store for transaction processing, and feed the data into a column store for analysis in near real-time, much like a video game uses a CPU for AI and a GPU for 3D rendering.
http://www.sqlite.org/whentouse.html
I haven't worked with it up close (browse around the website regularly, but don't run it), but all the docs I have seen say SQLite uses a B-tree, not a column store. Do you have an alternate reference to such?
SQLlite says not to use it for more than a few gb to tens of gb of data. Sybase IQ, for example, is routinely run with TB plus quantities of data. It's been tested to a trillion plus rows of data and 155 TB of input data (which autocompressed down to 55 TB of diskspace required to store it).
Vertica's headed thataways, I think, but also to be lighterweight and more general purpose than IQ.
I'm certainly not suggesting these guys are the first to implement a shared-nothing parallel RDBMS. IBM has offered DB2 parallel edition which is shared-nothing for some time now. However IBM wants a ton of money for parallel edition, and DB2 has some legacy stuff which might not be useful in a shared-nothing architecture. An open-source shared-nothing RDBMS might be compelling.
I think the shared-nothing approach is the best one for an open-source RDBMS offering. Organizations which use open source will almost certainly want to use commodity, open hardware. Shared-nothing will allow them to do that.
Is that you do not scale as well to a large number of columns. To access a set of X records with 100 columns, you have 100 asynchronous I/O calls to the separate column stores. I sell an analytical software that does just this, and it is not a technical something that should just be ignored. In some regards the single file row oriented system has less I/O overhead. We have come up with some ways to reduce the file system overhead, but while it is small, it is noticeable, more so on systems not designed to have a some large amount simultaneous open files. All that really happened is that it switched part of the bottleneck to rely less on the product architecture and more on the system architecture. Whether you think that is wise, well, that's up to you.
BTW, first post, I am no longer an eavesdropper, yay
Josh
it's on the front page of slashdot.. how stealthy can it be?
VLC FOR MAC IS DYING! IF YOU DEVELOP, PLEASE SAVE IT!!
What does this have to do with StreamBase? Is Stonebraker just throwing StreamBase under the bus? Are they complementary? How can one person (even someone with his abilities) function as CTO of two separate companies?
http://www.research.att.com/viewProject.cfm?prjID= 69
Other than this text, there is no discernible information contained in this sig.
Any relation to Required technologies? Unfortunately, that's what I think of when I hear about a column store
if you think this is bad, you should have seen my last sig
All I am saying is that claiming that the performance is going to be 100 times faster is not a good measurement. Every database vendor will find a scenario that suits their engine and proves unequivocally that they are the best - but they can't all be right, now can they?
This is exactly the kind of thing that I tried to avoid in the paper that you didn't read: by focusing on very simple cases (and very simple differences being measured)
Some examples to save you reading it:
- IBM JDK 1.5 threading model does not scale well and causes much more system load
- PostgreSQL does not benefit from using prepared statements as much as other dbs (even less so for updates)
- On the same hardware, MySQL is generally faster as i686 than x86_64
etc...But eh, "this benchmark is flawed", right? so just delete this information from your brain, you'll be fine.
I won't deny I still have plenty to learn, that's what I enjoy doing! And I like sharing it too!
read it!.
TODO: 753) write sig.
Oh ye of little faith, here i present thee with The Facts. Or a paper at the very least: One size fits all? a Benchmark
+Raider of the lost BBS
I've never heard of column based databases prior to this article. Would I be correct in assuming that you still can work with these using regular SQL?
Is that a Microsoft 100x increase or a Linux 100x increase? (For reference 100x(Microsoft) = 1.1x(Linux))
I may agree with what you say, but I will defend to the death your right to face the consequences of saying it.
> Grid enabled - This means the DBMS can make use of a large distributed group of computers and potentially have access
> to a huge amount of computing power. The typical DBMS runs on at beat a multi-processor server. Thi sis kind of like a DBMS
> server running a a "seti at home" type network.
Or like teradata in around, what? 1992? Informix around 1994? db2 around 1995? Oracle isn't there yet since their grid solution is more about failover than partitioning.
This is now lower-end functionality in the high-end database market. The typical database only runs best on an SMP if you mean postgesql or mysql when you say typical. The large commercial databases can easily split your data across 2,4 or 500 servers for handling 1 second queries that require complex queries across billions of rows of data.
One of the benefits of column oriented DBs is that tables have an ordering, and that ordering can be exploited in queries. SQL doesn't give a good way to exploit it. Column DBs do allows SQL, but they also have other native languages that people tend to use.
Column-oriented DBs should scale better with more columns if the the query doesn't access all the columns (which they rarely do). The DB only needs to keep the columns in memory that are being accessed. This is far better than a row-oriented DB that needs to cycle though the entire table or numerous indexes to get a result set.
Column oriented sounds to me a lot like an index with a single field. Even if you place all columns into a single block other columns could still be required when displaying the record, if you have multiple columns like this it could mean a lot of blocks being read once you actually know the direct block locations of your data. So do you combine row + column style, then using some DDL highlighting which columns to store independent of the rest of that row.
We use Sybase IQ where I work, and other than rebooting it every few days, frequent crashes, and slow running queries, it's pretty good. We are right now offloading a huge reporting database into SQL Server because our queries run about 6x faster. This is a star-schema data warehouse with a few TB of data. The big difference is that SQL Server supports partitioning.
KDB actually does its best when pulling from disk. If you can cache a B-tree in memory, there isn't much as much performance increase since disk seeking is your typical enemy in these large datasets. A hash table is even good when everything is in memory, so standard indexes do well. However, when going to disk, KDB can get away with far fewer disk seeks since it is pulling contiguous regions into memory.
It will be interesting to see how Stonebreaker's new DB performs, since there are a number of column DBs out there, but only a few actually work well with the massive amounts of data that Stonebreader seems to be targeting.
People who manage are sometimes tards. Ideally, in a streamlined shop, there is a lot of use for data warehousing. Instead, because it's a tech task that can be sold as a big budget increase, it sometimes gets a lot of people working on it and produces huge piles of crud.
If instead of producing reams and reams of reports noone needs, what if the people in charge were capable of and willing to dig into the warehouse and find out what was going right and what was going wrong? Or hire someone to do so? That's what data warehousing would be great for. It's not used that way because of organizational factors, but the fact is that it is a tool, and expanding its usefulness or making it more useful has nothing to do with whether that tool is well used. The best managers I have known are interested in acquiring as much information as possible in an organized format. And redefining the format again and again in the hopes that it means something. Your particular set seems to be more interested in process; that's fine sometimes. But don't assume that that is the way it always goes.
My little site.
You just described Teradata, which has been around since 1979, and do just what you said.
Initially, they used parallelized hardware with each "node" having its own disks, with tables partitioned, and a specialized interconnect. They then migrated all that in software.
See the diagrams on page 4, 5 and 6 here (PDF).
2bits.com, Inc: Drupal, WordPress, and LAMP performance tuning.
Unless the query is slow enough to be a problem and has to be optimized.
Then it pretty much always happens.
I know, I'm the dude that makes slow processes run 10x faster, sometimes more, depending on the level of incompetence of the original developers. The level of incompetence can sometimes be mind blowing. I've gotten 100x speed improvements when I was the 30th developer (all of who where supposed to be looking for optimizations) looking at a block of client code/stored procs/queries.
As to updates, what percent of updates only change one column worth of data? It depends on the database design of course. The percent is higher in highly normalized designs but I suspect it rarely hits 50%.
The database design in turn is often a thoughtful trade-off between transaction processing and analytic tasks.
My question. How much server power does this product need vs. setting up a mainstream db and just indexing ALL the columns.
Lets leave compound indexes out of the discussion for now but note the mainstream dbs will likely come out ahead on that metric.
John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'
Reviewing my 20 years at this crap...
Never seen a requirement to load random records. They are always based on a search criteria. Which is usually indexable.
Granted theirs the case of field x that contains the text 'blah'. That is a small part of database operations and doesn't particularly gain much from this.
John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'
Define well designed / indexed.
I suspect it was not all that well designed or indexed (for analysis). Most highly normalized designs are optimized for transaction processing.
You realize that mainstream dbs have ways to tell the server to keep certain indexes in memory? That can make carefully tuned queries scream. You can see 100x differences between designs put together following normalization rules and those thoughtfully de-normalized and with performance tweaks applied (pinned indices, sometimes compound indexes, sometimes pinned compound indexes, realized views etc etc).
Designing a mainstream db for query intensive work is very different to designing a mainstream db to TP.
John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'
You almost had me convinced you new something of what you said. Until that line.
You sir are an idiot. Go back to your CS homework.
Are you telling me indexes stop working at a certain record count? They have a cost of course as they contain a copy of the columns worth of data in an easy to scan format. Just like column oriented dbs. When scanning an index you are down to the same 20MB of data.
John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'
Yes in fact they stop working at a certain point. Try a fun experiment, populate a table where 5 percent meets a certain criteria, 7.5 percent another, 12.5 percent another, 25 percent another and 50 percent for the last one. Run each query against a cold database (reboot your system - or stop the database, unmount the filesystem, remount it and start the database again). Then create an index covering the queries and run the queries again, again against a cold database.
You will notice that at first index is much faster than non index, but when you get nearer 25 percent coverage the two types are about the same, index will most likely be a bit slower than non indexed. But at 50 percent they are going to be almost the same, indexed a bit slower.
Regarding the index at 20MB, yes you could do that, but when you got millions of rows indexes will get very expensive and thats why we want horizontal representation for these types of queries.
Vertica has indeed made the business/venture steps to
t /Credits/Partners/index.html
follow the MonetDB approach to exploit column-based stores
for large scale datawarehouse solutions.
Its science library provides many studies on the underlying
technology.
MonetDB has already build a business history in the
area of analytical CRM solutions available through SPSS.
In the area of datamining PROXIMITY is a leading
product for relational mining.
Not to mention the support for both SQL and XQuery
engine support. This all in the context of an open-source
community activity for several years.
See http://monetdb.cwi.nl/
http://monetdb.cwi.nl/projects/monetdb/Developmen
I had a long chat with Mike Stonebraker a few weeks ago, and came away with the following tentative opinions about Vertica's prospects, and those for columnar systems in general.
* Pinpoint data lookup doesn't seem like a great fit for columnar systems. Indeed, traditional rows-and-B-trees would seem to be best.
* Constrained query and reporting would seem to be a sweet spot, even though it's a sweet spot for some of the best competition as well.
* Cube-filling calculations involve big intermediate result sets. I'm not sure that's a great fit for columnar systems.
* Hardcore tabular data crunching would seem in many cases to be another sweet spot, again against a lot of competition, at least in some of its sub-categories.
* Text and media search are best done by specialized systems that, at least in the case of text, wind up being quasi-columnar. The same goes for other specialty areas. Systems like Vertica's have nothing to offer directly to these applications. However, it might be possible for Vertica to integrate with them fairly quickly, given that they're starting from vaguely similar philosophical roots.
There also are some technical details in that article; a link to a short, somewhat hagiographic intro to Mike himself; and so on.
To err is human. To forgive is good system design.
And it's not just Sybase IQ, either. There are lots of columnar players. Kognitio also has a columnar VLDB offering, but it's quite different from Vertica's. And the columnar memory-centric BI offerings are interesting as well, such as QlikTech's and SAP's. Also, full-text indexing is pretty columnar itself.
To err is human. To forgive is good system design.