Database Bigwigs Lead Stealthy Open Source Startup

Partners by stoolpigeon · 2007-02-14 09:22 · Score: 5, Informative

The article mentions that redhat and hp are listed among their partners. i'm not surprised by red hat or informatica (another partner though they aren't mentioned in the article) but i was a little surprised by hp - since they have been trying to get the word out about their own data warehousing and bi stuff. i wonder what that indicates about how they regard this new player.

also interesting is the wikipedia article on Michael Stonebraker if you aren't already familiar with him.

--
It's hard to believe that's how Micronians are made. Why don't we see it right now by having you both kiss one another?

Re:Partners by AKAImBatman · 2007-02-14 09:25 · Score: 4, Insightful

i was a little surprised by hp - since they have been trying to get the word out about their own data warehousing and bi stuff.

It's called "hedging your bets". If the little company doesn't work out, no big deal. If it does, then HP is in a position to either benefit from contractual relations, acquire it, or squash it. Whichever happens to be their fancy.

--
Javascript + Nintendo DSi = DSiCade
Re:Partners by stoolpigeon · 2007-02-14 10:22 · Score: 1

based on stonebraker's history -- somebody acquiring it at some point seems a relatively safe bet.

--
It's hard to believe that's how Micronians are made. Why don't we see it right now by having you both kiss one another?

Column oriented databases by Anonymous Coward · 2007-02-14 09:22 · Score: 2, Interesting

The article seems to describe the big advantage as being column oriented.

How does this differ than KX System's kdb (www.kx.com) which IIRC is similar in that way; and is alredy in use at many if not most major financial institutions (see their customer list)?

Re:Column oriented databases by georgewilliamherbert · 2007-02-14 09:37 · Score: 4, Informative

KX is primarily in-memory. The competing column-oriented product is primarily Sybase IQ, which has been on the market for a while now.
Re:Column oriented databases by JimDaGeek · 2007-02-14 11:24 · Score: 1

Well, kdb+ is proprietary and expensive. Maybe this product will be Open Source or at the very least kill kdb+ on price? There are not many real players in this market, the more the better IMO. What would be best is a competitive Open Source offering in this space. The Open Source product could steal away most of the market share, or at the very least, really drive down prices! :-)

--
General, you are listening to a machine! Do the world a favor and don't act like one.

When Will This Be Ported? by Anonymous Coward · 2007-02-14 09:23 · Score: 4, Funny

The question is when will this be ported to a mainstream OS such as Windows?

Re:When Will This Be Ported? by dfgchgfxrjtdhgh.jjhv · 2007-02-14 10:13 · Score: 1

lol

--
Web Design
Re:When Will This Be Ported? by Mad+Merlin · 2007-02-14 11:07 · Score: 2, Funny

The question is when will this be ported to a mainstream OS such as Windows?

Where by mainstream, you mean useless?

--
Game! - Where the stick is mightier than the sword!
Re:When Will This Be Ported? by JimDaGeek · 2007-02-14 12:25 · Score: 1

No offense, but you must be as dumb as a rock. This is not a "mainstream" application. This is a very business specific application. If the app does what a company needs, not running on MS Windows is _really_ no big deal.

I have been a senior developer for more than a decade now and have worked at 2 fortune 500 companies and 1 fortune 1000 company. All of the big companies use a multi-OS server setup. While most of the desktops are MS Windows, a lot of the servers are *nix. In fact, all of the really critical servers have been Solaris or Linux at the 3 large corps that I have worked at. All of the really big, critical financial systems at all three separate companies have been running Oracle on a non-MS OS.

I personally am glad about that since I program a lot of financial systems. All of the MS-Only SQL Server _certified_ "dba's" that I have worked with have been rocks. On the other hand,_all_ of the Oracle certified DBA's that I have worked with at all three large corps have know their shite!!

--
General, you are listening to a machine! Do the world a favor and don't act like one.
Re:When Will This Be Ported? by Anonymous Coward · 2007-02-14 23:43 · Score: 1, Funny

at 2 fortune 500 companies and 1 fortune 1000 company

that's 1500 companies with 3 fortunes, and you say he is dumb!!!

Everyone, we are moving to ASP now by varmittang · 2007-02-14 09:24 · Score: 3, Funny

It was LAMP, now its LAVA. Much cooler name.

--
-----BEGIN PGP SIGNATURE-----
12345
-----END PGP SIGNATURE-----

Re:Everyone, we are moving to ASP now by SatanicPuppy · 2007-02-14 10:27 · Score: 1

If only you could get A on LA with V...The OSS version would be LAVM(ono).

Bah, it's no use...This system is already doomed like Postgres because it has no cool acronym.

--
ad logicam Claiming a proposition is false because it was presented as the conclusion of a fallacious argument.
Re:Everyone, we are moving to ASP now by einhverfr · 2007-02-14 13:49 · Score: 1

Bah, it's no use...This system is already doomed like Postgres because it has no cool acronym.

You have something against LAPPs?

Racist!

--

LedgerSMB: Open source Accounting/ERP

it's fast, but can it penetrate enemy airspace? by President_Camacho · 2007-02-14 09:24 · Score: 1

Michael Stonebraker, who cooked up the Ingres and Postgres database management systems, is back with a stealthy startup called Vertica ... The promise -- a Linux-based system that handles queries 100 times faster than traditional relational database management systems.

Yeah, but what does its radar signature look like?

--
Wizard Needs Food, Badly

Re:it's fast, but can it penetrate enemy airspace? by varmittang · 2007-02-14 09:28 · Score: 2, Funny

V

--
-----BEGIN PGP SIGNATURE-----
12345
-----END PGP SIGNATURE-----
Re:it's fast, but can it penetrate enemy airspace? by Aqua_boy17 · 2007-02-14 09:29 · Score: 2, Funny

Yeah, but what does its radar signature look like?
Probably, a flock of seagulls.

--
What if the Hokey Pokey really is what it's all about?
Re:it's fast, but can it penetrate enemy airspace? by misleb · 2007-02-14 09:56 · Score: 1

I think in this case "stealthy" means that nobody has really heard of them before and nobody seems to care.

-matthew

--
"THERE IS NO JUSTICE, THERE IS ONLY ME." -Death
Re:it's fast, but can it penetrate enemy airspace? by eclectro · 2007-02-14 10:02 · Score: 2, Funny

Yeah, but what does its radar signature look like?

It's not bad, but the new startup synergistica that I'm working on is gonna be completely invisible.

--
Take the cheese to sickbay, the doctor should see it as soon as possible - B'Elanna Torres, "Learning Curve"
Re:it's fast, but can it penetrate enemy airspace? by ShieldW0lf · 2007-02-14 10:02 · Score: 1

They've got 23 million in funding... apparently someone cares...

--
-1 Uncomfortable Truth
Re:it's fast, but can it penetrate enemy airspace? by misleb · 2007-02-14 10:11 · Score: 1

But it is monopoly money!

--
"THERE IS NO JUSTICE, THERE IS ONLY ME." -Death
Re:it's fast, but can it penetrate enemy airspace? by Bastard+of+Subhumani · 2007-02-14 10:43 · Score: 1

Well thanks for informing us that it doesn't refer to the company's radar cross section or other measure of its radiative/reflective detectability. It's a miracle we survived so far under such a delusion.

--
Only three things are certain; death, taxes, and apocryphal quotations - Ben Franklin.
Re:it's fast, but can it penetrate enemy airspace? by Gospodin · 2007-02-14 11:11 · Score: 2, Funny

Microsoft is backing them?

--
...following the principles of Heisenburger's Uncertain Cat...

buzzword enabled by hey · 2007-02-14 09:24 · Score: 3, Insightful

"grid-enabled, column-oriented relational database management system"
What does that mean?
If anything.

Re:buzzword enabled by c0nst · 2007-02-14 09:59 · Score: 5, Informative

Here you go:
Stonebraker, Mike; et al. (2005). C-Store: A Column-oriented DBMS (PDF). Proceedings of the 31st VLDB Conference.
From the paper:
Among the many differences in its design are: storage of data by column rather than by row, careful coding and packing of objects into storage including main memory during query processing, storing an overlapping collection of columnoriented projections, rather than the current fare of tables and indexes, a non-traditional implementation of transactions which includes high availability and snapshot isolation for read-only transactions, and the extensive use of bitmap indexes to complement B-tree structures
:-)
Re:buzzword enabled by Jherek+Carnelian · 2007-02-14 10:39 · Score: 5, Funny

"grid-enabled, column-oriented relational database management system"
What does that mean?

Uh, a spreadsheet?
Re:buzzword enabled by perfczar · 2007-02-14 10:54 · Score: 5, Informative

Buzzwords, yes, but they have a little bit of meaning left. Grid-enabled means that it works on a "shared nothing" environment, that you can use a networked cluster of commodity computers if one isn't enough to hold the data, and so on. This is in contrast to using one big huge box (big computer, big storage array, or whatever). Of course many databases are similarly grid-enabled. Column-oriented means that data is stored on disk by column, this makes it fast to process a subset of columns that touch lots of rows, as is typical in data warehouse applications. This is a key architectural difference among databases; Oracle, DB2, etc., are "row stores", while Sybase IQ, Vertica, etc. are "column stores". Note: I work for Vertica Systems
Re:buzzword enabled by ChrisA90278 · 2007-02-14 10:54 · Score: 4, Informative

Column oriented means it can read data in from one column from the disk without pulling in all the other bytes in the row. Possibly much less reduced I/O bandwidth usage depending on the query. (kind of like if you turned the normal file structure side ways.)
Grid enabled - This means the DBMS can make use of a large distributed group of computers and potentially have access to a huge amount of computing power. The typical DBMS runs on at beat a multi-processor server. Thi sis kind of like a DBMS server running a a "seti at home" type network.
Going solely by the developer's reputation, this could be a big deal. He is not some random hacker. He is a well known university professor who has several times in the past lead projects that have been revolutionary and turned the field around. His ideas are widely used Still "100X faster" is a big claim. Lots of smart people have been working on DMBSes for many years, a two order of magnitude improvement is a "I will have to see it to believe it" type claim
I'm using PostgreSQL to handle some telemetry data right now. If my 45 minute run times can be reduced to seconds, I'll be happy.
Re:buzzword enabled by shmlco · 2007-02-14 12:24 · Score: 1

I guess on a column-oriented database I don't want to do a 'SELECT * FROM X' ?

--
Any sect, cult, or religion will legislate its creed into law if it acquires the political power to do so.
Re:buzzword enabled by smittyoneeach · 2007-02-14 13:59 · Score: 1

Soooo last decade:
s/spreadsheet/XML/

--
Get thee glass eyes, and, like a scurvy politician, seem to see things thou dost not.--King Lear
Re:buzzword enabled by kpharmer · 2007-02-14 17:37 · Score: 1

> Seriously though, I look after a system that has to run on oracle, postgresql (very tuned) and mysql. The reporting
> nature results over 10,000 grouping queries each time it's run. mysql takes less than a minute using InnoDB tables.
> The others crawl home closer to 20.

The problem is that mysql doesn't have partitioning, parallelism or a good optimizer. So even if in your specific application you're lucky enough to have a situation in which you've got a ton of both highly selective and very simple queries - that's not the normal reporting or business intelligence scenario. This is making the huge assumption that you've even tuned oracle correctly.

A more typical scenario is one in which you've got to read 5-10% of the total data volume (and your b-tree indexes don't work so you need partitioning), the queries *aren't* highly repetitive and are complex. In this scenario mysql sucks, and needs a few years to catch up. Meanwhile, expect query times 10-40x the length of oracle or db2.
Re:buzzword enabled by bytesex · 2007-02-14 21:15 · Score: 1

'grid enabled' like a beowolf cluster
'column oriented' like a table, but then turned on its side.
'relational database management system' you've got me there. I have no idea.

--
Religion is what happens when nature strikes and groupthink goes wrong.
Re:buzzword enabled by Kjella · 2007-02-15 01:18 · Score: 3, Insightful

Under ideal conditions, I don't have a problem seeing that:

1. Make up lots of 100-column+ tables
2. Select one column from each table
3. If you're IO bound, you should now see about a 100:1 increase

However, most real data models don't work that way. Usually you put stuff that's useful at the same time in the same table, in which case it probably won't make much of a difference.

--
Live today, because you never know what tomorrow brings
Re:buzzword enabled by kpharmer · 2007-02-15 02:37 · Score: 1

> Nowadays MySQL even does partitioning.

The GA (general availability) or production release does not support partitioning. Once they put it into production and gets serious reviews we'll find out whether or not they've got a worthwhile solution. Until then it's not applicable.

> Whether you think the optimizer is any good is up to you, basically, but it, too has improved lately.

Yes, it has improved - it used to completely tank on joins of 4+ tables just a couple of years ago. However, I've seen no indications that it is yet in the league of any other options.

> Parallelism on the other hand i did not understand... what do you mean by that?
The ability to split a single query that you submit into components, run them separately under the covers, join the data together and present you with the results. Sometimes it may just be setting up multiple agents to handle reading the data - so that pulling data off disk happens in parrall, with the rest of the operations happending serially. With db2, for example, we typically see linear performance improvements by allowing parallelism up to 4 or 8 way on 4 or 8 way servers with good IO subsystems. In other words a table scan (ie, no use of an index) of 100 million rows on a 4-way server that takes 60 seconds can be shortened to 15 seconds with parallelism.
Re:buzzword enabled by fbjon · 2007-02-15 05:47 · Score: 1

That's great. Now I can refer to my dead-tree calendar as "grid-enabled". Oo-rah!

--
True confidence comes not from realising you are as good as your peers, but that your peers are as bad as you are.

Column oriented? by JLavezzo · 2007-02-14 09:24 · Score: 1

A column oriented relational database? I'd like some more details on how that works. I don't suppose it's just a regular SQL db with Excel's Pivot Tables run on it...

Seriously, though, the target market for grid-based high volume data-warehousing type dbs are a lot smaller than the MySQL crowd. Not as big a deal as it seems, but it'd be nice to have if you needed it.

Re:Column oriented? by stoolpigeon · 2007-02-14 09:32 · Score: 2, Insightful

smaller in number - but i'm willing to bet much more profitable and growing rapidly. we've been looking at data warehousing options and frankly most of them suck in one way or another. if someone can do it right - they can make a killing.

--
It's hard to believe that's how Micronians are made. Why don't we see it right now by having you both kiss one another?
Re:Column oriented? by MrAnnoyanceToYou · 2007-02-14 09:33 · Score: 1

Depends. Reporting and data warehousing are pretty important; Business Objects / Crystal Reports / etc. all seem to be slower than they could. If you were to be able to throw in the rows as quickly as in MySQL or Postgres and then report on it with ten times the efficiency, you've got a decent demand in store for you. If, say, Google or Amazon could run with 1/10th the overall servers I have this feeling they would. Just a guess though. It's always possible a new approach to the old problems has resulted in real performance increase. I have my doubts, but it's a DBMS so noone REALLY gets validated either way for at least a couple years.

--
My little site.
Re:Column oriented? by truthsearch · 2007-02-14 09:35 · Score: 1

A lot of web sites that started out with small MySQL databases are now using replication. It can be a tough transition if not accounted for in the original development of the site. But if those sites started out with something that's "grid-based" maybe it would be much easier to grow (maybe). I have the feeling the market may be bigger than many people realize, especially if they start with something free.

--
Developers: We can use your help.
Re:Column oriented? by AKAImBatman · 2007-02-14 09:43 · Score: 4, Informative

A column oriented relational database? I'd like some more details on how that works.

http://en.wikipedia.org/wiki/Column-oriented_DBMS

It's basically an optimization of the current data access patterns. Databases have been row-oriented for decades, because they evolved from fixed width flat files. Once we eliminated COBOL-style accesses to databases, the full row data became less important. It became far more important to be able to scan a column as fast as possible. For example:

select * from names where lastname LIKE '%son'

The above query might have an index available to find what it needs. But it's just as likely that the database will need to do a table-scan. Since table-scans involve looking through every record in the database, you can imagine that it would be faster to just load the lastname column rather than loading every row in the database just to discard 90% of that data.

--
Javascript + Nintendo DSi = DSiCade
Re:Column oriented? by prog99 · 2007-02-14 09:43 · Score: 1

Fairly certain Sybase IQ server is column orientated.
Re:Column oriented? by georgewilliamherbert · 2007-02-14 09:47 · Score: 5, Insightful

A column oriented relational database? I'd like some more details on how that works.

Column oriented is easy. Imagine a database as a set of tables, each of which has rows of data records, in organized columns (column 1 = "User name", column 2 = "User ID", column 3 = "Favorite slashdot admin", etc).
Normal row-oriented databases store records which have a row of the data: "User name", "User ID", "Favorite slashdot admin" for user row #12345.
Column oriented databases store records which have a column of the data: "User name" for user rows 1-100,000; "User ID" for user rows 1-100,000; etc.
Updates are faster with row-oriented: you access the last record file and append something, or access an intermediate record file and update one "row" across.
Searches are faster with column-oriented: you access the record file for "Favorite slashdot admin" and look for entries which say "Phred", and then output the list of rows of data which match. Instead of going through the whole database top to bottom for the search, you just search on the one column. If you have 100 columns of data, then you look through 1/100th of the total data in the search. To pull data out, you then have to look at all the column files and index in the right number of records, but that goes relatively quickly.
Indexes are useful, but column-oriented is more efficient in some ways. You don't have to maintain the indexes, and can just automatically search any column without having indexed it, in a reasonably efficient manner.
Column-oriented also lets you compress the data on the fly efficiently: all the records are the same data type (string, integer, date, whatever) and lists of same data types compress well, and uncompress typically far faster than you can pull them off disk, so you can just automatically do it for all the data and save both speed and time...
Re:Column oriented? by Anonymous Coward · 2007-02-14 09:53 · Score: 1, Informative

I don't suppose it's just a regular SQL db with Excel's Pivot Tables run on it...
Essentially it is - take each column and put it in a file, sequentially by row number. Queries are really easy (read record n out of each column-file) but inserts are rather difficult. Searches are quite efficient (you can jam a lot of data in a data block without all those other columns in the way) but updates aren't so much. Data compresses better because a column tends to be consistent in format and repetetive, so you can pack even more information in each data block (and search even faster, but make updating even slower). It's cool, as long as you don't change much data.
I can't find anything to suggest it, but I suspect this group has some tricks to make updates less painful, or maybe they're just shooting for the warehouse market. It'll never take over the OLTP market but they may find a niche.
Re:Column oriented? by stoolpigeon · 2007-02-14 10:07 · Score: 1

I can't find anything to suggest it, but I suspect this group has some tricks to make updates less painful,

or - they are doing it an environment where data gets in via etl (or this streams stuff) and you aren't doing updates -- you are doing bi and reporting to make management's widgets do all kinds of nice things on their dashboards.

i think they are targeting the data warehouse market - not the transactional or general purpose market.

--
It's hard to believe that's how Micronians are made. Why don't we see it right now by having you both kiss one another?
Re:Column oriented? by flyingfsck · 2007-02-14 10:11 · Score: 2, Insightful

Yup, it is all about making the individual files smaller and more regular. Kinda the opposite of XML.

--
Excuse me, but please get off my Pennisetum Clandestinum, eh!
Re:Column oriented? by stoolpigeon · 2007-02-14 10:28 · Score: 2, Interesting

info week just ran an article on hp getting into data warehousing and bi that had this paragraph pretty early on: Until sitting down with InformationWeek recently, the company has been mum on the initiative--not so much as a peep from its normally talkative marketing team. Indeed, it's an unlikely move into a sector where IBM, Oracle, SAS Institute, and Teradata have years of experience, well regarded products, and loyal customers. Those four vendors--along with Microsoft, which has muscled in on the strength of its SQL Server database--hold about 85% of the $5.2 billion-a-year data warehousing software market, a sector IDC projects will grow 9.5% annually through 2010.

so you are right - there's a lot of opportunity there, even for a small player.

on a side note, i thought the opening paragraph described the current situation pretty well
For more than a decade, big companies and sophisticated data aggregators have adopted data warehouses, yet few have mastered them, and many have outright failed in the effort or have been scared off by the complexity. The goal is to give workers access to real-time data across departments and geographic units, but more often than not, data warehouses end up as costly clunkers with outdated, inconsistent, and missing information.

--
It's hard to believe that's how Micronians are made. Why don't we see it right now by having you both kiss one another?
Re:Column oriented? by Virtual_Raider · 2007-02-14 14:49 · Score: 2, Interesting

I've worked in DW for a time, and I can tell you that it's not easy to "get it right" because so far it's not something that can be packaged. You can get the data models and fancy machinery, but you will most definitely need an architect to tailor it to the particular organization because all companies work differently on the inside. And that architect will have a dickens of a time understanding how the company works because the bigger they are, the more likely not even their own employees do. As long as there isn't an official structural model imposed on them like it happens with accounting, corporations will grow and be structured however best suits them (or sometimes they just "grow" like wild weeds, unruly and chaotic). And a Data Warehouse is an attempt to code this internal structure and its dealings in a central repository that will serve a number of goals like Business Intelligence, Trend Analysis, etc... So you won't find a product or solution that will fit your company out of the box. It's pretty much like with self-help books. The general idea works in general terms, but you have to adjust it to your own reality and quirks for it to be of any value to you in particular.

--
+Raider of the lost BBS
Re:Column oriented? by mysticgoat · 2007-02-14 16:16 · Score: 1

the target market for grid-based high volume data-warehousing type dbs are a lot smaller than the MySQL crowd.

The growth potential for that market is staggering. We've now got desktop computers with enough storage capacity to hold everything a person has written or has ever read, from first grade to grave. We'll be looking for ways to organize these huge attics sometime soon.
Re:Column oriented? by Tablizer · 2007-02-14 16:40 · Score: 1

Once we eliminated COBOL-style accesses to databases, the full row data became less important. It became far more important to be able to scan a column as fast as possible

Isn't that what indexes do? Indexes basically stick an indexed column(s) (with row-ID) into a tree for faster searching.

--
Table-ized A.I.
Re:Column oriented? by AKAImBatman · 2007-02-14 17:33 · Score: 2, Insightful

Isn't that what indexes do?

Yes! No! Sort of!

Indexes only optimize some types of queries. To get the absolute maximum performance out of your database, you have to make sure that there is a specific index for each query you run, and that your indexes are properly rebuilt and optimized for least-time search. Suffice it to say, this rarely happens in the real world. So there's almost always some scanning, even after the indexes narrow things down a bit. By going with a column-oriented storage design, the scan can be streamed at higher levels of thoroughput than is possible with row-oriented databases.

The downside is that you're sacrificing the time to access individual rows, so if you're pulling and processing millions of rows of data, this might actually be slower than a traditional row-oriented database. Updates are almost guaranteed to be slower as you have to write to several column-oriented data stores rather than a single row-oriented store.

Still, column orientation makes a lot of sense for a variety of today's database applications. So if you're in need of querying a multi-terrabyte table, this product may be just what the (senior database) administrator ordered.

--
Javascript + Nintendo DSi = DSiCade
Re:Column oriented? by SanityInAnarchy · 2007-02-14 20:47 · Score: 1

Column-oriented also lets you compress the data on the fly efficiently: all the records are the same data type (string, integer, date, whatever) and lists of same data types compress well, and uncompress typically far faster than you can pull them off disk, so you can just automatically do it for all the data and save both speed and time...

Given enough spare CPU cycles, yes. LZO compression is probably good for that. In fact, this is part of the theory behind which Hans Reiser claimed Reiser4 will be over twice as fast as any other filesystem -- the cryptocompress plugin.

It's worth mentioning that there are other ways of improving performance here than compression -- you could stick it on a RAID, for instance. Ultimately, it comes to a question of cost. Are the CPU cycles cheaper than the disk space and RAID controllers? (My guess is that with LZO and very compressible data, the answer is yes, absolutely.)

Updates are faster with row-oriented: you access the last record file and append something, or access an intermediate record file and update one "row" across.

This seems right. However:

Searches are faster with column-oriented: you access the record file for "Favorite slashdot admin" and look for entries which say "Phred", and then output the list of rows of data which match.

In row-oriented, you have a couple of ways of dealing with this. Indexes are one; a separate table is another. In fact, if you split each field out into a separate (id,value) table, you can essentially create a column-oriented database out of a row-oriented one.

To pull data out, you then have to look at all the column files and index in the right number of records, but that goes relatively quickly.

Exactly as quickly as in the above scenario. However, it could easily be slower than if you were to look up a record by an indexed column, then pull the whole column.

Indexes are useful, but column-oriented is more efficient in some ways. You don't have to maintain the indexes, and can just automatically search any column without having indexed it, in a reasonably efficient manner.

I call BS on this. Yes, it's faster to do a fulltext search on one column of names, but some sort of hashing/index system will improve the performance of a search in just about all circumstances. An example where this might not be the case would be, say, a serial number -- to look up something related to item n, just go to the nth row in that table -- but in this case, deletions mean you either need to leave holes in your database, or update every single serial number in that table (and everything referring to that number) -- which could easily involve searching the whole database, and would certainly be slower than an index.

The only advantage here is if you were doing a lot of updates on a very small column of short-ish data. It may get to the point where updates here are almost as fast as updates on an un-indexed row-oriented table would be, but looking something up by column would be much faster -- but still not as fast as indexed.

But that just seems like an incredible edge case, or at least, the kind of performance advantage you'd have to benchmark thoroughly to justify, considering what you're giving up.

And it seems to me like row-oriented with indexes is faster than column-oriented with indexes.

However, there is one advantage -- it's sort of automagically doing some normalization for you, or forcing normalization onto you. That, and the storage advantage, means this would be great for a general-purpose database -- a replacement for MySQL, say. But it looks to me like if you have a genius DBA, it would be possible to tweak even more performance out of row-oriented, unless column-oriented provides a way to emulate it to a point -- at which point, the distinction is really moot.

It seems to me like this is either bold-but-stupid marketing, or they have some other advantage than column-orientedness. One such advantage might be the grid-computing aspect -- at first glance, strictly column-oriented looks easier to distribute. But I really don't know.

--
Don't thank God, thank a doctor!
Re:Column oriented? by swilver · 2007-02-15 02:48 · Score: 1

Searches are faster with column-oriented: you access the record file for "Favorite slashdot admin" and look for entries which say "Phred", and then output the list of rows of data which match. Instead of going through the whole database top to bottom for the search, you just search on the one column. If you have 100 columns of data, then you look through 1/100th of the total data in the search. To pull data out, you then have to look at all the column files and index in the right number of records, but that goes relatively quickly.
Assuming your search returned relatively few results. In row orientated databases there is a cut-off point where it becomes more efficient to do a table scan instead of using an index, this is usually when atleast 5-10% of all the rows must be accessed.
With a column orientated database, this cut-off point would be reached far sooner, quite possibly for a 25 column table at just 1/25th the records returned.
Reading the complete table would be a far more frequent occurence in a column orientated database in cases where full records are being requested by applications (which is a common practice in almost all programs that use somekind of Object Relational mapping). A 25 column table with a record size of 2000 bytes from which you select roughly 0.1% of the records for a financial report would mean that in a row orientated database you read 2k of data out of every 2 MB of table on average. With a column orientated setup, you'd read roughly 80 (2000/25) bytes of data out of every 80 kB of "column-table" multiplied 25 times.
Obviously reading 80 bytes of every 80 kB would be immensily slow (hard drives usually read stuff in far larger chunks without any performance impact), so at 0.1% of rows selected you're effectively already doing a full table scan. The row orientated database however will still be able to get some more performance by skipping huge chunks of table.
Also, the column based approach lowers the performance of retrieving a row by its primary key, one of the single most common operations of a database. In a row orientated database this would involve 2 or 3 index accesses plus one to retrieve the entire row. In a column orientated database it would be 2 or 3 index accesses plus one access for every column to retrieve. Retrieving a customer with their address records could easily result in hundreds of extra accesses in such a model.
I'm a bit skeptical if such a database would consistently perform better. It could do well in some cases, but it can do far worse in others.

Awesome by Fyre2012 · 2007-02-14 09:25 · Score: 2, Interesting

This is totally what we need.

With comodity hardware getting faster and cheaper by the minute, having a system that can handle a higher than average load with optimized software is, imho, a winner.

I'm sure everyone here can add some anecdotal evidence to how they had a heavy-hardware, database serving machine die on them because of some software bug.
This is one of the reasons I've been looking forward to ZFS. Hopefully the DB guru's will take the best of what's good about software, drop the legacy crap and really deliver something that's going to handle the kind of load that a good slashdotting delivers with hardware that didn't require a lease to be affordable.

--
This is not the greatest .sig in the world, no. This is just a tribute.

Re:Awesome by Grinin · 2007-02-14 09:55 · Score: 1

I couldn't agree more. The OpenSource community always comes up short when it comes to taking on the big corporate names. Ultimately the more choices the consumers have the lower the prices, the higher the standards, and thats what a mixed economy is all about.

--
Relocating to San Francisco / Palo Alto... Hire me?

open source? by Anonymous Coward · 2007-02-14 09:30 · Score: 1

how is this open-source?

Re:open source? by perfczar · 2007-02-14 11:00 · Score: 3, Informative

Vertica is not open source. Not sure where the confusion came from.

Note: I work for Vertica.
Re:open source? by Fyre2012 · 2007-02-14 11:11 · Score: 1

I didn't mention that it was Open Source.

Which one of us is confused?
...or were you not actually replying to me?
Now i got myself all confused =\

--
This is not the greatest .sig in the world, no. This is just a tribute.
Re:open source? by snoyberg · 2007-02-14 14:22 · Score: 1

My guess is that they titled it that way since it's built on Linux. (OK, they really titled it that way so everyone would read the article, but whatever.)

--
Thank God for evolution.
Re:open source? by gbarta · 2007-02-14 15:10 · Score: 1

This is a commercialization of C-Store by the same guys. It remains to be seen what commercial development would be fed back to the original open source version.

But does it save the children? by StikyPad · 2007-02-14 09:31 · Score: 1

The promise -- a Linux-based system that handles queries 100 times faster than traditional relational database management systems... ...using the power of oxygen!

--
https://www.eff.org/https-everywhere

Perfect timing by defile · 2007-02-14 09:31 · Score: 3, Interesting

Loading a million random records out of a set of one hundred million records is an enormously difficult task for an RDBMS on commodity hardware (e.g. magnetic rotating disks). This is a more common task than you would think. ORM systems backed by an RDBMS, such as Ruby on Rails, Django, Hibernate, have exactly this requirement and will only demand more as these models become more mainstream. Think about what search engines have to do: find millions among billions, all to show a user a dozen.

These problems are solvable now, but there's a lot of duplication of effort going on that a smart database vendor could solve for us.

Re:Perfect timing by symbolic · 2007-02-14 14:57 · Score: 1

Duplication of effort isn't bad at all....without it, you'll wind up with another Microsoft.

Good..If it works by Gomer79 · 2007-02-14 09:32 · Score: 1

Without any benchmarks of any kind and a lack of data I remain skeptical but if it works this could be a huge breakthrough for the database management as data storage amounts continue to skyrocket. I am curious if it will be ported to Windows or other proprietary systems and if so what affect it will have on the speed claims. Because if the speed claims are true and it stays Linux I would think companies would have to consider moving to Linux to realize the speed gains.

--
My user ID is a palindrome!

Re:Good..If it works by nuzak · 2007-02-14 10:01 · Score: 1

It's not a breakthrough, it's simply a vertical database design, and it will accellerate SOME kinds of queries, and not do so well on others. It's great for the kind of data mining where you're going to vertically slice the data anyway, not so good for OLAP and decision support where you usually want the whole record at once. You replicate to one of these databases, you don't usually primarily enter data into it -- with trading data being one notable exception. Financial apps love using kx, which is blindingly fast and has a programming languages drawing from APL, including its awesome terseness. I'm told that kx doesn't do so hot once you need to hit the disk though.

Oracle is able to do this with vertical partitions too, though partitions are a rather large-grained thing, so I imagine there are some limits to doing it that way.

I think I've karma-whored enough for one post :p

--
Done with slashdot, done with nerds, getting a life.
Re:Good..If it works by Anonymous Coward · 2007-02-14 10:15 · Score: 1, Informative

Personally, I think the breakthrough for managing data warehousing volumes of data with real-time response is going to come from NitroSecurity's NitroEDB. I saw a demo they gave running on a single commodity laptop which delivered query responses thousands of times faster than Oracle, on a data set with billions of records. They're working with MySQL creating an interface to use NitroEDB as a storage engine as well.
Re:Good..If it works by I!heartU · 2007-02-14 11:13 · Score: 1

According to Wikipedia, Column Oriented is better for reads, meaning OLAP(data warehousing) slower on writes, so not so good on OLTP. The idea being since columns are stored in different files, and queries usually don't want every column, so you only have to look at some of the files and not all of them.
Re:Good..If it works by nuzak · 2007-02-14 11:22 · Score: 1

Yikes, I actually meant to say OLTP, since I contrasted it with data warehousing. I just choked on my alphabet soup a little. Thanks :)

--
Done with slashdot, done with nerds, getting a life.

Doesn't "stealthy" require some stealth anymore? by georgewilliamherbert · 2007-02-14 09:33 · Score: 3, Insightful

Vertica's website has had all the details about what they're doing for months. They've had a Wikipedia article for a long time.

This is some new Network World definition of "Stealthy", apparently...

Best of luck by 140Mandak262Jamuna · 2007-02-14 09:40 · Score: 5, Insightful

I dont want to rain in their parade. But typically whenever people start with a spec like "100 times better than what they can do", they assume they will continue to perform at current levels while these people take years to develop and mature their new technology. In the real world, the traditional methods too improve and unless they can maintain a 100x lead continually the new technology flops.

What happened to Gallium Arsenide replacing silicon? What happened to solid state memory completely repalcing magnetic disks? Technology field is littered with such fiascos.

--
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact

Re:Best of luck by georgewilliamherbert · 2007-02-14 09:50 · Score: 1

Sybase IQ already shows that class of speedups on lots of datasets. Proof of concept is out there...
Re:Best of luck by PCM2 · 2007-02-14 12:08 · Score: 1

In the real world, the traditional methods too improve and unless they can maintain a 100x lead continually the new technology flops.

This might be the obvious conclusion if Vertica were targeting the mass market and trying to compete directly with Oracle, SQL Server, or DB2, but they are not. TFA says Vertica is targeted at the data warehousing market, which is a very specific application area that can be better served with niche products than with the traditional general-purpose relational RDBMSs. Based on what I've read, it sounds like Vertica is addressing a market similar to that of Greenplum. These guys are trying use open source to go up against the entrenched proprietary players, such as Teradata, that charge literal millions of dollars for software to run big data warehouses.

--
Breakfast served all day!
Re:Best of luck by einhverfr · 2007-02-14 14:04 · Score: 2, Insightful

For certain applications (particularly BI), I think that 100x speedups are practical, but I would not expect it in general OLTP systems.

Let me give you an example.

Suppose you have a table with, say, 100 billion rows. You want to create a report which provides aggregated data on a very large subset of a few columns of table. With a tradition RDBMS, you have to read through every single one of the 100 billion rows to aggregate the data (indeces don't help if you are going to be searching through a sizeable percentage of disk pages).

Most systems currently tackle this problem using massive parallelism. I.e. you break up the table into little pieces on different systems and store pieces of it on different systems. Now imagine that in addition to this, you break up each column into its own table. Now you have fewer disk pages to search through. Less memory and disk bandwidth issues, faster performances.

Now, this would be less useful if you were trying to do more complex queries on larger numbers of columns, and inserts/updates suck.

So like many things, it is a tradeoff.

--

LedgerSMB: Open source Accounting/ERP
Re:Best of luck by Ant+P. · 2007-02-14 16:22 · Score: 1

"100 times better" is perfectly feasible, if your average dataset is 100 (arbritary units), your old algorithm is O(n^2) and the new one is O(n).

Re:Doesn't "stealthy" require some stealth anymore by drinkypoo · 2007-02-14 09:41 · Score: 2, Funny

Vertica's website has had all the details about what they're doing for months. They've had a Wikipedia article for a long time. This is some new Network World definition of "Stealthy", apparently...

Network World is a trade rag. To them, anything not advertised is stealthy. Especially since they want to motivate people to think "oh no, I don't want to be stealthy, that means unknown! quick buy some advertising!"

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"

Patent Problems by IflyRC · 2007-02-14 09:42 · Score: 2

Watch...they'll run into patent problems with patents held by Oracle, Sybase, and MS.

Re:Patent Problems by kfg · 2007-02-14 12:01 · Score: 1

...they'll run into patent problems with patents held by Oracle, Sybase, and MS.

The priciple patent base of RDBMSs is actually held by IBM, many of which have actually expired. In any case there is a known solution to most patent issues. We call it "money."

Linux is only free if your time is worthless

Windows TCO is only lower if your data is worthless.

KFG

open source? by oohshiny · 2007-02-14 09:42 · Score: 1

Where does it say that Vertica is going to be open source?

In any case, if people wonder how they get 100x speedups, it's probably related to Stonebraker's previous company called Streambase.

Why does a company promising Linux solutions... by WindBourne · 2007-02-14 09:50 · Score: 2, Interesting

run Windows for their website?

--
I prefer the "u" in honour as it seems to be missing these days.

Re:Why does a company promising Linux solutions... by Mad+Merlin · 2007-02-14 11:03 · Score: 2, Interesting

Look again...
$ curl -I www.vertica.com HTTP/1.1 200 OK Date: Wed, 14 Feb 2007 23:00:26 GMT Server: Apache/1.3.33 (Unix) Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Expires: Sun, 19 Nov 1978 05:00:00 GMT Pragma: no-cache X-Powered-By: PHP/4.4.4 Set-Cookie: PHPSESSID=488de093f5b89a78277a234e1e9886a6; expires=Sat, 10 Mar 2007 02:33:46 GMT; path=/ Last-Modified: Wed, 14 Feb 2007 23:00:26 GMT Content-Type: text/html; charset=utf-8

--
Game! - Where the stick is mightier than the sword!
Re:Why does a company promising Linux solutions... by Bill,+Shooter+of+Bul · 2007-02-14 15:51 · Score: 1

apparently, they read slashdot.

--
Well.. maybe. Or Maybe not. But Definitely not sort of.

You're bound to get some strange looks... by Anonymous Coward · 2007-02-14 09:51 · Score: 5, Funny

during the transition when you tell people your business runs on LAVA-LAMP technology.

Re:You're bound to get some strange looks... by smittyoneeach · 2007-02-14 13:55 · Score: 1

...and, if some bonehead intern melts it down, you can always market the leftovers on eBay as a Pet Rock.
Once you've another internet connection, of course.

--
Get thee glass eyes, and, like a scurvy politician, seem to see things thou dost not.--King Lear

MOD PARENT UP! by Dysfnctnl85 · 2007-02-14 09:52 · Score: 1

It's hard for something like this to be relevant if it cannot interface with existing systems.

Speculation by cartman · 2007-02-14 09:55 · Score: 5, Informative

I noticed that Stonebraker is the company founder. Stonebraker has contributed extensively to database research over the years.

He's known for advocating the "shared-nothing" approach to parallel databases. The shared-nothing approach means that nodes in the parallel database don't attempt memory or cache synchronization, and each node has its own commodity disk array. In a shared-nothing parallel database, the data is "partitioned" across servers. So, for example, rows with id's 1-10 would be on the first server, 11-20 on the second server, etc. Executing the SQL query "select * from table where id < 1000" would send requests to multiple commodity servers and then aggregate the results. The optimizer is modified to take into account network bandwidth and latency, etc.

My guess on what they're doing: they're working on a shared-nothing parallel RDBMS with an in-memory client similar to Oracle TimesTen.

The are a few drawbacks to the shared-nothing approach: 1) the RDBMS software is more difficult to implement; 2) since the data is partitioned, any transaction that updates tuples on more than one database node requires a two-phase distributed commit, which is much more expensive; and 3) some queries are more expensive because they require transmitting large amounts of data over the network rather than a memory bus, and in rare cases that network overhead cannot be eliminated by the optimizer.

The advantage, of course, is linear scalability by adding commodity hardware. No more need for $3M+ boxes.

Re:Speculation by El_Oscuro · 2007-02-14 17:03 · Score: 1
The big problem with "commodity hardware" is it breaks all the time. If a system is using a shared-nothing approach, then all of the commodity servers and storage arrays have to be available or the query will fail. Thus, if you have 10 servers and storage arrays participating in a shared-nothing approach, if any one of those servers goes down, the query will fail. Having a large amount of moving parts (commodity hardware) involved in any operation makes the MTBF go way down.

We are preparing for a large deployment (200+ sites) of Oracle RAC clusters on commodity hardware, in which we fully expect servers and the shared storage to fail on a regular basis. Our most recent test of this configuration involved yanking the power cord from the shared array while the database was up. It is a realistic test, and we expect to see something similar in production within a few months of deployment.

One of the biggest misconceptions about the redundancy technologies is that it is actually truly redundant. For example, that RAIDed storage array you have? The RAID protects against the failure of a single hard drive (more if you have hot spares). However, your database is toast if someone forgot to turn off write cache (the default on most arrays) and your battery goes bad. Or if your SAN "panics" and loses a 600GB LUN. Or if your controller goes bezerk and starts writing corrupt archivelogs without any error messages?
How to achieve redundancy:
1. Use different hardware. If you are using a commodity storage array for your primary storage, keep a copy of critical files like redologs and archivelogs on a different hardware type, like an external drive. If you use SAN as primary storage, use NAS for your backup files.
2. Use different sites. Don't use a venders hardware replication. Instead use something like Oracle Advanced Replication or a Dataguard standby databases.
3. Clusters protect against the failure of a server. RAID protects against the failure of a hard drive. Assume no redundancy in anything else and plan accordingly.
I would never deploy anything with a "shared-nothing" approach on commodity hardware. It is just too likely to fail.
--
"Be grateful for what you have. You may never know when you may lose it."
Re:Speculation by Ed+Avis · 2007-02-14 22:12 · Score: 1

It sounds pretty straightforward to fix that: each part of the database is stored identically on three different commodity boxes. That triples your hardware costs but it still works out cheaper than a single monster machine.

--
-- Ed Avis ed@membled.com

I've been waiting for something like this ... by Qbertino · 2007-02-14 10:01 · Score: 2, Insightful

... for a long time.
Classic RDBMSes are crutches. A forced-upon neccesitiy we have to put up with for our app models to latch on to real world hardware and it's limitations. A historically grown mess with an overhead so huge it's insane. With a Database PL and 30+ dialects of it from back in the days when we flew to the moon using a slide-ruler as primary means of calculation.
If what they claim is true, these guys are probably finally ditching the omnipresent redundant n-fold layers user and connection management in favour of a lean system that at last does away with the distinction of filesystem and database and data access layer. Imagine a persistance layer with no SQL, no extra user management, no extra connection layer, no filesystem under it and native object suport for any PL you wish to compile in.
I tell you, finally ditching classic RDBMSes is *long* overdue, they're basically all the same ancient pile of rubble, from MySQL up to Oracle. If these guys are up to taking on this deed (or part of it) and they get finished when solid-state finally relieves our current super-slowpoking spinning metal disks on a broad scale we'll feel like being in heaven compared to the shit we still have to put up with today.
I wish these guys all the best. They appear to have the skills to do it and the authority to emphasise that todays RDBMSes and their underlying concepts are a relic of the past.
My 2 cents.

--
We suffer more in our imagination than in reality. - Seneca

Re:I've been waiting for something like this ... by Anonymous Coward · 2007-02-14 15:36 · Score: 1, Insightful

Score 3, Insightful?

You have got to be kidding me. This comment is the most nonsensical idiotic drivel ever. Read a goddamned database textbook before making such an asinine comment to at least learn the fundamentals of why RDBMS systems are used (and used successfully when used properly).

Yes, SQL isn't all that great, but you can take your object persistence system and shove it right up your ass. Most of them are total pieces of shit.
Re:I've been waiting for something like this ... by Qbertino · 2007-02-14 20:50 · Score: 1

You know, that last line of yours was really convincing. Solid arguements, I must say.

--
We suffer more in our imagination than in reality. - Seneca
Re:I've been waiting for something like this ... by LizardKing · 2007-02-14 22:41 · Score: 1

Imagine a persistance layer with no SQL, no extra user management, no extra connection layer, no filesystem under it and native object suport for any PL you wish to compile in.

I worked on just such a system, and ended up replacing it with a straightforward RDBMS. The object persistence layer serialised to disk, which offered no benefits over using an RDBMS as the backend data store (which had been in the original design oddly enough). It had to keep everything in memory - which proved impossible when the dataset grew to 80Gb. It was unreliable, couldn't perform adequately and could not be distributed across multiple machines without a ground up rewrite.

The replacement RDBMS directly queried from the business logic layer proved to be faster on the development machine (which had half the processors of the persistence stores production boxes). It was also easier to work with and far more reliable.
Re:I've been waiting for something like this ... by sonofagunn · 2007-02-15 03:49 · Score: 1

Well, keep waiting, because this is nothing like what you have described. This is another SQL RDBMS with a different way of storing the data on disk so that it is optimized for data warehouses instead of transaction based DBs.
Re:I've been waiting for something like this ... by jadavis · 2007-02-16 20:46 · Score: 1

Imagine a persistance layer with no SQL, no extra user management, no extra connection layer, no filesystem under it and native object suport for any PL you wish to compile in.

They've had databases like that for a long time. And then they invented Relational databases to solve the problems with the simple types of databases you describe.

The advantages of an RDBMS are:
* abstraction of the physical representation of data from the way the data is logically presented
* a model for constraints that are meaningful yet can perform well: normalization. Of course, constraints can only detract from performance, but normalization performs well in comparison with alternative implementations of constraints.
* a declarative query language that tells the database what you want not how to get it. This allows the database engine to optimize and choose a new strategy based on, e.g. the size of a table. No need to rewrite the query in the application or re-code the application to ask for the data in a different way.

I encourage you to read and think about why these things are valuable. The database can be a part of the application development process, and can make radical application changes much simpler. A simple example is that you can almost always avoid painful data format changes when your application needs to meet a new requirement.

--
Social scientists are inspired by theories; scientists are humbled by facts.

never mind by oohshiny · 2007-02-14 10:08 · Score: 1

There wasn't much information on the web site, but everything is in Wikipedia (look under C-Store, the BSD-licensed open source version). It really is just a column-oriented database.

Re:it won't work 100 times faster - I'll take bets by georgewilliamherbert · 2007-02-14 10:10 · Score: 1

A) Is your benchmark a data warehouse type app benchmark or transactional? Column oriented is slower for transactional typically but much faster for data warehouse. I don't care how many frames per second you measure if I'm buying a LAMP web server system.

B) Your benchmark data doesn't show that you've tried to run Sybase IQ or C-store column-oriented databases against the workload.

Are you really sure that you want to be so sure about this, given that you may not be testing the right thing, and haven't tested the comparable things? 8-)

Given that... by CodeShark · 2007-02-14 10:26 · Score: 4, Informative

MonetDb, is similarly configured as a column oriented AND Open source, and appears to clean the clock of most of the major commercial and Open Source databases for huge data set queries, (see the benchmarks at axyana.com for an example), where is Vertica's market advantage supposed to be?

By which I am asking that while Vertica is obviously well-researched and well funded as a start up, MonetDB is well-researched, already benchmarked and available now.. So why would I wait to invest my time, energy, and $$ in a proprietary future product rather than the time and energy, etc. to develop market leadership in my chosen corporate area in the present?

--
...Open Source isn't the only answer -- but it's almost always a better value than the alternatives...

Re:Given that... by perfczar · 2007-02-14 11:46 · Score: 5, Informative

Here are a few of the technical reasons one might choose Vertica over Monet; I'll not get into business issues.

Vertica is designed for large amounts of data, and is optimized for disk based systems. Monet does benchmarks against TPC-H Scale Factor 5 (30 million records, an amount which would fit in main memory) running on Postgres; Vertica does TPC-H Scale factor 1000 (6 billion records) against commercial row stores tuned by people who do such work to make a living.

Vertica runs on multi-node clusters, allowing the cluster to grow as the amount of data grows, while Monet doesn't scale to multiple machines.

There are numerous differences in the transaction systems, update architecure, tolerance of hardware failure, and so on, that make Vertica better suited to the enterprise DW market.

Note: I work for Vertica
Re:Given that... by fivelittlemonkeys · 2007-02-14 19:44 · Score: 1

For another new column-store DBMS which actually is open-source (GPL), check out LucidDB.
Re:Given that... by CodeShark · 2007-02-15 03:05 · Score: 1

Thanks for the perspective. That's what I like about /. I can put out a thought or question and mostly get good information back relatively quickly.

--
...Open Source isn't the only answer -- but it's almost always a better value than the alternatives...
Re:Given that... by Circuit+Breaker · 2007-02-15 04:03 · Score: 1

Can you compare kdb+ ( of which you're probably aware, but if not - http://kx.com/ ) to Vertica ?

Re:Omg top 5 by bob.appleyard · 2007-02-14 10:31 · Score: 3, Funny

You're 100 times faster than anyone else, obviously.

--
How dare you be so modest!! You conceited bastard!!

Comprable? by Pinback · 2007-02-14 10:31 · Score: 1

I wonder how this compares to http://en.wikipedia.org/wiki/NetezzaNetezza.

Re:Comprable? by rla3rd · 2007-02-15 02:16 · Score: 1

the wiki link answers your own question: The Netezza appliance is composed of an embedded database server (originally based on PostgreSQL).

Re:it won't work 100 times faster - I'll take bets by Splab · 2007-02-14 10:33 · Score: 2, Insightful

Uhm... wtf?

Seriously, you tested MySQL vs. other databases with "out of the box" setups? MySQL isn't a real database when running MyIsam engine, you simply cannot compare that with anything else. And on top of that, try do a proper insertion in MySQL, one single transaction with a few millions of rows and see how well that does. Oh and did you ever stop to think about _why_ MySQL does perform so much faster on that test? Try doing it on a InnoDB table with standard setup, even at 600k rows it slows to a crawl. (Easily fixable, but requires some optimizations)

Seriously the reason why big vendors have a clause in their eula for people to NOT do benchmarks is exactly people like you, you have no idea about what you are comparing, just figured that setting up something out of the box will give a good insight into the speed. Sheesh.

Ohh and the 100 fold increase in speed is very much likely to happen - on certain types of queries. With horizontal representation you can do sequential scan only on the part of the data that you need, not the entire set, which should be very very fast.

You would lose by acidmonth · 2007-02-14 10:46 · Score: 1

I've worked with another of this type of system - Alterian. For query intensive work, as compared to well designed / indexed Oracle, it was easily 100s of times faster. Sure, LOADING sucks, but given the goal was occasional loads with lots and lots of queries, it worked, and worked well.

you read 40 pages in under 4 mins - you're fast? by tota · 2007-02-14 10:49 · Score: 1

> people like you, you have no idea about what you are comparing, just figured that setting up something out of the box will give a good insight into the speed.

I guess you didn't read the first page, or the second?

As stated (multiple times), the purpose of this report is to compare various aspects with "out of the box" performance, with all the caveats that it implies.
And FYI I will be comparing MySQL InnoDB next time around.

> Ohh and the 100 fold increase in speed is very much likely to happen

> With horizontal representation you can do sequential scan only on the part of the data that you need..

A scan is still a scan is still a scan.
And even with horizontal representation you shouldn't be too far off indexed data access speed, so the "100 times" figure is still unrealistic.

But, heh, you don't know me, so keep talking...

--
TODO: 753) write sig.

Google uses this approach by russryan · 2007-02-14 11:09 · Score: 3, Informative

See http://en.wikipedia.org/wiki/Bigtable for a description of Google's column oriented database.

Re:Google uses this approach by ramakant · 2007-02-14 12:03 · Score: 3, Informative

Here's a good comparison of the two approaches:
http://glinden.blogspot.com/2006/05/c-store-and-go ogle-bigtable.html
(per my post below, Vertica is a commercial version of MIT C-Store: http://db.lcs.mit.edu/projects/cstore/ )

Re:Sounds great but.. by perfczar · 2007-02-14 11:10 · Score: 4, Informative

The Vertica business model is to sell a database engine (software to store and query data). Clearly use of standard interfaces is important, otherwise nobody would be able to make use of the product (which really ends up being a component of a larger system or strategy) without going to a heap of trouble. So of course Vertica has:

A JDBC driver
An ODBC driver
An interactive SQL client
A growing list of tested integrations with other software

Note: I work for Vertica

Re:you read 40 pages in under 4 mins - you're fast by Splab · 2007-02-14 11:21 · Score: 1

Right, so you got a table with lets say, 5 million rows spanning perhaps 2 GB of data. Now if you want to find out how many of the rows contain a specific value, lets say foo_bar is below 50, with vertical representation you need to scan your entire dataset, that is 2 GB. With horizontal representation you only need to dig through 5 million entries, let's assume it's 32bit integers and you are down to 20MB of data. Of course you can fix a bit of this by using an index covering just that column, but on very large datasets it just isn't an option.

And about me not knowing you - I don't need to, a quick scan of you paper told me you had a lot to learn. If you already know your test is flawed why the hell do you keep it online?

More Scalability by Doc+Ruby · 2007-02-14 11:26 · Score: 1

How about a database with the exact same query API (not just "but it's all SQL") as, say, Oracle or MS-SQL, or even Postgres, that allows any number of parallel query servers to work against a single datastore?

In other words, instead of yet another incompatible database, how about one that we could just switch to from an existing one, that is arbitrarily scalable against shared data. If you're going to get clever and act like you can solve hard problems, why not give people what we need, and not just what you think you can give us?

--

--
make install -not war

Re:More Scalability by PCM2 · 2007-02-14 12:38 · Score: 1

How about a database with the exact same query API (not just "but it's all SQL") as, say, Oracle or MS-SQL, or even Postgres, that allows any number of parallel query servers to work against a single datastore?

What would be the purpose of that? Performance gains? I/O is going to be your bottleneck there, and it sounds like it would start to clog up sooner, rather than later.

In other words, instead of yet another incompatible database, how about one that we could just switch to from an existing one, that is arbitrarily scalable against shared data.

Vertica isn't using shared data. It's a "shared nothing" database. Each server in the cluster holds only part of the database. Therefore, query processing and data access I/O is distributed equally. FYI, IBM DB2 does clustering this way. By comparison, Oracle RAC is a "shared everything" database. Each server holds a replica of the entire dataset. This makes for a very reliable database, but replicating the data limits the performance of the cluster when compared to a shared-nothing design.

If you're going to get clever and act like you can solve hard problems, why not give people what we need, and not just what you think you can give us?

If you're going to get all snotty and superior, why not first try to understand the problem that the product is intended to solve, not just the problem you're imagining people have? This product is targeted at the data warehousing market. If you know a better way to do data warehousing than what Vertica is proposing, it seems like the time you spend posting to Slashdot could be better spent publishing research papers.

--
Breakfast served all day!
Re:More Scalability by Doc+Ruby · 2007-02-15 02:23 · Score: 1

IO is the bottleneck anyway. The scheme I mentioned reduces the bottlenecks to that single one. And it allows arbitrary scaling with minimal (if any) recoding, just by adding HW.

If you're going to get snotty and dismissive, why not recognize that the needs of the market, easily/cheaply scalable databases without complex planning in application design, are more important than what this team happens to think it can do better, and don't need a vendor white paper to make clear in a few sentences?

--
--
make install -not war
Re:More Scalability by PCM2 · 2007-02-15 07:44 · Score: 1

IO is the bottleneck anyway. The scheme I mentioned reduces the bottlenecks to that single one. And it allows arbitrary scaling with minimal (if any) recoding, just by adding HW.

That's dumb.

In the shared-nothing design, each query server only has access to part of the database. Therefore, each server is able to use 100 percent of its I/O to access its portion of the data, because no other servers are trying to access that data. Bottlenecks only occur when the cluster is poorly designed and a disproportionate number of queries hit a single server.
By comparison, in a shared-everything design, each server has a complete copy of the data, therefore each server can use 100 percent of its I/O to access that data. The problem is that you have to keep all those copies of the data replicated, which introduces overhead that the shared-nothing design doesn't have.

In a scheme like you describe, you have a data store and you want to add an arbitrary number of query servers to access it. Unlike either of the actual, real-world designs I just mentioned, in your scheme the data store is the bottleneck. If you want to have ten query servers accessing the data store, then the data store must have ten times the I/O bandwidth of any one of the query servers if the system is to operate with no degradation of performance. If you want to add more, you need to find some way to increase the bandwidth of the data store. This should be obvious. If you can't do the math, just consider the fact that no real-world database clusters use this design.
If you're going to get snotty and dismissive, why not recognize that the needs of the market, easily/cheaply scalable databases without complex planning in application design, are more important than what this team happens to think it can do better, and don't need a vendor white paper to make clear in a few sentences?

More important to whom? Vertica is not a general purpose database. Period. It doesn't matter how many people might want something else. Existing general-purpose RDBMSs serve the general database market well (and they don't do what you're asking for because it won't work). This is a niche product.

--
Breakfast served all day!

This is a commercial version of MIT C-Store by ramakant · 2007-02-14 11:37 · Score: 4, Informative

This looks like it will be a commercial version of the Michael Stonebraker and MIT developed C-Store column-oriented:
- Web site: http://db.lcs.mit.edu/projects/cstore/
- Wikipedia Entry: http://en.wikipedia.org/wiki/C-Store
They distribute the source with a fairly liberal license, so this looks like something the open source community could pick up and run with.

Re:No scoop. PR being sneaked by Vertica! by georgewilliamherbert · 2007-02-14 11:47 · Score: 1

It takes balls to say things like that about Michael Stonebraker in the database field... ...and lack of brains or historical clue...

I apologize in advance by President_Camacho · 2007-02-14 11:48 · Score: 1

Yeah, but what does its radar signature look like?
Probably, a flock of seagulls.

That would make sense for a remote application. When ran, they're ran so far away.

--
Wizard Needs Food, Badly

Re:Don't forget to put the cover sheet, shitcock by MrAnnoyanceToYou · 2007-02-14 12:02 · Score: 1

Heh.

I'm always amazed at the vehemence.

What good is a huge pile of data with no order? Someone has to pay the bills, someone has to see where the profits are, someone has to see which shipments went out late. These are reporting functions. You create data once, but use it many, many times if you are paying attention to it. I'm sorry you feel differently.

Execution of insert queries is extremely important and time sensitive. Execution of everything else is often not quite as mission critical, but it is still important enough to be worthwhile, or noone would do it.

--
My little site.

One size doesn't fit all by perfczar · 2007-02-14 12:13 · Score: 2, Interesting

This is a different kind of issue, really, more like the difference between a CPU and a GPU. At the moment, a good GPU has >100x the performance of a good CPU on a certain class of computations. Column stores will clearly never replace row stores for transaction processing for obvious reasons, but (coupled with a few other architectural decisions) they do exhibit >100x the performance of row stores for the kinds of queries seen in data warehouses.

Also, the two technologies are complementary. The goal is not to replace one thing with another, but to provide more kinds of tools and make them work together. Keep a row store for transaction processing, and feed the data into a column store for analysis in near real-time, much like a video game uses a CPU for AI and a GPU for 3D rendering.

Re:How does this compare to SQLite's column store? by georgewilliamherbert · 2007-02-14 12:22 · Score: 1

http://www.sqlite.org/whentouse.html

I haven't worked with it up close (browse around the website regularly, but don't run it), but all the docs I have seen say SQLite uses a B-tree, not a column store. Do you have an alternate reference to such?

SQLlite says not to use it for more than a few gb to tens of gb of data. Sybase IQ, for example, is routinely run with TB plus quantities of data. It's been tested to a trillion plus rows of data and 155 TB of input data (which autocompressed down to 55 TB of diskspace required to store it).

Vertica's headed thataways, I think, but also to be lighterweight and more general purpose than IQ.

Re: Shared-Nothing Architecture by cartman · 2007-02-14 12:26 · Score: 2, Informative

Gee, I don't know anyone who's been succuessfully doing this for years...

I'm certainly not suggesting these guys are the first to implement a shared-nothing parallel RDBMS. IBM has offered DB2 parallel edition which is shared-nothing for some time now. However IBM wants a ton of money for parallel edition, and DB2 has some legacy stuff which might not be useful in a shared-nothing architecture. An open-source shared-nothing RDBMS might be compelling.

I think the shared-nothing approach is the best one for an open-source RDBMS offering. Organizations which use open source will almost certainly want to use commodity, open hardware. Shared-nothing will allow them to do that.

An issue with column orientation by jfroelich · 2007-02-14 13:15 · Score: 2, Informative

Is that you do not scale as well to a large number of columns. To access a set of X records with 100 columns, you have 100 asynchronous I/O calls to the separate column stores. I sell an analytical software that does just this, and it is not a technical something that should just be ignored. In some regards the single file row oriented system has less I/O overhead. We have come up with some ways to reduce the file system overhead, but while it is small, it is noticeable, more so on systems not designed to have a some large amount simultaneous open files. All that really happened is that it switched part of the bottleneck to rely less on the product architecture and more on the system architecture. Whether you think that is wise, well, that's up to you.

BTW, first post, I am no longer an eavesdropper, yay

Josh

Stealthy? by plasmacutter · 2007-02-14 13:33 · Score: 2, Funny

it's on the front page of slashdot.. how stealthy can it be?

--
VLC FOR MAC IS DYING! IF YOU DEVELOP, PLEASE SAVE IT!!

The interesting to me is... by BillAtHRST · 2007-02-14 13:43 · Score: 1

What does this have to do with StreamBase? Is Stonebraker just throwing StreamBase under the bus? Are they complementary? How can one person (even someone with his abilities) function as CTO of two separate companies?

I wonder if it's anything like Daytona? by peektwice · 2007-02-14 14:16 · Score: 1

http://www.research.att.com/viewProject.cfm?prjID= 69

--
Other than this text, there is no discernible information contained in this sig.

Required technologies? by bestguruever · 2007-02-14 14:20 · Score: 1

Any relation to Required technologies? Unfortunately, that's what I think of when I hear about a column store

--
if you think this is bad, you should have seen my last sig

thanks for proving my point - more examples below by tota · 2007-02-14 14:21 · Score: 1

So under some very specific circumstances, when compared with a table that is not indexed, horizontal is going to be much faster (since that's what it it is designed to do...). Well that's no surprise!
All I am saying is that claiming that the performance is going to be 100 times faster is not a good measurement. Every database vendor will find a scenario that suits their engine and proves unequivocally that they are the best - but they can't all be right, now can they?

This is exactly the kind of thing that I tried to avoid in the paper that you didn't read: by focusing on very simple cases (and very simple differences being measured)
Some examples to save you reading it:

IBM JDK 1.5 threading model does not scale well and causes much more system load
PostgreSQL does not benefit from using prepared statements as much as other dbs (even less so for updates)
On the same hardware, MySQL is generally faster as i686 than x86_64

etc...

But eh, "this benchmark is flawed", right? so just delete this information from your brain, you'll be fine.
I won't deny I still have plenty to learn, that's what I enjoy doing! And I like sharing it too!
read it!.

--
TODO: 753) write sig.

Big claims are backed by Virtual_Raider · 2007-02-14 14:29 · Score: 3, Informative

Still "100X faster" is a big claim. Lots of smart people have been working on DMBSes for many years, a two order of magnitude improvement is a "I will have to see it to believe it" type claim

Oh ye of little faith, here i present thee with The Facts. Or a paper at the very least: One size fits all? a Benchmark

--
+Raider of the lost BBS

Stupid question: Still SQL? by WoTG · 2007-02-14 15:54 · Score: 2, Interesting

I've never heard of column based databases prior to this article. Would I be correct in assuming that you still can work with these using regular SQL?

Re:Stupid question: Still SQL? by sonofagunn · 2007-02-15 03:08 · Score: 1

Yes, still SQL. Column oriented DBs are meant to optimize SQL reads where you only are using a few columns in your SQL, but the tables have many columns. This doesn't change anything about SQL.

Microsoft or Linux? by MadnessASAP · 2007-02-14 16:25 · Score: 1

Is that a Microsoft 100x increase or a Linux 100x increase? (For reference 100x(Microsoft) = 1.1x(Linux))

--
I may agree with what you say, but I will defend to the death your right to face the consequences of saying it.

welcome to 1994 by kpharmer · 2007-02-14 17:45 · Score: 1

> Grid enabled - This means the DBMS can make use of a large distributed group of computers and potentially have access
> to a huge amount of computing power. The typical DBMS runs on at beat a multi-processor server. Thi sis kind of like a DBMS
> server running a a "seti at home" type network.

Or like teradata in around, what? 1992? Informix around 1994? db2 around 1995? Oracle isn't there yet since their grid solution is more about failover than partitioning.

This is now lower-end functionality in the high-end database market. The typical database only runs best on an SMP if you mean postgesql or mysql when you say typical. The large commercial databases can easily split your data across 2,4 or 500 servers for handling 1 second queries that require complex queries across billions of rows of data.

SQL would inefficient by Jayson · 2007-02-14 17:46 · Score: 2, Informative

One of the benefits of column oriented DBs is that tables have an ordering, and that ordering can be exploited in queries. SQL doesn't give a good way to exploit it. Column DBs do allows SQL, but they also have other native languages that people tend to use.

Should scale better by Jayson · 2007-02-14 17:49 · Score: 1

Column-oriented DBs should scale better with more columns if the the query doesn't access all the columns (which they rarely do). The DB only needs to keep the columns in memory that are being accessed. This is far better than a row-oriented DB that needs to cycle though the entire table or numerous indexes to get a result set.

So it's like an Index... by DoChEx · 2007-02-14 19:18 · Score: 1

Column oriented sounds to me a lot like an index with a single field. Even if you place all columns into a single block other columns could still be required when displaying the record, if you have multiple columns like this it could mean a lot of blocks being read once you actually know the direct block locations of your data. So do you combine row + column style, then using some DDL highlighting which columns to store independent of the rest of that row.

Re:How does this compare to SQLite's column store? by sonofagunn · 2007-02-15 00:05 · Score: 1

We use Sybase IQ where I work, and other than rebooting it every few days, frequent crashes, and slow running queries, it's pretty good. We are right now offloading a huge reporting database into SQL Server because our queries run about 6x faster. This is a star-schema data warehouse with a few TB of data. The big difference is that SQL Server supports partitioning.

KDB on disk by Jayson · 2007-02-15 02:46 · Score: 1

KDB actually does its best when pulling from disk. If you can cache a B-tree in memory, there isn't much as much performance increase since disk seeking is your typical enemy in these large datasets. A hash table is even good when everything is in memory, so standard indexes do well. However, when going to disk, KDB can get away with far fewer disk seeks since it is pulling contiguous regions into memory.

It will be interesting to see how Stonebreaker's new DB performs, since there are a number of column DBs out there, but only a few actually work well with the massive amounts of data that Stonebreader seems to be targeting.

Re:Don't forget to put the cover sheet, shitcock by MrAnnoyanceToYou · 2007-02-15 03:50 · Score: 1

People who manage are sometimes tards. Ideally, in a streamlined shop, there is a lot of use for data warehousing. Instead, because it's a tech task that can be sold as a big budget increase, it sometimes gets a lot of people working on it and produces huge piles of crud.

If instead of producing reams and reams of reports noone needs, what if the people in charge were capable of and willing to dig into the warehouse and find out what was going right and what was going wrong? Or hire someone to do so? That's what data warehousing would be great for. It's not used that way because of organizational factors, but the fact is that it is a tool, and expanding its usefulness or making it more useful has nothing to do with whether that tool is well used. The best managers I have known are interested in acquiring as much information as possible in an organized format. And redefining the format again and again in the hopes that it means something. Your particular set seems to be more interested in process; that's fine sometimes. But don't assume that that is the way it always goes.

--
My little site.

Teradata by kbahey · 2007-02-15 04:39 · Score: 1

You just described Teradata, which has been around since 1979, and do just what you said.

Initially, they used parallelized hardware with each "node" having its own disks, with tables partitioned, and a specialized interconnect. They then migrated all that in software.

See the diagrams on page 4, 5 and 6 here (PDF).

--
2bits.com, Inc: Drupal, WordPress, and LAMP performance tuning.

rarely happens in the real world?? by HornWumpus · 2007-02-15 05:53 · Score: 1

Unless the query is slow enough to be a problem and has to be optimized.

Then it pretty much always happens.

I know, I'm the dude that makes slow processes run 10x faster, sometimes more, depending on the level of incompetence of the original developers. The level of incompetence can sometimes be mind blowing. I've gotten 100x speed improvements when I was the 30th developer (all of who where supposed to be looking for optimizations) looking at a block of client code/stored procs/queries.

As to updates, what percent of updates only change one column worth of data? It depends on the database design of course. The percent is higher in highly normalized designs but I suspect it rarely hits 50%.

The database design in turn is often a thoughtful trade-off between transaction processing and analytic tasks.

My question. How much server power does this product need vs. setting up a mainstream db and just indexing ALL the columns.

Lets leave compound indexes out of the discussion for now but note the mainstream dbs will likely come out ahead on that metric.

--
John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'

Re:rarely happens in the real world?? by AKAImBatman · 2007-02-15 06:29 · Score: 1

You *do* realize you're saying the exact same things I did, right?

--
Javascript + Nintendo DSi = DSiCade

Millions of random records? by HornWumpus · 2007-02-15 06:00 · Score: 1

Reviewing my 20 years at this crap...

Never seen a requirement to load random records. They are always based on a search criteria. Which is usually indexable.

Granted theirs the case of field x that contains the text 'blah'. That is a small part of database operations and doesn't particularly gain much from this.

--
John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'

as compared to well designed / indexed Oracle by HornWumpus · 2007-02-15 06:09 · Score: 1

Define well designed / indexed.

I suspect it was not all that well designed or indexed (for analysis). Most highly normalized designs are optimized for transaction processing.

You realize that mainstream dbs have ways to tell the server to keep certain indexes in memory? That can make carefully tuned queries scream. You can see 100x differences between designs put together following normalization rules and those thoughtfully de-normalized and with performance tweaks applied (pinned indices, sometimes compound indexes, sometimes pinned compound indexes, realized views etc etc).

Designing a mainstream db for query intensive work is very different to designing a mainstream db to TP.

--
John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'

but on very large datasets it just isn't an option by HornWumpus · 2007-02-15 06:17 · Score: 1

You almost had me convinced you new something of what you said. Until that line.

You sir are an idiot. Go back to your CS homework.

Are you telling me indexes stop working at a certain record count? They have a cost of course as they contain a copy of the columns worth of data in an easy to scan format. Just like column oriented dbs. When scanning an index you are down to the same 20MB of data.

--
John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'

Re:but on very large datasets it just isn't an opt by Splab · 2007-02-15 11:02 · Score: 1

Yes in fact they stop working at a certain point. Try a fun experiment, populate a table where 5 percent meets a certain criteria, 7.5 percent another, 12.5 percent another, 25 percent another and 50 percent for the last one. Run each query against a cold database (reboot your system - or stop the database, unmount the filesystem, remount it and start the database again). Then create an index covering the queries and run the queries again, again against a cold database.

You will notice that at first index is much faster than non index, but when you get nearer 25 percent coverage the two types are about the same, index will most likely be a bit slower than non indexed. But at 50 percent they are going to be almost the same, indexed a bit slower.

Regarding the index at 20MB, yes you could do that, but when you got millions of rows indexes will get very expensive and thats why we want horizontal representation for these types of queries.

MonetDB does not need stealth by mkersten · 2007-02-16 09:52 · Score: 1

Vertica has indeed made the business/venture steps to
follow the MonetDB approach to exploit column-based stores
for large scale datawarehouse solutions.
Its science library provides many studies on the underlying
technology.

MonetDB has already build a business history in the
area of analytical CRM solutions available through SPSS.
In the area of datamining PROXIMITY is a leading
product for relational mining.

Not to mention the support for both SQL and XQuery
engine support. This all in the context of an open-source
community activity for several years.

See http://monetdb.cwi.nl/
http://monetdb.cwi.nl/projects/monetdb/Development /Credits/Partners/index.html

Very promising for SOME applications by CurtMonash · 2007-02-16 11:22 · Score: 1

I had a long chat with Mike Stonebraker a few weeks ago, and came away with the following tentative opinions about Vertica's prospects, and those for columnar systems in general.

* Pinpoint data lookup doesn't seem like a great fit for columnar systems. Indeed, traditional rows-and-B-trees would seem to be best.
* Constrained query and reporting would seem to be a sweet spot, even though it's a sweet spot for some of the best competition as well.
* Cube-filling calculations involve big intermediate result sets. I'm not sure that's a great fit for columnar systems.
* Hardcore tabular data crunching would seem in many cases to be another sweet spot, again against a lot of competition, at least in some of its sub-categories.
* Text and media search are best done by specialized systems that, at least in the case of text, wind up being quasi-columnar. The same goes for other specialty areas. Systems like Vertica's have nothing to offer directly to these applications. However, it might be possible for Vertica to integrate with them fairly quickly, given that they're starting from vaguely similar philosophical roots.

There also are some technical details in that article; a link to a short, somewhat hagiographic intro to Mike himself; and so on.

--
To err is human. To forgive is good system design.

Vertica is far from the only columnar game in town by CurtMonash · 2007-02-16 11:27 · Score: 1

And it's not just Sybase IQ, either. There are lots of columnar players. Kognitio also has a columnar VLDB offering, but it's quite different from Vertica's. And the columnar memory-centric BI offerings are interesting as well, such as QlikTech's and SAP's. Also, full-text indexing is pretty columnar itself.

--
To err is human. To forgive is good system design.

Slashdot Mirror

Database Bigwigs Lead Stealthy Open Source Startup

147 of 187 comments (clear)