Cassandra 0.7 Can Pack 2 Billion Columns Into a Row

← Back to Stories (view on slashdot.org)

Cassandra 0.7 Can Pack 2 Billion Columns Into a Row

Posted by timothy on Sunday January 16, 2011 @12:58PM from the but-only-if-they're-really-thin dept.

angry tapir writes "The cadre of volunteer developers behind the Cassandra distributed database have released the latest version of their open source software, able to hold up to 2 billion columns per row. The newly installed Large Row Support feature of Cassandra version 0.7 allows the database to hold up to 2 billion columns per row. Previous versions had no set upper limit, though the maximum amount of material that could be held in a single row was approximately 2GB. This upper limit has been eliminated."

14 of 235 comments (clear)

Min score:

Reason:

Sort:

Re:Typical applications? by Brummund · 2011-01-16 13:12 · Score: 5, Funny

Any application developed by one or more Visual Basic developers, given enough time.
2 billion columns... by aBaldrich · 2011-01-16 13:12 · Score: 4, Funny

ought to be enough for everybody

--
In soviet russia the government regulates the companies.
This is a triumph for hideously bad schema by Sarusa · 2011-01-16 13:16 · Score: 4, Informative

Well good on them for solving an interesting technical problem, but the use cases for this are all bad.
Obvious first use: boss will suggest we optimize the database by using only one gigantic row with two billion columns.
Re:Only 2 billion? by Anonymous Coward · 2011-01-16 13:19 · Score: 5, Funny

You work for Gillette, don't you.
Re:Typical applications? by gratuitous_arp · 2011-01-16 13:20 · Score: 5, Interesting

Apparently the extra columns can be used to the effect of doing "more" than store data. A link in the article explains how lots of extra columns can be useful for querying data (Casandra doesn't use SQL). http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/
So the primary reason for this doesn't seem to be that one's run-of-the-mill database needs more columns.
for those that absolutely positively cannot RTFA by Son+of+Byrne · 2011-01-16 13:28 · Score: 5, Informative

Cassandra appears to be a multi-dimensional datastore that does not store data in the same fashion as a typical RDBMS. It uses columns and rows both to store sets of data uniquely. If you're familiar with Big Table, then, apparently, its kinda like that.
That just means that they've added even more storage vectors to it than before...not sure why it made slashdot front page...

--
I'd happily pay you Tuesday for a biopsy today!
Re:If you have more than 30 columns by ogrisel · 2011-01-16 13:35 · Score: 5, Informative

Not with column store databases such as Cassandra, HBase and BigTable.
Re:Typical applications? by jrumney · 2011-01-16 13:39 · Score: 5, Funny

What sorta applications need so many columns?
Facebook needs one column for every privacy violation.
Cassandra by tverbeek · 2011-01-16 13:49 · Score: 5, Funny

I predict that bad things will come of this.
Not that anyone will believe me.

--
http://alternatives.rzero.com/
Yes and the funniest thing about all this is by Giant+Electronic+Bra · 2011-01-16 15:35 · Score: 4, Insightful

That we had all of this stuff 30 years ago. It was called 'network' databases, which were pretty much the standard sort of technology before RDBMS came along and everyone realized how incredibly much better relational algebra was for the vast majority of problems. As with many other things older ideas eventually resurface with new names and a few more features. There are times when this kind of facility is useful. Nothing wrong with it. The vast majority of cases though where I've seen people using something like Cassandra or Big Table were ill advised. A properly optimized RDBMS with correctly designed schema can handle all but a few edge cases. Most of the hype these tools are generating is based on a lack of real understanding of how to properly use databases combined with people believing myths about other technologies and helped along by the industry's short memory span. The best part though is that when something turns into a giant mess guys like me can make nice money fixing the mess. lol.

--
"Malo periculosam, libertatem quam quietam servitutem." -- Jefferson
Re:Only 2 billion? by zach_the_lizard · 2011-01-16 15:39 · Score: 4, Funny

He doesn't, otherwise it'd be uint64_t and a lather strip!

--
SSC
Re:And Oracle supports EXABYTE sized databases by DavidTC · 2011-01-16 18:01 · Score: 4, Interesting

NoSQL stuff is useful in weird extreme fringe cases, where you need to access data in essentially random ways. Digg, Facebook, and Google all NoSQL databases, and I think the first two use Cassandra.
Specifically, you kinda make your own rows. It's like having permanent multiple JOINs that you can access instantly, from what I understand. (This is what this article is talking about, it's now unlimited.)
Essentially, it's a giant blob of data that exists, and you draw lines on it in advance that are your results, and you can get those result instantly, at the cost of being unable to decide to get other results in real time.
Many of the products let you have them on different servers, so you can have a 'people who have voted for this Digg' table or something, on the server that handles that thing.
I'm not entirely sure how it works, but that's basically it. Oh, and the fact they talk about 'columns' and 'rows' is just utter stupidity in naming to confuse everyone. Basically, they simply tend to keep each column as a file, which allows them to do what I mentioned above..copy needed columns, and just needed columns, to other servers.
It's really weird, and, like I said, only relevant for giant giant databases. There's no way that google could do a full text search on a RDBMS, regardless if it fits in Oracle. What it can do is make a 'column' for each word, and a 'row' for each URL, put different columns on different servers, and that actually works in the non-relational database they use, when there's no way in hell that would work on a RDBMS.
However, more importantly for slashdot, a fuckload of fools think that SQL is somehow 'retarded' and that NoSQL is 'awesome, dude', so they like to play with it, usually by spewing out some crap PHP or Perl or something that works about a tenth as well as just using an RDBMS would work. If they actually understood how to use an RDBMS, that is.

--
If corporations are people, aren't stockholders guilty of slavery?
Re:Typical applications? by Sarten-X · 2011-01-16 18:16 · Score: 4, Informative
Welcome to the first five minutes of using a column store. Screwey, ain't it?
My understanding is that rows' contents are indexed such that they may be retrieved quickly. Think of a row name as a primary key. It's easy to get the whole row when you know its name. Continuing the census application, it's be like asking for all the birth years of everyone in a geographical region. The requested column family (geographical region) is opened, and each column (person) is quickly checked for the particular row's contents (in case the birth year wasn't provided). Partitioning is done by both row and column family, so only some of the column family's data is actually scanned. That's where the cluster provides a very nice speedup, as well.

locating a value in a specific row can't tell how to retrieve that entire column
Now, I'm not sure if I understand your rage-induced rambling correctly, but if you're trying to make a SQL example, you're starting from the wrong premise, which explains why you're having trouble making sense of it all.
Quick review: The "R" in "RDBMS" stands for "relational", referring to a n-ary relation. SQL is intended to manipulate those relations, isolating the data you want to extract. Something that is not described as an RDBMS should not be expected to have relations.
Cassandra functions (from the application perspective) as a key-value store, with no relation structure. That means you don't work with sets, and you don't need to think about set operations. Pull out a row, and you get a list of columns with defined values, as well as those values. Iterate through each value looking for whatever value you're looking for. When you find it, you already have the column name. Just ask for the whole column next. Since the whole thing is running in a cluster, you can parallelize the iterations (I think... I've used HBase, but not Cassandra personally) to speed up the scan.
If that's not fast enough for you (which is likely), you can use Hadoop's MapReduce framework to scan each cell and create an index, possibly laid over the other table as just more rows & columns (though a different table would be better, from a sanity perspective). Since there's no mandatory structure, that's legit.
Of course, that's only valid for this particular census application, which assumes that the only reason for the database is either basic statistics or something complex enough for a MapReduce program.
It's entirely possible to run Cassandra arranged similar to a normal RDBMS. Use only a few column families with very specific columns (such as a single family for all the "Name, address, etc."). Throw in a bunch of index families, updated with MapReduce. Then, your processing can be a complex MapReduce job, iterating over each row with a particular set of rows meeting all your needed criteria. It'd be just like a normal RDBMS, except you have better scalability, and maintain indexes yourself.
If the trouble of indexing is too much for you, you can follow Google's route with Colossus, which runs MapReduce-like tasks when rows are changed. That's your dynamic indexing.
Here's some links to help your understanding:
- Looking to the future with Cassandra
- Understanding the Cassandra Data Model from a SQL Perspective
- WTF is a SuperColumn? An Intro to the Cassandra Data Model (While reading this, I note some discrepancy of terms I've used due to my familiarity with HBase. Please excuse that.)
--
You do not have a moral or legal right to do absolutely anything you want.
Re:Typical applications? by bjourne · 2011-01-17 00:28 · Score: 4, Informative

Maybe Cassandra should have choosen some other terminology for their database that so obviously doesn't conflict with already existing terms. A column in Cassandra is a tuple which in an RDBMS is a row. Confusion all around.

--
Football Odds