Cassandra 0.7 Can Pack 2 Billion Columns Into a Row
angry tapir writes "The cadre of volunteer developers behind the Cassandra distributed database have released the latest version of their open source software, able to hold up to 2 billion columns per row. The newly installed Large Row Support feature of Cassandra version 0.7 allows the database to hold up to 2 billion columns per row. Previous versions had no set upper limit, though the maximum amount of material that could be held in a single row was approximately 2GB. This upper limit has been eliminated."
What sorta applications need so many columns? Curious.
Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
This is a feature in need of an application and I can see very few applications.
ought to be enough for everybody
In soviet russia the government regulates the companies.
Well good on them for solving an interesting technical problem, but the use cases for this are all bad.
Obvious first use: boss will suggest we optimize the database by using only one gigantic row with two billion columns.
You work for Gillette, don't you.
Cassandra appears to be a multi-dimensional datastore that does not store data in the same fashion as a typical RDBMS. It uses columns and rows both to store sets of data uniquely. If you're familiar with Big Table, then, apparently, its kinda like that.
That just means that they've added even more storage vectors to it than before...not sure why it made slashdot front page...
I'd happily pay you Tuesday for a biopsy today!
Not with column store databases such as Cassandra, HBase and BigTable.
on the fly
Like storing the contents of a web crawl. The row key is the URL, the column is the crawl timestamp, and the cell contains the page (or keywords). That's a column created on the fly. Another application off the top of my head is storing access logs, where each row is a date, each column is a person, and each cell contains a resource they accessed. Having two billion columns is hardly excessive (in theory) for a suitably-large application.
Cassandra, like BigTable and HBase, is not the same as a traditional RDBMS. It's also a column-oriented DBMS. Since each group of columns is stored separately, there's no performance impact to having extra columns. Columns that aren't needed (like old crawls in the example above) simply aren't loaded into memory. What's bad design for an RDBMS is perfect for Cassandra or HBase.
You do not have a moral or legal right to do absolutely anything you want.
I predict that bad things will come of this.
Not that anyone will believe me.
http://alternatives.rzero.com/
I know why the developers thought this would be a good idea. A feature this mental would be sure to get them free publicity on slashdot
portfolio
If you are writing SQL, maybe. Cassandra is not a relational database.
Cassandra doesn't use SQL, and isn't even like a RDBMS in any way other than "it stores a table of data", so the SQL statement would be nonexistent.
You do not have a moral or legal right to do absolutely anything you want.
Cassandra like many of the "no sql" type databases doesn't have classic indexes.
So instead of having an index you typically have a separate table that acts as the index.
Image you have a users table. One of the field is country. Now you want to know all the users for a particular country.
In standard RDMS type systems you just scan each row or have a index that has done that "ahead of time" or as rows are inserted.
In Cassandra the rows of users are distributed possibly among 100s of servers. So scanning for all users that have a particular country would require scanning all rows which could a long time.
Unlike RDMS like system rows don't have a 2d structure and don't have real limitation on the number of columns they can have. And columns can essentially be arrays\rows of objects.
So as you design/bang out your application you typically realize you need to know "users by country" for some stupid report. So you create a new table to hold these values. This has one row per country. As users are entered you append to this row. This essentially creates an array like structure. You then lookup the row for a particular country and you now know all the users for that particular country.
Sounds like Cassandra is getting rid of a limitation that could have caused very large index to require multiple rows.
That we had all of this stuff 30 years ago. It was called 'network' databases, which were pretty much the standard sort of technology before RDBMS came along and everyone realized how incredibly much better relational algebra was for the vast majority of problems. As with many other things older ideas eventually resurface with new names and a few more features. There are times when this kind of facility is useful. Nothing wrong with it. The vast majority of cases though where I've seen people using something like Cassandra or Big Table were ill advised. A properly optimized RDBMS with correctly designed schema can handle all but a few edge cases. Most of the hype these tools are generating is based on a lack of real understanding of how to properly use databases combined with people believing myths about other technologies and helped along by the industry's short memory span. The best part though is that when something turns into a giant mess guys like me can make nice money fixing the mess. lol.
"Malo periculosam, libertatem quam quietam servitutem." -- Jefferson
He doesn't, otherwise it'd be uint64_t and a lather strip!
SSC
I'm was having trouble making a table for my new Web 3.0 m-commerce application on lesser databases:
CREATE TABLE peeps( ...
peep1_first_name VARCHAR(255),
peep1_last_name VARCHAR(255),
peep1_address VARCHAR(255),
peep1_address2 VARCHAR(255),
peep1_address3 VARCHAR(255),
peep1_creditcard VARCHAR(255),
peep1_creditcard2 VARCHAR(255),
peep1_creditcard3 VARCHAR(255),
peep2_first_name VARCHAR(255),
peep2_last_name VARCHAR(255),
peep2_address VARCHAR(255),
peep2_address2 VARCHAR(255),
peep2_address3 VARCHAR(255),
peep2_creditcard VARCHAR(255),
peep2_creditcard2 VARCHAR(255),
peep2_creditcard3 VARCHAR(255),
509 Bandwidth Limit Exceeded
I'm not a lawyer, but I play one on the Internet. Blog
So I can appreciate that this announcement sounds like News for Nerds, but can someone why it Matters that Cassandra can support 2 billion columns?
The article basically says "because you can't execute SQL you need lots of columns". OK, great, why would I want that? The article doesn't tell me. The Cassandra website sure doesn't tell me.
Oracle 11 supports up to 8 fucking EXABYTES of data in an RDBMS that I can execute SQL against. What Cassandra puts in columns, I put in rows.
I've scoured this thread like all the other ones on Cassandra for the killer feature, for the "you can do this with Cassandra that you can't do as well with an RDBMS" and I can't find it.
The best I can come up with is "I want to store lots of indexed data, I don't care about transactional integrity, and I don't want to pay Oracle". Is that it? That's fine if it's it, Oracle doesn't come cheap and that can be a deal breaker for new companies, but I just wish someone would spell out that this is the justification for Cassandra's existence.
- Not everyone answers every question. There is skip logic involved and there are loops, sometimes nested. These would lend themselves well for a multi table relational approach but the data does not come out of the data collection systems like that (most of them anyway). Would be nice to normalize, but as mentioned, there are new datasets every week, most of them having 1000s of columns. Good luck with normalizing all that before your deadline.
- Normalized data is not as easy to use in statistical applications. SPSS, the 800 pound gorilla in stats land, only supports flat data, for example.
- There are things called multiple response questions, sometimes having 100s of options, sometimes 1000s. Ergo 100s to 1000s of columns per question. "Which car models have you ever owned" + every single car model produced in the last 40 years is a good example. Of course there are alternatives such as blob fields and bit shifting, or storing only max 20 answers (first car, second car, etc) but it costs time to convert them. And these formats are also harder to use in statistical analysis, even in flat data.
In a world where you have complete control over the provided input, and the required output, you are right. In the real world, not so much.