Slashdot Mirror


Cassandra 0.7 Can Pack 2 Billion Columns Into a Row

angry tapir writes "The cadre of volunteer developers behind the Cassandra distributed database have released the latest version of their open source software, able to hold up to 2 billion columns per row. The newly installed Large Row Support feature of Cassandra version 0.7 allows the database to hold up to 2 billion columns per row. Previous versions had no set upper limit, though the maximum amount of material that could be held in a single row was approximately 2GB. This upper limit has been eliminated."

235 comments

  1. Typical applications? by oldhack · · Score: 3, Interesting

    What sorta applications need so many columns? Curious.

    --
    Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
    1. Re:Typical applications? by Brummund · · Score: 5, Funny

      Any application developed by one or more Visual Basic developers, given enough time.

    2. Re:Typical applications? by Musically_ut · · Score: 2

      What sorta applications need so many columns? Curious.

      From the article:

      An open source database capable of holding such lengthy rows could be most useful to big data cloud computing projects and large-scale Web applications, the developers behind the Apache Software Foundation project assert.

      So, basically, they don't know either but think (probably rightly so) that this a pretty cool feature. So cool that they made this the heading of their article.

      --
      Never trust a spiritual leader who cannot dance -- Mr. Miyagi
    3. Re:Typical applications? by feedayeen · · Score: 1

      Chinese and Indian census data.

    4. Re:Typical applications? by gratuitous_arp · · Score: 5, Interesting

      Apparently the extra columns can be used to the effect of doing "more" than store data. A link in the article explains how lots of extra columns can be useful for querying data (Casandra doesn't use SQL). http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/

      So the primary reason for this doesn't seem to be that one's run-of-the-mill database needs more columns.

    5. Re:Typical applications? by Anonymous Coward · · Score: 1

      I'm not sure, but I don't think that the summary said "hold up to 2 billion columns per row" quite enough. Having it in the headline and then repeating it in two consecutive sentences was a nice touch, but I really think they should have mentioned it a couple more times.

    6. Re:Typical applications? by SQL+Error · · Score: 3, Interesting

      The main reason was that Cassandra prior to 0.7 didn't support secondary indexes. Your keys in a table ("columnfamily" in Cassandra-speak) were indexed, and the names of the columns in a row were indexed. And Cassandra is schemaless, so the columns in one row could be completely different to the columns in another.

      So you'd use columns as sub-records to get the data structures you need.

      With 0.7 and secondary indexes, that's going to be less important.

    7. Re:Typical applications? by jrumney · · Score: 5, Funny

      What sorta applications need so many columns?

      Facebook needs one column for every privacy violation.

    8. Re:Typical applications? by adamofgreyskull · · Score: 0

      I'm as clueless as you, perhaps more so, but the only thing I can think of is maybe for large amounts of raw data for Bio-Informatics or from a sufficiently large experiment, e.g. particle collider?

      There's no way a sufficiently normalised data model for any normal,everyday application would require even 2,000 columns. This belief is normally tested whenever I go to TheDailyWTF though...

    9. Re:Typical applications? by jrumney · · Score: 1

      Any application developed by one or more Visual Basic developers, given enough time.

      How could that possibly be true, MS Access only supports 255 columns.

    10. Re:Typical applications? by Whiternoise · · Score: 2

      I can only think of something where you might want to input something ridiculously large like an image (or similar matrix of information with millions of points) so you could perform statistical analysis on a per-pixel basis. The pixel example would be for an image, but if you wanted to store something like, say, some parameter at a grid point and you wanted to compare those parameters between a load of different grids. It seems a very laborious way of doing things, but maybe if each point is storing a lot of data, it's easier to have a database where you can run "SELECT row1000col2000 FROM Things" (where row1000col2000 contains a blob or something) and get a long list instead of comparing a load of arrays.

      In the example of an image, you could feasibly run into hundreds of millions of columns (assuming you want to store your data in one table and not a table per comparison object and of course for some obscure reason you're storing each pixel in a field) with astronomical cameras.

      Failing that, never underestimate government and/or military databases. Heck, even someone like Google could probably find a use for a 2 billion column table.

    11. Re:Typical applications? by oldhack · · Score: 1

      Bio-informatics... OK, I see now. Hook this sucka up with some massive Perl codebase cooked up by postdocs and grad students, and you've got yourself the mother of all "hammer time".

      Oh Yeah.

      --
      Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
    12. Re:Typical applications? by bieber · · Score: 2

      In all seriousness, I'm horrified to see the potential abuses people will come up with for this.

      "Still using MySQL? Man, you need to check out Cassandra! MySQL kept clashing with my every-user-gets-their-own-column architecture..."

    13. Re:Typical applications? by Malcolm+Chan · · Score: 1

      I know that was meant as a joke, but such data would be stored in the rows of the table, not the columns of the individual rows.

      --

      /MC

    14. Re:Typical applications? by adonoman · · Score: 1, Funny

      No no, one column for each resident, plus a column for the row header. Each row holds one item of information: Name, address, etc...
      That way, adding a new data point to keep track of is a simple as inserting a new row.

    15. Re:Typical applications? by Anonymous Coward · · Score: 0

      Facebook developed it to store per-user inverted indexes for their mailbox/feeds/whatever, among other things. Key is a user, column is a term, value is a list of documents.

    16. Re:Typical applications? by Sarten-X · · Score: 2

      I don't know if that was sarcastic or not, but given that Cassandra is column-oriented, that's pretty much right (not so much with the header, but metadata is likely). Use a column family for each region, and you can process statistics in small chunks without a ridiculously-overpowered server. Only the requested column families need to be loaded into memory for processing.

      --
      You do not have a moral or legal right to do absolutely anything you want.
    17. Re:Typical applications? by g4b · · Score: 1

      my first applicable use would be to have one row saving a domain in the first column (or other fix data in a fixed number of additional columns) and change the db dimensions on the fly by adding multiple columns serially saving information like access time and ip adress.
      that would mean i can search by row to get the domain and log accesses easy

      i would just try that and look if this is speedier than having two tables saving by row.

      also comes in handy to add a column for each new user and a row for each new thread and save the views there

      all depends on how fast database changes are in a nosql db, but i think pretty fast.

    18. Re:Typical applications? by NNKK · · Score: 2

      Cassandra doesn't have "tables", and Cassandra's rows and columns have nothing to do with the rows and columns you're used to in SQL databases. Until you understand this, you will continue to be confused.

      The "name" of a column is an arbitrary key -- you could have a row with a bunch of columns named things like "Country", or "Username", but you could also have columns named "jsmith", "jdoe", "12345", "USA", "Canada", etc., and you don't have to pre-define the column names.

    19. Re:Typical applications? by Anonymous Coward · · Score: 1

      Cassandra doesn't have "tables", and Cassandra's rows and columns have nothing to do with the rows and columns you're used to in SQL databases. Until you understand this, you will continue to be confused.

      Then for the love of Pete, don't call them "columns and rows". You'll just confuse the hell out of us.

    20. Re:Typical applications? by RobertM1968 · · Score: 3, Funny

      Any application developed by one or more Visual Basic developers, given enough time.

      How could that possibly be true, MS Access only supports 255 columns.

      And now you understand why Cassandra is so important! :-)

    21. Re:Typical applications? by RobertM1968 · · Score: 1

      In all seriousness, I'm horrified to see the potential abuses people will come up with for this.

      "Still using MySQL? Man, you need to check out Cassandra! MySQL kept clashing with my every-user-gets-their-own-column architecture..."

      Wow, that is sloppy. I give each of my users their own table.

    22. Re:Typical applications? by RobertM1968 · · Score: 2, Funny

      Apparently the extra columns can be used to the effect of doing "more" than store data. A link in the article...

      Not sure what that last word means....

    23. Re:Typical applications? by TooMuchToDo · · Score: 1

      Nope. We use flat files for storing collider data from the LHC.

    24. Re:Typical applications? by Daniel+Dvorkin · · Score: 1

      Speaking as a bioinformatician who does a lot of DB work (the only one in the lab who has professional DBA experience ...) and I'll be the first to say that I can't see myself storing data this way. I'd be willing to be convinced, but as it stands, I don't see any use for this. IMO, YMMV, etc.

      --
      The correlation between ignorance of statistics and using "correlation is not causation" as an argument is close to 1.
    25. Re:Typical applications? by goombah99 · · Score: 1

      Who the hell cares. I mean whup tee doo. so someone has a larger address space . like wow. for all 12 people with such a bad design that they need 12 billion columns, I'm suite they already figured out how to do have Keyed indexes. why is this on slashdot?

      --
      Some drink at the fountain of knowledge. Others just gargle.
    26. Re:Typical applications? by AuMatar · · Score: 1

      Wow. I'm trying to think of any idea that could possibly beg for more bugs and bad design decisions than a feature like that. And I'm not coming up with anything.

      --
      I still have more fans than freaks. WTF is wrong with you people?
    27. Re:Typical applications? by DavidTC · · Score: 2, Interesting

      Wow, it's almost like you've invented databases, but rotated 90 degrees so that every single existing programming paradigm fails and you have to invent new ones to loop through columns.

      Instead of what every other database does, load the rows you want, and just those rows. With nicely named headers that get used to label the parts of each row. Oh, and types that vary per column.

      And indexes on columns...wait, let me guess, you can now index rows...although that can't actually work, programmaticly, because the columns aren't stored next to each other, so locating a value in a specific row can't tell how to retrieve that entire column..WAIT!

      Did this just exchange the meaning of rows and columns in some sort of mindfuck, but left everything the same?

      This is making more and more sense for Bizarro, but not really for anyone else.

      --
      If corporations are people, aren't stockholders guilty of slavery?
    28. Re:Typical applications? by Anonymous Coward · · Score: 1

      The point of this change is the switch between having no limit on the number of columns and a fixed data size, to a situation where we have any amount of data being stored. The downside is a fixed number of columns, but at 2B columns it's not really a limit anyone can reach.

      Good work guys.

    29. Re:Typical applications? by gargleblast · · Score: 1

      Let's turn that question around. What applications need limits such as 255 or 1024? Normalized database designs can hit such limits, and working around them (by spreading the relation over multiple tables, for example) is unpleasant.

    30. Re:Typical applications? by oldhack · · Score: 1

      I've been in the business for more than two decades, and I have never ever encountered a situation where I need 256(!) columns. True, I have worked mostly in tech/business sectors, and that's why I asked the question: what sorta application need so many columns.

      --
      Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
    31. Re:Typical applications? by DavidTC · · Score: 0

      Cassandra doesn't have "tables", and Cassandra's rows and columns have nothing to do with the rows and columns you're used to in SQL databases. Until the Cassandra people fix their stupid nomenclature, you will continue to be confused.

      Fixed that for you.

      As an aside what sort of logic led to calling them 'columns' and 'rows' when they don't exist in tables? Do they not grasp that the entire concept of tables is, to quote the first sentence in Wikipedia, 'a means of arranging data in rows and columns.', and that if you don't have 'tables', you don't have 'columns' and 'rows'? That is what a table is.

      Oh, wait, they're NoSQL people. They should be classified over with the functional programmers and the microkernel guys in the 'Everything computer programming does that actually works in the real world is wrong' group.

      --
      If corporations are people, aren't stockholders guilty of slavery?
    32. Re:Typical applications? by red_blue_yellow · · Score: 3, Informative

      Columns in Cassandra aren't analogous to columns in an RDBMS. Every row is basically a list of (key, value) pairs. This is referred to as a column, with the key being the column name. There's no requirement that rows have the same set of column names.

      Typically large rows are used for indexes or timelines. In a timeline example, you might use a timestamp for every column name and store the entry as the column value. Cassandra keeps the row sorted by column name, so all of the entries in the row (timeline) will be in chronological order.

      In the case of indexes, you may use one row for every indexed value (say, one row for all users from Utah, one for all from Texas, etc). Here, each column would store the row key (primary key) of a row in another column family (table) that matches that indexed value; in this case, every column might hold a userId.

      --
      A neutral communications medium is essential. It is the basis of science, by which humankind should decide what is true.
    33. Re:Typical applications? by AHuxley · · Score: 1

      312000000 by ~50 states/extraterritorial jurisdiction/territories with a code from each Fusion center. Add in extra space for other 3 letter agencies, faith/militia/gang/vet details, no fly list flagged ... gets big fast.

      --
      Domestic spying is now "Benign Information Gathering"
    34. Re:Typical applications? by Lehk228 · · Score: 1

      each user gets a system login with a sqlite db in their home directory that holds their account information, posts are appended to "static" HTML files representing each thread, user db's include hyperlinks to each post to view posts by a particular user

      --
      Snowden and Manning are heroes.
    35. Re:Typical applications? by WuphonsReach · · Score: 1

      What sorta applications need so many columns? Curious.

      Sample collection data where you are collecting a few hundred individual loosely related (and often completely unrelated other then the sample number) attributes per sample. For the most part, due to a lot of databases having a 255 column limit, this means you have to have multiple data tables. Which may or may not be a problem depending on how you need to report the data.

      --
      Wolde you bothe eate your cake, and have your cake?
    36. Re:Typical applications? by oldhack · · Score: 1

      "(the only one in the lab who has professional DBA experience ...)"

      See, imagine all other teams without a person like you...

      --
      Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
    37. Re:Typical applications? by WuphonsReach · · Score: 2

      I've been in the business for more than two decades, and I have never ever encountered a situation where I need 256(!) columns. True, I have worked mostly in tech/business sectors, and that's why I asked the question: what sorta application need so many columns.

      Data collection where you are reporting across samples (averages, means, group by) but where you are collecting dozens or hundreds of generally unrelated attributes for each sample. Some attributes might be related, but only loosely, other attributes are completely distinct from the other attributes. Because all of the attributes need to map back to the sample ID, there's no point in creating lots of different tables unless you are forced to due to database constraints. Plus the users want to be able to scan across the columns and down rows to spot patterns, so storing as "sampleID-attributekey-value" triplets means you have to do a lot more work in the presentation layer, converting it back to a "sample-attr1-attr2-attr3" style.

      --
      Wolde you bothe eate your cake, and have your cake?
    38. Re:Typical applications? by mini+me · · Score: 3, Informative

      Cassandra did not support said indexes until this very release. Even with secondary indexes, storing data in columns is still a reasonable design choice for many requirements. A column in Cassandra is not like a column in a relational database.

      I am sure that this is welcome news for big Cassandra users, but I do agree that it is a strange choice for the front page of Slashdot. Then again, with the number of comments asking why you would need so many columns, it seems that Slashdot needs to talk about Cassandra a little more.

    39. Re:Typical applications? by Mr+Z · · Score: 2

      In perl-speak, a Cassandra table sounds suspiciously like a nested hash if Cassandra's rows and columns are unsorted, or an array of array of key-value pairs if they are sorted.

      And if I understood the brief description of the use model from the article someone else linked, it sounds like you make a new table (columnfamily?) for each of the different criteria you might query against. The index for that table would be the parameterized bits of that query, and the other columns represent all the data that would match that particular query. Their example showed indexing employees by DOB.

      If i had a more complex query (say "all employees of a particular gender and date of birth"), then the index column would contain both details, so there'd be a row for "Feb 31st, 2013, Male," "Feb 31st, 2013, Female," etc. Am I understanding this correctly?

      Sounds like it'd be total crap (read: next to, if not completely impossible to use for) fully ad hoc queries whose structure isn't known beforehand. But, if you were building up something that had fairly static parameterized query structures, then it'd work out pretty well. The main thing to remember is that it transforms a pull-oriented SQL-style datamining operation into a more push-oriented "sort it as it arrives" structure.

    40. Re:Typical applications? by NFN_NLN · · Score: 3, Informative

      Any application developed by one or more Visual Basic developers, given enough time.

      How could that possibly be true, MS Access only supports 255 columns.

      And now you understand why Cassandra is so important! :-)

      In all seriousness I had no idea what Cassandra was or what made it unique as a database. However, I did find this tutorial that others might also find useful:

      http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model

    41. Re:Typical applications? by The_Wilschon · · Score: 1

      I don't know if I would describe ROOT's abominable TFiles as "flat files". I'm not even sure I would describe them as "files"... Certainly TTrees are extremely like an object database, except poorly designed, described, and implemented.

      --
      SIGSEGV caught, terminating

      wait... not that kind of sig.
    42. Re:Typical applications? by davester666 · · Score: 1

      Nailed it. There is currently a limit of 8192 rows...

      --
      Sleep your way to a whiter smile...date a dentist!
    43. Re:Typical applications? by Anonymous Coward · · Score: 0

      It is for a concept known as "wide" tables. It basically transposes the structure you are familiar with now.

    44. Re:Typical applications? by Sarten-X · · Score: 4, Informative

      Welcome to the first five minutes of using a column store. Screwey, ain't it?

      My understanding is that rows' contents are indexed such that they may be retrieved quickly. Think of a row name as a primary key. It's easy to get the whole row when you know its name. Continuing the census application, it's be like asking for all the birth years of everyone in a geographical region. The requested column family (geographical region) is opened, and each column (person) is quickly checked for the particular row's contents (in case the birth year wasn't provided). Partitioning is done by both row and column family, so only some of the column family's data is actually scanned. That's where the cluster provides a very nice speedup, as well.

      locating a value in a specific row can't tell how to retrieve that entire column

      Now, I'm not sure if I understand your rage-induced rambling correctly, but if you're trying to make a SQL example, you're starting from the wrong premise, which explains why you're having trouble making sense of it all.

      Quick review: The "R" in "RDBMS" stands for "relational", referring to a n-ary relation. SQL is intended to manipulate those relations, isolating the data you want to extract. Something that is not described as an RDBMS should not be expected to have relations.

      Cassandra functions (from the application perspective) as a key-value store, with no relation structure. That means you don't work with sets, and you don't need to think about set operations. Pull out a row, and you get a list of columns with defined values, as well as those values. Iterate through each value looking for whatever value you're looking for. When you find it, you already have the column name. Just ask for the whole column next. Since the whole thing is running in a cluster, you can parallelize the iterations (I think... I've used HBase, but not Cassandra personally) to speed up the scan.

      If that's not fast enough for you (which is likely), you can use Hadoop's MapReduce framework to scan each cell and create an index, possibly laid over the other table as just more rows & columns (though a different table would be better, from a sanity perspective). Since there's no mandatory structure, that's legit.

      Of course, that's only valid for this particular census application, which assumes that the only reason for the database is either basic statistics or something complex enough for a MapReduce program.

      It's entirely possible to run Cassandra arranged similar to a normal RDBMS. Use only a few column families with very specific columns (such as a single family for all the "Name, address, etc."). Throw in a bunch of index families, updated with MapReduce. Then, your processing can be a complex MapReduce job, iterating over each row with a particular set of rows meeting all your needed criteria. It'd be just like a normal RDBMS, except you have better scalability, and maintain indexes yourself.

      If the trouble of indexing is too much for you, you can follow Google's route with Colossus, which runs MapReduce-like tasks when rows are changed. That's your dynamic indexing.

      Here's some links to help your understanding:

      --
      You do not have a moral or legal right to do absolutely anything you want.
    45. Re:Typical applications? by Sarten-X · · Score: 3, Informative

      Close. It's more of a hash table of a sorted hash table... Columns are unsorted, but rows are (I think... I've only used HBase personally).

      If you know what you'll be looking for ahead of time, you can make your life easy with a write-heavy system. What's missing in standard Cassandra is a way to run ad-hoc queries. My understanding is that Cassandra can now run with Hadoop's MapReduce framework. Any query or computation can be run against the Cassandra table in a widely-distributed fashion as a MapReduce job. It's not as fast as an SQL query on an indexed column, but far better than a query on an unindexed one, because everything runs in parallel across the cluster.

      --
      You do not have a moral or legal right to do absolutely anything you want.
    46. Re:Typical applications? by mwvdlee · · Score: 1

      Now what good is the CREATE DATABASE statement if you're only gonna work with tables?

      --
      Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?
    47. Re:Typical applications? by Hognoxious · · Score: 2

      Obviously you're trapped in a relational mindset.

      That makes two of us.

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
    48. Re:Typical applications? by Undead+Waffle · · Score: 1

      "(the only one in the lab who has professional DBA experience ...)"

      See, imagine all other teams without a person like you...

      So your argument is you should store data this way if you have no idea what you're doing?

      I'm not bashing Cassandra here or anything (never used nosql), but I've seen this argument used enough times that I just have to point out how stupid it is.

    49. Re:Typical applications? by Jugalator · · Score: 1

      One of these days, I am going to learn how that can be so efficient and why NoSQL databases are all the rage.

      --
      Beware: In C++, your friends can see your privates!
    50. Re:Typical applications? by AlXtreme · · Score: 3, Insightful

      Dear $DEITY, the number of times I've seen (mostly) PHP crapplications use CREATE DATABASE and CREATE / ALTER TABLE, often with ingenious naming schemes, instead of simply inserting new rows. Certain people shouldn't be allowed to touch databases.

      If anyone needs me I'll be sobbing over my coffee.

      --
      This sig is intentionally left blank
    51. Re:Typical applications? by Anonymous Coward · · Score: 0

      I read it and I'm now more confused than I was before I started.

    52. Re:Typical applications? by Noughmad · · Score: 2

      Each user gets a single wooden table with a plastic folder containing a paper file.

      --
      PlusFive Slashdot reader for Android. Can post comments.
    53. Re:Typical applications? by he-sk · · Score: 1

      Column-oriented databases (also called column stores) have been around for more than a decade and there is loads of research going on in the academic community. Their main application is in-memory-based databases which present different challenges to optimize than traditional disk-based databases. Google MonetDB or X100 for more information. These systems support SQL as the query language, so most if not all of your database knowledge is applicable to them.

      I've also seen data storage systems that basically implement a proprietary column store going back 20 years.

      --
      Free Manning, jail Obama.
    54. Re:Typical applications? by jellomizer · · Score: 3, Interesting

      I don't think it is a good idea to propose limitation just to stop bad coding practices.

      For 1 the limitations rairly incourage good ones they only make them worse. Eg 254 columns with the 255th pointing to the tablename2 with more data.

      Second by preventing people from doing something stupid they also prevent them from doing something ingenious.

      Third there may be a good reason to do this as well.

      Fourth you make it big enough so you won't need to make it bigger

      --
      If something is so important that you feel the need to post it on the internet... It probably isn't that important.
    55. Re:Typical applications? by bjourne · · Score: 4, Informative

      Maybe Cassandra should have choosen some other terminology for their database that so obviously doesn't conflict with already existing terms. A column in Cassandra is a tuple which in an RDBMS is a row. Confusion all around.

    56. Re:Typical applications? by arth1 · · Score: 1

      Quick review: The "R" in "RDBMS" stands for "relational", referring to a n-ary relation.

      This is true.

      For N currently not exceeding "bin" (you'd have to invent a new API for a relational database with rows, columns and slices. Else you'd just use a 2-dimensional rows/columns approach like the rest of us, and multiple tables. Which is arguably why the relational functionality comes in handy -- if you could do true N dimensionality, there would be less use for relations, as you'd store the data where it would be fetched.)

      Oh, and except for when the R stands for Rigid, like in mainframe databases.

    57. Re:Typical applications? by camperdave · · Score: 1

      I don't think it is a good idea to propose limitation just to stop bad coding practices.

      For 1 the limitations rairly incourage good ones they only make them worse.

      Good point.

      Second by preventing people from doing something stupid they also prevent them from doing something ingenious.

      Another good point.

      Third there may be a good reason to do this as well.

      Still waiting for a good reason.

      Fourth you make it big enough so you won't need to make it bigger

      "640K ought to be enough for anybody."

      --
      When our name is on the back of your car, we're behind you all the way!
    58. Re:Typical applications? by Sarten-X · · Score: 1
      --
      You do not have a moral or legal right to do absolutely anything you want.
    59. Re:Typical applications? by camperdave · · Score: 1

      The column is the lowest/smallest increment of data. It’s a tuple (triplet) that contains a name, a value and a timestamp.

      Here’s a column represented in JSON-ish notation:

      { // this is a column
      name: "emailAddress",
      value: "arin@example.com",
      timestamp: 123456789
      }

      That’s all it is. For simplicity sake let’s ignore the timestamp. Just think of it as a name/value pair.

      Ah... So a column is a field/value pair; a record, in other words.

      --
      When our name is on the back of your car, we're behind you all the way!
    60. Re:Typical applications? by ferar · · Score: 0

      Data Mining Applications usually have databases with hundreds/thousands columns. Imagine that you need to derive a rule based on as much input variables as possible. Is very easy to generate many columns, for example take a Telco Churn Model, you will have for each phone number, # of calls on the last month, # of calls on the month before, etc, then same for call duration, max, min, ratios between months, you name it, if you are using for example just 20 vars and one year history, you can have 240 raw vars. Then you have ratios between them, if you take 4 months ratios you have 6 new vars for each original var, add 72, and so on.
      Never seen more than 5 thousand columns because of technical limitations, but I'm sure that if you give a DM Expert the possibility to have 2 billion columns they will fill them all.

    61. Re:Typical applications? by ultranova · · Score: 2

      What sorta applications need so many columns? Curious.

      Judging by the name, a pretty incredible one.

      --

      Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

    62. Re:Typical applications? by DavidTC · · Score: 1

      I know what column-oriented databases are, I had a sarcasm failure. I hadn't realized so many people lower in this discussion actually didn't know how NoSQL stuff worked, or I wouldn't have satirized the whole thing by pretending not to know.

      And you realize what he described makes no sense to use a column-oriented database for, right? He's basically taken something that should be in a perfectly normal RDBMS and made it twice as complicated. He's described a one dimensional array of data, and decided to access it sideways. Using something utterly unsuited for that.

      Which is, you know, what a lot of NoSQL stuff is used for, stuff that works perfectly find in a RDBMS.

      Neither does what you describe. If your database can't handle enough data to return a column consisting of an intersection of a single year and a single region, you either need to stop using sqlite or learn how to make indexes. Considering there are only 7 billion people on the earth, it's hard to see how census data could possibly contain enough data to make an RDBMS impractical.

      --
      If corporations are people, aren't stockholders guilty of slavery?
    63. Re:Typical applications? by siride · · Score: 1

      You don't need extra columns for those things, you need additional tables and proper normalization. Unless there are performance problems with that, I can't see why putting those kinds of columns in one table would make sense.

    64. Re:Typical applications? by DavidTC · · Score: 1

      You actually hit the nail on the head as to why some people use NoSQL stuff...they don't really understand how to use a database to do 3D stuff, or hierarchical stuff. (Hint, you have an id column and a parent_id column, and recurs. I promise you that you're allowed to do enough queries.)

      And the easiest way to do a 'third dimension' is just to have a field specifying 'in what dimension' a record is, obviously with an index, and filtering on it. Using multiple tables would get somewhat unwieldy, but I can see that if the dataset is huge and you'd never need to get records from multiple dimensions at once.

      Of course, I'm not sure why people move to NoSQL, because you can't do hierarchical stuff or 3 dimensional stuff or anything like that in NoSQL either.

      The problem is that a lot of people describing how NoSQL stuff works seem to have no idea how RDBMSes work to start with. Whereas I, OTOH, only have the vaguest idea of how column-oriented work, but I do know what is possible in RDBMSes, and know most of the NoSQL examples that people give are trivially easy in RDBMS. (Even the ones I think of.)

      --
      If corporations are people, aren't stockholders guilty of slavery?
    65. Re:Typical applications? by Anonymous Coward · · Score: 0

      See in my new database the row is the table and the column is the row, it's an entirely new way of interacting with data. I call it the visual cloud database system and made it specially for visual basic programmers.

    66. Re:Typical applications? by DavidTC · · Score: 1

      Column-oriented databases are the first databases.

      Many very old code still uses something like that, using a file for each column, and just writing each field direct to record_number*file_multiplier, where each column is in its own file.

      This allows a lot of neat tricks. You need almost no file locking, assuming that writing a range of bytes in a file is atomic, which is a reasonably safe assumption. (You still have to lock on adding records.) You can backup easily, or fix errors, although it's very hard to have consistency errors anyway.

      It was the best method of the time. Certainly better than just one file, which made adding new columns utterly impossible. And you can have additional non-fixed-record-size files if you really need them, but only use them when you need them.

      And then, tada, we invented RDBMSes. Which work better than that in 99.99999% of the cases.

      --
      If corporations are people, aren't stockholders guilty of slavery?
    67. Re:Typical applications? by HTH+NE1 · · Score: 1

      Fourth you make it big enough so you won't need to make it bigger

      "640K ought to be enough for anybody."

      Reasonable Limits Aren't.

      --
      Oh, say does that Star-Spangled Banner entwine / The myrtle of Venus with Bacchus's vine?
    68. Re:Typical applications? by Anonymous Coward · · Score: 0

      Hey man... don't put your, like, totally bourgeois terminology on us. We're starting a fuckin' revolution here.

    69. Re:Typical applications? by Anonymous Coward · · Score: 0

      Hey man... don't push your, like, totally bourgeois terminology on us. We're starting a fuckin' revolution here.

    70. Re:Typical applications? by EsbenMoseHansen · · Score: 1

      You need to imagine your DB needs to be so big that storing it on a dozen computers is not enough and imagine this DB is updated as well as read. From this follows that you need more or less transparent redundancy, since some of those servers will be failing at a given time. Oh, and you need to be able to add new servers as needed. And do it on cheapish hardware.

      It is a scenario outside of my experience, and, I would guess, outside yours.

      --
      Religion is regarded by the common people as true, by the wise as false, and by rulers as useful.
    71. Re:Typical applications? by Anonymous Coward · · Score: 0

      Here's an analogy. If a relational database is like C++, then Cassandra is like Python.

      In an RDBMS the schema of a table is like a struct definition and a bare relation is like an array of that struct. When you index that table, it's like using STL collections.

      But with Cassandra, which is not a RDBMS, a row is like an object in Python and a column is an attribute of that object. You can add attributes arbitrarily as desired to any object at any time.

    72. Re:Typical applications? by Hognoxious · · Score: 2

      Doesn't a name/value pair act for like a field? A record would be several of those that are related to each other. Or maybe not, who knows?

      But in any case, borrowing a term that's already in established usage to mean something superficially similar but significantly different is just fucking retarded. Whoever made that decision ought to be dragged behind a moderately slow horse.

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
    73. Re:Typical applications? by Anonymous Coward · · Score: 0

      Yes.

      Columns in Cassandra are like rows in traditional databases. On the other hand, there is a 255 row limit for Cassandra databases.

    74. Re:Typical applications? by hvm2hvm · · Score: 1

      Yeah, when you say it like that, it actually sounds retarded. I may have known RDBMs for too long but a row is a more logical term for the tuple of data, and column the logical one for naming attributes. Unless of course your tables are sideways... But who does that? Not even arabs or Japanese are that crazy.

      --
      ics
    75. Re:Typical applications? by hvm2hvm · · Score: 1

      Before anyone goes all political corectness on my ass, I'm referring of course to the fact that arabs write from right to left and Japanese from top to bottom as opposed to westerners that write left to right.

      --
      ics
    76. Re:Typical applications? by hvm2hvm · · Score: 1

      "Cassandra is schemaless": I read that as "Cassandra is shemaleless"...
      I think I need to look at some sites with cute cats or something like that.

      --
      ics
    77. Re:Typical applications? by QuantumBeep · · Score: 1

      Seems like a reasonable place to mention the Zero One Infinity rule.

    78. Re:Typical applications? by butlerm · · Score: 2

      Welcome to the first five minutes of using a column store.

      Calling Cassandra a "column store" or "column oriented database" is an abuse of the language. Real column oriented databases store "columns" of data in a linear sequential manner, so that they can be scanned in the fastest manner possible.

      Cassandra isn't like that - it stores denormalized rows with repeating groups in a free form manner, not "columns" at all. If it were a real column oriented database it would be completely unusable for most online web and OLTP applications. Real column oriented databases are designed for OLAP and numerical-statistical analysis, getting a single "row" or "record", "record group", or "even" document at a time is the opposite of what they are designed for.

      At best, you could call Cassandra a "cell" or "field" oriented database. But a "column oriented" database it is not. It is practically the opposite of a column oriented database.

    79. Re:Typical applications? by lonecrow · · Score: 1

      And don't forget one of the most performent methods for handling trees in SQL...nested sets.

      http://dev.mysql.com/tech-resources/articles/hierarchical-data.html

    80. Re:Typical applications? by dave87656 · · Score: 1

      That was perfect. ROTFLMAO.

    81. Re:Typical applications? by Sarten-X · · Score: 1

      It's a slight abuse, yes. My understanding is that the table is split first by by column family, then by partition as needed. That makes it more a column store than anything else I can think of offhand.

      Regardless of the slaughter of language, I think the point's still clear: Cassandra is a different beast from a traditional database.

      That's not to say I actually disagree. It's very likely that I'm misunderstanding how Cassandra in particular works. I worked through the BigTable paper and I've used HBase extensively. My personal experience with Cassandra is setting up a small system and running enough of a test to see that it wasn't suitable for my needs. Any further knowledge is appreciated.

      --
      You do not have a moral or legal right to do absolutely anything you want.
    82. Re:Typical applications? by DavidTC · · Score: 1

      'I promise you that you're allowed to do enough queries.'

      That's sorta what I meant by that.

      There are two classic mistakes of SQL programmers:

      1) Not having the server do any filtering. You need four columns from a single row? Just SELECT * FROM table, and then filter yourself. Do that SELECT for all ten records you need.

      If at any point, you're getting more rows than you need, you're almost certainly doing it wrong. If you don't know how to make SQL filter that way, then look it up, no one expects you to remember all that. (Feel free to have some smaller extra fields, though, if you think you might need them. Nothing is more annoying than constantly editing in and out fields of a query as you use them later or not. That sort of optimization goes at the end.)

      This way, obviously, causes insanely poor performance, and people cautioning against it leads to:

      2) People afraid to do any query. One, maybe two, and that's it. Under worse circumstances this results in it functionally not being SQL, because they query each entire table once, and stick it in an array somewhere. They'd never even consider doing a sub-query. (OTOH, half the people seem to think JOINs, inexplicably, are free. I don't understand that logic.)

      No. You really can do complicated queries, you can do lots of queries, you can even do recursive querying. Three fourths of the problem with #1 people isn't the 'queries', it's moving all that data around in memory. It's not the amount of queries, it's the size of the results, and the fact you then have to loop through it in your program to find it.

      That's not to say you shouldn't an SQL query as 'costly', but it's no more costly than looping over an array of 100 strings and doing a compare with each. It's something you don't want to do for no reason, but it's fine if you have a reason. Most programming languages have some sort of timing ability...if you don't believe me, test how long a statement takes.

      Of course, a lot of programmers are working on shittily designed databases, where it might actually be pretty costly to run queries.

      --
      If corporations are people, aren't stockholders guilty of slavery?
    83. Re:Typical applications? by butlerm · · Score: 1

      Column family based partitioning gives it a shade of column orientation I admit. That is a feature no relational database I know of really has, although it would be a great enhancement. The closest most relational databases come to column partitioning is out of line storage for BLOBs and CLOBs.

      My complaint is that the "column orientation" of Cassandra is more like free form "field orientation". If row columns (fields) are optional, you are not really storing "columns" at all, but rather arbitrary sets of fields (including repeating groups) in rows or sub rows.

    84. Re:Typical applications? by TooMuchToDo · · Score: 1

      I'd almost compare them to sqllite files. Almost.

      /hates TFiles
      //could always be worse

    85. Re:Typical applications? by Anonymous Coward · · Score: 0

      Human genome project? One row per person, 1 column per base pair.

    86. Re:Typical applications? by Hognoxious · · Score: 1

      ike asking for all the birth years of everyone in a geographical region. The requested column family (geographical region) is opened, and each column (person) is quickly checked for the particular row's contents (in case the birth year wasn't provided).

      Most RDBMSs I've worked with allow alternate/secondary indexes - in this case you'd create one on region.

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
    87. Re:Typical applications? by Hognoxious · · Score: 1

      I can narrow it down to a type but I can't pinpoint the individual instance.

      Must be an indefinite article.

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
    88. Re:Typical applications? by Hognoxious · · Score: 1

      Second by preventing people from doing something stupid they also prevent them from doing something ingenious.

      Well, the two categories aren't mutually exclusive, but experience suggests that for every elegant and clever usage of a feature there'll be a thousand retards who use it to do something that a) it doesn't do well and b) something else does much better.

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
  2. Yeah boy. by Anonymous Coward · · Score: 0

    Now I don't need that 2NF! Just use one column per customer!!!

  3. Only 2 billion? by Jeremi · · Score: 1

    They should have gone with the uint32_t counter, then they could support up to 4 billion!

    --


    I don't care if it's 90,000 hectares. That lake was not my doing.
    1. Re:Only 2 billion? by Anonymous Coward · · Score: 5, Funny

      You work for Gillette, don't you.

    2. Re:Only 2 billion? by zach_the_lizard · · Score: 4, Funny

      He doesn't, otherwise it'd be uint64_t and a lather strip!

      --
      SSC
    3. Re:Only 2 billion? by Korin43 · · Score: 1

      They probably use signed ints to support the people who want negative two billion columns.

  4. If you have more than 30 columns by loufoque · · Score: 1, Insightful

    ... then you're doing it wrong

    1. Re:If you have more than 30 columns by Anonymous Coward · · Score: 0

      I don't know if that's *guaranteed* to be true, but a good rule of thumb.

      Actually, can probably use a much smaller number as a rule of thumb, even...

    2. Re:If you have more than 30 columns by Anonymous Coward · · Score: 0

      I don't know... I store names in my name table as a column for each letter.

    3. Re:If you have more than 30 columns by ogrisel · · Score: 5, Informative

      Not with column store databases such as Cassandra, HBase and BigTable.

    4. Re:If you have more than 30 columns by Anonymous Coward · · Score: 1

      Just that you (and so many others) can't see a use case doesn't mean that there aren't any. I deal a lot with data from very lengthy questionnaires. There are usually several thousand columns, sometimes tens of thousands. I run into the column limits of conventional row based databases more often than not. That's why I tend to use fixed width text files rather than databases. Being able to easily convert this data into a very wide table that can be queried (be it SQL or otherwise) would definitely be useful.

      I suppose careful redesign into a relational structure could reduce the number of columns, but there are new datasets every week, so there is no time for that. Also, customers are not used to relational data.

    5. Re:If you have more than 30 columns by mini+me · · Score: 2

      If you are writing SQL, maybe. Cassandra is not a relational database.

    6. Re:If you have more than 30 columns by butlerm · · Score: 1

      Cassandra, HBase and BigTable aren't traditionally what is meant by the term href="http://en.wikipedia.org/wiki/Column-oriented_DBMS">column store database at all. Much closer to hybrid "repeating group" databases like Adabas and Pick.

      True column store databases are almost unheard of for online transaction processing because they are optimized for streaming, unindexed data storage and subsequent column oriented analysis over large datasets with very low per row overhead. A bitmap index is the closest a traditional relational database comes to column storage, although at least two major relational databases have means of physically clustering related rows from different tables on the same page, which is more or less what Cassandra is described as doing here, except perhaps with more flexibility and more overhead to go along with it.

    7. Re:If you have more than 30 columns by Mitchell314 · · Score: 1

      Two words: normalization.

      --
      I read TFA and all I got was this lousy cookie
    8. Re:If you have more than 30 columns by DavidTC · · Score: 1

      I call bullshit.

      No one answers 'tens of thousand' question questionnaires. At 5 seconds a question, that's 28 hours for 20,000 questions. (And let's not even hypothesis how long the damn results would take to read.)

      Alternately, you think you need more than one field a question, which means you are doing it wrong.

      I don't know what you mean 'careful redesign into a relational structure' either. A sane design might be to remove the person's info to another table, if people answer more than one questionaire(1), but if that information is taking up more than 15 columns, you are doing it wrong there also.

      1) Which we know they don't, because they don't have time. They're still answering your last insanely massive questionaire.

      --
      If corporations are people, aren't stockholders guilty of slavery?
    9. Re:If you have more than 30 columns by The_Wilschon · · Score: 1

      Your mail must get forwarded an awful lot! I usually try to cut things off at about 75 columns, so that when those pesky ">" characters get prepended to every line, it won't get all wonky...

      --
      SIGSEGV caught, terminating

      wait... not that kind of sig.
    10. Re:If you have more than 30 columns by Anonymous Coward · · Score: 2, Insightful

      - Not everyone answers every question. There is skip logic involved and there are loops, sometimes nested. These would lend themselves well for a multi table relational approach but the data does not come out of the data collection systems like that (most of them anyway). Would be nice to normalize, but as mentioned, there are new datasets every week, most of them having 1000s of columns. Good luck with normalizing all that before your deadline.

      - Normalized data is not as easy to use in statistical applications. SPSS, the 800 pound gorilla in stats land, only supports flat data, for example.

      - There are things called multiple response questions, sometimes having 100s of options, sometimes 1000s. Ergo 100s to 1000s of columns per question. "Which car models have you ever owned" + every single car model produced in the last 40 years is a good example. Of course there are alternatives such as blob fields and bit shifting, or storing only max 20 answers (first car, second car, etc) but it costs time to convert them. And these formats are also harder to use in statistical analysis, even in flat data.

      In a world where you have complete control over the provided input, and the required output, you are right. In the real world, not so much.

    11. Re:If you have more than 30 columns by Anonymous Coward · · Score: 0

      I am _so_ doing your mom "wrong" tonight.

    12. Re:If you have more than 30 columns by Hognoxious · · Score: 1

      1) Which we know they don't, because they don't have time. They're still answering your last insanely massive questionaire.

      I think it has those extra columns for all the extra user information because it takes several generations to finish it. The questionnaire that's also a family hierloom! Sign up today, if your grandfather didn't do it last century!

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
    13. Re:If you have more than 30 columns by DavidTC · · Score: 1

      Ah, so the answer is 'you think you need more than one field a question'.

      I already stated my opinion on that.

      --
      If corporations are people, aren't stockholders guilty of slavery?
  5. Bah, this is silly. by intellitech · · Score: 1

    If this really matters at all, besides being slightly cool, it will just lead to more bad db design.

    --
    vos nescitis quicquam, nec cogitatis quia expedit nobis ut unus moriatur homo pro populo et non tota gens pereat.
    1. Re:Bah, this is silly. by Musically_ut · · Score: 1

      If this really matters at all, besides being slightly cool, it will just lead to more bad db design.

      Of course not! They clearly state the importance of "creating so many columns that they are nearly unlimited" in the article:

      The ability to create so many columns is valuable because it allows systems to create a nearly unlimited number of columns on the fly, Ellis explained in a follow-up e-mail.

      So that's that.

      --
      Never trust a spiritual leader who cannot dance -- Mr. Miyagi
    2. Re:Bah, this is silly. by Sarten-X · · Score: 2

      on the fly

      Like storing the contents of a web crawl. The row key is the URL, the column is the crawl timestamp, and the cell contains the page (or keywords). That's a column created on the fly. Another application off the top of my head is storing access logs, where each row is a date, each column is a person, and each cell contains a resource they accessed. Having two billion columns is hardly excessive (in theory) for a suitably-large application.

      Cassandra, like BigTable and HBase, is not the same as a traditional RDBMS. It's also a column-oriented DBMS. Since each group of columns is stored separately, there's no performance impact to having extra columns. Columns that aren't needed (like old crawls in the example above) simply aren't loaded into memory. What's bad design for an RDBMS is perfect for Cassandra or HBase.

      --
      You do not have a moral or legal right to do absolutely anything you want.
  6. Why? by Xoc-S · · Score: 3, Insightful
    Only a completely de-normalized flat-file database would need anything like that number of columns. That would mean many duplicate pieces of information, and a complete maintenance nightmare. The only purpose I can see is to have views of existing normalized data for fast searching, but that would be read-only data.

    This is a feature in need of an application and I can see very few applications.

    1. Re:Why? by Jeremi · · Score: 3, Funny

      This is a feature in need of an application and I can see very few applications.

      I think you're right, but as long as we're adding features for the sake of having features... why limit the table to two dimensions? Perhaps the next version of Cassandra can support 3D-data-cubes, with each cell specified via a (row,column,level) triplet. And the version after that will allow hypercubes of data with any number of dimensions (up to 2 billion dimensions maximum, of course).

      --


      I don't care if it's 90,000 hectares. That lake was not my doing.
    2. Re:Why? by Anonymous Coward · · Score: 0

      Actually that wouldn't be an entirely bad way to handle versioning. With some sensible defaults, of course.

    3. Re:Why? by Anonymous Coward · · Score: 0

      Well that's exactly why databases like Cassandra exist, maybe you're sharper than you give yourself credit for being. Facebook developed it to hold an inverted index for each users mailbox/feed. The key is the user, the column a term, and the cell contains a list of documents.

    4. Re:Why? by Sarten-X · · Score: 1

      Disclaimer: I haven't used Cassandra personally, but I have used HBase which operates similarly.

      Cassandra uses column families, which are groups of columns, and are individually selectable. If all families contain the same columns, you have 3D (family, column, row) storage! Now, with HBase, excessive column family creation and maintenance isn't the ideal route, but if you actually need 3D storage, it would work pretty decently.

      Cassandra, BigTable, and HBase are designed for applications that need lots of rarely-accessed details, for relatively few rows. As an example, let's consider a forum. One row per thread, one column per post. A single query (usually in a manner not related to SQL) pulls out a single row, and all the columns associated with it. Since columns are created by the application at runtime, 2 billion isn't that excessive, but it is worth bragging about.

      --
      You do not have a moral or legal right to do absolutely anything you want.
    5. Re:Why? by NNKK · · Score: 1

      Only a completely de-normalized flat-file database would need anything like that number of columns. That would mean many duplicate pieces of information, and a complete maintenance nightmare. The only purpose I can see is to have views of existing normalized data for fast searching, but that would be read-only data.

      This is a feature in need of an application and I can see very few applications.

      Um, a very common answer to Cassandra questions is "denormalize". This is not an RDBMS, stop treating it like one.

    6. Re:Why? by Anonymous Coward · · Score: 0

      Way to miss the point.

      You don't even have to read the FA to find out that it is not about the column numbers and that there was no limit before hand.

      They created a 2 billion limit after removing a very limited 2GB~ limit on row size. Likely for closure/standards/testing reasons (system integrity is hard to prove on an infinite scale).

      I can see many applications in the wild that require more than 2GB of data in a row.

      It would be wise, if in future, when you can't see a plausible explanation; that you don't scream "GOD DID IT", and instead just presume you don't have enough information to form an intelligent opinion.

    7. Re:Why? by schnozzy · · Score: 1

      Failure to see applications is a failure of imagination. Also, Cassandra is highly optimized for extremely fast write rates at large scale, not read-only. Cassandra also has no SPOF, includes variable consistency guarantees, and many other great features. Specifically focusing on the fact that it can have 2 billion columns in a row glosses over a number of other things, but that hardly denormalized or non-transactional databases have few applications.

    8. Re:Why? by maraist · · Score: 1

      Have you reviewed the BigTable architecture? The central idea is to store what would normally be normalized joined data instead as in-line column-families. Within a column-family, you have related columns that are effectively your name-value pairs. Each name in the name-value pair is called a column (which in RDBMS it would more likely be a table with 3 columns, foreign-key, name, value - but with the tremendous innefficiency of having to do the join). All this effectively means is that prior to this version, Cassandra only supported a logical collection of name-value pairs that were less than 2Gig.. Now you're unlimited - or more correctly, I'm sure they're using a larger bit-value for some grouping thing.

      --
      -Michael
    9. Re:Why? by maraist · · Score: 2

      There are many problem-sets where you might like to perform associative mapping. If the keys and or values are large, you can easily hit the 2GB limit on a single primary key. Imagine if you felt that cassandra could help you in CPU node mappings.. Or weather patterns. The associations can be in the billions, and while you may or may not have a primary key for each main node, the association list may approach N. In traditional RDBMS, such large association mappings M:N tables, are impractical to traverse. An object oriented database might be a better fit, but the open-source ones I'm aware of aren't sufficient in horsepower. I'm not saying Cassandra fits the bill, but with TB sized total DBs this would be significantly faster than RDBMS with row-oriented storage (column store tables might do ok). And probably more to the point - the population of those large associations is what's going to kill traditional RDBMS M:N tables (or even proprietary blobs).

      --
      -Michael
    10. Re:Why? by Anonymous Coward · · Score: 0

      Depending on what you sacrifice in ACID you can get decent speed. For example facebook usually sacrifices C. You can from one refresh to another get 2 totally different pages. Even when no one updates anything that would be on your page. But it is not that big of a deal you miss for 10 mins that update of someone getting the high score in some game...

    11. Re:Why? by Daniel+Dvorkin · · Score: 2

      As an example, let's consider a forum. One row per thread, one column per post.

      Um, okay, but why would you set your database up that way in the first place? I really don't see the advantage of this over a more standard table table having columns for, say, forum ID, thread ID, poster ID, timestamp, and content.

      --
      The correlation between ignorance of statistics and using "correlation is not causation" as an argument is close to 1.
    12. Re:Why? by Dynedain · · Score: 2

      In the example you just made, I can see that the benefit is that you don't need another layer (PHP, stored queries, etc) to stitch the thread back together. The data structure inherently "knows" how the thread of posts are assembled.

      --
      I'm out of my mind right now, but feel free to leave a message.....
    13. Re:Why? by Sarten-X · · Score: 2

      Why not, if you expect to have several billion posts?

      The more important issue in this architecture decision would be scaling needs and abilities. How many billion rows can a typical RDBMS handle on a $20,000 budget? If that budget goes to $40,000, will that capacity double? With a column-oriented database, only the needed column families are loaded into memory. For this forum example, you could have a family for each month of operation. Old threads would then be entirely in old column families, so they would remain untouched on disk, never even read until they're needed. The lower memory use leads to less expensive servers (since storage is cheap now), and linear scaling.

      If your application doesn't need ridiculously large storage, go ahead and use a plain old RDBMS, just so you can avoid learning a new skill set. You're probably not going to hit a meaningful limit. If you're looking at a huge amount of data, or a moderate amount of data with a lot of processing, newer technology may be a better choice.

      Cassandra can also run with Hadoop's MapReduce framework. Taking the forum example further, a periodic job could process all the posts, updating another table (or set of columns) with a map of keywords to posts that contain them. Scanning one thread at a time, each node in the Hadoop cluster could compute the index in parallel, allowing the index creation to be separated from the load of making a post. Again, it's not a big deal for a small application, but when you're dealing with something scaling up to the size of Facebook, StumbleUpon, or Google, new tools with new designs just work better.

      --
      You do not have a moral or legal right to do absolutely anything you want.
    14. Re:Why? by Anonymous Coward · · Score: 0

      You don't have to do the join by setting up PK-FK (declarative referential integrity), jhttp://it.slashdot.org/story/11/01/17/0046248/Cassandra-07-Can-Pack-2-Billion-Columns-Into-a-Row#ust index both of the "join" columns (hint: look behind the scenes of Oracle Financials and other enterprise dbms apps... they tend to NOT use declarative referential integrity...).

      IF it's textual data, you can sometimes get kind of geeky and do "reverse-value" indexes too.

      But, Cassandra, BigTable, etc. are for other problem domains that don't fit well into the RDBMS way of looking at data.

    15. Re:Why? by c6gunner · · Score: 1

      For this forum example, you could have a family for each month of operation.

      So what's the difference between that and using a typical SQL database and just having a new table for each month of operation? Aren't tables loaded into memory as necessary?

    16. Re:Why? by melted · · Score: 1

      You've just described BigTable. :-)

    17. Re:Why? by DI4BL0S · · Score: 1

      You clearly have no idea what casandra is able to do with those columns.
      I suggest that you read the article as posted before and do your research before making a fool of yourself with comments like that

      Article: NoSQL Cassandra

    18. Re:Why? by Sarten-X · · Score: 1

      A dynamic table name leads to all kinds of trouble for an RDBMS. A closer analogue would be to partition the giant table by date, but even then there are some applications that end up with huge amounts of data in one partition. Cassandra can split by column family, as well as by sections of rows. Since the row partitioning is done automatically (in a "don't worry about it" fashion), there's seldom a painful amount of data in one place.

      That's the point of this story: The maximum column count (which would require manual partitioning) has just risen above any reasonable limits. Big Data can be even bigger.

      --
      You do not have a moral or legal right to do absolutely anything you want.
    19. Re:Why? by Anonymous Coward · · Score: 0

      As an example, let's consider a forum. One row per thread, one column per post.

      Um, okay, but why would you set your database up that way in the first place? I really don't see the advantage of this over a more standard table table having columns for, say, forum ID, thread ID, poster ID, timestamp, and content.

      Take a look at your traditional relational database, join all the tables into one, then rotate it 90 degrees. Now, you can see exactly why you would need 2 billion columns in one row. Want to display post #20 thru #40, just extract those 20 columns from the thread row. Good thing Casandra doesn't support SQL statements...

      (Seriously though, even with big tables, if you are doing "one column per post" you're doing it wrong.)

    20. Re:Why? by MetalAngel · · Score: 1

      this feature matters for those that need to store data efficiently ( a good ratio between payload and overhead ).
      columns (in cassandra specificly) are the most efficient form to store data, because a column has the lowest overhead.
      the overhead for a row is much higher.

  7. SQL statement from hell by Anonymous Coward · · Score: 0

    What would the SQL statement look like if you wanted to select nearly all of those 2 billion columns except a few?

    1. Re:SQL statement from hell by Sarten-X · · Score: 2

      Cassandra doesn't use SQL, and isn't even like a RDBMS in any way other than "it stores a table of data", so the SQL statement would be nonexistent.

      --
      You do not have a moral or legal right to do absolutely anything you want.
  8. Sorry... by Nemyst · · Score: 0

    I couldn't not link to this xkcd comic.

  9. 2 billion columns... by aBaldrich · · Score: 4, Funny

    ought to be enough for everybody

    --
    In soviet russia the government regulates the companies.
    1. Re:2 billion columns... by adamofgreyskull · · Score: 1

      Joke away but, going by some of the shit I've seen at TheDailyWTF, that could well come back and bite you in the ass one day.

    2. Re:2 billion columns... by Anonymous Coward · · Score: 0

      According to wikipedia, we need about 6.894 billion columns to have enough for everybody.

    3. Re:2 billion columns... by WGFCrafty · · Score: 1

      Woooooooooooooooosh. WOOOSH wooosh woosh. Four woosh's ought to be enough for you.

    4. Re:2 billion columns... by Phil06 · · Score: 0

      When I turn my head sideways it looks just like 2 billion rows

      --
      "...and yet, I blame society" Duke - Repo Man
    5. Re:2 billion columns... by ian_from_brisbane · · Score: 2

      When I turn my head sideways it looks just like 2 billion rows

      Where can I buy a monitor like yours?

  10. Awesome! by Anonymous Coward · · Score: 0

    Now I can write that application I have been wanting to write forever. Just couldn't find a suitable database because none of them supported two billion columns per row. Oh happy day.

  11. This is a triumph for hideously bad schema by Sarusa · · Score: 4, Informative

    Well good on them for solving an interesting technical problem, but the use cases for this are all bad.

    Obvious first use: boss will suggest we optimize the database by using only one gigantic row with two billion columns.

    1. Re:This is a triumph for hideously bad schema by teknopurge · · Score: 1

      Database? Psha - we only use Excel for our most critical data storage needs....

    2. Re:This is a triumph for hideously bad schema by Jugalator · · Score: 2

      This is a triumph for hideously bad schema

      This isn't a relational database. There is no schema. [/matrix]

      --
      Beware: In C++, your friends can see your privates!
  12. Re:About time by MichaelKristopeit401 · · Score: 1
    you'd still have to have multiple columns...

    Seriously though, WTF?

  13. really... by Bizzeh · · Score: 1

    ...whats the point...

  14. Thank goodness! by wonkavader · · Score: 1

    Now I can finally shoe-horn my coworkers' Excel spreadsheets into a database.

  15. for those that absolutely positively cannot RTFA by Son+of+Byrne · · Score: 5, Informative

    Cassandra appears to be a multi-dimensional datastore that does not store data in the same fashion as a typical RDBMS. It uses columns and rows both to store sets of data uniquely. If you're familiar with Big Table, then, apparently, its kinda like that.

    That just means that they've added even more storage vectors to it than before...not sure why it made slashdot front page...

    --
    I'd happily pay you Tuesday for a biopsy today!
  16. This upper limit has been eliminated by Anonymous Coward · · Score: 0

    By establishing an upper limit on a formerly unlimited limit.

  17. Re:designer shoes online for less-cheap wholesale by Relayman · · Score: 0

    Wow, the first spam comment I have ever seen on /. And not one piece is authentic. I especially like how they made the security icons clickable but not the way they should be.

    --
    If I used a sig over again, would anyone notice?
  18. Cassandra by tverbeek · · Score: 5, Funny

    I predict that bad things will come of this.

    Not that anyone will believe me.

    --
    http://alternatives.rzero.com/
    1. Re:Cassandra by thewils · · Score: 1

      I believe you :) There's a subset of coders who don't see anything wrong with "Select *" all over the place and I have a feeling this construct might chew up available memory real quick if a table has anywhere near this number of columns...

      --
      Once I was a four stone apology. Now I am two separate gorillas.
    2. Re:Cassandra by NNKK · · Score: 0

      I believe you :) There's a subset of coders who don't see anything wrong with "Select *" all over the place and I have a feeling this construct might chew up available memory real quick if a table has anywhere near this number of columns...

      What's table?

      (Seriously, Cassandra doesn't have tables. It's not an RDBMS, and doesn't use SQL.)

    3. Re:Cassandra by WeatherGod · · Score: 1

      Nah, that never will happen! (For those who didn't get it: http://en.wikipedia.org/wiki/Cassandra)

    4. Re:Cassandra by Dails · · Score: 1

      Clever.

  19. figured it out by Bizzeh · · Score: 2

    I know why the developers thought this would be a good idea. A feature this mental would be sure to get them free publicity on slashdot

    1. Re:figured it out by mini+me · · Score: 2

      A column in Cassandra is sort of, if you have to make a comparison, like a join in SQL. Using Slashdot as an example, the topic would be the row, and each comment within that topic would be a column. Wanting to store more than 2GB of column data doesn't seem mental at all.

      Whether or not it is worthy of the front page is another question.

    2. Re:figured it out by butlerm · · Score: 1

      Non-relational databases that do this have been around for decades. Adabas and Pick are the examples that come to mind. The pertinent difference here is that the developers of those databases were sane enough not to call repeating groups "columns".

    3. Re:figured it out by mini+me · · Score: 1

      Comparing it to a join was, of course, an oversimplification. Cassandra is not relational, so it is difficult to directly compare features with a relational database.

      Cassandra utilizes ideas from column-oriented databases to store its data, so it is not wrong to call them columns. They are just not columns in the relational database sense.

    4. Re:figured it out by butlerm · · Score: 2

      They are just not columns in the relational database sense.

      They are not columns _even_ in the sense that column oriented databases use. They are repeating groups. What column oriented databases call "columns" have a perfect logical correspondence with what relational databases call columns. Nothing about the relational model dictates either row or column orientation, so far as storage is concerned.

      The logical and physical structure of a Cassandra row has been used in some databases (Adabas, Pick, etc) for thirty or forty years. In fact others (such as Oracle and DB2) can implement the logical model of a relational database on the same physical model as that used by repeating group databases like Cassandra, getting the best of both worlds.

      The end game for contemporary NoSQL databases is to evolve step by step in the direction of distributed, shared nothing relational databases with some special purpose relaxations. Secondary indexes? We have that. Transactions? We have that too! And so on...

      The only problem with Cassandra is that the designer seem to have ignored everything written about database implementation in the past fifty years. Either that or have a severe case of the not-invented-here syndrome. Why use standard nomenclature when we can make up one maximally designed to confuse everyone and everybody?

  20. WHat?! by snizzle · · Score: 1

    Welcome to the new online dating experience when we match you to someone else with up to 2 Billion traits!

  21. 2 billion columns? by flimflammer · · Score: 1

    This sounds purely like marketing gibberish when you can't create enough meaningful features to boast about.

    I can't even think of a reason why you would need 2 billion columns. If you did, I think the ability to store it is the least of your problems.

  22. Nobody read "Jurassic Park"? by Ken+Hall · · Score: 1

    As I recall, one of the tasks given to Nedry in the design of the computer systems was to devise a database capable of holding a couple of billion fields to handle the sequencing of DNA strands.

    1. Re:Nobody read "Jurassic Park"? by DavidTC · · Score: 1

      That is possibly the stupidest design imaginable. You wouldn't be storing each DNA sequence in a field. DNA is full of variable length stuff, so that one slight insertion or deletion or change between species would result in the rest of the fields being offset, which would a) be hell to actually update, and b) entirely pointless because you can't compare them or search for them as fields.

      I'm not entirely sure what you would be doing, but it wouldn't be that. I'm not entirely sure what the Jurassic Park people were trying to do...I know they had to 'patch' some DNA, both to cause the lysine defect and to fix gaps in the record. (Inexplicably with frog DNA instead of bird DNA.) I don't know that a 'database' really makes any sense for either of those, but even if you do need some sort of database, putting individual DNA in fields would render it unusable.

      What really should have been done is 'decompiling' to some sort of 'machine language' and then editing, with diffs to frog DNA and merges to patches they wished to apply, and 'recompiling'. (Which is not even slightly possible even today, as we barely know what any DNA does, but it's called suspension of disbelief.)

      --
      If corporations are people, aren't stockholders guilty of slavery?
    2. Re:Nobody read "Jurassic Park"? by Ken+Hall · · Score: 1

      Nobody ever said Michael Crichton had the best grasp of technical issues. While the book was a BIT better than the movie, most of his stuff is still pretty well divorced from reality.

  23. Re:for those that absolutely positively cannot RTF by Anonymous Coward · · Score: 0

    Cassandra appears to be a multi-dimensional datastore that does not store data in the same fashion as a typical RDBMS. It uses columns and rows both to store sets of data uniquely. If you're familiar with Big Table, then, apparently, its kinda like that.

    That just means that they've added even more storage vectors to it than before...not sure why it made slashdot front page...

    ah yes.... this will but my linear algebra class to good use!

  24. Indexes by Twillerror · · Score: 3, Informative

    Cassandra like many of the "no sql" type databases doesn't have classic indexes.

    So instead of having an index you typically have a separate table that acts as the index.

    Image you have a users table. One of the field is country. Now you want to know all the users for a particular country.

    In standard RDMS type systems you just scan each row or have a index that has done that "ahead of time" or as rows are inserted.

    In Cassandra the rows of users are distributed possibly among 100s of servers. So scanning for all users that have a particular country would require scanning all rows which could a long time.

    Unlike RDMS like system rows don't have a 2d structure and don't have real limitation on the number of columns they can have. And columns can essentially be arrays\rows of objects.

    So as you design/bang out your application you typically realize you need to know "users by country" for some stupid report. So you create a new table to hold these values. This has one row per country. As users are entered you append to this row. This essentially creates an array like structure. You then lookup the row for a particular country and you now know all the users for that particular country.

    Sounds like Cassandra is getting rid of a limitation that could have caused very large index to require multiple rows.

    1. Re:Indexes by shutdown+-p+now · · Score: 1

      To as you design/bang out your application you typically realize you need to know "users by country" for some stupid report. So you create a new table to hold these values. This has one row per country. As users are entered you append to this row. This essentially creates an array like structure. You then lookup the row for a particular country and you now know all the users for that particular country.

      So how is that any different from the usual RDBMS index, except that in this case you have to maintain it manually?

  25. Re:About time by LukeWebber · · Score: 1

    My bad. I meant "multiple rows".

  26. Introduction to Cassandra by Fnord666 · · Score: 1

    Here is a link to to an introduction to the Cassandra database system. One thing to realize is that Cassandra is one of the new "noSQL" DBMS. These operate very differently than an RDBMS such as Oracle or DB2.

    --
    'The tyrant will always find pretext for his tyranny.' - Aesop's Fables
  27. 2 billion is too small... by Anonymous Coward · · Score: 1

    That's less than one column per person!

  28. Paging Microsoft by Nom+du+Keyboard · · Score: 1

    Now if only Excel would follow.

    --
    "It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
  29. SELECT * FROM TWO_BILLION_COLUMN_TABLE; by timeOday · · Score: 1

    Man this is great! Now I only need one table and never have to JOIN again. Most of the rows won't use most of the columns but that's what NULL is for, am I right?

  30. Comment removed by account_deleted · · Score: 1

    Comment removed based on user account deletion

  31. Re:for those that absolutely positively cannot RTF by maraist · · Score: 1

    I wonder if it's possible to represent a non-cartesian basis vector-space with a DB. Maybe one of the columns is sinusoidally looped - haha,, every 32nd insert wraps around itself.. Oh this could be a cool MLK holiday project.

    --
    -Michael
  32. Re:for those that absolutely positively cannot RTF by Dahamma · · Score: 1

    Not knocking Cassandra, but basically it means that this metric of "2 billion columns", being completely different from the concept of RDBMS columns, really doesn't mean much from a comparative point of view...

    It's kinda like saying "that army of ants will conquer all nations, they have 2 billion soldiers!" :)

  33. Yes and the funniest thing about all this is by Giant+Electronic+Bra · · Score: 4, Insightful

    That we had all of this stuff 30 years ago. It was called 'network' databases, which were pretty much the standard sort of technology before RDBMS came along and everyone realized how incredibly much better relational algebra was for the vast majority of problems. As with many other things older ideas eventually resurface with new names and a few more features. There are times when this kind of facility is useful. Nothing wrong with it. The vast majority of cases though where I've seen people using something like Cassandra or Big Table were ill advised. A properly optimized RDBMS with correctly designed schema can handle all but a few edge cases. Most of the hype these tools are generating is based on a lack of real understanding of how to properly use databases combined with people believing myths about other technologies and helped along by the industry's short memory span. The best part though is that when something turns into a giant mess guys like me can make nice money fixing the mess. lol.

    --
    "Malo periculosam, libertatem quam quietam servitutem." -- Jefferson
    1. Re:Yes and the funniest thing about all this is by Anonymous Coward · · Score: 1

      I wish there was a mod "-1, Get Off My Lawn".

    2. Re:Yes and the funniest thing about all this is by DavidTC · · Score: 2

      The vast majority of cases though where I've seen people using something like Cassandra or Big Table were ill advised. A properly optimized RDBMS with correctly designed schema can handle all but a few edge cases. Most of the hype these tools are generating is based on a lack of real understanding of how to properly use databases combined with people believing myths about other technologies and helped along by the industry's short memory span.

      Indeed, and there are edge cases, like Facebook, or Google, or whatever. The edge cases are gigantic databases that are accessed in certain specific way.

      There are probably less edge cases than actual NoSQL codebases, which is pretty surreal. There are more actual products then the number of people who need the products.

      And 99.99% of the people playing with them don't need them at all.

      Rule of thumb: If you ever have to decide whether or not you need NoSQL or not...you don't. Because the actual databases that would be better off under NoSQL are operating inside corporations where individuals don't ever make that sort of decisions anyway.

      The real joke is people using them in ways that are actually slower than any RDBMS, but they think it's 'easier', usually because they never bothered to learn how JOINs work, and don't understand that it's perfectly fine to make a dozen SQL queries on a web page...that's what indexes are for. (Someone should ask them to estimate how many SQL queries a slashdot page takes.)

      --
      If corporations are people, aren't stockholders guilty of slavery?
    3. Re:Yes and the funniest thing about all this is by red_blue_yellow · · Score: 3, Informative

      Indeed, and there are edge cases, like Facebook, or Google, or whatever. The edge cases are gigantic databases that are accessed in certain specific way.

      It's true that many people attempt to prematurely optimize by using Cassandra first instead of something they are already familiar with. However, when faced with some of the pains of growing an RDMBS beyond what a single box can handle, it's worth it to consider your other options. Keep in mind that if it's easy to store and make use of a huge pile of data, you're more tempted to gather that data in the first place, where 10 years ago it might have been prohibitively expensive or difficult.

      There are probably less edge cases than actual NoSQL codebases, which is pretty surreal. There are more actual products then the number of people who need the products. And 99.99% of the people playing with them don't need them at all.

      I can assure you that you're incorrect, but since you don't have any data to back this up, I won't bother either.

      The real joke is people using them in ways that are actually slower than any RDBMS, but they think it's 'easier', usually because they never bothered to learn how JOINs work, and don't understand that it's perfectly fine to make a dozen SQL queries on a web page...that's what indexes are for.

      Yes, only knuckle-dragging imbeciles are interested in new systems... *sigh*. This is an often-touted piece of flamebait that has little basis in reality. Some of the largest Cassandra users are companies who already have extensive experience scaling MySQL and other RDMBS.

      While some might find that document stores like MongoDB are "easier" and use it for that reason, Cassandra has a reputation for being difficult to get started with; the reason it gets used nevertheless is because the benefits outweigh the steep learning curve.

      --
      A neutral communications medium is essential. It is the basis of science, by which humankind should decide what is true.
    4. Re:Yes and the funniest thing about all this is by Giant+Electronic+Bra · · Score: 2

      My comment would just be along the lines of what the DavidTC stated though, in the case where that kind of technology is warranted you're either in a huge organization with very specialized needs or well beyond the competency level of small shops. It isn't so much a factor of being able to find a tool that could do the job. It is a matter that the various factors going into that kind of scale of system are so complex and varied. You need expertise in large scale mass storage, clustering, management, etc to back up something like that, and you're probably well into the realm of highly tweaked network, kernel, networking layer, etc. What you need is a dedicated team of experts. Something like Cassandra can be a great boon to them, but you better already have a very hefty IT budget or you're probably better off with vertical scaling and an RDBMS. You can go a LONG ways with that, trust me.

      --
      "Malo periculosam, libertatem quam quietam servitutem." -- Jefferson
    5. Re:Yes and the funniest thing about all this is by Anonymous Coward · · Score: 0

      The reason old ideas resurface is because technology changes. Nowadays, memory, disk and CPU are radically different than they were 30 years ago.

      What this means is that sometimes "bad" ideas become good ideas with newer technology.

  34. Finally! by Compaqt · · Score: 2

    I'm was having trouble making a table for my new Web 3.0 m-commerce application on lesser databases:

    CREATE TABLE peeps(
    peep1_first_name VARCHAR(255),
    peep1_last_name VARCHAR(255),
    peep1_address VARCHAR(255),
    peep1_address2 VARCHAR(255),
    peep1_address3 VARCHAR(255),
    peep1_creditcard VARCHAR(255),
    peep1_creditcard2 VARCHAR(255),
    peep1_creditcard3 VARCHAR(255),
    peep2_first_name VARCHAR(255),
    peep2_last_name VARCHAR(255),
    peep2_address VARCHAR(255),
    peep2_address2 VARCHAR(255),
    peep2_address3 VARCHAR(255),
    peep2_creditcard VARCHAR(255),
    peep2_creditcard2 VARCHAR(255),
    peep2_creditcard3 VARCHAR(255), ...

    509 Bandwidth Limit Exceeded

    --
    I'm not a lawyer, but I play one on the Internet. Blog
  35. And Oracle supports EXABYTE sized databases by dirkdodgers · · Score: 3, Interesting

    So I can appreciate that this announcement sounds like News for Nerds, but can someone why it Matters that Cassandra can support 2 billion columns?

    The article basically says "because you can't execute SQL you need lots of columns". OK, great, why would I want that? The article doesn't tell me. The Cassandra website sure doesn't tell me.

    Oracle 11 supports up to 8 fucking EXABYTES of data in an RDBMS that I can execute SQL against. What Cassandra puts in columns, I put in rows.

    I've scoured this thread like all the other ones on Cassandra for the killer feature, for the "you can do this with Cassandra that you can't do as well with an RDBMS" and I can't find it.

    The best I can come up with is "I want to store lots of indexed data, I don't care about transactional integrity, and I don't want to pay Oracle". Is that it? That's fine if it's it, Oracle doesn't come cheap and that can be a deal breaker for new companies, but I just wish someone would spell out that this is the justification for Cassandra's existence.

    1. Re:And Oracle supports EXABYTE sized databases by melted · · Score: 1

      The killer feature is that it actually horizontally scalable and fault-tolerant out of the box.

    2. Re:And Oracle supports EXABYTE sized databases by red_blue_yellow · · Score: 1

      2B columns in a row isn't why you use Cassandra.

      Here are some of the actual reasons why you use Cassandra:
      - No single point of failure. Every node in the cluster has the same role. This is also nice for maintenance purposes.
      - Linearly scalable. Increasing the size of your cluster by N times increases the total ops/second N times.
      - Tunable consistency per operation. For every read and write, you can specify how many replicas for a piece of data must respond for the operation to be considered a success. A typical strategy is requiring a quorum of replicas to respond for both reads and writes, ensuring a strongly consistent view of your data (by the pigeonhole principle) while still tolerating the failure of up to half of your replicas. If you're familiar with the CAP theorem, this lets you trade off some amount of C and A for every operation.
      - It's fast (while still durable). Sub-millisecond write latencies, read latencies typically 1ms to 10ms, depending on caching, write patterns, and other factors. Pretty standard hardware with cheap rotating media drives work very well, so you don't have to buy any super-boxes.
      - Multi-datacenter replication. Cassandra really does a good job of making this transparent while still giving good latencies and tunable availability/consistency.

      Basing the clustering aspects of Cassandra off of Amazon Dynamo is what really brings a lot to the table here. The BigTable data model just happens to work really well with this.

      --
      A neutral communications medium is essential. It is the basis of science, by which humankind should decide what is true.
    3. Re:And Oracle supports EXABYTE sized databases by teknopurge · · Score: 1

      The killer feature is that it actually horizontally scalable and fault-tolerant out of the box.

      So is Postgres. Like the OP, I'm still waiting for a good reason to use NoSQL-type storage. I have to agree that these are all solutions looking for problems: trying to re-invent the wheel for no other reason then they don't know how to correctly do it with the existing products.

    4. Re:And Oracle supports EXABYTE sized databases by DavidTC · · Score: 4, Interesting

      NoSQL stuff is useful in weird extreme fringe cases, where you need to access data in essentially random ways. Digg, Facebook, and Google all NoSQL databases, and I think the first two use Cassandra.

      Specifically, you kinda make your own rows. It's like having permanent multiple JOINs that you can access instantly, from what I understand. (This is what this article is talking about, it's now unlimited.)

      Essentially, it's a giant blob of data that exists, and you draw lines on it in advance that are your results, and you can get those result instantly, at the cost of being unable to decide to get other results in real time.

      Many of the products let you have them on different servers, so you can have a 'people who have voted for this Digg' table or something, on the server that handles that thing.

      I'm not entirely sure how it works, but that's basically it. Oh, and the fact they talk about 'columns' and 'rows' is just utter stupidity in naming to confuse everyone. Basically, they simply tend to keep each column as a file, which allows them to do what I mentioned above..copy needed columns, and just needed columns, to other servers.

      It's really weird, and, like I said, only relevant for giant giant databases. There's no way that google could do a full text search on a RDBMS, regardless if it fits in Oracle. What it can do is make a 'column' for each word, and a 'row' for each URL, put different columns on different servers, and that actually works in the non-relational database they use, when there's no way in hell that would work on a RDBMS.

      However, more importantly for slashdot, a fuckload of fools think that SQL is somehow 'retarded' and that NoSQL is 'awesome, dude', so they like to play with it, usually by spewing out some crap PHP or Perl or something that works about a tenth as well as just using an RDBMS would work. If they actually understood how to use an RDBMS, that is.

      --
      If corporations are people, aren't stockholders guilty of slavery?
    5. Re:And Oracle supports EXABYTE sized databases by melted · · Score: 2

      Try to deploy Postgres on a 5000 machine cluster, with replication and failover and then get back to me. And by "failover" here I mean the entire racks or ever network segments going away with nary a hiccup in serving, no manual intervention (except for bringing up replacement nodes), and no data loss.

      Then there's the issue of RDBMSs being suboptimal for straightfoward user profile storage. You have to implement a lot of things by hand. Cassandra (or BigTable) gives you a versioned, fault tolerant, scalable, multidimensional map. It's really convenient for a lot of things, but it's not a replacement for proper DBs when you need realtime aggregation or joins.

      Heck, even Google used MySQL here and there for this exact reason.

      I think it's really unfortunate that folks consider these a replacement for your typical DB-like scenarios. They're merely a replacement for DBs in cases where the use of DB was a kludge in the first place, and calling multidimensional maps "databases" is really a misnomer.

    6. Re:And Oracle supports EXABYTE sized databases by Anonymous Coward · · Score: 0

      BigTable is Sorted Map of Sorted Maps of Sorted Maps of Sorted Maps ... and so on.
      Cassandra is one of implementations of BigTable.
      It is completely alternative way of thinking about your data.

      Is it better? Faster? More scalable?
      Depends.

    7. Re:And Oracle supports EXABYTE sized databases by Ant+P. · · Score: 1

      Oracle still imposes size limits on the database? How crude. You should try a real database like PostgreSQL.

    8. Re:And Oracle supports EXABYTE sized databases by Anonymous Coward · · Score: 0

      So I can appreciate that this announcement sounds like News for Nerds, but can someone why it Matters that Cassandra can support 2 billion columns?

      ...

      Oracle 11 supports up to 8 fucking EXABYTES of data in an RDBMS that I can execute SQL against. What Cassandra puts in columns, I put in rows.

      The "NoSQL" databases serve a useful role in allowing us to see a wider range of options for storing data in apps, but article like these that there's also lots of hot air waiting to be let out in the valuation of these these products.

      The wisest option might be to withhold our justified cynicism with the poorly thought out PR and try to find something useful to expand out toolset. Not just in terms of technology even: many of the techniques keyval/map/reduce products use can be directly applied to use mature RDBMS technology in more flexible manner than we have before.

    9. Re:And Oracle supports EXABYTE sized databases by w_dragon · · Score: 1

      You consider 8 EXABYTES a limit? Get back to me when you actually have a postgres database with that much data in it.

    10. Re:And Oracle supports EXABYTE sized databases by teknopurge · · Score: 1

      Try to deploy Postgres on a 5000 machine cluster, with replication and failover and then get back to me. And by "failover" here I mean the entire racks or ever network segments going away with nary a hiccup in serving, no manual intervention (except for bringing up replacement nodes), and no data loss.

      See this is the problem: redundancy isn't the job of the DB: it's the job of the infrastructure. Tell you what, I'll deploy my 5000 postgres nodes and have everything vmotion and swing luns like a stripper working for her tuition and we'll see exactly why NoSQL DBs are exactly that. Replication? The SANs are mirrored over MPLS. Checkmate.

      And if you think RDBMS are "suboptimal for straightfoward user profile storage.." there's a problem with your data model, not the system. A poor workman blames his tools.

      My intention is not to come-off as vicious, though it may seem that way. I'm just really tired of people doing shit and thinking it's innovation, only becuase they are too ignorant to understand the problems they created their solutions for; either that or they forgot that something older then 15 years is available to do the same thing, only better.

    11. Re:And Oracle supports EXABYTE sized databases by melted · · Score: 2

      You still haven't described how you'd implement sharding (at which point with most realistic relational schemas you're likely to lose the ability to do joins), load balancing and transparent, reliable failover. SANs don't free you from the need to do DB replication, since several processes can't write to the same set of files without massive synchronization overhead, so you're in a losing position right from the start, SANs notwithstanding.

      But even assuming you got all of this to work, you're still doing it on high end hardware, whereas Cassandra can do the same on bare motherboards with cheap SATA drives velcroed on (which is how FB uses it), at a fraction of the cost, with higher reliability (since it was built for shitty hardware), and possibly with higher per-node performance.

      Right tool for the job, man, right tool for the job.

    12. Re:And Oracle supports EXABYTE sized databases by Ant+P. · · Score: 1

      Hey, someone considers it enough of a limit to turn capslock on to spell it out. Twice.

    13. Re:And Oracle supports EXABYTE sized databases by geekoid · · Score: 1

      " since several processes can't write to the same set of files without massive synchronization overhead,"

      I thought you knew what you where talking about, all the way up to there.

      If you think that's true, you need to learn your databases much deeper.

      --
      The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
    14. Re:And Oracle supports EXABYTE sized databases by melted · · Score: 1

      Several Postgres processes can't operate simultaneously on the same DB - says so right in the docs (see 8.2.19). You could conceivably hack them to do so, but you would have to prevent them from overwriting each other's changes, or keep one process in standby mode of some kind in case the other one dies.

      Therefore, if one process is fucked, your shard disappears even if data files are still there. That's bad.

      What exactly is not clear?

  36. Nice by Anonymous Coward · · Score: 0, Funny

    Nice, but can it run Flash?

  37. Re:About time by Anonymous Coward · · Score: 0, Funny

    Good morning, Michael! How are you? I am fine.

    Say, I haven't seen you in awhile; whatzup with dat?

    Are you related to http://slashdot.org/~MichaelKristopeit301 thru http://slashdot.org/~MichaelKristopeit360?

    I thought so. Yeah, I know how it is, bro--sixty /. accounts is the absolute minimum one should have, eh, Michael?

    Well, it's good to see your friendly posts again ol' buddy!

  38. Wonderful by dynamo · · Score: 1

    This is great for those of us in the database community who are purists about only using one row of data.

  39. Re:"would be stored in the rows..." by Joce640k · · Score: 1

    You've not done much outsourcing, have you?

    --
    No sig today...
  40. Same tired old crap by Anonymous Coward · · Score: 0

    Post a story that has no value whatsoever and queue the predictable quibbling about how screwed up in the head one must be to build a table with soo many columns.

    -1 redundant + AC -100 modifier

  41. SELECT * FROM stupid by Xaroth · · Score: 1

    I can hardly wait to see the first story on The Daily WTF about someone doing a SELECT *... on a 2-billion column row, whether intentionally or not. Bonus points for excluding a LIMIT clause, too.

  42. Congradulations! by VortexCortex · · Score: 1

    ... You're our -2,147,483,648th Column in the User Row!

    <blink>This is not a Joke!</blink>

    Click here to claim your free ERROR [MEMTABLE-FLUSHER-POOL:7] 2010-01-17 08:16:53,628 DebuggableThreadPoolExecutor.java (line 110) Error in ThreadPoolExecutor!

  43. Why? by Lord+Kano · · Score: 1

    No seriously, why?

    What could possibly necessitate two billion rows per column?

    Is this just a "because we can" kind of thing, or is there a practical reason for it?

    LK

    --
    "Hi. This is my friend, Jack Shit, and you don't know him." - Lord Kano
  44. Re:"would be stored in the rows..." by MichaelSmith · · Score: 1

    True story. They were over there in India using some meta data derived from our application dataset to generate a UML model which was generating java source which was compiling to class files one gigabyte in size. We fired the application up for the customer and it never actually finished starting...

  45. TFA forgets to mention... by RichiH · · Score: 1

    ...the most important feature: 2 billion columns per row!

  46. Re:About time by Teun · · Score: 1
    Well spotted :)

    What an absolute moron!

    --
    "The likes of Facebook and WhatsApp are free to those whose privacy is of zero value."
  47. But no transactions... this is old my-sql. by leuk_he · · Score: 1

    Looking at Cassandra from a traditional DBMS viewpoint i notice one thing, it does not have transaction like a true transactional database.

    This reminds me of early MYSQL databaseengines where there also was not transactional support (but had " huge" speed, until you went multi user and the lack of rowlevel locking bit you hard.)

    There are a lot of applications where the lack of tradional ACID is acceptable, but one has to keep this in mind in the designing in the application.

    (SQL/ACID is a incomplete model anyway, since in itself it does not have a facility to show updates to data to users. )

  48. So this is a spreadsheet? by Anonymous Coward · · Score: 0

    Columns per row?

    What happened to "fields per record"?

  49. Re:"would be stored in the rows..." by Joce640k · · Score: 1

    All that story needs is for the Java classes to be sorted as XML inside a database.

    --
    No sig today...
  50. Hugely advantageous by Anonymous Coward · · Score: 0

    Many people have been asking why it needs to support so many columns. This is hugely advantageous for users of Cassandra, because it only supports databases with a single row.

  51. OMG by CouchP · · Score: 1

    - WHY?

  52. Max column name? by HTH+NE1 · · Score: 1

    So, what is the label for the 2 billionth column as a base26 number with an initial leading base27 "digit" (i.e. alphabetic name of A-Z[A-Z]* where A is zero except for the first digit where A is 1, AA is 10, AAA is 100)?

    --
    Oh, say does that Star-Spangled Banner entwine / The myrtle of Venus with Bacchus's vine?
  53. Can you make an ER Diagram of this? by systematical · · Score: 1

    Developer: "Sure, how many rows did you say that was again?" Asshat: "Umm... 2 billion." Developer: ....

  54. Whoosh by Locke2005 · · Score: 1

    Apparently most geeks, unlike most Greeks, are unfamiliar with who Cassandra was...

    --
    I've abandoned my search for truth; now I'm just looking for some useful delusions.
  55. The non-existent upper limit was eliminated by johanatan · · Score: 1

    "Previous versions had no set upper limit, though the maximum amount of material that could be held in a single row was approximately 2GB. This upper limit has been eliminated."

    Poor writing. I think you meant to say that they exchanged one type of upper limit for another.

  56. And they asked 'Why' .... by PPH · · Score: 1

    ... when we built guitar amps that went to eleven.

    --
    Have gnu, will travel.
  57. This is a very bad thing by geekoid · · Score: 1

    I can't wait to have to fix some 3rd party database app that has millions of columns that was created by some automated tool.

    --
    The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
  58. Re:About time by Teun · · Score: 1

    Your script needs tuning.

    --
    "The likes of Facebook and WhatsApp are free to those whose privacy is of zero value."
  59. 90 degrees? by Anonymous Coward · · Score: 0

    Why not flip a RDBMS 90 degrees and you have a column-oriented database? Using your rows in your RDBMS as you would use your columns in Cassandra?

  60. Postgre/MySQL ... by layer3switch · · Score: 1

    Postgre/MySQL can pack more than 2 BILLION ROWS in single column!

    --
    "Don't let fools fool you. They are the clever ones."
  61. Great by Anonymous Coward · · Score: 0

    And some d*ck out there is going to need 2 billion and 1 columns, it will never end.