Cassandra NoSQL Database 1.2 Released
Billly Gates writes "The Apache Foundation released version 1.2 of Cassandra today which is becoming quite popular for those wanting more performance than a traditional RDBMS. You can grab a copy from this list of mirrors. This release includes virtual nodes for backup and recovery. Another added feature is 'atomic batches,' where patches can be reapplied if one of them fails. They've also added support for integrating into Hadoop. Although Cassandra does not directly support MapReduce, it can more easily integrate with other NoSQL databases that use it with this release."
I think I'd been led the wrong direction on use cases for nosql solutions.
It sounds like you probably have. There's a lot of misinformation out there parroted by folks who don't really understand NoSQL paradigms. They'll say it lacks ACID, has no schema, relations, or joins, and they'd be right, but sometimes those features aren't actually necessary for a particular application. That's why I keep coming back to statistics: Statistical analysis is perfect for minimizing the effect of outliers such as corrupt data.
The idea of "agility" sounded good, which to my mind meant worrying less about the schema.
Ah, but that's only half of it. You don't have to worry about the schema in a rigid form. You do still need to arrange data in a way that makes sense, and you'll need to have a general idea of what you'll want to query by later, just to set up your keys. If you're working with, for instance, Web crawling records, a URL might make a good key.
If I need to add field to something, I add a field.
Most NoSQL products are column-centric. Adding a column is a trivial matter, and that's exactly how they're meant to be used. Consider the notion of using columns whose names are timestamps. In a RDBMS, that's madness. In HBase, that's almost* ideal. A query by date range can simply ask for rows that have columns matching that particular range. For that web crawler, it'd make perfect sense to have one column for each datum you want to record about a page at a particular time. Perhaps just headers, content. and HTTP code each time, but that's three new columns every time a page is crawled - and assuming a sufficiently-slow crawler, each row could have entirely different sets of columns!
But the part about no relations always seemed like a show stopper for any case I'm likely to encounter.
It's not that there aren't relations, but that they aren't enforced. A web site might have had a crawl attempted, but a 404 was returned. It could still be logged by just having a missing content column for that particular timestamp, and only the 404 column filled. On later queries about content, a filter would ignore everything but 200 responses. For statistics about dead links, the HTTP code might be all that's queried. On-the-fly analysis can be done without reconfiguring the data store.
It'd be nice to store user status updates in a way where I don't have to worry too much about types of update, but I can't do that if correlating 'mentions', the user that posted it, and visibility against user groups would be a problem.
Here's one solution, taking advantage of the multi-value aspect of each row (because that's really the important part):
Store a timestamped column for each event (status update, mention, visibility change). As you guessed, don't worry much about what each event is, but just store the details (much like Facebook's timeline thing). When someone tries to view a status, run a query to pull all events for the user, and run through them to determine the effective visibility privileges, the most recent status, and the number of "this person was mentioned" events. There's your answer.
As you may guess, that'd be pretty slow, but we do have the flexibility to do any kind of analysis without reconfiguring our whole database. We could think ahead a bit, though, and add to our schema for a big speed boost: Whenever a visibility change happens, the new settings are stored serialized in the event. Sure, it violates normalization, but we don't really care about that.Now, our query need not replay all of the user's events... just enough to get the last status and visibility, and any "mentioned" events. That'll at least be pretty likely constant time, regardless of how long our users have been around.
Counting all those "mentioned" events might
You do not have a moral or legal right to do absolutely anything you want.