The NoSQL Ecosystem
abartels writes 'Unprecedented data volumes are driving businesses to look at alternatives to the traditional relational database technology that has served us well for over thirty years. Collectively, these alternatives have become known as NoSQL databases. The fundamental problem is that relational databases cannot handle many modern workloads. There are three specific problem areas: scaling out to data sets like Digg's (3 TB for green badges) or Facebook's (50 TB for inbox search) or eBay's (2 PB overall); per-server performance; and rigid schema design.'
Microsoft Access is here!
So... every time I open my inbox in Facebook, it has to search through 50TB of data? That sounds like a design problem. What has always floored me is why people think everything needs to be stuffed into a database. Terabyte sized binary blobs? You know, there's a certain point where people need to stop and actually think about the implimentation.
#fuckbeta #iamslashdot #dicemustdie
With regard to scalability, it strikes me that the problem isn't so much SQL but the fact that current SQL-based RDBMS implementations are optimized for smaller data sets.
The performance claims will probably be disputed by Oracle whizzes. However, the "rigid schema" claim bothers me. RDBMS can be built that have a very dynamic flavor to them. For example, treat each row as a map (associative array). Non-existent columns in any given row are treated as Null/empty instead of an error. Perhaps tables can also be created just by inserting a row into the (new) target table. No need for explicit schema management. Constraints, such as "required" or "number" can incrementally be added as the schema becomes solidified. We have dynamic app languages, so why not dynamic RDBMS also? Let's fiddle with and stretch RDBMS before outright tossing them. Maybe also overhaul or enhance SQL. It's a bit long in the tooth.
More at:
http://geocities.com/tablizer/dynrelat.htm
(And you thought geocities was de
Table-ized A.I.
I think I've heard of non-relational databases before. There's a particularly famous one, in fact. What could it be? Let's see: first started shipping in 1969, now in its eleventh major version, JDBC and ODBC access, full XML support in and out, available with an optional paired transaction manager, extremely high performance, and holds a very large chunk of the world's financial information (among other things). It also ranks up there with Microsoft Windows as among the world's all-time highest grossing software products.
....You bet non-relational is still highly relevant and useful in many different roles. Different tools for different jobs and all.
I'm a huge PostgreSQL fan and took classes in formal database theory in college. I'm saying this as someone who understands and thoroughly appreciates relational databases: I'm starting to love schema-less systems. I've only been playing with CouchDB for a few weeks but can certainly see what such stores bring to the table. Specifically, a lot of the data I've stored over the years doesn't neatly map to a predefined tuple, and while one-to-one tables can go a long way toward addressing that, they're certainly not the most elegant or efficient or convenient representation of arbitrary data.
I'm certainly not going to stop using an RDBMS for most purposes, but neither am I going to waste a lot of time trying to shoehorn an everchanging blob into one. Each tool has its place and I'm excited to see what niche this ecosystem evolves to fill.
Dewey, what part of this looks like authorities should be involved?
In the olden days you didn't have centralized message stores. That's largely a relic of PC-based networking schemes like Novell, Lotus Notes and Exchange. The Unix model used individual mailboxes (in fact, the whole breakdown was for all of a user's data being in their own hierarchy). Obviously the Unix mailbox scheme wasn't that great as we started saving many megabytes of data, so you create indexed systems, but each user's mail is still effectively independent. I've used Pine to navigate my old mbox archives and it can move through even unindexed email at speeds that put bloated monsters like Exchange to shame.
Clearly the issue with scalability in general is simply one of optimization. If you're returning relatively small pieces of information, then an RDBMS is the way to go. If all your databases are basically blobs, well then it's probably not going to be that effective. I still feel that blobs are heavily abused.
I think part of the problem with RDBMSs is simply that a lot of people don't use them properly, and create the bottlenecks through bad design.
The world's burning. Moped Jesus spotted on I50. Details at 11.
We didn't start with relationship databases. RDBMSes were responses to the seductive but unmanageable navigational databases that preceded them. There were good reasons for moving to relational databases, and those reasons are still valid today.
Computer Science doesn't change because we're writing in Javascript now instead of PL/1.
That is indeed suspicious. But if they want to sell clouds, then make a RDBMS that *does* scale across cloud nodes instead of bashing SQL. (SQL as a language doesn't define implementation; that's one of it's selling points.) It may be that since there's not one out yet, they instead hype the existing non-RDBMS that can span clouds.
(I agree that SQL could use some improvements, such as named sub-queries instead of massive deep nesting to make one big run-on statement. Some dialects already have this to some extent.)
Table-ized A.I.
I was an admin on a system that spread the data across 10 database servers. Each server had a complete set of some data, like accounts, but the system was designed so that ranges of accounts stored their transaction type data a specific server, and each server held about the same number of accounts and transactions. As data came in, it was temporarily housed on the incoming server until a background process picked it up and moved it to the 'correct' one. This is a very simplistic view, but the reality was that it worked quite well. Occasionally, there was a re-balancing that had to be done. But it was very scalable. The incoming data wasn't so time sensitive that if it took a few hours to get moved, everything was still OK. When an 'online' session needed data, it knew which server to connect to to get it. Processing was done overnight on each server, then summarized and combined as needed.
.. .people have been coming up with innovative ways to solve these problems for a very long time.
So yes
And they will continue to do so.
I rarely read replies, it's my opinion and if you thought about your opinion a little more, I'm OK with that.
Wow a "object oriented" database discussion again. I've never read one of these :P I've only been doing this 15 years and I've lost count of these talks a long time ago.
What is the difference between schema less and schema rigid anyways. I don't see what that has anything to do with performance. The real issue is uptime and transaction support. People want to add a column or index without taking the system down. That is different then dealing with PBs of data. Most table structures can easily deal with that much data.
If you have a DB that is big you have lots of outs. Pay...get Enterprise version of whatever. Break it into many DB/tables and merge together. Archive. Archive I bet will get most people by. Does eBay really need all that bidding info for items over a few weeks old...only for analysis maybe. Move that old stale data out of the active heavily hit data tiers.
The fact remains that MySQL should be able to scale to TBs of data. The fact that it can't is a failure of the product. All the others have been for a while. Why can't it...I don't know...the fact that it uses a F'in different file for each index on a table. If you don't understand how old school that is start using Paradox. Just because it is open source doesn't mean it has to be so damn out of date. Please for the love of god save multiple tables/indexes in the same pre sized file...god.
Google has all the power to go and use something different. Google gets to cheat. Google is a collection of pretty static data. They scan the internet a lot, but imagine if every time you did a search Google had to scan every web page on the planet, index them, and then give you search results. That would be impractical for sure. So for now they just store big collections of blobs and a big fast index for searching keywords and links to pages. Impressive none the less, but it's not like your typical app. GMail is...funny that it is one system they've had problem with. Even then EMAIL DOESN'T CHANGE. It's user specific, but it's still f'in static. GoogleTastic if you ask me.
The fact is people are using RDBMS right now to solve real world problems. Some start up is finding a way to tweek MySQL to do something cool and then posting it on a blog...then all of the sudden RDBMS is dead. RDBMS is fine, it will be fine for at least 10 years if not longer. In that time it will evolve as well so that it will be around for even longer. MySQL in 5 years will have online index addition, performance hitless online column addition, partitioning, geo indexing, XML columns, BigASS table support, Oracle RAC like support, and a thousand other features that some RDBMSs have today and some will not see for even longer. Then developers that spent all that cash developing custom shit will revert and post comments like this one.
That's the way it goes in software development. The middle tier gets bigger, gets inept, custom shit comes out, it gets integrated into the middle tier shit....continue;
Instead of pronouncing death start talking about how dated a 2 dimensional result set is. JOINs should return N dimension result sets similar to XML with butt loads of meta data. ODBC/JDBC are dated...so updated them.
select u.login, ul.when from users u join user_logins ul as logins.login ON ul.user_id = u.user_id where u.name = 'me' should equal something like a nested XML packet instead of duplicated crap when there is more then one user_logins.
MS-Access had some really great features: it could be accessed with both SQL and with a blazingly fast (because almost running on the bare OS) ISAM-style library. I am still missing anything like it on Linux. SQLite is a file-system database, but why on earth should it parse full-blown SQL at runtime and why on earth should my program write another program in SQL at runtime just to load some data? Get serious. Parsing and building SQL is just overhead, and especially parsing SQL is no easy and light task.
Since I switched to OO programming, most (95%) of my queries are "This table/index. Number 5 please." In essence that is the get/put method, or the ISAM style method. I really would like something like that to exist on Linux. The closest thing around is MySQL's HANDLER statement, but that can only be used for constant data (because it does dirty reads) and for reading only.
SQLite could even be faster if it just accepted some basic "get row by index" and "put row by index" commands that do not try to parse, optimize or outsmart anything. The problem with "modern" databases is that they are either "SQL" or "NoSQL". That's awful. Some programs speak SQL (because of compatibility, because it is a reporting program or just because the programmer does not know anything else) and some programs are better off with direct row management. That does not mean that the data should not be accessible by both programs. I really wish that the regular SQL databases would develop ISAM-style access methods. Programming would be a hell of a lot easier then, and the programs themselves would speed up significantly was well.
This is no idle remark. I worked a lot with MS-Access and most rants about it being slow comes from the fact that most programmers treat the file-system database as a server. So it must emulate itself as a server and do a lot of household parsing and does not even have a physical server to relieve its load.
But if you know how to program a file-system database with ISAM-style methods, MS-Access is by far the fastest database I ever encountered. No Joke. Really. It can be fast because there is no need to do all these household jobs to just dig up a row.
Nae king! Nae laird! Nae yurrupiean pressedent! We willna be fooled again!
I work on a very large db2 system. Enterprise systems cost money because they work. There still seems to be this ignorant self absorbed counter culture which believes big iron and similar (anything about look what I can build in my basement) isn't cool so it cannot work.
Between radix, sparse, derived, encoded vector indexes I can pretty much serve up anything my partners want, whether they are native or foreign db2 ,jdbc or odbc connected. With the tools I have at my disposal I can analyze statements presented by developers to insure I have the access paths needed for their work and guide them to better data retrieval. I can tell if their choices result in full table scans, index probes, hash tables, rrn tables, etc. If I need support its a phone call away.
I do not care who my client is, data is my job. As such I need tools which are so reliable that only concerns I have are, just what is my customer doing and how can I make their request better. When they query 5tb tables and don't even notice a delay I think I am doing just fine.
* Winners compare their achievements to their goals, losers compare theirs to that of others.
SQL databases if designed properly DO handle enourmous datasets. the problem starts when you have wits designing the database and then managers attempting to use the DB for purposes it wasn't meant for.
If you mod me down, I will become more powerful than you can imagine....
Right. Don't forget PostgreSQL too. Really, the problem here is MySQL. Hell, look at the "tips and tricks" comments for this story: they all deal with ways to work around deficiencies in MySQL (and old versions of MySQL at that.)
The guy who recommends using the first two characters of the MD5 hash to select a table is particularly hilarious. Doesn't he realize that's what a database index already does, and that databases (even MySQL) will do that for him?
NO offense, but you probably have no idea what you are talking about. MS-SQL is a relatively solid product. SQL Server 2000 and SQL Server 2005 are pretty stable and can easily handle rather large data sets (in the TB). Of all the Microsoft Products, personally Visual Studio and SQL Server are my favorites. I like PostgreSQL as well, so I'm not strictly a Microsoft Fan. But an awful lot of companies are realizing that MS SQL can manage their data much cheaper than Oracle can. Of course PostgreSQL can do it even cheaper...but many companies like to pay $$ to sleep better at night.