Yale Researchers Prove That ACID Is Scalable
An anonymous reader writes "The has been a lot of buzz in the industry lately about NoSQL databases helping Twitter, Amazon, and Digg scale their transactional workloads. But there has been some recent pushback from database luminaries such as Michael Stonebraker. Now, a couple of researchers at Yale University claim that NoSQL is no longer necessary now that they have scaled traditional ACID compliant database systems."
NoSQL never was necessary. Traditional SQL database - not just terascale, but even simple ones like MySQL - regularly deal with data volumes at Google and Walmart that make the sites that built these databases in desperation look positively tiny.
Digg's engineers wear clown shoes to work.
StoneCypher is Full of BS
digg has chased all their users away with the new version of their site so they could probably change over to MS Access and be ok.
Didn't Berkeley prove back in the 60s and 70s that acid was scalable?
Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
In essence, TFA claims that if the traditional ACID guarantee "if three transactions (let's call them A, B and C) are active ... the resulting database state will be the same as if it had run them one-by-one. No promises are made, however, about which particular order execution it will be equivalent to: A-B-C, B-A-C, A-C-B" is not abandoned (as in NoSQL systems), but is even strengthened to a guarantee that the result will always be as if they arrived in A-B-C order, then it solves all kinds of possible replication problems, requires less networking between the many servers involved, and allows for high scaling while also keeping all the integrity constraints.
A bigger issue may be the cost of ACID even if it can in theory scale. Supporting ACID is not free. A free web service may be able to afford losing say 1 out of 10,000 web transactions. Banks cannot do it, but Google Experiments can. The extra expense of big-iron ACID may not make up for the relatively minor cost of losing an occasional transaction or customer. It's a business decision.
Table-ized A.I.
For instance, Neo4J is a scalable graph-based "nosql" DB with ACID.
Old people fall. Young people spring. Rich people summer and winter.
Because it works.
"It's old" is a terrible reason to replace something. Go back to your previous arguments an you have a case. After all, a Core i7 is based on a 1960's view of a problem with an enormous number of band-aids applied in the intervening years, but you don't seem too concerned with replacing that.
Spoken with proud ignorance.
Anyone who has properly scaled an application knows the database isn't the problem. If it was, it wouldn't take 12 applications servers to bring the thing to its knees. That said, most of your gripes equate to:
I am not a DBA and therefore I do not understand DBA and therefore I must complain.
Further SQL has nothing to do with ACID. AT ALL!
Get your PostgreSQL here: http://www.commandprompt.com/
NoSQL's two big features are scalability and the arbitrary schemas. While the paper covers the first (though I still think map/reduce has its place) NoSQL does do taxonomy-based (hierarchical) schema better. The only way to do that in SQL is to have a property table, where the parent object is a object RID, and a huge table of attached properties and values to that. You might be able to get your indexes to perform reasonably well, but only by duplicating the some data. And on top of that, just try writing a query for hierarchical data! You'll have sub-selects for each level of hierarchy. This means in order to to something relatively simple, like KPCOFGS of species classifications, you'll need a select and 6 sub-selects. At least that one is well defined to . If its not, you just don't know how many, and you have to write a recursive function to generate your select query, or process the results from it. Either way, you repeatedly consider 99% useless records at every level. True, you can cheat at this because there are always 7 levels. But that is not true for most other trees.
Slashdot's rate-of-post filter: Preventing you from posting too many great ideas at once.
Actually, if you look at set theory and declarative languages, SQL is coming to more traditionally procedural environments. (MS's LINQ, for example.) It's an amazing language, good at what it's supposed to do. You could nearly complain the same about XML transforms as SQL. They just collect & format data. It's the programmers who make it complex.
Unavoidable bottlenecks in systems come from storage, searches and transforms. If you want to remove the DB from the equation, what layer of your system should be performing these things?
BTW: The math in set theory hasn't changed since the 1960's, it doesn't "get old" and need replacing. And you should learn to spell COBOL, your rants will appear more credible.
Sounds like you don't really understand what you're talking about. The reason we continue to use ACID compliant RDBMS is because they work and they work well. If you don't think that RDBMS have changed over the years, you're simply lacking experience. I feel this is most likely the case as you comlain about the interface language (SQL), and don't understand how to CM stored procedures, or how to test a DB (OMG I have to make a copy of the DB to test - so hard!) Comlaining about the overhead of using an RDBMS in an application that doesn't require an RDBMS is tantamount to complaining about how hot you get while wearing a spacsuit when you jog in the park.
We knew ACID can scale already.
With enough money poured into it, and new implementations, ACID can scale.
They solved some problems with scaling out, not necessarily the problems with it scaling up. Scaling does not necessarily just mean replicas and quick failover -- it means good performance without millions spent on hardware too, in terms of overhead, storage requirements, storage performance, server performance.
NoSQL scales in certain cases less expensively, with less work, and doesn't require complicated DBM algorithms. The representation of data is also simpler, and requires less work to maintain than tables.
It's just a result of major existing SQL implementations being so expensive with large datasets, that sometimes it costs more in terms of performance and required hardware, than simply using NoSQL.
I also love this gem from the article:
If the system is also stripped of the right to arbitrarily abort transactions (system aborts typically occur for reasons such as node failure and deadlock), then problem (b) is also eliminated. ... given an initial database state and a sequence of transaction requests, there exists only one valid final state. In other words, determinism.
I suppose the authors are from a land where hard drive space is infinite, database server resources are always guaranteed ahead of time... I/Os never have unrecoverable errors, syscalls never return error codes, RAM is infinite, programs never crash.
The conclusion that ACID alone is the bottleneck is not necessarily true. The SQL language itself requires a complex implementation just to parse and implement queries, that can add latency.
The parent is not a troll, it is spot on. The problem is that the database backend and the language frontend are tied together. To invent a new query language you need to invent a database backend to go with it, and you can't try out a new query language on an existing database deployment. Similarly, any innovations in the database backend are hampered by the limited syntax of SQL. If you can't make a small extensions to SQL to get it working, then you can forget about implementing it at all. This pretty much means game over for any database innovations.
Even Relational Algebra is infinitely easier to understand than the pseudo-English mess that is SQL. Much like even Haskell is easier to read than COBOL.
Finally! A year of moderation! Ready for 2019?
Which problem? Storing your data, retrieving your data, modifying your data while guaranteeing transactional integrity, analyzing your data in aggregate, providing ways to recover your data, providing ways to reset your data to a previous state?
I'm not saying a traditional relational database is the perfect solution to everything, but it's silly to think that every approach will address the same set of concerns.
I will absolutely agree that a well designed database does not have performance issues. However, I work in a segment of the industry that works with Health and Human services, and the databases have issues that make any reasonable DBA sick.
None the less, database throughput is always an issue. Our applications scale just fine for our needs (as you imply) but it remains that even if only one person is running one application against the database, the through put is just "meh" at best. This is because every operation requires queries against the database to move significant amounts of data from many different tables. Could we build applications with better performance? Absolutely, and using traditional Relational Databases too, if the Schema was properly designed.
All of this begs the question. The real question is why we use a technology that is so sensitive to bad schema design? Why use a technology that has such a high baseline overhead? Why use a technology that is so tedious? Why use a technology that is so hard to test?
Absolutely the developer doesn't have to build applications that inherit all these problems from the database. I have designed applications that sit on databases, and have none of these faults. But unfortunately not all the applications I work on were designed to avoid these issues.
Now you ARE right that I am not a DBA. But if I have a fault, it isn't because I don't understand the DBA, but that I don't understand the database....
And yeah, in my rant I criticized SQL and ACID and relational databases in general as if they were all the same. They are not, and in fact need not have any overlap at all. Still, I'll stand by my rant as an expression of my annoyance with various aspects (these and others) of this particular approach to the persistence problem.
Any decent framework abstracts out the SQL syntax for you in a nice manner (say, ARel in the Rails 3.0 framework is quite nice) , but gain a lot of compatibility by using SQL, allowing to choose from engines from SQLite in a flat file to Oracle on a cluster.
Irony much?
You might wanna read my 2nd sentence. I know, I know. That's really far into my post.
I don't think they've proven it yet, they simply offer some solutions to what they admit is a very difficult problem. In other words, we'll see how their ideas pan out.
I went to eat some animal crackers and the box said, "Do not eat if seal is broken." I opened the box and sure enough..
From the Wikipedia Article (http://en.wikipedia.org/wiki/ACID)
"In computer science, ACID (atomicity, consistency, isolation, durability) is a set of properties that guarantee database transactions are processed reliably. In the context of databases, a single logical operation on the data is called a transaction."
SQL syntax is dated and very obtuse. Just look at the different syntax between insert and an update. ...wouldn't you rather just have "save"?
Object-relational mapping is cumbersome and mis-matched in SQL. 1:many either yields n+1 queries or a monster cartesian product set. And, what about inheritance? It just doesn't jive.
It isn't about losing ACID- although not every purpose needs ACID. Your average shared drive filesystem isn't ACID, for example.
When you have anemic domains that aren't nailed down and need to be readily flexible without big re-designs, JSON-based No-SQL works very well.
When you want to avoid n+1 and have well-defined data needs with 4MB of data across your object graph, No-SQL works... very very well.
When you want to segregate the business services and its backing data store from the separate concern of BI, No-SQL keeps the riff-raff out of your data store.
It's different. It solves different problems. Keep your mind open.
-Ouija- poke 53280,11:poke 53281,12
Interesting article )and yes, I read the article), but the point of the NoSQL movement isn't so much about SQL, or ACID, as much as it is about Schema.
Most applications today are written in object-oriented languges like Java, C#, Ruby, etc... and most common frameworks in these languages use object-relational models to essentially 'unpack' the object into a relational model, and then reconstitute the objects on demand. this post explains the kinds of problems better than most.
NoSchema is about storing data closer to the format we process it in today. Key-Value pairs. XML. Sets and Lists. Object-Oriented data structures. This is about abstractions that make developers more productive. It is a tool in a toolbox, and useful in some circumstance and not in others.
SQL databases do not have to be the 'one persistence data mechanism to rules them all'. We don't need one; we need many that solve differing classes of problems well.
What alternative have you seen that handles the same workload more efficiently? Flat files? I've seen plenty of database-related performance issues, but it's almost never inherent in the database - it's the idiot that wrote the lousy table-scanning code that's reading a couple rows out of a table with millions that's the problem.
If only you could start something like a "transaction", which you could then "roll back" after finishing the test, leaving the database in its original state. And if you could somehow "back up" the database and "restore" it on a test server, or under a different name. That would be awesome.
Checking your create/change scripts into source control is no more difficult than checking your C source in prior to compiling it.
While I don't totally disagree on this point, calling SQL "fixed" is a bit like saying C# and Java are the same. I promise you any meaty SQL Server code will not run on Oracle without very significant changes that will have to be done by someone that will cost you a lot of money (and likewise with Oracle to SQL Server). The capabilities vary wildly by platform, and the syntax is only identical for the simplest of CRUD statements.
I have to give this one a LOLWUT. If you're using a big RDBMS, it's likely a multi-user system. If you've got multiple users and connections, you want ACID. This isn't like imposing sorting overhead on data structures, it's like imposing the basic memory protection, process isolation, and filesystem durability you find in any competent operating system. If you want to see what it's like without those protections, go use Mac OS 9 for a week or so, or an Access database used by a few dozen people over a network.
Normalization is not just some plot by database programmer to annoy application programmers (That is merely a satisfying side effect!)
Acid is definitely scalable if you use blotter paper.
Absolutely true. I rewrote an application that had a 70 table database to use a simple tree structured representation - it ran two orders of magnitude faster and the code was easier to understand because the data representation conformed well to the actual problem domain. Relational databases are great but they aren't always the appropriate answer.
But as an aside I don't think hyperbole is the enemy of critical thinking - it is just a tool (perhaps weapon) the proper employment of which requires immensely more skill than most people possess.
The tyrant will always find a pretext for his tyranny - Aesop
SQL is still SQL. SQL is fixed in a syntax and written with naming conventions and styles that can best be described as neo-Cobal.
Has relational algebra changed (no, it's complete)? Why would the basics of SQL change then? Sounds like you just don't understand relational math and structured informaion basics.
No comprende? Let me type that a little slower for you...
The editors have a loose definition of the work prove. I read the article and they provide some compelling arguments. However, I saw no proof in a mathematics or scientific way.
Of course ACID is scalable, but you have to be very careful with the dosage. Even Albert Hofmann himself never doubted that.
To achieve 'nonconcurrency' one needs to introduce a global ordering of transactions. Which WILL require a shared resource among ALL of the transactions. No way around it, sorry.
And what's funny, this resource some of the problems of ACID systems. However, there should be advantages (no need for rollbacks, etc.).
Besides, all of this doesn't tackle another advantage of NoSQL systems: working with HUGE amounts of data. There'll still be problems in ACID systems if data access requires communication between several storage nodes.
And don't forget the CAP theorem. You can't get Consistency, Atomicity and Partition Tolerance at the same time. RDBMS typically 'solve' it by dropping the requirement for the partition tolerance. Usually by using quorum sensing schemas, etc.
Uh.... If I never said that "being old" is a reason to replace something.... As you would have known if you actually read the sentence you quoted. Given this observation, what am I to say about the fact that the Core i7 is based on a 1960's view of a problem? Besides, the Core i7 ISN'T a 1960's based solution, but is based on a 1960's solution. There is an important difference between the two statements.
Everything we do in CS is based on work that goes back to 1939 and even earlier. However, in the case of the Core i7 (as an example) we CHANGE the approach to try and fix various problems we have with our performance.
Personally, I think going back to old ideas and realizing that we can now implement them better/faster/cleaner is a great way to approach many problems. That a solution is "old" isn't a problem, but it is a problem if a solution has known issues, and we just live with them.
hmm. or you could have put an index on the right columns... which generally are implemented as tree structures. I'm sure your code was perfectly understandable to all who came after you, thinking they were working with a DB :)
I don't know if you are using SQL and "relational database" as equivalent... it seems that way. Anyhow a long time ago there were many different database solutions and most of them weren't relational databases. Then relational databases became popular and anything else almost seemed to disappear. I didn't really get this enormous shift because there are lots of domains where a relational database is not the natural representation of the information being modelled. But for most applications that most people are interested in relational databases work well and SQL represents the ideas behind relational databases quite well. So SQL is still here relatively unchanged decades later because nothing better has come along - apparently it fills its niche quite well - well enough that it hasn't been dislodged.
As for "neo Cobol" I think it was either Wirth or Dijkstra that said that typing speed was not the limiting factor in programming.
The tyrant will always find a pretext for his tyranny - Aesop
Yes, I'd like to be able to work with RDMBS data in REAL languages, not in ugly SQL or even more uglier DB internal languages.
DB tables can be represented with lists, on which composable pure (side-effect free) functions could operate. So JOINs can be expressed as list comprehensions. 'where' naturally is expressed as filters, etc. Care should be taken to maintain purity of functions used in queries, so they can be optimized efficiently.
LINQ in C# has beginnings of something similar.
PS: Am I describing Haskell, by any chance? :)
PPS: If your query requires complex complex and non-trivial optimizations by the RDBMS engine, then it's a bad query.
c'mon we use web services and only a few people complain about the inefficiencies there, we use XML and only some people complain about sprawling XML documents you can get.
You need to go learn a bit about DBs. SQL is pretty easy, once you've grasped the list-based concepts behind it. Stick to the simple bits and you're 90% done. They're not as bad as you think - its just your ignorance that's confusing you.
All technology suffers from the flaws you point out, all technology is fragile and easy to create total crap out of. (I know, I've worked with some 'professional' developers who make the most godawful mess, some of them even think they really are god's gift to coding).
DBs incidentally are one of those strange technologies where a 'clean, elegant and well designed' schema is a bad thing. If you over-normalise a DB performance will suffer, as will the code you have to write to use it. If you cobble everything into a few tables, it actually goes faster and is easier to code against. Strange, but true.
Its worth noting that, in additional to the arguments from proponents of non-relational databases, SQL also gets criticism from proponents of actually doing set theory right (e.g., Date and Darwen.)
Really, SQL and the databases using it are shaped as much by optimization of disk-based storage using popular computing architectures of the time at which it took shape as any mathematical model of data.
As computing architectures and performance attributes (not speed, but relative costs of different access patterns) of storage media change, underlying database implementations and the languages that best leverage them may change, even when you want to be generally guided by set theory.
This seems to be a reinvention of field calls, with a slightly different purpose.
"Referential integrity? We ain't need no stinkin' referential integrity."
Table-ized A.I.
All of this begs the question. The real question is why we use a technology that is so sensitive to bad schema design? Why use a technology that has such a high baseline overhead? Why use a technology that is so tedious? Why use a technology that is so hard to test?
Because fairly consistently, for the past forty years, every time someone says they've created something better than SQL and released to the market, the market proves them woefully and completely wrong. As such, as much as people piss and moan about SQL, SQL has consistently proven to be an excellent, general purpose solution and amazingly poorly understood by the masses. And solutions such as MySQL has only made things worse. That's not to say there are not superior niche solutions, only that SQL is one of the few database technologies which has continued to survive for decades as a general purpose solution, and rightfully so.
Its like the world suddenly doing their own plumbing, framing, and mechanical work and then proudly exclaiming the state of architecture and the car industry stinks because the world is falling apart around them. In reality, that means we need far more qualified DBAs and far fewer people who can barely spell, "SQL", designing and condemning the world around us.
Its literally been years since I've run into a qualified DBA, despite the fact "DBA" was part of their title. Turns out, being able to spell, "DBA" is all too often enough to qualify one for such a position. And don't get me started on the all the more common case of people who don't even know what a DBA does and yet they are responsible for actually creating the schema/data model.
Maybe it is the fact that RDBS based solutions are too fragile and are too often crap. We need to develop in representations that make sense to developers, and have the right sorts of compiler technologies and tools that build the proper run time representations for performance.
That you CAN build manageable/fast/testable/efficient applications is only the first step.
The second is wringing manageable/fast/testable/efficient applications out of mere mediocre developers.
Just in case anyone's interested: http://voltdb.com/
Stonebreaker started an open source database to implement the concepts he talks about.
Except for ending slavery, the Nazis, communism, & securing American independence, war has never solved anything.
You said:
Why is it that we continue to use a technology based on a 1960's view of a problem
Your complaint: It's an old way of doing things.
My point: stick with everything else in your post, where you talk about efficiency and finding the language awkward. Your last sentence is summarized by "It's old, and we've thought of other things since then". That's not a useful argument.
When I explicitly referred to the rest of your post, that was kind of a clue that I read it.
As long as we pretend that CISC is new, for example.
Because SQL isn't a particularly faithful implementation of relational algebra?
Nope, trust me that wasn't going to work. It was much more natural to represent directly as a tree structure using XML.
The tyrant will always find a pretext for his tyranny - Aesop
because on every application I have ever worked on, the Database has always been the performance bottleneck.
That means you need to fire your DBA and hire one that actually knows how to structure tables for performance.
Testing of DB applications is always a problem, because the running of tests generally changes the database, rendering tests unrepeatable without reseting the database.
And how is that different from testing any sort of application that has a persistent state?
Configuring applications to use this database or that database also ends up being a problem for most applications.
Really? What sort of libraries are you using? Every framework and DB library I've used has had a priority towards making it very easy to connect to a database. Usually, if you're only connecting to a single database, all you need to do is write your connection string in the appropriate file, and you're set. The only time you need to change that is when you're deploying your application from development to test, and from test to production.
Why is it that we continue to use a technology based on a 1960's view of a problem when clearly there ARE other solutions and ways to approach said problem?
Why do we use quicksort when there are other approaches to sorting?
We all know what to do, but we don't know how to get re-elected once we have done it
All of this begs the question. The real question is why we use a technology that is so sensitive to bad schema design? Why use a technology that has such a high baseline overhead? Why use a technology that is so tedious? Why use a technology that is so hard to test?
Those statements could be applied to any technology that's being used inappropriately. Why are our programs so sensitive to bad algorithm design?
We all know what to do, but we don't know how to get re-elected once we have done it
I have a different image of ACID on Windows than they do.
Is it the image of Bill Gates in an Easter bunny outfit trying to force Steve Ballmer into a large cast iron kettle filled with Skittles and baby mice? 'Cause that's the image I have of ACID on Windows...
- None can love freedom heartily, but good men; the rest love not freedom, but license. -- John Milton
TFA hints at this but doesn't come out and say it: the larger you scale, the more you swamp yourself with atomicity protocol overhead. If your database is geographically distributed, then you have to decide if atomicity is more important than forgoing the very large bills for the associated network usage. I suspect that this may explain a lot about why Google, Amazon, etc., went with NoSQL solutions.
In the course of every project, it will become necessary to shoot the scientists and begin production.
I would agree with that, which is why I said *SQL basics*. You won't change the basic reduction down to working with sets no matter what you do with the language or abstractions.
No comprende? Let me type that a little slower for you...
"Yeah, ask me a specific structured question and I'll give you a two-dimensional array to work with as an answer."
I thought it was more like an array of structs, where each array entry is a row and each struct member is a column. In non-C you might say each row is an object, each field-of-a-class is a column (where class : table) and each field-of-an-object is a single cell.
Then the cartesian product operation on tables of types T1 and T2 (respectively) has a type which is the product of T1 and T2, and everything matches up neatly.
but it stores its data in a way that doesn't require me to deconstruct all of my data structures into tables.
I take it this is not business-type data? Otherwise you're doing it backwards. Start with your Entity-Relationship diagrams, devolve into logical than physical data models, and THEN start programming.
I forget who said it but it's true: The data belongs to the business, not to the application. The data should be structured and stored in a way that it will still be readable years after your program has become obsolete. (Unless it's data that has a short "best before" date.)
--
.nosig
Finally. I've been telling Bob that for years, but nooo, he insists that we keep using blotter paper and sour patch kids.
Wanna fight ? Bend over, stick your head up your ass, and fight for air.
The reason these other database types went away is because the relational db + SQL handles ad-hoc queries very well. In many if not most db applications that is a killer application.
Why is it that we continue to use a technology based on a 1960's view of a problem when clearly there ARE other solutions and ways to approach said problem?
- seriously. I have this same problem with the entire DNA thing - it's too damn old and hard to understand.
I say we switch to a new paradigm - NoDNA.
From now on we don't need all those silly As and Gs and Ts and Cs and the entire twin helical strand idea, it's too freaking old. We must move on with times, so that we can implement NoDNA-DNA2 paradigm. It's going to be faster and easier on the eyes, it's going to have more Zaz in it. Zing, Zork, Kapowza, Mazooma in the bank!
It's just what cool kids would use.
You can't handle the truth.
Oh no you don't like the syntax. That's a great reason to turn away from a technology that has been implemented enough times and had enough research to bring it to where it is today.
If you can get over not liking the syntax, the SQL standard is pretty awesome, as are many (but not all) actual databases that use it. It's powerful enough to let you do some pretty complex queries, it's reasonably easily optimisable (and there is a lot of literature about that) provided you're not using a lousy database engine (like MySQL which can't even handle basic relational calculus planning in a sane way), it's pretty fast, and it offers some great guarantees. I have absolutely no idea what you mean by being difficult to test - either you know how to test or you don't. SQL doesn't get in the way there. You have a production data store and a test data store; you test changes together.
Stored procedures are not so widely used because they're not standard enough. However, they're not hard to use with source code management - you're making the wrong argument.
Your last gripe is fair, and if you are *really* sure you don't need ACID overhead and you have a reasonable alternative database, go for it. You're giving up on all the other research that's gone into the common platform, but that's a tradeoff that might be worth it for some purposes.
For every problem, there is at least one solution that is simple, neat, and wrong.
Perhaps, but the question facing an individual company is do they want to take on the risk of being the Guinea pig for relational's challengers. Most such experiments fail to catch on, statistically. (Some do find a nice little niche, but often outside of where it was tested.)
But generally needs become more complex and diverse over time for any growing or competitive company, and one-trick-pony self-rolled databases fail to flex. If a sufficient-performing general-purpose RDBMS is only slightly more expensive than the self-rolled one at the start, it's better future-proofing.
Table-ized A.I.
abstracts out the SQL syntax for you in a nice manner (say, ARel in the Rails 3.0 framework is quite nice)
In the end it still boils down to SQL, and while I think SQL is pretty damn good and getting better, it's never going to become not a pain in the ass for certain complex operations. Some databases like postgresql work around it with extensions (such as postgresql's SELECT DISTINCT ON (...) which made questions like "Give me a list of every customer and the date and amount of their most recent purchase" trivial to answer in a single query before WINDOW became part of the standard, and even now it's easier to understand than the WINDOW syntax, at least as long as you know it's a postgresql special)
If I have been able to see further than others, it is because I bought a pair of binoculars.
And don't get me started on stored procedures and the difficulty of using source code management with stored procedures.
That's easily solvable:
Stored procedures don't have to be any more difficult to manage than any other code.
Dewey, what part of this looks like authorities should be involved?
Short Summary:
We make some claims about scaling ACID databases, but then don't support them.
Longer summary:
We don't like NoSQL and enjoy making baseless cracks about it such as it being a "lazy" approach.
In our paper we demonstrate that our unconventional version of an ACID database scales better than a traditional ACID database in a specific environment, while merely throwing away some robustness guarantees and changing how transaction ordering works.
No direct comparison to any NoSQL implementation is made.
So yea, I'm not holding my breath for companies to start migrating away from NoSQL.
Please don't take this wrong. I really do mean my comments respectfully and politely. It's been a long day, and I'm not sure I managed to write as sincerely as I intended.
... because on every application I have ever worked on, the Database has always been the performance bottleneck
Wow. We've had very different experiences, then. Sure, there have been plenty of times when the database was the bottle neck. But it seems like I've have more issues with network speeds. And I can think of a few cases where the file system was the issue. At my current day job, the system bus seems to be the most common bottle-neck. Not that we touch databases all that often.
Testing of DB applications is always a problem, because the running of tests generally changes the database, rendering tests unrepeatable without reseting the database.
Isn't that generally considered a "best practice" anyway? I mean, I've pretty much always just taken that as a given. What do you consider a feasible alternative?
Configuring applications to use this database or that database also ends up being a problem for most applications.
OK, now I really have to ask what kind of development environment you're using. That's always seemed like a fairly moderate "no-brainer." Sure, it's mildly inconvenient to make sure connection strings got changed when migrating from dev to test to staging to production, but it's not that big a deal.
Furthermore, while programming in general has continued to progress through many languages, exploring many different ways to describe problems, SQL is still SQL. SQL is fixed in a syntax and written with naming conventions and styles that can best be described as neo-Cobal.
That's one way of looking at it, sure. Maybe you're missing the point, though? I mean, so many other languages and approaches have changed so drastically over the years...maybe SQL hasn't because it's good enough for what it does?
Bottom line: SQL is tedious, ugly, slow, and difficult to test.
Compared to what? Keep in mind its original purpose: letting business users look up algebraic sets while programmers got on with the serious data analysis. It just happened that having a standardized API that made it relatively easy to swap out back-ends turned out to be the easiest way for programmers to do our jobs.
If you really do have access to some magic technology that lets you look up persisted data (in a way that's anywhere near as flexible as SQL) significantly faster than any of the major RDBMSs...why haven't you founded a business on that and made your fortune?
And don't get me started on stored procedures and the difficulty of using source code management with stored procedures.
You definitely need to look into some better tools. File | Save As... to stash your SP's in some directory, add to source control (if it's new), check in.
Last gripe: A traditional Relational database imposes ACID overhead on every application, even if you don't really need it or use it. This is like a programming language that imposes a SORT overhead on all your data structures even if you rarely or never need to sort them.
It's been a while since I had to mess with SQL, but I seem to recall specifying hints about how much transactional consistency I actually needed. I think you may be exaggerating the overhead a smidge. And I'm pretty sure there are ways to work around it. But that's getting way off track.
Why is it that we continue to use a technology based on a 1960's view of a problem when clearly there ARE other solutions and ways to approach said problem?
Two suggestions. 1) It works. And DBA's hate learning new technology. 2) No one's come up with an alternative that's compelling enough to convince more than a tiny fraction of companies to
4 words: You're doing it wrong.
I would be surprised if you've got significant experience. You sound like I did 12 years ago when I didn't understand. I had the same complaints you do. I patted myself on the back as I wrote several different storage systems for various projects. And every one of those systems was useful for no more than the toy problems I was involved in at the time. Once I had to make things that lasted in the real world, I saw the light.
For any non-trivial, multi-user persistent data storage, a database is usually the correct view, whether it came up in 1960 or not. The notion of information and the way things interrelate has not changed significantly since then, and is unlikely to in the near future. Those old timers did their homework. I eventually realized that they understood the topic far better than I did.
Yes, there are special cases where you can get an angle of advantage by skipping a traditional database, but you lose an enormous amount of power too. If the trade off is worth it, then great. Usually it's not. Usually you will eventually have to amend the structures, import and export data, distribute it to different teams, integrate with existing systems, provide reporting tools, etc, etc, etc and you'll wish to hell you had just had it all in an database.
Good luck.
My complaint ISN'T that the solution is old, but that very little has changed in the SQL/RDBMS approach through the years, despite the vast changes in hardware, the vast experience we have gathered in the awkwardness of using SQL to persist data, and the known performance costs to the SQL/RDBMS approach to persistence.
Maybe my summary statement stinks; it certainly has turned into a useless analysis of what the statement was taken to say vs. what I as an author think it says and meant for it to say.
So let's call you right about how poor the statement was as a summary, and me not so wrong in what I intended it to mean.
You don't need SQL in order to get ACID properties. And some common SQL-like languages don't provide ACID.
Furthermore, SQL wasn't designed for what it is being used for today; SQL was meant to be a database interaction language for non-experts.
And "information theory" doesn't mean what you seem to think it means...
Academic determines that if only you're willing to insert a single point of failure, all of your replication problems can be hand waved away. Also if you have this new single point of failure, somehow magically transactions will never need to abort ever again.
The reason that NoSQL is necessary is that ACID is not the only thing that developers need to think about. RDBMS was an innovative solution to the limitations of mainframe hierarchical databases circa 1970. Since then it has been the only game in town (At least for most enterprise software. Some of us do other things occasionally.)
It turns out that there are reasons to do things other ways, and having other options allows you to consider trade-offs. For many applications eventually consistent data scales just fine. For some applications, both big and small, an enterprise RDBMS is overkill. Why not just persist objects to a document store? Or even the file system?
The research is interesting, although I agree that we already knew we could scale the ACID paradigm. The conclusion is ridiculous. NoSQL has nothing to do with ACID, and it brings a richness to the conversation that has been missing for far too long. Like the Perl folks say, TMTOWTDI.
Thanks for this, this is EXACTLY what I think everytime a slashdot article concerning databases come up - here come the blowhards who understand every possible application and its needs. How long before someone trashes MySQL? It's tiresome.
To be fair, I think this thread has a lot of signal compared to many - and then someone'll ruin it by asserting some absolute.
As such, as much as people piss and moan about SQL, SQL has consistently proven to be an excellent, general purpose solution
I would disagree with that assertion. It would be more precise to say that SQL has proven to be a solution that is better than anything else offered so far, when taking into account factors such as the existing implementations, supporting tools, learning materials, and availability of skilled professionals (which are all major factors when comparing a mainstream entrenched tech with any newcomer; see Java vs Ruby etc).
SQL by itself isn't particularly well designed as a query language - it was based on fundamentally wrong premises (as many "4GLs", it was supposed to "enable non-programmers to write code"; as everything that has ever tried that before and after, it failed). Consequently, its syntax is quirky and overly verbose with no good reason, and its semantics very non-orthogonal. And that's without even getting into the issue of standard conformance of real-world implementations, and portability of non-trivial queries between them...
SQL's real strength is that it's 1) already there, and 2) is good enough. In that regard, it's very similar to C. It's very easy to come up with a systems programming language better than C, but C is so widely used that it beats any competitors simply by virtue of being good enough; you'd need some real breakthrough to unseat it. Same thing here.
If only you could start something like a "transaction", which you could then "roll back" after finishing the test, leaving the database in its original state.
...which means you now can't test transactions.
And if you could somehow "back up" the database and "restore" it on a test server, or under a different name.
That'd be obnoxious, but sure.
If you've got multiple users and connections, you want ACID.
Spoken like someone who's never written a web app.
Let's take Slashdot as a trivial example. Do I need transactions to be atomic? That'd be nice -- I don't want to see half a post succeed and the other half fail -- but it isn't really needed outside profile settings. Consistency? Posts really only depend on their parents, which are immutable -- pretty much by definition, if a user can see a parent post to reply to, by the time Slashdot will let them reply, that parent post will be committed. How about isolation? Nope -- if two users post two comments simultaneously, the application can trivially figure out which one should go on top -- no need for a transaction. Durability? Again, I don't see this being needed anywhere but in the user's profile.
So, we could take atomicity and durability as useful and easily drop isolation and consistency -- they simply are not needed and add zero value, unless I'm grossly misunderstanding what they are.
If you want to see what it's like without those protections, go use Mac OS 9 for a week or so,
Or an embedded system. Or an OS kernel.
Don't thank God, thank a doctor!
Why do we use quicksort when there are other approaches to sorting?
That's actually a good question, and many languages have switched to things like mergesort as a default sort.
Don't thank God, thank a doctor!
SQL Server supports nested transactions. I'd imagine other vendors do as well.
most geek arguments are based on the Highlander principle
bite my glorious golden ass.
Most major databases support nested transactions. If the ones you've had to test against don't, that would be a real PITA.
I'm probably thinking of something like SQLite. Good point, though.
Of course, there is the added problem of being able to access the result of running a given failed test before rolling it back. Should we dump the entire production database to do that?
You would need to do essentially the same things with any persistent store, right? Unless you mean to test against production data...
Eventually, the idea is to test against a clone of production data, which is I think what you're suggesting. However, there are a lot of tests you can do without even touching the database, and a lot more you can do with generated data. This means you need a quick way to get a completely clean datastore, and a quick way to get a datastore which is a clone of a given snapshot of production. (Has to be a given snapshot, otherwise you might have a test that fails on one run and succeeds on the next.)
OTOH, why do you care if your web app that doesn't need ACID happens to run on an ACID database? ACID is usually transparent to the application programmer, and I would assert that most web apps will never grow large enough to notice any performance hit, assuming a reasonably sane schema is in place.
As soon as you grow beyond a single database server, that's a serious performance hit, or quite possibly an administration nightmare, assuming your chosen database can scale to multiple servers -- last I checked, Postgres only does replication.
Also, SQL is ugly. If I don't need ACID, why should I put up with it? More generally, if my web app doesn't need ACID, why would I want to use an ACID-compliant database at all?
Hardly any of our web apps are ever going to become the next Google, Amazon, Twitter, or whatever.
Even without that, I'd like to be able to handle spikes in traffic gracefully. While I doubt it was the database that was the bottleneck, recently, a number of students in my physics class have been unable to work on or submit online homework, because the online homework system seems to buckle under the load of all the procrastinators. Being able to trivially scale the database with the application is still a Good Thing, even for a tiny app.
Don't thank God, thank a doctor!
Further SQL has nothing to do with ACID. AT ALL!
Mr Begin Transaction and Mrs Commit Transaction called for you. They aren't happy.
http://tinyurl.com/TMTOWTDI
Over all, I don't disagree with the substance of your reply.
it was supposed to "enable non-programmers to write code"
This is very important. The problem is, those same non-programmers believe they can design databases too.
Consequently, its syntax is quirky and overly verbose
There is a standard which reduces the vast majority of syntax quirkiness. Much of the quirkiness exists not from the SQL standard but from various SQL RDBMS which do not follow the standard.
I've frequently heard SQL is overly verbose but have never bought into that. Can you provide an example of SQL which is overly verbose and an imaginary, more concise example, which still meets "non-programmer" demands in readability?
In my own opinion, people confuse "overly verbose" with having many options. Again, in my opinion, many people don't understand the many nuances of various SQL clauses and is likely the root of the "overly verbose" argument.
Don't get me wrong, SQL is not perfect. There are certainly some oddities in the SQL standard. At the same time, I'm hard pressed to think of any technology by committee which doesn't have faults. Furthermore, I can't think of any significant technology base which is perfect.
SQL Antipatterns may interest you. As one of the reviews says, "An excellent guide to database design tradeoffs".
On that level, sure. But a lot of things that one might consider fairly basic to SQL aren't essential to leverage the generality of the relational model.
For instance, you could have a datastore where you assert tuples without reference to a particular relvar (or table), and then the datastore assures that the asserted tuple satisfies the constraints for all defined base relvars whose headers it satisfies. The body of a relvar is simply, then, the set of all asserted tuples that satisfy the header of the relvar.
If you do this, you'll want to have an independent namespace scheme for attribute (column) names rather than using relvars/tables as column namespaces as well.
This produces a database that is going to be very different from an SQL-based database on a fairly fundamental level, but still can leverage mathematical set theory and relational algebra in much the same way.
It also could conceivably support asserting (and querying) facts/tuples before they fit into defined relvars/tables, while still assuring that any defined integrity constraints on defined relvars were satisfied.
To give a specific example of SQL being messed up, WHERE vs HAVING. No other query language I'm aware of needs two distinct filtering clauses. In XQuery, for example, you simply apply "where" before or after "group by" as needed - they are fully orthogonal operators, so you can interleave as many as needed.
By the way, in general, I would point out XQuery as an example of a cleanly designed (syntactically, at least) query language from which SQL could learn a lot.
Most developers simply drop their application scalability problems down to the DB layer and/or OS layer. Then bitch that those DBAs are dumbasses, the DB server doesn't scale.
Deleted
I'm tired of all this nonsense. A DB provides persistence, storage, integrate it with the storage interface of the PL in question (you do have one, right). Heck, PL/I had handles and record oriented storage, and any descent multi paradigm language provides the necesary higher order functions or equivalent, the rest is the compilers job. FWIW
I know tobacco is bad for you, so I smoke weed with crack.
Why should they? NULL doesn't mean anything in the relational model; NULL is an SQL construct that violates the fundamental underpinnings of the relational model.
Well, it's kinda the idea behind LINQ - integrate query comprehensions (and convenient syntactic sugar for them) into the language, and then provide extensibility points so that they can be mapped to various data sources as needed.
it seems to me that once the transaction being tested is committed and the test detects a failure, it should be able to record details before rolling back the test-level transaction.
It could, but would I be able to explore a snapshot of the data at that point in time? In particular, could I watch changes to the data as I step through with a debugger?
These are things which are easy if I have an isolated copy (as you described in SQLite), but don't work as well just using a transaction of a development or production database.
If you've got tests that don't need a copy of the store, I'd say partition those from tests that do need it;
It can be done, sure, but I think it's far easier just to have object factory classes -- though I'm not sure, now that I think of it, that these are easier without SQL, it seems to be fairly independent of datastore.
along with dump files to repopulate either DB schema.
Not just the schema, though. You'd want a clone of the production schema and (likely) the production data. But I see your point.
(they don't actually *need* Oracle for this app, but they've standardized on it for all their internal apps).
I hate when that happens, but then, I'm not sure I see any real applications of Oracle. It seems to me that when you get to the scale where you would need Oracle, you're very nearly at a scale which would make Oracle useless and force you to think in distributed terms.
I think you're right about PostgreSQL, but I'm too lazy to check.
I think there may, at one point, have been some proprietary options for true multimaster replication, but I don't think they went anywhere. And multimaster replication doesn't really buy you much in terms of scalability, if you have to keep all nodes in sync.
Anyway, you should get a firm grasp during planning of how large (or distributed) the data needs to be within a foreseeable time frame. If you anticipate truly huge data on multiple servers or a need for distributed data, then that should strongly influence what data platform you choose.
Point is, if I design and plan for truly huge data on multiple servers, it's still going to work at a smaller scale -- and as an added bonus, I find many of the NoSQL databases much more pleasant to use than SQL.
However, if I design and plan for a small scale, well, it's probably not going to be pleasant upgrading from SQLite to MySQL or from MySQL to Oracle, so you can imagine how not fun it would be to go from any of these to a truly sharded design, or to one of these NoSQL designs.
One thing I love about Google App Engine, for instance, is that it makes me think about the shape of my data. If I can think about how this works at a high level, I can design my application around it such that it can scale to pretty much any load anyone throws at it.
maybe not if the data model just won't work well with a key-value DB.
Well, but what data is that? Consider CouchDB -- you get "views", which are essentially mapping functions, and you can also create arbitrary "reduce" functions. This means you essentially get a cached view of the result of arbitrary code execution, and these can each be pretty much arbitrarily parallized.
Most of the major databases can handle medium-sized data sets, say 10s of GBs to 100s of GBs, on a single server.
That doesn't tell anywhere near the whole story -- I handle terabytes of data on a single server with a flat filesystem, because they happen to be terabytes of video.
So what kinds of data, and what kinds of access patterns are we talking about? And how big a server -- how many cheap PCs could I buy for the same price? How are you going to handle availability -- that is, what happens if that one server goes down?
OK, you don't like SQL.
Don't thank God, thank a doctor!