Slashdot Mirror


The NoSQL Ecosystem

abartels writes 'Unprecedented data volumes are driving businesses to look at alternatives to the traditional relational database technology that has served us well for over thirty years. Collectively, these alternatives have become known as NoSQL databases. The fundamental problem is that relational databases cannot handle many modern workloads. There are three specific problem areas: scaling out to data sets like Digg's (3 TB for green badges) or Facebook's (50 TB for inbox search) or eBay's (2 PB overall); per-server performance; and rigid schema design.'

381 comments

  1. Why worry? by Anonymous Coward · · Score: 5, Funny

    Microsoft Access is here!

    1. Re:Why worry? by MichaelSmith · · Score: 4, Funny

      Don't forget excel!

    2. Re:Why worry? by socceroos · · Score: 1

      Ah yes, Excel 97 - the days when you could be in a flight simulator and legitimately tell your boss you were crunching numbers.

    3. Re:Why worry? by mikael_j · · Score: 1

      Sadly there is plenty of production code that uses Access databases for things they just shouldn't be used for, at a previous job I actually built several production websites that used Access as the db backend because the client didn't want to use MySQL (Open source is scary!) and they didn't want to pay for MSSQL...

      /Mikael

      --
      Greylisting is to SMTP as NAT is to IPv4
    4. Re:Why worry? by Anonymous Coward · · Score: 1, Funny

      ya don't forget!

    5. Re:Why worry? by rainhill · · Score: 1

      Seriously... why the +5 funny?

    6. Re:Why worry? by Manos_Of_Fate · · Score: 4, Funny

      Because there's no "scary because it's true" mod.

      --
      Isn't enough that I ruined a pony, making a gift for you?
    7. Re:Why worry? by Linker3000 · · Score: 2, Insightful

      Oh Great! I have just migrated 5 offices from a veterinary management system based around Access 97 onto the new, MS-SQL-based one.

      How can I expect to maintain my value to the company if they stick with old, reliable systems instead of moving onto more sophisticated 'solutions' that require a shit-load of tweaking and technical guesswork to keep them running smoothly?

      --
      AT&ROFLMAO
    8. Re:Why worry? by Yoozer · · Score: 2, Informative

      That's when you tell customers about MSDE (now SQL Server Express) which does the job a lot better without breaking the bank.

    9. Re:Why worry? by Hognoxious · · Score: 2, Funny

      Don't be so pessimistic. There's OO databases and the cloud. That should see you almost through to retirement.

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
    10. Re:Why worry? by Anonymous Coward · · Score: 5, Interesting

      You laugh, but the things I see done in Excel on a daily basis in production environments getting a LOT of work done are a testament to it's power. It is one of the best rapid application development platforms in existance. People with no CS background programming away in a functional style and getting shit done and not even realising they are programming. It could be so much better but it's still the best of breed. Any yes I have tried, and seen others try, O.O. et al. Forget it. Lets not go down that worn old road.

    11. Re:Why worry? by Prototerm · · Score: 1

      dBase3 FTW

      --
      "My country, right or wrong; if right, to be kept right; and if wrong, to be set right." --Senator Carl Schurz (1872)
    12. Re:Why worry? by MBGMorden · · Score: 1

      Indeed. SQL Server Express lacks mostly features that smaller outfits wouldn't care about anyways. The main limitation you hit is that the individual databases are limited to 4GB in size. If you're looking for something "free" though (which Access is not, but we'll assume it's cheap enough to be considered) and need something bigger than 4GB, then going with Access over ANY open source database is just insane.

      And if they really end up just HAVING to pay big money for software (too many baby boomers are obsessed with the outdated and often inaccurate notion of "you get what you pay for"), then just give him PostgreSQL and bill him a nice fat "setup fee" on the invoice so he thinks he's paying for the software itself. He gets a good database, accepts it as such because it was expensive, and you pocket some nice extra money.

      --
      "People who think they know everything are very annoying to those of us who do."-Mark Twain
    13. Re:Why worry? by QuoteMstr · · Score: 1

      just give him PostgreSQL and bill him a nice fat "setup fee"

      There are entire companies built around this idea.

    14. Re:Why worry? by BlackSnake112 · · Score: 1

      Only if you write in the usual care and feeding scripts yourself. Otherwise the express version is lacking. I have seen many SQL express DBs set up with no backup, no automated updating of any of the indexes and other care and feeding tasks. Also showing the bosses a script file and saying this is backing up and rebuilding X in your production database does not go over as well as opening up the GUI and saying here, look at your database. This is the task that backs it up, this is the task that rebuilds the indexes so your searches are faster, etc. Seeing things makes the boss types happier.

      Writing the script files to do these tasks is not very hard to do. Why is it that most of the out sourced SQL express jobs (or turn key systems) I see never do that?

    15. Re:Why worry? by Yetihehe · · Score: 1

      You didn't see anything until you've seen simulations of vibrations in fast rotating shafts done in Access. I think programming those in any compiled language could bring calculation time from 2 days to 5 minutes...

      --
      Extreme Programming - Redundant Array of Inexpensive Developers
    16. Re:Why worry? by tordon · · Score: 1

      Er, if you had been around in the Lotus 1-2-3 days, you would have seen those same people doing _real_ programming with Macros.

      That worn old road is exactly what MS went down when they created the psuedo OO macro language behind Excel.
      In doing so they alienated all of the business gurus who could handle macros in 123 and get stuff done.

      When Excel arrived they were stuffed because they couldn't handle the cryptic way macros are written in Excel.

    17. Re:Why worry? by CastrTroy · · Score: 1

      I think that's probably due more to bad coding than to not running compiled code. Compiled code might run 10% faster, but not that much faster. It's probably much more likely due to something being coded with completely inefficient algorithms, or people not realizing that things like appending 1 character to the end of a string 10,000 times is going to be slow, or people not realizing just how expensive "Redim Preserve" really is.

      --

      Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
    18. Re:Why worry? by shutdown+-p+now · · Score: 1

      There are actually two things to replace Access, depending on which of the main two scenarios it was used in. For a shared database, MSSQL Express is indeed the way to go.

      However, in many cases Access was used as an "embedded" database for a particular piece of software, used either by a single user at a time, or by several users on the same machine (use of the same database located on a network share being possible but not frequent). For that case, MSSQL Express is likely overkill, and is inconvenient in that it still requires a server process (and a Windows service registration). A better option here is MSSQL Compact, which is an in-process database engine (similar to SQLite or Firebird Embedded), with simple locking for concurrent database access.

    19. Re:Why worry? by Nefarious+Wheel · · Score: 1

      You laugh, but the things I see done in Excel on a daily basis in production environments getting a LOT of work done are a testament to it's power. It is one of the best rapid application development platforms in existance.

      Maybe not the best, but quite possibly the most popular and persistent. As I've said before, a tremendous - a truly disturbingly tremendous - amount of the world's wealth exists nowhere else than on the common Excel spreadsheet. This statement brought to you courtesy of a bit of work I did once for a major bank's managed funds section (I think working on nuclear munitions would have scared me less).

      For those of you who aren't across why, it's simple - convenience. Open Excel, create a macro, do a few things (open a spreadsheet file, format, move, change or populate a few columns, whatever) then close the macro.

      Now, press Alt+F11. Find the macro code. You're sitting in an IDE that allows you to write VBA, with the entire development and production environment encapsulated in that single Excel document, backed by the Office platform that most of the worlds businesses, correctly or in, have installed on every desktop.

      What this has meant for financial organisations is - you can effectively wrap that spreadsheet around a rock* and throw it through another company's window, and they will also be able to run the software on that document.

      You cannot ever underestimate the power of simple convenience. Those suckers are busy, the rules are always changing and fast+simple software development rules the waves.

      ----

      *It's one of those mettyphores. Honest. If it's insufficiently literal, substitute "a USB key tied to a lead sinker".

      --
      Do not mock my vision of impractical footwear
    20. Re:Why worry? by Anonymous Coward · · Score: 0

      ...a testament to it's power.

      It's just too bad that Excel's grammar check isn't available for /. posts!

    21. Re:Why worry? by Fulcrum+of+Evil · · Score: 1

      Compiled code might run 10% faster, but not that much faster.

      More like 10x - this is access, which is quite possibly running VB6.

      --
      "We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
    22. Re:Why worry? by clockwise_music · · Score: 1

      Don't forget Filemaker Pro - the world's worst database.

    23. Re:Why worry? by sincewhen · · Score: 1
      I have developed Brad's Law of Data, which states:

      All data spends some part of it's lifetime in an Excel spreadsheet.

      So far, (and particularly in the project I'm on) it seems to be true.

      --
      -- Braden's law of data: All data spends some of its lifetime in an excel spreadsheet.
    24. Re:Why worry? by Anonymous Coward · · Score: 0

      I think the "world's worst database" award would go to FoxPro. I am responsible (although that is a highly inaccurate term; "thrown at it because everyone abandoned it and my boss thought it would be a good idea since I do all the other jobs nobody is willing to take up" is a much better, but less succinct, one) for providing technical support for a steaming pile of interpreted VFP (version 6!!! no less) code that is in use in over 1500 sites around the world. I have never been allowed access to the sources, and I highly doubt I'd be able to hold my breakfast in if I did see it. Every day I get calls from one or other of the dealers tasked with supporting those 1500-odd end-use sites with inane "it don't work - it crashed and gave this here operator a "FATAL ERROR" message... whudduidooooooo now?" complaints. At least I get paid the big bucks and get to travel to exotic locations around the globe on my company's dime, because if I was a tech grunt saddled with this, I'd have quit a LONG time ago.

  2. bad design by girlintraining · · Score: 2, Insightful

    So... every time I open my inbox in Facebook, it has to search through 50TB of data? That sounds like a design problem. What has always floored me is why people think everything needs to be stuffed into a database. Terabyte sized binary blobs? You know, there's a certain point where people need to stop and actually think about the implimentation.

    --
    #fuckbeta #iamslashdot #dicemustdie
    1. Re:bad design by bennomatic · · Score: 5, Funny

      I'm a terabyte sized binary blob, you insensitive clod!

      --
      The CB App. What's your 20?
    2. Re:bad design by munctional · · Score: 5, Insightful

      Ever heard of bloom filters? Sharding? Indexes? They are clearly not doing a table scan on 50gb of data every time you open your Facebook inbox.

      You know, there's a certain point where people need to stop and actually think about the implimentation.

      Um, they do. They regularly blog about their solutions to their problems and open source their solutions and contributions to existing projects. They come up with amazing solutions to their large scale problems. They're running over five million Erlang processes for their chat system!

      http://developers.facebook.com/news.php?blog=1

      http://github.com/facebook

      Also, when was the last time you tried to visit Facebook and it was down? They're doing quite well for people who need to stop and actually think about their "implimentation".

      --
      Functional programming... for real men!
    3. Re:bad design by JavaPunk · · Score: 4, Interesting

      Yes it does (look through 50TB of data), and how would you design it? It has to access all of your friends and find their postings. Robert Johnson gave an excellent talk on facebook's design two weeks ago at OOPSLA (it should be in the ACM digital library soon). He stated that there is no clear segregation of data, the (friend) network is too connected and extracting groups of friends isn't possible. Basically they have a huge mysql farm with memcached on top. Loading an inbox will hit multiple servers (maybe even a different server for each of your friends) across the farm.

    4. Re:bad design by socceroos · · Score: 5, Funny

      Ever heard of bloom filters? Sharding? Indexes?

      Don't forget flux capacitors, FTL drives and crossfading splicers.

    5. Re:bad design by TheModelEskimo · · Score: 2, Funny

      Yeah. And those guys down the street, the tweakers, nose jobs, and johnny-come-latelys.

    6. Re:bad design by ErikTheRed · · Score: 4, Funny

      So... every time I open my inbox in Facebook, it has to search through 50TB of data? That sounds like a design problem. What has always floored me is why people think everything needs to be stuffed into a database. Terabyte sized binary blobs? You know, there's a certain point where people need to stop and actually think about the implimentation.

      Could be worse. They could try to find something on my desk.

      --

      Help save the critically endangered Blue Iguana
    7. Re:bad design by kestasjk · · Score: 2, Insightful

      They use bloom filters for messaging? What for?

      --
      // MD_Update(&m,buf,j);
    8. Re:bad design by mabhatter654 · · Score: 1

      all electronic data is stored in a "database"... even a file system is a type of database you just can't query it with SQL... it's just that database programs that use SQL as the front end have more tools built for them.

      You have a point that the implementation has to be thought about. If you RTFA you'll see the issue is more that RDMS implementations like Oracle rely on breaking up clusters so you can cram as much of the data into RAM as possible on to fast CPUs... in the case of something like Facebook that's just not possible, it's not reasonable to build that type of system, so they have to optimize their systems in new ways. This is also why mainframes and midrange systems still rule the enterprise arena because they have massive amounts of ram, cache, and disk directly connected . They don't architecturally fix the problem, just throw more resources at it. Google and Facebook are built on small, cheap machines in massive arrays...but RDMS doesn't really fit that storage model. Also internet clusters don't get "batch downtime" like mainframes do... time to cleanup tables and reset indexes. The big joke I thought of in my Oracle class was how much manually "tending" the package took....where Oracle systems try to have PERFECT data, Facebook or Google simply "evolves" the data as they build new banks of servers and throws out the unoptimized nodes, they know they'll never have perfect data... stop employing armies of DBAs to rearrange the house of cards. (in their defense Oracle is building "perfect" systems because that's how accounting, inventory, shipping, etc NEED to be... if you lose a Facebook post or Google loses a months worth of Bot data it's not the end of the world... lose a month of TPS reports and there's hell to pay!)

      In a parallel problem it's the same on your home computer. Why do I care about my OS having a "file system" and why can't my data, contacts, pictures, music, document, etc be in some kind of database that is self contained, properly metatagged, and enables a simple method of backup. In many ways the first 30 years of home computing was about making things generalized and marketable... the next will be building systems that ignore "files" and manage the raw data and metatags directly.

    9. Re:bad design by DarkAxi0m · · Score: 1

      clearly to make the messages look better for the people with fancy video cards...
      http://www.vgcats.com/comics/?strip_id=224

    10. Re:bad design by syousef · · Score: 0, Offtopic

      Ever heard of bloom filters? Sharding? Indexes?

      What does World of Warcraft have to do with it?

      --
      These posts express my own personal views, not those of my employer
    11. Re:bad design by Anonymous Coward · · Score: 1, Informative

      Bloom Filter

      Been around before it was used to describe computer graphics lighting effects.

    12. Re:bad design by Hal_Porter · · Score: 1

      They're running over five million Erlang processes for their chat system!

      http://developers.facebook.com/news.php?blog=1

      If the objective were to maximize the number of Erlang processes, that would be an indeed be an impressive achievement.

      --
      echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
    13. Re:bad design by Tablizer · · Score: 1

      Go on a diet then.

    14. Re:bad design by Tablizer · · Score: 1

      So... every time I open my inbox in Facebook, it has to search through 50TB of data?

      You do if you are Paris Hilton.
         

    15. Re:bad design by Ragzouken · · Score: 3, Interesting

      "Also, when was the last time you tried to visit Facebook and it was down? They're doing quite well for people who need to stop and actually think about their "implimentation"."

      When was the last time you tried to use Facebook or Facebook chat and didn't get failed transport requests, unsent chat messages, unavailable photos, or random blank pages?

    16. Re:bad design by gutter · · Score: 2, Interesting

      Sounds like you don't know much about Erlang. Erlang processes are MUCH lighter weight than unix processes, and are designed to scale to millions of processes. Generally, you want one Erlang process for each concurrent task in the system, like maybe one process for each active chat session. So, having 5 million Erlang processes would be as designed.

      --
      Check out DRM-free movies at http://www.bside.com
    17. Re:bad design by oldhack · · Score: 1

      To reverse the polarity when the flux capacitor is overloaded.

      --
      Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
    18. Re:bad design by KiwiSurfer · · Score: 1

      Mod parent up. Facebook chat is probably one of the most unreliable IM systems I've used to date.

    19. Re:bad design by donaggie03 · · Score: 1

      None of that ever happens to me, and I use facebook all the time. Maybe facebook just doesn't like you!

      --
      Three days from now?? Thats tomorrow!! ~Peter Griffin
    20. Re:bad design by Linker3000 · · Score: 1

      ..reverse tachyon beams...oh, and more cowbell.

      --
      AT&ROFLMAO
    21. Re:bad design by Zombywuf · · Score: 2, Funny

      Sounds like you don't understand sarcasm. I'll spell it out for you: Simply because Facebook are running 5 million processes is neither here nor there. The impressive thing is that it actually works (from what I hear it does any way. If it did it with one process or 5 million it has nothing to do with the relative weight of Erlang and Unix processes.

      Next up, tying your own shoelaces...

      --
      If you can read this you've gone too far.
    22. Re:bad design by Zombywuf · · Score: 4, Insightful

      The problem is when people don't think about the solution and apply the cargo cult mentality. Facebook uses Eeeerlaaaang therefore we should. Facebook wrote it's own database, therefore we should. People end up writing their own database engines that do exactly the same thing as modern relational engines, with all the bugs that were fixed in the relational engines 10 years ago (5 for Microsoft). Even MS SQL will split a large group by aggregate operation (which takes 3 lines to specify) across multiple CPUS by turning it into a map reduce problem, and it will do this all without you having to be aware of it. Oracle (and many others, Oracles is supposed to be the best) will maintain multiple concurrent versions of your data in order to allow multiple users to work with a snapshot that doesn't change under them while others are changing the data, and this happens transparently. You can go ahead and implement all this stuff yourself if you want, in C and sockets, call me when your done, in 10-20 years.

      The real issue I have with the NoSQL people is they're a bunch of whiny babies, who haven't even taken the time to understand the problem before lashing out at the first thing they see. Just the name tells you this, they call themselves "No SQL" and then lash out at relational databases. SQL is is a terrible language, which really needs replacing, but it is only one possible language for querying relational databases. Relational databases represent several decades of research into how to query data in a fault tolerant scalable way as a standing implementation, re-implementing them is a waste of time.

      --
      If you can read this you've gone too far.
    23. Re:bad design by Hal_Porter · · Score: 1

      These new interpreted GCd languages always claim to be ultra lightweight.

      In this case an Erlang process needs 1232 bytes (more if you want HiPE and/or SMP support, which you probably do)

      http://www3.erlang.org/doc/efficiency_guide/processes.html

      That gives you 5GB of space just for the process structures it you want 5 million of them. A quick Google shows that lots of people have problems with Erlang running out of memory because they've screwed up the GC somehow, so this is definitely a minimum. Once those process start to allocate memory things will get much worse. Even the minimum, 5GB is not really good. What will it do to access locality for example?

      Basically saying "lightweight compared to Unix and and Windows" isn't that impressive because both of those OSs are extremely bloated compared by the standards of embedded systems. Java, which Erlang claims to be faster than, is horrible for performance too.

      Hokey interpreted languages are no match for native code written by someone who understands how processors execute it if you want lightweight.

      --
      echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
    24. Re:bad design by Anonymous Coward · · Score: 0

      Yeah, they needed to use more XML and adopt a lean "implimentation" methodology.

    25. Re:bad design by Anonymous Coward · · Score: 0

      Robert Johnson gave an excellent talk on facebook's design two weeks ago at OOPSLA

      I got facebook, I got facebook on my mind
      I got facebook I got facebook all on my mind
      Hate to leave my friends, but you treats me so unkind

      I got mean things, I got mean things all on my mind
      You wrote them on my wall, I got mean things all on my mind
      Hate to leave you my friends, but you treats me so unkind

      Going over to myspace, or the first service I see
      Going over to myspace, or the first service I see
      I got the blues about my facebook page and facebook got the blues about me

      And I'm leavin' this mornin', with my arm' fold' up and cryin'
      And I'm leavin' this mornin', with my arm' fold' up and cryin'
      Hate to leave my friends, but they treat me so unkind

      I got mean things, I've got mean things on my mind
      I got mean things, I got mean things all on my mind
      I got to leave my friends, well, they treat me so unkind

    26. Re:bad design by Anonymous Coward · · Score: 1, Funny

      Maybe it's not on your desk, but UNDER it... you naughty boy

    27. Re:bad design by DarkOx · · Score: 1

      SQL and the RDBMS is an great tool for most jobs; similar to the way a pair of pliers can be very versicle. It does not mean its the only tool for every job. I have seen in IT and application development a trend over the past few years to cram everything into a database that would have been stuff in a flat file before. I am not always convinced this is actually a good choice.

      There are also task for with table oriented relational structures actually don't fit well. GIS is probably the best example. You don't really want in a ideal world to be using rasterized information which is pretty much what an underlying relational data model will always force you into. Its also not good for asking questions like find me all the bodies of water within 50miles of the treatment facility at point x. I am not saying this can't be done, or that its not done millions of times everyday even on the web. I am saying that it can't be done with a purely relational model without also making some compromises in the form of assumptions, limitations in granularity, and efficiency.

      I am also inclined to agree with the parent posters that where the NOSQL people are spending a great deal of time is on tasks that really can be done with RDB technology quite well and probably don't fully understand some of the problems they are attempting to solve.

      --
      Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
    28. Re:bad design by QuoteMstr · · Score: 2, Insightful

      What makes you think that relational calculus can't be extended to support spatial information? After all, it's just another kind of index.

    29. Re:bad design by Tim+C · · Score: 1

      Also, when was the last time you tried to visit Facebook and it was down?

      Well it's been a couple of months, but it does happen. Also one of my friends used to have a problem with her profile being unavailable for hours at a time quite frequently.

      Not major issues, it's true, and they're doing a great job, but don't think that they're perfect because they're not. (But then all software has problems from time to time of course)

    30. Re:bad design by Anonymous Coward · · Score: 0

      The flux capacitor is a bad idea, you need 1.21 jigawatts to run it!!!!

    31. Re:bad design by AvitarX · · Score: 1

      For something on the scale of Facebook, 5 GB of wasted overhead for the chat system would not scare me.

      When embedded systems get updated regularly to changing requirements, while comfortably scaling at the rate Facebook has it may be a reasonable comparison, but until then I don't think it is.

      And Facebook has problems too, and it definitely feels wonky at times (10pm eastern usually), and updates don't go perfect (ever since the live feed was added, the news view is out of order for me sometimes). It still feels like an amazing accomplishment, and if they waste a few GB here, and a few GB there of overhead to keep it going easier, I would think it is a good thing (from their perspective).

      In the end it amounts to hundreds, maybe thousands of dollars, which really is not much in the scheme of things.

      --
      Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
    32. Re:bad design by Muad'Dave · · Score: 2, Informative

      ...they call themselves "No SQL" and then lash out at relational databases.

      Had you read the article, you would've seen that the "No" in NoSQL stands for Not Only, not No, as in none whatsoever. I welcome any and all research into better, tighter synergy between databases and object persistence.

      --
      Tiller's Rule: Never use a word in written form that you've only heard and never read. You will end up looking foolish.
    33. Re:bad design by TheLink · · Score: 1

      > SQL is is a terrible language

      Yes it is. However one of the biggest benefits of SQL and "rigid schema design" is it forces people to conform to a standard.

      Those "whiny babies" might say that's a flaw, and it is a flaw when there's just one whiny baby to keep happy. But it's often a feature when you have 100 whiny babies that want different things from a _single_ database.

      Yes RDBMSes might be slower. But thanks to Intel and friends, RDBMSes have been good enough for most organizations and the performance curve stays ahead of their growth. and the performance problems tend to not be due to the RDBMSes but due to other problems e.g. the DBA screwing up, or the app doing the wrong thing 1000 times :).

      --
    34. Re:bad design by kestasjk · · Score: 1

      I know what a bloom filter is, that's why I asked what they use it for.

      --
      // MD_Update(&m,buf,j);
    35. Re:bad design by gander666 · · Score: 1

      Damn, and I thought I was the only one who was thinking Hellhounds on my mind....

      Bravo!

      --
      Suppose you were an idiot and suppose you were a member of Congress ... but I repeat myself. - Mark T
    36. Re:bad design by jcnnghm · · Score: 2, Insightful

      Yes it does (look through 50TB of data), and how would you design it?

      When a users posts a message, I would have the web server pass the message to a server that listens for messages that are being sent. That server would collect the mail then place them as a payload package in the messaging queue when either a fixed number of mail recipients, probably around 500, or a fixed time passes, probably 500ms, whichever comes first. When the payload reaches the front of the queue, the messaging server working on the payload would parse through all the messages building a model of all the data it needs to render all of the messages. It would then send a low priority FQL multiquery requesting all of the data it needs to render and send all of the requests. From there, the messaging server would render both the updated view of the mail when viewing the thread, and view of the thread when viewing the inbox. These would be passed to a persistent memcached setup.

      An FQL query would be generated for each user that would increment their inbox message counter, remove the memcached key of the old thread preview from the array of keys representing their inbox while prepending the new key to the array, and append the key to the array representing the thread. When this was assembled for all mail, another low priority multi-query would be sent committing this change.

      At this point I'd purge the old thread preview keys from the persistent memcached setup, and store the raw data in a table indexed by both the thread preview key, and the mail view key. The raw data would be stored in case a design change ever necessitated re-rendering all of the mail, or in the case of a user name change.

      Finally, I would generate and send an e-mail to each user telling them they have a new message.

      This is complex, but it also means that to render an inbox, the only thing that has to be done is to retrieve the array of message thread preview keys, and request each thread preview by key from memcached. Of course, this collection could also be cached.

      Note: I intentionally left out some things in the interest of time, like sent message display, read and unread flagging, spam filtering, new message highlighting, and I'm sure others. It shouldn't be difficult to see how this basic model can be expanded to cover these cases.

      --
      You don't make the poor richer by making the rich poorer. - Winston Churchill
    37. Re:bad design by BitZtream · · Score: 1

      And you have once again shown that they are likely doing it wrong.

      According to their own page, they have 300 million 'active' users, 50% log on each day, so 150 million users login each day. No where near that are going to to 'chat' each day.

      So what you are saying is they get 3 users to each chat process if EVERY user that logs in that day chats at the same time, which doesn't happen.

      Yes, they are doing it wrong.

      --
      Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
    38. Re:bad design by SenFo · · Score: 1

      Just out of curiosity, are there any modern file systems that are better at this? For example, rather than storing the entire content of a message in the database, what if sites like Facebook stored only a pointer to a file in the database and then saved the actual message content on disk? That way, the database can stay lean. Furthermore, it is my understanding (though I have no personal experience) that distributed file systems scale fairly well.

    39. Re:bad design by Anonymous Coward · · Score: 0

      You either don't understand GIS or you don't understand relational DBs.

      http://postgis.refractions.net/

    40. Re:bad design by LukeWebber · · Score: 1

      I'm a terabyte sized binary clod, you insensitive blob!

      Well, OK. I'm actually 1.5TB, but I have a diet plan and I've joined a gym.

    41. Re:bad design by Hal_Porter · · Score: 2, Insightful

      For something on the scale of Facebook, 5 GB of wasted overhead for the chat system would not scare me.

      It's not about the cost of the memory. Big systens tend to run more slowly because of locality effects. Systems running byte code run more slowly too. I think the Facebook guys have been saved by big ass hardware, not an efficient design.

      --
      echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
    42. Re:bad design by Zombywuf · · Score: 1

      You still fail to understand the points that:

      1. SQL is a language used to query relational databases, by no means the only one, just the most widely supported.
      2. Relational databases are a subset of databases.
      3. Object persistence has nothing to do with databases. You might use a database to store and manage your serialized state, but the point where it becomes an object you can send messages to it stops being data.
      --
      If you can read this you've gone too far.
    43. Re:bad design by Anonymous Coward · · Score: 0

      The real issue I have with the NoSQL people is they're a bunch of whiny babies, who haven't even taken the time to understand the problem before lashing out at the first thing they see.

      Yeah, some NoSQL advocates are certainly like that, but do you know what's even worse? The whiny ACID-babies who haven't taken the time to understand the problems that most NoSQL systems are really trying to solve, and are just deathly afraid of having to learn something besides their comfortable RDBMS-tweaking ways. Things really do get a bit different when you have petabytes spread across multiple sites, where even millions spent on Oracle licenses won't really yield a solution. A lot of this work is based on Eric Brewer's CAP Theorem, which was presented as a keynote at PODC in 2004 and has since been formally proven. How's that for a bunch of folks who didn't know what's already out there? Brewer's work was in turn informed by Lamport's (e.g. vector clocks and eventual consistency), who in turn built on others going back at least as far as Codd and the relational model. The simple fact is that you can't have all of C/A/P, some people legitimately value A/P more than C, and C (consistency) in this context includes the I (isolation) of which you seem so enamored. Stonebraker already made the point that this has nothing to do with SQL, and much better than you, but it does have to do with ACID and ACID is simply irreconcilable with some needs. Raising facile objections to the name is a poor substitute for tackling the real issues.

      I've written about the cargo cult mentality myself, even in this particular context, anticipating your remarks by at more than a month. Someone here has indeed not taken time to understand the problem before lashing out: you. Please get over the puerile attitude that different knowledge must be inferior knowledge, and educate yourself a little.

    44. Re:bad design by AvitarX · · Score: 2, Informative

      But that is my point really, big ass hardware is still quite cheap.

      An inefficient, but easy to create and manipulate design, that is scalable affordabley, has benefits.

      when the workload scales to 10,000 (April 2008) servers, the inefficiency is not problematic.

      If the super efficient embedded style program was easy to maintain and scaled so well, it may only be 5,000 servers, or even 1,000,, but how well would it scale to 15,000 or 3,000 with new features being added (October 2009 Facebook has 30,000 servers).

      Lightweight is not near as important as being able to throw big ass hardware at a problem.

      And if that big ass hardware is lots of little pieces, because the system scales easily that way, it could be less expensive than less hardware overall, but having to be stuffed into larger pieces (not saying that would be an inevitability of efficient embedded style programming, simply a likely outcome).

      --
      Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
    45. Re:bad design by Salamander · · Score: 1

      Don't know why that got posted as AC. I *was* logged in; maybe /. isn't handling these sorts of consistency issues as well as if they'd used an appropriate storage system. ;) In any case, I do stand behind what I said.

      --
      Slashdot - News for Herds. Stuff that Splatters.
    46. Re:bad design by jedidiah · · Score: 1

      The thing that struck me is how an anti-SQL troll seems to think that 50TB is somehow "large" or "unmanageable".

      If you code/design like a total bonehead than no spiffy technology will save you.

      --
      A Pirate and a Puritan look the same on a balance sheet.
    47. Re:bad design by Hal_Porter · · Score: 1

      Yeah but making a super efficient site in PHP or Perl or whatever the current sweet spot is makes you a good engineer. Making an inefficient site in a fashionable language that only the cool kids really use makes you a poseur.

      There's a principle at stake.

      --
      echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
    48. Re:bad design by BitZtream · · Score: 1

      30 per process, not 3. Missed the zero in there.

      As opposed to the 40-50 thousand per thread EFNet was doing 10 years ago on several orders of magnitude less CPU power.

      --
      Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
    49. Re:bad design by Anomalyst · · Score: 1

      I've joined a gym.

      Don't leave us in suspense.
      INNER, OUTER, LEFT, or RIGHT?
      or perhaps the question should be Elmer's, Epoxy or Superglue?

      --
      There is no right to feel safe thru security vaudeville at the expense of everyone's freedom, privacy and tax money.
    50. Re:bad design by kthejoker · · Score: 1

      No, the problem is that the "real issues" you are talking about are things that 99% of your typical DBAs will never see in their lifetime, because they work at a church or a pharmacy or a box factory.

      It's great that Facebook and Google and eBay need map-reduce and Erlang and something more scalable than SQL Server Express or Berkeley DB. But they are the exception, not the rule. Excoriating people for pointing that out is, at best, irrelevant and at worst harmful to the idea of alternative data storage mechanisms.

      I'm not picking on you directly, I see it as a larger symptom, that somehow because SQL/RBDMS is not ideal for certain projects, that it should be abandoned at all levels, sooner rather than later, even though there's 40+ years of RDBMS architecture manuals, best practices, knowledge bases, 3rd party apps, "SQL for Dummies", and so on to help the involuntary DBA succeed without having to figure out Cassandra.

      I guess my concern is that a lot of small businesses and shops will see something like this, will think, "You know, our Access database sucks," and try to port themselves over to this, and guess what? The learning curve here is a lot steeper than SQL (the *academic* side of SQL-alternatives is just now getting into 3rd gear), the business case for it is pretty poor in most cases, and you'll end up with a lot of people wasting time trying to get Erlang processes going instead of just migrating to MySQL and keep on carrying on. There's way too much "Rah, Rah, Death to SQL" being attached to these new things, and to me it seems overblown.

      But you know, I'm optimistic. 5 years from now, it may be a different ball game altogether, and then us DBAs just have more things to learn and to do.

    51. Re:bad design by MarkWatson · · Score: 1

      I would mod you up as "mucho interesting" if I had the points...

      I also appreciate how open Facebook is on their techniques to solve problems, open sourcing things like Cassandra, etc.

    52. Re:bad design by Salamander · · Score: 1

      So, basically, you'd move most of the work from the read side to the write side (with an unacknowledged application of the Nagle algorithm in the first paragraph). Well and good, but are you sure that's a good idea? Are you sure the Facebook engineers didn't consider and reject it because (a) it's actually more work total and (b) their access patterns indicate that they should be trying to offload the write side? It's very easy to come up with an architecture that you think (in a near vacuum of actual data or relevant experience) will work better, and conclude that they must be fools for having made the wrong tradeoff. Isn't it at least possible that they actually understand the tradeoffs and applicable technologies quite well, and that we might be able to learn something from the choices they made? Even if we don't actually believe that, it can be a useful mental exercise. Ditto for the folks who designed Dynamo, or PNUTS, or anything else in this area. Many of the objections and alternatives I'm seeing here are explicitly addressed in the papers these folks have written, and it's definitely worthwhile to look those up no matter what verdict you eventually hand down to us mere mortals.

      --
      Slashdot - News for Herds. Stuff that Splatters.
    53. Re:bad design by Muad'Dave · · Score: 1

      I actually do understand your points. That's why I'm in favor of expanding the number of tools we software architects have available to bring to bear against projects that don't always fit the relational paradigm and/or that don't mesh well with SQL.

      As for object persistence, I mention that as my current area of interest where I think more research will be needed as object stores grow in size and are scaled horizontally.

      --
      Tiller's Rule: Never use a word in written form that you've only heard and never read. You will end up looking foolish.
    54. Re:bad design by QuoteMstr · · Score: 1

      Every repository of knowledge can be described relationally. When a programmer claims that a problem doesn't map well to the relational model, what he means is that his mind can't transform the problem into that notation, and that's a deficiency in the programmer, not the data.

    55. Re:bad design by bartoku · · Score: 1

      I want to take a crack at this, I know enough about databases to make a PHP MySQL web pageis all! How can there be no clear segregation? The data is about me, about someone else, or about a group (a group being a single entity on Facebook similar to an individual) and the data either came from me, someone else, or a group.

      My inbox in Facebook is no different than my email inbox, all the messages are to me, just like my email inbox it should reside on one search-able database restricted to a certain manageable size like every other email inbox in the world. My inbox can be on one server and my friend's on another, the whole thing should be segmented by user. Same with my outbox, just like email I retain a copy of sent messages in my personal Facebook database. Wall posts and pictures should work the same way, anything that shows up in my profile should be copied to my personal tables and database.

      Now the trickiest part is when another user posts a picture and tags me in that picture. That picture reference should then be duplicated and placed in my picture database. This is in contrast to retaining one copy stored in a Facebook wide database and searching that database for pictures of me each time someone wants to bring up my pictures. The picture data storage can be spread across multiple servers and when someone views the my pictures section of my profile there is simply a dump done on my picture database of references to the pictures to present, links to the pictures and the pictures are retrieved from storage and displayed. When the owner of a picture deletes a picture or the owner untags me or I untag myself from a picture, the picture would simply be deleted from my picture reference database.

      I am going to go watch Robert Johnson's ACM talk cause this seems easy and the segregation is as clear as good ole email. From what I can see each individual's profile is fairly small in database terms and never viewed all at once. A profile is viewed by wall, inbox, pictures (and not even all the pictures at once), info...

      I am amateur at this at best, but with my hack mySQL skills I do not see it as a big deal to create an easily scalable Facebook segmented by users.

    56. Re:bad design by squallbsr · · Score: 1

      hmm, I take it you have never used MSN?

      --
      Sleep: A completely inadequate substitution for Caffeine.
    57. Re:bad design by QuoteMstr · · Score: 1

      Your instinct is correct in that segmentation is the correct approach. But what you MySQL weenies don't appreciate is that real databases do the segmentation for you. In a real DBMS, it's perfectly reasonable to have a single table with 5 trillion rows, which the database engine will replicate, split, and partition without your having to care about that on the application layer.

      That's where the true power is a DBMS starts to become apparent.

    58. Re:bad design by vajrabum · · Score: 2, Interesting

      Bloom filters give constant time probablistic answers to set membership questions in a very space efficient manner. Moreover set union and intersection for the filters can be computed by simple AND and OR operations--also in constant time. The downside is that delete is hard. That union and intersection property means that it's easy to distribute query's over an arbitrary number of machines. Sounds kind of perfect to me for implementing a distributed index for searching, no?

    59. Re:bad design by bartoku · · Score: 1

      Thanks, someday I hope grow up to be a real DBMS hot dog!
      What about Robert Johnson's claim that there is no clear segregation? Is he wrong?
      What is the problem, Facebook should be even easier with a real DBMS and scale very nicely.
      The only problem I see is over time an individual's profile expanding, like saving all your emails form the last 10 years.
      But computer hardware advances or archiving the older less accessed stuff off to another server will help.

    60. Re:bad design by CodeBuster · · Score: 1

      Relational databases represent several decades of research into how to query data in a fault tolerant scalable way as a standing implementation, re-implementing them is a waste of time.

      Except that worthless trade rags keep publishing bullshit PR articles on "the death of SQL" and the "next great database tech" (paid for by the company plugging that next great database tech). What happens next? Some B-Level executive reads this bullshit article on his next airline flight because somebody left the magazine in the seat pocket next to the sky-mall and when he gets back the engineers have to waste a bunch of time defending proven technologies, namely relational databases and SQL, from attacks by this B-Level executive who, armed with his bullshit trade rag knowledge, accuses the engineers of "thinking inside the box". If the executive wins the argument by twisting a few arms (aka cage match negotiator) then he gets promoted and by the time everyone realizes that the next great database tech really isn't he has moved on and the engineers are left holding the bag of shit leftover from the uninformed meddling of MBA asshats.

      For those of you wondering, cage match negotiator refers to a management anti-pattern where the negotiator (the manager) takes a "win the argument at any cost" approach to dispute resolution, up to and including driving other team members off the project. The name comes from the cage match format in wrestling where multiple wrestlers enter the cage but only one exits victorious when the match is finished.

    61. Re:bad design by Abcd1234 · · Score: 1

      Every repository of knowledge can be described relationally. When a programmer claims that a problem doesn't map well to the relational model, what he means is that his mind can't transform the problem into that notation, and that's a deficiency in the programmer, not the data.

      While that may be true, it's also true that there *is* an impedance mismatch between object and relational data modeling, and as such, you have to perform some non-trivial transformations to move from one to the other. And that creates cognitive complexity, wherein bugs often lie, not to mention real computational overhead.

      As such, there's certainly room for data storage solutions which try to better fit object oriented data modeling (hell, the OODBMS has been an interesting research area for decades).

      That said, I think the backlash against SQL in general, and relational databases specifically, is more than a little absurd. RDBMS's have been around for decades, and are very good at what they do. As such, I think it's ridiculous to position the "NoSQL" approach as a replacement for RDBMS's. Rather, I see them as complimentary tools (although I happen to believe that the non-relational approach is much more of a niche solution than it's proponents would like us to believe, as I think, for most problem spaces, the relational approach works very well, and has the advantage of greater maturity).

    62. Re:bad design by kestasjk · · Score: 1

      An index is for taking an input and finding a set of records, a bloom filter takes an input and tells you whether it hasn't been given before. (And being unable to delete messages from your inbox is a bit of a showstopper.) Not sure how it applies

      --
      // MD_Update(&m,buf,j);
    63. Re:bad design by Zombywuf · · Score: 1

      Rigid schema design is a feature of the relational model + some type checking, not SQL. SQL queries can often fail with run time type errors (in all the SQL systems I've used at least), so SQL is not that rigid in its enforcement.

      I'd love to be able to use Datalog with a database (perhaps with some type of dependent type system to enforce constraints at compile time, and a pony...), but there's nothing outside academia that supports it as far as I know.

      The idea of RDBMSs being slow seems to stem from people confusing special purpose databases with general purpose key value stores. The former can achieve access times in the sub 10ms range, the latter are more in the 100ms lower bound that RDBMSs are. Also, stored procedures are kind to your query cache people, ad hoc queries are bad for performance.

      --
      If you can read this you've gone too far.
    64. Re:bad design by AvitarX · · Score: 1

      I guess my point was if you design a system that scales effectively over many small systems, it is not so important that it be particularly efficient.

      Especially if data integrity is not too important (Facebook for example can probably get away with losing .1% of status updates without upsetting anyone, and delaying the visibility of pictures to friends for a minute or so).

      Efficiency is highly over-rated I'm sure Facebook is more concerned about being able to throw twice as much hardware at the problem in a year to handle twice the users. If they could get away with 1/4 or even 1/10th the hardware, but not scale it easily it would in the end grind the entire business to a halt.

      There is a time for efficiency, but more important is scalability when your systems number in the thousands.

      Whatever the language overhead is, is irrelevant, as long as it scales linearly.

      --
      Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
    65. Re:bad design by Anonymous Coward · · Score: 0

      Implementation.

    66. Re:bad design by Tellarin · · Score: 1

      While I do like Erlang a lot and I liked some of Facebook's design decisions I've read about, their chat is really bad.

      In my experience it is indeed unusable most of the time. And when it's working, it is just too slow.

    67. Re:bad design by dimeglio · · Score: 1

      I doubt it is only a question of bad design. SQL was introduced as a general database query language. It was designed to offer a common way of connecting to data. With generalization however comes a trade-off. You lose a bit of control over how the data is extracted or presented. The trade-off is insignificant for say 98% of applications so people are mostly happy with SQL. When you hit insane number of transactions and/or need to manage terabytes of data, you might have to consider other options. This is why, analog computers still exist or why not everyone is using a COTS as their business systems.

      --
      Views expressed do not necessarily reflect those of the author.
    68. Re:bad design by TheLink · · Score: 1

      "Rigid schema design" was one of the "problems" quoted in the story submission.

      RDBMSs are slow compared to simple key-value stuff. But they do a lot more than just key-value. And even so, stuff like postgresql can do better than 10ms for key-value especially on modern servers.

      See: http://www.kaltenbrunner.cc/blog/index.php?/archives/26-Benchmarking-8.4-Chapter-1Read-Only-workloads.html

      --
    69. Re:bad design by vajrabum · · Score: 2, Informative

      But because they can used to partition the filter accross machines in effect it can be used as index. Each machine that stores a portion of the filter gets all the queries that might apply to it and sends any results up to a machine that dispatches what you might call pre-queries on the bloom filters to the machines where the data and traditional indexes are stored. If the search vendor implements delete--google doesn't really, and this is the reason why--then you simply recompute the bloom filters when they become sufficiently out of date. That can be determined tracking how many times you get a false positive. Index lookups are slow for large data sets not because it takes that long to return an individual result but because there are so many queries. Bloom filters allow you to reduce the number of traditional index lookups and to dispatch the ones that have to be computed only to the machines where the data is available.

    70. Re:bad design by Eravnrekaree · · Score: 1

      I certainly agree. Every time we hear this NoSQL arguments and all of the disinformation from people who don't know what they are talking about, apparently dont know anything about SQL or relational databases, or dont use them properly. There is nothing in particular about SQL that is inefficient. In fact, it offers lots of opportunity to really speed things up. The multicolumn lookup on SQL does not slow down the speed of a single column lookup, so you get added benefit over a single column "nonsql" db without a downside. NonSQL brings back problems that were solved by SQL, its a regression, and without any valid reason.

    71. Re:bad design by Anonymous Coward · · Score: 0

      the "No" in NoSQL stands for Not Only, not No, as in none whatsoever

      That's possibly the stupidest thing I've heard all day.

    72. Re:bad design by Anonymous Coward · · Score: 0

      Wow, you can insult people with thesaurus words like puerile, you must be better than them. Way to really engage in a discussion.

    73. Re:bad design by kestasjk · · Score: 1

      So they might use a bloom filter as a way to find which machine is hosting a message, sending the message id to each machine for lookup, periodically rebuilding the filters from scratch by iterating over every message in the system?
      Of all the ways to subdivide id-indexed data among a distributed system that is the most peculiar suggestion I've heard.

      And like most CS people I am mysteriously desperate to find uses for Bloom filters. A partitioned ldap directory just isn't as neat

      --
      // MD_Update(&m,buf,j);
    74. Re:bad design by GaryOlson · · Score: 1

      Impressive, SQL humor. Do I get to SELECT * FROM Humor?

      --
      Every mans' island needs an ocean; choose your ocean carefully.
    75. Re:bad design by shutdown+-p+now · · Score: 1

      I'm a terabyte sized binary blob, you insensitive clod!

      In my experience, terabyte sized binary blobs in database usually contain XML (with no schemas, because XML is designed such that it's simply impossible to write a schema for one - like, say, encoding information in element names themselves).

      If you're indeed a piece of XML, I truly feel sorry for you - you're quite angular, very fat, and very redundant.

    76. Re:bad design by shutdown+-p+now · · Score: 1

      Every repository of knowledge can be described relationally. When a programmer claims that a problem doesn't map well to the relational model, what he means is that his mind can't transform the problem into that notation, and that's a deficiency in the programmer, not the data.

      The third option is a deficiency in the model. It's true that anything can be mapped to relational, but not anything maps well. You can also do business modeling in Brainfuck, but I've yet to meet anyone who thinks it's a good idea. And yet we keep trying to shove all kinds of data into relational...

    77. Re:bad design by KiwiSurfer · · Score: 1

      Even MSN is more reliable. At least it doesn't lose random messages, drop connections every 5 mins, go offline randomly, etc like Facebook chat does on a very regular basis (read every time I try and talk with someone on it). MSN does that every once in a while (read: once a month or so).

    78. Re:bad design by MikeBabcock · · Score: 1

      Why is that a design problem? The whole point the RDBMS proponents have made all these years is precisely that -- all the messages are in a 'messages' table, indexed by an id, with 'from' and 'to' fields and (in Facebook's case), a thread-id of some form probably.

      Then when you want your message list, you do something like:

      SELECT Date,Name,Subject from Messages where Messages.UserID = 37173 ORDER BY Date DESC LIMIT 10;

      or if your workload shows it makes more sense, you create a view based on the above and cursor your way through it.

      That said, I use flat DB files through BDB and CDB every day, filesystem based storage algorithms as well as real SQL and an MVDB as well. Each has its benefits under specific circumstances.

      --
      - Michael T. Babcock (Yes, I blog)
    79. Re:bad design by MikeBabcock · · Score: 1

      I have that happen quite regularly as well. I often click one of the tabs beside a friend's name and have nothing happen at all, or have the wrong data appear. Its great fun when "Info" shows up after clicking "Photos".

      Obviously they're doing a better job of handling that amount of data than most people would be, but not as good at it as say VISA is.

      --
      - Michael T. Babcock (Yes, I blog)
    80. Re:bad design by Anonymous Coward · · Score: 0

      another armchair engineer. you're a moron.

    81. Re:bad design by Anonymous Coward · · Score: 0

      They use bloom filters for messaging? What for?

      Damm you NVidia, shaders are like crack to developers...

    82. Re:bad design by vajrabum · · Score: 1

      Bloom filters like perfect hash functions are fun so I understand your desire to use them. I think I didn't reinvent this architecture. I seem to remember reading a discussion like this one where some google folks sketched out something very like this query's happens in parallel. The bloom filter is in essence a concordance or book index. Instead of a page number you end up with a CPU number. An almost necessary refinement is to insert every word you find on the disk into the filter except words like the since those sorts of words show up in every email or document. Really parallelising things is hard. If it matters for performance you almost always have to do it by developing a parallel algorithm specific to the problem from the beginning instead of relying on the hardware, OS or compiler to do the dirty work for you.

    83. Re:bad design by rescendent · · Score: 1
    84. Re:bad design by Anonymous Coward · · Score: 0

      wow

    85. Re:bad design by lennier · · Score: 1

      "What makes you think that relational calculus can't be extended to support spatial information "

      Isn't "extend" a word that makes mathematicians cringe, because it's indistinguishable from "completely break all the algebraic properties"?

      --
      You are not a brain: http://books.google.com/books?id=2oV61CeDx-YC
    86. Re:bad design by nairbv · · Score: 1

      no... it wouldn't "look through" all 50 TB of data. Actually reading all 50 TB of data would make it take a long long time to see your inbox.

      I don't know what their architecture is exactly, but they probably something like $mysql_host = 'sqlhosts' + ($friend_id MODULUS 1000);

      Sure, 100 friends might be on 100 different servers. That doesn't mean they need to scan through every row checking to see if it's a friend of yours. It's the same thing as using indexes, but distributed.

    87. Re:bad design by thethibs · · Score: 1

      You're confused because you are thinking in SQL terms. When you open your inbox, in Facebook, it doesn't search at all; it follows the link to your inbox.

      Think of it this way: In a relational database, you find Alice's marbles by asking every marble in the universe whether it belongs to Alice. In an OO, hierarchical, or key/value database, you ask Alice for her list of marbles.

      Codd's model was designed without regard to speed; the design imperatives were integrity and minimal size, in that order. We no longer have expensive and unreliable storage to worry about and most of our applications are tolerant of temporary inconsistencies.

      This is not new. Seven years ago I built a system whose only viable database solution involved three database instances with references going between tables in different instances (dropping referential integrity over the side). I was forced by "corporate architecture" to use a relational database product. The solution with something like BerkelyDB would have been a whole lot simpler and a whole lot faster in construction, deployment and production.

      On another project, I was dealing with an anal-retentive DB group who were very slow delivering the database. As a stop-gap, we built a layer below the objects that used the file system. Directories for classes, one file per object. The fun thing about this is that the system ran between 3 and 4 times as fast with the file system than it did with the eventual Oracle DB. You can blame object-relational impedance for some of that but not all.

      --
      I'm a Programmer. That's one level above Software Engineer and one level below Engineer.
    88. Re:bad design by thethibs · · Score: 1

      Relational databases represent several decades of research into designing and building relational databases.

      There are other kinds of databases, and problems to be solved that aren't easily modeled as tables. There's a reason for the agony over "object-relational impedance mismatch." The common answer--to build object models that are restricted to looking like relations (findByXXX is the smoking gun) gives up much of the power of objects.

      Relational databases solve problems that look like relations. That's a subset of all the problems to be solved--and it's shrinking.

      --
      I'm a Programmer. That's one level above Software Engineer and one level below Engineer.
    89. Re:bad design by mudshark · · Score: 1

      Damn straight...I had been using Skype chat for at least a couple of years before trying FB chat. I don't think Skype's implementation is anything special, but it's pretty solid and utilitarian enough that in spite of our location at the end of one of the world's longest undersea tethers, the Southern Cross Cable, I take it for granted (my wife and I use it to hail one another across the floors of our house). FB's is abysmal by comparison.

      --
      In other news, astrophysicists have announced that they now know what all that dark matter is: it's stupidity.
    90. Re:bad design by JDHowells · · Score: 1

      "Isn't "extend" a word that makes mathematicians cringe" ... Yeah, but it's every Bubble Blowing Dinosaur's dream.

    91. Re:bad design by Zombywuf · · Score: 1

      Er no. Adding spacial data is a completely reasonable addition to the relational calculus, it doesn't even change any of it's properties. One requirement of the relational calculus is the ability to form a subset by a predicate i.e. {x | x in X /\ P(x)}. For spacial data the predicate is just selection over an area. No breaking of the calculus, no changes to it.

      --
      If you can read this you've gone too far.
    92. Re:bad design by Zombywuf · · Score: 1

      Show me your data which is hard to model as a collection of sets.

      --
      If you can read this you've gone too far.
    93. Re:bad design by cheekyboy · · Score: 1

      arggg

      look, the reason people write their own stuff, is because the normal stuff wont scale for the same price.

      A custom solution could use 5x less disk space, so if its 50tb compared to 250tb, then one is cheaper
      to keep, backup, verify.

      And bottom line is, managers count money cost first.

      Yeah Oracle can do the same as xyz custom solution, but it would take 5x disk space, 10x more servers, require 15x more POWER.

      50tb can easily bit in one rack fridge, but 250 is starting to require more space/power/yearly costs, hd replacements.

      Its like saying VBasic can still be used to write a quake game, sure but it would be 1fps.

      Maybe its just too easy to make the wrong design/solution in SQLdb, and takes real talent to do it right.

      Not everyone has a spare 50tb to play with to 'try things', or even 250 if 50 isnt enough.

      Maybe its database sql gurus getting scared of loosing that $$$ contract design work to old school programmers.

      --
      Liberty freedom are no1, not dicks in suits.
    94. Re:bad design by Zombywuf · · Score: 1

      Writing a custom solution costs money. Or do you work for free?

      --
      If you can read this you've gone too far.
    95. Re:bad design by Anonymous Coward · · Score: 0

      When was the last time you tried to use Facebook or Facebook chat and didn't get failed transport requests, unsent chat messages, unavailable photos, or random blank pages?

      Let's see...yes, never. So far.

      Anecdotal "evidence" isn't.

      Note: In neither direction. That is has never failed for me doesn't mean it hasn't failed for you. But the point for you to realize is that just because it has failed for you does not mean it necessarily has had to fail for anyone else.

    96. Re:bad design by Anonymous Coward · · Score: 0

      A more relevant question is actually "when was the last time your business failed b/c you were on FaceBook and got failed transport requests, unsent chat messages, unavailable photos, or random blank pages?"

    97. Re:bad design by convolvatron · · Score: 1

      actually, fuck off. language design people have distilled the semantics of sql down to something with is actually composable, compilable, able to be evaluated in a distributed context, and actually..useful. datalog. mercury.

      sql is a festering sore. its a poorly conceived idea that for random reasons survived

    98. Re:bad design by Zombywuf · · Score: 1

      I'm confused, are you agreeing or disagreeing with me? (Or just drunk?)

      --
      If you can read this you've gone too far.
    99. Re:bad design by moosesocks · · Score: 1

      Also, when was the last time you tried to visit Facebook and it was down? They're doing quite well for people who need to stop and actually think about their "implimentation".

      You bring up an excellent point. I've been using Facebook since shortly after its launch in 2004, and can't remember any downtime over the course of its (very impressive) growth.

      I don't know of any other site with a track record like that. Even GMail has had a few (fairly severe) outages over its history.

      --
      -- If you try to fail and succeed, which have you done? - Uli's moose
    100. Re:bad design by Pseudonym · · Score: 2, Interesting

      Bloom filters are not as useful as they once were for large-scale indexing. As memory sizes increase, the tradeoff between precision and space efficiency changes. It's just as easy to distribute a hash table or a radix trie across multiple machines these days.

      A more common modern use is when you have data which is logically tabular, with potentially many "columns" which can contain arbitrary-sized objects, but the table is expected to be sparse. Traditional SQL table representations rely on predetermined maximum sizes for data values to optimise their representation, which is inappropriate for this because it would waste space. However, you also don't want to waste time accessing disk to find that a value isn't there. Using a Bloom filter costs a small amount of space (enough to fit in a small "descriptor") but can potentially save a huge number of disk seeks.

      --
      sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
    101. Re:bad design by Ragzouken · · Score: 1

      But I already know it fails for anyone else because I am friends with other Facebook users.

  3. hmm by buddyglass · · Score: 4, Insightful

    With regard to scalability, it strikes me that the problem isn't so much SQL but the fact that current SQL-based RDBMS implementations are optimized for smaller data sets.

    1. Re:hmm by phantomfive · · Score: 5, Insightful

      The biggest problem is the cloud. A lot of cloud APIs don't allow full relational database access, so now it seems we are coming up with all these justifications for why we don't really need it. Notice that this blog is from a company pushing a cloud based solution.

      --
      Qxe4
    2. Re:hmm by MightyMartian · · Score: 4, Insightful

      That's my take as well. We have these crippled semi-databases that lack a lot of useful features that anyone that has used RDBMSs over the last few decades have gotten used to, so suddenly it becomes a justification game; "Well, SQL doesn't deliver the output we need, so here's some half-way-to-SQL tools which are really better, kinda... oh yes, and Netcraft confirms it, SQL is dying!!!!"

      I have a feeling that this part hype, part inept programmers who don't actually understand SQL, or database optimization. The first sign for me that someone is selling bullshit is when they try to act like this is some never before seen problem, when in fact there is a good four decades of research of database optimization.

      --
      The world's burning. Moped Jesus spotted on I50. Details at 11.
    3. Re:hmm by KalvinB · · Score: 4, Insightful

      For the vast majority of use cases, large data sets can be made logically small with indexes or physically small with hashes.

      If you're dealing with massive data you're probably not dealing with complex relationships. E-Mail servers associate data with only one index: the e-mail address. Google only associates content with keywords. E-mail servers logically and physically separate email folders. Google logically and physically separates the datasets for various keywords. So by the time you hit it, it knows instantly where to look for what you want. You don't have a whole complex system of relationships between the data. It looks at the keywords , finds the predetermined results for each and combines the results.

    4. Re:hmm by Prof.Phreak · · Score: 2, Interesting

      Depends. We've been using Netezza with ~100T of data, and... well... it takes seconds to search tables that are 30T in size. I'd imagine Teradata, greenplum and other parallel db's get similar performance---all while using standard SQL with all the bells and whistles you'd normally expect Oracle SQL to have (windowing functions, etc.).

      --

      "If anything can go wrong, it will." - Murphy

    5. Re:hmm by Anonymous Coward · · Score: 0

      Well, that and the fact that relational/transactional guarantees don't scale particularly well, especially if you want to cluster your solution.

      Sometimes the only way to get the performance you need is to change the model.

    6. Re:hmm by Firehed · · Score: 1

      What now? The problem is that relational databases suck at scaling, and as a result we have to come up with absurd hacks like sharding to fix problems that are the fault of the storage engines (if the engine has to do that to not fall apart when dealing with large datasets, fine; but that should be entirely behind-the-scenes and transparent to the application). If these various NoSQL tools are faster than traditional databases and your data isn't particularly relational, then great! But I'd much rather see effort put into solving the lack of horizontal scalability associated with relational DBs especially since they have no problem accommodating non-relational data.

      Amazon offers full MySQL cloud-based hosting (in addition to their own simpledb stuff), and every managed cloud platform I've looked at also uses standard SQL DBs (MySQL, postgres, etc.). Then again, the term "cloud" has really just come to mean "someone else's datacenter" which includes pretty much every casual web hosting plan on the planet and plenty of higher-end stuff as well. The only problem with the cloud is that the term is wildly overused.

      --
      How are sites slashdotted when nobody reads TFAs?
    7. Re:hmm by Tablizer · · Score: 1

      One can skip transactions for speed in some RDBMS. MySql grew in popularity largely because it has an engine choice that excludes transaction handling, making it faster at the expense of the risk of incomplete or inconsistent data. It's a matter of choosing your trade-offs.

    8. Re:hmm by mzito · · Score: 4, Insightful

      Uh, no, that is not correct. Relational DBMSes such as Oracle, Teradata, DB2, even SQL Server are all designed to scale into the multi-terabyte to petabyte range. The issue is one of a couple of things:

      - Cost - "real" relational databases are expensive. I once had a conversation with someone who worked at Google, who talked about how much infrastructure they have written/built/maintain to deal with MySQL. Many of those problems were solved in an "enterprise" DBMS 3-10 years ago. However, the cost of implementing one of those enterprise DBMS is so high that it is cheaper to build application layer intelligence on top of a stupid RDBMS than purchase something that works out of the box
      - Workload style - most of the literature around tuning DBMS is for OLTP or DSS workloads. Either small question, small response time (show me the five last things I bought from amazon.com) or big question, long response time (look through the last two years worth of shipping data and figure out where the best places to put our distribution centers would be). Many of these workloads are combos - there could be very large data sets and complex data interdependencies, with low latency requirements. It may be possible to write good SQL that does these things (in fact, I know a couple luminaries in the SQL space that will claim just that), but the community knowledge isn't there.
      - Application development - when you're building your app from scratch, you can afford to work around "quirks" (bugs) and "gaps" (fatal flaws) to get what you need. This dovetails with the other issues, but when your core business is building infrastructure, it's worth your while to deal with this. When your core business is selling insurance or widgets, or whatever, it is not.

      None of this is to say that the "nosql" movement is a bad thing, or that there's no reason for its existence, or that no one should bother looking at it. However, there is a definite trend of "this is so much better than SQL" for no good reason. SQL has scaled for years, and I know loads of companies who work with terabytes and terabytes of data on a single database without any issue.

      A far more interesting discussion is the data warehouse appliance space - partitioning SQL down to a large number of small CPUs and pushing those as close to the disk as possible.

      --
      me@mzi.to
    9. Re:hmm by buchner.johannes · · Score: 4, Interesting

      The first sign for me that someone is selling bullshit is when they try to act like this is some never before seen problem, when in fact there is a good four decades of research of database optimization.

      Your point is valid, but I think there is more to it. And the problems these solutions try to solve are quite old too. For example:

      Ever tried to design a database, but got the requirement that you should be able to reconstruct the modification history? It boils down to not deleting (ever), and 'deleted' flag fields and other uglyness. A multi-version relational database would be nice, you actually don't need modification/delete operations in this scenario, just 'updates' that add to the previous status. CouchDB does append operations.

      In some cases you may not need a complete SQL database, just key->value relations, but have them scaling very well. http://project-voldemort.com/ states: "It is basically just a big, distributed, persistent, fault-tolerant hash table." Then they state that they provide horizontal scalability, which MySQL doesn't (OTOH, we should really look at Oracle for these things).

      And you can't really say MapReduce/Hadoop is pointless.

      --
      NB: The message above might reflect my opinion right now, but not necessarily tomorrow or next year.
    10. Re:hmm by FlyingGuy · · Score: 1

      Well said.

      Lack of imagination is another problem when people look at the dry subject of data storage and retrieval. Yes at some point the math has to work, but in the meantime looking outside the box and taking an RDBMS in different and new directions within the engine itself can give those mountains of scale that are the holy grail these days.

      While I was writing that paragraph I thought it might be interesting to take a structure such as an n-node b-tree and then make the leaves table references and the nodes pointers to those tables and overlay that structural idea into a single table. That single table could then be queried to gain access to a very interesting amount of data. Now if the leaves are not required to be unique, then the entry that would normal be a record pointer could then be an instance pointer, or a server pointer, cluster pointer and simply take the original query and have the result be returned from just about anywhere.

      Ehh someone undoubtedly thought of that already...

      --
      Hey KID! Yeah you, get the fuck off my lawn!
    11. Re:hmm by Anonymous Coward · · Score: 0

      If you're talking about table scans, it depends on how much money (and operational expense - space, maintenance, power) you spend. For a fixed time requirement for a scan of a lot of data, Teradata's probably going to cost you both arms and legs compared to Netezza which in turn is probably going to cost you an arm and leg compared to Greenplum which in turn might cost you an arm compared to Paraccel which, in turn might just cost you your little finger (depending on the nature of the constraints).

      If your data is sufficiently static that you can afford to maintain a lot of indexes, the number of limbs you have to sacrifice is probably more similar across the products.

      Of course, "seconds" is an eternity in a consumer facing web app.

    12. Re:hmm by countach · · Score: 1

      "But I'd much rather see effort put into solving the lack of horizontal scalability associated with relational DBs"

      I think I'd rather see the opposite: That non-relation DBs become the mainstream, and they have SQL added for the odd occasion it is useful. Relational has some nice properties for ad-hoc querying, but for everything else they are a nuisance.

    13. Re:hmm by PhrostyMcByte · · Score: 1

      There's a reason Google, Amazon, and Microsoft all designed their cloud databases without SQL -- it has a lot of features that don't scale well when your data spans a crap ton of servers. Imagine a website that does several JOIN queries for each page view -- now if you've got data spanning 50 servers, that's a hell of a lot of I/O that will be very hard to scale. When you take out these extra features, you end up not having much more than the basics -- usually just a simple insert, update, and delete with very limited transaction support. Without all the bells and whistles, there's no point in pretending you support SQL anymore.

      Think of the current concurrency push, and how most people consider it very challenging to do correctly, and are hoping some magic silver bullet comes along to fix it all. The old designs don't translate to multi-core very well. This is the same thing. Cloud databases require a fundamental redesign of how you structure and query your data. The only difference is, concurrency can usually be added in steps. With Cloud databases, it's usually all-in or nothing with no hand holding.

      I think a bigger problem is that the name "NoSQL" sounds like a direct attack at SQL. Really, it's just a stupid name for a larger mantra that most programmers try to abide by: use what's best for the job. SQL is great for a lot of jobs, but not all. Maybe someone will come along with that "magic silver bullet" for SQL that makes it super-scalable in all situations, but until then, we've now got more options for when it's not.

    14. Re:hmm by phantomfive · · Score: 2, Interesting

      Ever tried to design a database, but got the requirement that you should be able to reconstruct the modification history? It boils down to not deleting (ever), and 'deleted' flag fields and other uglyness.

      I did it by every time I did an INSERT, DELETE, or UPDATE query, taking an exact copy of the query and dumping it into a special table in the database (along with a stack trace of where it was called from). To reconstruct I could just run those commands straight from the database, to whatever point was desired. It was simple, straightforward and efficient, although I'm sure someone else has a better idea.

      --
      Qxe4
    15. Re:hmm by jilles · · Score: 1

      That's just another way of saying sql databases are a poor match for the requirements big websites face. SQL databases used at scale almost always throw characteristic features like transactions, joins, or even ACID out of the window in order to scale. Once you start doing that, SQL databases just become a really complicated way to store stuff. The one database that is really popular on big websites is mysql, which started out its popularity as a non transactional database. While most common features have been slapped on since, the proper way to use mysql at large scale still involves not relying on those.The way sites like facebook, ebay, etc. use it is as a dumb key-value store. Apparently, Amazon does not use database transactions. That's pretty steep for a billion dollar+ revenue ecommerce site that handles millions of financial transactions per day. I'm pretty sure they'd use database transactions if it was feasible to do so. Instead they handle transactions at the application level.

      So the problem is not APIs but the fact that the underlying technology can't live up to the requirements. Never mind what is possible in theory because that's not worth shit in practice. Now there are several non sql storage systems under development with different designs that are designed from the ground up to be scalable and have all sorts of desirable qualities regarding data integrity and a growing number of people relying on them in real world situations.

      --

      Jilles
    16. Re:hmm by CaptainZapp · · Score: 2, Insightful

      I have a feeling that this part hype, part inept programmers who don't actually understand SQL, or database optimization. The first sign for me that someone is selling bullshit is when they try to act like this is some never before seen problem, when in fact there is a good four decades of research of database optimization.

      Thank you very much for this comment, you put it far more eloquently then my venting, I just wanted to grace this thread with. The real kicker though is

      There are three specific problem areas: scaling out to data sets like Digg's (3 TB for green badges) or Facebook's (50 TB for inbox search) or eBay's (2 PB overall); per-server performance; and rigid schema design.

      This statement is just so full of shit. And the real larff riot, for me at least, is when people or shops employing MySQL (for heavens sake!) make such statements.

      Ej, folks: Rigid schema design is an asset, not a liability!

      --
      ich bin der musikant

      mit taschenrechner in der hand

      kraftwerk

    17. Re:hmm by QuoteMstr · · Score: 2, Informative

      I don't think you've thought clearly about the problem.

      If a JOIN is causing problems because it's causing too much non-local data access, then you're going to run into the same problem when you re-code the JOIN in the application. In fact, it might hit you worse because you won't benefit from the database's query optimizer.

      The solution is clearly to improve locality of reference. You can do that by duplicating some data, denormalizing the database, and so on. But you can do all those things just as easily within a RDBMS, and without losing the other benefits a RDBMS gives you.

      Really, your problem is that some of the things RDBMSes allow hurt when a database grows beyond a certain size. The solution is to not do the things that hurt, not ditch the things that RDBMSes do allow.

      It's like complaining that your feet are sore if you walk 20 miles, then cutting off your leg to make it stop.

    18. Re:hmm by QuoteMstr · · Score: 4, Insightful

      I think I'd rather see the opposite: That non-relation DBs become the mainstream, and they have SQL added for the odd occasion it is useful. Relational has some nice properties for ad-hoc querying, but for everything else they are a nuisance.

      Berkeley DB is a very good non-relational database with multiple language bindings, several storage engines, and transaction support. It's been around for 24 years, and has seen some appreciable use.

      But that use was nothing compared to the database explosion that SQLite brought about when it was released. SQLite is almost exactly like Berkeley DB, except that it has a SQL engine on top. Almost everyone is using SQLite, and many Berkeley DB users are moving over to it.

      Why? Because SQLite is relational! That constitutes some serious evidence that relationship databases are more than "a nuisance".

    19. Re:hmm by Anonymous Coward · · Score: 0

      Or it could be people who have to deal with far, far more data than any RDBMS is actually capable of handling. Dismissing everything here as "part hype, part inept programmers who don't actually understand SQL, or database optimization" is simply stupid - do you honestly think that Google, Yahoo, Facebook, eBay, Microsoft, and so on don't have one person between them who understands SQL?

      Seriously - show me an RDBMS that can handle 50TB of index data without needing to resort to sharding. If you have to resort to sharding, you lose all the features that make an RDBMS worth using in the first place.

    20. Re:hmm by xtracto · · Score: 1

      I find the NoSQL appraoch stupid.

      Although as they say in TFA, NoSQL is more of a "Not Only SQL" more than "No Sql at all", I think people should make a difference between the "Structured SQL Language" ;-) and the implementation.

      SQL is only a language ("artificial language designed to express computations that can be performed by a machine, particularly a computer"), used to query data from some place.

      I am sure that a lot of the problems the guy is talking in his blog can be solved by using the appropriate backend technology. There is stuff like RAID for data replication, faster access, etc. There surely are other solutions for transparent distribution of data (say, a layer between the MySQL driver and the actual data) and the like.

      But that does not mean that on the client side we should stop doing a "SELECT * FROM MY_FRIENDS WHERE SEX='F'".

      --
      Ubuntu is an African word meaning 'I can't configure Debian'
    21. Re:hmm by tkinnun0 · · Score: 1

      Isn't that the point, though? Drop all the features of SQL databases that hinder scalability and you end up with NoSQL.

    22. Re:hmm by h4rm0ny · · Score: 2, Interesting

      It was simple, straightforward and efficient, although I'm sure someone else has a better idea.

      I'd love someone to post it if they do. We use the same method and the one time we had to replay the sequence to get what we wanted, it took most of a day. Yes, that was because are last snapshot "starting point" was nearly a week old, but nonetheless... if technology has moved on and there's a better way of doing this, then I'm sure a lot of us will be interested.

      --

      Aide-toi, le Ciel t'aidera - Jeanne D'Arc.
    23. Re:hmm by Anonymous Coward · · Score: 0

      That is the main problem with this no-sql hype. I worked a while in a small company doing SaaS, with buzzwords "cloud" and "scalable" repeated in their website. They were so full of themselves that they can make "scalable" solutions, that if you wanted to talk about SQL you were mocked. Quite truly, you were not allowed to talk about SQL, or any query formalism. It was a curse word.

      Their product and customers so far only basically needed a web portal for a hash table: you give a key, or a range of keys as they are actually timestamps, and the server fetches you a page with the values. Once you login that identifies the table from which the keys you are going to ask might be and every customer usually had only 1 person who might be asking these values. They didn't even have much parallelism that multiple people are going to ask the same values. So you could just put up a server for every customer. Well they did manage a small group of servers to store the actual data for each customer, but we're still talking about a handful of dedicated machines.

      It was nothing like Facebook or Ebay. Basically the simplest kind of a storage problem you can imagine with just enough volume that a single Microsoft SQL database cannot handle it, and the solution was to restrict the users to be able to do only one simple query at a time.

      For them this was a ground breaking discovery and they considered themselves to be a gift from god. Their CEO actually told several professors of computer science that they should concentrate on scalable systems and that they can offer some expertise on that. The same CEO also thought that their inventions were so staggering that it should not be a problem to write a few articles and get them published. Needless to say that this did not happen quite that easy.

    24. Re:hmm by PhrostyMcByte · · Score: 1

      These databases are all schemaless, so it's not like they could use an established RDBMS... they had to make something new. I was trying to give an example of why they wouldn't bother implementing something like JOIN in something new that they planned to scale from the beginning.

      I'd actually be pretty stoked to find a schemaless DB that uses mostly standard SQL, because I find schemaless makes a lot of sense for use beyond cloud-level scalability. Most of the time they let you store things in a more natural, accessible way than what tables provide.

      But a big benefit they have for scalability (and I suspect the reason those companies went schemaless) is that if you decide you want to change your application-defined object schema, you don't have to copy everything to a new table or ALTER the existing one. If you've got terabytes of data, it can be well worth it to simply keep code around that knows how to read the old object type and update them on demand.

    25. Re:hmm by Hognoxious · · Score: 3, Funny

      Rigid schema design is an asset, not a liability!

      Not to people who think a free format text field is the ideal place to store the price, quantity and delivery date of an order. Why not, it's long enough for it all to fit. And it saves all that moving between fields.

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
    26. Re:hmm by QuoteMstr · · Score: 1

      There exists some schema for every data set. The lack of a schema is (being generous) actually a lack of imagination and ingenuity on the part of the programmer.

      you don't have to copy everything to a new table or ALTER the existing one

      How is that any better than creating a second table with the new schema, and slowly (or lazily) migrating records from the old to the new table?

    27. Re:hmm by tkinnun0 · · Score: 2, Insightful

      What if you ARE dealing with massive data AND complex relationships?

    28. Re:hmm by sohp · · Score: 1

      Thank St. Codd we are finally getting away from the relational database dogma. I've seen more companies and projects crippled or completely killed by DB architects trying to enforce the One True Schema than just about any other idiocy in IT.

    29. Re:hmm by sohp · · Score: 1

      They may have been "designed" to scale, but they were clearly never tested or even trialed under real-world conditions. Distributed replication and transactions are a complete joke for scalability. As for "Many of those problems were solved in an "enterprise" DBMS 3-10 years ago", which ones specifically? Oh, right, none of them, because "enterprise" is just marketspeak for "really expensive and complicated but we promise it'll work great if you just sign over more money".

      If I had a nickel for every organization handicapped by politically powerful but technically backwards database management people insisting that every development project be constrained to fit their idea of the "right" way to do data storage, I'd be retired and living on St. Croix.

    30. Re:hmm by Anonymous Coward · · Score: 0

      Ever tried to design a database, but got the requirement that you should be able to reconstruct the modification history? It boils down to not deleting (ever), and 'deleted' flag fields and other uglyness.

      I did it by every time I did an INSERT, DELETE, or UPDATE query, taking an exact copy of the query and dumping it into a special table in the database (along with a stack trace of where it was called from). To reconstruct I could just run those commands straight from the database, to whatever point was desired. It was simple, straightforward and efficient, although I'm sure someone else has a better idea.

      Er...logical log maybe?

    31. Re:hmm by chthon · · Score: 1

      Prior to the relational database, there where network databases and hierarchical databases. They must be navigated to search for data. What you propose here seems to be the same, but then mapped on top of a relational database, which makes it probably slower than the previous solutions.

    32. Re:hmm by Trepidity · · Score: 1

      Ej, folks: Rigid schema design is an asset, not a liability!

      Is that like "it's a feature, not a bug"? Sure, there are plenty of cases where rigid schema design is an asset, just like there are plenty of cases where architected-to-hell OO design is an asset. But there are cases where being able to change your damn schema without jumping through hoops is desirable, just as there are cases where you might want to be able to modify your programs nimbly.

      Oddly enough, these seem to go together--- the sorts of people using non-SQL storage also tend to use dynamic languages rather than C++ or Ada.

    33. Re:hmm by CaptainZapp · · Score: 1

      But there are cases where being able to change your damn schema without jumping through hoops is desirable

      No question. However, changing the schema became pretty straight forward with the introduction of relational databases.

      I'm not claiming that alternative storage and rerieval technologies don't have a place (see Googles Map Reduce), but what I read from proponents of such schemes usually indicates that a lot of them don't really have a clue what they're talking about.

      Alas, NoSQL certainly sounds like a cool, new buzzword

      --
      ich bin der musikant

      mit taschenrechner in der hand

      kraftwerk

    34. Re:hmm by Anonymous Coward · · Score: 0

      I thought relational databases were a "nuisance" in school too. I think most students do. Let's face it: database class is one of the least fun classes to take, especially when it's your first introduction to database theory. Before you even get to SQL, let alone touch a live database, you're worn out from all the boring math.

      But then I got a real-world job, and 12 years later, not only can I appreciate the relational database (and SQL), I honestly enjoy working as a database programmer/admin. It's actually fun stuff, especially if you work on both the backend and frontend. It's one of those things where the real world is quite different than what you learned in school.

    35. Re:hmm by geminidomino · · Score: 3, Insightful

      It's not without precedent. Drop all the features of SQL databases that make them a good idea and you end up with MySQL.

      (Burn, baby, burn)

    36. Re:hmm by geminidomino · · Score: 1

      That's just because CPU architects haven't been able to perfect the LART instruction yet. Current RFCs suggest 40vDC into the keyboard.

      (Hey, if Microsoft can get that fucking "Windows Key" added to every goddamn commodity keyboard in the world...)

    37. Re:hmm by geminidomino · · Score: 1

      But that does not mean that on the client side we should stop doing a "SELECT * FROM MY_FRIENDS WHERE SEX='F'".

      0 rows found.

    38. Re:hmm by Anonymous Coward · · Score: 0

      >A lot of cloud APIs don't allow full relational database access,

      if you mean stuff like this, its for the good reason that yr data might actually be all over the place. cloud data access strategies are necessarily different.

    39. Re:hmm by MyDixieWrecked · · Score: 1

      I have a feeling that this part hype, part inept programmers who don't actually understand SQL, or database optimization.

      This is part of the problem... similar to PHP, most people learn some examples that teach some bad habits right off the bat (sticking SQL in your view, etc) because it's so easy to get started, but you've gotta get a grasp on the tech before you can do anything big.

      Also, I feel that one of the root causes of the hype is that SQL and RDBMSs in general don't solve all your problems and sometimes get in the way of your application design. Between rigid schema definitions and the SQL language that has a bit of a learning curve when you start dealing with nested queries and handling shards/partitions/etc, I think that's the reason we're starting to see more non-RDBMS databases.

      At work, we had a project that we started building on MySQL, but was falling short because we were constantly making schema changes. We begun to build a system where we could have arbitrary attributes attached to arbitrary objects, but then our queries were getting REALLY nasty. We discovered MarkLogic which is an XML database server and uses XQuery to query the data. We were ingesting around 100MB of XML a day, and we needed to be able to handle just about any XML that went into the system. MarkLogic was a natural fit since we needed to put XML in and we wanted XML out most of the time.

      We're still using MySQL for tracking the ingestions and managing the frontend to the system (which is built on Rails), but having XQuery at our fingertips has been a godsend.

      There's a lot to be said about new technologies that solve needs and get around shortcomings of the more ubiquitous technologies, but, as with anything that people see as a solution, it's not a silver bullet. You've gotta be careful not to get trapped in "everything looks like a nail" syndrome.

      --



      ...spike
      Ewwwwww, coconut...
    40. Re:hmm by gpierre · · Score: 1
      Actually, consistency does have a performance cost (i.e., the throughput you get from a given setup) but it does not have to hinder scalability (the ability to grow the maximum throughput by adding extra hardware). Scalaris offers transactional consistency, for example. If you prefer using an existing no-SQL database then you can impose transactional consistency on top of it without breaking scalability properties.

      Disclaimer: I co-authored the second link of this post.

    41. Re:hmm by PostPhil · · Score: 1

      Exactly. One good example is when you look at how a SQL RDBMS indexes its data at the implementation level. Indexes are local and are concerned with local things such as the ordering of the data on the disk (e.g. clustered indexes). If you want a distributed SQL database, indexes are still typically a local operation and replication is handled at a much higher level. Much of these "NoSQL" implementations are about fast searches (requiring fast indexing) that are inherently distributed. If SQL RDBMS implementations had more sophisticated implementations that assumed the database is inherently distributed, and if the indexes were treated as such from the start, then searches would be much faster and there would be less need for "NoSQL" databases. When you realize that the relational model is based on "relations" (and in turn, set theory) you realize that there is nothing inherent about what is essentially math that prevents it from working efficiently in a distributed manner. If you can make a distributed "key-value store", you can make a distributed UNIQUE index too, etc. If speed is the ultimate concern and that is why they opt for "eventual consistency", then a special-purpose "NoSQL" database may be the way to go, for the *web* anyway. But if you need massive data *and* constant consistency (e.g. such as with financial transaction data), distributed SQL databases would be a better choice (given there is a good implementation available).

    42. Re:hmm by larry+bagina · · Score: 1
      I've been using triggers to file a copy into an audit table for years. Standard stuff. Maybe not for the average mysql guy, although mysql 5 should be able to do that kind of thing.

      Seriously, many of the "problems" aren't.

      --
      Do you even lift?

      These aren't the 'roids you're looking for.

    43. Re:hmm by larry+bagina · · Score: 3, Informative

      create post insert, update, and delete triggers which file the data (as well as the action, timestamp, and user) into an audit table.

      --
      Do you even lift?

      These aren't the 'roids you're looking for.

    44. Re:hmm by mzito · · Score: 2, Informative

      I sort of agree with you, from the perspective that there's crusaders on either side - people who insist that traditional RDBMSes are the Only Way and people like you who insist they've "never been trialed under real-world conditions". Both statements are clearly incorrect on their face.

      However, there are a multitude of features that these systems have that are not available in NoSQL systems, or only available in such a watered down form that its unfair to compare the two. A list:

      - On-disk encryption
      - Compression
      - Schema/data versioning (present one picture of data to one set of clients, while presenting another layout of the same data to another set during a data migration)
      - Automated failover between servers, clusters, facilities, datacenters
      - "Flashback" - say "I want to run a query against my data as it looked last week at 3pm", and it just works.
      - Shared-disk clustering

      As far as transactions go, they may be a "joke" for scalability (not quite sure what that means), but they're awfully useful when dealing with sensitive information you need ACID compliance for. For example, I would prefer my bank not use an "eventual consistency" model when dealing with my credit card transactions.

      Now, as I said above, a relational database *may not* be the right decision for your application. But the idea that relational databases don't scale is ridiculous. I've seen petabyte datawarehouses running teradata that absolutely scream through data. I've seen Oracle systems that do 10s of thousands of write transactions per second, and several times that in reads. They exist.

      --
      me@mzi.to
    45. Re:hmm by Anonymous Coward · · Score: 0

      I spend all day working with a non-RDBMS DB. It's faster and more powerful than SQL, but at the cost of far more programmer time and much more difficult maintenance. There's also an SQL front-end and, when possible, I use it.

    46. Re:hmm by Anonymous Coward · · Score: 0

      Most of the modern relational database engines can do transparently partitioning, which splits the tasks among a larger amount of servers. I can EASILY scale my databases upto terabytes and beyond. But then again, I know what I am doing instead of these noSQL folks.

    47. Re:hmm by QuoteMstr · · Score: 1

      these noSQL folks

      Movements founded based on opposition to a technology or idea never succeed in the long run. A sure sign of organizational rot is a name in the form "noX".

    48. Re:hmm by awol · · Score: 1

      I looked hard at netezza for a big project with "absurd" requirements (many 10^4 new records per second, ad hoc queryable by clients). It seemed to be the ideal solution. Nice to see it might have worked. How fast does your data grow?

      --
      "The first thing to do when you find yourself in a hole is stop digging."
    49. Re:hmm by mbourgon · · Score: 2, Insightful

      Mod parent up. That way you're not dealing with the statements themselves, just the data. And you can add the UserID to the Audit table - then find the most recent row for that particular person, or get the most recent row for each ID and apply that.

      --
      "Sometimes a woman is a kind of religion, she can save your soul & set you free from all your sins" - Bad Examples
    50. Re:hmm by xtracto · · Score: 1

      But that does not mean that on the client side we should stop doing a "SELECT * FROM MY_FRIENDS WHERE SEX='F'".

      0 rows found.

      OMG this guy hacked into my facebook !

      --
      Ubuntu is an African word meaning 'I can't configure Debian'
    51. Re:hmm by oh_my_080980980 · · Score: 1

      And from a clod to boot!

      There's nothing wrong with RDMS. The problem is that people are poorly trained in the technology to properly implement them. Because of this, jackasses like him spout their voodoo to fix a "supposed" problem.

    52. Re:hmm by bjd145 · · Score: 1

      I think you hit the nail on the head with your post. Cost is the exact reason why this NoSQL movement came up. Look who are pushing it the most. Internet companies. When they started to grow, do you think they had millions in license fees to Oracle or Microsoft? They all went with MySQL to start with and then started to have growing pains as their datasets grew larger. These internet companies learned from their Web 1.0 brothers. Its better to pay a few highly skilled engineers to come up with some new technology than to pay millions in OS and DBMS license fees.

      To me this reminds me so much of the Unix (Solaris, AIX, etc ) vs Linux debates in the late 90s and early 00s.

    53. Re:hmm by Just+Some+Guy · · Score: 1

      I did it by every time I did an INSERT, DELETE, or UPDATE query, taking an exact copy of the query and dumping it into a special table in the database (along with a stack trace of where it was called from). To reconstruct I could just run those commands straight from the database, to whatever point was desired. It was simple, straightforward and efficient, although I'm sure someone else has a better idea.

      Here's the same thing in Python and CouchDB:

      #!/usr/bin/env python

      from couchdb.client import Server

      # Connect to the server
      server = Server('http://db1:5984/')

      print 'databases:', list(server)

      # Connect to a database on the server
      db = server['example']

      print 'documents in the example database:', list(db)

      # Create a new record, store it, then make a few modifications and
      # store them too.
      record = {'foo': 1}
      db['foo'] = record
      record['bar'] = 2
      db['foo'] = record
      record['baz'] = 3
      db['foo'] = record
      record['foo'] = 4
      db['foo'] = record

      # Get a list of all the revisions of our new object
      revinfo = db.get('foo', revs=True)['_revisions']

      # Turn the revision information into a list of keys used by CouchDB
      revs = ['%d-%s' % pair for pair in zip(range(revinfo['start'], 0, -1), revinfo['ids'])]

      # Print each revision so we can see the changes over time
      for rev in revs: print db.get('foo', rev=rev)

      # Clean up after ourselves
      db.delete(record)

      I'm sure this could be made easier (such as the Python client library implementing the revision-generating code as a method on the record - I'm off to submit patches now), but not by a whole lot.

      If you know a simpler way to do this with, say, PostgreSQL, then please share.

      --
      Dewey, what part of this looks like authorities should be involved?
    54. Re:hmm by emilper · · Score: 1

      I have a feeling that this part hype, part inept programmers who don't actually understand SQL, or database optimization.

      Please, can you think a way to optimize a RDBMS to hold MARC data ?

    55. Re:hmm by popeyethesailor · · Score: 2, Interesting

      Why not use the DB features? Most enterprise-y databases have PITR(Point-in-time Recovery features).. Although it's not designed for that sort of thing, it could be used in such a fashion.
      Most DBs do the same thing you guys do, i.e, use a transaction log. The transaction log could be replayed to get into a Point-in-time state. The one disadvantage is it's all or nothing i.e, you can't do it for specific transactions(although I'm sure some DBA will wander in correct me on this ;)

    56. Re:hmm by Anonymous Coward · · Score: 0

      Those people with dynamic languages need to learn that you can generate schema and new table definitions on the fly via the same SQL connection that is used to run queries. There is nothing rigid nor static about it. And if you truly need per-instance duck typing, you can use a fully normalized storage format with the same expressive power as RDF triple stores.

    57. Re:hmm by Tablizer · · Score: 1

      Because SQLite is relational! That constitutes some serious evidence that relationship databases are more than "a nuisance".

      Trivia Police here. Note that "relational" is not really about relationships. "Relational" is more or less a mathy term for "table".
           

    58. Re:hmm by Anonymous Coward · · Score: 0

      If you want an easy test to see whether a "NoSQL" type understands SQL or is just trying to cover up for a lack of ability, all you have to do is ask him to create a 3NF entity relationship model describing some real-world system, then map it to a Java object design.

      Someone whose actually got some chops will lay it out for you on paper while you watch.

      Someone who doesn't understand SQL will mutter, hem and haw, then try to justify using ORM to do something that's not even close to what you're asking for.

      This is all about lack of ability on the programmer side.

    59. Re:hmm by larry+bagina · · Score: 1

      In postgres (or pretty much every adult sql database) you can create post insert/update/delete/ triggers which automatically log changes into an audit table. On the server side. Atomically. Entirely transparent and independent of the client. That's a little easier, safer, and cleaner in my book.

      --
      Do you even lift?

      These aren't the 'roids you're looking for.

    60. Re:hmm by Just+Some+Guy · · Score: 1

      I make a lot of money doing exactly that stuff with "adult" databases. I just think it's fan-freakin'-tastic that some people are experimenting with different ways of doing the same things. Best case: it catches on and everyone's life gets a little easier. Worst case: it goes nowhere and we're no worse off than we are today.

      --
      Dewey, what part of this looks like authorities should be involved?
    61. Re:hmm by Pollardito · · Score: 1

      A lot of cloud APIs don't allow full relational database access, so now it seems we are coming up with all these justifications for why we don't really need it.

      Need someone to justify the lack of a feature? There's a Steve Jobs for that!

    62. Re:hmm by PhrostyMcByte · · Score: 1

      The migration happens within the same table. In a schemaless environment, you simply store objects. JSON seems to be popular for them, so imagine you store a bunch of objects like {"Id":1,"Value":2}, {"Id":2,"Value":3}, etc.

      If you decide you want to add a new feature "foo" to your app and it needs a new attribute to work, you don't need to tell the database anything. Just keep your old code around that knows how to read the old objects, and the next time you need to update one, update it as {"Id":1,"Value":2,"foo":"bar"}. Even after this "foo" attribute is added to the first object, the second one is still perfectly valid being just {"Id":2,"Value":3}.

      It is up to the application to enforce any schema, if that's what you choose to do -- the database doesn't care what you put in it. Either way you aren't forced to make your data tabular, and that is where the big win comes along in expressing your data more naturally.

    63. Re:hmm by sohp · · Score: 1

      It's distributed atomic transactions that don't scale.

      As for the list of things you claim that NoSQL doesn't have, they are either so far outside of what a storage management system needs to do like On-disk encryption or clearly already part of some existing implementations, like automated failover between servers, clusters, facilities, datacenters, that I wonder if maybe Oracle hands out a talking points list of things to say about NoSQL technologies.

    64. Re:hmm by Anonymous Coward · · Score: 1, Interesting

      That might be part of it but the big selling point of SQlite is its license which is a lot less restrictive than Berkeley DB.

      SQLite is kind of cool but if really doesn't seem very well optimized as far as accessing the filesystem below it. Unless your whole database fits in RAM then it can really thrash your drives something terrible. Needless to say, it doesn't work well with large databases, mostly because of this. Even with lots of research and tweaking on high-end hardware I have never gotten SQLite to perform very well.

      Firebird can be embedded just like SQLite (single file database, one library to link to, etc), also has a very permissive license, has tons more features (stored procedures, etc), and performs orders of magnitude better than SQLite. Its license is also more permissive than MySQL and it has more features (but MySQL does have multiple database engines and more indexing types). I'm successfully using Firebird for terabyte size databases and it works well. I'm not sure why more people don't use it.

      (What about PostpreSQL you ask? Meh, decent license but it can't be embedded, performs better than SQLite but much worse than Firebird and MySQL; feature-wise its about on par with Firebird)

    65. Re:hmm by mzito · · Score: 1

      Well, if Oracle does hand this out, I don't get it certainly. And for the record, I think that Oracle is overpriced, and believe their pricing model is not sustainable over the long term.

      But you see my point, no? These are features that are in more traditional enterprise RDBMS systems. If you don't need them, you don't have to spend the money. If you need them, you can spend the money, or you can cobble together a system on your own. If you do cobble together a system, you're responsible for supporting that infrastructure internally.

      You've made several separate points:
      - RDBMS technology has never been "trialed under real-world conditions" and "transactions are a joke for scalability" - neither of those statements are true, nor have you backed them up with any data. A ~100TB DB2 or Oracle database is hardly impressive these days. (I'll also throw in that if ACID compliance bothers you that much, you can disable it at a session, table, or database level in an RDBMS. At least you have the option.)
      - "Distributed atomic transactions don't scale" - this is demonstrably true, but irrelevant to the point at hand. 2PC has its place, but in the RDBMS world, has been largely discarded for the exact reasons you specify.
      - "It's a good thing you have all those tools ...because you need them to make that complex pile of enterprisey spaghetti work. If only there was something out there that just worked, and didn't need all that hand-optimizing and tool-fiddling to kludge it into usability. - I think it's hysterical that you would throw this elsewhere into the thread, when you freely admit that NoSQL-esque systems are missing all sorts of features that you expect someone to implement outside of the solution.

      Let me rewrite your sentence - "It's a good thing you have all of those third-party software packages...because you'll need them to get that immature open-source key-value store to work. If only there were something out there that had a fully integrated stack of DR, HA, and management capabilities that just worked, and didn't need all of that custom infrastructure around it, and had a large community of people and knowledge who know how to manage and operate it"

      Let me restate AGAIN my point - NoSQL data stores are interesting, have their place, and will no doubt continue to grow and be an important part of the data management ecosystem. HOWEVER, to say that traditional RDBMS systems are done for, or don't scale, or are useless, displays an ignorance of reality that undermines the whole discussion.

      --
      me@mzi.to
    66. Re:hmm by Anonymous Coward · · Score: 0

      Reconstruct the modification history: Isn't that what an archive log is for? You have the ability to reconstruct a DB to a point in time. No need to prevent deletes. Or are you talking about needing a live reconstruction at all times? Can't imagine why that would be useful to have.

      MySQL is horizontally scalable. http://planet.mysql.com/entry/?id=15962

    67. Re:hmm by Anonymous Coward · · Score: 0

      Look into Change Data Capture in SQL Server 2008

    68. Re:hmm by Anonymous Coward · · Score: 0

      Even MySQL is more full-featured than most "cloud DBs" out there, which are usually little more than glorified key-value maps. I agree with the original assertion in this thread - "SQL/relational is not needed" is misinformation spread by companies selling cloud solutions to promote their products, just as "transactions are not needed" was misinformation spread by MySQL.

      By the way, RDBMS do scale well, and you can have an "RDBMS for the cloud", with SQL and transactions. There's no inherent contradiction in principles there.

    69. Re:hmm by shutdown+-p+now · · Score: 1

      There exists some schema for every data set.

      Indeed. If there is no schema at all, then you effectively have just a stream of bits. Which is definitely a valid way to store data, but how are you going to run queries over it?

    70. Re:hmm by tthomas48 · · Score: 1

      Really? I find in the real world my data almost always has a relationship and a relational database does a good job of modeling it. I can see some small webapps that wouldn't need a relational database. And perhaps something like Digg doesn't. But most of the world still deals with data that has strong relationships to one another.

      That said, I've been playing with CouchDB for a wiki type application I'm working on that allows users to add metadata to documents. And a relational database is not good for those sort of end-user driven changes.

      Too many people here who don't understand the value of both types of databases. They're both valuable. But very rarely is a nosql database a good replacement for a relational database.

    71. Re:hmm by Anonymous Coward · · Score: 0

      Unfortunately you just described the Pick database. I regularly regurgitate a little while using it.

    72. Re:hmm by countach · · Score: 1

      Yeah, because Berkeley DB is a piece of shit. I was talking about something UP the scale of sophistication, not way DOWN. Something like an advanced object database as an example.

    73. Re:hmm by Eil · · Score: 1

      SQL has scaled for years, and I know loads of companies who work with terabytes and terabytes of data on a single database without any issue.

      Whenever I hear someone say SQL doesn't scale, I simply point them to Slashdot. It's been running Perl and MySQL since the beginning. No, it's not Digg or Facebook, but it's got higher traffic than 99.99% of all new web development projects ever will see.

    74. Re:hmm by clockwise_music · · Score: 1

      Unmod parent up.

      We did this for our DB and it completely killed the server. Every
      single update also doing an insert into another table? Performance killer.

      Unfortunately there's just no other way to do it. You can either:

      1 - store the entire history of updates
      2 - store delta's
      3 - backup your DB every day.

      There's nothing else you can do. Or is there??

    75. Re:hmm by lennier · · Score: 1

      "Before you even get to SQL, let alone touch a live database, you're worn out from all the boring math."

      Grr!

      It's the math that makes programming correct. We need MORE math, not less. If you don't want to understand what you're doing and whether it's right or wrong, why do you get to do it on the Internet and to millions of other people's computers?

      Kids these days.

      --
      You are not a brain: http://books.google.com/books?id=2oV61CeDx-YC
    76. Re:hmm by lennier · · Score: 1

      Relationship databases are the new 'little black book', right?

      --
      You are not a brain: http://books.google.com/books?id=2oV61CeDx-YC
    77. Re:hmm by lennier · · Score: 1

      Then you're working for Total Information Awareness, and it's easily solved by [REDACTED] the petaquad antiterrorist [REDACTED] and cross-circuiting helm control to [REDACTED] [REDACTED] [REDACTED] with a squirrel.

      --
      You are not a brain: http://books.google.com/books?id=2oV61CeDx-YC
    78. Re:hmm by Anonymous Coward · · Score: 0

      My question is: Why do you need to do that in your application? Where I worked last, the database did that for us.

      We kept full backups in regular intervals and in between that, the complete transaction log was saved/archived. If you needed to go back, just load an older backup and replay everything BUT the stuff you don't want. For just viewving history, triggers were inserting the current row into "protocol" tables, before doing the actual insert/update/delete. Of course, these protocol tables have to be kept clean. every month, we pushed everything older than three months into archive tables, because the triggers were taking too long to update the protocol table. That was with updating multiple tables each having millions of records in the "current" tables and multiples of that in protocol/archive.

      Another great thing: The transaction logs are actually what Sybase uses for replication. I totally don't get why PostgreSQL doesn't have decent replication solutions by now. The Sybase method works very well. The replication agent just looks through the transaction log, marking parts that it already replicated as "ok to delete" (although, as stated, before actually deleting the transaction logs, we would back them up).

    79. Re:hmm by elnyka · · Score: 1

      What if you ARE dealing with massive data AND complex relationships?

      And how often does that happen?

    80. Re:hmm by metachimp · · Score: 1

      Instead of a 'deleted' flag, you can use a create timestamp and an end timestamp, for tables that must maintain a modification history. The one record that has an end timestamp in the indefinite future is the current one, all the changes are tracked, and there's no hackery with 'deleted' flags or current record indicators, and so then it becomes just a matter of 'where current_timestamp between create_timestamp and end_timestamp'.

      You also get the benefit of a continuous history of the modifications, which gives you point-in-time ability, definitely an advantage over a simple flag.

      --
      The system has failed you, don't fail yourself. --Billy Bragg
  4. Dynamic Relational: change it, DON'T toss it by Tablizer · · Score: 5, Interesting

    The performance claims will probably be disputed by Oracle whizzes. However, the "rigid schema" claim bothers me. RDBMS can be built that have a very dynamic flavor to them. For example, treat each row as a map (associative array). Non-existent columns in any given row are treated as Null/empty instead of an error. Perhaps tables can also be created just by inserting a row into the (new) target table. No need for explicit schema management. Constraints, such as "required" or "number" can incrementally be added as the schema becomes solidified. We have dynamic app languages, so why not dynamic RDBMS also? Let's fiddle with and stretch RDBMS before outright tossing them. Maybe also overhaul or enhance SQL. It's a bit long in the tooth.

    More at:
    http://geocities.com/tablizer/dynrelat.htm
    (And you thought geocities was de

    1. Re:Dynamic Relational: change it, DON'T toss it by Tablizer · · Score: 1
    2. Re:Dynamic Relational: change it, DON'T toss it by Prodigy+Savant · · Score: 2, Insightful

      What you are suggesting is to mimic a key-value design with something like a json or serialized data as the value.

      This would work if you never had to index on any of the values in the json. All your sql queries must have there where parts running off the key.

      This is a problem that couchdb and mongodb solve.

      I am not trying to paint SQL in an unflattering shade -- there would still be a lot of situations where an RDBMS design would be optimal. Infact, I am currently working on a mongodb/mysql hybrid solution for a large web site (larger than /. )

      --
      Dont make a better sig, you insensitive clod!
    3. Re:Dynamic Relational: change it, DON'T toss it by Tablizer · · Score: 2, Interesting

      What prevents indexing a dynamic-relational DB? Although I said that you didn't need a data-definition language, but that doesn't mean one *must* skip the DDL (for things such as indexes). Another thing to explore is auto-indexing. If so many queries keep filtering by a given column, then it could automatically put an index on it.

    4. Re:Dynamic Relational: change it, DON'T toss it by TubeSteak · · Score: 1

      Let's fiddle with and stretch RDBMS before outright tossing them.

      This isn't "it ain't broke, don't fix it"
      Instead we're dealing with "I have a hammer, so every problem looks like a nail"

      The desire to "fiddle with and stretch" software instead of sinking dollars into something new is
      part of the reason we have a clusterfark of decades old technologies & hardware that won't go away.
      Sometimes you have to accept that a hammer isn't the right tool for the job.

      --
      [Fuck Beta]
      o0t!
    5. Re:Dynamic Relational: change it, DON'T toss it by sco08y · · Score: 4, Interesting

      However, the "rigid schema" claim bothers me. RDBMS can be built that have a very dynamic flavor to them. For example, treat each row as a map (associative array).

      You described an entity attribute value model, which winds up reinventing half the DBMS, poorly. Don't worry, *everyone* does one once until they realize it's a bad idea.

      Constraints, such as "required" or "number" can incrementally be added as the schema becomes solidified.

      A "rigid" schema is preventing a ton of totally redundant code being written on the app side. All those constraints wind up in the schema because your UI designer doesn't want to consider that Mary might have 5 addresses or 6 mothers or work 7 jobs simultaneously. And your UI tester doesn't want to test an exploding combinatorial number of possibilities.

      I'd like to see, however, a decent type system, proper logical / physical separation, etc.

      Maybe also overhaul or enhance SQL. It's a bit long in the tooth.

      I'm starting from scratch. (Currently I'm slowly retyping about 40 pages into Latex...)

    6. Re:Dynamic Relational: change it, DON'T toss it by QuoteMstr · · Score: 1

      UI designer doesn't want to consider that Mary might have 5 addresses or 6 mothers or work 7 jobs simultaneously

      As an aside, it's a good rule of thumb that the only limits that make sense are 0, 1, and infinity. If you find yourself hard-coding a limit of 7, 42, or 6175, there's probably something wrong with your design.

    7. Re:Dynamic Relational: change it, DON'T toss it by pla · · Score: 2, Insightful

      RDBMS can be built that have a very dynamic flavor to them. For example, treat each row as a map (associative array). Non-existent columns in any given row are treated as Null/empty instead of an error. Perhaps tables can also be created just by inserting a row into the (new) target table. No need for explicit schema management.

      Aaaaaaaand, congratulations, you've described "fixing" the problem of schema flexibility by using an RDBMS as a non-relational flat hashed memory storage area, with at least three layers of indirection (not even counting underlying complexity of the DB engine itself).

      Aside from why the hell you would ever do this in favor of, y'know, just using a flat block of real memory (since you've given the application the fun task of memory management below what the OS usually handles, with all the overhead of framing each read or write as an SQL query)... Well, no. I have no aside, just what I've written.

      Sorry, I'll grant that you have a clever solution to a problem, but a far more effective solution would throw away the problem itself and not try to frame everything in terms of DBM - Kinda like Amazon did.

    8. Re:Dynamic Relational: change it, DON'T toss it by Tablizer · · Score: 1

      I'm not sure what you mean by "flat". I didn't preclude indexes or joins. It simply applies concepts found in dynamic/scripting languages to RDBMS. How about some example scenarios to explore.

    9. Re:Dynamic Relational: change it, DON'T toss it by Tablizer · · Score: 1

      You described an entity attribute value model, which winds up reinventing half the DBMS, poorly.

      May I ask for a scenario demonstrating "poorly"? It can still do joins, aggregation, and indexing just like any other RDBMS. It's not clear to me what you think is missing.

      A "rigid" schema is preventing a ton of totally redundant code being written on the app side.

      Depends on the project. Sometimes nimbleness is a strategic advantage, such as prototyping or ad-hoc analysis.

      I'm starting from scratch [query language]. (Currently I'm slowly retyping about 40 pages into Latex...)

      I couldn't open any of those for some reason. But I too have also formed a draft query language influenced from IBM's BS2. See http://c2.com/cgi/wiki?TqlRoadmap

      Here's a sample that returns the top 6 earners in each company department:

      srt = orderBy(Employees, [dept, salary], order)
      top = group(srt, [(dept) dept2, max(order) order])
      join(srt, top, a.dept=b.dept2 and b.order - a.order < 6)

      ("a" and "b" represent the left and right side of the join parameters.) But SQL is too entrenched and could probably be stretched a bit further with some enhancements.

    10. Re:Dynamic Relational: change it, DON'T toss it by Anonymous Coward · · Score: 0

      You know the problem with all this object-oriented database/JSON shite. They are very interesting - but be honest... your first port of call for REALLY IMPORTANT apps is a relational database. They work. They are well understood. They've done the job for decades now. They just aren't very fashionable these days.

      People don't make reputations for genius by supporting the status-quo. Hence the legions of people claiming that RDBMS have had their time, and THEY have the right answer. That's not to say that there aren't better solutions in some cases... but really... you should view any of this stuff with a wary eye. New types of databases is major source of CompSci quackery.

    11. Re:Dynamic Relational: change it, DON'T toss it by pla · · Score: 1

      I'm not sure what you mean by "flat".

      By "flat", I mean that you've solved the problem of flexibility by reducing your DB to nothing more than a collection of unrelated binary blobs, each with a completely arbitrary meaning and not necessarily bearing any actual relation to one another - No different than storing a collection of binary blobs in RAM (or in individual files for that matter), "indexed" by address.


      I didn't preclude indexes or joins.

      How do you join against a field that contains data that requires context-dependent unpacking? Sure, you can still use the DB for easily working with other fields, but once you start using data with a dynamic interpretation, you've effectively made it off-limits to any server-side manipulation - Thus removing virtually every benefit of using a RDBMS in the first place.


      How about some example scenarios to explore.

      I've had the "pleasure" of working with an MS-SQL DB "extended" something like you suggest. They used XML inside various fields, which could and sometimes did contain base64 zipped files, which could and sometimes did contain XML plaintext files (because the original fields had slowly grown much too long to work with directly), which could and sometimes did contain base64 image data (I could go on, no joke) - And while I had to deal with this mess, the original creators would get mad at me every time I pointed out that their latest minor update to the core app had trashed an assortment of previously working features.

      Try setting up a filtered view when you need to run some-but-not-all fields through three passes of two different imported DLL decoders.

    12. Re:Dynamic Relational: change it, DON'T toss it by Tablizer · · Score: 1

      By "flat", I mean that you've solved the problem of flexibility by reducing your DB to nothing more than a collection of unrelated binary blobs

      I did not dictate implementation, only how it "looks" to the query user. I gave some implementation suggestions for conceptual study, but didn't limit it to that.

      How do you join against a field that contains data that requires context-dependent unpacking?

      Please clarify. The indexing approaches can be almost identical to what exists now. If you set an index on Column X, then an index file for that column is created using B-trees etc.

      Internally interpreting a map-based record is not significantly slower than a tuple. Yes, it will use more storage because of map "key" storage (column name), but that's the price of dynamism. A typical sequential search will take roughly twice as long because of this. If you do a lot of sequential searches on large numbers of records, then this contraption is not for you. Its benefit is schema flexibility. There's almost always going to be some trade-off between flexibility and speed in ANY tool. I'm simply providing that trade-off choice here.

      It's still not clear to me what you are imagining as a big problem. How about a specific query for a scenario. I am not packing or zipping anything.

    13. Re:Dynamic Relational: change it, DON'T toss it by clockwise_music · · Score: 1

      sco08y is completely correct. Everyone thinks of this once until someone points out that it's a bad idea.

    14. Re:Dynamic Relational: change it, DON'T toss it by sco08y · · Score: 1

      May I ask for a scenario demonstrating "poorly"? It can still do joins, aggregation, and indexing just like any other RDBMS. It's not clear to me what you think is missing.

      Any kind of integrity constraints, even as simple as including enforcing types. You can kind of sort of do it with triggers but, like I said, you're reimplementing the DBMS. I speak from having _extensively_ implementing an EAV system and totally engineering myself into a corner with it.

      The only thing you really gain is transactional support for creating / altering tables, but a good DBMS like PostgreSQL can do that anyway.

      Depends on the project. Sometimes nimbleness is a strategic advantage, such as prototyping or ad-hoc analysis.

      Those are phases of development, not types of projects. And nimbleness sucks when someone else has to figure out what you're doing, i.e. the maintenance phase.

      But I too have also formed a draft query language influenced from IBM's BS2.

      Hey, good luck with it! I didn't have any PDFs up there, there are couple now. The section I'm working on is fairly basic mostly because a lot of people think they understand relational theory because they've used SQL DBMSs. Most of the actual language comes out in the next part.

  5. NoSQL? That'd Be DL/I, Right? by BBCWatcher · · Score: 4, Informative

    I think I've heard of non-relational databases before. There's a particularly famous one, in fact. What could it be? Let's see: first started shipping in 1969, now in its eleventh major version, JDBC and ODBC access, full XML support in and out, available with an optional paired transaction manager, extremely high performance, and holds a very large chunk of the world's financial information (among other things). It also ranks up there with Microsoft Windows as among the world's all-time highest grossing software products.

    ....You bet non-relational is still highly relevant and useful in many different roles. Different tools for different jobs and all.

    1. Re:NoSQL? That'd Be DL/I, Right? by Tablizer · · Score: 2, Informative

      IMS is very efficient for known query patterns, but not very flexible for stuff not anticipated. This is a common characteristic of non-relational databases: optimize for specific query paths at the expense of general queries (variety).

      Often IMS data is exported and re-mapped nightly or periodically to a RDBMS so that more complex queries can be performed on the adjusted copy. The down-side is that it's several hours "old".

      Note that it's also possible to optimize RDBMS for common queries using well-planned indexing and techniques such as clustered indexes, which put the physical data in the same order as the primary or target key. Whether that can be as fast as non-relational techniques is hard to say. It may depend on the skills of the tuner.
                       

    2. Re:NoSQL? That'd Be DL/I, Right? by Trepidity · · Score: 1

      To some extent relational databases are making a similar bet, optimizing for particular kinds of access at the expense of more general queries. There are more expressive database languages that support more general kinds of queries, including quantification and variable binding and such, like Datalog, but they're harder to make efficient (though they can also be optimized for common query paths). I see SQL as something of a middle-of-the-road choice: more general than tuple stores and some other approaches, but less general than Datalog and similar approaches. That middle-of-the-road choice might be the optimal one for a lot of applications, but it's not clear to me that it's some sort of global optimum.

    3. Re:NoSQL? That'd Be DL/I, Right? by Tablizer · · Score: 1

      Well, I'll buy that. Maybe DataLog performance would be improved if enough people used it to justify spending resources on engine improvements. But it has a much longer learning curve than SQL for most people. So unless somebody can show a "killer app", SQL is still the alpha male on the street.

    4. Re:NoSQL? That'd Be DL/I, Right? by sohp · · Score: 0, Troll

      Hey Bryce Jacobs, people know you are completely clueless. You've never known what you are talking about, and trying to hide your ignorance in wordy jargon-filled rants filled with weasel words isn't fooling anyone any more. Go back to playing with your little toy dBASE programs and leave the heavy lifting to people who understand how to implement post 1980s solutions.

    5. Re:NoSQL? That'd Be DL/I, Right? by petrus4 · · Score: 1

      Let's see...

      • people know you are completely clueless.

        Subjective ad hominem. (Calling someone clueless, while making no objective citations in support of said claim)

      • You've never known what you are talking about...Go back to playing with your little toy dBASE programs...

        Attempts at condescension/belittlement, again without any kind of rhetorical support.

      • who understand how to implement post 1980s solutions.

        An also baseless appeal to modernity, otherwise known as chronological snobbery. This is the assertion that that which was used in the past, is inherently inferior to that which is used in the present, purely because of its' age. There is no inherent causal relationship between the age of something, on its' own, and its' degree of quality or effectiveness for a given purpose.

      The person you are responding to, may well be incorrect, but I also have not been swayed by your own argument.

    6. Re:NoSQL? That'd Be DL/I, Right? by Anonymous Coward · · Score: 0

      And for the pro-relational, nosql croud, IBM also produced Model 204 in 1968 for the NSA and it is still in use today in many intelligence and other government agencies. Uses an attribute-value structure and repeating groups in record sets, makes for extremely fast procedure based database, but again, with flexibility comes development cost....

    7. Re:NoSQL? That'd Be DL/I, Right? by Anonymous Coward · · Score: 0

      Whao, tool fight! Actually xbase (dbase) coding can be quite compact compared to the equivalent SQL. Maybe "heavy lifting" is an issue sometimes if dealing with billions of records, but if you want lean coding then sometimes it's hard to beat. Use the right tool for the job. There's no use living with big iron constraints if not dealing with big iron data sets. For example, sometimes one may first pull specific information from the big iron database, and then sift the smaller sub-set locally.

    8. Re:NoSQL? That'd Be DL/I, Right? by Anonymous Coward · · Score: 0

      By the way, you can do SQL queries against _that_ database ;)

  6. Starting to love the idea by Just+Some+Guy · · Score: 4, Interesting

    I'm a huge PostgreSQL fan and took classes in formal database theory in college. I'm saying this as someone who understands and thoroughly appreciates relational databases: I'm starting to love schema-less systems. I've only been playing with CouchDB for a few weeks but can certainly see what such stores bring to the table. Specifically, a lot of the data I've stored over the years doesn't neatly map to a predefined tuple, and while one-to-one tables can go a long way toward addressing that, they're certainly not the most elegant or efficient or convenient representation of arbitrary data.

    I'm certainly not going to stop using an RDBMS for most purposes, but neither am I going to waste a lot of time trying to shoehorn an everchanging blob into one. Each tool has its place and I'm excited to see what niche this ecosystem evolves to fill.

    --
    Dewey, what part of this looks like authorities should be involved?
    1. Re:Starting to love the idea by butlerdi · · Score: 1

      Also Neo4j and a multitude of high performance RDF/Owl data stores which are into the billions of rows with reasoning.

      --
      "If the King's English was good enough for Jesus, it's good enough for me!" -- "Ma" Ferguson, Governor of Texas (circa
  7. Hashes are your friend by KalvinB · · Score: 0

    In the example of inbox's no user has to look at another user's inbox so the first step is to simply find the current user's mail.

    I typically use MD5 since it's very good at evenly distributing information. For example stock symbols are heavily weighted to common letters so there are lots of stock symbols that start with "s". But, if you MD5 the stock symbol you get an even distribution based on the first two hash characters to put the historical data into 256 tables. You could also just put it all in one massive table and use the first two characters in their own column with an index. The advantage of using multiple tables is that it's easier to later split the tables onto multiple physical systems.

    So MD5 the Facebook user ID. Use the first four characters to pick the database server. Use the next four characters to pick the table and then select from there. By the time you're even referencing the table you're down to a handful of accounts sharing one table. Searching the User's email is then trivial as the dataset is small.

    Another example of MD5 awesomeness is finding a URL and associated data very quickly (useful for DMOZ data). In MySQL varchars can be up to 255 characters while URLs with various parameters can be any length so you could try to index the TEXT field OR you simply hash the URL and when you want to look up a URL you search for the easily indexed hash.

    Working with large sets of data is only a problem if you don't devise ways to break up the data. If Facebook needs to search all the user's email for various stuff then they can run a script that goes through every table in every database. They don't have to run a single query which would take forever. With distinct sets of data you can quickly start getting results to verify your code is accurate and start digging through the results while the script continues to run.

    1. Re:Hashes are your friend by MightyMartian · · Score: 2, Insightful

      In the olden days you didn't have centralized message stores. That's largely a relic of PC-based networking schemes like Novell, Lotus Notes and Exchange. The Unix model used individual mailboxes (in fact, the whole breakdown was for all of a user's data being in their own hierarchy). Obviously the Unix mailbox scheme wasn't that great as we started saving many megabytes of data, so you create indexed systems, but each user's mail is still effectively independent. I've used Pine to navigate my old mbox archives and it can move through even unindexed email at speeds that put bloated monsters like Exchange to shame.

      Clearly the issue with scalability in general is simply one of optimization. If you're returning relatively small pieces of information, then an RDBMS is the way to go. If all your databases are basically blobs, well then it's probably not going to be that effective. I still feel that blobs are heavily abused.

      I think part of the problem with RDBMSs is simply that a lot of people don't use them properly, and create the bottlenecks through bad design.

      --
      The world's burning. Moped Jesus spotted on I50. Details at 11.
    2. Re:Hashes are your friend by Firehed · · Score: 1

      That's very clever and all (and I'm sure quite effective), but it doesn't address the original issue: RDBMSs suck at scaling. We should be able to throw a rack of servers with a load balancer and a SAN at the problem and have it go away. We shouldn't have to rewrite our application logic to scale it out any more than we currently have to write special code because our hard drives are in RAID5 (read: not at all).

      The storage engines and their indexing should take care of all of this nonsense automatically. You might have to help them out by being a bit more specific than key `user_id` (`user_id`) (your stock tickers are a good example), but fundamentally the code that helps scale out a database should be part of the database and not the application that's using it.

      But, life isn't so kind to us. Oh well, maybe in time.

      --
      How are sites slashdotted when nobody reads TFAs?
    3. Re:Hashes are your friend by Malc · · Score: 1

      I bet it can't find old messages at the speed I do with X1 + 10 years of Exchange-based email (more than 250,000 messages). I stuck with Pine through the end of the 90s, when everybody else I worked with was switching to Netscape Communicator. I wouldn't go back now though.

    4. Re:Hashes are your friend by Anonymous Coward · · Score: 0

      How do Teradata or Greenplum or Netezza not scale?

      Sure, they will cost you and for limited query patterns, it may be cheaper to and sensible to implement a special purpose solution. The customer facing application portion of things like Facebook, gmail, Amazon, LinkedIn and the like are pretty simple and predictable so it's relatively easy to implement application specific solutions. In data warehouse and hard core BI, there are often many tables (and new ones appearing from time to time) and the queries are quite unpredictable - analysts want to ask their question now based on business questions and get an answer real soon - not submit a Work Order to the MapReduce Question Department (like in the old days).

    5. Re:Hashes are your friend by davidbrit2 · · Score: 1

      But, if you MD5 the stock symbol you get an even distribution based on the first two hash characters to put the historical data into 256 tables.

      That works great until you decide to use the R in RDBMS and actually join some tables. Plus you'd be using all sorts of dynamic SQL to allow every query to pick the appropriate table, putting yourself at risk of SQL injection vulnerabilities. You don't want a bunch of interns coding dynamic SQL against a system big enough and important enough to warrant this kind of data partitioning.

      If you really need to split a single table's data across multiple file/disk systems, use a DBMS that supports this at the physical storage level, rather than forcing you to do it logically with 256 tables. SQL Server, for example, allows creating file groups, which can contain multiple files on different file systems. Assign a table to a specific file group, and it will get spread across all those files. Or if you need finer control, use table partitioning which allows you to pick which file group each specific range of data is stored in. This works great, because the data is physically stored as though it were in multiple tables/indexes, allowing you to very quickly narrow your searches based on the partitioning key, and thus isolating all the I/O to a specific partition.

      But 256 separate tables? Egad. It's irritating enough working with our ERP system, which splits most data into separate "open" and "historic" tables. If I had to deal with 256 of them, I'd probably quit.

    6. Re:Hashes are your friend by Tellarin · · Score: 1

      One word, alpine. :)

    7. Re:Hashes are your friend by mindstrm · · Score: 1

      The reason you can't just keep throwing more boxes at it boils down to CAP -
      Consitency, Availability, and Partition Toleranace.

      In designing any distributed system, you can only get two out of three.

      http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

  8. hi monkeys by Anonymous Coward · · Score: 0

    Hi Monkeys. There are MPP databases that scale way past this and allow you speedy access that includes ansi sql access (petabytes in teradata's case). The newer compresed column store engines for many uses destroy hadoop in analytics use cases, per in performance and far fewer machines, plus the ability to use sql.

    Stop the trype hype.

  9. Everything old is new again by QuoteMstr · · Score: 5, Interesting

    We didn't start with relationship databases. RDBMSes were responses to the seductive but unmanageable navigational databases that preceded them. There were good reasons for moving to relational databases, and those reasons are still valid today.

    Computer Science doesn't change because we're writing in Javascript now instead of PL/1.

    1. Re:Everything old is new again by sohp · · Score: 1

      Wrong. Every dogmatic RDBMS lapdog drags out the old "navigational database" claims, but that's just bs to cover up the fact that relational is a solution seeking a problem. Question for you: if relation is so great, why isn't the DNS system that is the backbone of the internet built on top of a relational schema? It's because the decentralized scalability we need for real-world applications can never be properly addressed by any relational implementation.

    2. Re:Everything old is new again by QuoteMstr · · Score: 3, Interesting

      Your question reminds me of the people who say, "if flight records are so strong, why don't we just build the whole plane out of the stuff they use to make them?" You might as well ask, "if DNS is so great, why don't we implement filesystems in terms of it?" Your post demonstrates that you you haven't considered context and purpose.

      Relational databases are models. You can certainly describe DNS in terms of a relational schema. In principle, you could construct a wrapper and query it with SQL. But there's no reason to do that, because with someone as simple as DNS, the full power of a relational query engine doesn't buy you much.

      Most datasets aren't that simple.

      Furthermore, DNS is an open standard that needs to be accessible in as simple a way as possible. Complicating it with relational semantics wouldn't have been worthwhile (because of DNS's relative simplicity), and would have significantly hampered DNS's interoperability.

      That is, if relational databases had existed when DNS was implemented, which they didn't.

      Furthermore, DNS is a distributed, decentralized database. You couldn't use a RDBMS (the software that realized the abstract model of a relational database) to manage it even if you wanted to. That doesn't apply to most datasets, which however large, are still managed by a single organization, and which are accessed by software under the control of that organization.

      Your comparison really makes no sense whatsoever. The vast majority of databases aren't put under the same constraints DNS, and so can take advantage of the much greater flexibility an RDBMS affords.

      You're basically arguing that we can't have efficient engines in automobiles because of a few of them might need to tow 18 ton trailers and withstand mortar rounds. It's ridiculous.

    3. Re:Everything old is new again by QuoteMstr · · Score: 2

      I gave concrete reasons why DNS wouldn't work well implemented as if it were a single global relational database. You reply with hysterical rhetoric and conspiratorial allegations. I'm done with you.

      By the way: it's perfectly possible (and in some cases, even reasonable) for a DNS server to use a relational database to store its records.

  10. Vendor Hype Orange Alert (Re:hmm) by Tablizer · · Score: 3, Interesting

    Notice that this blog is from a company pushing a cloud based solution.

    That is indeed suspicious. But if they want to sell clouds, then make a RDBMS that *does* scale across cloud nodes instead of bashing SQL. (SQL as a language doesn't define implementation; that's one of it's selling points.) It may be that since there's not one out yet, they instead hype the existing non-RDBMS that can span clouds.

    (I agree that SQL could use some improvements, such as named sub-queries instead of massive deep nesting to make one big run-on statement. Some dialects already have this to some extent.)
             

    1. Re:Vendor Hype Orange Alert (Re:hmm) by QuoteMstr · · Score: 1

      named sub-queries

      What do you think stored functions and procedures are?

    2. Re:Vendor Hype Orange Alert (Re:hmm) by alansingfield · · Score: 1

      In MSSQL a FUNCTION doesn't perform well as a named sub-query - because it merely evaluates the function over and over for each row it encounters.

      However, a VIEW is precisely what you need, a named sub-query. When you reference a view within a SELECT statement, behind the scenes it merely pastes the view definition into your statement. This allows the optimiser to look through into the VIEW definition and apply any indexes in the same way it would if you just typed the whole thing out.

      Caveat - this does NOT apply for materialised views, but unless you are using the highly expensive Enterprise or DataCenter editions you don't get this anyway.

    3. Re:Vendor Hype Orange Alert (Re:hmm) by QuoteMstr · · Score: 1

      In MSSQL a FUNCTION doesn't perform well as a named sub-query - because it merely evaluates the function over and over for each row it encounters.

      I'm surprised the query optimizer doesn't try to inline the function. That's a shame. Views are nice too, but aren't quite as general-purpose.

    4. Re:Vendor Hype Orange Alert (Re:hmm) by jellomizer · · Score: 1, Informative

      The problem with SQL isn't really SQL fault it is the fact that most people don't know how to use it properly. It is amazing how many people "with SQL in their resumes" cannot do joins, cursors, or grouping. I don't consider myself an SQL expert but I am good at it, and I can get the job done. But I know if I make a query and it takes too long to run if I don't need to run it once. Then there is probably some optimization that I will need to do to improve speed. Creating Temporary tables with a reduced dataset, Changing from a single nested sql call to a cursor loop or the other way around depending which will run faster. A lot of times these optimizations are the difference between 40 minutes to run to 30 seconds. Giving the same correct results. A lot of times people who don't know about joins do the basic join of select x.a y.b from x, y where x.c = y.c Not realizing that Most SQL engines will take all the records of x and cross them with y so you will have x.records*y.records Loaded in your system, the it goes and removes the matches. So O(n^2) in performance, Vs. If you do a Select x.a, y.b from x left join y on x.c
      What this will do is go down the x table O(n) then it will match up the correct record from Y where if you have the correct indexing it will take O(ln(n)) so the overall performance is O(n) meaning it is an order of magnitude faster.

      --
      If something is so important that you feel the need to post it on the internet... It probably isn't that important.
    5. Re:Vendor Hype Orange Alert (Re:hmm) by khchung · · Score: 1

      A lot of times people who don't know about joins do the basic join of select x.a y.b from x, y where x.c = y.c Not realizing that Most SQL engines will take all the records of x and cross them with y so you will have x.records*y.records Loaded in your system, the it goes and removes the matches.

      I don't know what RDBMS you used, but from my experience with both Oracle and DB2, neither would do something that stupid normally. I very much expect SQL Server, Sybase, Postgres, etc would not either, and I suspect even MySQL would be smarter than that.

      In fact, I would say most RDBMS would NOT do that.

      If yours is doing something so stupid on something so fundamental as a simple inner join, go and use another database.

      I do agree that a lot of people have trouble writing sensible SQL though.

      --
      Oliver.
    6. Re:Vendor Hype Orange Alert (Re:hmm) by QuoteMstr · · Score: 1

      Yes, even MySQL isn't that stupid. It's appalling how people will choose between two tools based on supposition and misunderstood rumor without knowing the slightest bit about the actual capabilities of the systems involved.

    7. Re:Vendor Hype Orange Alert (Re:hmm) by Orestesx · · Score: 1

      Isn't the index search ln(n) for each entry in x? So the overall complexity is O(n*ln(n))? You can't magically make something O(n) when the operation involves finding the matched pairs in two sets.

    8. Re:Vendor Hype Orange Alert (Re:hmm) by hibiki_r · · Score: 1

      I only know of one that operates in that fashion, and it's not really a relational database, but a multi-value database that has an awful SQL interface to be able to say that they support a relational model. The query plans that include more than one join become cartesian joins with filtering, making any SQL query that has even a bit of weight to it unusable on datasets over ten thousand rows.

      Needless to say, the same queries in Postgres run with very good query plans, and are two orders of magnitude faster even in small datasets.

    9. Re:Vendor Hype Orange Alert (Re:hmm) by cervo · · Score: 1

      Are you for real? Most commercial database engines have a query optimizer which among other things, will recognize that your query is a join and apply a join. The choice of how to do the join (via N^2 nested loop, via hashing one of the tables, via O(n) merge) depends on the sizes of the tables and the indexes available which is all decided by the query optimizer. Not that it is perfect, sometimes it makes a bad decision and you need to drop a hint to override it. Any DBMS that cannot recognize that select .... from x, y where x.a=y.b is a join probably should not be seriously used.

    10. Re:Vendor Hype Orange Alert (Re:hmm) by radtea · · Score: 1

      I don't know what RDBMS you used, but from my experience with both Oracle and DB2, neither would do something that stupid normally.

      There actually was a major database vendor in the'90's that did things that way, but I can't recall who it was. I just remember reading some of their internal docs that had a comment to the effect "Outer joins need to be updated in the next release. Current implementation is totally broken", and based on performance compared to everything else I assumed that they had the order of evalution wrong, as the GP describes.

      So it did happen back in the day, but it's been a decade since and all the other major vendors had things right even then.

      --
      Blasphemy is a human right. Blasphemophobia kills.
    11. Re:Vendor Hype Orange Alert (Re:hmm) by Just+Some+Guy · · Score: 4, Informative

      A lot of times people who don't know about joins do the basic join of select x.a y.b from x, y where x.c = y.c Not realizing that Most SQL engines will take all the records of x and cross them with y so you will have x.records*y.records Loaded in your system, the it goes and removes the matches. So O(n^2) in performance, Vs. If you do a Select x.a, y.b from x left join y on x.c

      Dude. That is so unbelievably wrong. First, implicit (comma) joins are inner, not left: your results will differ from the original query. Second, please name one popular database released in the last 3 years that implements inner joins with predicates in the way you describe. I can't speak for the others, but PostgreSQL sure as hell doesn't:

      => select count(1) from invoice;
      select c count
      ---------
      1241342

      => select count(1) from ship;
      count
      --------
      664708

      => select invoice.invid from invoice, ship where invoice.shipid = ship.shipid and ship.name_delpt = 'redacted';
      invid
      ---------
      12345
      12346

      Each of those queries against our live production database ran in under a second (and I only edited the input and output of the final query). PostgreSQL may be quick, but I promise you it didn't have time or RAM to create 825,129,958,136 tuples and then winnow out the non-matches. Maybe you're stuck on an ancient version of a DB that was crappy to start with, but the rest of us don't put up with the same insanities you describe.

      --
      Dewey, what part of this looks like authorities should be involved?
    12. Re:Vendor Hype Orange Alert (Re:hmm) by kthejoker · · Score: 1

      The solution to this is to use a table-valued function instead of a scalar and then CROSS APPLY it to the rest of your data set:

      http://www.databasejournal.com/features/mssql/article.php/3845381/T-SQL-Best-Practices--Dont-Use-Scalar-Value-Functions-in-Column-List-or-WHERE-Clauses.htm

      Or, as stated, just use a view. Scalar UDFs are good for setting SQL variables based on today's date or a customer ID one-time; not much else. Indeed, a shame.

    13. Re:Vendor Hype Orange Alert (Re:hmm) by Rich0 · · Score: 1

      Aside from what others have pointed out (your join is not equivalent, and virtually all modern rdbms's will do an inner join given the syntax you supplied), I wanted to point out one other thing: ANSI SQL.

      I'm not an expert on SQL, but if I'm not mistaken, the syntax you suggested shouldn't be used is the ONLY syntax for an inner join that actually works on every ANSI-compliant database server out there. I don't think that JOIN is a valid keyword in ANSI SQL, even though it is obviously a universal concept.

      I think one of the biggest issues out there is that ANSI SQL is getting fairly rusty, and as a result almost everybody uses some vendor-specific SQL variant. An SQL statement written for mysql will most likely not work in Oracle, and vice-versa.

      ANSI SQL needs a major update.

      (If I'm wrong on the JOIN bit by all means speak up - but the last time I went looking I couldn't find any syntax for JOINs other than the typical a=b or a*=b or a=*b ANSI syntax. I don't think there is any standard way of doing an OUTER join at all.)

    14. Re:Vendor Hype Orange Alert (Re:hmm) by Tablizer · · Score: 1

      User-defined functions and user-defined views are not something allowed in most RDBMS without DBA permissions and tons of coding. Besides, they are permanent when often just a temporary view is needed. And it may not be an actual view, but merely a syntactic shortcut, which may allow for more efficient execution. An actual view is usually blind to the context that it's being used in.

    15. Re:Vendor Hype Orange Alert (Re:hmm) by akaariai · · Score: 1

      Your queries do not produce the same result. SELECT from x left join y on x.c is an outer join whereas SELECT from x, y where x.c = y.c is an inner join. Ok, it is a cross join, but most databases will transform it to inner join. I would like to know what are those SQL engines which do a cross join when using the select from x, y where x.c = y.c syntax. Even SQLite converts comma joins to inner joins if it is more efficient to do so. Also, using left joins when inner join is the thing you want is not good for performance. You can look the above link for explanation why.

    16. Re:Vendor Hype Orange Alert (Re:hmm) by Big_Mamma · · Score: 1

      A lot of times people who don't know about joins do the basic join of select x.a y.b from x, y where x.c = y.c Not realizing that Most SQL engines will take all the records of x and cross them with y so you will have x.records*y.records Loaded in your system, the it goes and removes the matches. So O(n^2) in performance, Vs. If you do a Select x.a, y.b from x left join y on x.c

      Sorry, but it doesn't work that way. As far as I know, none of the decent SQL engine choke on it, although I'm not sure on Access :P

      Also, a lot depends on the size of the dataset and other parameters in the where clause. Real life example, with len(r) = ~1M and len(g) = ~20k: select * from core_report r, core_guild g where r.guild_id = g.id and g.id = 7. With this query, postgres executes it as: scan core_report_guild_id index, look for id=7. Then, lookup g by primary key and join it in a nested loop with loops=1. Without the g.id = 7, it executes as: table scan g, hash it, table scan r and join the two with a hash join. Note that the query planner switched from fetch by primary key, which is O(log n) * n rows -> O(n log n), to table scan x2, O(n), but with a much lower actual cost because walking a BTree isn't cheap. It also ordered it so that only 20k rows get hashed and copy-pasted into the main dataset, not the other way around. That's the advantage of using a proper DBMS.

      You can pry PostgreSQL from my dead, cold hands. It's just so much easier to do meaningful things in a relational database, and until you hit the db-size > largest SSD (used to be RAM) you can buy limit, there is absolutely no reason to limit yourself to glorified tuple stores and hash tables. Okay, sometimes, ORM's can be slightly too eager to join stuff (causing queries like this one), but it's easily fixed by rewriting the line executing the query. Or just ignore it, even that monstrosity (1 index scan, 1 fetch-by-id loop, 3 full table scans) took only 1s max - who cares on a homepage/intranet/most websites.

    17. Re:Vendor Hype Orange Alert (Re:hmm) by Anonymous Coward · · Score: 0

      (I agree that SQL could use some improvements, such as named sub-queries instead of massive deep nesting to make one big run-on statement. Some dialects already have this to some extent.)

      Or how about some semblance of standardization, rather than the laughable (or sad) mess we're left with today. Currently it is impossible to write a non-trivial SQL statement that will run on all major databases.

    18. Re:Vendor Hype Orange Alert (Re:hmm) by Anonymous Coward · · Score: 0

      Heh, this reminds me of folks who think even an ancient, lowly non-server database like Foxpro has to pull the entire table down from the network to grab a few records. Funny how I can pull a recordset from a nearly 2 GB-sized table in sub-second times. Now THAT'S a fast network! *smile*

      Foxpro doesn't even have to pull the whole index file down, much less the DBF -- it optimizes the index slices it takes, depending on join and filter expressions, gets the record numbers it needs, and slices them out of the DBF. All in a matter of milliseconds.

      Makes me wonder what is so wrong with relational databases that proper design/implementation can't fix?

    19. Re:Vendor Hype Orange Alert (Re:hmm) by shutdown+-p+now · · Score: 1

      I just remember reading some of their internal docs that had a comment to the effect "Outer joins need to be updated in the next release. Current implementation is totally broken", and based on performance compared to everything else I assumed that they had the order of evalution wrong, as the GP describes.

      Since GP described cross/inner joins (you can't really do an outer join using multiple tables in FROM with a condition in WHERE so long as you stick to standard SQL), I don't see how it could be in any way relevant.

    20. Re:Vendor Hype Orange Alert (Re:hmm) by shutdown+-p+now · · Score: 1

      Second, please name one popular database released in the last 3 years that implements inner joins with predicates in the way you describe.

      You can amend that to "last 20 years". Keep in mind that ANSI-style joins (with JOIN keyword) were introduced in ANSI SQL-92; before that, implicit cross join via FROM combined with a WHERE clause was the only way to do an inner join, and various RDBMS provided different syntactic extensions to ANSI SQL to do outer joins that way, too. Of course they know how to optimize such queries well!

      Heck, Oracle has only added ANSI joins in 9i - that's 2001. MSSQL did it in 6.5 in 1996. It would be rather silly to think that, before those additions, they couldn't optimize old-style joins well, or that they've suddenly stopped doing so after ANSI joins were added.

    21. Re:Vendor Hype Orange Alert (Re:hmm) by Just+Some+Guy · · Score: 1

      You can amend that to "last 20 years".

      I imagined as much, but figured that was guaranteed to start an argument about whether ObscureSQL 0.5.7 that some Slashdotter wrote as a class project qualified as a popular database, which would somehow evolve to include racism, Glenn Beck, Microsoft, and the Bilderbergers. I figured it'd be easier to underestimate and save us all the pain.

      --
      Dewey, what part of this looks like authorities should be involved?
    22. Re:Vendor Hype Orange Alert (Re:hmm) by Tablizer · · Score: 1

      Or how about some semblance of standardization, rather than the laughable (or sad) mess we're left with today. Currently it is impossible to write a non-trivial SQL statement that will run on all major databases.

      Adding bells and whistles is how vendors distinguish themselves from each other. They don't want you to switch, but rather become addicted to their way. But a half-ass-standard is generally better than no standard.

      In my pet draft query language (SMEQL), the syntax is based on uniform and simple constructs such that a DBA could in theory add functionality to mirror other vendors if needed simply by adding custom statements. It borrows composability ideas from functional programming. It's hard to do that with SQL because its syntax is so convoluted. It's almost like comparing a COBOL syntax diagram to Lisp's. Lisp uses the libraries to extend itself, not syntax changes. (Don't worry, it looks more like C than Lisp).
           

  11. Curiously spurious by KeensMustard · · Score: 1

    Collectively, these alternatives have become known as NoSQL databases. The fundamental problem is that relational databases cannot handle many modern workloads.

    I'm sceptical. Why is the problem worse now then in the past? Relational theory in practice is abstracting the data such that a human/application can understand it as logical constructs. How the data is PHYSICALLY organised is a matter of implementation - the relational theory doesn't place any constraint (!) on how the data is organised/retrieved/updated - except that by giving a broad design pattern , duplication is minmised, and so then is processing overhead. MPP (Parallel Processing) lends itself quite neatly to any large set of data - many implementations will continue to scale linearly above the PB size (e.g Teradata). Looks to me like a sales pitch.

  12. 10 years ago, they had the same problem by johnlcallaway · · Score: 2, Interesting

    I was an admin on a system that spread the data across 10 database servers. Each server had a complete set of some data, like accounts, but the system was designed so that ranges of accounts stored their transaction type data a specific server, and each server held about the same number of accounts and transactions. As data came in, it was temporarily housed on the incoming server until a background process picked it up and moved it to the 'correct' one. This is a very simplistic view, but the reality was that it worked quite well. Occasionally, there was a re-balancing that had to be done. But it was very scalable. The incoming data wasn't so time sensitive that if it took a few hours to get moved, everything was still OK. When an 'online' session needed data, it knew which server to connect to to get it. Processing was done overnight on each server, then summarized and combined as needed.

    So yes .. .people have been coming up with innovative ways to solve these problems for a very long time.

    And they will continue to do so.

    --
    I rarely read replies, it's my opinion and if you thought about your opinion a little more, I'm OK with that.
    1. Re:10 years ago, they had the same problem by mindstrm · · Score: 1

      Right - but the solution to your problem wasn't within the RDBMS itself - it was outside that context, and your apps had to be aware of that abstraction.

    2. Re:10 years ago, they had the same problem by johnlcallaway · · Score: 1

      And the solution to my RDBMS problem today isn't in the RDBMS based on the article and comment, it requires me to move to something else.

      And someday, this new thing will have similar problems. Because no matter how clever we think we are, reality has a way of finding new twists and turns that no one thought of.

      --
      I rarely read replies, it's my opinion and if you thought about your opinion a little more, I'm OK with that.
  13. Oh no... by Anonymous Coward · · Score: 0, Funny

    I just sharded

    1. Re:Oh no... by Anonymous Coward · · Score: 0, Troll

      Hahaha.. One time I was shit faced drunk and picked up a trollop (more accurately, she picked up me). Back at my place, she goes down and ignores the cock, ignores the balls, ignores the taint and starts giving me the best fucking rim job ever (more accurately, only one ever). As I mentioned, I was shit faced off of bud or natty light, so I had some beer farts, and I felt a massive one coming on and I knew it was more than just gas. I tried to move her head, but it was too late. I sharted in her face. A real nasty smelling beast, like a dead animal or rotten eggs. And, of course, the liquid shit.

      I'm sure you've seen the smoking gun pic of the guy who was huffing paint. Imagine that, but with shit. And a girl, not a dude.

  14. I/O bottleneck by Begemot · · Score: 1, Interesting

    Let's not forget where the bottleneck is - the I/O. It's expensive but once you build a fast and solid storage system, correctly configure it and partition your data properly over a sufficiently large number of hard drives, RAIDs, LUNs etc., you might be able to use SQL. We run a database of 10TB on MS SQL with hundreds of millions of records with an equal rate of reads and writes and could not be happier.

    1. Re:I/O bottleneck by FlyingGuy · · Score: 1

      WOW! Do you get ahold of an older SyBase version?!

      But seriously, yeah even MS-SQL can hang in if you set up it up just so and never let MS patch the thing forever after.

      --
      Hey KID! Yeah you, get the fuck off my lawn!
    2. Re:I/O bottleneck by cervo · · Score: 2, Insightful

      NO offense, but you probably have no idea what you are talking about. MS-SQL is a relatively solid product. SQL Server 2000 and SQL Server 2005 are pretty stable and can easily handle rather large data sets (in the TB). Of all the Microsoft Products, personally Visual Studio and SQL Server are my favorites. I like PostgreSQL as well, so I'm not strictly a Microsoft Fan. But an awful lot of companies are realizing that MS SQL can manage their data much cheaper than Oracle can. Of course PostgreSQL can do it even cheaper...but many companies like to pay $$ to sleep better at night.

    3. Re:I/O bottleneck by FlyingGuy · · Score: 1

      Well, I do, and sheesh ease up dude. Line 1 was obviously a joke. And yes, MS-SQL has come a long way way, even to the point of being able to be almost on par with Oracle but always misses the mark somehow and being able to be massively scaled up is one of those marks.

      If Microsoft really wants to take their SQL Server to the next level they would do well to port it to a *nix flavor so they could run it on a Mainframe or even something like Google's distributed context, but, we all know they wont.

      --
      Hey KID! Yeah you, get the fuck off my lawn!
  15. Shards and clusters and servers, oh my! by shmlco · · Score: 1

    Worse, sharding and other such solutions usually end up requiring the application to know way, way too much about the back end structure, how tables are split, where they are split, and so on.

    And your solution to improving the storage engine doesn't help. At some point in a RDBMS you need to do joins and so forth, and that assumes that the machine doing the join is capable of doing so AND of handling the load and the number of transactions being tossed at it. Hence we start getting into clusters and other solutions that again need to be understood and managed.

    The NoSQL solution let's you toss your request out to the "cloud" and get an answer without needing to know clusters, shards, tables, or really anything on the physical implementation side of the fence.

    --
    Any sect, cult, or religion will legislate its creed into law if it acquires the political power to do so.
    1. Re:Shards and clusters and servers, oh my! by Otterley · · Score: 1

      But if you are concerned about performance, and you are already running your RDBMS servers at their limits, then you also already know way, way too much about the internal RDBMS structure, how tables are split, where they are split, and so on.

      At some point the comparative cost of doing your own joins is less than tweaking your RDBMS to scale. However, this point is rarely reached in most organizations.

  16. You pick the DBMS that works for you by mkairys · · Score: 1

    Most RDBMS implementations on the web are generally only used to store data and perform very basic queries such as get and store operations. Personally I don't really see the issue of using one for a web applications since they are proven to work well and with the right design and caching solution are more than capable of handling a popular website such as Digg or Facebook. The only real issue with these sites is to prevent bottlenecks you would generally need to throw more hardware at it than may be necessary (although memory is very cheap these days so its a non-issue for most companies).

    Memcached has shown to really help solve many performance issues for relational databases since the database won't constantly perform complex queries to grab data, it will just pull the result from a hashed index stored in memory. MemcachedDB http://memcachedb.org/memcachedb-guide-1.0.pdf is looking very promising to use to get rid of a RDBMS all together for certain data such as user sessions since it focuses on performance rather than functionality. Even then I think it all really boils down to choosing the right tool for the job, if there's data that you know is going to be a performance bottleneck in the database, you look for more creative solutions to store and process that data. There's nothing stopping you from running two or more different types of databases for the task at hand.

  17. 30 Years? by uncqual · · Score: 1

    ...the traditional relational database technology that has served us well for over thirty years...

    Hmm... Before 1979, market share for RDBMS was TINY. It really didn't begin to "serve us well" until the mid 80's.

    --
    Why is there an "insightful" mod and why isn't it "-1"? If I wanted insight, I wouldn't be reading /.
  18. This again by Twillerror · · Score: 2, Interesting

    Wow a "object oriented" database discussion again. I've never read one of these :P I've only been doing this 15 years and I've lost count of these talks a long time ago.

    What is the difference between schema less and schema rigid anyways. I don't see what that has anything to do with performance. The real issue is uptime and transaction support. People want to add a column or index without taking the system down. That is different then dealing with PBs of data. Most table structures can easily deal with that much data.

    If you have a DB that is big you have lots of outs. Pay...get Enterprise version of whatever. Break it into many DB/tables and merge together. Archive. Archive I bet will get most people by. Does eBay really need all that bidding info for items over a few weeks old...only for analysis maybe. Move that old stale data out of the active heavily hit data tiers.

    The fact remains that MySQL should be able to scale to TBs of data. The fact that it can't is a failure of the product. All the others have been for a while. Why can't it...I don't know...the fact that it uses a F'in different file for each index on a table. If you don't understand how old school that is start using Paradox. Just because it is open source doesn't mean it has to be so damn out of date. Please for the love of god save multiple tables/indexes in the same pre sized file...god.

    Google has all the power to go and use something different. Google gets to cheat. Google is a collection of pretty static data. They scan the internet a lot, but imagine if every time you did a search Google had to scan every web page on the planet, index them, and then give you search results. That would be impractical for sure. So for now they just store big collections of blobs and a big fast index for searching keywords and links to pages. Impressive none the less, but it's not like your typical app. GMail is...funny that it is one system they've had problem with. Even then EMAIL DOESN'T CHANGE. It's user specific, but it's still f'in static. GoogleTastic if you ask me.

    The fact is people are using RDBMS right now to solve real world problems. Some start up is finding a way to tweek MySQL to do something cool and then posting it on a blog...then all of the sudden RDBMS is dead. RDBMS is fine, it will be fine for at least 10 years if not longer. In that time it will evolve as well so that it will be around for even longer. MySQL in 5 years will have online index addition, performance hitless online column addition, partitioning, geo indexing, XML columns, BigASS table support, Oracle RAC like support, and a thousand other features that some RDBMSs have today and some will not see for even longer. Then developers that spent all that cash developing custom shit will revert and post comments like this one.

    That's the way it goes in software development. The middle tier gets bigger, gets inept, custom shit comes out, it gets integrated into the middle tier shit....continue;

    Instead of pronouncing death start talking about how dated a 2 dimensional result set is. JOINs should return N dimension result sets similar to XML with butt loads of meta data. ODBC/JDBC are dated...so updated them.

    select u.login, ul.when from users u join user_logins ul as logins.login ON ul.user_id = u.user_id where u.name = 'me' should equal something like a nested XML packet instead of duplicated crap when there is more then one user_logins.

    1. Re:This again by clockwise_music · · Score: 1

      That's the way it goes in software development. The middle tier gets bigger, gets inept, custom shit comes out, it gets integrated into the middle tier shit....continue;

      Lol. Great sentence.

  19. SQL is NOT the Physical Storage or RDBMS engine by Invisible+Now · · Score: 1

    Can we agree that SQL is a high level language for capturing the set theory query logic and is COMPLETELY INDEPENDENT of the engine and physical storage that actually generates the query plan and makes the heads fly to cache and return data?

    Structured
    Query
    Language

    not

    Stupid
    Quixotic
    Layout
    (Of tables, pages, indexes, drives, heads,spindles, SANs, etc...)

    Right?

    --

    "Knowing everything doesn't help..."

  20. Perhaps you just aren't that popular? by Anonymous Coward · · Score: 1, Insightful

    Has this ever occured to you: Maybe people just choose not to answer you? :)

  21. TFA is bullshit by WarwickRyan · · Score: 1

    I've seen OLAP systems in the 100TB range which work fantastically well on Oracle.

    Object databases could be a nice idea, but not for performance or scaling reasons. An object oriented database would be beneficial as a method to sidestep ORM. So you can, effortlessly and without any significant amount extra work persist the state of your objects.

    Then you can build POxOs to represent your objects and just implement a few lines of code to have them persisted.

    Not sure if anything like that already exists. I certainly don't know of anything in the C# world, but I expect there's some funky named java project which does it.

    1. Re:TFA is bullshit by QuoteMstr · · Score: 1

      How on earth is that better than using an existing ORM library? Even if you have to write your own, an ORM isn't particularly difficult to write.

    2. Re:TFA is bullshit by WarwickRyan · · Score: 1

      I can only talk about nHibernate or LINQ2SQL, but in either of those cases I have to do something in the database and/or write some XML. That's duplication of work: you've already defined the properties of the object in the class, so adding anything else on top of that is a waste of time.

      With ORM you usually end up keeping three seperate definitions in sync - the database table, the ORM metadata (mappings) and the object. That costs time, and time costs money.

      Rails have solved this through scaffolding and use of ActiveRecord. That's nice, but your definition's in a database. Not in your programming language. Which is dumb, because as a developer you're spending 90%+ of your time coding in your programming language. So it follows that the best place to define the entities and relationships is there. Then let an object-store handle the persistance, searching, indexing etc.

      In my job (building small scale LOB application) the bottlenecks are more my time than scaling or performance (which'd be a problem with an object-store), so an object-cache solution would be idea.

    3. Re:TFA is bullshit by QuoteMstr · · Score: 1

      Is writing SQL table definitions instead of equivalent assertions in your language of choice really the bottleneck of your development? Switching to an entirely different database paradigm solely to avoid writing "CREATE TABLE" sounds inefficient.

    4. Re:TFA is bullshit by WarwickRyan · · Score: 1

      The problem is not that it's 'instead of', it's 'as well as'. Sure, you can go all ActiveRecord, but then you loose a lot when it comes to your object design. Which you really don't want to do.

    5. Re:TFA is bullshit by sohp · · Score: 1

      Object databases could be a nice idea, but not for performance or scaling reasons.

      [citation needed]

      You're just talking out of your ass.

    6. Re:TFA is bullshit by WarwickRyan · · Score: 1

      Nice an constructive comment. You must be proud of yourself.

      You've quoted me out of context, so for clarity I'll just restate: I'm saying that for me performance and scaling are not reasons to move to an object database. Relational databases already have excellent performance and scale extremely well. So they're going to drive me to switch.

      Are you insinuating that I'm suggesting that object databases are slow, or that they don't scale? Because I've not said anything of the sort, and anecdotal evidence (quick google on CouchDb) suggest that they scale very well whilst performing adequately.

    7. Re:TFA is bullshit by sohp · · Score: 1

      Relational databases already have excellent performance and scale extremely well

      That must be why Google, Amazon, eBay, and Orbitz make them the foundation of their systems. Oh, wait, they don't.

      Perhaps your definition of "excellent performance" that scales "extremely well" is one I'm not familiar with, however.

    8. Re:TFA is bullshit by WarwickRyan · · Score: 1

      For a start, Amazon's system is built on top of MySql: ahref=http://arstechnica.com/business/news/2009/10/mysql-gets-cloudy-with-amazons-new-database-service.ars/rel=url2html-20785http://arstechnica.com/business/news/2009/10/mysql-gets-cloudy-with-amazons-new-database-service.ars/>

      Large relational? A lot of telecoms and banking systems are based on Oracle. All have massive datasets and require excellent performance.

      In google's case then relational's not worth it. Much of their data doesn't have an natural relational structure. All a relational database would do is add massive overheads but no benefits. Plus they're big enough (and smart enough) to gain the efficiencies of scale of building their own data persistance systems.

      For some problems relational is more suited, and for some object databases. Then for some a completely custom data persistance / management system is required.

      Not sure why I'm arguing with you, you've been trolling slashdot with the 'relational is dead' line for a while now. Plus I'm pro object-dbs when they're the right tool for the job. I'd just like to see them (and their integration with development environments) improved to the point that they're a real alternative to relation databases at the bottom end / for LOBs.

    9. Re:TFA is bullshit by makapuf · · Score: 1

      Django ORM (not the best ORM by their own say but simple and has largely worked for 90%+ of my needs) let you define your "tables" as an object with some field types and let you create your tables in SQL.

      Schema migration solutions does exist however it's a HARD solution, you'd better write SQL scripts when you're there. (having separate sql structure and code is a good thing imo and the reason why n the long term / larger projects with a life of its own an "orm" like sqlalchemy is better than a simple ActiveRecord-like ORM).

      Anyway, the real hard problems for a OODB or ORM are :
      - inheritance (multiple, change of father object & other subtle semantics)
      - schema (change your object definition, how does that reflect your _current data_ ?
      - polymorphism (query base table A, shall returned objects be of derived type of not ?)

      I haven't seen a uniform definition of those in programmig languages. How do you tie this to a DB (which should be multi language I hope ..)

      But in the end, having tables for data with stable schema linked to it + OO code in your language is the simplest way, to each its own semantics.

    10. Re:TFA is bullshit by tkinnun0 · · Score: 1

      Unless I'm tragically mistaken, your link refers to Amazon starting to offer hosted MySQL servers. It doesn't mean that Amazon runs their own business on MySQL or that their MySQL offering scales from a simple website to Google levels automatically. It's just single MySQL servers that you can rent.

  22. And I am missing it greatly on Linux by Errol+backfiring · · Score: 2, Interesting

    MS-Access had some really great features: it could be accessed with both SQL and with a blazingly fast (because almost running on the bare OS) ISAM-style library. I am still missing anything like it on Linux. SQLite is a file-system database, but why on earth should it parse full-blown SQL at runtime and why on earth should my program write another program in SQL at runtime just to load some data? Get serious. Parsing and building SQL is just overhead, and especially parsing SQL is no easy and light task.

    Since I switched to OO programming, most (95%) of my queries are "This table/index. Number 5 please." In essence that is the get/put method, or the ISAM style method. I really would like something like that to exist on Linux. The closest thing around is MySQL's HANDLER statement, but that can only be used for constant data (because it does dirty reads) and for reading only.

    SQLite could even be faster if it just accepted some basic "get row by index" and "put row by index" commands that do not try to parse, optimize or outsmart anything. The problem with "modern" databases is that they are either "SQL" or "NoSQL". That's awful. Some programs speak SQL (because of compatibility, because it is a reporting program or just because the programmer does not know anything else) and some programs are better off with direct row management. That does not mean that the data should not be accessible by both programs. I really wish that the regular SQL databases would develop ISAM-style access methods. Programming would be a hell of a lot easier then, and the programs themselves would speed up significantly was well.

    This is no idle remark. I worked a lot with MS-Access and most rants about it being slow comes from the fact that most programmers treat the file-system database as a server. So it must emulate itself as a server and do a lot of household parsing and does not even have a physical server to relieve its load.
    But if you know how to program a file-system database with ISAM-style methods, MS-Access is by far the fastest database I ever encountered. No Joke. Really. It can be fast because there is no need to do all these household jobs to just dig up a row.

    --
    Nae king! Nae laird! Nae yurrupiean pressedent! We willna be fooled again!
    1. Re:And I am missing it greatly on Linux by QuoteMstr · · Score: 1

      Parsing and building SQL is just overhead, and especially parsing SQL is no easy and light task.

      No optimization without quantification. Parsing is very fast, especially compared to disk IO. Are you sure the SQL is slowing your program to any appreciate (or even measurable) degree? You should be able to measure any supposed effect with a profiler.

      Nevertheless, if SQLite so offends your sensibilities, you can always use Berkeley DB. It gives you a similarly powerful storage engine without the necessity (or ability) to write SQL to access it.

    2. Re:And I am missing it greatly on Linux by Errol+backfiring · · Score: 1

      Nevertheless, if SQLite so offends your sensibilities, you can always use Berkeley DB [oracle.com]. It gives you a similarly powerful storage engine without the necessity (or ability) to write SQL to access it.

      Well, that's exactly the problem, isn't it? Once I put my data in a "NoSQL" database, I can not reach it with SQL anymore and vice versa. For some applications (especially reporting), SQL is the best tool. The problem is that "the warehouse determines what trucks you should use". And I just want to be able to use light and small trucks for small urgent jobs and large trucks for bulk storage.

      --
      Nae king! Nae laird! Nae yurrupiean pressedent! We willna be fooled again!
    3. Re:And I am missing it greatly on Linux by QuoteMstr · · Score: 2, Insightful

      One of the big attractions of using a database to store your information is having a consistent API for accessing your data. I'm not convinced that what you want, having both SQL and non-SQL methods to access the same dataset, is ever actually useful. The overhead SQL imposes is actually minuscule compared to the cost of data access itself.

      If you go the Berkeley DB route, you're going to need to build an application-level data access layer anyway. If you have a complex query to perform, just do it through that access layer.

      On the other hand, if you use a SQL engine, you can go "small and light" simply by using "small and light" queries. There's no particular reason you can't simply run SELECT * FROM mytable WHERE id=? repeatedly, incrementing id each time.

    4. Re:And I am missing it greatly on Linux by Errol+backfiring · · Score: 1

      The overhead SQL imposes is actually minuscule compared to the cost of data access itself.

      That depends. If you take a server database, the networking overhead can be of a significant magnitude. But with ISAM-syle methods and file-system databases, you just look up a pointer from an index and set the pointer "at the OS level" and start reading the file to get the record. There is a really thin library layer on top of it, but that is how it works.

      The thing is, I know the database. I know what index to use because I defined it myself (or my colleague did and documented it). There is no need to check for aggregate functions, see what columns are used in the WHERE clause to get the list of index that are really used, pick one of them, no need to check all the options for a select. That is really nice for a complex JOIN, but not for a "just give me this row". My queries can be pre-optimized at compile time by using ISAM-style methods.

      --
      Nae king! Nae laird! Nae yurrupiean pressedent! We willna be fooled again!
    5. Re:And I am missing it greatly on Linux by QuoteMstr · · Score: 1

      First, we're comparing local database access via SQL with local database access using lower-level APIs. Networks don't factor into this discussion.

      Second, you're letting your prejudices get in the way. You feel like the SQL parsing must be incurring some significant overhead, so you feel like the best way to avoid it is to "just" seek and read. (The word "just" in a rationale is a giant red flag.)

      Why don't you actually profile your code? The time spend dealing with the SQL will be minuscule compared to the time you spend accessing your actual data.

    6. Re:And I am missing it greatly on Linux by Errol+backfiring · · Score: 4, Informative

      I did profile my code. It is not my gut feeling, but my experience.

      --
      Nae king! Nae laird! Nae yurrupiean pressedent! We willna be fooled again!
    7. Re:And I am missing it greatly on Linux by mugurel · · Score: 2, Funny

      most (95%) of my queries are "This table/index. Number 5 please."

      Admirable! Despite the strong desire for efficiency, you still have the prudence to phrase you queries politely.

    8. Re:And I am missing it greatly on Linux by Errol+backfiring · · Score: 2, Funny

      Well, actually only for the INTERCAL connector...

      --
      Nae king! Nae laird! Nae yurrupiean pressedent! We willna be fooled again!
    9. Re:And I am missing it greatly on Linux by SQLGuru · · Score: 1

      One of the fun things about programming in .Net. You get access to data using many methods. You can write SQL statements against CSV files if you set your connection string up correctly (I recommend http://www.connectionstrings.com/). And then, if you want more direct access, you can do that, too. All you need are the correct drivers.

    10. Re:And I am missing it greatly on Linux by gbutler69 · · Score: 1

      You, sir, have a fundamental misunderstanding. Running that query over and over again, simply incrementing the ID, is not "small", "light", "efficient", nor "fast"! That would be a 100% retarded thing to do.

      --
      Over-the-top Response Guy! Giving "Over-the-Top Responses" since 1970.
    11. Re:And I am missing it greatly on Linux by gbutler69 · · Score: 1

      To elaborate further...If there are N entries in the above table and ID is in a b-Tree index to iterate over all N records you wish to retrieve would involve N * LOG(N) time complexity, whereas simply getting the whole list (or a proper sub-range) would involve only LOG(N) time complexity. The difference between N * LOG(N) and LOG(N) is DRAMATIC!

      --
      Over-the-top Response Guy! Giving "Over-the-Top Responses" since 1970.
    12. Re:And I am missing it greatly on Linux by QuoteMstr · · Score: 1

      Of course. But when N is small, it doesn't matter: and I guarantee you, the OP was talking about small N. Besides: the beauty of SQL is that you can easily change the query to pull a range of rows instead of asking for each individually; on the other hand, if the retrieval is hand-coded, it takes a lot longer to change the search strategy.

    13. Re:And I am missing it greatly on Linux by Jurily · · Score: 1

      On the other hand, if you use a SQL engine, you can go "small and light" simply by using "small and light" queries. There's no particular reason you can't simply run SELECT * FROM mytable WHERE id=? repeatedly, incrementing id each time.

      I'm sorry, but that definition of "small and light" only applies within the context of using an SQL engine, if at all. For example, compare it with the complexity of using a raw file in the same manner:

      offset += sizeof(record);

    14. Re:And I am missing it greatly on Linux by Anonymous Coward · · Score: 0

      On the other hand, if you use a SQL engine, you can go "small and light" simply by using "small and light" queries. There's no particular reason you can't simply run SELECT * FROM mytable WHERE id=? repeatedly, incrementing id each time.

      IMHO -- that can be quite a problem if you consider the overhead involved in executing the query. Sometimes it is actually better to query a larger resultset and iterate over the results.

    15. Re:And I am missing it greatly on Linux by Tetsujin · · Score: 1

      most (95%) of my queries are "This table/index. Number 5 please."

      Admirable! Despite the strong desire for efficiency, you still have the prudence to phrase you queries politely.

      Well, the database gets all sulky if it doesn't hear the magic word on a regular basis... So it's really in one's own best interests to be polite.'); DROP TABLE stories;--

      --
      Bow-ties are cool.
    16. Re:And I am missing it greatly on Linux by dotgain · · Score: 1

      I know for a fact that the Perl DBI module can do that too. I'm sure many other popular languages can do it to, point being that's nothing peculiar to .Net

    17. Re:And I am missing it greatly on Linux by Anonymous Coward · · Score: 0

      I guarantee you the overhead of preparing queries is not the bottleneck. But if you really need it, you can send sqlite's VDBE interpreter raw bytecode, and nothing prevents you from writing your own language that generates it.

      The funny thing is, aside from having fairly verbose identifiers, sql is actually pretty concise. About the only practical thing that would significantly shrink the actual syntax elements would be some kind of join shorthand people would want to use (there's NATURAL JOIN but it sucks because it's by name and not structural declarations like foreign keys)

      Access isn't even particularly fast when it comes to ISAM since it goes through standard windows filesystem APIs, which aren't particularly tuned for that sort of thing.

    18. Re:And I am missing it greatly on Linux by JimFive · · Score: 1

      What is the overhead in keeping your raw file in an arbitrary order so that
      offset += sizeof(record);
      returns anything meaningful.

      Relational tables are not ordered. But even apart from that, if all you are doing is iterating through your records then why are you using a database at all? The power of SQL is in performing actions on a set of results, e.g.
      UPDATE Employees
      SET Salary = Salary * 1.1
      WHERE Department = 'IT'
      When you start treating the relational table as an ordered file you lose most (all?) of the benefit of the relational model.

      Finally, MS-SQL does compile and cache SQL code so if you're not constantly building all new, ad-hoc statements you benefit from that. In almost all cases, you are not smarter than the built in optimizer.
      --
      JimFive

      --
      Please stop using the word theory when you mean hypothesis.
  23. like the network effect and developer laziness by Colin+Smith · · Score: 1

    its simpler to switch to a different rdbms when your queriees are already in sql.

    It's mostly just human ignorance and laziness.

    --
    Deleted
    1. Re:like the network effect and developer laziness by Abcd1234 · · Score: 1

      its simpler to switch to a different rdbms when your queriees are already in sql.

      It's mostly just human ignorance and laziness.

      Bullshit. That argument *might* be true for people who actively migrate from an existing SQL DB to SQLite, but the reality is that SQLite's *real* niche is in supplanting BDB. In that case, you don't have people with queries "already in SQL"... these are new projects that choose SQLite because it fits a need they have, and does so better than any existing, non-relational storage solution.

    2. Re:like the network effect and developer laziness by PCM2 · · Score: 1

      I think what the GP might have been trying to say is that just about any developer these days will have encountered SQL databases. You'll find high school students with plenty of MySQL experience. It's easier to draw upon that knowledge to write simple SQLite applications than to have to go and learn how to do it in BerkeleyDB (a skillset which only applies to the simple applications and won't help you when you want to write an app that needs more advanced RDBMS features).

      --
      Breakfast served all day!
  24. Hstore by Anonymous Coward · · Score: 1, Informative

    You are aware of PostgreSQL's hstore: a type representing basically a name-value mapping (think Perl hash or Python dictionary). You can put an index on it answering queries like "find all records where the field has a mapping "foo => bar", or contains mappings {foo => bar, baz => grumble} and more.

    Cool stuff.

  25. One? by Chapter80 · · Score: 1

    E-Mail servers associate data with only one index: the e-mail address.

    ...Valid points, except for your use of the word "one". My email can be retrieved by my email address, but also selected by the folder that it's in, sorted by sender, subject, date or priority, and searched by keyword.

    There are only a couple of handfuls of thing that need to be indexed, but certainly more than 1.

  26. Sorry, you need to get real by Shivetya · · Score: 2, Informative

    I work on a very large db2 system. Enterprise systems cost money because they work. There still seems to be this ignorant self absorbed counter culture which believes big iron and similar (anything about look what I can build in my basement) isn't cool so it cannot work.

    Between radix, sparse, derived, encoded vector indexes I can pretty much serve up anything my partners want, whether they are native or foreign db2 ,jdbc or odbc connected. With the tools I have at my disposal I can analyze statements presented by developers to insure I have the access paths needed for their work and guide them to better data retrieval. I can tell if their choices result in full table scans, index probes, hash tables, rrn tables, etc. If I need support its a phone call away.

    I do not care who my client is, data is my job. As such I need tools which are so reliable that only concerns I have are, just what is my customer doing and how can I make their request better. When they query 5tb tables and don't even notice a delay I think I am doing just fine.

    --
    * Winners compare their achievements to their goals, losers compare theirs to that of others.
    1. Re:Sorry, you need to get real by sohp · · Score: 0, Troll

      It's a good thing you have all those tools to analyze the queries and tables, because you need them to make that complex pile of enterprisey spaghetti work. If only there was something out there that just worked, and didn't need all that hand-optimizing and tool-fiddling to kludge it into usability.

      More importantly, it would be great to have some kind of data storage system that developers could just use and not be forced to go through the bottleneck of the custodian of tools who, often as not, will use the "guide them to better data retrieval" excuse to refuse reasonable requests for straightforward functionality as violating some kind of holy normalization or relational purity dogma.

    2. Re:Sorry, you need to get real by QuoteMstr · · Score: 2, Funny

      And so we come to the core of the issue: people aren't really opposed to relational databases, but instead to relational database administrators.

  27. You can take my SQL when you pry it from my... by turing_m · · Score: 1

    This isn't "it ain't broke, don't fix it" Instead we're dealing with "I have a hammer, so every problem looks like a nail"

    SQL is hardly a hammer - a hammer only has one general use. It's not a Swiss army knife either - a lot of fairly low-grade tools that are convenient in a pinch. If anything, a Swiss army knife is a spreadsheet.

    An RDBMS is more like a well equipped workshop that you build and equip at the site of likely problems. It will take far more work to set up than buying a leatherman tool. However, it will solve almost any unanticipated problem you throw at it, once it is built. That is the beauty of an RDBMS, and why businesses and governments like to build both workshops and relational databases.

    Of course, there are circumstances where a RDBMS is not called for. If you are doing anything that needs to be highly optimized for just one thing, and will only ever be used for just that one thing, then you do not use an RDBMS. (e.g. an FPS). Much like you wouldn't use a workshop if all you are going to be doing is manufacturing widgets - then you need a factory.

    I guess the analogy kind of breaks down there, because a workshop isn't efficient enough to run as a business compared to a factory - it is support infrastructure. For many, many things though, an RDBMS can be the core of a business information system and can also quickly and conveniently answer questions that weren't thought of at design time. Their RDBMS problem domain will only increase as computing power grows, unlike more specialized systems. I would not be surprised at all if SQL is still dominant in a hundred years time.

    --
    If I have seen further it is by stealing the Intellectual Property of giants.
  28. solution looking for a problem? by timmarhy · · Score: 3, Insightful

    SQL databases if designed properly DO handle enourmous datasets. the problem starts when you have wits designing the database and then managers attempting to use the DB for purposes it wasn't meant for.

    --
    If you mod me down, I will become more powerful than you can imagine....
    1. Re:solution looking for a problem? by Jearil · · Score: 1

      That's not exactly where the problem comes in actually. It has more to do with large datasets.

      I know that people have mentioned Oracle and other large DB vendors that make databases that will scale to terabytes (or maybe even petabytes), but as some others have also pointed out: those are REALLY expensive.

      I work for a particular purple company that deals with web searches and scalability issues. Even on some of the 'small' apps that run on DB servers eventually run into scalability considerations when reaching a certain size. The main bottleneck that you face is IO latency. While for most applications you will have the bulk of your data sitting on disk and a smaller index for that data sitting in ram so that you can find the tuple your looking for quickly, when the database gets too large that index becomes big enough to no longer fit into ram. When you have to perform a disk seek to access the index, access times go way down.

      Also, with systems such as MySQL you can horizontally scale for reads which helps a lot. You add slave machines that replicate the data, and you can even set them up in tiers so as to not put so much strain on your master server. All of your read requests go to a slave that's behind a load balancer, while all of the writes and read critical requests go to your master. Unfortunately when you've fully saturated your master with writes, you can't just add a second master without sharding. Sharding can be a pain as it often requires you to rewrite the data access portion of your application to deal with the sharding details. Also, you can't perform joins over data in two different shards, which means you'll be joining that data in your application. Messy.

      The only way to scale write databases in a MySQL setup is vertically, with a bigger box. That works for a little while but eventually that will be saturated as well. What is needed is some way to utilize a lot of commodity (cheap) servers to store and process all of your data, and currently implementing SQL on something like Hadoop has a lot engineering challenges to overcome.

  29. Neo4J is really interesting here... by Anonymous Coward · · Score: 0

    All these programmers that know how to create tables and normalize a DB but that don't really understand advanced programming techniques are cornering themselves in the "Vietnam of software development": endless OO-to-RDB plumbing. They invent lots of tools to ease their immediate pain, without looking at the big picture: OO and RDB do not match.

    You have hierarchical datas? Then learn something new: learn what OO really is about, use an OO DB. It has proven to be really fast.

    But I don't expect this to become mainstream: most people don't understand advanced programming techniques. They don't understand OO, they don't understand multi-threaded programming. Hence they rely on the SQL DB to "keep things in synch" and to "organize" their data. It kinda works, for naive stuff.

    Once it comes to real amount of data, then the relational paradigm and especially the SQL implementation of that relational paradigm simply ain't cutting it anymore.

    1. Re:Neo4J is really interesting here... by QuoteMstr · · Score: 1

      Reading your post, I was half-expecting to see "educated stupid" crop up. You're claiming that 40 years of accumulated wisdom managing petabytes worth of data is worthless, and that your "advanced programming techniques" can do a better job? Put a sock in it. Maybe it's you who has the problem here. After all, you can't grasp that people actually do model data relationally, do produce object-relational layers that work just fine, and do produce systems that process vast amounts of information relationally.

      What's next? Will you advocate that people who advocate antiobiotics don't your "advanced medicinal techniques" based on the four humors?

      Oh, and by the way: "data" is already plural, you numbskull.

    2. Re:Neo4J is really interesting here... by petrus4 · · Score: 1

      I agree with you.

      I've never understood why people like the GP have such attitudes. They scoff at pre-existing UNIX methods, and then insist that various forms of obscenely, excessively complex, perverted evil such as C++ or XML are somehow preferable and superior.

      Although I love using Linux and FreeBSD, and also engaging in shell scripting and simple forms of programming, whenever I see these being discussed in a public forum recently, I've noticed that my immediate response is usually to become angry.

      This is because there is now apparently an entire generation of chronically elitist, misinformed, horribly uneducated and misguided programmers, who engage in chronological snobbery and various other subjective logical fallacies while expressing derision towards the UNIX philosophy, (which is, as you rightly point out, the product of 40 years' worth of accumulated experience) and then advocate such horrors as the aforementioned C++ and XML as preferable solutions.

      C++, XML, and "object oriented," programming are nothing other than facilitators of elitism. I've never come across a single advocate of C++ who was not a condescending elitist of the worst possible kind. They use its' degree of needless complexity as a means of gratifying their ego, and they thus fight tirelessly to ensure that the means of said ego gratification is preserved.

      Inflating epeens, however, is the only thing that C++ really accomplishes.

  30. Link corrected by WarwickRyan · · Score: 1

    I mangled the first link.. http://tinyurl.com/ybepcqr/

  31. RDBMS do the job by Stormcrow309 · · Score: 1

    Database size is usually not an issue for modern RDBMS, such as Microsoft SQL, Sybase ASE, Oracle, or IBM's DB2. I am running an ERP on Sybase with 3 TB worth of data, a datamart on Microsoft with 5 TB, a Patient Record System on Microsoft with 20 TB, a HR system with 2 TB, and a Patient Accounting system on Oracle with 8 TB of data. All of these systems talk with at least one other system, usually with the assistance of SSIS (Thank god for SSIS, our ETL is heavy lifting, approx. 5 TB a night of incrementals). With enough server hardware, we can scale up to very large levels easily. We forcast out our data size needs out for the next three years and have been very accurate, not running across SAN issues.

    Only systems we have had issues with in the area of data size is MySQL and Informix.

    --

    In God we trust, all others require data.

  32. Amazon doesn't use database transactions?? by Anonymous Coward · · Score: 0

    You do realize that a site can use more than one database, right? A database that is primarily read-only to display data quickly, and another database to handle the financial transactions. I really hope that you are not telling me that any business - web business or not - does not depend on ACID requirements.

    Also, there's more to the RDBMS world than MySQL, PostGres, etc. - the commercial databases, on proper hardware and proper database design, can actually scale up and scale out quite well: You just have to learn to use the right tools for the job!

  33. I know the type well by Anonymous Coward · · Score: 0

    The typical architect who opts for a NO-SQL approach is basing her decision on what an RDBMS can do / can't do primarily on experience with mySQL. She would never consider something much more scaleable on the extreme like Oracle or even heaven forbid DB2. She has never tuned let alone touched one of these real RDBMS. Similarly, her idea of hardware doesn't much transcend a set of independent servers linked with GBE. So her hammer is anything but an RDBMS and the conclusion is totally foregone that an RDBMS won't work. The real conclusion is that mySQL won't work which is totally accurate. Go look at the Larry Ellison video of the Oracle/Sun database machine which will eat most of these "unsolveable" problems for lunch. Yes it is expensive, but building an empire so your pet project can succeed is also expensive and probably more risky as well.

    1. Re:I know the type well by QuoteMstr · · Score: 3, Interesting

      Right. Don't forget PostgreSQL too. Really, the problem here is MySQL. Hell, look at the "tips and tricks" comments for this story: they all deal with ways to work around deficiencies in MySQL (and old versions of MySQL at that.)

      The guy who recommends using the first two characters of the MD5 hash to select a table is particularly hilarious. Doesn't he realize that's what a database index already does, and that databases (even MySQL) will do that for him?

    2. Re:I know the type well by Anonymous Coward · · Score: 0

      At my company, we don't call those kind of people "architects". Instead, we call them "fucktards".

  34. One big problem with SQL is ... by Skapare · · Score: 1

    ... that too many developers and integrators will just use an SQL database by default without considering whether or not it is appropriate for the task. I see so many databases where there is little or no hint of any relationships even being involved. Some forums, for example, store postings in a database where the message content is a blob and it is indexed by a number. To get a post, look by number. While an SQL database can do this, so can many other database types. There's no complex relational searching with this; it's just basic indexing (with maybe a tree of index relationships). I'd sooner do this with a B-tree based filesystem.

    --
    now we need to go OSS in diesel cars
    1. Re:One big problem with SQL is ... by Joe+U · · Score: 1

      Or, for example, this NNTP/Usenet clone thing called Slashdot. I never understood why developers love to shove message boards into SQL. The accounting system, yes, the messages, no.

      I ran a site that did private messages between users, we had 3 choices, SQL database, custom app, or mail server. I suggested using a mail server with a custom interface so could bypass SMTP/IMAP. We went with a custom server instead, It ran rings around SQL and could handle millions of messages with little CPU/disk/network use, but we did have to re-invent some things that were already done had we used a mail server or SQL.

      You need to use the best tool for the job at hand.

    2. Re:One big problem with SQL is ... by QuoteMstr · · Score: 1

      I suggested using a mail server with a custom interface so could bypass SMTP/IMAP.

      So in what way exactly were you left with a mailserver?

    3. Re:One big problem with SQL is ... by Joe+U · · Score: 1

      For the accounts and message storage system, it's optimized for millions of messages and quick lookups of account info.

      It was only talking to a web server, so SMTP and IMAP would have been wasteful overhead, a direct interface with a constantly open connection would have been much quicker. (Also, this was 1997, we were limited)

      User to user messages(IM) is just fast email when both users are there and is pretty much regular email when one isn't.

    4. Re:One big problem with SQL is ... by QuoteMstr · · Score: 1

      In order words, you traded a database over which you had full control for whichever database the gutted remnants of your mailserver used.

    5. Re:One big problem with SQL is ... by Joe+U · · Score: 1

      No, like I said before, we wrote a custom server. So we had complete control over every aspect of the system.

    6. Re:One big problem with SQL is ... by Abcd1234 · · Score: 1

      While an SQL database can do this, so can many other database types. There's no complex relational searching with this; it's just basic indexing (with maybe a tree of index relationships). I'd sooner do this with a B-tree based filesystem.

      And how many B-tree-based filesystems out there are there? And how portable are their implementations? Yeah, exactly.

      Even for straight, non-relational, tabular data, the fact is, your average RDBMS provides decent performance and portability while allowing you to leverage the facilities your pet language of choice offers for database integration. That's a *huge* win, particularly if your alternative is to roll your own custom storage solution (as you suggest).

      Frankly, I'd seriously question someone's competence if they chose to write their own filesystem-based storage layer over leveraging an existing solution like an RDBMS, *particularly* for simple applications that don't have special perform or data storage/querying requirements.

    7. Re:One big problem with SQL is ... by shutdown+-p+now · · Score: 1

      ... that too many developers and integrators will just use an SQL database by default without considering whether or not it is appropriate for the task. I see so many databases where there is little or no hint of any relationships even being involved. Some forums, for example, store postings in a database where the message content is a blob and it is indexed by a number. To get a post, look by number. While an SQL database can do this, so can many other database types. There's no complex relational searching with this; it's just basic indexing (with maybe a tree of index relationships). I'd sooner do this with a B-tree based filesystem.

      It's not any different than using XML as a default format for any data that can be more or less reasonably shoved into it.

      And I don't think that it's really a problem. Yes, you do not normally use a screwdriver to hammer nails, you use a hammer. But what if screwdriver is in fact convenient enough for that, and standardized to boot, while hammers are all different, and everyone is prone to making their own?

      Using stock solutions - even if they do 3x more than you actually need to do - is often a lot of time saved for negligible (in the big scheme of things) performance losses. Hence XML. Hence SQL.

    8. Re:One big problem with SQL is ... by larry+bagina · · Score: 1

      HFS/HFS+ is B-tree-based.

      --
      Do you even lift?

      These aren't the 'roids you're looking for.

  35. MySQL Sucks by democritus · · Score: 1

    Actually, the real problem is that MySQL sucks. Sure, you can patch over some of its suck with Memcache, but at somepoint your still stuck waiting 30 seconds for a query to return, no matter how optimized you make it. Yes, it's trivial to get Oracle and MSSQL to scale to billions of rows, but those cost money no one is willing to spend. NoSQL is wonderful in that it scales easily and is free.

    Sure, you have to denormalize your data, but you probably already were to try to squeeze the last bit of performance out of MySQL.

    You want people to use RDBMS? Make a free one that doesn't suck donkey balls and they will.

    1. Re:MySQL Sucks by Zombywuf · · Score: 1

      PostgreSQL.

      NoSQL can not possibly scale, because it is not an implementation (protip: implementations scale/don't scale, concepts don't). Unless you are referring simply to the fact that it is a meme, and as such scales trivially.

      BTW, when you say "scale" are you talking about data size, active data size, CPU load, memory bandwidth, latency, concurrent queries or something else?

      --
      If you can read this you've gone too far.
    2. Re:MySQL Sucks by democritus · · Score: 1

      Don't be pedantic. When people talk about NoSQL they're talking about more then a meme. They mean the various implementations: Cassandra, HBase, BigTable, Tokyo Cabinet. All these were designed from day 1 to scale to TB+ datasizes, and all do it pretty well.

      Scale: handle tons of data, respond quickly, be able to grow across dozens, if not 100s of hosts.

    3. Re:MySQL Sucks by Zombywuf · · Score: 1

      If I'm not pedantic my databases fail to scale. The reason being that I pay attention to detail, as such your requirement that they "handle tons of data" - for example - is crap. What are the inputs? what are the outputs? What form of query? What form is the data in on insertion/output? What if I need to do more than a single big flat index into my data?

      The word "scale" is my most hated, to use that word on its own without context is to make the claim that all knowledge about building large scale computer systems acquired over the last 40 years can be boiled down to a single attribute that can be simply applied, like a salve, to your software.

      Simply applying BigTable to your problems will not make them go away, it will give you a different set of problems to solve. If you chose wisely it will be a simpler set of problems, but it is not "make my website faster" cream.

      --
      If you can read this you've gone too far.
    4. Re:MySQL Sucks by TheSunborn · · Score: 1

      The problem is not the database size as such, it is the requirements to do 50000 inserts/updates and 200000 select(Read) each second that kill the current sql implementations.

    5. Re:MySQL Sucks by Salamander · · Score: 1

      Here's a tip from a *real* pro: some concepts don't scale. Take strictly sequential execution (please!) as an example. That's clearly a concept, not an implementation, and yet it clearly conflicts with scalability - i.e. maintaining a ratio of output to some factor X as X increases. In this case X could be records or bytes per second, or number of clients, or network latency. In all of those cases, a point is quickly reached - still well within a normal range of interest - where sequential execution fails. Concept? Check. Doesn't scale? Check. Q.E.D. Back to school, n00b.

      --
      Slashdot - News for Herds. Stuff that Splatters.
    6. Re:MySQL Sucks by QuoteMstr · · Score: 1

      The scalability issues of sequential execution are entirely implementation driven. There's no conceptual limit as to how fast a Turing machine can shuffle its tape around. You're confusing implementation and idea while simultaneously lecturing others not to do that. You're a sophomoric hypocrite.

    7. Re:MySQL Sucks by Salamander · · Score: 1

      You're confusing performance with scalability. Don't worry, though; it's a common rookie mistake. Saying that there's no limit to how fast a Turing machine can process its tape is a comment about performance and is totally irrelevant to the already-stated definition of scalability. Talking about scalability in terms of infinitely fast components is just silly, because no such component can ever exist. All reasonable discussion of scalability must be grounded in ratios of load offered vs. load handled. Please spend less time picking which laughably self-referential insult to sling, and more time actually thinking.

      --
      Slashdot - News for Herds. Stuff that Splatters.
    8. Re:MySQL Sucks by QuoteMstr · · Score: 1

      Talking about scalability in terms of infinitely fast components is just silly, because no such component can ever exist.

      That's my point. You can't talk about the scalability of abstract concepts because they can never actually exist. Only implementations can be scalable. If you want to talk about ratios, then an arbitrarily fast Turing machine can handle as much work offered in as little time desired: that is, it has infinite scalability. Such a concept is clearly nonsense.

      Please, limit yourself to the real world instead of trying to embarrass yourself with generalizations to nothingness.

    9. Re:MySQL Sucks by Salamander · · Score: 1

      Abstract concepts cannot exist, but a statement about one can nonetheless be true. "Red" does not exist, but I can make statements about the relationship between red and green and have those statements be meaningful. So it is with sequential execution and scalability. Both are concepts, neither can exist in the real world, but there is a relationship between them and it's one of opposition. To say otherwise is to deny that any logical statement of the form "for any x..." can be valid, and there's no conversation to be had with someone who has turned their back on logic like that. "Concepts can't be scalable" is nothing more than a silly little slogan that you thought would help you avoid the need for real argument, and apparently you can't bring yourself to step away from it no matter how painful it becomes. Good luck with that OCD.

      --
      Slashdot - News for Herds. Stuff that Splatters.
    10. Re:MySQL Sucks by Fulcrum+of+Evil · · Score: 1

      Talking about scalability in terms of infinitely fast components is just silly, because no such component can ever exist.

      That's my point. You can't talk about the scalability of abstract concepts because they can never actually exist. Only implementations can be scalable. If you want to talk about ratios, then an arbitrarily fast Turing machine can handle as much work offered in as little time desired: that is, it has infinite scalability. Such a concept is clearly nonsense.

      Please, limit yourself to the real world instead of trying to embarrass yourself with generalizations to nothingness.

      You sound like a college student. Now, given your turing machine, if you have sequential operation, then two turing machines can complete the same work in, um, the same amount of time. Scalability = 0. How is that implementation driven?

      --
      "We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
    11. Re:MySQL Sucks by Zombywuf · · Score: 1

      Sequential execution scales perfectly if your batches can be run in parallel :-)

      Take a webserver, each CGI script runs sequentially but a load balancer and an array of cheap machines gives you scalability. And that by the way is scalability, the ability to vary resource consumption with regard to desired output capacity, if the former varies linearly with regard to the latter your system is said to be scalable. What you describe is capacity. Performance is what you worry about when you exceed capacity.

      --
      If you can read this you've gone too far.
  36. CODASYL Hierarchical Databases are faster by briddle · · Score: 1

    CODASYL Hierarchical Databases are faster for large complex databases. I've supported extremely large databases and user bases with 3 second or better end-to-end response times for over 300,000 real-time customer service rep users with such software. These databases allow precise physical positioning; including the ability to group related child record rows on the same physical page. One I/O can retrieve the entire set. They also support hash or other custom indexing that directly yields the physical page address instead of wading thru relational index pages to get there. Tool support is not as good and it takes someone who understands them to get the best results. Functionality such as producing report output is more work. But they work great on large datasets.

  37. Index Desk Contents by Dareth · · Score: 1

    I can quickly find anything on my desk using my index of food wrappers and containers.

    I know that report was done about the time I ate that snicker's, ah found it.

    --

    I only look human.
    My mother is a halfling and my dad is an ogre, so that makes me an Ogreling
  38. NoSQL? How about... by Anonymous Coward · · Score: 0

    Let's call it Nazgul instead? That's how I pronounce NoSQL anyway :)

  39. NoSQL - good tech, bad name by Dominican · · Score: 1

    At the ACM site Michael Stonebraker wrote an article titled "The "NoSQL" Discussion has Nothing to Do With SQL" where he discusses how the NoSQL group is solving real problems, but using a name.. that well.. really has nothing to do with the problems getting solved.

    http://cacm.acm.org/blogs/blog-cacm/50678-the-nosql-discussion-has-nothing-to-do-with-sql/fulltext

    For anyone not familiar with Stonebreaker..
    http://en.wikipedia.org/wiki/Michael_Stonebraker

    Great article from someone who truly knows what he is talking about.

    1. Re:NoSQL - good tech, bad name by QuoteMstr · · Score: 1

      That article is a pile of garbage, and until I see further evidence, I'm going to consider its author the same. Besides managementspeak tidbits like "net-net", the article contains such wonderful truisms as:

      In summary, blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management. To go wildly faster, one must remove...overhead. This is possible in either a SQL context or some other context.

      Well, yes, that's true in the same way it's true than an integer is either positive or negative. So?

      Then Stonebraker proceeds to end the article with two citations, one of which points to an article he wrote himself!

      This other article, well, here's a summary:

      A not directly PL-related paper about a new database architecture, but the authors provide some interesting and possibly controversial perspectives:

      • They split the application into per-core, single-threaded instances without any communication between them.
      • Instead of using SQL from an external (web app) process to communicate with the database, they envision embedding Ruby on Rails directly into the database.
      • They state that most database warehouse tasks rely on pre-canned queries only, so there is no need for ad-hoc querying.

      Yes, Stonebraker has rediscovered the stored procedure and the shared-nothing architecture. Does he have any original thoughts?

  40. I keep MongoDB, Sesame, and CouchDB always running by MarkWatson · · Score: 1

    MongoDB starts as a service on my MacBook and on my local network I always keep services for Sesame (RDF data store, SPARQL endpoint), MongoDB, and CouchDB running.

    It is easier to use NoSQL datastores (when they are appropriate) if you always have them running, have client libraries in place, etc.

    If you want to use a relational database, you don't have to stop to install it, get client libraires, etc. I think the same 'ready at hand-ness' shoud apply to whatever NoSQL datastores that meet your needs.

  41. IMS Fastpath by phulax · · Score: 1

    The world rely on IMS

  42. all Java? by jipn4 · · Score: 1

    Are all of those in Java? What about people who want something efficient and scalable without running JVMs everywhere? Have some of them been ported to Mono?

  43. Filesystems are Databases! by Anonymous Coward · · Score: 0

    Everything stored on disk or memory is a database! Many filesystems use B-Trees to organize files. Those directory paths are essentailly schemas. The simple pointer address to a struct, record, or object in memory is organized data. A memory block and the ASCII charset define a string.

    Every 50TB block is a blob with interpreted meaning. Whatever that meaning is, that's your database!

  44. Re:JOIN Syntax Query Plan by Anonymous Coward · · Score: 0

    The JOIN syntax should not affect the query plan. Most SQL-engines today create equally good plans for both SQL-86 and SQL-92 JOINs. This is true among MySQL, PostgreSQL, & Oracle from the last decade.

    The query plan will read any indexes it has to find column C in both tables X and Y. Then it will do an index-scan to retrieve the columns. See the "EXPLAIN PLAN" (oracle) or just "EXPLAIN" statement to see the query plans.

    SQL> select x.a, y.b from x, y where x.c = y.c;
    SQL> select x.a, y.b from x left join y on x.c;

    Also, there nothing restricting these SQL queries to one computer. The SELECT can occur recursively on a server cluster. Treat the queries on computers A & B as the UNION of the two single-computer SELECTs. SQL can be used in a cloud!

    SQL> (NOTE: AT... is not SQL!)
    select x.a, y.b from x left join y on x.c AT Computer A
    UNION
    select x.a, y.b from x left join y on x.c AT Computer B;

  45. Re:hmm - General Purpose DB by transparen · · Score: 1

    There's something to be said for using the right tool for the job. A general purpose database will be optimized for the general case, not for your specific problem. Large databases spanning multiple servers, taking extreme traffic, are sufficiently outside the scope of normal database operations, that a custom solution can be the only way to do it.

    I suggest:

    * figure out exactly what needs to be solved
    * check if existing solutions solve it
    * if not, then develop a solution (and if you're in Canada, claim IRAP and SR&ED for it ;-) )

    Reasons why my suggestion would not always work:

    * risk, both financial and project
    * skills bias (preference to change the problem to match what skills are available)
    * technology bias (preference to change the problem to match a specific technology)
    * vendor bias (preference to change the problem to match a specific vendor)

    --
    SR&ED
  46. Oh yeah, relational DBs so aren't up to the task by Anonymous Coward · · Score: 0

    50TB of data? OMG! WTF! MOREACRONYMSINCAPS! With an index and an average allocation unit of 1kB and no caching whatsoever, that could be, like, up to almost 37 seeks!!! OH NOES! DO WE HAVE ENOUGH POWER?!?!?

  47. Oh really? by Kupfernigk · · Score: 1
    Enterprise means something. It means that an application is either capable of scaling beyond a single machine, or is capable of using the resources of a large multi-CPU server, and can support many users across a large company. The design challenges are quite different from optimising an application to run efficiently on a single machine. I constantly encounter this issue, with end users and marketing not understanding why, for instance, you do not want an enterprise application to have the pretties and feature set of Word or Powerpoint.

    I too get annoyed by database luddites, especially the ones who are in there because they have no social skills, no desire to co-operate with others, and who know all the MS latest terminology but don't, for instance, actually understand how indexes work because they have never really learnt system programming. But valuable corporate data does need to be protected; its loss or corruption costs profits and jobs. SQL is a proven language with a strong track record that is largely portable and, except when queries are generated by some hopeless automated query generation engine, can be made human readable and checkable. Way to go for corporate data.

    If you had a nickel as you suggest, you probably wouldn't have enough to buy lunch in a decent restaurant. If you had a nickel for every swear word uttered by every dba or IT manager sweating blood trying to overcome data loss or corruption, you might be able to retire as you suggest.

    Remember: social networking applications are not mission critical business processes, and they do not have significant SLAs to meet.

    --
    From scarped cliff or quarried stone she cries "A thousand types are gone, I care for nothing, no not one."
  48. The examples use sql by Anonymous Coward · · Score: 0

    > Digg's (3 TB for green badges) or Facebook's (50 TB for inbox search) or eBay's (2 PB overall)

    All of these use sql though.

  49. SQLite by Nefarious+Wheel · · Score: 1

    SQLite is a nice alternative for embedded systems. The whole distribution is less than half a meg. Works quite well for the opposite side of the spectrum covered by TFA. Smaller than Access, smaller than darn near anything, a fully self contained SQL environment expressed in a file. For the Big Huge scale (petabytes) look at Google's BigTables.

    --
    Do not mock my vision of impractical footwear
    1. Re:SQLite by shutdown+-p+now · · Score: 1

      Poster specifically mentioned MSSQL, and it usually makes sense when you otherwise build on MS stack (say, .NET) to stick to MS offerings - they go along together better. Hence why I spoke of SQLCE. You get all the benefits of IDE integration with VS (can add a database file to project, and design tables within), LINQ to SQL and Entity Framework both work with CE, and so on.

      For any other solution, especially portable one, I'd definitely recommend going with SQLite. There is, in fact, an ADO.NET provider for it, which also supports Entity Framework (but not LINQ to SQL, which is non-extensible), but I don't see many advantages over SQLCE there. Both are database-in-a-single-file, both are in-process, both are properly relational/ACID. CE has a few extra minor features (e.g. first-class support for date as a datatype; SQLite manulates dates as strings only, with a few helper functions to ensure consistent representation for comparisons).

      SQLCE DLLs are somewhat larger - 1000k if used in native code only, 1300k with managed/ADO.NET. On the other hand, sqlite3.dll is 500k. I doubt this makes a difference in the majority of cases, though I've heard that some people use SQLite for WinMobile applications in lieu of SQLCE specifically because of smaller size.

      I'm not aware of any performance (memory/speed/whatever) comparisons between them, so can't say anything on that.

  50. Regarding Entity/Attribute/Value model by Tablizer · · Score: 1

    You described an entity attribute value model

    No. The EAV model creates a row-centric view of attributes. My suggestion keeps the traditional column-centric view intact. Other than being more careful about implied types when comparing and asterisk usage, most SQL will look just like it does in a "static" RDBMS. This is not the case with EAV's; they completely change the way one queries.

  51. Why choose when you can have both? by ivoras · · Score: 1

    PostgreSQL at least, and probably other databases, has a generic "key-value store" data type: http://www.postgresql.org/docs/8.3/interactive/datatype.html. With it, rows can contain some strictly-typed data (such as IDs, types, other metadata) and also contain a field (or many fields) which store all other loosely-typed data. And since it's PostgreSQL, all data is safe, can be replicated, you can have complex indexes, full text search, etc.

    --
    -- Sig down
  52. How about... by DaVince21 · · Score: 1

    ...One database per registered user.

    --
    I am not devoid of humor.