Slashdot Mirror


Is the One-Size-Fits-All Database Dead?

jlbrown writes "In a new benchmarking paper, MIT professor Mike Stonebraker and colleagues demonstrate that specialized databases can have dramatic performance advantages over traditional databases (PDF) in four areas: text processing, data warehousing, stream processing, and scientific and intelligence applications. The advantage can be a factor of 10 or higher. The paper includes some interesting 'apples to apples' performance comparisons between commercial implementations of specialized architectures and relational databases in two areas: data warehousing and stream processing." From the paper: "A single code line will succeed whenever the intended customer base is reasonably uniform in their feature and query requirements. One can easily argue this uniformity for business data processing. However, in the last quarter century, a collection of new markets with new requirements has arisen. In addition, the relentless advance of technology has a tendency to change the optimization tactics from time to time."

208 comments

  1. Perl & CSV by baldass_newbie · · Score: 1, Funny

    How did Perl & CSV fare?

    --
    The opposite of progress is congress
    1. Re:Perl & CSV by Ingolfke · · Score: 5, Funny

      How did Perl & CSV fare?

      It failed the "relational" part of the test. But it failed very quickly.

    2. Re:Perl & CSV by Tablizer · · Score: 1

      It would be interesting to see how it does on joins that don't fit in RAM.

    3. Re:Perl & CSV by patio11 · · Score: 4, Funny
      It failed the "relational" part of the test. But it failed very quickly.

      Yep. On the plus side, the Perl hacker who put it together only wasted the time it took to write one line. Granted, the line was 103,954 characters long. He considered breaking it up into two lines to improve readability but ultimately rejected the notion -- anyone not capable of reading the program clearly had no business messing with it anyhow. (Quick question aside from the snark: since Perl has associative arrays can't it emulate a relational database? It was my understanding that after you've got associative arrays you can get to any other conceivable data structure... assuming you're willing to take the performance hit.)

    4. Re:Perl & CSV by nuzak · · Score: 3, Interesting

      > It was my understanding that after you've got associative arrays you can get to any other conceivable data structure

      Once you have lambda you can get to any conceivable data structure. The question is, do you really want to?

      sub Y (&) { my $le=shift; return &{sub {&{sub {my $f=shift; &$f($f)}}(sub {my $f=shift; &$le(sub {&{&$f($f)}(@_)})});}}}

      --
      Done with slashdot, done with nerds, getting a life.
    5. Re:Perl & CSV by Anpheus · · Score: 1

      Could a perl hacker explain the code for those of us who can't read crazy?

      (No offense, I'm just not a fan of Perl syntax. If you can make working programs out of it, then it's my loss by not being able to debug them if I ever have to use Perl.)

    6. Re:Perl & CSV by mysticgoat · · Score: 1

      since Perl has associative arrays can't it emulate a relational database?

      I've built actual relational databases to run in memory using Perl's hashes. This was a good way of doing some prototyping for user feedback before telling the MUMPS coders what it was exactly that we wanted them to do. (Their titles were "Programmer/Analyst", but neither one had any interest or skill in analyzing clinical needs: they were both happy to be just codemonkeys.) Performance with Perl was pretty snazzy but my constant worry was that some clever user would find a repeatable way to thrash the disk cache and make the project look bad— but that never happened. Persistence was with modified csv files (using the pipe char as the delimiter since it never occurred in the data sets). The memory resident tables were loaded on startup and written back to disk on shutdown, and we didn't worry about losing data in crashes since these were prototypes, not live. We could open up the disk files between runs with Excel, and use it to do some sanity checking, or introduce strange conditions. The biggest problem was cajoling the doctors and nurses to drop by and play with the prototype, and then try to get useful feedback out of some of them.

    7. Re:Perl & CSV by patio11 · · Score: 4, Interesting

      I think it implements a Y combinator. Then again, it could just print out "Just another perl hacker". But I'm guessing on the Y combinator. Lets break it down so its readable:

      sub Y (&) {
      my $le=shift;
      return &{
      sub { ## SUB_A
      &{
      sub { ## SUB_B
      my $f=shift;
      &$f($f)
      }
      } ##Close SUB_A's block
      (sub { ## SUB_C
      my $f=shift;
      &$le(sub { ##SUB_D
      &{
      &$f($f)
      }
      (@_)
      }## END SUB_D
      )} ##END SUB_C
      ); ##End the block enclosing SUB_C
      } ## END SUB_A
      } ## Close the return line
      } ##Close sub Y

      Y can have any number of parameters you want (this is sort of a "welcome to Perl, n00b, hope you enjoy your stay" bit of pain). The first line of the program assigns le to the first parameter and pops that one off the list. That & used in the next line passes the rest of the list to the function he's about to declare. So we're going to be returning the output of that function evaluated on the remaining argument list. Clear so far?

      OK, moving on to SUB_A. We again use the & to pass the list of arguments through to ... another block. This one actually makes sense if you look at it -- take the first argument from the list, evaluate it as a function on itself. We're assuming that is going to return a function. Why? Because that opening parent means we have arguments, such as they are, coming to the function.

      OK, unwrapping the arguments. There is only one argument -- a block of code encompassing SUB_C. (Wasted 15 minutes figuring that out. Thats what I get for doing this in Notepad instead of an IDE that would auto-indent for me. Friends don't let friends read Perl code.)

      By now, bits and pieces of this are starting to look almost easy, if no closer to actual readable computer code. We reuse the function we popped from the list of arguments earlier, and we use the same trick to get a second function off of the argument list. We then apply that function to itself, assume the result is a function, and then run that function on the rest of the argument list. Then we pop that up the call stack and we're, blissfully, done.

      So, now that we understand WTF this code is doing, how do we know its the Y combinator? Well, we've essentially got a bunch of arguments (f, x, whatever). We ended up doing LAMBDA(f,(LAMBDA(x,f (x x)),(LAMBDA(x,f (x x)))) . Which, since I took a compiler class once and have the nightmares to prove it, is the Y combinator.

      Now you want to know the REALLY warped thing about this? I program Perl for a living (under protest!), I knew the answer going in (Googled the code), and I have an expensive theoretical CS education which includes all of the concepts trotted out here... and the Perl syntax STILL made me bloody swim through WTF was going on.

      I. Hate. Perl.

      And the reason I hate Perl, more than the fact that the language makes it *possible* to have monstrosities like that one-liner, is that the community which surrounds the language actively encourages them.

    8. Re:Perl & CSV by tootlemonde · · Score: 1

      we didn't worry about losing data in crashes since these were prototypes

      You highlight a critical point in evaluating databases, namely, performance is not necessarily the most important consideration, even in high demand environments.

      Databases also

      • have to recover from system crashes,
      • have to be backed up while running,
      • have to be handle both reads and writes,
      • have to be replicated,
      • have to scale, and
      • have to be supported.

      Customized systems that are optimized for performance often sacrifice one or more of the other requirements. In a medical environment, all of these would be more important than performance regardless of high the demand was.

      A typical example is using caching strategies to reduce access times. Every effort to deal with crashes and backups degrades the advantages of caching.

      If you have relatively static content with a predictable growth rate then you can generally concentrate on performance to the exclusion of all other factors. When all the other factors are critical, the best way to deal with performance is through hardware. In my experience, the effort to squeeze more performance out of an application is regularly overtaken by improvements in the performance of the underlying hardware. Performance is often a problem that solves itself just by waiting.

    9. Re:Perl & CSV by theskipper · · Score: 0, Flamebait

      Cool post.

      Which reminds us of everyone's first reaction to discovering the Obfuscated Perl Contest:

      "Wow, now that's redundant."

    10. Re:Perl & CSV by shotgunefx · · Score: 2, Insightful

      Outside of "Golfing", I'd strongly disagree. I don't think the community encourages it for the most.

      This is from someone who's spent the last seven years with Perl and in the community. YMMV

      --

      -William Shatner can be neither created nor destroyed.
    11. Re:Perl & CSV by 00lmz · · Score: 1

      DBD::CSV does SQL on CSV files... It's relational, right? :-)

    12. Re:Perl & CSV by Ingolfke · · Score: 1

      Ah... another one of my ideas shot down by CPAN :)

    13. Re:Perl & CSV by FacePlant · · Score: 2, Informative

      And the reason I hate Perl, more than the fact that the language makes it *possible* to have monstrosities like that one-liner, is that the community which surrounds the language actively encourages them.

      Not all of us encourage this.

      Its considered *clever* and a mark of great skill that you can strip out all the code that actually explains WTF your code is doing and be left with the perfectly compressed version.

      They call this Perl Golf (shaving strokes of your game. Get it?)
      Many of us do not consider it clever. Rather, we consider it stupid and counter-productive.

      On the other hand, all of the sample answers posted at the Python Challenge are all golf style, and the Python Challenge is supposed to be a learning tool.

      This is modeled as good Perl style to folks just starting with the language,
      People who do this should be tied up with string and left in small dark rooms. For a month.

      the Llama book has lots of code which looks like that, and code samples you find will look like it too.
      This just isn't the case. The code samples in the Llama are no more or less obtuse the code samples in my Pragmatic Ruby book.

      It appears that the community largely does not teach perl like it is a language that needs to be read.
      I wish I could argue more strongly with you here, other than to assert that I come across code in many languages (Perl, Ruby, Java, C, Lisp), on a regular (daily, weekly, monthly) basis, at work and at home, in books, magazines, and online that appear written to not be read.

      Your complaint of bad coding practices is endemic to the industry, and should not be used to condemn a language because it allows the freedom to code poorly.

      --
      My Heart Is A Flower
    14. Re:Perl & CSV by Anonymous Coward · · Score: 1, Informative

      And the reason I hate Perl, more than the fact that the language makes it *possible* to have monstrosities like that one-liner, is that the community which surrounds the language actively encourages them. I think it's unfair to say that the community actually encourages this sort of unreadable code ... there is a fairly strong distinction between a well-meaning Perl obfuscation and an actual project attempting to accomplish a specific goal (although I will concede that the lines between the two can be blurry at times). If we were working on a work-related project I would have smacked you upside the head had you tried to commit that atrocity to the codebase, as would quite a few other Perl developers. And, please, it is easy to write muck no matter which language you're working with. Your example really doesn't prove anything because, clearly, most people in their right minds wouldn't do what you just did.
    15. Re:Perl & CSV by Schraegstrichpunkt · · Score: 1

      Yeah, Perl sucks. He should have written it in PHP instead... ;-)

    16. Re:Perl & CSV by phlamingo · · Score: 1

      I think it's unfair to say that the community actually encourages this sort of unreadable code ... there is a fairly strong distinction between a well-meaning Perl obfuscation and an actual project attempting to accomplish a specific goal (although I will concede that the lines between the two can be blurry at times). If we were working on a work-related project I would have smacked you upside the head had you tried to commit that atrocity to the codebase, as would quite a few other Perl developers. And, please, it is easy to write muck no matter which language you're working with. Your example really doesn't prove anything because, clearly, most people in their right minds wouldn't do what you just did.

      ... and what has that got to do with Perl hackers?



      But, seriously, a lot of people have that impression of the Perl community, that they take perverse pride in writing unreadable code, and that there is a certain snobbery about it. I agree that there is no technical reason that Perl must be unreadable. When I have to write Perl, it looks more like Pascal than the kind of glossalalia that started this sub-thread, mostly because I don't want to spend the effort to learn Perl that deeply. Life is too short, and there are too many other things to do.


      It seems to me that the Perl community is about equally divided between pragmatists who appreciate the power of the language and CPAN, and wackos who promote a cult of Gnostic mysticism. Guess which group I think of when Perl is mentioned?


      --
      I had forgotten how much cooler teenagers look when they are smoking. Oh, wait ...
  2. "In the last quarter century..." by AndroidCat · · Score: 2, Funny

    Well it's about time we had some change around here!

    --
    One line blog. I hear that they're called Twitters now.
  3. Was there ever a one-size-fits-all database? by Ant+P. · · Score: 1

    The closest thing I can think of that fits that description is Postgres.

    1. Re:Was there ever a one-size-fits-all database? by Architect_sasyr · · Score: 2, Funny

      There's a difference between fitting and being forced to fit into something ;)

      --
      Me failed English...
      FreeBSD over Linux. If my comments seem odd, this may explain...
    2. Re:Was there ever a one-size-fits-all database? by mwanaheri · · Score: 1

      Well, you might say that an xxl-sized shirt fits all, but only if you say that if you can get in, it fits you. For most of my s-uses, postgres offers far more than I need (still, postgres is my default).

      --
      Idha khatabahum lijahiluna qalu salaman
    3. Re:Was there ever a one-size-fits-all database? by trACE666 · · Score: 1

      If I am not mistaken, both Oracle and IBM use the same code base for all the versions of their RDBMS products.

    4. Re:Was there ever a one-size-fits-all database? by egghat · · Score: 2, Informative

      Btw. Postgres was a project from Stonebreaker meant to deal with the limitiations of SQL (POST inGRES).

      See the history of PostgreSQL.

      When the community picked the old, dormant Postgres source code up (no problem due to the BSD licensing), the first that was added (after some debates) was the SQL syntax, hence the name change to PostgreSQL.

      Bye egghat.

      --
      -- "As a human being I claim the right to be widely inconsistent", John Peel
    5. Re:Was there ever a one-size-fits-all database? by Anonymous Coward · · Score: 0

      From the orginal POSTGRES white paper

      The INGRES relational database management system (DBMS) was implemented during 1975-1977 at the Univerisity of California. Since 1978 various prototype extensions have been made to support distributed databases [STON83a], ordered relations [STON83b], abstract data types [STON83c], and QUEL as a data type [STON84a]. In addition, we proposed but never prototyped a new application program interface [STON84b]. The University of California version of INGRES has been ''hacked up enough'' to make the inclusion of substantial new function extremely difficult. Another problem with continuing to extend the existing system is that many of our proposed ideas would be difficult to integrate into that system because of earlier design decisions. Consequently, we are building a new database system, called POSTGRES (POST inGRES).

      From the Wikipedia article

      Since the mid-1980s, Ingres had spawned a number of commercial database applications, including Sybase, Microsoft SQL Server, NonStop SQL and a number of others. Postgres (Post Ingres), a project which started in the mid-1980s, later evolved into PostgreSQL. By any measure, Ingres is one of the most influential modern computer research projects.
  4. Noticed how roll your own is faster? by BillGatesLoveChild · · Score: 2, Interesting

    Have you noticed when you code your own routines for manipulating data (in effect, your own application specific database) you can produce stuff that is very, very fast? In the good old days of the Internet Bubble 1.0 I took an application specific database like this (originally for a record store) and generalized it into a generic database capable of handling all sorts of data. But every change I made to make the code more general also made it less efficient. The end result wasn't bad by any means: we solid it as an eCommerce database to a number of solutions, but as far as the original record store database went, the original version was by far the best. Yes. I *know* generic databases with fantastic optimization engines designed by database experts should be faster, but noticed how much time you have to spend with the likes of Oracle or MySQL trying to get it to do what to you is an exceedingly obvious way of doing something?

    1. Re:Noticed how roll your own is faster? by smilindog2000 · · Score: 4, Interesting

      I write all my databases with the fairly generic DataDraw database generator. The resulting C code is faster that if you wrote it manually using pointers to C structures (really). http:datadraw.sourceforge.net. Its generic, and faster than anything EVER.

      --
      Beer is proof that God loves us, and wants us to be happy.
    2. Re:Noticed how roll your own is faster? by Anonymous Coward · · Score: 5, Informative

      Looks interesting, will check it out. Working URL for the lazy: http://datadraw.sourceforge.net/

    3. Re:Noticed how roll your own is faster? by BillGatesLoveChild · · Score: 1

      That links is broken. Is actually http://datadraw.sourceforge.net/ but thanks SmilingDog. Checking it out now. Looks interesting.

    4. Re:Noticed how roll your own is faster? by trimbo · · Score: 1

      Looks interesting. Would be nice if it worked with C++ clases. Has anyone tried creating a C++ app around this?

    5. Re:Noticed how roll your own is faster? by The+Real+Nem · · Score: 1

      It's hard to take any project seriously (professional or not) when it's web page has such glaring mistakes as random letter b's in its source (clearly visible in the all the browsers I've tried), more white space than anyone can reasonably shake a stick at and poor graphics (I'm looking at the rounded corners of the main content).

      As interesting as it sounds, it makes me wonder what could be wrong with the code...

    6. Re:Noticed how roll your own is faster? by smilindog2000 · · Score: 1

      There was a C++ wrapper written for the old (and more stable) version of DataDraw. However, we really wanted it to generated C++ classes, and that turned into a mess. It required DataDraw to parse your C++ header files and insert code into your manually written classes. Microsoft pulled it off with their Class Wizard, but we just didn't have the bandwidth to finish it.

      --
      Beer is proof that God loves us, and wants us to be happy.
    7. Re:Noticed how roll your own is faster? by RAMMS+EIN · · Score: 1

      Is there any documentation for it (didn't see a link on the webpage)? How do I use it in my program?

      Are there any benchmark results that prove the claims about it being faster? How much faster (than what?) is it, really?

      --
      Please correct me if I got my facts wrong.
    8. Re:Noticed how roll your own is faster? by RAMMS+EIN · · Score: 1

      ``Is there any documentation for it (didn't see a link on the webpage)? How do I use it in my program?''

      Never mind, I found the link. Must have skipped past it the first time. Perhaps it would be a good idea to add it to one of the edges of the page?

      --
      Please correct me if I got my facts wrong.
    9. Re:Noticed how roll your own is faster? by smilindog2000 · · Score: 1

      There is a manual in OpenOffice format (yeah, I really AM a PITA). It was benchmarked heavily in the late 90's, internally at an EDA company before deciding to use the integer-based object references. All programs (including a placer and a router) sped up, and the range was from 10% to 20%, depending on the tool. The average was about 15%. Improvements were more pronounced in tools with larger amounts of data, which we felt was due to cache effects. It would be nice to redo the benchmarks, with open-source programs, but it takes a TON of work. You basically have to take a program that is already well optimized by hand, and convert it to use data in a DataDraw database rather than custom C structures. However, the trend has been that cache effects have become even more important, so I expect the benchmarks to be even better next time.

      --
      Beer is proof that God loves us, and wants us to be happy.
    10. Re:Noticed how roll your own is faster? by sonofagunn · · Score: 1

      In our company, we use the database mostly as a warehouse. Our daily processing is done via flat files and Java code. It's just much, much, much faster that way and easier to maintain. I think we're kind of a special case though.

    11. Re:Noticed how roll your own is faster? by suggsjc · · Score: 1
      I think we're kind of a special case though.
      Yep, you are special...just like everyone else.


      On a side note. I know the term flat files can mean different things to different people, but I find that they are almost always a bad idea (to some degree and depending on your definition). You always run the risk of whatever you are using as delimiters coming up in the data you are parsing giving those "bugs." You always think "we sanatize our data..." and it will never happen to me, but more times than not, it will.
      --
      When I have a kid, I want to put him in one of those strollers for twins and then run around the mall looking frantic.
    12. Re:Noticed how roll your own is faster? by fingusernames · · Score: 2, Interesting

      Back in the late 90s, I worked on a data warehouse project. We tried Oracle, and had an Oracle tuning expert work with us. However, we couldn't get the performance we needed. We wound up developing a custom "database" system, where data was extracted from the source databases (billing, CDRs, etc.) and de-normalized into several large tables in parallel. The de-normalization performed global transformations and corrections. Those tables were then loaded into shared memory (64bit HP multi-CPU system with a huge amount of RAM for those days, 32GB IIRC), indices were built, and a highly optimized algorithm (over time it kept getting tighter and smaller) was used to join the data based on various criteria using standard, left, right and some hybrid methods. The join algorithm operated on pointers to tables of pointers. Initially, developers used a PERL script to pre-process simple pseudo-SQL into C code/macros, that would be linked to their report application. As the project grew, I developed a SQL-derived language that was run through a cross-compiler to generate the C code and macros to link to applications. That language supported joins, views, temporary tables, and some other useful features that enabled developers to work quickly in implementing report requests. The system was very fast for our purposes, performing fraud analysis and sales trends analysis nightly. In parallel to that analysis on a different server, the de-normalized data was also exported to a Redbrick database so users could perform desktop reporting over historical data. I was the overall technical architect for system, and the developer of the joining system and the SQL-like language and compiling/development tools. I'm sure that today though there are data warehouse specific tools that would eliminate most of that.

      Larry

    13. Re:Noticed how roll your own is faster? by Anonymous Coward · · Score: 0

      > ... and had an Oracle tuning expert

      "an" Well, there's your problem right there.

  5. Prediction... by Ingolfke · · Score: 4, Insightful

    1) More and more specialized databases will begin cropping up.
    2) Mainstream database systems will modularize their engines so they can be optimized for different applications and they can incorporate the benefits of the specialized databases while still maintaining a single uniform database management system.
    3) Someone will write a paper about how we've gone from specialized to monolithic...
    4) Something else will trigger specialization... (repeat)

    Dvorak if you steal this one from me I'm going to stop reading your writing... oh wait.

    1. Re:Prediction... by Tablizer · · Score: 3, Interesting

      2) Mainstream database systems will modularize their engines so they can be optimized for different applications and they can incorporate the benefits of the specialized databases while still maintaining a single uniform database management system.

      I agree with this prediction. Database interfaces (such as SQL) do not dictate implimentation. Ideally, query languages only ask for what you want, not tell the computer how to do it. As long as it returns the expected results, it does not matter if the database engine uses pointers, hashes, or gerbiles to get the answer. It may however require "hints" in the schema about what to optimize. Of course, you will sacrifice general-purpose performance to speed up a specific usage pattern. But at least they will give you the option.

      It is somewhat similar to what "clustered indexes" do in some RDBMS. Clusters improve the indexing by a chosen key at the expense of other keys or certain write patterns by physically grouping the data by that *one* chosen index/key order. The other keys still work, just not as fast.

    2. Re:Prediction... by theshowmecanuck · · Score: 1
      The reasons for this "one size fits all" (OSFA) strategy include the following:
      Engineering costs...
      Sales costs...
      Marketing costs...

      What about the cost of maintenance for the customer?

      Maybe people will keep buying 'one size fits all' DBMSs if they meet enough of their requirements and they don't have to hire specialists for each type of databases they might have for each type of application. That is, it is easier and cheaper to maintain a smaller number of *standard* architectures (e.g. one) for a company. Otherwise you have to pay for all sorts of different types of specialists. Now if your company only does say, data warehousing, then that is another matter and it is smart to purchase a specialized system. Or if you are a mega corporation you might be able to afford to have a number of specialist teams for each type of system. But I think smaller shops might need to make do with the poor old vanilla DBMS.

      --
      -- I ignore anonymous replies to my comments and postings.
    3. Re:Prediction... by Pseudonym · · Score: 2, Interesting

      Interfaces like SQL don't dictate the implementation, but they do dictate the model. Sometimes, the model that you want is so far from the interface language, that you need to either extend or replace the interface language for the problem to be tractable.

      SQL's approach has been to evolve. It isn't quite "there" for a lot of modern applications. I can forsee a day when SQL can efficiently model all the capabilities of, say, Z39.50, but we're not there now.

      --
      sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
    4. Re:Prediction... by Tablizer · · Score: 1

      Z39.50 is specific to text searches, no? SQL and Z39.50 are apples and oranges.

    5. Re:Prediction... by dkf · · Score: 1

      SQL and text searching? Check out the FTS1 module for SQLite...

      --
      "Little does he know, but there is no 'I' in 'Idiot'!"
    6. Re:Prediction... by Decaff · · Score: 2, Informative

      I agree with this prediction. Database interfaces (such as SQL) do not dictate implimentation. Ideally, query languages only ask for what you want, not tell the computer how to do it.

      This can be taken a stage further, with general persistence APIs. The idea is that you don't even require SQL or relational stores: you express queries in a more abstract way and let a persistence engine generate highly optimised SQL, or some other persistence process. I use the Java JDO 2.0 API like this: I can persist and retrieve information from relational stores, CSV, XML, LDAP, Object Databases or even flat text files using exactly the same code and queries, and yet I get optimised queries on each - if I persist to Oracle, the product knows enough about Oracle (and even the specific version of Oracle) to generate very otimised SQL.

    7. Re:Prediction... by Frank+T.+Lofaro+Jr. · · Score: 1

      Using Java and optimizing for performance is like using Windows and optimizing for stability, or like living in hell and optimizing for coolness.

      --
      Just because it CAN be done, doesn't mean it should!
    8. Re:Prediction... by Pseudonym · · Score: 2, Insightful

      Z39.50 is actually much, much more than mere "text searching". If you think hard about the way that you interact with a library catalogue or Google compared with how you interact with a RDBMS, you'll realise there are quite a few more differences than just "text searching".

      Think about highly heterogeneous data. Libraries, for example, might index books, periodicals, audio-visual items and online resources such as journals. Google indexes web pages, Usenet news articles, PDF documents and so on. And you can search them all by "title".

      Think about "result sets" instead of sequences of tuples. When you search google, or a library catalogue, what you get is a bunch of summary information which you page through, then eventually retrieve the record that you want. Or you might refine your query by adding new search terms or sorting your results by some key. The key data structure here is the "result set": a sequence of record numbers. Everything happens to result sets. You sort your results by state, or intersect the set with another query. The whole process is record-oriented. SQL, on the other hand, is data-oriented: the central data structure is a sequence of tuples, and tuples contain real data.

      I hear you objecting that there are ways to do this in SQL, and you'd be right. But in this kind of application, it's always going to be at the expense of a lot more time (more processing grunt required, or less opportunity to exploit disk locality) or much more disk space, if only because of the extra indirection required. If you have terabytes of information, this bites, and bites hard. You wouldn't use Google or your library catalogue if it were ten times slower.

      SQL is optimised for the case where data is "right there". Z39.50 is optimised for the case where accessing real data is expensive, because it might involve parsing XML or PDF. People complain about how supposedly inefficient XML data is, but the fact is, there's no better way to do text with structure. The real problems are a) people use XML for things that aren't structured text, and b) relational databases can't handle it with reasonable efficiency at the moment.

      Yes, I know, SQL will eventually be able to handle things like this. But it's not there yet.

      --
      sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
    9. Re:Prediction... by Tablizer · · Score: 1

      This can be taken a stage further, with general persistence APIs. The idea is that you don't even require SQL or relational stores: you express queries in a more abstract way and let a persistence engine generate highly optimised SQL, or some other persistence process. I use the Java JDO 2.0 API like this: I can persist and retrieve information from relational stores, CSV, XML, LDAP, Object Databases or even flat text files

      I flat-out don't believe you. OOP API's in many ways are a lower-level abstraction than relational query languages. And the features of those things you listed are not one-for-one exchangable. Let's see you do a 3-way join and then a GROUP-BY-like summary with flat files and handling multi-user concurrency without reinventing a database to pull it off.

    10. Re:Prediction... by Pseudonym · · Score: 1

      You might say that, but it'd be quite misleading.

      SQL isn't just about searching either. It's actually an embodiment of a relational data model which includes insertion, modification, searching, sorting and data retrieval.

      Z39.50 doesn't (yet) model database modification, but it's actually a feature-rich, and quite generic, model for textual information retrieval and presentation. Searching is but one part of that.

      --
      sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
    11. Re:Prediction... by Decaff · · Score: 1

      Using Java and optimizing for performance is like using Windows and optimizing for stability, or like living in hell and optimizing for coolness.

      Considering that the most profitable site on the internet, E-Bay, does exactly that, I would say, on balance, that you are wrong.

    12. Re:Prediction... by Decaff · · Score: 1

      I flat-out don't believe you. OOP API's in many ways are a lower-level abstraction than relational query languages.

      But almost all coding is done in an OOP way these days, so it is highly useful abstraction.

      And the features of those things you listed are not one-for-one exchangable. Let's see you do a 3-way join and then a GROUP-BY-like summary with flat files and handling multi-user concurrency without reinventing a database to pull it off.

      You are missing the point completely. No-one is saying that you will get all the features of concurrent high-performance relational database no matter what type of persistence you use; the point is that you can use exactly the same code and queries for all those types of persistence. For example, I can open a transaction, define and run a query including joins and grouping, and retrieve objects, and modify them and have them be persisted when I close the transaction, no matter what persistence store I use - flat file, relational database and so on. Of course you don't get multi-user concurrency with flat files! But who cares? The point is that you get automatically get multi-user concurrency when you run the same code on a relational database without having to change a line of code.

    13. Re:Prediction... by Tablizer · · Score: 1

      But almost all coding is done in an OOP way these days, so it is highly useful abstraction.

      That is not true. Even many OO proponents complain that actual code being created is often not really OOP, "procedural in an OO dress" they sometimes say. OO's lip-service-to-usage ratio is high. Plus, popularity does not equal "good". Otherwise Windows is 10 time better than Linux or Mac.

      You are missing the point completely. No-one is saying that you will get all the features of concurrent high-performance relational database no matter what type of persistence you use; the point is that you can use exactly the same code and queries for all those types of persistence. For example, I can open a transaction, define and run a query including joins and grouping, and retrieve objects, and modify them and have them be persisted when I close the transaction, no matter what persistence store I use - flat file, relational database and so on.

      For something very simple, yes. But not for non-trivial stuff. I've encountered this claim multiple times before and it turned out to be bogus when actual code example was given. They either leave out some important key limit, or use some bloated API that makes for very very verbose ugly code, almost like OOP "query" assembler. With enough effort you can wrap and swap anything, but the result can be far worse than what you wrap. Query languages are not "raw implementation", they can be a powerful abstraction tool if you know how to use them. I'll believe it when I see actual code (that doesn't have described flaws).

    14. Re:Prediction... by Decaff · · Score: 1

      That is not true.

      OK, then - I personally find OOP a highly useful abstraction. Popularity does also not equal bad!

      For something very simple, yes. But not for non-trivial stuff.

      If you look at real examples, it turns out that this approach is actually more powerful for the more complex stuff, as the abstraction saves time.

      I've encountered this claim multiple times before and it turned out to be bogus when actual code example was given. They either leave out some important key limit, or use some bloated API that makes for very very verbose ugly code, almost like OOP "query" assembler. With enough effort you can wrap and swap anything, but the result can be far worse than what you wrap. Query languages are not "raw implementation", they can be a powerful abstraction tool if you know how to use them. I'll believe it when I see actual code (that doesn't have described flaws).

      Well, there is no way to reply to that, as you are free to label any code I provide as ugly and verbose. What matters is does the API provide performance and productivity? My experience is, unquestionably, yes.

      All I can say is that I actually use JDO 2.0 in precisely the way I have described: this is not imaginary.

      Also, I am certainly not excluding the use of query languages. JDO 2.0 specifies a rich query language. The difference from, say, SQL, is that it is truly portable, and all implementations of JDO must provide all features of the JDO query language, no matter what platform and no matter what persistence method - no incomplete implementations of standards, as has always been the case with SQL.

      Here is an example of using JDO 2.0:

      pm.currentTransaction().begin();
      Query query = pm.newQuery("SELECT UNIQUE FROM Person WHERE lastName == 'Jones' +
                                                  " && age :age_limit PARAMETERS double age_limit ORDER BY surname ASCENDING");
      List results = query.execute(24);

      That is it. Not verbose; nothing like query assembler.

    15. Re:Prediction... by shalmaneser1 · · Score: 1

      there's no better way to do text with structure... YAML's not bad for structured data in a text -- If you need something that will map directly on to code based data structures its great ( which granted isn't always the case ) At any rate much more compact than XML to be sure.
    16. Re:Prediction... by Tablizer · · Score: 1

      If you look at real examples, it turns out that this approach is actually more powerful for the more complex stuff, as the abstraction saves time.

      Good abstractions do. Bad abstractions can make things worse than no abstraction.

      I personally find OOP a highly useful abstraction.

      I agree a lot of it may be a personal preference. People think different and are tripped up or helped by different things. OOP has not been shown to improve any objective metric (except maybe systems software, which is not my domain so I don't care.)

      Your example looks like basic DB access API's. I've seen similar things in 30-year-old FORTRAN. They are just an API to send queries and recieve results from databases. OOP does not change the nature of them much. One could do something similar with procedural API's:

      db = dbContext(.....);
      dbBeginTransaction(db);
      rs = dbQuery(db, "update foo where bar=" . myID);
      rs2 = dbQuery(db, "select * from foo where....");
      while (row = dbGetRow(rs2)) {
          printStuff(row["ssn"]....);
      }
      dbEndTransaction(db); ...

      Some argue that OOP allows one to switch DB API vendors, but that has almost nothing to do with domain development productivity. And the swappable benefit claims only play out in a very *limited* set of circumstances, which OOP books exaggerate the likelyhood by far.

    17. Re:Prediction... by Decaff · · Score: 1

      OOP has not been shown to improve any objective metric (except maybe systems software, which is not my domain so I don't care.)

      This is just plain wrong. There is no doubt that the ability to refactor code is a major benefit in terms of code quality and productivity, and this is largely due to OOP. I have personally seen huge productivity benefits in my company resulting from code re-use, both between developers in the company and from the use of external libraries, as a result of OOP.

      Your example looks like basic DB access API's.

      Then I have really done a bad job of explaining things.

      I've seen similar things in 30-year-old FORTRAN. They are just an API to send queries and recieve results from databases. OOP does not change the nature of them much. One could do something similar with procedural API's:

      You really do need to look at JDO 2.0. It is light-years away from that situation. All I am doing with that query is retrieving a few objects. But then, any changes I make to objects (in normal code) are transparently noted, and saved to the data store automatically. There is no need for an update query or API. I don't even need to retrieve most of the objects. Most of the retrieval is in the background, and automatic. And all this works using the highest efficiency SQL (on relational stores), is transactionally safe and clusterable with no coding required. The key phrase you need to look up is 'transparent persistence'.

      Some argue that OOP allows one to switch DB API vendors, but that has almost nothing to do with domain development productivity. And the swappable benefit claims only play out in a very *limited* set of circumstances, which OOP books exaggerate the likelyhood by far.

      Sorry, but this is a wildly out-of-date view. I have seen a huge boost in productivity, as it means that business logic need not be expressed in the constrained terms of SQL queries and relational tables - it can be expressed in normal OOP code. It allows a huge amount of freedom for developers; it allows them to express what happens to data without concerns of persistence. This saves a lot of time and code.

      As for saying that the swappable benefit only plays out in a very limited set of circumstances - that is not only wrong - it is crazy: Every day that I work on my projects, I develop not caring about what persistence store I use. It so happens that I do most of my development on a laptop using PostgreSQL. I then pass on my code to other developers who use Oracle. This is not trivial code - it is a major project with hundreds of thousands of lines of code, but they only need to change a few lines in a configuration file for my code to work with their database. A few more lines, and it would work with any of a dozen major relational databases, or an XML store or an Object Database. I have no more concern about which database I persist to than the nature of the file system driver that is used when I call a 'File.open' method.

      As for the benefits - they are major. This means that development can be done, including substantial testing, off-site, and without requiring expensive hardware and licenses for specific hardware to run enterprise databases. It means that when my company develops products, the client can choose the database - I am not telling them which database they can use. It means that I can release smaller versions of software that will operate of, say, CSV files in low-memory machines, but without having to change a line of code. And, of course, there is the fantastic freedom from vendor tie-in.

      It sounds to me like you seriously need to get up to date, and realise how things have changed.

    18. Re:Prediction... by Tablizer · · Score: 1

      and this is largely due to OOP. I have personally seen huge productivity benefits in my company resulting from code re-use, both between developers in the company and from the use of external libraries, as a result of OOP.

      I have heard this claim many times. But, I NEVER get to see actual code kicking procedure/relational's butt. Plus, even many OO proponents have changed their tune about "reuse". Reuse has fallen out of favor as a key OO claim. I keep an eye on such opinions. "It helps better organize code" is the most common claim these days. (But "well organized" still appears highly subjective.)

      All I am doing with that query is retrieving a few objects. But then, any changes I make to objects (in normal code) are transparently noted, and saved to the data store automatically.

      Example changes? SQL is doing most of the "real" domain-oriented work. That is where most of the changes will thus happen. You cannot use non-trivial SQL on flat files, XML, etc.

      The key phrase you need to look up is 'transparent persistence'.

      RDBMS are NOT about persistence. Persistence is largely a side benefit. Features/Abilities of RDBMS include:

      * Persistence
      * Query languages or query ability (standardized collection-handling idioms)
      * metadata repository
      * Multi-user contention management and concurrency (locks, transactions, rollbacks, etc.)
      * Backup and replication of data
      * Access security
      * Data computation/processing (such as aggregation and cross-referencing)
      * Data rule enforcement or validation
      * Data export and import utilities
      * Multi-language and multi-application data sharing
      * Data change and access logging
      * Automated "search path" optimization (user focuses on what, not on how)
      * "Noun modeling"

      It so happens that I do most of my development on a laptop using PostgreSQL. I then pass on my code to other developers who use Oracle.

      Procedural can do that also. But most of the difference is in the SQL dialects, not in the DB access API's. I use ODBC for similar such testing also. (Perhaps it the guts of ODBC drivers are written in OO, I don't know and don't need to care. I am talking about helping custom apps, not systems software.)

      A few more lines, and it would work with any of a dozen major relational databases, or an XML store or an Object Database.....call a 'File.open' method...

      Bullsh*t! XML and flat files don't understand SQL. Again, most of the "real work" is in the query language, not the access API's. (Unless you are doing something trivial or wierd.)

      It means that I can release smaller versions of software that will operate of, say, CSV files in low-memory machines, but without having to change a line of code.

      Perhaps you are one of those OO programmers who uses the RDBMS as merely a dumb file system and reinvents a database in app code. In that case, you would be right. But, I don't like reinventing databases from scratch or using a highly non-standard ones that works with only one language, like Java Prevelayer, etc.

      It sounds to me like you seriously need to get up to date, and realise how things have changed.

      Sounds like you need to learn what "evidence" means. Brochure-talk does not compile. OO has provided bullsh*t artists with many many new paint brands. You are starting to remind me of one.

    19. Re:Prediction... by Tablizer · · Score: 1

      I suspect that no company uses the specialized DB's for everything. For example, an airline might use a specialized DB for "live" reservations, but copies those reservations to a regular RDBMS for general reporting and marketing research. That way they can use off-the-shelf reporting and pattern analysis tools. A lot of old mainframe systems do this because they don't want to rewrite their primary customer transaction apps (which may use custom DB's or legacy ones like IMS), but want to use RDBMS on the data collected.

    20. Re:Prediction... by Pseudonym · · Score: 1

      YAML is great for structured data that has text in it, such as library catalogues. But it's not as good for data that is, first and foremost, a document, that happens to need some structure.

      If, for example, you mark up all the names of people in a newsfeed, such as <person>Bill Gates</person>, you can then search the newsfeed for all references to Bill Gates the person without falsely getting references to invoices for fence portals.

      --
      sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
    21. Re:Prediction... by Decaff · · Score: 1

      But, I NEVER get to see actual code kicking procedure/relational's butt.

      You need to look around then.

      Plus, even many OO proponents have changed their tune about "reuse". Reuse has fallen out of favor as a key OO claim.

      I keep an eye on such opinions.


      Me too - advising on technical matters is a key part of my job. I find re-use is a significant help in projects that I work on, and other aspects of OOP, especially polymorphism are absolutely key to much of my work. I regularly write GUI apps in Java. Inheritance, polymorphism and re-use is essential for this kind of development - without it, one would be back to the ghastly event-loop and call-back coding of decades ago.

      Procedural can do that also. But most of the difference is in the SQL dialects, not in the DB access API's. I use ODBC for similar such testing also. (Perhaps it the guts of ODBC drivers are written in OO, I don't know and don't need to care. I am talking about helping custom apps, not systems software.)

      Again, what's with the SQL? I am talking about portable query languages.

      Example changes? SQL is doing most of the "real" domain-oriented work. That is where most of the changes will thus happen. You cannot use non-trivial SQL on flat files, XML, etc.

      Any change you can think of.

      Where did I mention SQL? Why are you fixated on SQL? I am talking about a rich portable query language (which SQL isn't) that is mapped to SQL on systems that use SQL. Look up JDOQL or HSQL.

      SQL isn't doing anything - it is simply a way of expressing what should be done. There are other ways.

      And, also, you REALLY need to get up to date. There have been SQL-based systems for a range of file formats - CSV, XML, flat, Excel etc. for years, and not trivial SQL either.

      RDBMS are NOT about persistence. Persistence is largely a side benefit. Features/Abilities of RDBMS include:

      Yes, of course I know all this, which is why RDMBSes are vital for almost all that I do. All of those facilities are available through the ORM when it is running on an RDBMS. Obviously, you don't get them when persisting to flat files. The point is not to have to re-invent a new API for persisting to flat files.

      Perhaps you are one of those OO programmers who uses the RDBMS as merely a dumb file system and reinvents a database in app code. In that case, you would be right. But, I don't like reinventing databases from scratch or using a highly non-standard ones that works with only one language, like Java Prevelayer, etc.

      I am not re-inventing anything. What happens is that I have full transactional and multi-user facilities when I need them, using the same API. I certainly don't use a database just as a dumb file system - that is a crazy waste of resources and the power of such systems.

      Could I ask that you stop setting up straw man arguments, and actually find out about this before you post again?

      Bullsh*t! XML and flat files don't understand SQL. Again, most of the "real work" is in the query language, not the access API's. (Unless you are doing something trivial or wierd.)

      Wrong. The work happens in the implementation of the access APIs. If I use a JDO 2.0 implementation that persists to XML files, that implementation will handle the query language. And to say that the real work happens in the query language means you just don't understand how this works. You need to get up-to-date and find out about "transparent persistence".

      Sounds like you need to learn what "evidence" means. Brochure-talk does not compile. OO has provided bullsh*t artists with many many new paint brands. You are starting to remind me of one.

      Nothing like a good ad hominem attack.

      This isn't brochure talk - it is based on real projects and real experience of thousands of developers.

    22. Re:Prediction... by Tablizer · · Score: 1

      You need to look around then.

      I've been searching for 8+ years for code-proof that OOP is better. Never found any, at least for biz apps. The samples provided by others had flaws or flawed assumptions about change frequency, bad languages (C vs. C++), or they were just poor procedural programmers.

      without it, one would be back to the ghastly event-loop and call-back coding of decades ago.

      All GUI programs need an event-loop of some kind. It is just that the guts of the frameworks can hide such so that app developers are not dealing directly with it. This is true of procedural also. Event-driven GUI systems are perfectly at home with procedural. (If C cannot do it, it is a fault of C, not procedural.)

      Again, what's with the SQL? I am talking about portable query languages.

      You are not clear what you are talking about. You keep wavering between talking about DB access API's and a query language itself. Wrapping a file system in a DB API does not give you a query language unless somebody reinvents one on top of the file system under the covers. File systems have no inherent query language, so why are you talking about switching from SQL to a file system? It makes no sense minus a translator.

      Wrong. The work happens in the implementation of the access APIs. If I use a JDO 2.0 implementation that persists to XML files, that implementation will handle the query language. And to say that the real work happens in the query language means you just don't understand how this works. You need to get up-to-date and find out about "transparent persistence".

      You are not being clear. Your SQL will NOT work on XML unless you have an SQL parser that treats the XML like relational tables. Java kits may provide that, I don't know. But again this is no different than ODBC drivers. If one has an ODBC driver that allows one to query XML files via an SQL interface, then swapping can take place. (The advantage is that ODBC is not tied to one language such that one can use Fortran or Perl with the same ODBC driver. This keeps one from having to reinvent the same wheel for each app language.)

      This technology predates Java. You just think it is new.

      Nothing like a good ad hominem attack.

      I echoed your own insult style to me for that. You don't recognize your own attacks as "bad" unless they are mirrored back to you. Telling.

      This isn't brochure talk - it is based on real projects and real experience of thousands of developers.

      I mean that anocdotes are nearly useless. Anecdotes can be found to "prove" anything. What I want to see is OOP *code* kicking procedural's butt (from good procedural programmers). Present the OO version and the procedural version and describe exactly why it is better. Provide change scenerios and your code impact line/token change counts if need be. Just be clear how you are measuring betterment. "I'm more comfortable with OO" is not objective. It is not usable for such debates. I wanna see science, not personal psychology.

    23. Re:Prediction... by Decaff · · Score: 1

      I've been searching for 8+ years for code-proof that OOP is better. Never found any, at least for biz apps. The samples provided by others had flaws or flawed assumptions about change frequency, bad languages (C vs. C++), or they were just poor procedural programmers.

      well that is fair enough, but I have no response but to say that I use OOP in almost everything I write and I find it invaluable.

      All GUI programs need an event-loop of some kind. It is just that the guts of the frameworks can hide such so that app developers are not dealing directly with it. This is true of procedural also. Event-driven GUI systems are perfectly at home with procedural. (If C cannot do it, it is a fault of C, not procedural.)

      No, it isn't the same for procedural at all. For example, in Swing, I create a subclass of JComponent, and I get all the features of that class. I can override any particular method to add my own behaviour. You can implement this procedurally, but it simply isn't a clean and natural way to do this. This is not a new point of view - this was a major discovery in the 70s.

      You are not clear what you are talking about. You keep wavering between talking about DB access API's and a query language itself. Wrapping a file system in a DB API does not give you a query language unless somebody reinvents one on top of the file system under the covers. File systems have no inherent query language, so why are you talking about switching from SQL to a file system? It makes no sense minus a translator.

      I am not talking about switching between SQL and a file system. Let me try and explain again. You can get part of an object tree via a portable query language. Then, additional accesses to objects generate queries to tranparently retrieve additional data.

      Suppose I have an object called 'Customer', and that is liked to 'Address', then all I need do with querying is retrieve the customer. When I do:

      customer.getAddress()

      then an additional query is run to retrieve the address, transparently and in the background.

      Of course, if this is inefficient, you can configure the system to always fetch the addresses along with the customers. Then, efficient and optimised SQL for this will be automatically generated.

      Also, with you talk about wrapping a file system in the API is a valid point. What happens is the following: if someone implements the JDO 2.0 API for a file system, they have to implement functionality so that all of the JDO 2.0 query language works in their product. If I write a JDO query to fetch a customer and associated addresses with a given condition, that query will work on ANY JDO implementation, no matter what the underlying data store.

      It is not a case of re-implementing the query language - the query language (unlike SQL) is fixed and guaranteed standard. It is a case of implementing a persistence engine that understands the query language.

      You are not being clear. Your SQL will NOT work on XML unless you have an SQL parser that treats the XML like relational tables.

      Indeed.

      Java kits may provide that, I don't know.

      They do.

      But again this is no different than ODBC drivers.

      Correct, but I am not talking about JDBC or ODBC - I am talking about a far higher level of abstraction (see below).

      If one has an ODBC driver that allows one to query XML files via an SQL interface, then swapping can take place. (The advantage is that ODBC is not tied to one language such that one can use Fortran or Perl with the same ODBC driver. This keeps one from having to reinvent the same wheel for each app language.)

      I was discussing this in the context of a Java API because it does hugely more than JDBC and ODBC.

      JDBC and ODBC aren't good abstractions, and don't provide much portability. All they provide (to simplify) is a common interface for querying what a particular storage system can do, and a way to pass SQL to those storage systems, and fe

    24. Re:Prediction... by Tablizer · · Score: 1

      but I have no response but to say that I use OOP in almost everything I write and I find it invaluable.

      Yes, but you have not given me enough info to select among these possibilities:

      1. OOP is merely a personal preference (it fits your mind better).
      2. You are not good at procedural/relational (p/r) techniques, and thus turn to OOP.
      3. OOP is objectively better
      4. The features of some product that you like are not really about OOP.

      create a subclass of JComponent, and I get all the features of that class. I can override any particular method to add my own behaviour. You can implement this procedurally, but it simply isn't a clean and natural way to do this.

      This assumes that variations fit and change in a hierarchical way. In the real world they usually do *not* fit or change tree-wise in my observation. Set theory is a superior "variation management technique" than OOP's tree-based subclassing. (Yes, there is multiple inheritance, but it is messier than formal set usage.) And, relational better fits this set-oriented view. There is no "tree cop" to force domain things to fit a tree. Sets are simply more flexible than trees. Trees are impossible to normalise when you get more than 3 or 4 factors of variation.

      Suppose I have an object called 'Customer', and that is liked to 'Address', then all I need do with querying is retrieve the customer. When I do: "customer.getAddress()" then an additional query is run to retrieve the address, transparently and in the background.

      Generally in p/r you do a JOIN to bring in the address information for a given task if you need it. It is also usually more effecient over a network than to grab one per loop iteration. (In biz apps, network bandwidth is often a bigger bottleneck than CPU.)

      But *if* one wanted to go your route, nothing stops us from creating something like:

      row = getAddress(customerID); // returns associative array

      Of course, if this is inefficient, you can configure the system to always fetch the addresses along with the customers. Then, efficient and optimised SQL for this will be automatically generated.

      What one wants with a particular task varies such that it usually makes more sense to decide that per task rather than summarily say "all customer retrievals will grab address also". If there are similarities, then one can create a shared function rather than copy-and-paste code (or perhaps a view). However, that should be during a cleanup step, not at the starting design. Most SQL is task-specific in my observation.

      Also, I do create utilities to help generate things like INSERT and UPDATE statements, which are otherwise tedious.

      It is not a case of re-implementing the query language - the query language (unlike SQL) is fixed and guaranteed standard. It is a case of implementing a persistence engine that understands the query language.

      Writing a persistence or access engine that understands (parses) a query language (such as SQL) is NOT something a vast majority of application developers have to concern themselves with. I won't dispute that OOP may make systems software and device drivers easier to make, but this issue will not affect the app coder. I want to see how OO helps biz app writers, not device driver writers. Most pro-OO books seem to love device-driver like examples but shun biz examples. I think I know why.

      JDBC and ODBC aren't good abstractions, and don't provide much portability. All they provide (to simplify) is a common interface for querying what a particular storage system can do, and a way to pass SQL to those storage systems, and fetch back the results.

      I don't think that is true. IINM, Some ODBC drivers implement some form of SQL such that they can read XML, comma-delimited, and spreadsheets via SQL statements because the driver has (or uses) a basic SQL engine. For formal databases, ODBC just passes the SQL along to the RDBMS rather than interpret it

    25. Re:Prediction... by Decaff · · Score: 1

      1. OOP is merely a personal preference (it fits your mind better).
      2. You are not good at procedural/relational (p/r) techniques, and thus turn to OOP.
      3. OOP is objectively better
      4. The features of some product that you like are not really about OOP.


      It is a complex mix. Although I would strongly dispute that one 'turns to OOP' because it is a poor alternative to p/r. OOP has major benefits that are very hard to model in relational systems, such as polymorphism and inheritance.

      This assumes that variations fit and change in a hierarchical way. In the real world they usually do *not* fit or change tree-wise in my observation. Set theory is a superior "variation management technique" than OOP's tree-based subclassing. (Yes, there is multiple inheritance, but it is messier than formal set usage.) And, relational better fits this set-oriented view. There is no "tree cop" to force domain things to fit a tree. Sets are simply more flexible than trees. Trees are impossible to normalise when you get more than 3 or 4 factors of variation.

      Well, over 30 years of research here have shown that you are wrong, and in the areas of GUI development things DO fit well in the real world into single-inheritance tree structures. It is practical and useful.

      It is all very well for you to claim that they don't, but the experience of millions of developers who have found the OOP approach productive suggest you are wrong.

      Anyone who suggested that user interfaces should be based on relational theory would not get much support.

      Generally in p/r you do a JOIN to bring in the address information for a given task if you need it.

      Why even bother to explicitly do a join? That is SO low level. Instead, with JDO, you express a hint to the query generator that things needs to be fetched and let it do the work. After all, not all persistence methods have an SQL JOIN.

      I am sorry, but I feel like someone who codes in C trying to talk so someone who only understands assembler, and who is talking about how to optimise register use. My compiler handles all that!

      It is also usually more effecient over a network than to grab one per loop iteration. (In biz apps, network bandwidth is often a bigger bottleneck than CPU.)

      I know this.

      But *if* one wanted to go your route, nothing stops us from creating something like:

      row = getAddress(customerID); // returns associative array


      I believe you are thinking at this at far to low a level - it is almost like an 'assembler' of relational use. If you research further, you will see that the systems I describe will automatically fetch things in an optimised way, with considerations of networks.

      Also, that is so explicit. The point of transparent persistence is you simply don't need to express retrieval in standard business logic.

      I don't think that is true. IINM, Some ODBC drivers implement some form of SQL such that they can read XML, comma-delimited, and spreadsheets via SQL statements because the driver has (or uses) a basic SQL engine. For formal databases, ODBC just passes the SQL along to the RDBMS rather than interpret it itself.

      Yes, I said it was a simple description.

      And procedural-only languages can use ODBC also. This is getting away from application-centric code issues anyhow. Can't OO fans talk about something besides device drivers and swapping DB engines? It is an interesting topic, but gets into nitty details about SQL intercepting/parsing etc. And, it is very product/vendor specific.

      No, on the contrary, it allows vendor-neutrality, and it absolutely isn't product or vendor specific. I am talking about vendor-independent standards.

      What you are talking about is a vendor-neutral version of SQL that the driver translates or cleans up based on vendor etc.

      No, I am not! I am not talking about SQL.

      I see no rule of the universe that

    26. Re:Prediction... by Tablizer · · Score: 1

      OOP has major benefits that are very hard to model in relational systems, such as polymorphism and inheritance.

      Ironically, relational fans (such as I) tend to think that polymorphism and inheritance are usually poor modeling techniques and way over used. Many OO fans have agreed even, recommending composition instead of inheritance. IS-A is out out out of style in OO circles.

      Feature variation management and classification systems are more powerful or at least more managable done with set theory in my opinion. Sets are more mathy and thus cleaner than the pointer-like spehggetti of OO structures.

      Inher and poly *could* be added to relational databases, but relational fans know better and fortunately resisted pressure.

      Well, over 30 years of research here have shown that you are wrong, and in the areas of GUI development things DO fit well in the real world into single-inheritance tree structures. It is practical and useful.

      Research? Where? And, I would rather have a relational GUI system than an OOP one. Can you objectively prove that an OOP GUI model is superior to a relational one? I agree that OOP GUI engines have been better explored and tested, but that does not mean a relational one is not possible and not objectively worse. DOM is a fricken tangled mess. It is a "navigational database" of sorts. Dr. Codd and Charles Bachman faught navigational-versus-relational battles in the 70's and relational is generally considered to have won. That is until OO fans tried to resurrect navigational concepts. Navigational is modern-day GOTO's.

      The best GUI system would be mostly declarative because that would allow *multiple* languages to use the same GUI engine. OOP has not figured out a way to do that because each OO language is too different from each other. If you can prove otherwise, you will be an OO hero instead of wrong. Declarative is more cross-sharable. Declarative == Sharable.

      It is all very well for you to claim that they don't, but the experience of millions of developers who have found the OOP approach productive suggest you are wrong.

      Where has anybody seriously tried a relational-based GUI to compare? (RDBMS are not really tuned for that kind of work from a performance, typing, and installation standpoint, at least not yet.)

      As far as outside of GUI's, many complain about OOP being overhyped. Most seem to say that OO is good for some things but not others. This may be true, but there is no consensus, suggesting that where to put it is a subjective personal preference. I agreed that paradigms may be subjective; just don't shove OO down the throats of those who don't want it UNLESS you first prove it objectively better. Prove first, shove later.

      Why even bother to explicitly do a join? That is SO low level. Instead, with JDO, you express a hint to the query generator that things needs to be fetched and let it do the work. After all, not all persistence methods have an SQL JOIN.

      Again, that is a query language syntax issue, not an OOP issue. (Some dialects support "natural joins" that do just that.) I would like to rework SQL myself, but that is another issue. If you are using a special magic query language that is better than SQL, fine, but that is not an OO victory even if true.

      I believe you are thinking at this at far to low a level - it is almost like an 'assembler' of relational use. If you research further, you will see that the systems I describe will automatically fetch things in an optimised way, with considerations of networks.

      Again, a good many OO fans have lots of complaints and grumbles about OO/relational mappers. I don't think your experience is typical. You are one voice among many. And it does not sound like an OO thing anyhow.

      No, on the contrary, it allows vendor-neutrality, and it absolutely isn't product or vendor specific. I am talking about vendor-independent standards.

      Where are they published?

      Procedural things simply

    27. Re:Prediction... by Decaff · · Score: 1

      Ironically, relational fans (such as I) tend to think that polymorphism and inheritance are usually poor modeling techniques and way over used. Many OO fans have agreed even, recommending composition instead of inheritance. IS-A is out out out of style in OO circles.

      You can keep using the phrase 'many OO fans', but that does not stop these features being of real use. Inheritance and polymorphism are fundamentally different ideas. Inheritance is overused, polymorphism underused.

      The best GUI system would be mostly declarative because that would allow *multiple* languages to use the same GUI engine.

      Multiple languages can already use the same GUI engine on the JVM.

      OOP has not figured out a way to do that because each OO language is too different from each other.

      No, see below.

      If you can prove otherwise, you will be an OO hero instead of wrong. Declarative is more cross-sharable. Declarative == Sharable.

      Then I am an OO hero, as I can easily prove you wrong. Here are some examples of multiple languages using the same GUI engine (Swing on the JVM):

      Java (well, no examples needed there).
      Groovy: http://www.oreillynet.com/onjava/blog/2004/10/gdgr oovy_basic_swingbuilder.html
      JRuby:
      http://www-128.ibm.com/developerworks/java/library /j-alj09084/
      Jython:
      http://www.uselesspython.com/Jython_Swing_Basics.h tml
      Scala:
      http://scala.sygneca.com/code/scalagui

      I can provide many more if you wish.

      Research? Where?

      All the work of Alan Kay, Dan Ingalls, Adele Goldberg and so on.

      Again, that is a query language syntax issue, not an OOP issue. (Some dialects support "natural joins" that do just that.) I would like to rework SQL myself, but that is another issue. If you are using a special magic query language that is better than SQL, fine, but that is not an OO victory even if true.

      Yes, it is, because it is a query language that allows for inheritance, polymorphism and re-use. I can query for, say, a Contact, and I will automatically get subclasses. This allows for transparent extensibility.

      Also, as I said, but you keep ignoring, what you retrieve are instances of classes that have been proxied and advised with additional code to allow transparent persistence. That is an OOP matter.

      That is outright bad biz modeling.

      Sorry, but I can't take you seriously. You are doing nothing more that claiming that anyone who takes a different approach from you is wrong!

      You don't subclass Contact, you reference it. A customer, client, etc, may have multiple contacts. For example, a given company may have a billing contact, a sales contact, and a general (front desk) contact. Even Amazon has multiple contact options per customer. This is what happens when you think in trees instead of sets. I rest my case.

      You haven't made any case. You can certainly model by referencing, but it depends on the complexity. In a situation where you don't have multiple contacts, inheritance is far simpler.

      Dr. Codd and Charles Bachman faught navigational-versus-relational battles in the 70's and relational is generally considered to have won. That is until OO fans tried to resurrect navigational concepts. Navigational is modern-day GOTO's.

      You had better tell CERN that - they use navigational models for the vast majority of their data collection, as relational systems are neither fast enough or flexible enough.

      Again, a good many OO fans have lots of complaints and grumbles about OO/relational mappers.

      That is true, but I am talking about modern mappers, not the way they were years

    28. Re:Prediction... by Tablizer · · Score: 1

      Inheritance and polymorphism are fundamentally different ideas.

      Actually they are fairly closely related in many ways. But that is another topic.

      Here are some examples of multiple languages using the same GUI engine (Swing on the JVM):

      "Possible" and "convenient" are two different things. It is also possible to run VB GUI's from COBOL or assembler. But it is not pleasent. Declarative frameworks don't depend on behavior as much and behavior is less common between languages. Data is easier to share than direct behavior.

      [Research? Where?] All the work of Alan Kay, Dan Ingalls, Adele Goldberg and so on.

      They just talked wishy-washy speculative philosophy, not provided solid comparative evidence. OO seems to kill off neurons in one's science lobe.

      Yes, it is, because it is a query language that allows for inheritance, polymorphism and re-use. I can query for, say, a Contact, and I will automatically get subclasses. This allows for transparent extensibility.

      One can make a procedure or view that can bring in related stuff also as already discussed. And it is often not wise to automatically bring in a bunch of crap that may not be related to the task at hand.

      Sorry, but I can't take you seriously. You are doing nothing more that claiming that anyone who takes a different approach from you is wrong!

      I told you why it is wrong: contacts is often not one-to-one per customer, vendor etc. You don't hard-wire taxonomies into code that are likely to change away from hierarchies. IS-A is too firm a design choice.

      I already have, but you have ignored it. Here it is again: Customer customer = query.execute(); // Retrieve a customer by query
      Address address = customer.getAddress(); // address retrieved without explicit query


      I see nothing that cannot be done procedurally or by views. They are called "functions". Show me something that functions and relational CANNOT do or do poorly. You claimed OO is objectively better than p/r. I don't care if OO walks your dog if other things can also walk your dog just as well. It runs, but so does assembler.

      No, it isn't out of style - you are just cherry-picking points of view. The 'IS-A out of style' argument was common about a decade ago, but not now.

      So IS-A is back in style again? I'll avoid OO until you guys get it right. Should I keep my disco pants around in case?

      All that happened is that a proportion of people never really understood how to use OOP, and have fallen back to other methods.

      Because OO is f8cked up. Nobody agrees on exactly what it is and how to tell good OO design from bad OO design. People slap classes together with no guiding or verifiable principles. OO produces e-shanty towns.

    29. Re:Prediction... by Decaff · · Score: 1

      Ignoring the personal views, which seem immune to debate, and you hand-waving away decades of computer science research you don't happen to like...

      One can make a procedure or view that can bring in related stuff also as already discussed. And it is often not wise to automatically bring in a bunch of crap that may not be related to the task at hand.

      No, you can't, not TRANSPARENTLY. And why are you bringing up the straw man argument about 'bringing in a bunch of crap'? For goodness sake, why don't you actually research before posting? If you did, you would know that persistence approaches like JDO allow the use of both lazy loading (only bring in what is needed, when it is needed) or fetch groups (define what is automatically loaded for optimisation).

      I see nothing that cannot be done procedurally or by views. They are called "functions". Show me something that functions and relational CANNOT do or do poorly. You claimed OO is objectively better than p/r. I don't care if OO walks your dog if other things can also walk your dog just as well. It runs, but so does assembler.

      You are completely missing the whole point. Let me try one last time.

      I use a method call, and IN THE BACKGROUND, WITHOUT ME HAVING TO ADD A LINE OF CODE, this is transformed into a query.

      This means I can apply persistence even to legacy code, or to code written by others who don't want to have to consider persistence.

      And the advantage of OOP is that this capability is inheritable - it is automatically applied to subclasses.

      You simply can't do this transparent persistence procedurally - it requires OOP specific features, including transparent proxy generation (production of automatic subclasses).

      It is this transparency that makes this objectively better than a procedural approach.

      Even if you are going to do nothing but code in a procedural way, it is worth wrapping code in class to get this power.

      So IS-A is back in style again? I'll avoid OO until you guys get it right. Should I keep my disco pants around in case?

      IS-A is absolutely back in style. If you kept up to date, you would realise that it is the foundation of some of the most exciting new systems, like Rails, Hibernate, JPA, Spring.

      Because OO is f8cked up. Nobody agrees on exactly what it is and how to tell good OO design from bad OO design. People slap classes together with no guiding or verifiable principles. OO produces e-shanty towns.

      Only when people who haven't a clue use it. There are far too many of those in IT. Relational and procedural approaches can be equally f_cked up - I have seen that so many times.

      You seem like an extremely experienced and skilled person, but I don't think you are doing yourself any favours by not keeping up with things. At least find out about transparent persistence - your comments show you haven't any idea how it works. If you are going to reject something, you need to understand it.

      And also, you never answered my question - how long would it take YOU to transfer a several hundred thousand line project with hundreds of tables between different vendors RDMBSes?

    30. Re:Prediction... by Tablizer · · Score: 1
      No, you can't, not TRANSPARENTLY.

      What do you mean by "transparently"?

      If you did, you would know that persistence approaches like JDO allow the use of both lazy loading (only bring in what is needed, when it is needed)

      I've heard horror stories about stuff getting out of sync because of this. I don't really see the necessity of such anyhow. In procedural designs one tends to do one task/event at a time. You get what is needed for such task, process it, and the task ends. OO needs crap like lazy loading because it tends to build copies of the database in RAM and spends a lot of fiddle-time trying to keep that copy mirroring the DB properly. Your tool is actually reinventing an OODBMS in RAM and trying to keep it in sync with the RDBMS. Sure, if you reinvent a DB then you have only a single interface for queries etc. But is very nearly replacing the DB rather than wrapping it. It is a hat trick.

      I use a method call, and IN THE BACKGROUND, WITHOUT ME HAVING TO ADD A LINE OF CODE, this is transformed into a query.

      I've seen similar things using data/schema dictionaries. It still takes "configuration" info to be put somewhere. It does not read minds. You seem to not understand what comes about because of OO and what comes about because somebody automated something and put it into a library API.

      And the advantage of OOP is that this capability is inheritable - it is automatically applied to subclasses.

      This is called "default behavior", which OO did not invent. But it only works on simpler things. More complex variation-management works better with sets rather than inherence hierarchies. I stand by that, at least for my head and most relational fans. Compared to sets, OO is primative and ugly.

      This means I can apply persistence even to legacy code

      Databases are not about (just) persistence. You are only using it for persistence. But that is not the same as being for persistence.

      Only when people who haven't a clue use it. There are far too many of those in IT. Relational and procedural approaches can be equally f_cked up - I have seen that so many times.

      I'll take a p/r mess over an OO mess any day. Because p/r is more consistent from shop-to-shop, the messes are also more consistent.

      You seem like an extremely experienced and skilled person, but I don't think you are doing yourself any favours by not keeping up with things.

      OO is warmed-over navigational from the 1960's. It is you who is relearning generation-old lessons the hard way. If and when OO can offer public-inspectable proof that it is objectively better, I'll revisit it. Instead we see a jillion anecdotes (many not good) and zero evidences.

      And also, you never answered my question - how long would it take YOU to transfer a several hundred thousand line project with hundreds of tables between different vendors RDMBSes?

      You haven't answered MY question: Can *only* OO make an intermediate query language that is translated to vendor-specific SQL in the API library? I will agree that there may not be many existing procedural tools that do such. But that is not the same as not being possible. OO is in style right now so it gets all the funding. Generally most companies do not switch DB vendors that often because it requires using a lowest-common-denominator of DB abilities.

      In closing, here is a quote from somebody who had problems with such OO-DB tools:


      Here is an anecdote in support of Costin's view. I have participated in some real-life story where TopLink datalayer proved to be a performance bottleneck, and when replaced by direct SQL over JDBC, the throughput on the same box increased by 2 (two) orders of magnitude. Clarity and simplicity of code also improved significantly.

      Typically, those were simple selects joining 3 or 4 tables, but the server was supposed to do it N hundred times per second. This was kind of a new service, and client did not realize that it will be so popular, so the shit hit the fan in production time. Boy was it interesting... See also LeakyAbstraction. -- Alexey Verkhovsky [from c2-dot-com]


    31. Re:Prediction... by Decaff · · Score: 1

      What do you mean by "transparently"?

      Why don't you bother to research before posting?

      At this point I am afraid give up! I have given you ample opportunity to get up to date. I have given you references you can read to try and understand this. You have refused to. You are so sure of your 70s attitude to persistence that you simply can't be bothered to find out anything new. Well, good luck to you.

    32. Re:Prediction... by Tablizer · · Score: 1

      I already understand it in a general sense. I've seen similar such tools in action. They are half-ass OODBMS that try to wrap over RDBMS because OO'ers hate RDBMS due to indoctrination and ignorance. Seen it multiple times. What I want to see is betterment that ONLY oop can deliver. You keep talking about what your OO-RD tool does, but have not shown anything that proves those features are something OO has a monopoly on.

      Regarding being out of date, again you are the one out-of-date, not me. This war already happened in the 70's and sets and relational kicked navigational's arse. Sticking a different name on such technology and adding some minor decorations like methods did not fix the fundimental problems. OOP is still navigational and still sucks because of that.

    33. Re:Prediction... by Tablizer · · Score: 1

      (Addendum)

      Here are two more links with yet more evidence that O/R mappers are very far from problem-free:

      http://blogs.tedneward.com/2006/06/26/The+Vietnam+ Of+Computer+Science.aspx

      http://www.codinghorror.com/mtype/mt-comments-rena med.cgi?entry_id=621

      Basically, these tools are becomming more complicated than the very database they are trying to hide/wrap! Keeping app developer's inside an OO view of things is becomming more important than developer productivity and simplicity. It smells of W-style "stay the cOOurse" zealotry.

  6. No specifics by PlatinumRiver · · Score: 1

    I was hoping the article would mention specific relational databases (Oracle, PostgreSQL) results versus specialized ones.

    1. Re:No specifics by dedrop · · Score: 2, Interesting

      There's a reason for that. Many years ago, the Wisconsin database group (David DeWitt in particular) authored one of the first popular database benchmarks, the Wisconsin benchmarks. They showed that some databases performed embarrassingly poorly, which made a lot of people really angry. In fact, Larry Ellison got so angry, he tried to get DeWitt fired (Ellison wasn't clear on the concept of tenure). Since then, major databases have a "DeWitt clause" in their end-user license, which says that the name of the database can't be used when reporting benchmark results.

      And this years ahead of Microsoft not allowing users to benchmark Vista at all!

      --
      Don't wrestle with pigs; you'll both get muddy, but the pig likes it.
  7. one size fits 90% by JanneM · · Score: 5, Insightful

    It's natural to look at the edges of any feature or performance envelope. People that want to store petabytes of particle accellerator data, do complex queries to serve a million webpages a second, have hundreds of thousands of employees doing concurrent things to the backend.

    But for most uses of databases - or any back-end processing - performance just isn't a factor and haven't been for years. Enron may have needed a huge data warehouse system; "Icepick Johhny's Bail Bonds and Securities Management" does not. Amazon needs the cutting edge in customer management; "Betty's Healing Crystals Online Shop (Now With 30% More Karma!)" not so much.

    For the large majority of uses - whether you measure in aggregate volume or number of users - one size really fits all.

    --
    Trust the Computer. The Computer is your friend.
    1. Re:one size fits 90% by smilindog2000 · · Score: 1

      This is more true all the time. I work in the EDA industry, in chip design. The databases sizes I work with are naturally well correlated with More's Law. In effect, I'm a permanent power user, but my circle of peers is shrinking into oblivion...

      --
      Beer is proof that God loves us, and wants us to be happy.
    2. Re:one size fits 90% by TubeSteak · · Score: 1, Insightful
      For the large majority of uses - whether you measure in aggregate volume or number of users - one size really fits all.
      I'm willing to concede that...
      But IMO it is not 100% relevant.

      Large corporate customers usually have a large effect on what features show up in the next version of [software]. Software companies put a lot of time & effort into pleasing their large accounts.

      And since performance isn't a factor for the majority of users, they won't really be affected by any performance losses resulting from increased specialization/optimizations. Right?
      --
      [Fuck Beta]
      o0t!
    3. Re:one size fits 90% by NotZed · · Score: 1

      Yeah, it's only 90% relevant.

      Another stupid article making unfounded and unfoundable claims. 30+ years of database design isn't totally wrong all of the time - it's even pretty good most of the time. Why do people write these stupid stories, and why do people bother to give them any weight?

      --
      _ // `Thinking is an exercise to which all too few brains
      \\/ are accustomed' - First Lensman
    4. Re:one size fits 90% by Threni · · Score: 1

      You're missing the point. There are more companies who make up the 90% for whom there's already enough power in modern databases, so there's not much need to spend a lot of effort/time/money making things faster for a minority of users.

    5. Re:one size fits 90% by Bozdune · · Score: 1

      Then why do we need specialized OLAP systems like Essbase, Kx Systems, etc.? So much for OSFA (one size fits all). Any transaction-oriented database of sufficient size, requiring multi-way joins between tables, and requiring sub-second response times to queries, is way out of range of OSFA. Furthermore, it doesn't require petabytes to take a relational database system to its knees. Just a few million transactions, and your DBMS will be on its back waving its arms feebly, along with your server.

      Performance IS a factor, a very serious factor indeed, for many applications. Not for Betty or for Icepick Johnny, to be sure; but for almost any business with more than about $200M in sales, I guarantee there's a dataset kicking around that will require specialized tools to analyze properly. Since those specialized tools are typically expensive, and typically difficult to use, that dataset will not get analyzed properly, and the business will be "running blind."

    6. Re:one size fits 90% by Hoi+Polloi · · Score: 1

      Big databases appear as soon as you want to analyze performance or sales trends, especially in the financial and retail markets. Now you start accumulating large amounts of historical data that need to be processed on a daily basis.

      --
      It is by the juice of the coffee bean that thoughts acquire speed, the teeth acquire stains. The stains become a warning
  8. Imagine that.... by NerveGas · · Score: 4, Insightful

    ... a database mechanism particularly written for the task at hand will beat a generic one. Who would have thought?

    steve

    (+1 Sarcastic)

    --
    Oh, you're not stuck, you're just unable to let go of the onion rings.
    1. Re:Imagine that.... by dedrop · · Score: 1

      It's the beating by an order of magnitude in non-esoteric tasks that's the point. It means there's user pain out there that can be addressed, which means there's a niche market where money can be made.

      --
      Don't wrestle with pigs; you'll both get muddy, but the pig likes it.
  9. Dammit by AKAImBatman · · Score: 4, Insightful

    I was just thinking about writing an article on the same issue.

    The problem I've noticed is that too many applications are becoming specialized in ways that are not handled well by traditional databases. The key example of this is forum software. Truly heirarchical in nature, the data is also of varying sizes, full of binary blobs, and generally unsuitable for your average SQL system. Yet we keep trying to cram them into SQL databases, then get surprised when we're hit with performance problems and security issues. It's simply the wrong way to go about solving the problem.

    As anyone with a compsci degree or equivalent experience can tell you, creating a custom database is not that hard. In the past it made sense to go with off-the-shelf databases because they were more flexible and robust. But now that modern technology is causing us to fight with the databases just to get the job done, the time saved from generic databases is starting to look like a wash. We might as well go back to custom databases (or database platforms like BerkeleyDB) for these specialized needs.

    1. Re:Dammit by Tablizer · · Score: 1

      The key example of this is forum software. Truly heirarchical in nature, the data is also of varying sizes, full of binary blobs, and generally unsuitable for your average SQL system. Yet we keep trying to cram them into SQL databases, then get surprised when we're hit with performance problems and security issues. It's simply the wrong way to go about solving the problem.

      But is this actually happening? Has slashdot had to abandon general-purpose RDBMS? Not all slashdot display orders are hierarchical anyhow.

    2. Re:Dammit by modmans2ndcoming · · Score: 1

      soooooo... you set up the code that deals with comments to access a hierarchical DB and everything else on a sql DB.

    3. Re:Dammit by AKAImBatman · · Score: 2, Insightful
      But is this actually happening? Has slashdot had to abandon general-purpose RDBMS?

      I wasn't referring to Slashdot in particular, but rather general web forum software. Your PhpBB, vBulletins, and JForums of the world are more along the lines of what I'm referring to. After dealing with the frustrations of setting up, managing, and hacking projects like these, I've come to the conclusion that the backend datastore is the problem. The relational theories still hold true, but the SQL database implementations simply aren't built with CLOBs and BLOBs in mind.

      That being said, Slashdot is a fairly good example of how they've worked around the limitations of their backend database at a cost equalling or far exceeding the cost of building a customized data store. A costly venture that bit them in the rear when they reached their maximum post count.

      Not that I'm criticizing Slashcode. Hindsight is 20/20. It's just becoming more and more apparent that for some applications the cost of using an off-the-shelf database has become greater than the cost of building a custom datastore.
    4. Re:Dammit by Jason+Earl · · Score: 2, Funny

      Eventually the folks working on web forums will realize that they are just recreating NNTP and move on to something else.

    5. Re:Dammit by Tablizer · · Score: 2, Insightful
      The relational theories still hold true, but the SQL database implementations simply aren't built with CLOBs and BLOBs in mind.

      That is very true. They haven't seemed to have perfected the performance handling of highly variable "cells".

      That being said, Slashdot is a fairly good example of how they've worked around the limitations of their backend database at a cost equalling or far exceeding the cost of building a customized data store. A costly venture that bit them in the rear

      Last night we crossed over 16,777,216 comments in the database. The wise amongst you might note that this number is 2^24, or in MySQLese an unsigned mediumint. Unfortunately, like 5 years ago we changed our primary keys in the comment table to unsigned int (32 bits, or 4.1 billion) but neglected to change the index that handles parents.


      It would be nice if more RDBMS offered flexible integers such that you didn't have to pick a size up front. Fixed sizes (small-int,int,long) are from the era where variable-sized column calculations were too expensive CPU-wise. Since then CPU is cheap compared to "pipeline" issues such that variable columns are just as efficient as fixed ones, but only take the space they need.

      But it would not have been hard for slashdot to use a big integer up-front. They chose to be stingy and made a gamble, it was not forced on them. It may have cost a few cents more early, but would have prevented that disaster. Plus, bleep happens no matter what technology you use. I am sure dedicated-purpose databases have their own gotcha's and trade-off decision points. Being dedicated probably means they are less road-tested also.
    6. Re:Dammit by Jerf · · Score: 1
      Truly heirarchical in nature, the data is also of varying sizes, full of binary blobs, and generally unsuitable for your average SQL system.
      Actually, I was bitching about this very problem (and some others) recently, when I came upon this article about recursive queries on the programming reddit.

      Recursive queries would totally, completely solve the "hierarchy" part of the problem, and halfway decent database design would handle the rest.

      My theory is that nobody realizes that recursive queries would solve their problems, so nobody asks for them, so nobody ever discovers them, so nobody ever realizes that recursive queries would solve their problem. I don't know of an open source DB that has this, and I'd certainly never seen this in my many years of working with SQL. I wish we did have it, it would solve so many of my problems.

      Now, if we could just deal with the problem of having a key that could relate to any one of several tables in some reasonable way... that's the other problem I keep hitting over and over again.
    7. Re:Dammit by dcam · · Score: 1

      ...the SQL database implementations simply aren't built with CLOBs and BLOBs in mind.

      This is extremely true.

      I work on a web application that stores a lot of documents (on of our clients stores +50Gb). The database back end is SQL Server (yeah I know). When it was designed (~8 years ago) we decided to store the documents in the filesystem and store the paths in the database. This was largely for performance reasons, although some other considerations were the size of database backups and general db management. It was anticipated that in the future we would moce the documents into the db when performance improved sufficiently. It hasn't.

      According to Inside SQL Server 2000, all data in SQL server is stored on 8K pages in B trees. BLOBs and CLOBs are broken up into 8K chunks. Performance on reading and writing this data is obviously not fantastic, particularly when you have largish files (we have files that are +100Mb, average size of files would be ~2Mb). In addition the tools in SQL server for adding and retrieving BLOBs are a major headache.

      SQL Server is not designed for BLOBs. I can't comment on other relational databases, but I suspect that they would suffer similar issues.

      --
      meh
    8. Re:Dammit by newt0311 · · Score: 1

      bah. limited experience. postgres has very easy methods for handling files (called large objects). The basic methods are the I/O methods which take in a file location and return an OID and those which will take a path and an oid and extract the lo to the specified path. There are also methods for searching. Infact, searching and retrieving parts of the file from inside the db is actually slightly faster because the db is very good at optimizing disk access.

    9. Re:Dammit by Anonymous Coward · · Score: 0

      Now, if we could just deal with the problem of having a key that could relate to any one of several tables in some reasonable way... that's the other problem I keep hitting over and over again.

      In other words, distributed foreign keys.. this has been discussed to some length by Chris Date and other people who work with the relational model. It's a pretty basic constraint, yet no SQL database seems to implement it.

      A quick Google turns up a hopeful PostgreSQL discussion, but it quickly turns to PostgreSQL's "table inheritance" feature which is a very flawed idea.

      I'd love to see this implemented in a mainstream database, I think about 3 out of every 4 apps I've written needed this.

    10. Re:Dammit by dcam · · Score: 1

      Comments like this one would suggest that others have different experience. I've been hunting for details on the storage mechanisms of pgsql to try to work out whether it might be faster but no luck so far.

      --
      meh
    11. Re:Dammit by a_ghostwheel · · Score: 1

      Or just use hierarchical queries - like START WITH / CONNECT BY clauses in Oracle. Probably other vendors have something similar too - not sure about that.

    12. Re:Dammit by Imsdal · · Score: 1
      My theory is that nobody realizes that recursive queries would solve their problems, so nobody asks for them, so nobody ever discovers them, so nobody ever realizes that recursive queries would solve their problem.


      It used to be that execution plans in Oracle were retreived from the plan table via a recursive query. Since even the tiniest application will need a minimum amount of tuning, and since all db tuning should start by looking at the execution plans, everyone should have run into recursive queries sooner rather than later.


      My theory is instead that too few developers are properly trained. They simply don't know what they are doing or how it should be done. During my years as a consultant, I spent a lot of time improving db performance, and never even once did I run into in-house people who even knew what en execution plan was, let alone how to interpret it. (And, to be honest, not all of my consultant colleagues knew either...)


      Software development is a job that requires the training of a surgeon, but it's staffed by people who are trained to be janitors or, worse, economists. (I realise that this isn't true at all for the /. crowd. I'm talking about all the others all of us has run into on every job we have had.)

    13. Re:Dammit by CaymanIslandCarpedie · · Score: 1

      The handling of BLOBS isn't really all that bad anymore in many RDMS. Now certinaly speciallized DBs for this could do better I'm sure, but the old maxims about "never store BLOBS in a DB" don't really hold anymore. Since you mention SQL Server, consider that SharePoint Server uses SQL Server as its data store. We have an install of SharePoint with roughly 150GB of documents and scanned archival PDFs with over 100 users accessing those documents pretty much continuously (not all 100 users but it is very activly used). The performance of opening documents in SharePoint are not (noticeably to a human at least) any slower than opening the documents from a network file share.

      Now obviously MS probably had some top of the line DBAs tuning this to get that type of performance, but it doesn't seem that BLOBs are a direct limitation in SQL Server any more as much as limitation of the DBAs trying to get the performance out of the system perhaps if others are still having issues with this.

      That being said, our current application is only dealing with roughly 100 users on the local LAN. In the next 6 months we will be testing exposing this on the internet to 10s of thousands of users. We'll see how if it still holds up ;-)

      --
      "reality has a well-known liberal bias" - Steven Colbert
    14. Re:Dammit by Jerf · · Score: 1
      In other words, distributed foreign keys.
      Ah, thank you for the name. It's difficult to find the name of something based on its description. I tried before posting but just got a lot of stuff about how to use foreign keys.
    15. Re:Dammit by poot_rootbeer · · Score: 1

      The key example of this is forum software. Truly heirarchical in nature, the data is also of varying sizes, full of binary blobs, and generally unsuitable for your average SQL system.

      Hierarchichal? Yes, but I don't see any problem using SQL to access hierarchical information. It's easy to have parent/child relationships.

      Data of varying sizes? I thought this problem was solved 20 years ago when ANSI adopted a SQL standard including a VARCHAR datatype.

      Full of binary blobs? Why? What in the hell for? So that each user can have an obnormous enoxious "signature banner" graphic that readers have to look at 20 times in any given thread?

      There's very little data that belongs in a forum interface that can't be represented in plaintext. For the rest, store it on the filesystem and just store a reference to it in the database.

      As anyone with a compsci degree or equivalent experience can tell you, creating a custom database is not that hard.

      And as anyone who has ever done software development in the real world can tell you, custom components almost always suck worse than similar standard components.

    16. Re:Dammit by psocccer · · Score: 1

      Well comments and replies naturally lend themselves to a tree, and the obvious way to store them is by using some self-referential parentid to the same table. In practice this becomes difficult for exactly the reason you cited, no recursion. But recursion is hard to optimize for a database which is why I presume it's not built in to SQL, but the answer for modeling trees in SQL is to use nested sets which allow you to extract parts of a tree and determine the depth at the same time, it's a very fast operation because you are simply selecting a range of numbers which databases are very good at.

    17. Re:Dammit by AKAImBatman · · Score: 1
      Hierarchichal? Yes, but I don't see any problem using SQL to access hierarchical information. It's easy to have parent/child relationships.

      No problem, but it's unoptimized for the task. To show the front page of a webforum, for example, you have to find all the forums, then dive down into each to get the most recent topic, then dive into the most recent topic to get the most recent post, then get all the dates and poster information (which will also have to be retrieved) to display on the front page. Given the difficulty in retreiving that data in an SQL database, the programmers invariably start denormalizing the database to provide a performance boost to the application.

      As you probably know, denormalizing is the antithesis of good database design. The result is that the application, database, and code all become several times more complex than they should be. More complexity leads to more bugs. More bugs lead to security issue. Etc.

      The optimal solution is a database that can properly index along the needs of a forum. That index (or indexes as the case may be) would allow the database to return the correct information in a fully joined query. Yes, the query path would be far too custom for your average SQL database. Which is why a custom database designed around heirarchical information starts to make sense.

      Data of varying sizes? I thought this problem was solved 20 years ago when ANSI adopted a SQL standard including a VARCHAR datatype.

      Except that posts vary from sizes that fit in most VARCHARs (cheap) to sizes that are effectively CLOBs (expensive). What you want is a database that makes CLOBs cheap, so you can store all posts the same way.

      Full of binary blobs? Why? What in the hell for?

      Attachments, images, avatars, etc.

      So that each user can have an obnormous enoxious "signature banner" graphic that readers have to look at 20 times in any given thread?

      Not really. In the more professional forums (many of which disallow image signatures) everything from photographs of equipment to screenshots to source code to binaries to PDF documents need to be shared. Storing these in the database is wasteful and slow, and storing them flat on disk is dangerous. (e.g. If you deploy JForum as a WAR file, you'll completely overwrite your existing uploads!) It's much better if there is a single, unified datastore for all the information rather than trying to cobble together a solution based on SQL databases + the File System.

      And as anyone who has ever done software development in the real world can tell you, custom components almost always suck worse than similar standard components.

      If you still think that's true in 100% of the cases, then you haven't been programming long enough. Use of a custom component vs. a generic components depende on how critical the component is to the success of the project. The more critical the component is, the more costly it is to attempt to modify a generic solution to meet specialized needs. Sometimes it really is cheaper, faster, and easier to build your own.
    18. Re:Dammit by ahmusch · · Score: 1

      You mean like Oracle, where if you specify a column as NUMBER, INTEGER, SMALLINT, or otherwise don't specify precision and scale -- not bits, but total digits and digits right of mantissa -- you get a number field that's 38 positions wide?

    19. Re:Dammit by doom · · Score: 1

      Eventually the folks working on web forums will realize that they are just recreating NNTP and move on to something else.

      Hey, I've got an idea: let's try to implement a weblog with a distributed P2P backend that can't be slashdotted! Doesn't that sound cool?

      (And maybe we can work out some way of quoting the text you're replying to that doesn't involve typing BLOCKQUOTE tags all the time... let's see, how might that work?)

    20. Re:Dammit by LWATCDR · · Score: 1

      "That being said, Slashdot is a fairly good example of how they've worked around the limitations of their backend database at a cost equalling or far exceeding the cost of building a customized data store"

      I don't think so. that wasn't a problem with using SQL it was a problem with design. Had the Slashdot team built their own backend it is very likely that more such errors would have been made not less.

      The Slashdot team underestimated there traffic. I classic scaling problem.

      --
      See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
    21. Re:Dammit by Tablizer · · Score: 1

      It would store numbers, integers in particular, kind of like a compressed string of digits. I forgot the byte-compressed format that IBM 360 used to use to store 2 digits per byte, but something like that but variable-sized. Thus, you would have something like VARINT(50) that could store integers with up to 50 digits but only take up 25 bytes to do it. If the value is 1, then it would only take up one byte. (Well, half a byte actually, but systems usually round to whole bytes.) And null or zero could take up zero bytes if defined that way.

  10. Duh by Reality+Master+101 · · Score: 5, Insightful

    Who thinks that a specialized application (or algorithm) won't beat a generalized one in just about every case?

    The reason people use general databases is not because they think it's the ultimate in performance, it's because it's already written, already debugged, and -- most importantly -- programmer time is expensive, and hardware is cheap.

    See also: high level compiled languages versus assembly language*.

    (*and no, please don't quote the "magic compiler" myth... "modern compilers are so good nowadays that they can beat human written assembly code in just about every case". Only people who have never programmed extensively in assembly believe that.)

    --
    Sometimes it's best to just let stupid people be stupid.
    1. Re:Duh by Waffle+Iron · · Score: 5, Informative
      *and no, please don't quote the "magic compiler" myth... "modern compilers are so good nowadays that they can beat human written assembly code in just about every case". Only people who have never programmed extensively in assembly believe that.

      I've programmed extensively in assembly. Your statement may be true up to a couple of thousand lines of code. Past that, to avoid going insane, you'll start using things like assembler macros and your own prefab libraries of general-purpose assembler functions. Once that happens, a compiler that can tirelessly do global optimizations is probably going to beat you hands down.

    2. Re:Duh by smilindog2000 · · Score: 1

      I've never heard the "magic compiler myth" phrase, but I'll help educate others about it. It's refreshing to hear someone who understands reality. Of course, a factor of 2 to 4 improvement in speed is less and less important every day...

      --
      Beer is proof that God loves us, and wants us to be happy.
    3. Re:Duh by suv4x4 · · Score: 2, Interesting

      "modern compilers are so good nowadays that they can beat human written assembly code in just about every case". Only people who have never programmed extensively in assembly believe that.

      Only people who haven't seen recent advancements in CPU design and compiler architecture will say what you just said.

      Modenr compilers apply optimizations on a so sophisticated level that would be a nightmare for a human to support such a solution optimized.

      As an example, modern Intel processors can process certain "simple" commands in parallel and other commands are broken apart into simpler commands, processed serially. I'm simplifying the explanation a great deal, but anyone who read about how a modern CPU works, branch prediction algorithms and so on is familiar with the concept.

      Of course "they can beat human written assembly code in just about every case" is an overstatement, but still, you gotta know there's some sound logic & real reasons behind this "myth".

    4. Re:Duh by kfg · · Score: 2, Insightful

      The reason people use general databases is not because they think it's the ultimate in performance, it's because it's already written, already debugged, and -- most importantly. . .

      . . .has some level of definable and gauranteed data integrity.

      KFG

    5. Re:Duh by wcbarksdale · · Score: 4, Insightful

      Also, to successfully hand-optimize you need to remember a lot of details about instruction pipelines, caches, and so on, which is fairly detrimental to remembering what your program is supposed to do.

    6. Re:Duh by mparker762 · · Score: 2, Insightful

      Only someone who hasn't recently replaced some critical C code with assembler and gotten substantial improvement would say that. This was MSVC 2003 which isn't the smartest C compiler out there, but not a bad one for the architecture. Still, a few hours with the assembler and a few more hours doing some timings to help fine-tune things improved the CPU performance of this particular service by about 8%.

      Humans have been writing optimized assembler for decades, the compilers are still trying to catch up. Modern hand-written assembler isn't necessarily any trickier or more clever than the old stuff (it's actually a bit simpler). Yes compilers are using complicated and advanced techniques, but it's still all an attempt to approximate what humans do easily and intuitively. Artificial intelligence programs use complicated and advanced techniques too, but no one would claim that this suddenly makes philosophy any harder.

      Your second point about the sophistication of the CPU's is true but orthogonal to the original claim. These sophisticated CPU's don't know who wrote the machine code, they do parallel execution and branch prediction and so forth on hand-optimized assembly just like they do on compiler-generated code. Which is one reason (along with extra registers and less segment BS) that it's easier to write and maintain assembler nowadays, even well-optimized assembler.

    7. Re:Duh by mparker762 · · Score: 1

      It sounds like "global optimization" means something different to you than it does to a compiler writer. For a compiler, this simply means optimizing across basic blocks, not optimizing across functions and files (that's usually called "whole-program optimization" or something like that). Humans optimize across basic blocks very easily, it's actually difficult to stop a programmer from doing fairly extensive optimizations at this scale -- programs just look untidy and needlessly redundant without it. Compilers still have trouble doing a decent job of this type of optimizations for non-functional languages (like C).

      Even using assembler macros and prefab libraries of general-purpose assembler functions you're generally no worse off than the compiler. What the heck to you think the standard C runtime is?

      The bigger danger to doing lots of code in assembler is that you're tempted to use simpler algorithms over tricky-but-fast ones, and you're tempted to optimize too early (though this is a problem in any language. Assembler just makes this trap particularly easy to fall into).

    8. Re:Duh by Anonymous Coward · · Score: 0

      Still, a few hours with the assembler and a few more hours doing some timings to help fine-tune things improved the CPU performance of this particular service by about 8%.

      Woo hoo, so that one little loop that accounts for 5% of the total running time was sped up 8%! Did you not have anything better to do?

      Modern hand-written assembler isn't necessarily any trickier or more clever than the old stuff (it's actually a bit simpler).

      And after a day you might have a decent for loop. After that I have an entire web service.

      The speed of execution goes down in importance as the speed to market goes up and the number of programmers goes up. Why waste your time using a hammer when doing a roof, get a nail gun.

    9. Re:Duh by Waffle+Iron · · Score: 1
      Even using assembler macros and prefab libraries of general-purpose assembler functions you're generally no worse off than the compiler.

      I don't know how true that is, given that assembler macros and fixed assembler APIs won't be particularly good at inlining calls and then integrating the optimizations of the inlined code with the particular facets of the surrounding code for each expansion.

    10. Re:Duh by BillGatesLoveChild · · Score: 1

      > "modern compilers are so good nowadays that they can beat human written assembly code in just about every case". Case in point: SQL Optimizers in every SQL Database Product I have ever used. Often, they will find a very stupid way of doing something where a human (who has greater insight into the data despite the UPDATE STATISTICS command) can find a much faster way. Much of my time maintaining databases is trying to get it not to do stupid things. No one trusts automatic C++ code generators. So why do we trust automatic SQL code generators?

    11. Re:Duh by Pseudonym · · Score: 1

      The reason why assembly programmers can beat high-level programmers is they can write their code in a high-level language first, then profile to see where the hotspots are, and then rewrite a 100 line subroutine or two in assembly language, using the compiler output as a first draft.

      In other words, assembly programmers beat high-level programmers because they can also use modern compilers.

      --
      sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
    12. Re:Duh by Electrum · · Score: 1

      No one trusts automatic C++ code generators. So why do we trust automatic SQL code generators?

      We don't. That's why we have explain plans and hints.

    13. Re:Duh by BillGatesLoveChild · · Score: 1
      > That's why we have explain plans and hints.

      That's right. To fix the fact they're broken.

    14. Re:Duh by suv4x4 · · Score: 1, Insightful

      This was MSVC 2003 which isn't the smartest C compiler out there, but not a bad one for the architecture. Still, a few hours with the assembler and a few more hours doing some timings to help fine-tune things improved the CPU performance of this particular service by about 8%... These sophisticated CPU's don't know who wrote the machine code, they do parallel execution and branch prediction and so forth on hand-optimized assembly just like they do on compiler-generated code. Which is one reason (along with extra registers and less segment BS) that it's easier to write and maintain assembler nowadays, even well-optimized assembler.

      Do you know which types of commands when ordered in quadruples will execute at once on a Core Duo? Incidentally those that won't on a Pentium 4.

      I hope you're happy with your 8% improvement, enjoy it until your next CPU upgrade that requires different approach to assembly optimization.

      The advantage of a compiler is that compiling for a target CPU is a matter of a compiler switch, so compiler programmers can concentrate on performance and smart use of the CPU specifics, and you can concentrate on your program features.

      If you were that concerned about performance in first place, you'd use a compiler provided by the processor vendor (Intel I presume) and use the intel libraries for processor specific implementations of common math and algorithm issues needed in applications.

      Most likely this would've given you more than 8% boost and still keep your code somewhat less bound to a specific CPU, than with assembler.

      An example of "optimization surprise" i like, is the removal of the barrel shifter in Pentium 4 CPU-s. You see, lots of programmers know that it's faster (on most platforms) to bit shift, and not multiply by 2, 4, 8, etc (or divide).

      But bit shifting on P4 is handled by the ALU, and is slightly slower than multiplication (why, I don't know, but it's a fact). Code "optimized" for bit shifting would be "antioptimized" on P4 processors.

      I know some people adapted their performance critical code to meet this new challenge. But then what? P4 is obsolete and instead we're back to the P3 derived architecture, and the barrel shifter is back!

      When I code a huge and complex system, I'd rather buy a 8% faster machine and use a better compiler than have to manage this hell each time a CPU comes out.

    15. Re:Duh by try_anything · · Score: 3, Insightful
      Modenr compilers apply optimizations on a so sophisticated level that would be a nightmare for a human to support such a solution optimized.

      There are three quite simple things that humans can do that aren't commonly available in compilers.

      First, a human gets to start with the compiler output and work from there :-) He can even compare the output of several compilers.

      Second, a human can experiment and discover things accidentally. I recently compiled some trivial for loops to demonstrate that array bounds checking doesn't have a catastrophic effect on performance. With the optimizer cranked up, the loop containing a bounds check was faster than the loop with the bounds check removed. That did not inspire confidence.

      Third, a human can concentrate his effort for hours or days on a single section of code that profiling revealed to be critical and test it using real data. Now, I know JIT compilers and some specialized compilers can do this stuff, but as far as I know I can't tell gcc, "Compile this object file, and make the foo function as fast as possible. Here's some data to test it with. Let me know on Friday how far you got, and don't throw away your notes, because we might need further improvements."

      I hope I'm wrong about my third point (please please please) so feel free to post links proving me wrong. You'll make me dance for joy, because I do NOT have time to write assembly, but I have a nice fast machine here that is usually idle overnight.

    16. Re:Duh by Josef+Meixner · · Score: 1
      The reason people use general databases is not because they think it's the ultimate in performance, it's because it's already written, already debugged, and -- most importantly -- programmer time is expensive, and hardware is cheap.

      There is another reason I believe to be even more important: I doubt I could beat a good engine when the job at hand leaves the very specific, narrow area and it moves further and further into the area where the engines are good, the generic usage pattern. So I would for example build a special engine for the search feature of a large forum, because there I have very special needs. For example I could do the adding of data offline (I don't think it has a big impact, if the search is updated once per hour or perhaps even once per day) and so don't need any locking, which will make the thing a bit faster. Also I can predict the access patterns quite well and so select a good data layout and how to do the indexes.

      But building a special database for essential parts of an application seems quite risky to me. My first question would be how you plan to do a consistent backup, the second how you guarantee integrity. Sorry, but performance is useless if your application can't do a rollback after a problem or produces the problems by creating inconsistencies. So partitioning the space might be a terrible decision, which could prove fatal once you truly need your backups.

      I guess it is the usual problem in programming, knowing when to leave the high level and the support libraries and start to roll your own.

    17. Re:Duh by RAMMS+EIN · · Score: 2, Insightful

      Also, the compiler may know more CPUs than you do. For example, do you know the pairing rules for instructions on an original Pentium? The differences one must pay attention to when optimizing for an Thoroughbred Athlon vs. a Prescott P4 vs. a Yonah Pentium-M vs. a VIA Nehemiah? GCC does a pretty good job of generating optimized assembly code for each of these from the same C source code. If you were to do the same in assembly, you would have to write separate code for each CPU, and know the subtle differences as well as the compiler does.

      --
      Please correct me if I got my facts wrong.
    18. Re:Duh by tootlemonde · · Score: 1

      The reason people use general databases is not because they think it's the ultimate in performance, it's because it's already written, already debugged, and -- most importantly -- programmer time is expensive, and hardware is cheap.

      The article deals with that issue:

      For the purpose of this paper, we define "dramatically outperform" to mean at least a factor of 10 advantage on the same (or comparable) hardware. ... Although one can argue about whether a factor of 10 is too high a fence for a new architecture to clear, the number is clearly not a factor of 2 or 3. In the latter case, one merely waits a year or two for the next hardware advance or increases the hardware budget. A factor of 10, in contrast, makes such tactics unworkable.

      It then says that "The premise of this paper is that there are at least four markets where this factor of 10 (or higher) threshold currently exists."

      The point is that there are certain markets where changing the database design and the retrieval strategies will outperform hardware upgrades. Among several examples, it mentions the data warehouse market which "is dominated by RDBMS vendors selling systems that use the traditional row-oriented architecture" and usually responds to the trade off between programmer time and hardware costs. A different architecture (column-oriented) gives a dramatic improvement.

      The result is that the reason people use a general database is not "because it's already written, already debugged" but because they aren't aware of the huge benefits that thinking a little harder can bring.

    19. Re:Duh by 00lmz · · Score: 1

      I think GCC can refine its guessing of branch probabilities using runtime profile data (see -fprofile-arcs and -fbranch-probabilities, also this page about improving GCC's optimizer). I don't know how close this is to your ideals though... Better ask a GCC developer.

    20. Re:Duh by mparker762 · · Score: 1

      Nope, I don't know which quadruples will execute in parallel on a Core Duo. And unless I compile a special version just for the core duo the compiler doesn't either. But this small assembler change sped up the *entire* program (not just that code segment) by an average of 8% across several CPU's tested. That's a pretty big win for a hundred lines of assembler.

      One reason I'm so skeptical of claims of compiler efficiency by guys that have never done significant amounts of assembler is that these claims typically rest on the capabilities of a "sufficiently sophisticated compiler" that never seems to exist in reality. If you spend much time staring at the code coming out of MSVC it's difficult to be impressed. Yes they use sophisticated algorithms, and yes they optimize every single line of code relentlessly. But all those claims of the compiler producing better code than a human programmer were originally in reference to the sort of programmers that nowadays prefer VB, the claim was never made that compilers could out-perform a good programmer. Somehow this distinction has been lost in recent years, probably because of the lack of exposure to assembler in the college curriculae and the de-emphasis of low-level concerns have made the sort of competencies needed for assembler a rarity.

    21. Re:Duh by ciggieposeur · · Score: 1

      With the optimizer cranked up, the loop containing a bounds check was faster than the loop with the bounds check removed.

      That actually makes sense to me. If your bounds check was very simple and the only loop outcome was breaking out (throw an exception, exit the loop, exit the function, etc., without altering the loop index), the optimizer could move it out of the loop entirely and alter the loop index check to incorporate the effect of the bounds check. Result is a one-time bounds check before entering the loop and a simplified loop, hence faster execution.

      I remember in the discussion on the D compiler someone pointed this out.

    22. Re:Duh by bill_mcgonigle · · Score: 1

      Who thinks that a specialized application (or algorithm) won't beat a generalized one in just about every case?

      Managers who read Oracle marketing material?

      --
      My God, it's Full of Source!
      OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
    23. Re:Duh by try_anything · · Score: 1

      Thanks; that will help. I'm really wondering about trial and error (since compiler can't always predict which instruction sequence runs faster) and all the optimizations that are computationally unfeasible for normal use. It seems odd that turning on all the optimizations only costs seconds when applied to a few hundred lines of code -- if it's easier to come up with too-slow compiler optimizations than fast ones, I wonder why I don't have access to those slow optimizations when I really want them!

    24. Re:Duh by Simetrical · · Score: 1

      Also, the compiler may know more CPUs than you do. For example, do you know the pairing rules for instructions on an original Pentium? The differences one must pay attention to when optimizing for an Thoroughbred Athlon vs. a Prescott P4 vs. a Yonah Pentium-M vs. a VIA Nehemiah? GCC does a pretty good job of generating optimized assembly code for each of these from the same C source code. If you were to do the same in assembly, you would have to write separate code for each CPU, and know the subtle differences as well as the compiler does.

      I must admit to only having taken a semester's worth of assembly, but it seems to me that the major optimization is simply condensing verbose instructions. If you're trying to play fancy tricks with pipelines, sure, you'll have to do it differently, but that's a losing game.

      How does gcc know what chip the program is going to be run on, anyway? Once it's compiled, it can be run on any processor that uses the given instruction set.

      --
      MediaWiki developer, Total War Center sysadmin
  11. Parallel databases by meta-monkey · · Score: 1

    This reminds me of the parallel databases class I took in college. Sure, specialized parallel databases (not distributed, mind you, parallel) using specialized hardware were definitely faster than the standard SQL-type relational databases...but so what? The costs were so much higher they were not feasible for most applications.

    Specialized software and hardware outperforms generic implementations! Film at 11!

    --
    We don't have a state-run media we have a media-run state.
  12. SQL is Dead - Long Live SQL by Doc+Ruby · · Score: 1

    SW platform development always features a tradeoff between general purpose APIs and optimized performance engines. Databases are like this. The economic advantages for everyone in using an API even as awkward and somewhat inconsistent as SQL are more valuable than the lost performance in the fundamental relational/query model.

    But it doesn't have to be that way. SQL can be retained as an API, but different storage/query engines can be run under the hood to better fit different storage/query models for different kinds of data/access. A better way out would be a successor to SQL that is more like a procedural language for objects with all operators/functions implicitly working on collections like tables. Yes, something like object lisp, best organized as a dataflow with triggers and events. So long as SQL can be automatically compiled into the new language, and back, for at least 5 years of peaceful coexistence.

    --

    --
    make install -not war

    1. Re:SQL is Dead - Long Live SQL by Tablizer · · Score: 1

      A better way out would be a successor to SQL that is more like a procedural language for objects with all operators/functions implicitly working on collections like tables.

      If you mean like cursor-oriented approaches (explicit loops), that tends to make automatic optimization harder. If you note, in SQL you generally don't specify order nor explicit loops. The idea is that the RDBMS figures out the best performance path so that you don't have to. It is like refacting a math equation. The less your query language is like math and the more like procedural steps (loops and conditionals), the harder it is to auto-optimize.

      As far as "objects", I think OOP set relational progress back 15 years. OO conflicts with relational on a number of fronts and has bloated up code with ugly translation layers. But that is another debate for another day.

      As far as replacing SQL with a more flexible query language, I have proposed SMEQL (Structured Meta-Enabled Query Language). It allows things like column lists to be "calculated" via a query also. It also makes it easier to split big queries into smaller ones by using named references instead of (or in additional to) nesting. Some have complained about the "string clauses", but these are just one of multiple implementation approaches.

    2. Re:SQL is Dead - Long Live SQL by Tablizer · · Score: 1

      Here is an example of the draft language SMEQL:

      This example returns the top 6 earners in each department based on this table schema: table: Employees, columns: {empID, dept, empName, salary}

          srt = orderBy(Employees, (dept, salary), order)
          top = group(srt, ((dept) dept2, max(order) order))
          join(srt, top, a.dept=b.dept2 and b.order - a.order <= 5)

    3. Re:SQL is Dead - Long Live SQL by Doc+Ruby · · Score: 1

      Collections don't (necessarily) have an order.

      Objects don't have to be C++ objects. They can be just class blueprints inherited from other classes, for instantiated objects, which are just related logic and the data accessed.

      Your SMEQL looks a lot like lisp.

      Something like object lisp for large collections of multidimensional (even asymmetric) objects could bring benefits of encapsulation/reuse and relations to a syntax that better reflects both the data model and the sequence of operations, in rules like policies. A dataflow version would be easy to read, debug and maintain.

      --

      --
      make install -not war

    4. Re:SQL is Dead - Long Live SQL by hypersql · · Score: 1

      A successor to SQL - NewSQL: http://newsql.sf.net/

    5. Re:SQL is Dead - Long Live SQL by Tablizer · · Score: 1

      Collections don't (necessarily) have an order.

      I am not sure what this is a response to.

      Objects don't have to be C++ objects. They can be just class blueprints inherited from other classes, for instantiated objects, which are just related logic and the data accessed.

      There is no uniform practice/standard for handling collections of objects. Under pure OO, each object/class handles itself. Thus, standardizing collection handling is against encapsulation. Standardizing it turns the OO engine into a quasi-database, but 60's style navigational databases are the kind of things that frustrated Dr. Codd and motivated him to invent relational. Navigational and relational battled it out in the 70's, and relational won......until OO fans tried to bring back navigational. Relational offers more discipline and consistency to design than navigational. This is largely because it barrowed from set theory and predicate logic.

      Your SMEQL looks a lot like lisp.

      I disagree. It does barrow heavily from functional programming, but the similarity to Lisp ends there.

  13. This has been known for years already by TVmisGuided · · Score: 2, Interesting

    Sheesh...and it took someone from MIT to point this out? Look at a prime example of a high-end, heavily-scaled, specialized database: American Airlines' SABRE. The reservations and ticket-sales database system alone is arguably one of the most complex databases ever devised, is constantly (and I do mean constantly) being updated, is routinely accessed by hundreds of thousands of separate clients a day...and in its purest form, is completely command-line driven. (Ever see a command line for SABRE? People just THINK the APL symbol set looked arcane!) And yet this one system is expected to maintain carrier-grade uptime or better, and respond to any command or request within eight seconds of input. I've seen desktop (read: non-networked) Oracle databases that couldn't accomplish that!

    --
    All the world's an analog stage, and digital circuits play only bit parts.
    1. Re:This has been known for years already by Tablizer · · Score: 1

      Look at a prime example of a high-end, heavily-scaled, specialized database: American Airlines' SABRE.

      But are there *other* airlines that are doing fine with "standard" RDBMS, such as Oracle or DB2?

    2. Re:This has been known for years already by sqlgeek · · Score: 3, Insightful

      I don't think that you know Oracle very well. Lets say you want so scale and so you want clustering or grid functionality -- built into Oracle. Lets say that you want to partition your enormous table into one physical table per month or quarter -- built in. Oh, and if you query the whole giant table you'd like parallel processes to run against each partition, balanced across your cluster or grid -- yeah, that's built in too. Lets say you almost always get a group of data together rather than piece by piece so you want it physically colocated to reduce disk i/o -- built in.

      This is why you pay a good wage for your Oracle data architect & DBA -- so that you can get people who know how to do these sort of things when needed. And honestly I'm not even scratching the surface.

      Consider a data warehouse for a giant telecom in South Africa (with a DBA named Billy in case you wondered). You have over a billion rows in your main fact table, but you're only interested in a few thousand of those rows. You have an index on dates and another index on geographic region and another region on customer. Any one of those indexes will reduce the 1.1 billion rows to 10's of millions of rows, but all three restrictions will reduce it to a few thousand. What if you could read three indexes, perform bitmap comparisons on the results to get only the rows that match the results of all three indexes and then only fetch those few thousand rows from the 1.1 billion row table. Yup, that's built in and Oracle does it for you for behind the scenes.

      Now yeah, you can build a faster single-purpose db. But you better have a god damn'd lot of dev hours allocated to the task. My bet is that you'll probably come our way ahead in cash & time to market with Oracle, a good data architect and a good DBA. Any time you want to put your money on the line, you let me know.

    3. Re:This has been known for years already by georgewilliamherbert · · Score: 1

      Nevertheless - anyone doing serious data warehousing who cares about read performance has been using Teradata (older apps) or column-oriented Sybase-IQ (newer apps). Oracle can store over a billion rows, sure; a terabyte's a lot of data, and people have had multi-terabyte databases for the better part of a decade, for some projects.

      Why? Despite all the tuning, Sybase-IQ can still run through a general purpose query into its data around ten times faster than tuned Oracle.

      It may not matter in the telephone company, but for people who actually have money on the line (financial companies), huge data processing uses appropriate tools. IQ and columns win.

    4. Re:This has been known for years already by TVmisGuided · · Score: 1
      Now yeah, you can build a faster single-purpose db. But you better have a god damn'd lot of dev hours allocated to the task. My bet is that you'll probably come our way ahead in cash & time to market with Oracle, a good data architect and a good DBA. Any time you want to put your money on the line, you let me know.

      Seems to me this describes AA perfectly...SABRE has been around since what, the mid- to late-70s? And it's still actively developed and maintained. At a fairly hefty annual price tag. And yeah, the user interface is antiquated and arcane, but no one's come up with anything better yet.

      Now, I don't know what they're using to get it to play nice with the Internet (since Travelocity is tied directly into SABRE), but that must have been an interesting exercise in programming on its own. That, however, is a discussion topic for another time and place.

      --
      All the world's an analog stage, and digital circuits play only bit parts.
    5. Re:This has been known for years already by TVmisGuided · · Score: 1

      I can't think of any. The other major CRS out there all look a great deal like SABRE. Some of the smaller airlines may have built some coprocessing systems on RDBMS which use the CRS as a back-end, just to make it a bit easier on the reservations and ticketing people, but since I don't work in the travel field any more I don't know who that might be or what they're using.

      --
      All the world's an analog stage, and digital circuits play only bit parts.
    6. Re:This has been known for years already by Anonymous Coward · · Score: 0

      The Wikipedia article explains how they abused of their ownership of this custom-made, byzantine system to promote their own flights, just like Google abuses its search monopoly to push their own products in search results. That wouldn't have been so easy with a standardized database and an interface that lets you apply your own weighting magic.

  14. MIT, not berkley by Anonymous Coward · · Score: 0

    Back when he did postgres, it was at berkley. He then moved on to the private world to do a start-up from it. So now he is at MIT. Well, at least MIT picks up good ones.

    1. Re:MIT, not berkley by hey+hey+hey · · Score: 1
      Back when he did postgres, it was at berkley. He then moved on to the private world to do a start-up from it.

      He was still a professor at Berkeley while working with RTI/Ingres Inc. He didn't leave until his wife wanted to move back near her family (which was after the sale of both Ingres and Illustra). I was at (no doubt just one of) the going away lunches.

      So now he is at MIT. Well, at least MIT picks up good ones.

      Certainly true. Mike is as bright as they come.

  15. I thought I was an assembler demon by ratboy666 · · Score: 1

    I had a "simple" optimization project. It came down to one critical function (ISO JBIG compression). I coded the thing by hand in assembler, carefully manually scheduling instructions. It took me days. Managed to beat GNU gcc 2 and 3 by a reasonable margin. The latest Microsoft C compiler? Blew me away. I looked at the assembler it produced -- and I don't get where the gain is coming from. The compiler understands the machine better than I do.

    Go figure -- I hung up my assembler badge. Still a useful skill for looking at core dumps, though. And for dealing with micro-controllers.

    So, have you had at it and benchmarked your assembler vs. a compilers?

    --
    Just another "Cubible(sic) Joe" 2 17 3061
    1. Re:I thought I was an assembler demon by Reality+Master+101 · · Score: 1

      I looked at the assembler it produced -- and I don't get where the gain is coming from. The compiler understands the machine better than I do.

      All that proves is that the compiler knew a trick you didn't (probably it understood which instructions will go into which pipelines and will parallelize). I bet if you took the time to learn more about the architecture, you could find ways to be even more clever.

      I'm not arguing for a return to assembly... it's definitely too much of a hassle these days, and again, hardware is cheap, and programmers are expensive. Just that given enough programmer time, humans can nearly always do better than the compiler, which shouldn't be surprising since humans programmed the compiler, and humans have more contextual knowledge of what a program is trying to accomplish.

      --
      Sometimes it's best to just let stupid people be stupid.
    2. Re:I thought I was an assembler demon by TheLink · · Score: 1

      "The compiler understands the machine better than I do."

      Actually the people paid lots of money to write Microsoft's C compiler understand the machine better than you do. I doubt you should be surprised.

      And the compiler will hopefully be able to keep all the tricks in mind (a human might forget to use one in some cases).

      I'm just waiting/hoping for the really smart people to make stuff like perl and python faster.

      Java has improved in speed a lot and already is quite fast in some cases, but I don't consider it a high level language (given the amount of code people have to write just to do simple stuff).

      --
  16. Please reduce lameness by suv4x4 · · Score: 4, Insightful

    We're all sick with "new fad: X is dead?" articles. Please reduce lameness to an acceptable level!
    Can't we get used to the fact that specialized & new solutions don't magically kill existing popular solution to a problem?

    And it's not a recent phenomenon, either, I bet it goes back to when the first proto-journalistic phenomenons formed in early uhman societies, and haunts us to this very day...

    "Letters! Spoken speech dead?"

    "Bicycles! Walking on foot dead?"

    "Trains! Bicycles dead?"

    "Cars! Trains dead?"

    "Aeroplanes! Trains maybe dead again this time?"

    "Computers! Brains dead?"

    "Monitors! Printing dead yet?"

    "Databases! File systems dead?"

    "Specialized databases! Generic databases dead?"

    In a nutshell. Don't forget that a database is a very specialized form of a storage system, you can think of it as a very special sort of file system. It didn't kill file systems (as noted above), so specialized systems will thrive just as well without killing anything.

    1. Re:Please reduce lameness by suv4x4 · · Score: 1

      man, you're a fag and an idiot.

      The world's not perfect you know. You're a troll, anonymous and a coward.

      I'd still pick me over you if given the chance.

    2. Re:Please reduce lameness by msormune · · Score: 2, Funny

      I'll chip in: Public forums! Intelligence dead? Slashdot confirms!

    3. Re:Please reduce lameness by JFMulder · · Score: 1

      You forgot

      "Video! Radio star dead?"

  17. Bad myth! by Anonymous Coward · · Score: 0

    Hardware is cheap.

    Developer time slightly less so.

    Operational expenses will break your kneecaps and charge you for the bat. So often I have heard developers piss and moan about another debugging stage or standardisation in terms of installation procedures or configuration options or whatever, and haul out the hoary chestnut about their time being heinously expensive, when in fact what they could fix once will mean major benefits again, and again, and again, often on a weekly or monthly basis for scheduled tasks. Even accelerating a simple weekly data update from a three-hour, five-man task to a one-hour, one-man task saves you fourteen man-hours. (This example taken from real life with details stripped to protect the idiotic.) You don't have to have an MBA from Harvard to count these beans.

    Remember, kids:
    Up front costs (including dev time) are cheap.
    Recurring costs are expensive.

  18. Stonebreaker has a vested interested in Stream Dbs by Anonymous Coward · · Score: 2, Informative

    He's the CTO of Streambase, so he's not just a "neutral" academic.

    http://www.streambase.com/about/management.php

  19. Re:Was there ever a one-size-fits-all anything? by EmbeddedJanitor · · Score: 1
    Languages, OSs, file systems, databases, microprocessors, cars, VCRs, diskdrives, pizzas, .... none of these are one-size-fits-all.

    There never has been, and probably never will be. A small embedded database will never be replaced by a fat-asses SQL database any more than Linux will ever find aplace in the really bottom-end microcontroller systems.

    --
    Engineering is the art of compromise.
  20. Isn't it just stating the obvious? by Dekortage · · Score: 5, Funny

    I've made some similar discoveries myself!

    • Transporting 1500 pounds of bricks from the store to my house is much faster if I use a big truck rather than making dozens (if not hundreds) of trips with my Honda Civic.
    • Wearing dress pants with a nice shirt and tie often makes an interview more likely to succeed, even if I wear jeans every other day after I get the job.
    • Carving pumpkins into "jack-o-lanterns" always turns out better if I use a small, extremely sharp knife instead of a chainsaw.

    Who woulda thought that specific-use items might improve the outcome of specific situations?

    --
    $nice = $webHosting + $domainNames + $sslCerts
    1. Re:Isn't it just stating the obvious? by hotdiggitydawg · · Score: 1

      What self-respecting geek would carve pumpkins with anything other than a Dremel? Turn your card in at the door, sir...

    2. Re:Isn't it just stating the obvious? by swillden · · Score: 1

      Transporting 1500 pounds of bricks from the store to my house is much faster if I use a big truck rather than making dozens (if not hundreds) of trips with my Honda Civic.

      Nah, just strap it all on top.

      --
      Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
    3. Re:Isn't it just stating the obvious? by Anonymous Coward · · Score: 0

      1500 pounds? That is only ~700kg. You should be able to do it in 2 times if you do not have passengers. So a big ass truck may not be needed.

      Now, if you said 15 tons or 15,000kg or ~33,000lb, well, that is different. Then you'd need a dozens of trips with the Civic or one big truck :)

  21. it's all (okay, mostly) in the queries by yagu · · Score: 1

    I've seen drop dead performance on flat file databases. I've seen molasses slow performance on mainframe relational databases. And I've seen about everything in between.

    What I see as a HUGE factor is less the database chosen (though that is obviously important) and more how interactions with the database (updates, queries, etc) are constructed and managed.

    For example, we one time had a relational database cycle application that was running for over eight hours every night, longer than the alloted time for all night time runs. One of our senior techs took a look at the program, changed the order of a couple of parentheses, and the program ran in less than fifteen minutes, with correct results.

    I've also written flat file "database" applications, specialized with known characteristics that operated on extremely large databases (for the time, greater than 10G), and transactions were measured in milliseconds, typically .001 - .005 seconds) under heavy load. This application would never have held up under any kind of moderate requirement for updates, but I knew that.

    I've many times seen overkill with hugely expensive databases hammering lightweight applications into some mangle relational solution.

    I've never seen the world as a one-size-fits-all database solution. Vendors of course would tell us all different.

  22. Databases are like history and mistakes by mlwmohawk · · Score: 1, Troll

    The problem with database articles like this is that they pretty much ignore the core objectives of a database. Wait, I hear you, "That's the point, but ignoring ACID and transactions, we can improve performance." There in lies the difference between understanding the "whole problem" vs a part of the problem.

    Specialized solutions typically accomplish their objective by removing one or more aspects of good database design.

    Databases are complex beasts and the good ones encompass a LOT of expertise and theory of data access, stuff that takes many months or even years to really understand. Specialized systems tend to focus exclusively on the highest level "problem" while ignoring the inherent problems of data access and modifications.

    While there are specialized solutions that work within a limited range of criteria, and, in fact, improve performance, it should be the exception rather than the rule, because the good SQL databases are REALLY good (MySQL is not one of them) at parsing and creating a good query.

    1. Re:Databases are like history and mistakes by NineNine · · Score: 0, Redundant

      Well said.

  23. No way. by Anonymous Coward · · Score: 0

    We still use Access every day. It's not dead yet!

  24. This is news? by Anonymous Coward · · Score: 0

    Who has _not_ known this for years? The only reasons for generic DB's has been lowcost and flexibility. Does it take studys to point out common knowledge now?

  25. This is spot on -- I did some benchmarking, too by relstatic · · Score: 1

    I actually stared a Free/Open Source Social Networking project (FlightFeather) based on similar reasoning, which I also covered in a talk at LinuxWorld San Francisco 2006.

    Specifically, when dealing with Web applications, get as much of the stored data as possible in HTML format -- then serve the resulting static pages. Much faster than pulling the data out of a generic SQL database, and rebuilding the page on every request. Even the page generation side can be very efficient. My benchmarking indicates that a single one-CPU machine should be able to handle several hundred comments per second in a discussion forum.

    --
    Bad Boss? Discuss it on bosstats.com
    1. Re:This is spot on -- I did some benchmarking, too by commanderfoxtrot · · Score: 1

      You're probably using it, but if you're dealing with caching (as most Web programmers should), look at http://www.danga.com/memcached/

      --
      http://blog.grcm.net/
    2. Re:This is spot on -- I did some benchmarking, too by relstatic · · Score: 1

      Memcached is an excellent suggestion -- especially since it is a distributed cache. Of course, there is overhead; some time ago, I did a series of tests using ApacheBench, trying to establish just how big the performance penalty is.

      In purely local tests (i.e. ignoring network overhead) performance dropped 40% with the introduction of memcached. Over a rather slow 10 Mbps LAN, the performance degradation was only 10%. Note that memcached was still local to the server in the second series of tests -- only the request from ApacheBench went over the LAN.

      Much of the memcached overhead was due to marshaling; in the non-memcached version, all objects lived entirely in memory. Increasing the number of objects in memcached 10 times resulted in a massive 70% performance drop for the LAN-based tests.

      So, if you cache few objects (or plain strings), and there is little communication between the machines in your server farm, memcached performance will be close to pure in-memory performance. On the other hand, if there is lots of local I/O to handle a request, and you maintain a complex set of objects in memcached, you will take quite a hit.

      The above is not the fault of memcached -- it is just that when designing distributed systems, you are actually trying to reduce the amount of communication inside the cluster. This is similar to multi-threaded design, where you must try to reduce the number of threads and especially their interaction with each other (I covered this in one of my articles for O'Reilly).

      In my system (FlightFeather) I do maintain quite a lot of in-memory state to improve performance. For example, a special subclass of the float type (this is in Python) helps create a fast session cache. In addition, the operating system helps when you are using files directly, by maintaining a cache on its own. The most important thing, however, is to try to generate static content for frequently accessed material. The authors of memcached already know that static content is "boring, easy" (PDF; see page 30). If you want reliability and performance, boring and easy is where you want to go :-)

      --
      Bad Boss? Discuss it on bosstats.com
  26. Taken seriously by matria · · Score: 3, Funny

    Almost as bad as trying to take seriously someone who dosn't know his it's from his its, right?

    1. Re:Taken seriously by Anonymous Coward · · Score: 0

      considering he only got 1 out of 3 wrong I would say that puts him head and shoulders above 90% of /. poster's.

    2. Re:Taken seriously by Anonymous Coward · · Score: 1, Funny

      poster's

      Hilarious.

  27. in other news by timmarhy · · Score: 0, Redundant

    boats perform better on water then cars, stones don't work well as parachutes and fire warms your house better then blocks of ice. how long did it take this genious to come up with this?

    --
    If you mod me down, I will become more powerful than you can imagine....
  28. Death to Trees! by Tablizer · · Score: 2, Interesting

    Don't forget that a database is a very specialized form of a storage system, you can think of it as a very special sort of file system. It didn't kill file systems

    Very specialized? Please explain. Anyhow, I *wish* file systems were dead. They have grown into messy trees that are unfixable because trees can only handle about 3 or 4 factors and then you either have to duplicate information (repeat factors), or play messy games, or both. They were okay in 1984 when you only had a few hundred files. But they don't scale. Category philosophers have known since before computers that hierarchy taxonomies were limited.

    The problem is that the best alternative, set-based file systems, have a longer learning curve than trees. People pick up hierarchies pretty fast, but sets take longer to click. Power does not always come easy. I hope that geeks start using set-oriented file systems and then others catch up. The thing is that set-oriented file systems are enough like relational that one might as well use relational. If only the RDBMS were performance-tuned for file-like uses (with some special interfaces added).

    1. Re:Death to Trees! by suv4x4 · · Score: 2, Insightful

      Anyhow, I *wish* file systems were dead. They have grown into messy trees that are unfixable because trees can only handle about 3 or 4 factors and then you either have to duplicate information (repeat factors), or play messy games, or both.

      You know, I've seen my share of RDBMS designs to know the "messiness" is not the fault of the file systems (or databases in that regard).

      Sets have more issues than you describe, and you know very well Vista had lots of set based features that were later downscaled, hidden and reduced, not because WinFS was dropped (because the sets in Vista don't use WinFS, they work with indexing too), but because it was terribly confusing to the users.

    2. Re:Death to Trees! by Tablizer · · Score: 1

      You know, I've seen my share of RDBMS designs to know the "messiness" is not the fault of the file systems (or databases in that regard).

      It is true that sloppy people will mess ANYthing up. However, trees cannot be fixed past a certain point. They create contradictions or duplications. I've never seen a messy schema that was not fixable. That is the difference: fixability. (Whether you get approval for such is another matter.)

      Sets have more issues than you describe,

      Example? Not more than trees I bet.

      was dropped ... because it was terribly confusing to the users.

      That I already agreed to. Sets are take time and perhaps training to get used to. BUT they are more flexible once you take the leap, once you get over the learning hump. At least I wish my personal files could have such an engine. Easy to learn and easy to use (in the longer run) sometimes conflict. Windows-vs-Unix is sometimes suggested as such an example.

  29. Write-only languages by mysticgoat · · Score: 4, Insightful

    As any English teacher will tell you, any language that will support great poetry and prose will also make it possible to write the most gawdawful cr*p. Perl bestows great powers, but the perl user must temper his cleverness with wisdom if he is to truly master his craft.

    However in this specific case Google reveals that

    ## The Y Combinator
    sub Y (&) {
    my $le=shift;
    return &{
    sub {
    &{sub { my $f=shift; &$f($f) } }
    (sub { my $f=shift; &$le(sub { &{&$f($f)}(@_) }) });
    }
    }
    }
    was simply "borrowed" from y-combinator.pl. This is an instance of Perl being used in a self-referential manner to add a new capability (the Y combinator allows recursion of anonymous subroutines (why anyone would bother to do such an arcane thing comes back to the English teacher's remarks)). Self-referential statements are always difficult to understand because, well, they just are that way (including this one).
    1. Re:Write-only languages by nuzak · · Score: 1

      Sorry, I forgot to attribute it, and yes that's exactly where I got it, at which point I turned it into a one-liner. The sad thing is, it's about as straightforward as a Y combinator can get in perl, and it's all the damn sigils that add the noise.

      Perl6 looks like God's Own Language when it comes to the operations it has, but while they were changing the meaning of sigils in Perl6 (they now denote namespace and not context), I wish they took the next logical step and made the damn things optional. I've been a perl hacker for about 10 years, but I just can't stand leaning on the shift key anymore or having to close nested bracket sequences like ]})} ...

      --
      Done with slashdot, done with nerds, getting a life.
    2. Re:Write-only languages by dscruggs · · Score: 1

      > Self-referential statements are always difficult to understand

      I myself have no idea what you mean.

  30. Re:Was there ever a one-size-fits-all anything? by Fred_A · · Score: 3, Funny
    Languages, OSs, file systems, databases, microprocessors, cars, VCRs, diskdrives, pizzas, .... none of these are one-size-fits-all.

    There never has been, and probably never will be.
    Aren't most condoms sold in one size fits all ?

    Maybe they could make rubber databases ?

    (or it's a bit of a stretch)
    --

    May contain traces of nut.
    Made from the freshest electrons.
  31. One size still fits all by bytesex · · Score: 1

    It's just not called SQL driven RDBMS. It's called Sleepycat.

    --
    Religion is what happens when nature strikes and groupthink goes wrong.
  32. Re:Was there ever a one-size-fits-all anything? by EmbeddedJanitor · · Score: 0, Flamebait

    Speak for yourself weener-boy!

    --
    Engineering is the art of compromise.
  33. The "science" of writing articles... by flajann · · Score: 1
    Often times articles are written with a certain level of naivete -- probably because those writing the articles typically aren't the heavy-hitting experts in the field, or perhaps they are writing for an andiance that can't handle all the nuances of the seasoned expert.

    So, they will always come off sounding a bit hokey to us.

    Having said that, I will now say that we are entering into a "new age" -- gee, how often have you heard that hackneyed adage? -- of massive torrents of information with a damning need to manage it all.

    What I'm surprise to not hear are hardware projects to manage specialized requirements for massive data streams. Couldn't someone come up with, for instance, a clever way to arrange a battery of Field-Programmable Logic Arrays to crunch data in real time for some really specialized application?

  34. Re:Was there ever a one-size-fits-all anything? by Anonymous Coward · · Score: 0

    Aren't most condoms sold in one size fits all ?

    Maybe they could make rubber databases ?


    Or maybe they should make rubber condoms?

  35. Specialization is faster but can be harmful by Terje+Mathisen · · Score: 1

    23 years ago I wrote a custom DB to maintain the status of millions of "universal" gift cards, it ran 3-5 orders of magnitude faster (on a 6 MHz IMB AT) than a commercial database running on a big IBM mainframe.

    I reduced the key operations (what is the value of this gift card, when was it sold, has it been redeemed previously? etc) to just one operation:

    Check and clear a single bit in a bitmap.

    My program used 1 second to update 10K semi-randomly-ordered (i.e. in the order we got them back from the shops that had accepted them) records in a database of approximately 10 M records.

    20 years later I wrote a totally new version of the same application, but this time the gift cards are electronic debet cards. This time I used Linux-Apache-MySQL-Perl to make a browser-based version, and I stored everything in the DB. Today that is plenty fast enough, and it allows us to make any kind of queries against the DB, like "How many transactions of less than 100 kr was accepted in December, broken down by business area/chain/shop/etc"

    Terje

    --
    "almost all programming can be viewed as an exercise in caching"
  36. Creative Commons License by pfafrich · · Score: 2, Interesting

    Has anyone noticed the This article is published under a Creative Commons License Agreement, its the first time I've seen this applied to an academic paper. Another small step for the open-content movement.

    --
    There are four sorts of people in the world: fools, lunatics, idiots and morons. - Umberto Eco, Foucaut's pendulum.
  37. Re:Was there ever a one-size-fits-all anything? by rufty_tufty · · Score: 2, Funny

    Which reminds me of the Robin Williams joke

    "They came in 3 sizes, extra large, large and white man"

    --
    "The weirdest thing about a mind, is that every answer that you find, is the basis of a brand new cliche" -
  38. Where's part 1? by FlopEJoe · · Score: 1

    This is titled "OSFA? - Part 2: Benchmarking Results." Has anyone found Part 1?

    1. Re:Where's part 1? by Peter+Mork · · Score: 1

      Line 1 of the abstract: "Two years ago, some of us wrote a paper ... 'One Size Fits All (OSFA)' [Sto05a]." Looking at the references indicates that the paper in question is:

      Stonebraker and Cetintemel. "One Size Fits All: An Idea Whose Time has Come and Gone." In: ICDE, 2005.

  39. Why imagine, just read ;-) by shis-ka-bob · · Score: 1

    There is an article, and it has many references. How is a 'Captain Obvious' sort of comment labeled Insightful? The insightful part is in the article. The first author, Michael Stonebraker, architected Ingres and Postgres. He looked at OLAP databases, which is a market that is much larger than a special case. He proposed storing the data in columns rather than in rows. He tested this, it works. In fact it works so well that he can clobber a $300,000 server cluster with a $800 dollar PC. I know that I would be pretty happy to spend a year porting to his database if I could pocket half of that annual hardware cost savings. The savings in electricty would be enough to pay for several pretty serious Starbucks addictions. His key insight seems to be that he can vastly improve OLAP performance by storing the data in columns rather than in rows. This change could be quite transparent to the end users & developers, except for the massive speed-up and cost savings, of course. This paper describes a general solution for a common problem. Stonebraker has developed Vertica , which is still support ad-hoc querries in SQL. This seems like a pretty general purpose solution for OLAP.

    --
    Think global, act loco
  40. Nah, gerbiles will never work... by Anonymous Coward · · Score: 0

    They have too much of a tendency to get stuck in loops.

    *ducks*.

  41. UPGRADES by KurtisKiesel · · Score: 1

    I think you have to take into account how secure the specific-use databases are going to be maintained. Who is going to write updates for security flaws in a specific-use database and support new technologies. Might it be safe to say we'll see companies forget or drop a specific-use database over a more general platform. I rather use a database that might be a little slower if I know that I can get regular updates to keep it current with my ever patching / updating / and re-securing Operating systems. How often do we see proprietary hardware stop working because M$ decided to make a new service pack? What happens when the latest version of your Debian or Red Hat install won't run your widget-PostMySQL super designed build because the new install no longer supports an older kernal?

    1. Re:UPGRADES by Dekortage · · Score: 1

      You make excellent points. That's why we have things called "planning" and "weighing your options".

      Admittedly, many people do not do this very well, which has led to many of humanity's problems throughout history. Database selection and design are just items #92838701283743^199320 and #92838701283743^199320+1 on the list of things people ought to have thought about more over the last few million years.

      --
      $nice = $webHosting + $domainNames + $sslCerts
  42. MUMPS by dpbsmith · · Score: 1

    This is, of course, what MUMPS advocates have been saying for years.

    MUMPS is a very peculiar language that is very "politically incorrect" in terms of current language fashion. Its development has been entirely governed by pragmatic real-world requirements. It is one of the purest examples of an "application programming" language. It gets no respect from academics or theoreticians.

    Its biggest strength is its built-in "globals," which are multidimensional sparse arrays. These arrays and the elements in them are automatically created simply by referring to them. The array indices are arbitrary strings. There can be an arbitrary number of subscripts and the same array can have elements with different numbers of subscripts. Oh, and they're always sorted automatically; each element is created automatically in its proper sequence, and there are fundamental operators for traversing arrays in sequence.

    "Global" arrays are persistent across sessions, are stored on the disk, and as in ordinary practice can be hundreds of megabytes in size.

    Before you say "this can all be done simply by writing a C++ class," I have to mention the important point, which is that the use of the globals is so intrinsic to the ordinary way MUMPS is really used in practice, that successful implementions of MUMPS must and in practice do make the implementation of globals efficient.

    You really can just use "globals" all the time for everything. They work well enough that you don't need to reserve their use for when they're really needed. They're not a luxury. MUMPS programmers rarely use files, except for interchange in and out of the MUMPS universe. Within MUMPS, data is simply kept in globals; it's just the MUMPS way.

    "Globals" are extremely flexible and lend themselves naturally to representations of real-world databases. These representations are typically one-off, ad-hoc representations designed by the programmer, who needs to make up-front decisions about the hierarchical organization in which the data will be stored, and writes special-purpose code to perform the accesses. Naturally, this sounds like the dark ages compared to relational technology, but there is an impressive tradeoff. If MUMPS fits the application, development times are short, and performance is dramatically better than for relational systems.

    Whether or not this is important in the year 2006, it was very clear a decade ago when medium-scale database applications were typically hosted on minicomputers, that the same hardware resources could support several times as many users running a MUMPS application as a similar application implemented with a relational database, as various organizations found when they converted... in either direction.

    Of course relational systems can and are implemented on top of MUMPS.

    MUMPS underlies InterSystems' Cache product, and a MUMPS-like language with historical connections to MUMPS underlies the products of Meditech. I'm not sure what the current status of Pick is, but it has some similarities. The company I currently work for has nothing whatsoever to do with either system... except that our business IT system happens to be Pick-based.

    Regardless of whether you think of MUMPS itself, there are almost certainly lessons to be learned from the durability of this language and its effectiveness.

  43. Re:Stonebreaker has a vested interested in Stream by Omni-Cognate · · Score: 1

    Like many academics, he has founded a company. It hardly invalidates his research. They aren't trying to hide it either - one of the other authors is contributing in his capacity as a StreamBase employee, as shown right at the top of the paper.

    Also, since nobody else has said anything about his neutrality or otherwise, you shouldn't put the word "neutral" in quotes like that. It makes you look like you are trying to set up a straw man. Neutrality is not even particularly important in a researcher. You'd have to go a long way to find one who didn't want his own particular theory to prevail, whatever the reason. Peer review of the content is the criterion on which papers are judged.

    --

    "The Milliard Gargantubrain? A mere abacus - mention it not."

  44. New hat same as the old hat by Bacon+Bits · · Score: 1

    The same argument is what gave rise to re-programmable generic processing components: CPUs. You'll note that the processor industry today (AMD in particular) is now also moving towards this kind of diversification. Gaming systems have been using dedicated GPUs for ages (today they're more powerful than entire PCs from 5 years ago) and I'm sure we remember back when math co-processors (i387) were introduced. You'll note that math co-processors were just absorbed back into the generic model.

    It's another pendulum in the computing world (much like the serial/parallel dichotomy). Moving from a disparate number of diverse systems to a small number of all-purpose systems. The advances are always for performance, and they typically happen when the current generation plateaus. We've mastered the concepts of one generation, time to explore new concepts (by re-exploring old concepts).

    In 10 or 15 years people will be complaining about the difficulty of data portability, the esoteric nature of these unique data files, and the lack of features in area X in one product and area Y in a second product, and the archaic languages you have to use on these old, unsupported systems. There will be a move back to generic storage engines, bringing with it the lessons learned from that round of insight.

    Of course, there will always be demands for specialized components just as there will always be demand for generic, standard components. It's the centrists whose demands are for the best combination of performance and features that determine popularity.

    --
    The road to tyranny has always been paved with claims of necessity.
  45. *BAD* example - SABRE is the HP-MySQL case study. by Anonymous Coward · · Score: 0

    Good point, but bad example. Sabre migrated to a generic database (which MySQL and HP significant contractors on the project).

    http://h71028.www7.hp.com/enterprise/downloads/Sab re-HP-MySQL-case-study.pdf

    But your point is well taken, and practically all the large financial institutions (Fidelity, UBS, Merrill Lynch, etc) use KDB from kx.com; also APL derived like I think the old Sabre system was.

  46. see also KDB used by most financial institutions by Anonymous Coward · · Score: 0

    See also KDB from kx.com.
    It's the database used by most large financial institutions (Fidelity, Merrill Lynch, UBS, etc); and stores all it's variables on the global K stack.

    For thy types of database work stock traders do, it's orders of magnitude faster than the Oracles, etc of the world.

    http://kx.com/
    http://www.kuro5hin.org/?op=displaystory;sid=2002/ 11/14/22741/791

  47. SQL is part of the problem. by sonofagunn · · Score: 1

    Databases already have the ability to change storage engines as long as they support SQL. The reason my company shuns the database for many specific tasks is that SQL is ill-suited to perform many types of transformations, calculations, and aggregations on data. What may take many pages of SQL (and many temp tables) in a stored proc can be written in a simple Java class and will perform much better, as well as being easier to maintain. A lot of our processing goes like this Raw data from database (simple select queries, which are very fast) -> flat files -> custom Java code -> reporting engine or another database. The speedup over using stored procs or SQL based ETL tools ranges between a factor of 10 and a factor of 100. MDX is a better language than SQL for a lot of purposes, but not all.

    1. Re:SQL is part of the problem. by Doc+Ruby · · Score: 1

      I tend to agree. But how does putting your Java into stored procedures compare to the long/inefficient SQL? And is the Java stored procedure slowdown due to a worse JVM, more demand on the same host for both DB and JVM, or some other architectural bottleneck?

      --

      --
      make install -not war

    2. Re:SQL is part of the problem. by sonofagunn · · Score: 1

      We don't have Java inside stored procedures.

      What we're doing is extracting data from the warehouse and then using custom Java code to process that data. We're finding this to be much faster and easier than any SQL based approach (such as stored procedures and ETL tools).

  48. Wrong mod by Anonymous Coward · · Score: 0

    Kinda sad that this was modded Funny rather than Insightful.

  49. Heirarchies can be done. by Static · · Score: 1

    Many many many people get stuck on the fancy things the DB writers tell you SQL can do rather than thinking about the data differently. A truly classic mistake is wetting yourself over sub-queries when you're still using parent-IDs to do heirarchies.

    Feh.

    Parent IDs are a bad idea if you want to be able to look at multiple levels of the tree at once, especially the trees you get in threaded comments. http://www.dbazine.com/oracle/or-articles/tropashk o4 is how you do trees in SQL that obviate that. Doing it that way removes most of the need to produce recursive queries to get all the levels. Unfortunately, *most* forum software *doesn't* do it this way because such advanced data manipulation doesn't occur to them.

    1. Re:Heirarchies can be done. by AKAImBatman · · Score: 1

      That solution violates the KISS principle. It is highly complex, non-obvious, non-portable (as in tied to some very expensive dbmses; I don't think MySQL 4.0 could pull it off so easily), and worst of all, unprotected against accidental manipulation. It also wouldn't surprise me if the method became unmaintainable over long maintenance periods as team members working on the codebase shifted around.

      Creating a specialized datastore is more direct, more obvious, solves ALL the problems (not just one), maintainable, and can be designed to protect against invalid data. It may be more costly in the short term, but in the long term it will save a great deal of time and energy, in addition to producing a less complex codebase.