Refactoring SQL Applications

← Back to Stories (view on slashdot.org)

Posted by samzenpus on Wednesday March 11, 2009 @06:04AM from the read-all-about-it dept.

stoolpigeon writes "My past as a DBA probably makes me a bit biased, but the reaction I've seen the most when a database application isn't performing as well as would be liked seems to focus on the database side of things. The search for a solution usually seems to center around tuning db parameters, the building (or removal) of indexes and, if the budget allows, throwing hardware at the problem. In their new work, Refactoring SQL Applications, Faroult and L'Hermite bring a much wider range of options to the table. There is a lot in this little book for the developer charged with fixing an existing application and I think a lot of good information that could save one from making a number of headache-inducing mistakes on a new application." Keep reading for the rest of JR's review. Refactoring SQL Applications author Stephane Faroult with Pascal L'Hermite pages 293 publisher O'Reilly Media, Inc. rating 9/10 reviewer JR Peck ISBN 978-0-596-51497-6 summary Good for a developer charged with fixing an existing application. The book is divided into eight chapters; the first two deal with how to approach a problematic application in general. In the preface the authors say, "This book tries to take a realistic and honest view of the improvement of applications with a strong SQL component, and to define a rational framework for tactical maneuvers." I found this to be true throughout the entire book and was impressed by how well the examples, suggestions and problems echoed my real-life experience. This book is first and foremost practical. There is really almost nothing in the book that does not come immediately to bear upon the problem at hand. I've seen others do a lot less with many more pages.

The examples and benchmarks are compared across three different popular relational database management systems. They are MySQL, Oracle RDBMS and Microsoft SQL Server. I thought that this brought up a couple interesting issues that are not directly addressed in the book. First is that the authors are talking about how to improve performance, not comparing platforms, but the numbers are there and may be of some interest to people who would like to compare them. Secondly, I've met a number of people over the years who get quite animated about insisting that a good DBA does not need to know any certain solution, but rather just the fundamentals. I think Faroult and L'Hermite put this idea to rest, though unintentionally. In order to discuss how to best understand what exactly is happening and how best remedy issues, they show that it is necessary to have an understanding of platform specific issues and tools. This is true on two levels. The first is that the location of use of the built in tools for each platform are different. The second is that what works for one platform does not necessarily work for another.

For example, Chapter Two "Sanity Checks" contains a section on parsing and bind variables. The authors compare performance when queries are hard coded, with new prepared statements on each iteration (firm coded) and with one prepared statement and changing the parameter value on each iteration in a loop (soft coded). On Oracle and SQL Server the performance was poorest with hard coded, better with firm coded and best with soft coded. MySQL did best with soft coded as well but actually took a performance hit moving from hard coded to firm coded. This had to do with differences in how MySQL server caches statements. The authors took the time to rewrite their code from java to C in order to ensure that the issue was not related to language or driver issues. This is not to say that one can ignore RDBMS and SQL fundamentals, but rather that to get top performance requires knowledge of platform specific issues. This also comes out again when dealing with optimizers.

With that in mind, the authors recommend that readers have a solid understanding of SQL and some programming language. Most examples are SQL and code is given in Java and PHP. There are also examples that illustrate SQL extensions showing procedures, functions, etc. written for all three RDBMS products covered. The authors stick primarily to standard SQL but do make note and at times show examples of how things will look in each of the other databases. This information is current and reflects the most recent versions of the each product.

The fourth chapter, "Testing Framework" is incredibly useful. The authors cover generating test data and then checking correctness of outcomes through comparison. This is really useful information for anyone working to improve an application, or writing one for the first time. I think it also a large part of why this book could really appeal to new and experienced developers as well as the developer working on existing or brand new applications. I think there is a good chance that only the most extremely experienced developer would find nothing new here, or at least some new way to approach a problem. New developers can learn quite a bit and avoid some bad habits and assumptions without having to gain that information the hard way. And then the tools for generating random data, large amounts of data and comparing results will provide excellent opportunities for learning and real world application.

The next three chapters cover dealing with specific types of issues and how to improve performance. The last chapter then quickly describes a scenario of just how the authors step into real world situations and start to attack a problem. This is followed with two appendices. The first is scripts and samples, the second tools that are available to help in finding issues and resolving them. Some of the authors tools use SQLite, which is discussed briefly in the chapter on creating test data as some of the tools depend upon it.

I think that it has been a while since I've read a book that could have such a rapid return on investment. There are many suggestions and insights that should enable anyone to squeeze better performance out of just about any database application. While the focus is on the application side, there is plenty that requires understanding and work on the database side as well. There is discussion of the parameters and hardware I mentioned at the start of this review. But rather than the only options, they are one part in a much larger and systematic approach.

The authors relate that often refactoring for this type of application comes into play when something that used to work does not work any more. This can often lead to an environment of high pressure and emotion. The desire for a rapid resolution can lead to casting about in the dark for a quick fix or a feeling that cost is no longer as significant since a fix must be had now. The authors argue, and I agree, that this is exactly when a rational, disciplined process of tracking down and fixing issues is the most valuable. I agree. The issue is of course that someone in a position to do something must have the ability to take that approach. This book will get one well on the way to being in that place. Of course it can't take a brand new developer or DBA an expert. Much like a degree it can give them some fundamental tools that will allow them to take full advantage of experience as it comes rather than just crashing and burning.

If I could I'd have any developer on a database centric application read this, and DBAs as well. There is a lot here for both sides to learn about just how much they depend upon and impact one another. This may be an idealistic dream, especially for larger shops where often the relationship between those two groups is adversarial, but I think that such an approach could only make life much better for everyone involved. For anyone looking to enter this world on either side of the DBA or developer equation, this may make a nice addition to their education. For that individual wearing both hats this could be a life saver. In this small book they will learn many things to look out for as well as gain exposure to some of the similarities and differences in what are arguably the top three relational database management systems right now.

You can purchase Refactoring SQL Applications from amazon.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

9 of 159 comments (clear)

Min score:

Reason:

Sort:

Re:Shoot the developers by Bright+Apollo · 2009-03-11 06:41 · Score: 2, Informative

I would cautiously agree with the developers being less educated on SQL than they should be. The trouble seems to be in the different mind-sets required to solve the application problems. SQL is a declarative language, and it operates on your set of data as described. The host language, for what of a better term, is C or Java or some other iterative language, and it operates on each individual member of the set, stepwise. If you primarily think "stepwise" or algorithmically, you're already framing your problems as challenges to be met with an iterative approach. Learning to recognize which parts of the application require declarative, functional, or procedural solutions is where good developers becomes great.
I disagree with any statement that pits developers against DBAs. I have found a few DBAs that have more ability than myself in SQL, but not many and not recently. I don't rely on them to work with my data, and then don't rely on me to manage four hundred schemas across two dozen machines. A good separation of duties always helps.
-BA
Re:compare SQL to Code by Saint+Stephen · 2009-03-11 06:42 · Score: 4, Informative

On a REAL database, like Oracle, the query optimizer will factor common expressions, eliminate unused branches, and in general execute your SQL in completely different manner than what you write.
Doing things in a "relational calculus" way, where you specify what to be done (i.e., with SQL) is superior to doing things in a "relational algebra" way (individual statements correlated by procedure code).
I've written some queries that were a dozen pages long for a individual statement, mostly because I use a python-like style where the indentation specifies the structure and thus you can string together monstrous subexpressions and not get confused. The DBA was like "you're not running that on MY box," but it ran super fast because of the query optimizer.
That's what I mean when I say MySql is a Toy, compared to DB/2, Oracle, or SQL Server. The query optimizer.
Much better ways to do complex sql by avandesande · 2009-03-11 06:58 · Score: 2, Informative

I have developed a design pattern using in memory data table objects that can satisfy the most complex requirements without using any dynamic code.
It also allows the queries and business logic to be broken into discreet chunks that are easily optimized and debugged.

--
love is just extroverted narcissism
Re:compare SQL to Code by stoolpigeon · 2009-03-11 07:18 · Score: 3, Informative

Chapter 5 - "Statement Refactoring" includes, according to the author, "...how to analyze SQL statements so as to turn the optimizer into your friend, not your foe." It's solid and probably points people towards writing things that work just as you describe.

--
It's hard to believe that's how Micronians are made. Why don't we see it right now by having you both kiss one another?
Re:Performance Tuning is Not Refactoring by stoolpigeon · 2009-03-11 07:24 · Score: 2, Informative

Let me explain what I meant by "doesn't work any more". For example a query that originally took 30 seconds now take 3 hours. It still 'works' from a functional perspective but from a business perspective may have become completely useless. Refactoring can make it work again. I should have been more clear on what I meant there.

--
It's hard to believe that's how Micronians are made. Why don't we see it right now by having you both kiss one another?
Re:Should have included PostgreSQL and DB2 by Splab · 2009-03-11 08:08 · Score: 4, Informative

GP is right, MySQL is a toy database, advanced toy but still a toy database.
The absolutely most important thing for a database is data integrity, the ability to trust in your system - when it says "Yeah I saved that for you", it should take catastrophic events to lose it again.
MySQL treats data in a best effort way, if what you asked it to do doesn't fly with current config, it reverts to something that looks right enough and go with that.
Consider a database setup, admin installs MySQL with default creates some tables, runs it for a while, decides he needs more log space (transaction), he adjusts the settings and restart MySQL. It starts, everything is peachy. Transactions are running, being committed, he adds more tables, and then suddenly shit hits the fan, he does a rollback, MySQL says ok, but lo and behold, the data is still there...
So what went wrong? When he changed the transaction log size MySQL during start up realized an inconsistency between the actual log file size and the wanted, MySQL can't expand this file on the fly so InnoDB is disabled, MySQL now reverts to MyISAM (I am not kidding, this is what MySQL will do). Any subsequent calls to begin and commit transaction will be accepted with an OK. Any tables created afterward will be accepted, even with explicit engine syntax MySQL will just issue query ok, 1 warning.
Now the warning will tell you that the InnoDB engine wasn't available, so MySQL chose MyISAM instead - however, most aren't aware of this behavior, especially since most programming languages does not support this.
A database should at no point _ever!_ say "OK" to a request for something that can't be handled. If I say begin transaction and something isn't right I want my database to shout on top of its binary lungs that something is wrong and my data isn't safe.
Re:Use views by bjourne · 2009-03-11 08:26 · Score: 2, Informative

That's silly. When you change the data model you must change the views too. Then you could as well have changed how the application uses the database instead and avoid a whole layer of indirection. Plus, views are read-only so the client application still needs direct access to the tables to update data. Views are useful and very under appreciated, but not in the way you suggest.

--
Football Odds
I Only Know Oracle by bloobamator · 2009-03-11 14:59 · Score: 5, Informative

I only know Oracle but I've known it since version 5.0. Intimately. I haven't read the book but I read the review. Here are a few tips I've learned over the decades that you might find useful, just in case they aren't covered in the book:

1) You have to establish a general rule of thumb for each production db whereby any one sql that consumes more than x% of the db resources needs to be tuned. The value of x varies from db to db. If it cannot be tuned below x% then it needs to be refactored.
2) Learn to use stored outlines. If you can get them to work they will save your ass and make you look like a total hero.
3) Never turn your back on the optimizer. Really. Even for simple queries, even with the deepest stats.
4) Bind variables are a necessity for high-repetition sql. Bind variables are something you might want to avoid for reports queries for which the optimal plans depend on the user input values. This is because a sql's plan is cached along with it the first time it is parsed, and if you use bind variables then the first plan you get is the plan you will always get so long as the sql remains in the shared pool.
(You can sometimes work around this issue by turning off bind variable peeking, but consider doing it on a per-session basis instead of changing it system-wide. Scary!)
5) Nowadays a 32GB SGA is no big thing. Get yourselves a ton o' RAM and set up a keep pool in the buffer cache to pin your most important reporting indexes and tables. Partition your big reporting tables and maintain a sliding window of the most recent partition(s) in the keep pool.
6) No sorting to-disk. Ever. If you cannot let the session have the PGA it needs to sort the query in memory then the SQL needs to be "refactored".
7) Once you have eliminated most of the physical disk reads it then becomes all about the buffer gets (BG's). When disk reads are low the high-logical-BG queries immediately become the new top SQL. This is because logical BG's are all CPU and your db is now cpu-bound, which is right where you want it. So from this point it's back to item #1 and we prune and tune (thanks KG!)

I could go on all day. Perhaps I should write a book?

--
"Crude and slow, clansman. Your attack was no better than that of a clumsy child."
Re:Confusing volume with data integrity by berend+botje · 2009-03-11 19:46 · Score: 2, Informative

Well, if parent is using fuzzy full-text searches he's well better off with Oracle.

Even non-fuzzy full-text searches on Postgresql are a pain. Yes, they do work great, but the syntax is an abomination.

I loathe Oracle as much as the next guy, but even MySQL does a better job at fuzzy string matching! Really.