Refactoring SQL Applications
stoolpigeon writes "My past as a DBA probably makes me a bit biased, but the reaction I've seen the most when a database application isn't performing as well as would be liked seems to focus on the database side of things. The search for a solution usually seems to center around tuning db parameters, the building (or removal) of indexes and, if the budget allows, throwing hardware at the problem. In their new work, Refactoring SQL Applications, Faroult and L'Hermite bring a much wider range of options to the table. There is a lot in this little book for the developer charged with fixing an existing application and I think a lot of good information that could save one from making a number of headache-inducing mistakes on a new application." Keep reading for the rest of JR's review.
Refactoring SQL Applications
author
Stephane Faroult with Pascal L'Hermite
pages
293
publisher
O'Reilly Media, Inc.
rating
9/10
reviewer
JR Peck
ISBN
978-0-596-51497-6
summary
Good for a developer charged with fixing an existing application.
The book is divided into eight chapters; the first two deal with how to approach a problematic application in general. In the preface the authors say, "This book tries to take a realistic and honest view of the improvement of applications with a strong SQL component, and to define a rational framework for tactical maneuvers." I found this to be true throughout the entire book and was impressed by how well the examples, suggestions and problems echoed my real-life experience. This book is first and foremost practical. There is really almost nothing in the book that does not come immediately to bear upon the problem at hand. I've seen others do a lot less with many more pages.
The examples and benchmarks are compared across three different popular relational database management systems. They are MySQL, Oracle RDBMS and Microsoft SQL Server. I thought that this brought up a couple interesting issues that are not directly addressed in the book. First is that the authors are talking about how to improve performance, not comparing platforms, but the numbers are there and may be of some interest to people who would like to compare them. Secondly, I've met a number of people over the years who get quite animated about insisting that a good DBA does not need to know any certain solution, but rather just the fundamentals. I think Faroult and L'Hermite put this idea to rest, though unintentionally. In order to discuss how to best understand what exactly is happening and how best remedy issues, they show that it is necessary to have an understanding of platform specific issues and tools. This is true on two levels. The first is that the location of use of the built in tools for each platform are different. The second is that what works for one platform does not necessarily work for another.
For example, Chapter Two "Sanity Checks" contains a section on parsing and bind variables. The authors compare performance when queries are hard coded, with new prepared statements on each iteration (firm coded) and with one prepared statement and changing the parameter value on each iteration in a loop (soft coded). On Oracle and SQL Server the performance was poorest with hard coded, better with firm coded and best with soft coded. MySQL did best with soft coded as well but actually took a performance hit moving from hard coded to firm coded. This had to do with differences in how MySQL server caches statements. The authors took the time to rewrite their code from java to C in order to ensure that the issue was not related to language or driver issues. This is not to say that one can ignore RDBMS and SQL fundamentals, but rather that to get top performance requires knowledge of platform specific issues. This also comes out again when dealing with optimizers.
With that in mind, the authors recommend that readers have a solid understanding of SQL and some programming language. Most examples are SQL and code is given in Java and PHP. There are also examples that illustrate SQL extensions showing procedures, functions, etc. written for all three RDBMS products covered. The authors stick primarily to standard SQL but do make note and at times show examples of how things will look in each of the other databases. This information is current and reflects the most recent versions of the each product.
The fourth chapter, "Testing Framework" is incredibly useful. The authors cover generating test data and then checking correctness of outcomes through comparison. This is really useful information for anyone working to improve an application, or writing one for the first time. I think it also a large part of why this book could really appeal to new and experienced developers as well as the developer working on existing or brand new applications. I think there is a good chance that only the most extremely experienced developer would find nothing new here, or at least some new way to approach a problem. New developers can learn quite a bit and avoid some bad habits and assumptions without having to gain that information the hard way. And then the tools for generating random data, large amounts of data and comparing results will provide excellent opportunities for learning and real world application.
The next three chapters cover dealing with specific types of issues and how to improve performance. The last chapter then quickly describes a scenario of just how the authors step into real world situations and start to attack a problem. This is followed with two appendices. The first is scripts and samples, the second tools that are available to help in finding issues and resolving them. Some of the authors tools use SQLite, which is discussed briefly in the chapter on creating test data as some of the tools depend upon it.
I think that it has been a while since I've read a book that could have such a rapid return on investment. There are many suggestions and insights that should enable anyone to squeeze better performance out of just about any database application. While the focus is on the application side, there is plenty that requires understanding and work on the database side as well. There is discussion of the parameters and hardware I mentioned at the start of this review. But rather than the only options, they are one part in a much larger and systematic approach.
The authors relate that often refactoring for this type of application comes into play when something that used to work does not work any more. This can often lead to an environment of high pressure and emotion. The desire for a rapid resolution can lead to casting about in the dark for a quick fix or a feeling that cost is no longer as significant since a fix must be had now. The authors argue, and I agree, that this is exactly when a rational, disciplined process of tracking down and fixing issues is the most valuable. I agree. The issue is of course that someone in a position to do something must have the ability to take that approach. This book will get one well on the way to being in that place. Of course it can't take a brand new developer or DBA an expert. Much like a degree it can give them some fundamental tools that will allow them to take full advantage of experience as it comes rather than just crashing and burning.
If I could I'd have any developer on a database centric application read this, and DBAs as well. There is a lot here for both sides to learn about just how much they depend upon and impact one another. This may be an idealistic dream, especially for larger shops where often the relationship between those two groups is adversarial, but I think that such an approach could only make life much better for everyone involved. For anyone looking to enter this world on either side of the DBA or developer equation, this may make a nice addition to their education. For that individual wearing both hats this could be a life saver. In this small book they will learn many things to look out for as well as gain exposure to some of the similarities and differences in what are arguably the top three relational database management systems right now.
You can purchase Refactoring SQL Applications from amazon.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.
The examples and benchmarks are compared across three different popular relational database management systems. They are MySQL, Oracle RDBMS and Microsoft SQL Server. I thought that this brought up a couple interesting issues that are not directly addressed in the book. First is that the authors are talking about how to improve performance, not comparing platforms, but the numbers are there and may be of some interest to people who would like to compare them. Secondly, I've met a number of people over the years who get quite animated about insisting that a good DBA does not need to know any certain solution, but rather just the fundamentals. I think Faroult and L'Hermite put this idea to rest, though unintentionally. In order to discuss how to best understand what exactly is happening and how best remedy issues, they show that it is necessary to have an understanding of platform specific issues and tools. This is true on two levels. The first is that the location of use of the built in tools for each platform are different. The second is that what works for one platform does not necessarily work for another.
For example, Chapter Two "Sanity Checks" contains a section on parsing and bind variables. The authors compare performance when queries are hard coded, with new prepared statements on each iteration (firm coded) and with one prepared statement and changing the parameter value on each iteration in a loop (soft coded). On Oracle and SQL Server the performance was poorest with hard coded, better with firm coded and best with soft coded. MySQL did best with soft coded as well but actually took a performance hit moving from hard coded to firm coded. This had to do with differences in how MySQL server caches statements. The authors took the time to rewrite their code from java to C in order to ensure that the issue was not related to language or driver issues. This is not to say that one can ignore RDBMS and SQL fundamentals, but rather that to get top performance requires knowledge of platform specific issues. This also comes out again when dealing with optimizers.
With that in mind, the authors recommend that readers have a solid understanding of SQL and some programming language. Most examples are SQL and code is given in Java and PHP. There are also examples that illustrate SQL extensions showing procedures, functions, etc. written for all three RDBMS products covered. The authors stick primarily to standard SQL but do make note and at times show examples of how things will look in each of the other databases. This information is current and reflects the most recent versions of the each product.
The fourth chapter, "Testing Framework" is incredibly useful. The authors cover generating test data and then checking correctness of outcomes through comparison. This is really useful information for anyone working to improve an application, or writing one for the first time. I think it also a large part of why this book could really appeal to new and experienced developers as well as the developer working on existing or brand new applications. I think there is a good chance that only the most extremely experienced developer would find nothing new here, or at least some new way to approach a problem. New developers can learn quite a bit and avoid some bad habits and assumptions without having to gain that information the hard way. And then the tools for generating random data, large amounts of data and comparing results will provide excellent opportunities for learning and real world application.
The next three chapters cover dealing with specific types of issues and how to improve performance. The last chapter then quickly describes a scenario of just how the authors step into real world situations and start to attack a problem. This is followed with two appendices. The first is scripts and samples, the second tools that are available to help in finding issues and resolving them. Some of the authors tools use SQLite, which is discussed briefly in the chapter on creating test data as some of the tools depend upon it.
I think that it has been a while since I've read a book that could have such a rapid return on investment. There are many suggestions and insights that should enable anyone to squeeze better performance out of just about any database application. While the focus is on the application side, there is plenty that requires understanding and work on the database side as well. There is discussion of the parameters and hardware I mentioned at the start of this review. But rather than the only options, they are one part in a much larger and systematic approach.
The authors relate that often refactoring for this type of application comes into play when something that used to work does not work any more. This can often lead to an environment of high pressure and emotion. The desire for a rapid resolution can lead to casting about in the dark for a quick fix or a feeling that cost is no longer as significant since a fix must be had now. The authors argue, and I agree, that this is exactly when a rational, disciplined process of tracking down and fixing issues is the most valuable. I agree. The issue is of course that someone in a position to do something must have the ability to take that approach. This book will get one well on the way to being in that place. Of course it can't take a brand new developer or DBA an expert. Much like a degree it can give them some fundamental tools that will allow them to take full advantage of experience as it comes rather than just crashing and burning.
If I could I'd have any developer on a database centric application read this, and DBAs as well. There is a lot here for both sides to learn about just how much they depend upon and impact one another. This may be an idealistic dream, especially for larger shops where often the relationship between those two groups is adversarial, but I think that such an approach could only make life much better for everyone involved. For anyone looking to enter this world on either side of the DBA or developer equation, this may make a nice addition to their education. For that individual wearing both hats this could be a life saver. In this small book they will learn many things to look out for as well as gain exposure to some of the similarities and differences in what are arguably the top three relational database management systems right now.
You can purchase Refactoring SQL Applications from amazon.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.
This study would have carried more weight if it had included PostgreSQL and IBM's DB2. These two databases do more serious work than MySQL although many believe MySQL is more widely deployed.
But with management.
when I spent a few years as a DBA it was common to be told to not work on that project any more as soon as it produced usable data. That means as soon as you have a working prototype you are required to drop it and start the next project. Many times after you get a working prototype you then go back and refine it so that it's faster and uses less resources.
Management is the blame. Unrealistic deadlines for DBA's and if you are honest with them and give a report that you have data they think it's good to go. I actually got wrote up once for taking one of the old procedures we had and rewriting it so that it worked much faster and the resource hog it was was reduced to the point that others could use the DB while it ran. I was told I was wasting time.
Do not look at laser with remaining good eye.
I'd like to see some work done on the balancing act of how much to do in code and how much to do in SQL. My coworker can put SQL statements together that if printed on an 8.5x11 would fill the whole sheet if not run over. Me, on the other hand, I tend to break up huge sql statements into a set of smaller ones and then use code to do some of the work that could possible have been done in SQL. I don't have the time to find out what works best on my own but I do have the time to read about it.
btw, how come tech books don't come on tape/cd?
I came to the datacenter drunk with a fake ID, don't you want to be just like me?
I have the misfortune of working with a database that is primarily a couple of tables with key-value pairs (not a traditional database model).
There is only one column that can be indexed, and it has to be done with a full text index.
Every once in a while, there is a discussion about moving this mess to something more traditional. I was excited to read the review on this book, but as I read through the review, it seemed like this was more of a "performance tuning guide".
Re-factoring a database is a lot more involved - changing tables, stored procedures, maybe even the underlying database.
The term Database Application is fuzzy and poorly defined. Is it the front end? The stored procedures? The database tables? I would consider a database application to be any part of the code that stores, retrieves, removes or modifies data stored in a database, and the entities that have been defined to store that data.
Using that definition, this book is about tuning, not refactoring.
There's not Postgres in the book, so, it will be usefull for me?
I've found that the biggest issues with SQL applications (writing rich clients) is not in performance turning of server/sql but in dealing with ORM issues, where to draw the line between how much work the client does vs. how much the server does, reconciling changes made in memory with data in tables, concurrency, database architecture designed to cope in advance with poorly thought-out requirements you're given, etc. I'd hope that book on refactoring SQL *applications* would touch on these issues.
I am a developer, and my experience has shown that if you use one of: Oracle, SQLServer,
:-)
PostgreSQL, DB2 and application performance is poor, 99% of the time it is poor design from our
(developer's) side.
Developers without good understanding of Relational Databases and SQL often produce problems that
cannot be solved by indexes, or throwing transistors at them.
It is so nice to see a "custom" made map implemented in the database using temporary tables instead of
using the language's built-in map functionality
sorting arrays using the database gets extra points (no kidding, I have seen this!)
I think it's usually best to have views (whether with rows that are the result of code, or with a pure 'select' definition, or materialized ones) define what your application 'sees', so that you can always change the underlying datastructure. That way refactoring becomes a bit more easy.
Religion is what happens when nature strikes and groupthink goes wrong.
While they focus on MySQL in their examples, these are the same authors of "the Art of SQL", which tackles how to formulate application schema in a optimal way. It's actually a very good read - and if anyone had read and understood their first book, they would have no need for this second one, regardless of what RBDMS they are using. Also, mod parent Flamebait please - because that's all it is.
...is probably a good read. He has a lively writing style; it kind of reminds me of Bertrand Meyer's "Object Oriented Software Construction". Anyhow, I've got both this book and Faroult's "The Art of SQL"... both are excellent.
The Army reading list
It doesn't matter what language you use, the DB will always be the bottleneck in DB applications, duh!
This, of course, only works if the rest of the database setup is more or less ok ;-)
TCAP-Abort
"Faroult and L'Hermite bring a much wider range of options to the table."
Ah! I get it! "To the table", like SQL table! You funny!
I have developed a design pattern using in memory data table objects that can satisfy the most complex requirements without using any dynamic code.
It also allows the queries and business logic to be broken into discreet chunks that are easily optimized and debugged.
love is just extroverted narcissism
Comment removed based on user account deletion
Only on Slashdot do you find someone who wants to listen to Natalie Portman talk SQL.
SELECT * FROM Memes WHERE Reference LIKE '%Portman%' AND LIKE '%naked%' AND LIKE '%petrified%' ORDER BY SlashdotCommentScore, HotGrits;
27,154,947 Rows Returned.
Boot Windows, Linux, and ESX over the network for free.
Perhaps it speaks ill of my skills but usually the first place I look is at my application / query structure. Most modern DBs have enough raw speed to pull off large tasks in amazingly short amount of times. Typically I look towards how I am managing the tool first as this frequently yields the desired gains and nets me more experience as well. Because slow queries always deal with large amounts of data, small "mistakes" have exponential effects... thus small improvements offer the same reward.
Only when that fails to I go further into modifications of the hardward / DB engine, etc. as required.
Play me online? Well you know that I'll beat you. If I ever meet you I'll "/sbin/shutdown -h now" you. -Weird Al, kinda.
..is to rewrite is in a language that actually lends itself to applications.
AC because I know people will regard this as a troll. It is not - I have seen countless DBAs who think that SQL is actually an application language that should be used for all manner of things it is really hopeless at....
This even included:
- email
- CSV processing
- error reporting
Said systems were always incredibly difficult to manage normally and next to impossible to manage by people who did not write them.
Now I know not all "applications" being dicussed are actually APPLICATIONS as such and are more just slightly complex procedures, but that is not what is implyed by the words chosen here.
I don't know. Are the ideas worth testing on PostgreSQL? I am willing to bet they are.
LedgerSMB: Open source Accounting/ERP
Recently I was asked to improve the performance of a MySQL based PHP web application. After turning on query caching and tuning the settings I was left with looking at the queries involved. It turned out the application was really the problem, not the queries. Just loading the main page involved several hundred queries. For example, settings were saved in a table. Instead of loading all of the settings with a single query, it grabbed them one at a time. It wasn't like they had a few hundred variables and only used a few. There were a couple of dozen and they loaded all of them individually. This kind of bad design was consistent though out the application. Turning on query caching made a huge difference, but the thing was packed with similar problems that could not be overcome with a simple fix like that. A complete rewrite was really the only answer -- and wasn't going to happen.
-- Will program for bandwidth
The typical argument goes something like: 'MySQL suxorz - nobody uses it for serios work' followed by: 'Yeah? well explain that to =HIGH VOLUME SITE=!'
Such responses show a misunderstanding of what serious work is being discussed.
MySQL does a fabulous job with simple, high-volume transactions, exactly the type seen by Yahoogle/Flicker/Blogsites. They need to sore simple data (EG text) and be able to retrieve it quickly, and for these uses, MySQL is probably a better bet than Postgres or DB2.
But 'serios work' means thing like strong, ACID compliant transactions, row-level locking, strong integrity of field types, and a query scheduler that holds its own when you combine inner, outer, nested, subqueries mashing together a dozen or more tables with millions/billions of records/combinations.
Postgres will do this, MySQL won't. MySQL isn't bad because of this, it's just a tool not well suited to this specific job. I use MySQL for website CMS, I use Postgres for financial applications.
Does your dishwasher suck because it does a piss-poor job cleaning your socks? Use the right tool for the job.
I have no problem with your religion until you decide it's reason to deprive others of the truth.
Why?
Sorry don't get it.
MySQL is a good db for single-app databases, where data integrity is not a tremendous issue or where you are deploying a separate MySQL instance on a different port. It is quite a bit worse than PostgreSQL for anything else.
From your description you are using MySQL for a single-app database where you run a dedicated instance of MySQL for your app. That is not the usage case I was describing, which is a central RDBMS serving out the same data to a myriad of different applications. If you are trying to go beyond ONE app for your data, you should look to PostgreSQL instead of MySQL.
LedgerSMB: Open source Accounting/ERP
"and, if the budget allows, throwing hardware at the problem." These days its actually quite the opposite. Having a full database cached in memory does wonders for performance.
I've found that the easiest way to make applications run great is to give the developers systems that are at least 2 generations older than what will be used in production (with the latest software, patches, drivers, etc)...
Then, hold them to making the application perform as you want it to, on that hardware. They don't get paid (their final lump amount) until said application performs as you'd like on the 2 gen old hardware.
Then, when you migrate to the production hardware, it's quite a bit faster, and doesn't contain the bloat that could have crept in, if the developers had been given access to hardware identical to the production hardware.
Who is general failure, and why is he reading my hard drive?
I only know Oracle but I've known it since version 5.0. Intimately. I haven't read the book but I read the review. Here are a few tips I've learned over the decades that you might find useful, just in case they aren't covered in the book:
1) You have to establish a general rule of thumb for each production db whereby any one sql that consumes more than x% of the db resources needs to be tuned. The value of x varies from db to db. If it cannot be tuned below x% then it needs to be refactored.
2) Learn to use stored outlines. If you can get them to work they will save your ass and make you look like a total hero.
3) Never turn your back on the optimizer. Really. Even for simple queries, even with the deepest stats.
4) Bind variables are a necessity for high-repetition sql. Bind variables are something you might want to avoid for reports queries for which the optimal plans depend on the user input values. This is because a sql's plan is cached along with it the first time it is parsed, and if you use bind variables then the first plan you get is the plan you will always get so long as the sql remains in the shared pool.
(You can sometimes work around this issue by turning off bind variable peeking, but consider doing it on a per-session basis instead of changing it system-wide. Scary!)
5) Nowadays a 32GB SGA is no big thing. Get yourselves a ton o' RAM and set up a keep pool in the buffer cache to pin your most important reporting indexes and tables. Partition your big reporting tables and maintain a sliding window of the most recent partition(s) in the keep pool.
6) No sorting to-disk. Ever. If you cannot let the session have the PGA it needs to sort the query in memory then the SQL needs to be "refactored".
7) Once you have eliminated most of the physical disk reads it then becomes all about the buffer gets (BG's). When disk reads are low the high-logical-BG queries immediately become the new top SQL. This is because logical BG's are all CPU and your db is now cpu-bound, which is right where you want it. So from this point it's back to item #1 and we prune and tune (thanks KG!)
I could go on all day. Perhaps I should write a book?
"Crude and slow, clansman. Your attack was no better than that of a clumsy child."
I think it's usually best to have views (whether with rows that are the result of code, or with a pure 'select' definition, or materialized ones) define what your application 'sees', so that you can always change the underlying datastructure.
As another poster noted, I don't see much benefit from that over just changing what the app sees. At least not long term.
I agree with Joel On Software Joel, in that you should set up views as you stated to present the best view possible for the application - then change all your code to use that view, and when it does make the view the real table you are going against.
Otherwise I have seen way too many views abused and grown to monsters. If you always use them as transition states, then they can never get out of hand as once they are the real tables you proceed to make new views, and the schema evolves.
"There is more worth loving than we have strength to love." - Brian Jay Stanley
This study would have carried more weight if it had included PostgreSQL and IBM's DB2. These two databases do more serious work than MySQL although many believe MySQL is more widely deployed.
Shouldn't developer concentrate on app-logic? The db should just take care of every thing. Then what are the db vendors offering...1000 different parameters to learn & tune?
Comment removed based on user account deletion
checkout netezza.com 's data warehouse appliance. No refactoring required, it just runs fast
Its also available on BetterWorldBooks for $31.98 with free shipping to the US
And, as always, save the world, fund global literacy,
details: http://www.betterworldbooks.com/Info-Our-Impact-m-51.aspx
Over 300 titles of used (=cheap) on SQL
http://www.betterworldbooks.com/List.aspx?Category_ID=3804&&s=23329287
I'm a dba. I think I will actually buy and read this one though. :-)
by default (last I checked, and meaning you don't install innodb, etc), the tables will not be transactional.
Why would one assume InnoDB is turned off? MySQL binaries in Ubuntu 8.04's repository and the MySQL essentials installer for Windows include InnoDB. So does Go Daddy's hosting platform, if one believes this page deprecating Berkeley DB in favor of InnoDB.
Strict mode can be turned off by any application.
Per the MySQL manual: You could try not handing out the SUPER privilege like candy. This way, other apps can't SET GLOBAL sql_mode, and when they try to SET SESSION sql_mode, it won't affect any other apps' connections.
If you are looking for a single-app database try SQLite (mentioned in the tools section by the authors) http://sqlite.org/whentouse.html
SQLite's web site states that SQLite silently ignores REFERENCES (foreign key) constraints.
still, I'd prefer not to handle mysql structural data, as it mangles text removing spaces
What version of MySQL are you talking about? This page claims that MySQL 5.0.3 and later preserve trailing spaces in VARCHAR columns, and all versions preserve trailing spaces in TEXT columns.
Now, for the citation needed folks: http://sql-info.de/mysql/gotchas.html
From the linked page: "Update: Apparently this behaviour will be corrected in the planned 5.0 release", and it was.
Comment removed based on user account deletion
The SQLite source tree contains source code and documentation for a C program that will read an SQLite database, analyze the foreign key constraints, and generate appropriate triggers automatically.
The searches I tried in Synaptic on Ubuntu 8.04 LTS didn't appear to turn up any program to make sqlite triggers. Was this program added to Ubuntu only recently, or is it not in Ubuntu at all? And then the download page scares people away from downloading the source code packages: "Their use is not recommented. The SQLite developers do not use them. You should not use them either" (spelling and emphasis preserved).
For MySQL you generally needed InnoDB for referential integrity
But at least MySQL with InnoDB comes packaged for Ubuntu.
Comment removed based on user account deletion
The READ.ME for this is at http://www.sqlite.org/cvstrac/fileview?f=sqlite/tool/genfkey.README You can access this functionality directly from the sqlite3 shell tool (type sqlite3 at the CLI). The syntax is .genfkey ?--no-drop? ?--ignore-errors? ?--exec?
Thank you for pointing out genfkey. Now that I have a keyword to go on, I can understand how one would "recompile" foreign key constraints. Don't get me wrong, I've been using SQLite 3 in a Python-based internal web site for about a year now; I just wanted to get straight what it does and does not do.