Diagramming Tool For SQL Select Statements
alxtoth writes "Snowflake is a new BSD-licensed tool that parses SQL Select statements and generates a diagram. It shows parts of the underlying SQL directly in the diagram. For example: x=30, GROUP BY (year), SUM (sales), HAVING MIN (age) > 18. The primary reason for the tool was to avoid Cartesian joins and loops in SQL written by hand, with many joined tables. The database will execute such a statement, if syntactically correct, resulting in runaway queries that can bring the database down. If you sit close to the DBAs, you can hear them screaming... "
No single query will ever bring a (real) RDBMS down. Even on a terrabyte of data or more, doing a crazy multi-hundred-table cross join, you're not going to bring it down.
Now, it could seriously slow down a production server, but... you're not pushing untested SQL on a production server now, are you? Right? Riiiiiiiiiiight?
So at worse, you're slowing down your own localhost development database engine for everyone else trying to access it (read: no one).
Not much for the DBA to scream about...
I know, it would have caught this
SELECT *
FROM Event E
LEFT JOIN Document D ON D.DocumentTitle = 'Daily Note'
WHERE D.DocumentTitle IS NULL
Hint: LEFT JOIN runs in O(N) time where NOT IN runs in O(N2) time on this system.
A link to an alpha project on Sourceforge that was created three days ago and doesn't even have its own website? That apparently outputs LaTeX tables instead of something readable without having to compile it first, like HTML, SVG, or even indented text? I know it's silly to expect every story to be about a cure for cancer, but come on...
You can do what I've done and seen done a number of times, and write a hunk of middleware that parses SQL statements for runaways and send back a warning to the user. That, and not using medium and low duty databases lile MSSQL and MySQL can go a very long way to keeping users happy.
Poorly Written Summary. Something not many people many if any will care about. Yesterday we the 'Language Translation Error' One-liner and now this? How did this story get picked?
They have been there from the days of "Halting problem" (of Turing machines - before electronic computers were invented)
WITH A AS (
SELECT 1
UNION ALL
SELECT 1
) SELECT *
runs forever.
Execution of SQL statements can require the RDBMS to perform nested loops over parts of the query execution.
This can be an issue if the DBMS is forced to do something like perform a sequential scan of one table for each record matched in another table. That gets expensive *fast*.
There are many other possible performance issues, of course.
However, I don't see how SQL parsing can tell you much about the performance characteristics of the query. The database's query optimiser makes choices about how to execute the query, and is free to change its mind depending on configuration parameters, available resources, system load, disk bandwidth, present indexes, statistics gathered about data in the table, etc. PostgreSQL's planner for example does make heavy use of table statistics, so query plans may change depending on the quantity and distribution of data in a table.
Any decent database can already tell you how it will execute a query (and usually give you a performance readout from an actual execution of the query). There are plenty of GUI tools for displaying the resulting query plan output graphically. PgAdmin-II can do it, for example.
A simple SQL parser can have no idea about what indexes are configured, the distribution of the data, how much working memory the database has available for sorts and joins, etc. The database knows these things - and can already tell you how it will, or did, execute a query - so why not let it do its job?
The whole project doesn't make much sense.
Doesn't name WHICH RDBMS, and then you throw SQL at it? So what? For DB2 we have a thing called "Visual Explain" which NOT ONLY does this, but is free, provided by IBM, but also shows you other things like whch index is being used for each step, etc.
This is news? This isn't even worth a second look!
posted by the admin of the project? the spam tag is accurate... "yes this is an open source clearing house, no we will not all rapidly sign up to your cute little project." though, i would be willing to be this is a Masters Thesis project and alxtoth is hoping to get some fast-tracking going on...
we were somewhere just out of Barstow when the patent trolls attacked.
this way the programmers are prevented from infecting the database from their crapness
If you mod me down, I will become more powerful than you can imagine....
I don't see what this has over EXPLAIN and an appropriate graphical display tool like PgAdmin-III. There are large numbers of tools that display graphical query plans - and unlike this simple SQL parser, they know how the database will actually execute the query once the query optimiser is done with it.
Furthermore, a simple SQL parser has no idea about what indexes are present, available working memory for sorts and joins, etc. It can't know how the DB will really execute the query, without which it's hard to tell what performance issues may or may not arise.
See comment 24461217 for a more detailed explanation of why this whole idea makes very little sense.
Most PostgreSQL users don't seem to use the existing, and superior, tools like EXPLAIN, EXPLAIN ANALYZE, PgAdmin-III's graphical explain, etc. I'm sure the same is true for users of many other databases.
It's not like these tools are particularly difficult to use or understand. No training is required, though being willing to think and read a little documentation helps if you want to get the most out of them. Understanding at least vaguely how databases execute queries is handy for any database user anyway. The same understanding is required to get anything useful out of this just-posted tool.
Anyway, as I've noted elsewhere the exiting tools for this do a much better job due to integration with the RDBMS and superior knowledge of how the DB will execute the query.
If it was, such a query analysis tool would be provably incapable of handling all queries because of the halting problem.
Thankfully most SQL dialects are limited to expressing queries that can be executed in finite time with a defined end point.
They day of the python? There were 2 different tools written in python.
I'm not flaming, I just thought it was interesting as I loved programming in python when in college!
I've noticed that when things go horribly wrong, you don't actaully have to sit that close. To be fair, as a Unix SA who has to deal with Windoze systems, I've done my fair share of screaming. :-)
It must have been something you assimilated. . . .
That's right. Mod parent up. Any tool that only looks at the SELECT statement, without knowing about the indices or what the optimizer is doing, is nearly useless.
So SQL Server has had a graphical execution plan view for ever, and it's better than this lameness. But of course its not free, and we all know that free software is better, even when it sucks. Seriously, compare this to the real tools included with a serious RDBMS, and I have to question why this was even posted. It's almost farcical.
No sooner do I get over one, then you put a better one right next to me. Bastards.
Heck, even ANSI finally got into recursive SQL using the WITH clause:
Now let's imagine queries with multiple levels of nesting using such clauses - after all, any SELECT statement can generally be used in any FROM clause.
Now, perhaps you're Chris Date or Fabian Pascal and are truly concerned with the completeness of SQL as implementing the relational model. For the rest of us, however, recursive SQL can answer interesting questions without getting into the nastiness of procedural code.
Oh, and considering the default join in virtually any SQL database is a nested-loop join, I'd say all databases loop by default. And a statement as innocuous as :
Can absolutely crater cpu and I/O performance. If each has 1,000 rows and there's not enough memory, there's 1,000,001 table scans. Hope your disk is fast.
I've thought about such a tool for detecting SQL injection. Essentially, you have a whitelist of SELECT statement "diagrams" stored somewhere. Before running a query, you generate a diagram of the current statement and check it against the diagram.
Of course a better investment would be to write your code the right way first...
Can we have that in English please? Possibly with a diagram?
Dear Slashdot: next time you want to mess with the site, add a rich-text editor for comments.
Comment removed based on user account deletion
> Any tool that only looks at the SELECT statement, without knowing about the indices or what the optimizer is doing, is nearly useless.
This is wrong. The indices or optimizer have very little to do with the SELECT; at the time of processing the SELECT clause, most database engines already are done with the FROM, JOIN, WHERE, GROUP and HAVING clauses. At this point there will be little gain to add/drop indices from the query plan, unless the platform does support included fields. As such, SELECT is like ORDER BY, a final operation that has little weight in the overall plan (unless there is a huge amount of fields, but even then...)
For long queries with complex joins (like recursion), a diagram tool for SELECT can be very helpful, because it can help you validate that you actually get what you ask for. I would see this product as a logical optimization tool, not a physical one.
lucm, indeed.
Nobody with any sense is going to use this because it's under that fucking evil BSD license, right Linuxistas?
Bite me.
I think you might've missed the point.
The term SELECT statement generally refers to the whole statement, including FROM, WHERE, HAVING, etc clauses.
This is pretty clear in context, as it'd be nonsensical to produce a graphical explain tool for the result field list in the SELECT clause its self.
That's why the parent said SELECT statement not SELECT clause .
As it happens the same issues regarding the need for planner knowledge etc are true for DML like INSERT, UPDATE and DELETE. It's not about SELECT at all, but rather any non-DDL query.
Just use ANSI style joins. You're guaranteed never to accidentally do a cartesian join.
For example:
Normal join - "SELECT * FROM table1 t1 JOIN table2 t2 ON (t1.column = t2.column)"
Cartesian join - "SELECT * FROM table1 t1 CROSS JOIN table2 t2"
Big difference, no mistakes no matter how many tables/columns you're pulling in to your query.
--Roy
Build a website for it. (Don't have "Homepage" link to an empty directory.) Take screenshots in a decent resolution. Offer an open forum.
> I think you might've missed the point.
> The term SELECT statement generally refers to the whole statement, including FROM, WHERE, HAVING, etc clauses.
I did not miss the point:
>> For long queries with complex joins (like recursion), a diagram tool for SELECT can be very helpful
I simply disagree that the product is useless without knowing about the indices. Before one should start reviewing query plans and figuring out what index is important, one must make sure the query makes sense. Logical before physical.
A good example of this is a query with a LEFT JOIN. Applying filters in the JOIN or in the WHERE clause might look similar (it is with INNER JOIN), but you may get very different results with LEFT since the non-matching rows from the left table are added to the result AFTER the filter on the JOIN is applied. With a graphical tool, you can see it immediately and reformulate your query accordingly.
lucm, indeed.
Comment removed based on user account deletion
Ermm, yes, but 'windward' is a direction. Let me see... 'Never Eat Shredded Wheat' - nope, don't see 'credentials' on the compass, so I don't think they count as a direction, so I suppose that makes them something we look 'at'. Nice quote though.
Without knowing the indexes how are you going to know what it will do on the disk? The indexes are how the planner decides what data is in or out many times.
If you had a disk/memory with 0ns seek time then yes this makes more sense (as you do not need indexes as much). But most issues are with either some sort of loop (which this tool sounds like it can help with), or a poor plan (ie borked indexes).
SELECT *
FROM X
WHERE Y=99
on a system where there is 100 rows that will probably come back fast.
on a system where there is 1 billion rows it will take ages. (seconds depeding on if the data is up in memory and how big the rows are)
on a system where there is 1 billion rows and an index on Y it will come back in a reasonable time.
Without knowing the indexes you do not know if you are getting a full table scan (a large loop) or if the planner can get you to 90% the way to the data.
Indexes sound bad up front but are a wonderful tool. I have seen them abused too. However, almost all things in comp sci can abused.
I as a DBA am more worried about will the query finish FAST and not lock up crazy amounts of the database while doing it. Both things help minimize deadlocks and raise customer satisfaction.
Also indexes help with the locking issue. What parts of the DB are locked so the data is consistant. If you can narrow it down to row level; you get speed from other statments in your system.
I too in my day have written the big crazy statment. This tool probably would help disect what something like that does. But that is NOT MAINTAINABLE. If you need a tool to tell you what it is doing then your doing it wrong. I usually take a step back and break it down into a bunch of smaller simpler statments. This lets *ME* know what it is doing and someone eles who has to maintain it do so. Breaking it down also removes the nested loop problem quite nicely.
The problem is that sql lets you put statments inside of statments. Putting those in the wrong places can be devistating in performance. The rule is pretty easy. You can put select statments in the select clause or where clause areas but be prepared for it to take awhile to finish. Put them in the FROM clause where they belong.
When I have a perf issue I almost always start with the disk reads/writes which is almost always an index is messed. THEN I move to logic. A better tool is the one that samples how long different ones take so you can figure out which queries are messed. It is usually pretty obvious what is messed up at that point when an index doesnt take care of it. It is a data layout issue (where does you data live) or a data flow design issue (how are you asking for the data). Many times it is not even the queries themselves. It is the fact that a programmer decided to call that one procedure that takes .1 seconds to execute, 200k times in a row. Instead of using his brain and writting a new proc that returns all of the data he needs in a table. You can spend all your time getting it down to .05 or fix the real issue that the design is broken/missing something.
You missed the parenthesized word in "No single query will ever bring a (real) RDBMS down."
> The problem is that sql lets you put statments inside of statments. Putting those in the wrong places can be devistating in performance. The rule is pretty easy. You can put select statments in the select clause or where clause areas but be prepared for it to take awhile to finish. Put them in the FROM clause where they belong.
The SQL optimizer will actually do that for you. It can replace a correlated subquery by a JOIN if this appears to me more optimal.
> When I have a perf issue I almost always start with the disk reads/writes which is almost always an index is messed. THEN I move to logic.
I completely agree. But that does not mean that the OP tool is pointless, it is just not suited for performance analysis. Hence my point: use it to review the logic of a query, then move to the physical layer with other tools. In most IT companies those two steps are not done by the same people anyway.
> I too in my day have written the big crazy statment. This tool probably would help disect what something like that does. But that is NOT MAINTAINABLE. If you need a tool to tell you what it is doing then your doing it wrong. I usually take a step back and break it down into a bunch of smaller simpler statments.
I agree. In the ideal situation, this is what one should do. Sadly we do not live in an ideal world, and sometimes one must deal with huge queries. For those sad moments, the OP tool is good. That's the point.
lucm, indeed.
And these type of edicts from up on high tend to really bite you in the behind over time. You wind up with hundreds upon hundreds of stored procedures, and nobody knows which ones are even in use any longer. One project will wind up requesting a change that affects another project, and it basically excludes any O/R mapping tools. It's just one huge mess.
Your best bet is to insist that your developers are just a little clued in. How hard is it to say, "As long as your queries always have an indexed field in the where clause, and your joins are all on the primary key, fire away. Anything other than that needs DBA approval." Or some other reasonable policy depending on how much data are in the database.
They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock
My reply is: # of CPUs on the RDBMS server.
So for a 8-way SUN server, you just need 8 user *sessions* submitting a 3 table cartesian product join, each table with 1 million rows.
Who many users does it take to bring down a server?
It could just be ONE human, issuing the same query 8 times in a row...
Yes, it could be useful for examining the output of the query in non-performance terms. For complex queries I can easily see how that could be useful. That may, in fact, be the whole idea behind the tool - to help reduce or eliminate execution of grossly incorrect queries that don't do what the user wants. Tools like EXPLAIN aren't as useful for that, either, as the query looks quite different after the query optimiser is done with it. Additionally EXPLAIN output usually drops detail about specific fields, so it's not really possible to predict the results of the query from explain output alone. That said, it's quite possible to spot issues like a join without a filter.
Indeed. I was quite careful to say "no -SINGLE- query..."
Well, unless you're running the database on a Pocket PC or something =P
Comment removed based on user account deletion
I tested this on SQL Server 2005 Express and either I am doing something wrong or your statement is false.
I created three tables Table1, Table2, Table3. Each one has two columns id INT PRIMARY KEY IDENTITY, str NVARCHAR(MAX). I ran an query that inserted 1000 rows in each of those three tables. Then I ran 2 queries concurrently: SELECT * FROM Table1,Table2,Table3
Well, my PC slowed down a bit and queries have been running some minutes now but nothing dramatic happened. I ran third query SELECT * FROM Table1 and it executed quite fine. I made some INSERTs and UPDATEs and those went well also.
So results are: some slowness, no CPU cratering (but both cores are under 100% load), mem use is almost endurable and one unhappy boss because I used company time to do this test.
You don't know what you don't know.
Comment removed based on user account deletion
What he meant was, if you only have the SQL statement, without accessing the CATALOG to know what is indexed, you can only go so far. Of course the optimizer will need even more information than just WHAT is indexed -- sizes of tables, distribution of values, etc. Oracle/DB2 require you to update the index statistics periodically, otherwise the optimizer may choose a bad plan.