How Would You Improve SQL?
theodp asks: "It was the best of languages, it was the worst of languages. SQL's handy, but it can also drive you nuts. For example, if you want all 100 columns from a table, 'SELECT *' works quite nicely. However, if you want all but 1 of the 100 columns, be prepared to spell out 99 column names. Wouldn't it not make sense to provide a Google-like shorthand notation like 'SELECT * -ColumnName' (or DROP=syntax like SAS)? So how would you improve SQL?"
Extremely useful when you need to produce a result tree instead of a result list (e.g. Slashdot's nested comments). Oracle does this with "CONNECT BY", there is also a PostgreSQL patch available. Of course there are hacks to do this, such as adding extra fields to keep track of where you are in the tree, but they are a real pain in the arse compared with using the information that's already present in the database.
Bogtha Bogtha Bogtha
You can't write poetry in SQL. So it remains an inferior language compared to Perl.
(yes, well, I ran out of ideas)
May contain traces of nut.
Made from the freshest electrons.
If you want to get an idea of some cool SQL improvements, check out theq /
http://msdn.microsoft.com/netframework/future/lin
LINQ (Language Integrated Query) project for c# 3.0. Some cool stuff tht i never really thought about.
For example, their select statements go backwords ie from table, select column1, n2, n3 etc... Seems kinda wacky at first, but it makes sense since you really should know what table your'e selecting from before you specify the columns.
ex.
public void Linq3() {
List products = GetProductList();
var expensiveInStockProducts =
from p in products
where p.UnitsInStock > 0 && p.UnitPrice > 3.00M
select p;
Console.WriteLine("In-stock products that cost more than 3.00:");
foreach (var product in expensiveInStockProducts) {
Console.WriteLine("{0} is in stock and costs more than 3.00.", product.ProductName);
}
}
Top 10 Reasons To Procrastinate
10.
If this is your main problem with SQL, then you have other problems as well. Who in their right mind needs a table with 100 columns? If you have 100 columns, you seriously need to normalize your database.
Ok, I might not be a database buff. Actually, my experience with SQL is purely academical (although I've worked with object-oriented databases). But if I were to improve SQL, my attempts would be in the direction of making it into a more pure mapping of a relational database, not in adding yet more syntactic sugar.
Did you know there have been people working on a general algebra for data management for, what 40 years now? Did you know, this is basically a SOLVED PROBLEM? Ever heard of "D"? Or Tutorial D? The Third Manifesto?
.. it tells the computer what you want, it doesn't tell it how to do it and in what order, and 2) algebraic notation is *completely general*. You can nest arbitrarily with parentheses, and you can clearly see what's a variable and what's a value and what's an operator. Easy to create, understand, and *optimize*.
You know, I totally understand why Fabian Pascal is always pissed off.
Here's something for you to chew on:
Why do programmers write this:
A + 3
when they want to add 3 to A? Why do we not write some lovely crap like:
OPERATE ON A WITH 3 USING ADDITION
why do we write:
(A + 3) * 2
and not:
OPERATE ON (OPERATE ON A WITH 3 USING ADDITION) WITH 2 USING MULTIPLICATION
Why do we do that?? Because algebraic notation is 1) declarative
Do you agree with me that the verbose syntax clouds your thinking? Keeps you from seeing the underlying operations? Makes it difficult to apply the basic algebraic skills you learned in high school? Makes it difficult for the compiler writer to do the same?
Now I ask you, why do we write:
SELECT * FROM Order
and not
Order
Why do we write:
SELECT * FROM Order JOIN OrderItem WHERE Order.order_id = OrderItem.order_id
and not
Order JOIN OrderItem
And here you are, dwelling on some little detail about projecting columns.. this is an easy one: use an "ALL BUT" operator for example:
RelvarWith100Attributes ALL BUT (Attribute100)
Once you see that relational algebra is just values, variables, and operators nested in any arbitrary way, just like arithmetic, you have opened the door a little more to understanding the fundamental theory of data management and how backwards and primitive "modern" data management is.
And let's not even get into all the crap that SQL gives us like duplicate rows, NULLs, brain-dead table-oriented storage, lack of 100% updateable views, lack of arbitrary constraints, (often) lack of composite types (why the hell do we splat objects into MULTIPLE COLUMNS?? They should be stored in ONE column). SQL also confuses logical and physical layers (keys vs. indexes), and has basically kept the database industry in the dark ages for decades now.
So the answer to your question is pretty simple: I would ditch SQL and use something that looks like relational algebra, which has been understood and documented for a probably longer than you've been *alive*. No offense.
In what way isn't it relational?
.. I think this is related to lack of table equality.
1. SQL syntax doesn't look like relational algebra (see my long rant above). This clouds thinking and hides the simplicity of the underlying model.
2. Relations are sets. SQL allows duplicate rows, so its tables aren't sets, and therefore aren't relations. (This property alone is enough to make it "not relational" by the way).
3. Relation *attributes* (the column names) are also sets. SQL allows columns with the SAME NAME in a query result!!
4. SQL has no "table equality" operator. You'd think the first operator you'd implement for a data type, especially a fundamental data type, would be equality! Imagine a programming language with no integer equality for instance.
5. Relations require each attribute to be drawn from a single type or domain. SQL allows NULLs, which are values not drawn from the column's type. And SQL gives you very little to help you work without NULLs. To add insult to injury, the default for columns is NULLable.
6. (related) Relational algebra requires boolean logic. SQL uses three-valued logic because of NULLs. And it uses it *inconsistently*.
7. The relational model does not specify a type system, it just requires one. Yet SQL specifies it's own particular type system (integers, chars, etc). What if you want to store XML or audio in one of your columns?
8. The relational model specifies nothing about physical implementation. Yet, almost every SQL product stores the columns of tables "together" in such a way that makes joins needlessly expensive.
9. SQL distinguishes between "base tables" and "views". The relational model requires them to be indistinguishable to the end user. Specifically, most SQL implementations don't let you update views! Pretty unbelievable. Imagine a programming language that didn't let you pass arguments to any function for instance.
10. SQL lets you do meaningless things like multiply the primary key values of two tables or add a weight to a height. This is related to the type system issues.
11. SQL confuses KEYs (logical) with INDEXes (physical implementation).
12. (This one gets me all the time) SQL has an EXISTS operator (is this statement true for at least one value of this result?) but not a FOREACH operator (is this statement true for all values in this result?)
13. SQL implementations don't have ANY brains whatsoever. They don't know that book_id from column A and book_id from column B are equal in a join, and that you don't need both of them in the query result. They don't "look inside" your CHECKs and foreign keys to deduce information about your database and use that information to optimize queries.
I'm sure if you picked up a basic theory book you'd find plenty of other nitpicks for the syntax, the semantics, and the basic underlying model of SQL.
And these aren't just "theoretical" problems, I run into them every day because I know there's something "more" out there. Here's a simple query you should try to do in one line of SQL: "give me a list of all customers who bought every product in product line X". Someone who knows relational theory just thinks up the solution (you just need to create a list P of all products in product line X, and pull out the list of orders where P is a subset of the order items, then join with the list of customers). Someone who only knows SQL will immediately run for the application layer, where you can't just *declare* your problem and have the app solve it, you literally have to write loops and procedural code to solve the problem.
If you are interested in learning more, get Date's O'Reilly book "Database in Depth". It's very short, roughly 200 pages, and tells you all you need to know about data management theory.
1. Standard date functions and handling.
2. Allow for SELECT statement reordering. I should be able to have the FROM first. This would be a BIG help to SQL editors!
3. Column aliases. So if I have a column in the select that is ColA+ColB as "My Value", I can use the "My Value" in the WHERE, GROUP BY, and ORDER BY instead of having to restate the equation every time.
Bad User. No biscuit!
Out of all the annoying issues, I've pulled out the most hair over unique id's, and INSERT vs UPDATE.
Most SQL implementations give you some way of assigning unique id's to newly INSERT'ed rows. It would be nice if there were a standardized way, but that's a side issue. Once rows have unique id's, you can identify rows to be updated by id. This is very fast and simple.
Except .. in order to find out what id the DBMS has assigned a row, I usually have to follow my INSERT with a SELECT, to read the id column. Slow and annoying. Sometimes the DBMS I am working with takes a few seconds to perform the INSERT, and ten minutes to perform the SELECT. That takes it beyond an optimization issue, and into a workability issue.
Also, if I do not yet know if a data record has been INSERT'ed, and I need to either UPDATE the existing record or INSERT a new one (say, with just a new timestamp), then I need to either attempt an UPDATE and then fall back on INSERT if the UPDATE fails (ew!) or attempt a SELECT and either INSERT or UPDATE depending on whether it returned any rows (ew!).
If SQL came up with a standardized way to associate unique id's with newly INSERT'ed rows, it would be very, very nice if the id column(s) assigned were returned to the client in the same packet as the message confirming that the INSERT succeeded. Nearly zero additional overhead, neat, fast, and easy.
To solve the UPDATE/INSERT issue, I'm less sure. Say, for instance, that I have a daemon which periodically scans the filesystems in a cluster of machines, and it wants to UPDATE the "exists" column of a given row identified by a ( hostname, mountpoint, path, filename) tuple with the current time, if that row already exists, or INSERT a whole new row for that file if it does not exist. Perhaps there could be a "WRITE" command which is just like INSERT but overwrites a row if it already exists? That seems like the wrong solution, too. In the meantime, I play with caches of hashes to unique id's and lose more hair.
-- TTK
Actually, it'd be faster if you listed out every column name. If you're talking about faster to write out the code for, you're obviously not writing a query for a program that's intended to be used much. There's absolutely no reason you should be deploying code containing a query that does "select *" or anything like it. You're making the database do the work of looking up the list of columns names every time that query runs. There are much more useful things to spend your caching space on (if you have any).
If you really can't stand to write queries containing the actual column names, you should be using some type of abstraction layer in whatever language you're writing your code in.
If you're not writing code and just making queries by hand to test the results, then you're even further off your rocker. (this also applies in general to the statement you made) Why would you ever NOT want to select that last value out of 100? is it going to keep your output from wrapping? (lol)
Also, those of you saying that you should never have 100 columns in your table, you're certifiable lunatics as well. If you have 100 columns that are used in every record and have very little or no duplication per row, there is no reason you should break this up into multiple tables!!! Then the database has to do joins, which again require more processing power and disk usage. It's also hard to maintain multiple tables when you really have one table after you normalize it.
For those of you that say this isn't normalized... I'm not even really sure how to answer that.... If you have several tables all with a strict 1:1 relationship, they should be in ONE table. Anything else is considered denormalized, not yet normalized. (aside from being just plain BAD!)
For those of you that say you'd never need that many columns in one table or split across multiple tables, however you'd like to think the world should work. I have an example of just that. My wife does genetic research, primarily statistical analysis of sequence data (in various forms, but that's the easiest way to sum it up). We've had discussions on this particular topic, where she had been told by someone else that she would get better performance in Oracle if she split her one table into several tables containing a smaller number of columns, each.
This is just simply not true. It also is a perfect example of a situation where you would actually need a large number of columns. There were specific bits of data that needed to be looked up quickly (like, 45'ish). You can't store it all in one column (or even just a few) and use regexes to find the bits you're looking for. You also don't want to be doing a lot of joins unless you really need to, you know.. when you actually have data that would fit into some form of normalization. Technically, you CAN do this stuff, but not if you want decent performance. If you didn't want decent performance, you could just leave the data in a text file and shell out a grep command. *sigh*
Anyway, enough ranting, but seriously people... Get a clue. Get some experience with these issues. Don't just pipe up because "hey, I've worked with databases and while I probably don't understand them very well, I don't know anybody else that understands them at all, so I'm kind of an expert!"
Lets look at something a little more realist:
c e.InvoiceNum,
SELECT
Lease.LeaseNum,
Lease.LesseeNum,
Invoice.InvoiceNum,
Invoice.AmountBilled
FROM
Lease INNER JOIN
Invoice ON
Lease.LeaseNum = Invoice.InvoiceNum
WHERE
Lease.LeaseNum = "1234"
ORDER BY
LeaseNum, InvoiceNum
Okay, that's pretty big to get some basic lease and invoice info. Now how you you write that?
Lease.LeaseNum,
Lease.LesseeNum,
Invoi
Invoice.AmountBilled
Lease JOIN
Invoice ON
Lease.LeaseNum AND Invoice.LeaseNum
Lease.LeaseNum = "1234"
Lease.LeaseNum
Invoice.InvoiceNum
??? All that's been accomplished is the removal of key words. I'm not seeing any benefit, and I'm seeing the pitfall of it being hard as hell to read.
-Rick
"Most people in the U.S. wouldn't know they live in a tyrannical state if it walked up and grabbed their junk." - MyFirs
You're missing the point. You really have to study the theory, and you'd get something like this (the exact syntax is unimportant of course):
((Lease JOIN Invoice) WHERE LeaseNum = "1234")
[LeaseNum, LesseeNum, InvoiceNum, AmountBilled]
ORDER BY whatever
Why do I put the column names at the end? Because a projection operation applies to a *single relational result*, not to individual tables.
Why do I not qualify every column name? Because relational attributes *must* always be unique and unambiguous.
Why do I leave off the Theta from the join (the equality test)? Because I *already* set the foreign keys on those tables. The DBMS should be able to *deduce* which columns to join on and generate an error if it can't.
Here's what's happening:
I used the JOIN operator on two base relation values, stored in variables with the names Lease and Invoice. I got a third (anonymous) relation with all of the combined columns.
I then applied the RESTRICT ("where") operator on that relation, along with a boolean expression, and I got a fourth relation with just the rows where that expression was true.
I then PROJECTED that relation to get a fifth relation with just the desired columns.
Note that ORDER BY is a non-relational operator. It turns the relational result (unordered set) into an an ordered array. So for a final step, the relation was 1) turned into a regular array and 2) ordered.
Of course I didn't actually perform those steps. I told the DBMS what I wanted, and *it* did the work.
Do you see how the SQL conceals the underlying algebra? And how it makes YOU do the work (in the join for instance)? You might not see it. Study the theory more, and you will. Compare with the A+3 example. Imagine a 12-table join. Imagine having to do a query like "where all rows of table X are a subset of all rows in table Y" right in the middle. An algebraic notation would make this MUCH easier. Just break it down, and apply the next operator to the result of the last. You do it all the time when programming your favorite language, why not in SQL?
And yes, the differences get MUCH deeper than this. SQL can't even represent all possible relational queries!
I have yet to encounter a DBMS that didn't have an efficient, straightforward way to get the ID after an insert, but YMMV. However, the INSERT/UPDATE issue is a fundamental syntactical problem and it really should be fixed. INSERT and UPDATE do almost the same thing, yet have completely different syntax.
INSERT INTO someTable (fld1,fld2) VALUES ("foo","bar")
UPDATE someTable SET fld1=foo, fld2=bar
It is REALLY annoying when you have to write some code that generates a SQL statement because you must code for two completely different syntaxes. Someone replied about the Oracle MERGE which seems like a nice way to go.
Fortunately though, there are lots of good frameworks around SQL that make it so that writing SQL is becoming a thing of the past. I would like to see SQL treated like HTTP - nobody writes HTTP. It's a protocol. Let it be. Just use the tools. I guess it will never go away though...
SELECT * FROM whatever the hell my customers have in mind
Go hug some trees.
No doubt it has defects, but SQL has a strong theoretical underpinning in set theory. This has made it a very durable language and one that scales to sizes probably unimagined by Dr Codd when he outlined its roots in 1970 in his article "A relational model of data for large shared data banks".
Computings needs for well structured access and manipulation of large data sets has been well served by SQL.
A clear replacement has yet to emerge. There are pretenders to the throne, of which Tutorial D is certainly technically nice, XQuery is a mess and ODBMSs (and their query tools) really haven't caught on.
Its just that SQL passes a simple test - its good enough for the job and relatively ubiquitous. And standards do exist (that every major vendor breaks. sigh.).