Database Error Detection and Recovery
CowboyRobot writes "ACM Queue has an interview by Steve Bourne with Bruce Lindsay, responsible for a lot of the SQL and RDBMS we use today, in which they discuss error detection and recovery.
My favorite part other than the photos is the definition of Heisenbugs - those problems that disappear only when you explicitly look for them."
Whether one of the problems they prepared for was a Slashdotting. Database driven sites beware!
Where is it?
Heisenburg bugs are a rite of passage in the computer world. They result from the production environment being different from the development environment. For instance, a debugger may initialize all memory in the process space to zero. An errant loop control now happens to be set properly, so no error occurs; however, in the production environment, whatever is left over in memory is used, which means the loop wanders off into nomansland and crashes. Always initialize your variables, period! Even in languages that automatically do it for you so that you are aware to what they are initialized.
Bel, the mostly sane.. "Of course I can't see anything! I'm standing on the shoulders of idiots." -- Me
ignore this response.
*everything* is Orwellian to cats.
"Heisenbug as originally defined--and I was there when it happened--are bugs in which clearly the behavior of the system is incorrect, and when you try to look to see why it's incorrect, the problem goes away.
This is a really cool article, but it was especially fun to see the heisenbug mention. Years ago, some fellow CS people and myself conjectured a similar phenomenon that seemed to manifest once in a while, in which a computer malfunction goes away after one "proves" that there's no cause for the error to exist.
Here's a list of heisenbug anecdotes, but note that some of these submissions aren't strictly heisenbugs.
I'm less afraid of the heisenbugs as I am of the Hindenbugs, which bring down an app in a most spectacular fashion. Oh the humanity!
BL: In the heart monitor case, you better keep the heart going, whereas in the Microsoft Word case, you can just give them a blue screen and everybody is used to that.
SB: But also in the heart monitor case, it?s hard to ask users if they want to keep the heart going because the answer is pretty obvious, whereas in the Word case, you can ask the user in some cases what to do about it.
New Microsoft Pace - Heart Monitor and Pacemaker
STOP: 0x0000000A (0x0000015a, 0x0000001c, 0x00000000, 0x80116bf4)
IRQL_NOT_LESS_OR_EQUAL - Beat.exe
Please hold your breath while a dump file is created...
That picture is really something. I didn't know Gandalf wrote bsh.
I have worked on electronic hardware for forty years. Over that period, I have experienced many such bugs. You carefully trace the problem and get to the point where you say: "It must be this!" So you go there and the signal is correct but now the equipment works properly. It stays working properly. I'm used to problems that are sometimes there and sometimes aren't; this is different. The working condition stays.
It's like the equipment is playing hide and seek with you. You found the problem and the game is over. Maybe this proves that the 'great electron' has a sense of humor.
Comment removed based on user account deletion
Actually, the wiz in the pic has a more friendly Dumbledore look, rather than that of a driven Istari.
A good design principle is: either do what you're told to do or tell us you didn't do it and why, but don't do something completely different.
Exactly. Compare and contrast with MySQL's behaviour.
That's why there are loads of people who point out that you can't trust MySQL for important data, or that it isn't a "real" database. A real database tells you when it fails, which is something that is necessary for trusting it with data integrity.
The key point here is that if you go to sea with only one clock, you can't tell whether it's telling you the right time.
Ahh... but a man with one clock always knows the time - but a man with two is never quite sure :).
Heisenbugs are those bugs whose position in the source file you can see, and whose effects you can see, but never at the same time. God, it's like a bad pthread project....
AccountKiller
Somebody set up us the bomb!
The guy looks like he's covered in coke dust.
Designing for failure may be the key to success.
Engineering for Failure
If you were looking for an expert in designing database management systems, you couldn't find many more qualified than IBM Fellow Bruce Lindsay. He has been involved in the architecture of RDBMS (relational database management systems) practically since before there were such systems. In 1978, fresh out of graduate school at the University of California at Berkeley with a Ph.D. in computer science, he joined IBM's San Jose Research Laboratory, where researchers were then working on what would become the foundation for IBM's SQL and DB2 database products. Lindsay has had a guiding hand in the evolution of RDBMS ever since.
In the late 1980s he helped define the DRDA (Distributed Relational Database Architecture) protocol and later was the principal architect of Starburst, an extensible database system that eventually became the query optimizer and interpreter for IBM's DB2 on Unix, Windows, and Linux. Lindsay developed the concept of database extenders, which treat multimedia data--images, voice, and audio--as objects that are extensions of standard relational database and can be queried using standard SQL (Structured Query Language). Today he is still at work deep in the data management lab at IBM's Almaden Research Center, helping to create the next generation in database management products.
Our interviewer this month is Steve Bourne, of Unix "Bourne Shell" fame. He has spent 20 years in senior engineering management positions at Cisco Systems, Sun Microsystems, Digital Equipment, and Silicon Graphics, and is now chief technology officer at the venture capital partnership El Dorado Ventures in Menlo Park, California. Earlier in his career he spent nine years at Bell Laboratories as a member of the Seventh Edition Unix team. While there, he designed the Unix Command Language ("Bourne Shell"), which is used for scripting in the Unix programming environment, and he wrote the ADB debugger tool. Bourne graduated with a degree in mathematics from King's College, London, and has a Ph.D. in mathematics from Trinity College in Cambridge, England.
Photography by Tom Upton
STEVE BOURNE Why don't we start off with the thought that you can't recover from an error until you've detected the error.
BRUCE LINDSAY Let's think a little bit about how errors happen--and they happen at all the different levels of the system, from an alpha particle discharging a capacitor in your memory to a fire, flood, or insurrection wiping out the entire site. From program logic blunders to the disk coming back with data from the wrong sector, things go wrong. You have to engineer for failure at all the different levels of the system, from the circuit level on up through subsystems like the database or the file system and on into the application programs themselves.
"Engineering for failure" sounds like a bad phrase, but that's really what's required to deliver reliable and dependable information processing.
SB It's certainly true that one of the mind-sets you have to have when you're writing code and designing systems is: What's going to break? There's a broad range of possibilities for approaching this, depending on the type of application or software. If you're writing a Microsoft Word-type program, the way you approach this might be different from if you're designing a heart monitor.
BL In the heart monitor case, you better keep the heart going, whereas in the Microsoft Word case, you can just give them a blue screen and everybody is used to that.
SB But also in the heart monitor case, it's hard to ask users if they want to keep the heart going because the answer is pretty obvious, whereas in the Word case, you can ask the user in some cases what to do about it.
BL You can sometimes ask the user, although it is better to ask the subsystem what it is going to do to get itself as healthy as it can, or abandon as the case may be, depending on its analysis of the situation.
SB Are you
Are Heisenbugs anything like the slashdot memory relapse? You know you've been reading too much /. when you say to yourself, "Wait, I think I saw that in /. somewhere." Then goto slashdot search to try to find it. You end up pulling 20000 articles -- like finding a needle in a haystack.
/. needs better database tools.
Perhaps
Linux at home
Web pages that disappear when you try to look at them....
A man with two clocks that agree has a much higher degree of certainty.
One of the things that is addressed to some extent in the article is the need to make error messages meaningful! There is nothing more frustrating to me than to encounter an error message like "syntax error."
At a minimum, an error message should have a Unique ID of where in the code this message is coming from, what was expected, what was actually found, and the context where it was found.
EXAMPLE:
Which would you prefer:In my experience, meaningful error messages save more debugging time than it takes to put them in.
Some languages do have support for error detection, for example the express data modelling language (and express-c its executable counterpart). http://www.ap210.org/tiki/tiki-index.php?page=EXPR ESS has some information on express for anyone interested. Basically, it lets you specify integrity constrains on the model it defines.
Are there Heisen-features as well?
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
Dumbledore's been talking to Muggles!
Not quite on topic, but, I once tried writing code in SQL (in this case for ColdFusion) by using stored procedures and exception handling.
What a nightmare.
Many people code unique inserts like this.
Check for duplicate record.
if not found, then insert.
else, prompt user.
Using exception handling, you code like this.
insert.
if error thrown, prompt user.
One less query, lots less code.
One problem, the web application language treated all db errors as fatal. When asked, I was told this was by design.
Thinking about it, I feel that Macromedia didn't want me to code efficiently. You don't sell extra ColdFusion servers if you can offload all your data logic to the SQL server. (Where it belongs)
pack
reindex
restore backup.....
repeat
09 F9 11 02 9D 74 E3 5B - D8 41 56 C5 63 56 88 C0 45 5F E1 04 22 CA 29 C4 93 3F 95 05 2B 79 2A B2
So this Heisenburg guy is to blame for the errors. So...does he work for Microsoft or something?
I bet he didn't look into Java. Java (at least) allows and enforces that. A method will only throw an exception if declares to do so. A caller is forced to provide appropriate handlers or to declare it throws the exceptions not handled at its level. If a method can throw A, B or C but gets D during its execution, it has to in some way map D to either A, B or C (or not throw an exception at all).
Of course, I am talking here about checked exceptions. Unchecked exceptions are supposed to represent *bugs*, and nobody should be trying to capture those.
The sad thing is that even seasoned Java programmers do not understand how to write code w.r.t. exception handling. And beginners are usually turned off by the verbosity required by exception handling, so it is usual to see code where people capture (because they are forced by the language) and ignore exceptions (because they are too lazy and/or stupid to understand the consequences).
I just found my new avatar picture. :)
ps: not a troll, this guy's a freakin genius. I hope I look like that in 20+ years.
Of course, both clocks could be totally broken but have been set to the same (unchanging) time by obsessive/compulsive someone "tidying up". Then the clocks are each right twice a day but not when you think.
AC comments get piped to
It is called testers paradox in statistics.
It can mean that either the production or development environment isn't stable. For example there could be a bug in the program which depends on precise timings. Even on the same system with the same software it may show up sometimes but not other times.
SB It's certainly true that one of the mind-sets you have to have when you're writing code and designing systems is: What's going to break? There's a broad range of possibilities for approaching this, depending on the type of application or software. If you're writing a Microsoft Word-type program, the way you approach this might be different from if you're designing a heart monitor.
BL In the heart monitor case, you better keep the heart going, whereas in the Microsoft Word case, you can just give them a blue screen and everybody is used to that.
09F911029D74E35BD84156C5635688C0
Jesus loves you, I think you suck
In the heart monitor case, you better keep the heart going, whereas in the Microsoft Word case, you can just give them a blue screen and everybody is used to that.
But blue screens probably cause a lot of stress heart-attacks, so that the end result is the same.
Table-ized A.I.
I couldn't help noticing Mr. Lindsay explanations of what a process would or could do. He kept describing it in the first person:
- "You asked me to do X, I didn't do it."
- "Aha, this seems like I should go further."
- "Oh, I see this as one of those really bad ones."
- "I'm going to initiate the massive dumping now."
Obviously he is an expert in his field but I'm not sure if he talks this way because of his personality or because there isn't a vocabulary big enough to describe it.
Would you imagine a medical doctor talking this way?
- "So the white blood cells fight with the cancer cells: die evil cell, die!!"
Or an engineer:
- "The little peg ask it's big brother : can you help me convert this energy into circular motion?"
Heisensoftware: Programs that are believed to exist but no one seems to know where. Such as Duke Nukem Forever.
Give me my freedom, and I'll take care of my own security, thank you.
Someone asks someone about a term they used, and it's a troll? Hey, how about not moderating posts if you don't have a clue? I swear all of the kids with moderator points are trying to ruin the site.
The on-line hacker Jargon File, version 4.1.0
I think its funny that both problems are easily solvable simply by throwing enough momey at them. Alas, the world makes some sort of sick sense.
... dude looks like a yet-ti...
Perhaps if he shaved occasionally UDB for Unix/Linux/Windows wouldn't be the crufty mess it is today.
dayum, youse guys must have slow mail. this was in last month's Queue.
I am the Lorvax, I speak for the machines.
ERROR: You've forgotten that most Slashdotters are initialized with the value of Linux.
is the ACM Queue obscure.
Not everybody reads the same obscure material as you do. You're SOOO up-to-date. Do you watch television as well?
Try Corewar @ www.koth.org - rec.games.corewar
That's why I hate to leave those kinds of problems unsolved, where 'solved' is defined as "I know why it went away"
FWIW, in hardware design Heisenbugs often occur due to things like metastability, race conditions, or noise. There are plenty of boards that will work when you put them on an extender to trouble-shoot them. That often means adding some length or capacitance to i/o just changed things enough to work... yes, they suck as much as software heisenbugs... ;-(
I couldn't help noticing Mr. Lindsay explanations of what a process would or could do. He kept describing it in the first person:
- "You asked me to do X, I didn't do it."
- "Aha, this seems like I should go further."
- "Oh, I see this as one of those really bad ones."
- "I'm going to initiate the massive dumping now."
Obviously he is an expert in his field but I'm not sure if he talks this way because of his personality or because there isn't a vocabulary big enough to describe it.
Expert in his field? Language in CS? I was sure that he was talking about his experience in anilingus with coprophilic overtones...