Data Crunching

← Back to Stories (view on slashdot.org)

Posted by timothy on Monday June 20, 2005 @08:15AM from the constructive-critique dept.

Vern Ceder writes "I really expected to love Data Crunching. The Pragmatic Bookshelf has come up with some very good and, well, "pragmatic" texts in the past so I was looking for more of the same. Even better, the subject of the book was the routine data extraction, massaging and formatting that I (and a lot of other coders) spend so much time on. I was really looking forward to adding a couple more pragmatic tools to my coding toolbox. Unfortunately (as you may have guessed), I really can't say I love Data Crunching. It's a good book, but there are several minor points that keep if from being a truly great book." Read on for the rest of Ceder's review. Data Crunching: Solve Everyday Problems Using Java, Python, and more. author Greg Wilson pages 176 publisher Pragmatic Bookshelf rating 7 reviewer Vern Ceder ISBN 0974514071 summary A good introduction to data crunching, but watch the examples.

On the positive side, there is a lot of good stuff in this book. I would even go so far as to recommend it to everyone who writes code to extract or manipulate data, particularly those less experienced. Greg Wilson should be praised for taking the idea of data crunching seriously and for systematically dealing with its patterns and pitfalls. A lot of important work gets done every day with one-off programs and behind the scenes scripts and Wilson is right that the techniques that go into this sort of coding are different, but just as important, as those that go into full-blown application development.

The strength of this book is that it offers useful approaches and patterns for dealing with a variety of common programming situations and types of data, while also pointing out their common traps and pitfalls. Wilson starts with techniques for crunching text data, moves on to the use of regular expressions, XML, binary data, and SQL databases before concluding with a special section on "horseshoe nails," various little techniques which just might save help save the day. Quite often he uses examples in both Python, which he calls an "agile" language and Java, a "sturdy" language. The basic advice offered is sound, if not shocking -- keep things simple, test as you develop, don't duplicate code, use existing scripts and utilities when possible, and so on. The combination of such sound advice with a wealth of practical examples is makes for a very effective handbook, particularly for someone new to data crunching.

So is Data Crunching a good book? Definitely. Should you read it if you regularly do routine data manipulation and extraction? Absolutely. And yet...

And yet there are number of things that just aren't quite right. The text and binary sections are the best, while I would say that the XML and SQL sections are the weakest, partly because those topics are too broad to cover in a single slim chapter. If you already have an idea of how you might want to use XML or how to extract data from a SQL database, you're likely find something handy in those chapters. On the other hand, if you're unfamiliar with them, this book probably doesn't have enough detail to get you writing useful code. I should say it doesn't have enough detail to get you writing useful code knowing what you're doing. And data crunching without knowing what you're doing is a bad idea. Trust me on that one.

I have another problem with the section on SQL. Several of the slicker SQL recipes rely on nested queries (page 147-151). MySQL, clearly a very popular SQL database, has nested queries only in its latest versions, so many, if not the majority, of MySQL installations do not yet have that capability. Yet the text carries on as if nested queries were universal, without so much as parenthetical mention that some things might not work on all SQL implementations. It seems to me that this is exactly the sort of pitfall a book like this should inform the reader of.

There are also several coding examples that bother me. Since I tend to both learn and teach by paying close attention to examples, I get uncomfortable with examples that seem to suggest something other than what they should.

For instance, the very first pieces of sample code (pages 9-10) in the text chapter are Python and Java programs to reverse the order of lines in a text file. I don't have a problem with the exercise itself, I've often assigned it to beginning programmers. However, this book is about quick and reliable solutions to common data handling problems, not leading people through basic programming exercises. Ironically, the very same chapter discusses the advantages of using the Unix command-line and its wealth of little tools. So wouldn't it be reasonable to expect at least a brief note or example showing that the REALLY easy way to solve the problem is with a single line: $ tac filename > filename2? Yet tac is not even in the list of "Useful Commands" on page 24. If reversing lines is just a programming example, it shouldn't be the lead example in a book like this, and if it is important, then you should mention that the problem has already been solved.

In the same vein, Wilson spends a fair amount of time in the text chapter illustrating code to parse command-line parameters, before admitting that libraries for the task abound in most languages. Granted, being able to snag a parameter or two off of the command-line without using a library can sometimes be handy; but implementing a more involved command-line parser is a problem that has already been abundantly solved.

Similarly, one of the examples in the chapter on regular expressions uses a regular expression to check to see if a string contains a valid IP address (pages 65-66). After showing how to use a regular expression to scan a dotted quad of digits, the text then admits that using a regular expression alone would lead to too much complexity, since it's hard to use a regular expression to check to see if a 1 to 3 digit number is less than 255 (or 127, which is what he uses in his code). So the example on page 66 ends up compiling and matching a regular expression like this:

pat = re.compile("(\\d{1,3})\\.(\\d{1,3})\\.(\\d{1,3})\\ .(\\d{1,3})")
. . .
m = pat.match(text)
for g in m.groups():
. . .
when a Python coder would more naturally just use:

quads = text.split('.')
for number in quads:

Sure, it's a good example of how to extract matched items, but the implication is that using a regular expression is the best way to extract extract numbers separated by dots, when in fact the Python has a simpler, easier and more reliable way to deal with it. Again a quick mention of the "easy" way to solve the problem would have been appropriate.

These kinds of issues are what keeps Data Crunching from being a great book. In spite of them, it is still a very good and useful book and Mark Wilson has done a good job with a topic all too often ignored. The general idea is great, and the principles, problems and solutions are well-explained and relevant. If data crunching is something you do, I would certainly recommend that you read this book, but with a somewhat critical eye.

You can purchase Data Crunching: Solve Everyday Problems Using Java, Python, and more. from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

94 comments

Sounds like it's missing something by copeland3300 · 2005-06-20 08:19 · Score: 0

sounds like even though this book gives a good idea of how to do certian things, there's a lot of little, yet very useful tips and tricks that it leaves out
1. Re:Sounds like it's missing something by JamesOfTheDesert · 2005-06-20 16:13 · Score: 1
  
  I read the sample chapter online, and got the impression that the author was spreading himself too thin. While the content was more or less correct, it demonstrated a lack of real understanding, with a few clear mistakes.
  
  --
  
  Java is the blue pill
  Choose the red pill
Reviewer catches himself. by juuri · 2005-06-20 08:21 · Score: 4, Insightful

Don't berate the author for his examples using nested SQL when a paragraph later you call him out for not using "tac" because you assumed it is universal.

Like nested queries, tac, isn't standard across all unix platforms.

--
--- I do not moderate.
1. Re:Reviewer catches himself. by FriedTurkey · 2005-06-20 08:38 · Score: 3, Interesting
  
  Isn't nested SQL part of the ANSI standard? MySQL is a great database for certain purposes but every other modern database has nested SQL. I don't think an author should not use a technique native to most databases because one database's older versions don't have it.
2. Re:Reviewer catches himself. by Anonymous Coward · 2005-06-20 08:38 · Score: 0
  
  Can we review reviewers? I have no interest in this book, but this review was downright horrible.
3. Re:Reviewer catches himself. by Anonymous Coward · 2005-06-20 08:57 · Score: 0
  
  rev(1) is a more universally available command
4. Re:Reviewer catches himself. by helixblue · 2005-06-20 09:06 · Score: 3, Informative
  
  FYI: It's worth mentioning that rev is not very close to being universal either, existing only on Linux and BSD boxes as best as I can tell. tail -r is more universal in that it works under both SYSV and BSD variants, but oddly enough: not Linux.
  
  The GNU tail folks were pretty stubborn about keeping their file reversal in the tac command, wreaking havoc with cross platform scripts everywhere. :)
5. Re:Reviewer catches himself. by Anonymous Coward · 2005-06-20 11:36 · Score: 0
  
  No one complies with the ANSI SQL standard. No one. The ANSI SQL standard is a fantasy.
6. Re:Reviewer catches himself. by kayumi · 2005-06-20 11:42 · Score: 0
  
  rev and tail -r do different things.
  rev reverses each line.
  tail -r reverses the order in which lines are read
7. Re:Reviewer catches himself. by Profound · 2005-06-20 13:51 · Score: 1
  
  perl -e "print reverse " filename
8. Re:Reviewer catches himself. by Profound · 2005-06-20 13:56 · Score: 2, Informative
  
  perl -e "print reverse <>" filename
  
  (next time I'll use preview)
9. Re:Reviewer catches himself. by Anonymous Coward · 2005-06-20 15:57 · Score: 0
  
  True... but some are (much) farther away than others.... like MySQL. Learning SQL on MySQL just sets you up for looking like an idiot later when you have to use, well, pretty much anything else.
10. Re:Reviewer catches himself. by hubie · 2005-06-22 07:05 · Score: 1
  
  I disagree. I thought it was a very good book review because he described the scope and contents of the book, then talked about what in particular was good and bad supported with specific examples. Usually what gets passed off as a book review here (and elsewhere) is more a synopsis of the type you'd find from an "Amazon top 100" reviewer: a brief description similar to what is on the dust jacket, then a sentence or two talking about what is in each chapter, and if you're lucky maybe a comment or two about how it was written too dry or folksy.
  One might take issue with the reviewer's comments or opinions (e.g., whether the statements about MySQL are accurate), but he at least has comments and opinions to discuss.
quads = text.split('.') by Evro · 2005-06-20 08:25 · Score: 5, Insightful

quads = text.split('.')
This assumes valid data and not something mangled like "1.2.3" or "U.S.A.". Using the numeric regex match that the book's author suggested would be more reliable in matching IP addresses only.

--
rooooar
1. Re: quads = text.split('.') by abigor · 2005-06-20 08:37 · Score: 4, Informative
  
  quads = text.split('.') if len(quads) != 4: raise NotAnIPAddress for member in quads: try: quad = int(member) if quad < 0 or quad > 255: raise NotValidQuad except: raise NotValidQuad . . . etc.
2. Re: quads = text.split('.') by Anonymous Coward · 2005-06-20 08:53 · Score: 2, Informative
  
  Ummm... is receiving a number less than 0 or greater than 255 an exception? No, it's abnormal input sure, but that is a nasty and poor use of exceptions.
  
  You get an F on programming style :(
3. Re: quads = text.split('.') by Anonymous Coward · 2005-06-20 08:57 · Score: 0
  
  What are you suggesting then? It's the programmer's responsibility to catch exceptions as they are thrown back to him. Would you like to open a message box instead? And what if it's a command line program? Exceptions can be ignored...
4. Re: quads = text.split('.') by abigor · 2005-06-20 09:00 · Score: 3, Informative
  
  Jesus, it's just a demo to show that calling split isn't particularly unsafe. How you handle the errors is up to you. Consider the raise statements to be pseudocode.
  
  Ah, but your last line explains everything: you teach programming. You don't do it for a living. Makes sense now.
5. Re: quads = text.split('.') by Anonymous Coward · 2005-06-20 09:16 · Score: 0
  
  It's considered good Python style to use exceptions for flow control.. I shit you not.
  
  Or at least, it was when I used it many years ago. The great thing about Python is, "there's more than one way to do it". ;-)
6. Re: quads = text.split('.') by Evro · 2005-06-20 09:22 · Score: 1
  
  Well, sure, that'll work, but that's not what the reviewer included as an alternate example. The \d{1,3} syntax would do a lot of sanity checking right off the bat. But if there's one thing I learned from Perl, it's TMTOWTDI. If you're guaranteed that the data being passed to you is valid and clean, using a simple split on the '.' character would suffice. I usually prefer to err on the side of "never trust the data," especially when designing modular stuff, as you never really know who's going to be passing you the data. Better to check twice than not at all, to me anyway.
  
  This may or may not work... typing code in an HTML textarea is annoying.
  function parseData($stuff) { .... if (preg_match('/^\d{1,3}\.\d{1,3}\.\d{1,3\.\d{1,3}$/ ',trim($stuff)) { log_ip_address($stuff); } ... } function log_ip_address($ip) { $ip = preg_split('/\./',$ip); foreach ($ip as $q) { if (is_int($q) && ($q >= 0) && ($q <= 255)) { continue; } else { return FALSE; } } // It is a numeric IP address... If we were cool we'd check to make sure it // wasn't in a reserved ip block and other sanity checks. log($ip); return TRUE; }
  
  --
  rooooar
7. Re: quads = text.split('.') by Anonymous Coward · 2005-06-20 09:34 · Score: 0
  
  Ah, but your last line explains everything: you teach programming. You don't do it for a living. Makes sense now.
  You are a complete dipshit. People who program for a living know that handling a thrown exception can cause a major performance hit. Don't use exceptions to control program flow.
8. Re: quads = text.split('.') by Pete+Brubaker · 2005-06-20 09:36 · Score: 3, Informative
  
  "\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5 ]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0 -9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[ 0-9][0-9]?)\b"
  
  This will match a valid IP address.
  
  --Pete
  
  --
  What's a sig? Pete Brubaker
9. Re: quads = text.split('.') by Anonymous Coward · 2005-06-20 10:20 · Score: 0
  
  And you don't write code for a living. So go fuck yourself.
  
  If you did, you'd realise two things: one, that error severity depends on the situation (as the grandparent pointed out, the raise is just pseudocode, but it is a valid thing to do). Two, maintainability is key. Python provides exceptions as a core language feature; use them to handle exceptional conditions (errors).
  
  Now fuck off, junior.
10. Re: quads = text.split('.') by PatrickThomson · 2005-06-20 10:39 · Score: 2, Informative
  
  THe whole point of python is to raise and catch exeptions instead of fucking about trying to make it all nice. So the parseing program might be called by
  
  ip = getuserinput()
  try: DoShitFromGrandparent()
  except NotAnIPAddress:
  print "Not an IP address, dumbass"
  except NotValidQuad:
  blah blah etc.
  
  --
  I am one of many. My idea is not unique, nor do I expect my voice alone to sway you. I speak in a chorus of opinion.
11. Re: quads = text.split('.') by feronti · 2005-06-20 11:43 · Score: 2, Informative
  
  Actually, that would depend on where this code lives... if it's in the user interface, sure, using an exception is probably not the right way to do it, since you know right there how to handle it. But what if it's deep in the bowels of a library? A library should validate that its callers are following the contract, but has no way of knowing how to handle the error when the value is out of range, so it should fail early and throw an exception so the higher layers can do something about it.
  
  Besides, as another poster mentioned, using exceptions for flow control is an actual pattern in Python. The Python philosophy is 'it's easier to ask for forgiveness than to ask for permission.' Though, the truly python way would be to build the address and just pass it on, and let someone who knows better validate it.
12. Re: quads = text.split('.') by thor · 2005-06-20 12:10 · Score: 1
  
  or using Perl:
  
  if((@quads)=m:(\d+)\.(\d+)\.(\d+)\.(\d+):)
  {
  # do stuff
  }
  else
  {
  warn ( "bad address" ) ;
  }
  
  QED
13. Re: quads = text.split('.') by Anonymous Coward · 2005-06-20 14:31 · Score: 0
  
  Pfft. Regexes work fine... if you know how to use them.
  
  To match a number 0-255, the easiest pattern to see it with is this:
  
  (\d|\d\d|1\d\d|2[0-4]\d|25[0-5])
  
  [Which matches any one digit (0-9), or any two digits (10-99), or a 1 followed by any two digits (100-199), or a two followed by 0-4 and any digit (200-249), or 25 followed by a 0-5 (250-255).]
  
  So use that pattern 4 times, with 3 literal periods between them. Want to shorten that? Well, the 0-199 bits can always be made into:
  
  (1?\d?\d)
  
  Which is an optional '1' followed by an optional digit, followed by a required digit (we don't want any part of our pattern to have a possibility of matching nothing at all). Merging that in will give us merely:
  
  (1?\d?\d|2[0-4]\d|25[0-5])
  
  And the whole IP address, over all, would then be:
  
  $input =~ m/(1?\d?\d|2[0-4]\d|25[0-5])\.(1?\d?\d|2[0-4]\d|25 [0-5])\.(1?\d?\d|2[0-4]\d|25[0-5])\.(1?\d?\d|2[0-4 ]\d|25[0-5])/;
  
  Which is a tad ugly, but works just fine. Yes, I'm using Perl style regexes here, but Java uses the Perl syntax within the regexes (mostly--you still have to escape all the backslashes with a second backslash).
  
  Or, yeah, split on '.' and just make sure each number is greater than 0 (lest some idiot try to give you a negative IP address somehow in user input) and less than 255. If you want to get really clever, you might exclude 127.*.*.* and perhaps other RFC defined ranges from certain applications for some purposes. You can also do that with a regex, but it's a tad more contorted. I'd probably use the version that was a bit more split up and split it into 100-119, 120-126 & 128-9, 130-199 with something like:
  
  (\d|\d\d|1[01]\d|12[0-68-9]|1[3-9]\d|2[0-4]\d|25 [0 -5])
  
  Using that for the first digit, then the normal pattern on the remaining 3 (because all we care there is that the first digit isn't a 127--if it's not, and it matches the rest, we're fine). That gives us this pattern:
  
  (\d|\d\d|1[01]\d|12[0-68-9]|1[3-9]\d|2[0-4]\d|25 [0 -5])\.(1?\d?\d|2[0-4]\d|25[0-5])\.(1?\d?\d|2[0-4]\ d|25[0-5])\.(1?\d?\d|2[0-4]\d|25[0-5])
  
  I think that the O'Reilly regex book has a more detailed example of doing this, and a few ideas on other clever ways to split the pattern up.
  
  Me? I focus on pattern readability and sometimes efficiency, not pattern length. The irony is that long, specific patterns can be more efficient (if hideously ugly--see the RFC compliant 'email address' matcher in the back of the regex book I mentioned) provided they don't make a huge morass of crap for the automaton to have to backtrack through.
14. Re: quads = text.split('.') by Wavicle · 2005-06-20 19:58 · Score: 1
  
  Ummm... is receiving a number less than 0 or greater than 255 an exception?
  
  In this case, it probably is.
  
  No, it's abnormal input sure,
  
  abnormal input is an exceptional condition by definition. Normal input is expected.
  
  but that is a nasty and poor use of exceptions.
  
  No it isn't.
  
  You get an F on programming style
  
  Your teaching credential needs revoking. As anybody worth their salt as a programmer would know that whether or not to handle something as an exception depends on the severity of the problem, the frequency of the problem, and the time critical nature of the code in question. Validating an IP address is something usually done infrequently, often in response to a user action, so any performance hit from exception handling is insignificant compared to the additional processing that will occur popping up some sort of dialog to let the user know something is wrong.
  
  --
  Education is a better safeguard of liberty than a standing army.
  Edward Everett (1794 - 1865)
15. Re: quads = text.split('.') by Hercynium · 2005-06-21 01:04 · Score: 1
  
  Maybe he's trying to be helpful... but as a perl programmer, I say, mod parent funny!!!
  
  My philosophy (yes, everybody seems to have one these days) is this:
  
  1. Define the rules for valid data.
  2. Classify types of invalid data.
  3. Break the rules down into a series of discrete steps.
  4. Write the validation code, using the simplest semantics possible.
  5. If data validation fails, try to match the problem to one of the invalid data classifications and throw an exception.
  
  Yes, it sounds complex, but for the type of work I do, we need to:
  1. Check our data for correctness.
  2. Reject invalid data.
  3. Identify what went wrong
  4. Be able to quickly understand the rules by reading the code, for future maintenance.
  
  okay, back to work. gotta mung a list of ATM PVCs
  
  --
  I'm done with sigs. Sigs are lame.
16. Re: quads = text.split('.') by Anonymous Coward · 2005-06-22 05:28 · Score: 0
  
  I think QEF is appropriate here instead of QED.
nested queries are a problem? by stoolpigeon · 2005-06-20 08:26 · Score: 4, Insightful

If a book uses nested queries and some rdbms doesn't -- the problem lies with the rdbms. I've never used mysql and I've avoided the flames about it not being a real database.... but come on. That is weak.

--
It's hard to believe that's how Micronians are made. Why don't we see it right now by having you both kiss one another?
1. Re:nested queries are a problem? by jthayden · 2005-06-20 08:30 · Score: 2, Insightful
  
  Granted the ANSI SQL standard isn't followed as closely as perhaps other standards are, but if Nested Queries are in the standard, then I would have to say the RDBMS is at fault and not the book.
2. Re:nested queries are a problem? by poot_rootbeer · 2005-06-20 09:26 · Score: 2, Insightful
  
  I may be wrong, but I believe that an RDBMS must support nested subqueries to be conformant to the ANSI SQL92 Entry-Level specification (maybe even SQL89?).
  
  Not to fan the flames of another advocacy flamewar, but if MySQL hasn't caught up to a 13-year-old standard yet, it shouldn't be treated as a fully-functional SQL RDBMS.
  
  If you're running MySQL you should be aware of its limitations yourself; it's not the book's job to bring them to your attention for you.
3. Re:nested queries are a problem? by Anonymous Coward · 2005-06-20 10:02 · Score: 0
  
  the reviewer is just a gnu-shit-centric knowing-linux-and-nothing-else mysql-is-a-real-db yadda yadda hippie.
4. Re:nested queries are a problem? by Anonymous Coward · 2005-06-20 11:04 · Score: 0
  
  lol
I really expected to love Data Crunching. by dfn5 · 2005-06-20 08:27 · Score: 3, Funny

I also expected to love getting my teeth pulled. Trust me. It wasn't that great.

--
-- Thou hast strayed far from the path of the Avatar.
1. Re:I really expected to love Data Crunching. by Anonymous Coward · 2005-06-20 08:50 · Score: 0
  
  Well trust me your lucky, I read that as Data Crusher and thoughts raced in my head at the implications for all StarTrek fans.
2. Re:I really expected to love Data Crunching. by GecKo213 · 2005-06-20 08:58 · Score: 1
  
  I really expected to love Data Crunching. Mmmmmm... Data Crunching. *Drool* Tastes like Chicken.
  
  --
  Generation Trance: What generation are you?
Matching a dotted quad by Anonymous Coward · 2005-06-20 08:28 · Score: 1, Informative

Shouldn't be too hard if we can use ereg() or similar. How about checking for 0-255 like so: "([1-9][0-9]{0,1}|1[0-9][0-9]|2[0-4][0-9]|25[0-5]| 0)", then it's just a matter of checking for those between dots?
Re:Astroturf @ /. by MisanthropicProgram · 2005-06-20 08:31 · Score: 1

...since the publisher paid good money for this review...
Dude, I like to start controversy as much as the next Tr...errr...guy, but you need to give some evidence.
Regex method is better by Anonymous Coward · 2005-06-20 08:32 · Score: 2, Insightful

Your oversimplification of his solution for validating ip addresses is a fine example of a poor review by someone who thinks he knows more than the author.

Try passing in a string such as "I.like puppies!!!". A regex like the one the author provided will easily reject this, so there's no need to worry about checking for numericness, or any other strange characters at all. The regex in fact filters out EVERYthing so that all that has to be done is to check the actual numeric values for the right value range. I would not like to see the remainder of the alternate example (I'm sure it wouldn't be simple)

I'm all for KISS but there is definitely is such a thing as too simple.
1. Re:Regex method is better by computational+super · 2005-06-20 15:52 · Score: 1
  
  Try passing in a string such as "I.like puppies!!!". A regex like the one the author provided will easily reject this
  So... you're saying that the author doesn't like puppies? Wow. What a scrooge.
  
  --
  Proud neuron in the Slashdot hivemind since 2002.
MySQL by imgumbydammit · 2005-06-20 08:33 · Score: 0, Troll

MySQL, clearly a very popular SQL database, has nested queries only in its latest versions, so many, if not the majority, of MySQL installations do not yet have that capability. Yet the text carries on as if nested queries were universal, without so much as parenthetical mention that some things might not work on all SQL implementations.

The fact that MySQL sucks is not a limitation of the book, as far as I'm concerned. Stumbling across the bits of SQL that some particular version of MySQL does not support (e.g. UNIONs, inline views, etc etc) is just one of the great treats in life.

--
That's right: I'm gumby dammit.
1. Re:MySQL by Tiny+Elvis · 2005-06-20 09:47 · Score: 1
  
  Agree. It's only a toy if it doesn't support subqueries. Insert coin.
2. Re:MySQL by DogDude · 2005-06-20 10:24 · Score: 2, Insightful
  
  What I can't believe (and I'm replying more to myself than anything else, because I just realized...) is that if MySQL hasn't been supporting something as basic as sub-queries until recently that means that there have been tons and tons of complex applications written without subqueries! Holy mother of christ... How would something as simple as even Slashdot get written without subqueries? There must be thousands upon thousands of apps out there that were written with almost -no- understanding of what a modern RDBMS is designed to do even though they're manipulating data. I can only imagine the middle layer of all of these apps doing many, many, many, many unnecessary database connections and queries. Wow. There are truly a LOT of bad programmers out there.
  
  --
  I don't respond to AC's.
3. Re:MySQL by iggymanz · 2005-06-20 12:34 · Score: 1
  
  heh, was that a very obscure mocking of typical J2EE peristence layer architecture?
4. Re:MySQL by quasi_steller · 2005-06-20 15:48 · Score: 2, Insightful
  DBAs and database developers do not consider MySQL a database.
  
  You have got to be kidding me. Of course MySQL is a database. A database is simply a collection of data organized so that a computer program can access pieces of that data, something a MySQL database certainly does. This would make MySQL as a whole, a DBMS (DataBase Management System), as it is a collection of programs used for managing a database. Now, Is MySQL a RDBMS (Relational DBMS)? Well, that depends on your definition of RDBMS. If you define a RDBMS as a DBMS that stores it's data in the form of related tables, then MySQL is most certainly a RDBMS. However, if your a strict follower of Codd, then you might not consider MySQL a RDBMS, as it doesn't follow all of Codd's rules. However, under this strict definition, no SQL DBMS is a RDBMS, as SQL breaks some of Codd's rules.
  
  Perhaps what you meant to say was: "DBA's don't consider MySQL a true SQL database." (Or at least until very recently, as MySQL has gained a lot of functionality.)
  
  Don't get me wrong, I don't disagree with you completely. While I believe MySQL has is uses, I also believe there are many applications where it just shouldn't be used. I just think that we need to be a little more careful when we choose our wording here, so we don't sound like we're trying to flame, or even worse troll. (By the way, I don't believe you were doing either. I'm sure that when you said database, you were thinking SQL.) MySQL is a database, it just is (was? I'm not sure about the newest version) not an SQL compliant database.
  
  References:
  
  http://en.wikipedia.org/wiki/RDBMS
  
  http://www.webopedia.com/TERM/R/RDBMS.html
  --
  ...interesting if true.
5. Re:MySQL by Anonymous Coward · 2005-06-20 16:18 · Score: 0
  
  "DBA's don't consider MySQL a true SQL database." Followed by "I just think that we need to be a little more careful when we choose our wording here"
  
  Seriously, dude, step off the soapbox. The developers called it "MySQL", not "MyDBMS". Why shouldn't we expect it to be "a true SQL database"?
6. Re:MySQL by Matje · 2005-06-20 18:23 · Score: 3, Insightful
  
  So from the fact that MySQL lacked subquery support you derive that there are a lot of bad programmers? me thinks there is only evidence here that you're a bad logician. Now that is a skill a good programmer must have ;). A couple of remarks:
  
  - if you're building a simple website, chances are you won't need any subqueries. Websites were (are?) the bread and butter of MySQL.
  
  - the fact that the dbms lacks subquery support does not imply that the programmer lacks knowledge about them, nor does it imply that programmers generally use unnecessary db connections or queries!
  
  - The MySQL manual states, correctly in my opinion, that in many situations subqueries can be rewritten to joins. Could it be possible that all those bad programmers out there were aware of this and you weren't?
7. Re:MySQL by Billly+Gates · 2005-06-21 03:34 · Score: 1
  
  Mysql is to a real RDBMS as Windows 3.11 is a true multitasking, multiuser, and reliable OS.
  
  Sure Windows 3.11 can theoretically support multiple users and multitask but I would prefer W2k thank you.
  
  Same is true with mysql. Mysql is popular because its free and is very multi-user friendly for ISP's with tons of user accounts so they bundle it with their hosting.
  
  PostgreSQL is arguably alot better and also free. In asia its what most Linux users use by default. The tools for it are finally cominging out and as soon as its more multiuser friendly most ISP's will stay reluctant to support it.
  
  Mysql started out as just a small fast SQL filesystem to a simple embedded database for tiny apps. Its growing but its the fastest for small apps that do not need a RDBMS. This is why its experiencing growing pains. In many ways msql took over this market as mysql is turning into a RDBMS.
  
  But a real RDBMS includes MS-SQL Server, Oracle, and Sybase. Even postgresql is behind with some features but is fine for midrange applications.
  
  In alot of ways MCSE's who grow up on Windows who are ignorant of about unix are similiar to those who use Mysql. You can do a ton of stuff with a real RDBMS and going to mysql after experiencing a real RDBMS is like pulling teeth.
  
  --
  http://saveie6.com/
Amazon referral whore, mod down by Anonymous Coward · 2005-06-20 08:35 · Score: 2, Interesting

Link contains redirect to kaleidojewel's referral account. Don't encourage his spamming by rewarding him with payoffs.
1. Re:Amazon referral whore, mod down by JUSTONEMORELATTE · 2005-06-20 10:00 · Score: 1
  
  Link contains redirect to kaleidojewel's referral account. Don't encourage his spamming by rewarding him with payoffs.
  
  True, but it is indeed more than $7 cheaper than the bn.com link in the review.
  I'm not 'kaleidojewel' nor do I know him/her, I'm just sayin...
Reviewing the book or showing off geekiness? by zanderredux · 2005-06-20 08:35 · Score: 4, Insightful

Similarly, one of the examples in the chapter on regular expressions uses a regular expression to check to see if a string contains a valid IP address (pages 65-66). After showing how to use a regular expression to scan a dotted quad of digits, the text then admits that using a regular expression alone would lead to too much complexity, since it's hard to use a regular expression to check to see if a 1 to 3 digit number is less than 255 (or 127, which is what he uses in his code). So the example on page 66 ends up compiling and matching a regular expression like this:
pat = re.compile("(\\d{1,3})\\.(\\d{1,3})\\.(\\d{1,3})\\ .(\\d{1,3})")

Actually, that example is safer than just invoking text.split, as that long regex can shield you from injection attacks and help you enforce numeric IPs in one single command.
In the end, it is a matter of style, but just invoking text.split and trusting user input is... naive?!
1. Re:Reviewing the book or showing off geekiness? by owlstead · 2005-06-20 11:16 · Score: 1
  
  I most agree with you. You should use regex for what it is for: checking if the structure of the input is correct. Leave the checking of the actual values to the program. His comments just split at '.' characters. So this means that e.g. +23.-56. 255.1e34 might evaluate to a "correct" IP address.
  
  The book shows the exact way I would do it; check for the maximum amount of structure in the IP adress, allowing only digits and dots, and then proceed to make sure 344.344.344.344 is not accepted. There is nothing wrong with that. Obviously, the book should explain why to choose the most restrictive regex as well as when to use regex, and when not.
Learning concepts...bah! by Anonymous Coward · 2005-06-20 08:37 · Score: 3, Funny

Wilson spends a fair amount of time in the text chapter illustrating code to parse command-line parameters, before admitting that libraries for the task abound in most languages.
You know I had that same problem with my Operating Systems class. That text by Tannenbaum goes through countless examples of what makes a good system, and then at the end he FINALLY admits that there is something called Unix that I can just go and install. What a waste learning all of those concepts!
1. Re:Learning concepts...bah! by Rosco+P.+Coltrane · 2005-06-20 08:42 · Score: 0
  
  Come on, silly...
  
  There are perfectly good calculators for a dollar at the thrift store, yet you learn how to add, subtract, divide and multiply by hand at school. Why is that? Because it gives you a better understanding of what an addition, subtraction, division or multiplication are and involve mathematically.
  
  Same for this book. As a matter of fact, I love technical books that explain the workings of whatever subject they cover, and not just "you can get library X at http://xyz/ and use it, don't worry about how it works".
  
  --
  "A door is what a dog is perpetually on the wrong side of" - Ogden Nash
2. Re:Learning concepts...bah! by Anonymous Coward · 2005-06-20 08:53 · Score: 0
  
  Come on, silly...
  
  I did that last night, all it got me was a slap in the face.
3. Re:Learning concepts...bah! by despik · 2005-06-20 09:31 · Score: 4, Informative
  
  Boy, did someone just miss the joke...
  
  --
  "I seem to have mastered a certain amount of control over physical reality."
Correction. by Anonymous Coward · 2005-06-20 08:38 · Score: 0

(okay, so it's 1-255, but you get the idea, and slashcode broke the string so I'll blame any further problems on that, just like I would have posted this 5 seconds following the parent if it weren't for the b0rken time limit... you'd think it would be possible to adapt this so that users who are good citizens get to wait a shorter period, or not at all? Maybe that's patented :-\)

"It's been 7 minutes since you last successfully posted a comment"

Oh, I'm not giving up. I can wait a week if that's what it takes.
Note really fair by amorico · 2005-06-20 08:40 · Score: 1

It's not fair to criticize the book because you use a tarted up text file instead of something like postgres or oracle or db2 or any number of other rdbms's that managed to support subqueries and foreign keys within 30 years of their invnetion.

--
"The plural of anecdote is not data." -- Roger Brinner
Perl and a new Mac by Anonymous Coward · 2005-06-20 08:42 · Score: 0

That's all you need for the perfect Data crunching machine.

Especially when the new Mactels come out !!!!
subqueries not a problem. by Anonymous Coward · 2005-06-20 08:50 · Score: 0

>I've never used mysql and I've avoided the flames about it not being a real database.... but come on. That is weak.

13.1.8. Subquery Syntax

HTH. HAND.
1. Re:subqueries not a problem. by stoolpigeon · 2005-06-20 09:00 · Score: 1
  
  which makes the criticism even weaker. If it is only old versions of the db that don't support it, the point is what?
  
  --
  It's hard to believe that's how Micronians are made. Why don't we see it right now by having you both kiss one another?
Re:1992 Called.... by Rosco+P.+Coltrane · 2005-06-20 08:50 · Score: 0, Troll

Your two last posts, combined with your high Slashdot ID and the general trollishness of your comments lead me to think you were born in 1992. Did I guess right?

1992 called, they want their spermatozoid-turned-spotty-teenager back...

--
"A door is what a dog is perpetually on the wrong side of" - Ogden Nash
Good Advice by pinkythecat · 2005-06-20 08:52 · Score: 1

"read this book, but with a somewhat critical eye." Blindingly obvious but good advice.
1. Re:Good Advice by Rosco+P.+Coltrane · 2005-06-20 08:55 · Score: 3, Funny
  
  "read this book, but with a somewhat critical eye." Blindingly obvious but good advice.
  
  Indeed, having a critical eye can obviously make you blind.
  
  --
  "A door is what a dog is perpetually on the wrong side of" - Ogden Nash
Data "massaging"? by Kirby-meister · 2005-06-20 08:54 · Score: 1

Sounds like you're having a little too much fun with your database...
1. Re:Data "massaging"? by de+Bois-Guilbert · 2005-06-20 09:02 · Score: 3, Funny
  
  "SQL like a pig"?
2. Re:Data "massaging"? by uberslack · 2005-06-20 15:21 · Score: 1
  
  that's the funniest fucking thing i've read on slashdot in a long time... kudos...
  
  --
  Just because you're paranoid does not mean that the world is not full of assholes.
3. Re:Data "massaging"? by Anonymous Coward · 2005-06-21 03:20 · Score: 0
  
  And I thought such subtlety was lost on /.
  At least one person got it. :)
price... by Anonymous Coward · 2005-06-20 09:00 · Score: 0

the pragprog books only have IMHO one problem:
the price/page ratio is not right
Not mentioning tac is not a dealbreaker by illumin8 · 2005-06-20 09:28 · Score: 3, Insightful

I don't fault the author for not mentioning tac. It is part of the GNU textutils package, and although it might be standard on every Linux distro, it's most likely not in ANY enterprise Unix. I just checked my Sun boxes and it's not installed there, except for the ones that I've installed GNU textutils on.

I really wish a lot of Open Source developers would stop assuming that all of us have every GNU utility ever invented on our system. I can't tell you how difficult it is to get the average GNU autoconf program to compile correctly on Solaris or any flavor of enterprise Unix, simply because most authors assume they're writing platform-independent code, without realizing that GNU's M4 is different from System V M4. Also, differences between lex, flex, tar, and GNU tar abound. Please, for the love of god, don't assume that the tools you know and love on your Linux box at home are available or even installable on enterprise kit at work. Most company policies prevent the installation of these type of tools.

--
"When the president does it, that means it's not illegal." - Richard M. Nixon
1. Re:Not mentioning tac is not a dealbreaker by civilizedINTENSITY · 2005-06-20 11:31 · Score: 1
  
  Not installable by you, of course. But not installable? You seem to suggest that it is more difficult to install for Solaris. Doesn't Sun have a GNU toolchain site? I always thought that:
  It is a vital component in Linux kernel development, BSD development and a standard tool when developing software for embedded systems. Parts of the toolchain are also widely used in the Solaris Operating Environment (which, in the opinion of many, needs the GNU tools for reasonable usability) and Microsoft Windows programming with cygwin.
  
  I know I've intalled cygwin painlessly, and even ran the gnu toolchain under irix.
  
  I have to wonder if your post wasn't just intended to pretend a distinction.
2. Re:Not mentioning tac is not a dealbreaker by illumin8 · 2005-06-20 12:51 · Score: 2, Informative
  
  Not installable by you, of course. But not installable?
  
  Haha, yeah, I don't even know how to go to SunFreeware or Blastwave and download a copy of GNU textutils in Solaris package format. You can think that if you want to, but in the enterprise world, every software package I want to install has to be approved by about 3 levels of management. They want to know what it does, why we need it, how much it costs, and who else will know how to maintain it after I leave the company. The chance of providing them a list of all the GNU utilities necessary to compile your single average open-source app and getting approval for that is close to nil. Forget Perl modules and CPAN. These are real-world systems that might handle lots of real-world money, and they don't necessarily trust code that's been written by anyone on them.
  
  Anyway, I'm just (hopefully) educating people on some of the problems that a real-world sysadmin runs into on a daily basis.
  
  --
  "When the president does it, that means it's not illegal." - Richard M. Nixon
3. Re:Not mentioning tac is not a dealbreaker by Anonymous Coward · 2005-06-20 23:26 · Score: 0
  
  I can't tell you how difficult it is to get the average GNU autoconf program to compile correctly on Solaris or any flavor of enterprise Unix, simply because most authors assume they're writing platform-independent code, without realizing that GNU's M4 is different from System V M4.
  
  M4 has nothing to do with it. The configure script in a distribution of configurable, compilable and installable source code is purely /bin/sh shell script. The configure.ac/.in file is GNU m4, but that's beside the point because it's not needed for compilation or installation of a complete source code distribution where autoconf has already been run and the configure script exists.
  
  You fail it.
Re:Fps are crunchy too by Anonymous Coward · 2005-06-20 09:41 · Score: 0

Fuck Slashdot.
MySQL by DogDude · 2005-06-20 09:45 · Score: 4, Informative

I have another problem with the section on SQL. Several of the slicker SQL recipes rely on nested queries (page 147-151). MySQL, clearly a very popular SQL database, has nested queries only in its latest versions, so many, if not the majority, of MySQL installations do not yet have that capability. Yet the text carries on as if nested queries were universal, without so much as parenthetical mention that some things might not work on all SQL implementations. It seems to me that this is exactly the sort of pitfall a book like this should inform the reader of.

Nested queries are *basic* database functionality. This is just one of many reasons why those of us who are experienced DBAs and database developers do not consider MySQL a database. The fact that there are lots of people trying to use it as such is irrelevant. The author didn't mention that the book is also missing a section of spreadsheets. Why not? Lots of people use spreadsheets as a database!

--
I don't respond to AC's.
Summary in the first sentence by springbox · 2005-06-20 09:48 · Score: 1

"I really expected to love Data Crunching"
It's interesting the way that's written, because it tells me that you didn't like the book in the first sentence. If getting people to read the entire review was an issue, which is not the case here, then that would have been moved to the last paragraph.
Re:1992 Called.... by hobbesx · 2005-06-20 10:01 · Score: 0, Flamebait

1983 actually... There's the 60% exchange rate on Canadian Asshats.

--
This rating is Unfair ( ) ( ) Fair (*) Funny
Sigh... If only. Modding would be so much more fun.
Munging Alternative by PotatoMan · 2005-06-20 10:02 · Score: 2, Informative

You might want to compare this book to "Data Munging With Perl" by David Cross.

See the Slashdot Review:
http://books.slashdot.org/article.pl?sid=01/04/26/ 1229238&tid=145&tid=6
1. Re:Munging Alternative by Anonymous Coward · 2005-06-20 12:32 · Score: 0
  
  I read that as "Data Mugging with Perl" and thought "how appropriate."
  
  And I like PERL..
2. Re:Munging Alternative by DrHyde · 2005-06-20 20:55 · Score: 1
  
  DDJ's reviewer also liked it, making the very good point that Dave Cross's book isn't really about perl.
3. Re:Munging Alternative by Anonymous Coward · 2005-06-20 21:03 · Score: 0
  
  Looks like the author of this book was quite a fan of Data Munging with Perl. Wonder where he got the idea for this book from :-)
4. Re:Munging Alternative by Anonymous Coward · 2005-06-24 20:34 · Score: 0
  
  Just look at the publisher,
  Publisher: Pragmatic Programmers, LLC, The
  
  yeah right
Who's the author? by The_Wilschon · 2005-06-20 11:57 · Score: 1

Near the beginning of the post, in the green box, we have:

author | Greg Wilson

And yet, in the final paragraph we see:

In spite of them, it is still a very good and useful book and Mark Wilson has done a good job with a topic all too often ignored.

What's going on?

--
SIGSEGV caught, terminating

wait... not that kind of sig.
MySQL and data crunching by angio · 2005-06-20 15:10 · Score: 2, Insightful

MySQL's lack of support for some of the ANSI SQL features is annoying. But, that said, I do a lot of data crunching on a terabyte or so of Internet measurement data, and MySQL remains my database of choice. In a data-mining application like mine, I need speed and a compact on-disk representation of the data and the indices before anything. Our inserts are batched a couple of times a day; having them fast is important, but having them run concurrently with queries isn't. I don't need transactions, I can deal with table-level locking, and I'm willing to give up a couple of things like nested selects to get that speed.
Given that MySQL is the best fit for some types of data crunching applications, the earlier comment about assuming nested queries has merit.

My requirements arise in a research setting, so perhaps they're less common. Companies like wal-mart can afford big iron on which to do their data mining. Smaller data crunching tasks don't make the same kind of performance demands on their RDBMS. Of course, one thing to consider is that the standard RDBMS model isn't all that well suited to huge-scale data-mining in general, so there may be no silver bullet here for any of us to get religious about yet.
More, or even Better Books on the Topic? by wehe · 2005-06-20 18:59 · Score: 1

Are there any better books about data crunching? I found at least Data Munging with Perl by David Cross. BTW: check out DataConv for a survey of data conversion tools, many of them GPLed and often unix-based.
Just parse it already by Urusai · 2005-06-20 20:44 · Score: 1

Might as well use yacc/bison to generate a LALR parser while you're being stupid about it.
1. Re:Just parse it already by sinserve · 2005-06-23 03:17 · Score: 1
  
  You mean Lex/Flex. Yacc is a parser generator, Lex is a scanner generator. Two different things.