Slashdot Mirror


Data Crunching

Vern Ceder writes "I really expected to love Data Crunching. The Pragmatic Bookshelf has come up with some very good and, well, "pragmatic" texts in the past so I was looking for more of the same. Even better, the subject of the book was the routine data extraction, massaging and formatting that I (and a lot of other coders) spend so much time on. I was really looking forward to adding a couple more pragmatic tools to my coding toolbox. Unfortunately (as you may have guessed), I really can't say I love Data Crunching. It's a good book, but there are several minor points that keep if from being a truly great book." Read on for the rest of Ceder's review. Data Crunching: Solve Everyday Problems Using Java, Python, and more. author Greg Wilson pages 176 publisher Pragmatic Bookshelf rating 7 reviewer Vern Ceder ISBN 0974514071 summary A good introduction to data crunching, but watch the examples.

On the positive side, there is a lot of good stuff in this book. I would even go so far as to recommend it to everyone who writes code to extract or manipulate data, particularly those less experienced. Greg Wilson should be praised for taking the idea of data crunching seriously and for systematically dealing with its patterns and pitfalls. A lot of important work gets done every day with one-off programs and behind the scenes scripts and Wilson is right that the techniques that go into this sort of coding are different, but just as important, as those that go into full-blown application development.

The strength of this book is that it offers useful approaches and patterns for dealing with a variety of common programming situations and types of data, while also pointing out their common traps and pitfalls. Wilson starts with techniques for crunching text data, moves on to the use of regular expressions, XML, binary data, and SQL databases before concluding with a special section on "horseshoe nails," various little techniques which just might save help save the day. Quite often he uses examples in both Python, which he calls an "agile" language and Java, a "sturdy" language. The basic advice offered is sound, if not shocking -- keep things simple, test as you develop, don't duplicate code, use existing scripts and utilities when possible, and so on. The combination of such sound advice with a wealth of practical examples is makes for a very effective handbook, particularly for someone new to data crunching.

So is Data Crunching a good book? Definitely. Should you read it if you regularly do routine data manipulation and extraction? Absolutely. And yet...

And yet there are number of things that just aren't quite right. The text and binary sections are the best, while I would say that the XML and SQL sections are the weakest, partly because those topics are too broad to cover in a single slim chapter. If you already have an idea of how you might want to use XML or how to extract data from a SQL database, you're likely find something handy in those chapters. On the other hand, if you're unfamiliar with them, this book probably doesn't have enough detail to get you writing useful code. I should say it doesn't have enough detail to get you writing useful code knowing what you're doing. And data crunching without knowing what you're doing is a bad idea. Trust me on that one.

I have another problem with the section on SQL. Several of the slicker SQL recipes rely on nested queries (page 147-151). MySQL, clearly a very popular SQL database, has nested queries only in its latest versions, so many, if not the majority, of MySQL installations do not yet have that capability. Yet the text carries on as if nested queries were universal, without so much as parenthetical mention that some things might not work on all SQL implementations. It seems to me that this is exactly the sort of pitfall a book like this should inform the reader of.

There are also several coding examples that bother me. Since I tend to both learn and teach by paying close attention to examples, I get uncomfortable with examples that seem to suggest something other than what they should.

For instance, the very first pieces of sample code (pages 9-10) in the text chapter are Python and Java programs to reverse the order of lines in a text file. I don't have a problem with the exercise itself, I've often assigned it to beginning programmers. However, this book is about quick and reliable solutions to common data handling problems, not leading people through basic programming exercises. Ironically, the very same chapter discusses the advantages of using the Unix command-line and its wealth of little tools. So wouldn't it be reasonable to expect at least a brief note or example showing that the REALLY easy way to solve the problem is with a single line: $ tac filename > filename2? Yet tac is not even in the list of "Useful Commands" on page 24. If reversing lines is just a programming example, it shouldn't be the lead example in a book like this, and if it is important, then you should mention that the problem has already been solved.

In the same vein, Wilson spends a fair amount of time in the text chapter illustrating code to parse command-line parameters, before admitting that libraries for the task abound in most languages. Granted, being able to snag a parameter or two off of the command-line without using a library can sometimes be handy; but implementing a more involved command-line parser is a problem that has already been abundantly solved.

Similarly, one of the examples in the chapter on regular expressions uses a regular expression to check to see if a string contains a valid IP address (pages 65-66). After showing how to use a regular expression to scan a dotted quad of digits, the text then admits that using a regular expression alone would lead to too much complexity, since it's hard to use a regular expression to check to see if a 1 to 3 digit number is less than 255 (or 127, which is what he uses in his code). So the example on page 66 ends up compiling and matching a regular expression like this:

pat = re.compile("(\\d{1,3})\\.(\\d{1,3})\\.(\\d{1,3})\\ .(\\d{1,3})")
. . .
m = pat.match(text)
for g in m.groups():
. . .
when a Python coder would more naturally just use:

quads = text.split('.')
for number in quads:

Sure, it's a good example of how to extract matched items, but the implication is that using a regular expression is the best way to extract extract numbers separated by dots, when in fact the Python has a simpler, easier and more reliable way to deal with it. Again a quick mention of the "easy" way to solve the problem would have been appropriate.

These kinds of issues are what keeps Data Crunching from being a great book. In spite of them, it is still a very good and useful book and Mark Wilson has done a good job with a topic all too often ignored. The general idea is great, and the principles, problems and solutions are well-explained and relevant. If data crunching is something you do, I would certainly recommend that you read this book, but with a somewhat critical eye.

You can purchase Data Crunching: Solve Everyday Problems Using Java, Python, and more. from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

13 of 94 comments (clear)

  1. Matching a dotted quad by Anonymous Coward · · Score: 1, Informative

    Shouldn't be too hard if we can use ereg() or similar. How about checking for 0-255 like so: "([1-9][0-9]{0,1}|1[0-9][0-9]|2[0-4][0-9]|25[0-5]| 0)", then it's just a matter of checking for those between dots?

  2. Re: quads = text.split('.') by abigor · · Score: 4, Informative

    quads = text.split('.')
    if len(quads) != 4:
    raise NotAnIPAddress
    for member in quads:
    try:
    quad = int(member)
    if quad < 0 or quad > 255:
    raise NotValidQuad
    except:
    raise NotValidQuad
    .
    .
    .
    etc.

  3. Re: quads = text.split('.') by Anonymous Coward · · Score: 2, Informative

    Ummm... is receiving a number less than 0 or greater than 255 an exception? No, it's abnormal input sure, but that is a nasty and poor use of exceptions.

    You get an F on programming style :(

  4. Re: quads = text.split('.') by abigor · · Score: 3, Informative

    Jesus, it's just a demo to show that calling split isn't particularly unsafe. How you handle the errors is up to you. Consider the raise statements to be pseudocode.

    Ah, but your last line explains everything: you teach programming. You don't do it for a living. Makes sense now.

  5. Re:Reviewer catches himself. by helixblue · · Score: 3, Informative

    FYI: It's worth mentioning that rev is not very close to being universal either, existing only on Linux and BSD boxes as best as I can tell. tail -r is more universal in that it works under both SYSV and BSD variants, but oddly enough: not Linux.

    The GNU tail folks were pretty stubborn about keeping their file reversal in the tac command, wreaking havoc with cross platform scripts everywhere. :)

  6. Re:Learning concepts...bah! by despik · · Score: 4, Informative

    Boy, did someone just miss the joke...

    --
    "I seem to have mastered a certain amount of control over physical reality."
  7. Re: quads = text.split('.') by Pete+Brubaker · · Score: 3, Informative

    "\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5 ]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0 -9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[ 0-9][0-9]?)\b"

    This will match a valid IP address.

    --Pete

    --
    What's a sig? Pete Brubaker
  8. MySQL by DogDude · · Score: 4, Informative

    I have another problem with the section on SQL. Several of the slicker SQL recipes rely on nested queries (page 147-151). MySQL, clearly a very popular SQL database, has nested queries only in its latest versions, so many, if not the majority, of MySQL installations do not yet have that capability. Yet the text carries on as if nested queries were universal, without so much as parenthetical mention that some things might not work on all SQL implementations. It seems to me that this is exactly the sort of pitfall a book like this should inform the reader of.

    Nested queries are *basic* database functionality. This is just one of many reasons why those of us who are experienced DBAs and database developers do not consider MySQL a database. The fact that there are lots of people trying to use it as such is irrelevant. The author didn't mention that the book is also missing a section of spreadsheets. Why not? Lots of people use spreadsheets as a database!

    --
    I don't respond to AC's.
  9. Munging Alternative by PotatoMan · · Score: 2, Informative

    You might want to compare this book to "Data Munging With Perl" by David Cross.

    See the Slashdot Review:
    http://books.slashdot.org/article.pl?sid=01/04/26/ 1229238&tid=145&tid=6

  10. Re: quads = text.split('.') by PatrickThomson · · Score: 2, Informative

    THe whole point of python is to raise and catch exeptions instead of fucking about trying to make it all nice. So the parseing program might be called by

    ip = getuserinput()
    try: DoShitFromGrandparent()
    except NotAnIPAddress:
    print "Not an IP address, dumbass"
    except NotValidQuad:
    blah blah etc.

    --
    I am one of many. My idea is not unique, nor do I expect my voice alone to sway you. I speak in a chorus of opinion.
  11. Re: quads = text.split('.') by feronti · · Score: 2, Informative

    Actually, that would depend on where this code lives... if it's in the user interface, sure, using an exception is probably not the right way to do it, since you know right there how to handle it. But what if it's deep in the bowels of a library? A library should validate that its callers are following the contract, but has no way of knowing how to handle the error when the value is out of range, so it should fail early and throw an exception so the higher layers can do something about it.

    Besides, as another poster mentioned, using exceptions for flow control is an actual pattern in Python. The Python philosophy is 'it's easier to ask for forgiveness than to ask for permission.' Though, the truly python way would be to build the address and just pass it on, and let someone who knows better validate it.

  12. Re:Not mentioning tac is not a dealbreaker by illumin8 · · Score: 2, Informative

    Not installable by you, of course. But not installable?

    Haha, yeah, I don't even know how to go to SunFreeware or Blastwave and download a copy of GNU textutils in Solaris package format. You can think that if you want to, but in the enterprise world, every software package I want to install has to be approved by about 3 levels of management. They want to know what it does, why we need it, how much it costs, and who else will know how to maintain it after I leave the company. The chance of providing them a list of all the GNU utilities necessary to compile your single average open-source app and getting approval for that is close to nil. Forget Perl modules and CPAN. These are real-world systems that might handle lots of real-world money, and they don't necessarily trust code that's been written by anyone on them.

    Anyway, I'm just (hopefully) educating people on some of the problems that a real-world sysadmin runs into on a daily basis.

    --
    "When the president does it, that means it's not illegal." - Richard M. Nixon
  13. Re:Reviewer catches himself. by Profound · · Score: 2, Informative
    perl -e "print reverse <>" filename


    (next time I'll use preview)