Data Munging with Perl

← Back to Stories (view on slashdot.org)

Posted by timothy on Thursday April 26, 2001 @04:45AM from the can-they-say-that-on-tv? dept.

For those inundated with data -- numbers, names, dates, temperatures, colors, seismographic sensor output, voting records(!), or anything else -- the paltry concerns of user interface may be less important than the assurance that they can make something useful from all that stuff. Data munger extraordinaire chromatic has again delivered his insightful dissection of a programming book aimed at people with Perl knowledge and a lot of data to wade through, and No, it's not from O'Reilly. Maybe it's for you.

Data Munging with Perl author David Cross pages 283 publisher Manning Publications rating 9 reviewer chromatic ISBN 1-930110-00-6 summary Dave explores Perl's unique and compelling abilities tomanage and manipulate data of all types, sizes, and shades.

The Scoop Larry Wall, so goes the story, needed to glue together two systems on opposite sides of the country. Calling on the virtues of Laziness (why throw together something for just one job) and Hubris (why not write a new language?), he created Perl. Though it's found new niches in the post-web world, Perl earns its bread and butter munging data.

Dave Cross has put together a friendly and handy compendium of techniques, tricks, and best practices. Suitable for raw novices to experienced intermediates, Data Munging with Perl is a gentle but firm romp from flat text, past structured and binary files, to the realm of custom parsers. Clean examples and lots of modules accompany the explanations.

What's to Like? The book plots a natural course through topics ordered by complexity. It opens with a theoretical overview of data processing. This introduces terminology and outlines the general types of data one might encounger. Additionally, the author writes with the authority of experience when exploring the basic approaches and best practices. While other books aimed at novice users shy away from programs-as-filters and data structures, Cross prefers to instill good habits from the start.

Beyond munging data, the book provides a decent introduction to idiomatic and effective Perl programming. While the brief tutorial won't magically produce new JAPHs, the thoughtful and continual devotion to good technique and skill will inspire smarter programmers. More important than knowing many useful tricks is knowing when and how to use a handful of tools -- and where to go for more.

The overall level of quality is excellent. The binary data chapter stands out as the clearest explanation available, and the information on munging dates and times will save readers plenty of grief. Additionally, the entire parsing section introduces a handful of powerful but sorely-underused tools to handle HTML, XML, and even creating custom parsers. Rounding out the curriculum is an appendix that explores the larger modules, mentioned earlier, in more detail (XML::Parser, DBI, Date::Manip).

What's to Consider? Only two things might turn readers from this book. The first is its deceptive length. While the text is short, the examples are clear and the text packs a lot of wallop in what's there. Careful readers who follow the links to other resources will have little trouble supplementing their education. (On the other hand, another ten pages describing Parse::RecDescent would have been a nice addition. It's hard to fault the author for deferring to the module's voluminous documentation.)

Second, longtime Perl programmers may find little new material, particularly if they are familiar with the wealth of modules on the CPAN. The intended audience is clearly new and underexperienced programmers. While there's plenty of good advice presented well, the book falls more toward the tutorial side of the aisle than the reference section. This does not detract from the book, but it does narrow the base of potential readers slightly.

The SummaryManning Publications continues its fine line of Perl books with the consistent and powerful Data Munging with Perl. Coders looking to transform data somehow and hackers who want to take advantage of Perl's unique features will improve their knowledge and understanding. If you find yourself working with files or records in Perl, this book will save you time and trouble. Table of Contents

Introduction
1. Data, data munging, and Perl
2. General practices to use when munging data
3. Generally useful Perl idioms
4. Pattern matching
Data Munging
1. Unstructured data
2. Record-oriented data
3. Fixed-width & binary data
Simple Data Parsing
1. More complex data formats
2. HTML
3. XML
4. Building your own parsers
Conclusion
1. Looking back -- and ahead
1. Modules reference
2. Essential Perl

You can purchase this book at ThinkGeek.

12 of 66 comments (clear)

Min score:

Reason:

Sort:

Re:The power of paper? by rho · 2001-04-26 04:35 · Score: 4

The best part about a book -- a well written book, not a "How to Be and Unleashed Dummy in 21 Days" book -- is the time and care put into it by a host of professionals, whereas a Web resource tends to be cobbled together from a community of geniuses and idiots alike.
Look at Slashdot -- some of it is great, some of it would wither a pile of dog poo it's so bad. php.net is similar -- the function reference is good if you're looking for arguments to a rarely used function, but the user-contributed stuff is off-and-on useful.
That's partially why you pay $50 for a good tech book -- the team of people needed to put together a *good* book is quite expensive. You need a knowledgeable author, a clued-in editor, a savvy fact-checker... all these people cost money.

"Beware by whom you are called sane."

--
Potato chips are a by-yourself food.
Re:Question by holzp · 2001-04-26 00:58 · Score: 4

take a look at the source of a perl program.
yep, it's a good book by jacobito · 2001-04-26 01:17 · Score: 4

Along with the Camel, "Effective Perl Programming" (Addison/Wesley, don't remember author's name), and the "Perl Cookbook," this has been one of my favorite programming books. Mind you, I'm not a seasoned hacker, so YMMV. But for anyone who already understands the basics of Perl, this book is a great way to learn something practical.

Like Chromatic, though, I really wished that the section on Parse::RecDescent had been longer...
1. Re:yep, it's a good book by thoughtstream · 2001-04-26 04:45 · Score: 5
  
  Like Chromatic, though, I really wished that the section on Parse::RecDescent had been longer...
  Be careful what you wish for...
  Next year I'll be writing a book about Parse::RecDescent (or its successor Parse::FastDescent) and grammatical parsing techniques.
  Damian
Unstructured data by Animats · 2001-04-26 01:04 · Score: 4

In this context, "unstructured data" often refers to text in a natural language. An SEC filing is a good example of data with enough structure that machine processing is possible, but not enough that it's easy.
We have an engine which processes such data, but it's slow, because it's in Perl. Most of the time goes into modules recommended in this book, like HTML::Parser. The big problem is that simple tokenizing, like extracting HTML tags, is incredibly slow in Perl. The classic "get next character, get character class for character, switch on character class" operation is something Perl does very badly.
Yes, you can write low-level C functions and call them from Perl to deal with such problems, but that kills portability.
Re:The power of paper? by rgmoore · 2001-04-26 02:05 · Score: 4

One thing that I haven't seen mentioned yet is that books are easier to read than monitors. Monitors just can't match a book's DPI, and the higher resolution of the printed page can actually improve reading speed and retention and reduce eye strain. That may or may not be a big issue for you, but it can be a big deal and a reasonable justification for the extra expense. Another advantage of a printed book is that the author has already gone to the trouble of cobbling together the data for you so that you don't have to spend your time scrounging the web for it; if you're a consultant getting paid $100 per hour it doesn't take much time scouring the web for information to add up to more than the cost of the book.
OT Note: the correct term is tome (from the Greek word meaning to cut, and the same root as in medical procedures ending in -otomy, as tomes were originally produced by cutting a long scroll into smaller sections) not tomb (which is where somebody is buried).

--
There's no point in questioning authority if you aren't going to listen to the answers.
Boycott This Book!!! by none2222 · 2001-04-26 01:10 · Score: 4

Have you stopped to consider the consequences of the information contained in books like this? This type of effort should not be supported by the Free Software community.
Books like this give corporations the tools they need to destroy our privacy and strip us of our rights. How do you think Double Click puts the information about you it sells into useable form? With techniques it learns from this type of book. Same goes for the corporate websites you visit, your supermarket, etc.
Information wants to be free, but not the information in this book. Data mining and Data munging techniques should never have left the hallowed halls of academe. Once they enter the public domain, they are immediately exploited by greedy corporations. The author should have thought about that before writing a book like this.
If you buy or support books like this, you have lost any right to complain about your privacy being violated. If you are serious about privacy, boycott this book!

--
If you have a problem with my views, REPLY, don't moderate!
1. Re:Boycott This Book!!! by Neea · 2001-04-26 02:20 · Score: 5
  
  Dude, take your medication before you post. This book isn't going to tell the bad guys how to get your credit card number from the porn site you just visited. The bad guys already know how to get every piece of data about you that they want. So go back to your room in your mom's basement and put the aluminum foil hat back on your head. Remember - shiny side out.
Re:The power of paper? by interiot · 2001-04-26 01:38 · Score: 5

Get a second monitor to read documentation from. Not only would it pay for itself within 4 books, but it's more useful than a stack of spent books.
--
Re:Question by babbage · 2001-04-26 01:12 · Score: 5

XML is structured data.
Log files are generally fairly structured data.
CSV files are structured data.
Free flowing ASCII text is unstructured data.
Shakespeare's sonnets, however well formed, are unstructured data
(unless you can come up with a parser that recognizes iambic pentameter... :).
Falling somewhere in the middle is binary data. It has a structured format but freeform contents. Consider the various sound, image, and video formats. Maybe Shakespeare's sonnets could fall into this category too... :)
There are situations where you could want to analyze each form. Parsing Apache log files is a slightly different task than analysing formal XML documents or sloppy HTML pages or messy ASCII email. This book helps give you a feel for which situation you may be dealing with, and thus what tools & techniques might be useful for that situation.
Though some will tell you otherwise, this book has nothing to do with "Buffy the Vampire Slayer." Sorry, grep.

--
DO NOT LEAVE IT IS NOT REAL
The power of paper? by tenzig_112 · 2001-04-26 01:12 · Score: 5

This is not flame bait. I'm just curious what can be found in a paper tomb that cannot be cobbled together from various up-to-date and *free* sources from the web.
Perhaps I'm still stuck in the paper age (somewhere between bronze & silicon), but I find myself spending $50 a pop for progamming books I only skim through. If I need reference material, I hit PHP.net (for my PHP projects).
Am I missing something?
1. Re:The power of paper? by rfsayre · 2001-04-26 01:27 · Score: 5
  
  Yes, you are missing something. You're absolutely right that you can get all the reference material you need on the web. That's what it does best. However, when you're trying to *learn* a new language, it's better to have your editor, a couple console windows, and a book open. That speeds up the write/compile/run cycle. No flipping back and forth from the browser. You learn faster.
  
  Art At Home