Perl & LWP

← Back to Stories (view on slashdot.org)

Posted by timothy on Monday August 19, 2002 @03:00AM from the really-practical-text-extraction dept.

When direct database access to the information you need isn't available, but web pages with the right data are, you might pursue "screen-scraping" -- fetching a web page and scanning its text for the appropriate pieces of text in order to do further processing. LWP (Library for WWW access in Perl) is a collection of module to help you do this. mir writes: " Perl & LWP is a solid, no-nonsense book that will teach you how to do screen-scraping using Perl. It describes how to automatically retrieve and use information from the web. An introduction to LWP and related modules from simple to advanced uses and various ways to extract information from the returned HTML." Perl & LWP author Sean M. Burke pages 264 publisher O'Reilly and Associates rating 9 reviewer mir ISBN 0596001789 summary Excellent introduction to extracting and processing information from web sites.

The good: The book has a nice style and good coverage of the subject, includes introduction to all the modules used, reference material and includes good, well-developed examples. I really liked the way the authors describe the basic methodology to develop screen-scraping code, from analyzing an HTML page to extracting and displaying only what you are interested in.

The bad: Not much is bad, really. Some chapters are a little dry, though, and sometimes the reference material could be better separated from the rest of the text. The book covers only simple access to web sites; I would have liked to see an example where the application engages in more dialogue with the server. In addition, the appendixes are not really useful. More Info:

If it had not been published by O'Reilly, Perl and LWP could have been titled Leveraging the Web: Object-Oriented techniques for information re-purposing, or Web Services, Generation 0. An even better title would have been Screen-scraping for fun and profit: one day we might all use Web Services and easily get the information we need from various providers using SOAP or REST, but in the meantime the common way to achieve this goal is just to write code to connect to a web server, retrieve a page and extract the information from the HTML. In short, "screen-scraping." This will teach you all about using Perl to get Web pages and extract their "substantifique moëlle" (the pith essence, the essentials) for your own usage. It showcases the power of Perl for that kind of job, from regular expressions to powerful CPAN modules.

At 200 pages, plus 40 pages of appendices and index, this one is part of that line of compact O'Reilly books which covers only a narrow topic in each volume but which covers those topics well. Just like Perl & XML , its target audience is Perl programmers who need to tackle a new domain. It gives them a toolbox and basic techniques that to provide a jump start and avoid many mistakes.

Perl & LWP starts from the basics: installing LWP, using LWP::Simple to retrieve a file from a URL, then goes on to a more complete description of the advanced LWP methods for dealing with forms and munging URLs. It continues with five chapters on how to process the HTML you get, using regular expressions, an HTML tokenizer and HTML::TreeBuilder, a powerful module that builds a tree from the HTML. It goes on with an explanation of how to allow your programs to access sites that require cookies, authentication or the use of a specific browser. The final chapter wraps it all up in a bigger example: a web-spider.

The book is well-written and to-the-point. It is structured in a way that mimics what a programmer new to the field would do: start from the docs for a module, play with it, write snippets of code that use the various functions of the module, then go on to coding real-life examples. I particularly liked the fact that the author often explains the whys, and not only the hows, of the various pieces of code he shows us.

It is interesting to note that going from regular expressions to ever more powerful modules is a path followed also by most Perl programmers, and even by the language itself: when Perl starts being applied to a new domain first there are no modules, then low-level ones start appearing, then, as the understanding of the problem grows, easier-to-use modules are written.

Finally I would like to thank the author for following his own advice by including interesting examples and above all for not including anything about retrieving stock-quotes.

Another recommended book on the subject is Network Programming with Perl by Lincoln D. Stein, which covers a wider subject but devotes 50 pages to this topic and is also very good.

Breakdown by chapter:

1. Introduction to Web Automation (15 pages): an overview of what this book will teach you, how to install Gisle Aas' LWP, some interesting words of caution about the brittleness of screen-scraping code, copyright issues and respect for the servers you are about to hammer, and finally a very simple example that shows the basic process of web automation.
2. Web Basics (16p): describes how to use LWP::Simple, an easy way to do some simple processing.
3. The LWP Class Model (17p): a slightly steeper read, closer to a reference than to a real introduction that lays out the ground work for the good stuff ahead.
4. URLs (10p): another reference chapter, this one will teach you all you can do with URLs using the URI module. Although the chapter is clear and complete it includes little explanation as to why you will need to process URLs and it is not even mentioned in the introduction roadmap.
5. Forms (28p): a complete and easy to read chapter. It includes a long description of HTML form fields that can be used as a reference, 2 fun examples (how to get the number of people living in any city in the US from the Census web site and how to check that your dream vanity plate is available in California) and how to use LWP to upload files to a server. It also describes the limits of the technique. I appreciated a very educative section showing how to go from a list of fields in a form to more and more useful code that queries that form.
6. Simple HTML processing with Regular Expressions (15p): how to extract info from an HTML page using regexps. The chapter starts with short sections about various useful regexp features, then presents excellent advice on troubleshooting them, the limits of the technique and a series of examples. An interesting chapter, but read on for more powerful ways to process HTML. On the down side, I found the discussion of the s and m regexp modifiers a little confusing.
7. HTML processing with Tokens (19p): using a real HTML parser is a better (safer) way to process HTML than regexps. This chapter uses HTML::TokeParser. It starts with a short, reference-type intro, then a detailed example. Another reference section describes the methods an alternate way of using the module, with short examples. This is the kind of reference I find the most useful, it is the simplest way to understand how to use a module.
8. Tokenizing walkthrough (13p) a long Example showing step-by-step how to write a program that extracts data from a web site, using HTML::TokeParser. The explanations are very good, showing _why_ the code is built this way and including alternatives (both good and bad ones). This chapter describes really well the method readers can use to build their code.
9. HTML processing with Trees (16p): even more powerful than an HTML tokenizer: HTML::TreeBuilder (written by the author of the book) builds a tree from the HTML. This chapter starts with a short reference section, then revisits 2 previous examples of extracting information from HTML using HTML::TreeBuilder.
10. Modifying HTML with Trees (17p): More on the power of HTML::TreeBuilder: a reference/howto on the modification functions of HTML::TreeBuilder, with snippets of code for each function I really like HTML::TreeBuilder BTW, it is simple yet powerful.
11. Cookies, Authentication and Advanced Requests (13p): Back to that LWP business... this chapter is simple and to-the-point: how to use cookies, authentication and referer to access even more web-sites. I just found that it lacked a description on how to code a complete session with cookies.
12. Spiders (20p): a long example describing how to build a link-checking spider. It uses most of the techniques previously described in the book, plus some additional ones to deal with redirection and robots.txt files.
Appendices

I think the Appendices are actually the weakest part of the book, most of them are not really useful, apart from the ASCII table (every computer book should have an ASCII table IMHO ;--).
- A. LWP modules (4p): the list and one line description of all modules in the LWP library, long and impressive! But not very useful,
- B. HTTP status (2p): available elsewhere but still pretty useful,
- C. Common MIME types (2p): lists both the usual extension and the MIME type,
- D. Language Tags (2p): the author is a linguist ;--)
- E. Common Content Encodings (2p): character set codes,
- F. ASCII Table (13p): a very complete table, includes the ascii/unicode code, the corresponding HTML entity, description and glyph,
- G. User's View of Object-Oriented Modules (11p): this is a very good idea. A lot of Perl programmers are not very familiar with OO, and in truth they don't need to be. They just need the basics of how to create an object in an existing class and call methods on it. I found the text too be sightly confusing though, in fact I believe it is a little too detailed and might confuse the reader.
- Index (8p): I did not think the index was great (code is listed with references to 5 seemingly random pieces of code, type=file, HTML input element is listed twice, with and without the comma...), but this is not the kind of book where the index is the primary way to access the information. The Table of Content is complete and the chapters are focused enough that I have never needed to use the index.

You can purchase Perl & LWP from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

11 of 121 comments (clear)

Yea! by molo · 2002-08-19 03:18 · Score: 4, Funny

Yea! Perl finally natively supports Light-Weight-Processes (threading)!

Oh, wait...

--
Using your sig line to advertise for friends is lame.
LWP is great! by JediTrainer · 2002-08-19 03:30 · Score: 4, Interesting

Perl's been a wonderful tool in my situation. There's been a situation in my company where we needed to gather data from a (large) supplier, who was unwilling to provide us with a CSV (or otherwise easily parseable) file. Instead, we had to 'log in' to their site, and get the data as an HTML table from the browser.

In one evening, I wrote a quick Perl routine to perform the login and navigation to the appropriate page by LWP, download the needed page, and use REs to extract the appropriate information (yes, traditional screen scrape)

The beauty was that it was easy. I don't usually do Perl, but in this case it proved to be a wonderful tool creation tool :) LWP was a lifesaver here, and that script has worked for over a year now!

--

You can accomplish anything you set your mind to. The impossible just takes a little longer.
Re:perldoc LWP by Masem · 2002-08-19 03:31 · Score: 5, Insightful

Books like this one, the Perl & XML, and other "compact" books certainly can be argued as repetition of the perldocs, but there is room for such books under ORA's wing. First off, it gives someone with a hankering to author a computer handbook the opportunity to do so: based on a search at Amazon, this is Seth Burke's first book, and so a quality job, even as something short as the LWP module with already extensive documentation, will help him get good inroads into writing other books and other possibilities (No, I don't know Seth personally, just using him as an example author). The second advantage is that most perldocs are written from an efficient manner: tell the developer exactly what they need to know when they need to know it. While there are examples, they are usually not fleshed out very well. A good book that covers the modules inside and out, with a philosophy of use, concrete examples, and useful reference material can be very helpful in understand the module further and using it more efficiently, and for the programmer unexperienced in the modules, it provides a solid background for them to get started quickly.
Books like these, that focus very narrowly but try to cover the topic well, is what ORA is well known for and why they are still the major distributor of books related to OSS development and usage. Other large publishers would seem to balk at these types of books and instead opt for the 1000+ pg books that try to cover everything, typically failing to cover topics adequetely or making mistakes, since the size of a book can be an influencing factor to some book purchasers. In fact, one could argue that a lot of what ORA offers is simply rehashs of free documentation, but if that were the case, I'd have expected to see ORA out of business years ago. Therefor, there is a demand for ORA's quality retakes of the manpages and free documentation, and books like these continue to extend their catalog in good ways.

--
"Pinky, you've left the lens cap of your mind on again." - P&TB
"I can see my house from here!" - ST:
This book fills a niche by TTop · 2002-08-19 03:41 · Score: 4, Informative

I for one am thankful this book is available and I will probably get it. I've always thought that the LWP and URI docs are cryptic and a little too streamlined. The best docs I thought were in an out-of-print O'Reilly book called Web Client Programming with Perl, but the modules have changed too much for that book to be very relevant anymore (although the book itself has been "open-sourced" at O'Reilly's Open Book Project).
It's actually not that often that I want to grep web pages with Perl, the slightly-more difficult stuff is when you want to pass cookies, etc, and that's where I always find the docs to be wanting. Yes, the docs tell you how, but to get the whole picture I remember having to flip back-and-forth between several module's docs.
Ticketmaster Example by barnaclebarnes · 2002-08-19 03:47 · Score: 5, Informative

Ticketmaster has these terms and conditions which specifically exclude these types of screen scrapes for commercial purposes

Quote from their TOC's...

Access and Interference

You agree that you will not use any robot, spider, other automatic device, or manual process to monitor or copy our web pages or the content contained thereon or for any other unauthorized purpose without our prior expressed written permission. You agree that you will not use any device, software or routine to interfere or attempt to interfere with the proper working of the Ticketmaster web site. You agree that you will not take any action that imposes an unreasonable or disproportionately large load on our infrastructure. You agree that you will not copy, reproduce, alter, modify, create derivative works, or publicly display any content (except for your own person, non-commercial use) from our website without the prior expressed written permission of Ticketmaster

This I think would be something that a lot of sites would want to do (Not that I agree)

--
[Please type your sig here.]
Worth learning LWP instead of doing it manually? by Etcetera · 2002-08-19 04:00 · Score: 4, Interesting

I've done a whoooole lot of screen-scraping working for a company that shall remain nameless :) and I've generally always used "lynx --source" or curl to download the file and parse/grep it manually.

Can anyone discuss if it's worth it to learn this module and convert HTML the "right" way? Does it provide more reliability, easy of use or deployment, or other spiffiness? Or is it just a bloated Perl module that slaps a layer of indirection onto what is sometimes a very simple task?

--
Hire a Linux system administrator, systems engineer,
Re:Screen scraping cold war by dougmc · 2002-08-19 04:06 · Score: 4, Informative

How long until web designers begin making small randomizations to their page layout to break any screen scrapers code?
This already happens.
Another thing that sites do is encode certain bits of text as images. Paypal, for example, does this. And they muck with the font to make it hard for OCR software to read it -- obviously they've had problems with people creating accounts programatically. (why people would, I don't know, but when there's money involved, people will certainly go to great lengths to break the system, and the system will have to go to great lengths to stop it -- or they'll lose money.)
It's nice that there's a book on this now ... but people have been doing this for a long time. For as long as there has been information on web sites, people have been downloading them and parsing the good parts out.
Re: Doesn't seem to discuss the legalities by Spackler · 2002-08-19 04:37 · Score: 4, Insightful

Who the hack can sue me if I program my own browser and call it "Perl" or "LWP" and let it pre-fetch some news sites every morning at 8am?

Many sites (Yes, our beloved Slashdot included) use detection methods. If the detector thinks you are using a script, BANG!, your IP is in the deny list until you can explain your actions. A nice profile that says "for the last 18 days, x.x.x.x IP address logged in each day at exactly 7:53 am and did blah..." will get you slapped from MSNBC pretty fast. I would advise you to get some type of permission from the owner of the site before running around with scripts to grab stuff all over the web. Someone might mistake you for a script kiddie.
Another resource by merlyn · 2002-08-19 04:44 · Score: 5, Interesting

In addition to the Perl & LWP book, about half of my 150+ columns have been about LWP in one way or another. Enjoy! (And please support the magazines that still publish me: Linux Magazine and SysAdmin Magazine).
--
- Randal L. Schwartz, Just another Perl hacker for Stonehenge
Re:Worth learning LWP instead of doing it manually by Wee · 2002-08-19 04:47 · Score: 4, Funny

Can anyone discuss if it's worth it to learn this module and convert HTML the "right" way? Does it provide more reliability, easy of use or deployment, or other spiffiness?
First, I don't know what "the right way" means. Whatever works for your situation works and is just as "right" as any other solution. Second, I don't know excatly what you're using in comparision. I can think of a dozen ways to grab text from a web page/ftp site, create a web robot, etc. The LWP modules do a good job of pulling lots of functionality into one package, though, so if you expect to expand your current process's capabilities at any point, I'd maybe recommend it over something like a set of shell scripts.
Having said all that, I can say that yes, in general, it's worth it to learn the modules if you know you're going to be doing a lot of network stuff along with other programmatic stuff. It provides all the reliability, ease of use/deployment, and other general spiffiness you get with Perl. If you have a grudge against Perl, then it probably won't do anything for you; learning LWP won't make you like Perl if you already hate it. But if you have other means to gather similar data and you think might like to take advantage of Perl's other strengths (database access, text parsing/generation, etc) then you'd do well to use something "internal" to Perl rather than 3 or 4 disparate sets of tools glued together (version changes, patches, etc can make keeping everything together hard sometimes). Of course, you can also use Perl to glue these programs together and then integrate LWP code bit-by-bit in order to evaluate the modules' strengths and weaknesses.
Does the LWP stuff replace things like wget for quick one-liners? No. Does it make life a little easier if you have to do something else, or a whole bunch of something elses, after you do your network-related stuff? Yes.
Or is it just a bloated Perl module that slaps a layer of indirection onto what is sometimes a very simple task?
Ah, I have been trolled. Pardon me.
-B

--
Ash and Hickory, straight-grained and true, make excellent bludgeons, dandy for the cudgeling of vegetarians.
Don't be unfair to the author...... by i_want_you_to_throw_ · 2002-08-19 05:19 · Score: 4, Insightful

Yeah yeah spammers can use it. So what? Spam/email harvesting is only one of thousands of uses for LWP and focusing on that fact alone is VERY unfair to the author. You want to address the spamming issue? Don't use mailto tags in your HTML. Use form submission instead. If you use mailto: tags you DESERVE to be spammed.

There. Now shut the fsck up about the issue.

I manage a few government web sites and this book has been tremendous help in writing the spiders that I use to crawl the sites and record HTTP responses that then generate reports about out of date pages, 404s and so on. That alone has made it worth the money.

Sean did a great job on this. His book doesn't deserve to be slammed for what the technology MAY be used for.