Perl & LWP

← Back to Stories (view on slashdot.org)

Posted by timothy on Monday August 19, 2002 @03:00AM from the really-practical-text-extraction dept.

When direct database access to the information you need isn't available, but web pages with the right data are, you might pursue "screen-scraping" -- fetching a web page and scanning its text for the appropriate pieces of text in order to do further processing. LWP (Library for WWW access in Perl) is a collection of module to help you do this. mir writes: " Perl & LWP is a solid, no-nonsense book that will teach you how to do screen-scraping using Perl. It describes how to automatically retrieve and use information from the web. An introduction to LWP and related modules from simple to advanced uses and various ways to extract information from the returned HTML." Perl & LWP author Sean M. Burke pages 264 publisher O'Reilly and Associates rating 9 reviewer mir ISBN 0596001789 summary Excellent introduction to extracting and processing information from web sites.

The good: The book has a nice style and good coverage of the subject, includes introduction to all the modules used, reference material and includes good, well-developed examples. I really liked the way the authors describe the basic methodology to develop screen-scraping code, from analyzing an HTML page to extracting and displaying only what you are interested in.

The bad: Not much is bad, really. Some chapters are a little dry, though, and sometimes the reference material could be better separated from the rest of the text. The book covers only simple access to web sites; I would have liked to see an example where the application engages in more dialogue with the server. In addition, the appendixes are not really useful. More Info:

If it had not been published by O'Reilly, Perl and LWP could have been titled Leveraging the Web: Object-Oriented techniques for information re-purposing, or Web Services, Generation 0. An even better title would have been Screen-scraping for fun and profit: one day we might all use Web Services and easily get the information we need from various providers using SOAP or REST, but in the meantime the common way to achieve this goal is just to write code to connect to a web server, retrieve a page and extract the information from the HTML. In short, "screen-scraping." This will teach you all about using Perl to get Web pages and extract their "substantifique moëlle" (the pith essence, the essentials) for your own usage. It showcases the power of Perl for that kind of job, from regular expressions to powerful CPAN modules.

At 200 pages, plus 40 pages of appendices and index, this one is part of that line of compact O'Reilly books which covers only a narrow topic in each volume but which covers those topics well. Just like Perl & XML , its target audience is Perl programmers who need to tackle a new domain. It gives them a toolbox and basic techniques that to provide a jump start and avoid many mistakes.

Perl & LWP starts from the basics: installing LWP, using LWP::Simple to retrieve a file from a URL, then goes on to a more complete description of the advanced LWP methods for dealing with forms and munging URLs. It continues with five chapters on how to process the HTML you get, using regular expressions, an HTML tokenizer and HTML::TreeBuilder, a powerful module that builds a tree from the HTML. It goes on with an explanation of how to allow your programs to access sites that require cookies, authentication or the use of a specific browser. The final chapter wraps it all up in a bigger example: a web-spider.

The book is well-written and to-the-point. It is structured in a way that mimics what a programmer new to the field would do: start from the docs for a module, play with it, write snippets of code that use the various functions of the module, then go on to coding real-life examples. I particularly liked the fact that the author often explains the whys, and not only the hows, of the various pieces of code he shows us.

It is interesting to note that going from regular expressions to ever more powerful modules is a path followed also by most Perl programmers, and even by the language itself: when Perl starts being applied to a new domain first there are no modules, then low-level ones start appearing, then, as the understanding of the problem grows, easier-to-use modules are written.

Finally I would like to thank the author for following his own advice by including interesting examples and above all for not including anything about retrieving stock-quotes.

Another recommended book on the subject is Network Programming with Perl by Lincoln D. Stein, which covers a wider subject but devotes 50 pages to this topic and is also very good.

Breakdown by chapter:

1. Introduction to Web Automation (15 pages): an overview of what this book will teach you, how to install Gisle Aas' LWP, some interesting words of caution about the brittleness of screen-scraping code, copyright issues and respect for the servers you are about to hammer, and finally a very simple example that shows the basic process of web automation.
2. Web Basics (16p): describes how to use LWP::Simple, an easy way to do some simple processing.
3. The LWP Class Model (17p): a slightly steeper read, closer to a reference than to a real introduction that lays out the ground work for the good stuff ahead.
4. URLs (10p): another reference chapter, this one will teach you all you can do with URLs using the URI module. Although the chapter is clear and complete it includes little explanation as to why you will need to process URLs and it is not even mentioned in the introduction roadmap.
5. Forms (28p): a complete and easy to read chapter. It includes a long description of HTML form fields that can be used as a reference, 2 fun examples (how to get the number of people living in any city in the US from the Census web site and how to check that your dream vanity plate is available in California) and how to use LWP to upload files to a server. It also describes the limits of the technique. I appreciated a very educative section showing how to go from a list of fields in a form to more and more useful code that queries that form.
6. Simple HTML processing with Regular Expressions (15p): how to extract info from an HTML page using regexps. The chapter starts with short sections about various useful regexp features, then presents excellent advice on troubleshooting them, the limits of the technique and a series of examples. An interesting chapter, but read on for more powerful ways to process HTML. On the down side, I found the discussion of the s and m regexp modifiers a little confusing.
7. HTML processing with Tokens (19p): using a real HTML parser is a better (safer) way to process HTML than regexps. This chapter uses HTML::TokeParser. It starts with a short, reference-type intro, then a detailed example. Another reference section describes the methods an alternate way of using the module, with short examples. This is the kind of reference I find the most useful, it is the simplest way to understand how to use a module.
8. Tokenizing walkthrough (13p) a long Example showing step-by-step how to write a program that extracts data from a web site, using HTML::TokeParser. The explanations are very good, showing _why_ the code is built this way and including alternatives (both good and bad ones). This chapter describes really well the method readers can use to build their code.
9. HTML processing with Trees (16p): even more powerful than an HTML tokenizer: HTML::TreeBuilder (written by the author of the book) builds a tree from the HTML. This chapter starts with a short reference section, then revisits 2 previous examples of extracting information from HTML using HTML::TreeBuilder.
10. Modifying HTML with Trees (17p): More on the power of HTML::TreeBuilder: a reference/howto on the modification functions of HTML::TreeBuilder, with snippets of code for each function I really like HTML::TreeBuilder BTW, it is simple yet powerful.
11. Cookies, Authentication and Advanced Requests (13p): Back to that LWP business... this chapter is simple and to-the-point: how to use cookies, authentication and referer to access even more web-sites. I just found that it lacked a description on how to code a complete session with cookies.
12. Spiders (20p): a long example describing how to build a link-checking spider. It uses most of the techniques previously described in the book, plus some additional ones to deal with redirection and robots.txt files.
Appendices

I think the Appendices are actually the weakest part of the book, most of them are not really useful, apart from the ASCII table (every computer book should have an ASCII table IMHO ;--).
- A. LWP modules (4p): the list and one line description of all modules in the LWP library, long and impressive! But not very useful,
- B. HTTP status (2p): available elsewhere but still pretty useful,
- C. Common MIME types (2p): lists both the usual extension and the MIME type,
- D. Language Tags (2p): the author is a linguist ;--)
- E. Common Content Encodings (2p): character set codes,
- F. ASCII Table (13p): a very complete table, includes the ascii/unicode code, the corresponding HTML entity, description and glyph,
- G. User's View of Object-Oriented Modules (11p): this is a very good idea. A lot of Perl programmers are not very familiar with OO, and in truth they don't need to be. They just need the basics of how to create an object in an existing class and call methods on it. I found the text too be sightly confusing though, in fact I believe it is a little too detailed and might confuse the reader.
- Index (8p): I did not think the index was great (code is listed with references to 5 seemingly random pieces of code, type=file, HTML input element is listed twice, with and without the comma...), but this is not the kind of book where the index is the primary way to access the information. The Table of Content is complete and the chapters are focused enough that I have never needed to use the index.

You can purchase Perl & LWP from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

121 comments

spammers thank you. by morgajel · 2002-08-19 03:09 · Score: 0, Troll

ok, my first line of thought when seeing this is, HOW do I combat it? seems like this would be a spammers dream tool.

of course, I'm probably just pessimistic.

--
Looking for Book Reviews? Check out Literary Escapism.
1. Re:spammers thank you. by killthiskid · 2002-08-19 03:17 · Score: 3, Insightful
  
  I'd combat it by making the infomation you DO want available very easy to get to, and the everything else hard.
  
  That's why it's easy to take a penny from the penny jar and hard to get to the safe at a store.
2. Re:spammers thank you. by zoward · 2002-08-20 02:10 · Score: 2
  
  Of course it's a spammer's dream tool. Spammers have been using tools like this for years, to harvest e-mail addresses from web pages. That's why people use such funky email addresses on Slashdot (except one user I remember who's .sig cleverly read "of course it's my real email address - what kind of idiot would spider Slashdot?").
  
  --
  "Can't you see that everyone is buying station wagons?"
perldoc LWP by Anonymous Coward · 2002-08-19 03:10 · Score: 0

Once again, a perl book rendered mostly moot by the excellent documentation available free to perl users.
1. Re:perldoc LWP by Masem · 2002-08-19 03:31 · Score: 5, Insightful
  
  Books like this one, the Perl & XML, and other "compact" books certainly can be argued as repetition of the perldocs, but there is room for such books under ORA's wing. First off, it gives someone with a hankering to author a computer handbook the opportunity to do so: based on a search at Amazon, this is Seth Burke's first book, and so a quality job, even as something short as the LWP module with already extensive documentation, will help him get good inroads into writing other books and other possibilities (No, I don't know Seth personally, just using him as an example author). The second advantage is that most perldocs are written from an efficient manner: tell the developer exactly what they need to know when they need to know it. While there are examples, they are usually not fleshed out very well. A good book that covers the modules inside and out, with a philosophy of use, concrete examples, and useful reference material can be very helpful in understand the module further and using it more efficiently, and for the programmer unexperienced in the modules, it provides a solid background for them to get started quickly.
  Books like these, that focus very narrowly but try to cover the topic well, is what ORA is well known for and why they are still the major distributor of books related to OSS development and usage. Other large publishers would seem to balk at these types of books and instead opt for the 1000+ pg books that try to cover everything, typically failing to cover topics adequetely or making mistakes, since the size of a book can be an influencing factor to some book purchasers. In fact, one could argue that a lot of what ORA offers is simply rehashs of free documentation, but if that were the case, I'd have expected to see ORA out of business years ago. Therefor, there is a demand for ORA's quality retakes of the manpages and free documentation, and books like these continue to extend their catalog in good ways.
  
  --
  "Pinky, you've left the lens cap of your mind on again." - P&TB
  "I can see my house from here!" - ST:
2. Re:perldoc LWP by Anonymous Coward · 2002-08-19 03:40 · Score: 0
  
  True! There are trolls on this topic showing how few lines of code are needed to download a web page in perl and python. Yeah, it's pretty easy, which is why it's only one chapter of the book. The stuff that's no in the perldocs is what makes up the rest of the book.
3. Re:perldoc LWP by clintp · 2002-08-19 03:54 · Score: 3, Funny
  
  No, I don't know Seth personally,
  That's fairly apparent. Especially as his name is Sean. :)
  
  --
  Get off my lawn.
Screen scraping cold war by Anonymous Coward · 2002-08-19 03:10 · Score: 2, Insightful

How long until web designers begin making small randomizations to their page layout to break any screen scrapers code?

In turn, screen scrapers will have to counter with further intelligence and the information cold war begins!

Just a thort.
1. Re:Screen scraping cold war by pheared · 2002-08-19 03:49 · Score: 1
  
  I've been waiting for this for some time with eBay (I develop bidwatcher). They claim they don't want you to be able to use an 'automated tool' to access their site. So far, nothing has happened, except for the occaisional code change which will end up breaking my stuff. The biggest problem with parsing/understanding eBay html is that they really didn't care how it looked when it was generated since it's going to be rendered by a browser. It's quite a mess.
2. Re:Screen scraping cold war by dougmc · 2002-08-19 04:06 · Score: 4, Informative
  
  How long until web designers begin making small randomizations to their page layout to break any screen scrapers code?
  This already happens.
  Another thing that sites do is encode certain bits of text as images. Paypal, for example, does this. And they muck with the font to make it hard for OCR software to read it -- obviously they've had problems with people creating accounts programatically. (why people would, I don't know, but when there's money involved, people will certainly go to great lengths to break the system, and the system will have to go to great lengths to stop it -- or they'll lose money.)
  It's nice that there's a book on this now ... but people have been doing this for a long time. For as long as there has been information on web sites, people have been downloading them and parsing the good parts out.
3. Re:Screen scraping cold war by upsilon+b · 2002-08-19 08:26 · Score: 1
  
  Hey - why look far away? /. just asked me to type in three letters from an image when I registered last week.
4. Re:Screen scraping cold war by upsilon+b · 2002-08-19 08:30 · Score: 1
  
  No need to look far away. /. just asked me to typein three letters displayed as an image when I registered last week. It had some light color smudged around the text, but it seemd to me that it would be pretty easy to filter the smudge out before feeding it to an OCR - if anybody really wanted to.
5. Re:Screen scraping cold war by Anonymous Coward · 2002-08-20 09:11 · Score: 0
  
  paypal did this because you used to get money for referrals, so people were writing code to create accounts, and then respond to thousands of ads selling things online with the message: "I am interested, and will give you $X, just register at paypal and I will deposit my first payment".
Programming troll clients in perl by Anonymous Coward · 2002-08-19 03:11 · Score: 0

O'Reilly has an out of print book on their site that is the textbook for crapflooding bots.
Look for "Programmig web clients in perl" on your favorite search engine.
1. Re:Programming troll clients in perl by Anonymous Coward · 2002-08-19 05:18 · Score: 0
  
  Cool. Just what I've been looking for. Thanks.
Spam by dolo666 · 2002-08-19 03:14 · Score: 1, Offtopic

I hope spammers read that book. Then maybe I won't get twenty Viagra emails... just one, or better yet... An unsolicited email directed at my love for video games instead of incessant "free" sex spam.
Quick n' Dirty Method by mgibbs · 2002-08-19 03:15 · Score: 0, Redundant

Just use wget and regular expressions. :-)
1. Re:Quick n' Dirty Method by Anonymous Coward · 2002-08-19 03:46 · Score: 0
  
  i prefer lynx -source ;-)
2. Re:Quick n' Dirty Method by Fastball · 2002-08-19 04:23 · Score: 2
  
  That may be dirty, but it isn't quick. Fashioning regular expressions for this kind of work has to be one of the greater time pits for the average programmer.
Whoah, perl needs a whole book for this? by Tyler+Eaves · 2002-08-19 03:15 · Score: 1, Informative

#!/usr/bin/python
import urllib
obj = urllib.urlopen('http://slashdot.org')
text = obj.read()

--
TODO: Something witty here...
1. Re:Whoah, perl needs a whole book for this? by Anonymous Coward · 2002-08-19 03:24 · Score: 0
  
  Does urllib support basic-authentication? (sending user credentials MIME base-64 transfer encoded?) I agree, seems like devoting a book to doing this is pointless, or than perhaps from a reference stand point.
2. Re:Whoah, perl needs a whole book for this? by BigWillieStyle · 2002-08-19 04:53 · Score: 0
  
  #!/usr/bin/ruby
  require 'net/http'
  
  Net::HTTP.get_print('slashdot.org', '/index.pl')
3. Re:Whoah, perl needs a whole book for this? by Anonymous Coward · 2002-08-19 22:05 · Score: 0
  
  Sorry but you'd be *very* disappointed with urllib comparing to LWP perfomance/capibilities in the wild.
  
  It's just not mature enough.
  It's not really suprising - on my machine LWP is 5.64 and urrlib - 1.15.
  
  I'm a big fan of zope actually but to do some real work I had to switch from urrlib to LWP in my External Methods.
  
  Cheers
Doesn't seem to discuss the legalities by burgburgburg · 2002-08-19 03:18 · Score: 3, Interesting

I'm suprised that there isn't any discussion about the potential legal pitfalls in all of this repurposing listed in the contents. I'm not saying that scraping is illegal, but at least a mention of the possible claims and counter-arguments might have been called for.
1. Re: Doesn't seem to discuss the legalities by Antity · 2002-08-19 03:26 · Score: 3, Insightful
  
  How can this be less legal than surfing the pages with a browser regularly?
  
  Additional question for 5 bonus points: Who the hack can sue me if I program my own browser and call it "Perl" or "LWP" and let it pre-fetch some news sites every morning at 8am?
  
  VCRs can be programmed to record my favorite daily soap 5 days a week at 4pm as long as I'm on vacation. Some TV stations here in Europe even use VPS so my VCR starts and stops recording exactly when the show begins and ends, so I don't get commercials before/after. Illegal to automate this?
  
  Disclaimer: I don't watch soaps. :)
  
  --
  42. Easy. What is 32 + 8 + 2?
2. Re:Doesn't seem to discuss the legalities by Anonymous Coward · 2002-08-19 03:27 · Score: 1, Informative
  
  Actually there is a paragraph devoted to legalities. He asks you not to research and read whatever TOS the website you're going to "rip" has. If that doesn't satisfy your needs, he urges you to also contact the owner of the website to get permission to do what you are doing.
3. Re:Doesn't seem to discuss the legalities by MarkWatson · 2002-08-19 03:45 · Score: 1
  
  Right on.
  I needed several good sources of news stories for a live product demo and I did not have too much trouble getting permission from site owners to automatically summarize and link to their material.
  I worked for a company that got in mild trouble for not getting permission a few years ago, so it is important to read the terms of services for web sites and respect the rights of others.
  That said, it is probably OK to scrape data for your own use if you do not permanently archive it. I am not a lawyer, but that sounds like fair use to me.
  A little off topic: the web, at its best, is non-commercial - a place (organized by content rather than location) for sharing information and forming groups interested in the same stuff. However, I would like to see more support for very low cost web services and high quality web content. A good example is Salon: for a low yearly fee, I find the writing excellent. I also really like the SOAP APIs on Google and Amazon - I hope that more companies make web services available - the Google model is especially good: you get 1000 uses a day for free, and hopefully they will also sell uses for a reasonable fee.
  -Mark
4. Re: Doesn't seem to discuss the legalities by Spackler · 2002-08-19 04:37 · Score: 4, Insightful
  
  Who the hack can sue me if I program my own browser and call it "Perl" or "LWP" and let it pre-fetch some news sites every morning at 8am?
  
  Many sites (Yes, our beloved Slashdot included) use detection methods. If the detector thinks you are using a script, BANG!, your IP is in the deny list until you can explain your actions. A nice profile that says "for the last 18 days, x.x.x.x IP address logged in each day at exactly 7:53 am and did blah..." will get you slapped from MSNBC pretty fast. I would advise you to get some type of permission from the owner of the site before running around with scripts to grab stuff all over the web. Someone might mistake you for a script kiddie.
5. Re: Doesn't seem to discuss the legalities by Antity · 2002-08-19 06:11 · Score: 1
  
  I meant: What is the difference between fetching a site every morning in a browser and - for example - have it pre-fetch with a script so the info is already there when you enter your office?
  
  Asking for permission is never a bad idea, though.
  
  --
  42. Easy. What is 32 + 8 + 2?
Yea! by molo · 2002-08-19 03:18 · Score: 4, Funny

Yea! Perl finally natively supports Light-Weight-Processes (threading)!

Oh, wait...

--
Using your sig line to advertise for friends is lame.
1. Re:Yea! by LunaticLeo · 2002-08-19 04:02 · Score: 3, Insightful
  
  I imagine you were just joking, but in the case that your jokes are a cry for help ...:)
  
  First, perl has native threads in the current perl 5.8.0.
  
  Second, if you are interested in threads (or more generally multiple concurrent processing), check out POE from CPAN. POE *is* the best thing to happen to perl since LWP. It is an event driven application framework, which allows cooperatively multi-tasking sessions to do work in parallel. It is the bees knees, and the cat's meow.
  
  --
  -- I am not a fanatic, I am a true believer.
2. Re:Yea! by sporty · 2002-08-19 04:03 · Score: 1
  
  Yeah, becareful of this one on interviews. I accidentally thought they meant the same thing. :)
  
  --
  -
  ping -f 255.255.255.255 # if only
3. Re:Yea! by kin_korn_karn · 2002-08-19 04:20 · Score: 1
  
  Please don't insult Perl by advocating any part of it with phrases like "the bee's knees". A bee's knees are small insignificant bits of chitin. That is not Perl.
  
  For that matter, anyone who uses that phrase should be drawn and quartered and then fed their own intestines, but I digress.
4. Re:Yea! by Anonymous Coward · 2002-08-19 11:06 · Score: 0
  
  come on,
  everybody knows what LWP stands for:
  'Larry Wall's Pucker'
LWP is not new but by Bob+Bitchen · 2002-08-19 03:20 · Score: 1

I suppose some will find it useful. I always try
perldoc LWP first followed by finding any examples
I can using the module, the book is always a last resort. I can learn a lot more by just playing
with the module.

--
http://tinyurl.com/3t236
Perl security links by Anonymous Coward · 2002-08-19 03:24 · Score: 0

www.cgisecurity.com/lib
LWP is great! by JediTrainer · 2002-08-19 03:30 · Score: 4, Interesting

Perl's been a wonderful tool in my situation. There's been a situation in my company where we needed to gather data from a (large) supplier, who was unwilling to provide us with a CSV (or otherwise easily parseable) file. Instead, we had to 'log in' to their site, and get the data as an HTML table from the browser.

In one evening, I wrote a quick Perl routine to perform the login and navigation to the appropriate page by LWP, download the needed page, and use REs to extract the appropriate information (yes, traditional screen scrape)

The beauty was that it was easy. I don't usually do Perl, but in this case it proved to be a wonderful tool creation tool :) LWP was a lifesaver here, and that script has worked for over a year now!

--

You can accomplish anything you set your mind to. The impossible just takes a little longer.
1. Re:LWP is great! by littleRedFriend · 2002-08-19 10:36 · Score: 1
  
  Using Perl is not even neccessary. The following thing will work just as well (using *nix of course). lynx -source 'http://www.someHTMLdata.org' | awk '{ some code to parse }' I don't understand why anyone would want to use a library or something like Perl for that. Makes it more complicated than it should be. I especially don't understand why you would want to buy a book about this when you can do: man awk
  
  --
  IANAL, but imagine a beowulf cluster of in Soviet Russia all your belong are base to us welcoming the new SCO overlords.
2. Re:LWP is great! by Fjord · 2002-08-19 11:49 · Score: 2
  
  Or you could have cut and pasted the table into access and saved as CSV.
  
  --
  -no broken link
3. Re:LWP is great! by JediTrainer · 2002-08-19 12:01 · Score: 1
  
  Or you could have cut and pasted the table into access and saved as CSV.
  
  Unfortunately that wouldn't be feasible because this feed comes in daily, and the idea was to reduce manual work. Much easier to just schedule it with 'crontab' or 'at'.
  
  --
  
  You can accomplish anything you set your mind to. The impossible just takes a little longer.
4. Re:LWP is great! by JediTrainer · 2002-08-19 12:06 · Score: 1
  
  I don't understand why anyone would want to use a library or something like Perl for that. Makes it more complicated than it should be. I especially don't understand why you would want to buy a book about this when you can do: man awk
  
  Primarily because it was a cross-platform solution, and (at the time) I didn't know how to do it in Java (I do now, but can't bother to rewrite something that works).
  
  The script originally ran on a Windows box, but it has since moved to a SCO Unix box.
  
  Finally, remember that a form-based login and some navigation was required (and saving of cookies in the process). This makes lynx and such more or less useless when trying to automate this. The Perl script then can proceed to dump the data directly into the database (or output as CSV, as mentioned earlier) with just a few more lines of code.
  
  --
  
  You can accomplish anything you set your mind to. The impossible just takes a little longer.
5. Re:LWP is great! by littleRedFriend · 2002-08-19 19:17 · Score: 1
  
  Fair enough. Now I get it.
  
  --
  IANAL, but imagine a beowulf cluster of in Soviet Russia all your belong are base to us welcoming the new SCO overlords.
6. Re:LWP is great! by jethro_troll · 2002-08-20 02:41 · Score: 1
  
  > Or you could have cut and pasted the table into access and saved as CSV.
  
  Thank you for a perfect example of false laziness.
Screenscraping is hardly best practices. by InnovATIONS · 2002-08-19 03:32 · Score: 1, Offtopic

I think that is is irresponsible to promote screen scraping as a practice. Sure it is unavoidable in some cases but should be used only as a last resort because of how fragile it is. Most screen scraping is done because the programmers are too lazy or cheap to work officially with the content provider.
1. Re:Screenscraping is hardly best practices. by glwtta · 2002-08-19 04:44 · Score: 2
  
  too lazy or cheap to work officially with the content provider
  or get information from dozens of (often academic) "content providers", with a page or two of info each; updated maybe once or twice a month... yes they would definitely want everyone who uses the information they publish to "work with them officially" - good use of everyone's time.
  
  --
  sic transit gloria mundi
Screen Scraping? by EastCoastSurfer · 2002-08-19 03:35 · Score: 1, Redundant

How is parsing web pages and pulling out relevant data screen scraping? Sounds more like parsing html to me. True screen scraping is when you link into the screens buffer and pull data from particular points on the screen.
This book fills a niche by TTop · 2002-08-19 03:41 · Score: 4, Informative

I for one am thankful this book is available and I will probably get it. I've always thought that the LWP and URI docs are cryptic and a little too streamlined. The best docs I thought were in an out-of-print O'Reilly book called Web Client Programming with Perl, but the modules have changed too much for that book to be very relevant anymore (although the book itself has been "open-sourced" at O'Reilly's Open Book Project).
It's actually not that often that I want to grep web pages with Perl, the slightly-more difficult stuff is when you want to pass cookies, etc, and that's where I always find the docs to be wanting. Yes, the docs tell you how, but to get the whole picture I remember having to flip back-and-forth between several module's docs.
1. Re:This book fills a niche by Anonymous Coward · 2002-08-19 04:25 · Score: 0
  
  > Yes, the docs tell you how, but to get the whole
  > picture I remember having to flip back-and-forth
  > between several module's docs.
  
  Oh the horror... Yeah, you did the right thing...
  
  Can I borrow $50 bucks?
2. Re:This book fills a niche by BoyPlankton · 2002-08-19 04:41 · Score: 3, Informative
  
  It's actually not that often that I want to grep web pages with Perl, the slightly-more difficult stuff is when you want to pass cookies, etc, and that's where I always find the docs to be wanting.
  
  I've always found the libwww-perl cookbook to be an invaluable reference. It covers cookies and https connections. Of course, it doesn't go into too much detail, but it provides you with good working examples.
3. Re:This book fills a niche by doom · 2002-08-20 06:14 · Score: 2
  
  There seems to be a surprising amount of confusion about a simple point here: this book is the second edition of "Web Client Programming in Perl", it's just been re-titled "Perl & LWP". This is a *vast* improvement, in my opinion... once upon a time O'Reilley books had seriously geeky titles (e.g. "lex & yacc") where you would pick up the book just to figure out what the hell the title meant. Then they started trying to branch out with more "comprehensible" titles that actually turned out to be more confusing in a lot of cases. Like, when I was getting into mod_perl it took me a year to realize that that the book "Writing Apache Modules with Perl and C" was what I should have been reading.
  Anyway, "Web Client Programming" was a nice slim volume that did a good job of introducing the LWP module, but had an unfortunately narrow focus on writing crawlers. If you needed to do something like do a POST of form values to enter some information there wasn't any clear example in the text. (The perldoc/man page for HTML::LWP on the other hand had a great, very prominent example. Though shalt not neglect on-line docs.) I flipped through this new edition at LinuxWorld, and it looks like it's fixed these kind of omissions it's a much beefier book.
  BUT... even at a 20% discount it wasn't worth it to me to shell out my own money for it. If you don't know your way around the LWP module, this is probably a great deal, if you do it's a little harder to say.
4. Re:This book fills a niche by TTop · 2002-08-20 07:15 · Score: 2
  
  Well, I'll agree that this is a follow-up subject-wise, but really this book has an entirely different author and title than the first book, so it's hard to call it a second edition, in my opinion.
5. Re:This book fills a niche by doom · 2002-08-20 09:52 · Score: 2
  
  Well, I'll agree that this is a follow-up subject-wise, but really this book has an entirely different author and title than the first book, so it's hard to call it a second edition, in my opinion.
  Yes, I agree. You got me. It only just occured to me to check to see if I'm the one who was confused.
2 Free Orielly online books with related topics by cacav · 2002-08-19 03:43 · Score: 2, Informative

I've found that 2 of the free books Oreilly offers on their website delve into this a little bit.
You can read online their book Web Client Programming With Perl which has a chapter or two on LWP, which I've found very useful.
And on a related note, you can also read CGI Programming on the World Wide Web which covers the CGI side.
I may take a look at this LWP book, or I may juststick with what the first book I mentioned has. It's worked for me so far.
Ticketmaster Example by barnaclebarnes · 2002-08-19 03:47 · Score: 5, Informative

Ticketmaster has these terms and conditions which specifically exclude these types of screen scrapes for commercial purposes

Quote from their TOC's...

Access and Interference

You agree that you will not use any robot, spider, other automatic device, or manual process to monitor or copy our web pages or the content contained thereon or for any other unauthorized purpose without our prior expressed written permission. You agree that you will not use any device, software or routine to interfere or attempt to interfere with the proper working of the Ticketmaster web site. You agree that you will not take any action that imposes an unreasonable or disproportionately large load on our infrastructure. You agree that you will not copy, reproduce, alter, modify, create derivative works, or publicly display any content (except for your own person, non-commercial use) from our website without the prior expressed written permission of Ticketmaster

This I think would be something that a lot of sites would want to do (Not that I agree)

--
[Please type your sig here.]
1. Re:Ticketmaster Example by zoward · 2002-08-20 02:17 · Score: 2
  
  You can picture someone writing a script to tag the Ticketmaster site until they can get in to purchase tickets to a hard-to-buy event, like a Bruce Springsteen concert, for example. The second tickets like these go on sale, the web site gets slashdotted, so why not automate the "try" process in a script? Because Ticketmaster will detect this and block your IP...
  
  --
  "Can't you see that everyone is buying station wagons?"
It's the repurposing that concerned me by burgburgburg · 2002-08-19 03:50 · Score: 1

It isn't the scraping. It's what you do with the info afterwards that I felt there might be issues revolving around. The AC in the next post mentions that there is a paragraph on the legalities and suggests contacting the scrapee to get permission. Seems safer, though probably unnecessary. But then, why risk angering a "source" and having them rewrite output to screw with your scraping efforts.
This is not screen scraping by drinkypoo · 2002-08-19 03:51 · Score: 3, Informative

Screen scraping is where you are reading the visible contents of an application. The prime example is GUI apps wrapped around 3270-based database apps. IBM *Still* uses this method internally for an interface to their support database, RETAIN, which is a mainframe app, rather than providing a database interface. One assumes this is to control user authentication, but you'd think they could do that in their database interface too. They do have more programmers working for them than god.
Using the source of a webpage is just interpreting HTML. It's not like the application is selecting the contents of a browser window, issuing a copy function, and then sucking the contents of the clipboard into a variable or array and munging it. THIS is what screen-scrapers do.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
1. Re:This is not screen scraping by Fjord · 2002-08-19 12:00 · Score: 2
  
  Uh, most screen scraping packages I've known only emulate the front end, not actually display it in a terminal and then select it off the screen (whatever that would mean). They give you an callable interface that let's you think you are doing that (select row 1 columns 40-45, for example) but really, it's just the packages interpretting the formatting and text coming from the application.
  
  The process of ripping data from HTML is very commonly called screen scraping.
  
  --
  -no broken link
2. Re:This is not screen scraping by drinkypoo · 2002-08-19 16:23 · Score: 2
  
  My point is that it is not, repeat not screen scraping. Even doing it over the network is just the legacy of collecting the text from a locally-run program. It need not go to the screen buffer, but at some point it is rendered as if it has, rather than reading what WILL be rendered as you do in the case of interpreting HTML. If you got the contents of fields and the field names -- IE, the files on the mainframe that defined the interfaces -- this would be analogous to reading the HTML, and it would not be screen scraping. You are receiving a different set of information entirely. This is what happens when you receive HTML. It would only be screen scraping if you were selecting all of the text in the HTML window, copied it, and ate it, as described previously.
  
  --
  "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Worth learning LWP instead of doing it manually? by Etcetera · 2002-08-19 04:00 · Score: 4, Interesting

I've done a whoooole lot of screen-scraping working for a company that shall remain nameless :) and I've generally always used "lynx --source" or curl to download the file and parse/grep it manually.

Can anyone discuss if it's worth it to learn this module and convert HTML the "right" way? Does it provide more reliability, easy of use or deployment, or other spiffiness? Or is it just a bloated Perl module that slaps a layer of indirection onto what is sometimes a very simple task?

--
Hire a Linux system administrator, systems engineer,
Regulare Expressions for HTML? by Anonymous Coward · 2002-08-19 04:01 · Score: 1, Interesting

Anyone using RegExps to parse HTML should be shot on sight.

This is so wrong in so many ways.
For starters, you cannot parse dyck-languages with regular expressions.

You *have* to use a proper HTML-parser (that is tolerant to some extent), otherwise your program is simply wrong and I can always construct a proper HTML page that will break your regexp parser.

For those who are really hot on doing information extraction on web pages:
In my diploma thesis I found some methods to extract data from web pages that is resistant to a lot of changes.
I.e. if the structure of a web page changes, you can still extract the right information.
So you can do "screen-scraping" if you really want to, but it should be easier to contact the information provider directly.
1. Re:Regulare Expressions for HTML? by BoyPlankton · 2002-08-19 04:55 · Score: 3, Informative
  
  You *have* to use a proper HTML-parser (that is tolerant to some extent), otherwise your program is simply wrong and I can always construct a proper HTML page that will break your regexp parser.
  
  The problem is that regular expressions are often faster at processing than using an HTML parser. One example that I wrote used the HTML::TreeBuilder module to parse the pages. The problem is that we were parsing 100's of MB's worth of pages, and the structure of these pages made it very simple for me to write a few regexp's to get the necessary data out. The regexp version of the script took much less time to run than the TreeBuilder version did.
  
  This is not to say that TreeBuilder doesn't have it's place. There's a lot of stuff that I use TreeBuilder for just because sometimes it's easier and produces cleaner code.
2. Re:Regulare Expressions for HTML? by Pinball+Wizard · 2002-08-19 06:03 · Score: 2
  
  I.e. if the structure of a web page changes, you can still extract the right information.
  
  Um, OK, whatever. If I have an HTML parser and your HTML page changes, my program is broken. Whereas if I'm looking for say, the Amazon sales rank for a certain book, and the format of amazon's page changes, but I can still grep for Amazon Sales Rank: xxx, I still have a working program.
  
  What diploma thesis? Where's the link? Parent post should be considered a troll until further explanation is given.
  
  Besides, this book in fact covers HTML parsers in addition to other useful techniques, like regular expressions. And since when is HTML a dyck language?
  
  --
  No, Thursday's out. How about never - is never good for you?
3. Re:Regulare Expressions for HTML? by j_d · 2002-08-19 08:37 · Score: 1
  
  Anyone using RegExps to parse HTML should be shot on sight.
  
  Sure. But you're missing the point if you think it's about using regexes to process a whole HTML file.
  
  The idea isn't to parse an entire HTML document, but to look for markers which signal the beginning and end of certain blocks of relevant content.
  
  What's the url to your thesis?
Practical but... by Tonetheman · 2002-08-19 04:08 · Score: 1

Scraping web pages for data is practical and useful in some cases but I would not use it for "production" data. I have been somewhat involved with a company that is wanting to scrape data from another site. The problem is that you are relying on someone else to not change their web site. Which is just bad no matter how you look at it. I would not base a business on it.

Tone
Re:Worth learning LWP instead of doing it manually by Anonymous Coward · 2002-08-19 04:10 · Score: 2, Informative

Yes it is worthwhile to use LWP in combination with a parser module like HTML::TokeParser in many cases. HTML can be incredibly tricky to parse using only regular expressions or similar pattern matching techniques, leading to errors, false matches and mangled input. The LWP/HTML::* solution is more flexible and reliable. This advices applies more to someone who knows a bit of perl though. if you've never used Perl, I don't see why you'd want to learn a whole language, plus a group of modules, just to do one task, that you can already manage.
Slash-scraping with LWP by bastion_xx · 2002-08-19 04:28 · Score: 2, Informative

Anyone who has used AvantGo to create a Slashdot channel understands the importance of reparsing the content. AvantSlash uses LWP to such down pages and do reparsing. Hell, for years (prior to losing my iPaq), this was how I got my daily fix of Slashdot.

I just read it during regular work hours like everyone else. :>
Too little too late by Anonymous Coward · 2002-08-19 04:30 · Score: 1, Informative

I have been using LWP for over 1.5 years now, very heavily... to post data to affiliates and do all sorts of gnarly stuff.

I can't believe they have devoted a book to this subject! And why would they wait so long...? If you are into Perl enough to even know what LWP is, you probably don't need this book.

Once you build and execute the request, it is just like any other file read.

For you PHP'ers, the PHP interface for the Curl library does the same crap. Libcurl is very cool stuff indeed.

l8,
AC

"If you have to ask, you'll never know".
Red Hot Chili Peppers
Sir Psycho Sexy
Re:Worth learning LWP instead of doing it manually by Etcetera · 2002-08-19 04:32 · Score: 2

Actually, I do normally use Perl. I just dump the source to a string and then regexp to my heart's content.

Hmm.. guess I should take a closer look at it =)

--
Hire a Linux system administrator, systems engineer,
Re:Worth learning LWP instead of doing it manually by BoyPlankton · 2002-08-19 04:34 · Score: 3, Informative

I've done a whoooole lot of screen-scraping working for a company that shall remain nameless :) and I've generally always used "lynx --source" or curl to download the file and parse/grep it manually.

Can anyone discuss if it's worth it to learn this module and convert HTML the "right" way? Does it provide more reliability, easy of use or deployment, or other spiffiness? Or is it just a bloated Perl module that slaps a layer of indirection onto what is sometimes a very simple task?

The benefits come from when you're trying to crawl websites that require some really advanced stuff. I use it to crawl websites where they add cookies via javascript and do different types of redirects to send you all over the place. One of my least favorite ones used six different frames to finally feed you the information, and their stupid software was requiring my session to download, or at least open, three or four of those pages in the frames before it would spit out the page with all the information in it. IMHO, LWP with PERL makes it way simple to handle this sort of stuff.
Re:Worth learning LWP instead of doing it manually by gosand · 2002-08-19 04:41 · Score: 2

I've done a whoooole lot of screen-scraping working for a company that shall remain nameless :) and I've generally always used "lynx - -source" or curl to download the file and parse/grep it manually.
I too have done this for a long time, but not for any company. Let's just say it is useful for increasing the size of my collection of, oh, shall we say widgets. :-)
Really, it may be a little time consuming to do it manually, but it is also fun. If I find a nice site with a large collection of widgets, it is fun to figure out how to get them all in one shot with just a little shell scripting. A few minutes of "lynx -source" or "lynx -dump", cutting, grepping, and wget, and I have a nice little addition to my collection.

--

My beliefs do not require that you agree with them.
Another resource by merlyn · 2002-08-19 04:44 · Score: 5, Interesting

In addition to the Perl & LWP book, about half of my 150+ columns have been about LWP in one way or another. Enjoy! (And please support the magazines that still publish me: Linux Magazine and SysAdmin Magazine).
--
- Randal L. Schwartz, Just another Perl hacker for Stonehenge
Re:Worth learning LWP instead of doing it manually by Wee · 2002-08-19 04:47 · Score: 4, Funny

Can anyone discuss if it's worth it to learn this module and convert HTML the "right" way? Does it provide more reliability, easy of use or deployment, or other spiffiness?
First, I don't know what "the right way" means. Whatever works for your situation works and is just as "right" as any other solution. Second, I don't know excatly what you're using in comparision. I can think of a dozen ways to grab text from a web page/ftp site, create a web robot, etc. The LWP modules do a good job of pulling lots of functionality into one package, though, so if you expect to expand your current process's capabilities at any point, I'd maybe recommend it over something like a set of shell scripts.
Having said all that, I can say that yes, in general, it's worth it to learn the modules if you know you're going to be doing a lot of network stuff along with other programmatic stuff. It provides all the reliability, ease of use/deployment, and other general spiffiness you get with Perl. If you have a grudge against Perl, then it probably won't do anything for you; learning LWP won't make you like Perl if you already hate it. But if you have other means to gather similar data and you think might like to take advantage of Perl's other strengths (database access, text parsing/generation, etc) then you'd do well to use something "internal" to Perl rather than 3 or 4 disparate sets of tools glued together (version changes, patches, etc can make keeping everything together hard sometimes). Of course, you can also use Perl to glue these programs together and then integrate LWP code bit-by-bit in order to evaluate the modules' strengths and weaknesses.
Does the LWP stuff replace things like wget for quick one-liners? No. Does it make life a little easier if you have to do something else, or a whole bunch of something elses, after you do your network-related stuff? Yes.
Or is it just a bloated Perl module that slaps a layer of indirection onto what is sometimes a very simple task?
Ah, I have been trolled. Pardon me.
-B

--
Ash and Hickory, straight-grained and true, make excellent bludgeons, dandy for the cudgeling of vegetarians.
Parsing HTML in Perl by Animats · 2002-08-19 04:49 · Score: 3, Interesting
I parse large amounts of difficult HTML (and SGML) inside the Downside system. I do it in Perl, so I've been down this road. A few comments:
- Parsing into a tree is the way to go. HTML is obviously a tree structure, but not many applications parse it into a tree and work on it that way.
- HTML::TreeBuilder is useful, but has problems with badly constructed HTML. I wrote my own parser (SGML::TreeBuilder), which is much more robust, and will parse HTML, SGML, and XML into a tree of HTML::Element reliably without knowing anything about the tags. The key concept is to defer deciding whether a tag needs a matching close tag until you see a close tag for either that tag or an enclosing tag. When you find the close tag, you assume that all the unclosed tags within it didn't need a matching close at all. This minimizes the impact of unclosed tags; for example, if someone fails to close an <I>, it's ignored, rather than putting the entire remaining document in italics. (I really should put that up on CPAN. E-mail me if you want a copy.)
- One of the things you learn from this is that Perl doesn't do "get next character" very well. Some parsers for HTML use a C subroutine(!) to do the low-level tokenizing. It's embarassing that Perl, a string language, does this so badly.
- The output of parsing is a tree of Perl objects. Perl does objects, but not very well. It takes a lot of memory to represent HTML this way.
- High-volume parsing is better done in Java or C++. Perl's regular expressions don't help much, and the weak object system is an obstacle. Doing the whole job in Perl can work, but there's a sizable performance penalty.
1. Re:Parsing HTML in Perl by Matts · 2002-08-19 05:31 · Score: 3, Interesting
  
  Try XML::LibXML instead. It parses HTML, uses a DOM tree, and is all in C code, so uses about twice the memory of your source document, instead of about 8 times for a pure perl DOM.
  
  --
  
  Matt. Want XML + Apache + Stylesheets? Get AxKit.
2. Re:Parsing HTML in Perl by rp · 2002-08-19 05:57 · Score: 1
  
  But does it have the heuristic parsing of wuasi-HTMLthat HTML::Parser / HTML::TreeBuilder were designed to do?
3. Re:Parsing HTML in Perl by Anonymous Coward · 2002-08-19 14:09 · Score: 0
  
  and the weak object system is an obstacle
  
  This is FUD.
  
  The Perl object system is far superior to Java's, Python's or C++. Can you do class invariants, *true* pre and post conditions (that can refer to the object's previous state) effortlessly in those languages? Nope. But in Perl it's Class::Contract. If it's type safety you want, there's another module that gives you Pascal-like levels of type safety - far more than, say Java, which is so crippled you can't even create a type-safe collection.
SSL coverage omitted by cazwax · 2002-08-19 04:52 · Score: 2, Insightful

This book has no coverage of configuring your LWP module to support SSL connections. Perhaps it is trivial, but an overview would be useful to newbies.
Big Time Scraping by deathcow · 2002-08-19 05:07 · Score: 3, Interesting

I work for a telecom company. You wouldnt believe the scope of devices which require screen scraping to work with. The biggest one that comes to mind that _can_ require it is the Lucent 5ESS telecom switch. While the 5ESS has an optional X.25 interface (for tens of thousands $), our company uses the human-ish text based interface.

Lets say a user (on PC) wants to look up a customers phone line. They pull up IE, go to a web page, make the request into an ASP page, it gets stored in SQL Server.

Meanwhile, a perl program retrieves a different URL, which gives all pending requests. It keeps an open session onto the 5ESS, like a human would. It then does the human tasks, retrieves the typically 1 to 10 page report and starts parsing goodies out of it for return.

More than just 5ESS switches -- DSC telecom switches, some echo cancellers, satellite modems, lots of other devices require scraping to work with.
the Linux Wonkumentation Project by Anonymous Coward · 2002-08-19 05:16 · Score: 0

I for one am pleased to see the Linux Wonkumentation Project finally receiving some press!
PHP by ShaggusMacHaggis · 2002-08-19 05:18 · Score: 0, Offtopic

I really don't use much perl, but I do use a lot of PHP. PHP with the curl functions seems to me about 10x easier to use/learn than Perl and LWP (not saying it's hard, it's just that PHP is really really easy), and seems to do the same exact thing.
Don't be unfair to the author...... by i_want_you_to_throw_ · 2002-08-19 05:19 · Score: 4, Insightful

Yeah yeah spammers can use it. So what? Spam/email harvesting is only one of thousands of uses for LWP and focusing on that fact alone is VERY unfair to the author. You want to address the spamming issue? Don't use mailto tags in your HTML. Use form submission instead. If you use mailto: tags you DESERVE to be spammed.

There. Now shut the fsck up about the issue.

I manage a few government web sites and this book has been tremendous help in writing the spiders that I use to crawl the sites and record HTTP responses that then generate reports about out of date pages, 404s and so on. That alone has made it worth the money.

Sean did a great job on this. His book doesn't deserve to be slammed for what the technology MAY be used for.
Muhahaha by Anonymous Coward · 2002-08-19 05:31 · Score: 0

Just the book I need to help me develop Slashtroll v2!
LWP by linuxelf · 2002-08-19 05:55 · Score: 2, Interesting

Our company recently switched from Netscape Mail Server to Exchange 5.5. They then turned off all non-Microsoft protocols, like IMAP and POP, so suddenly, my beutiful Linux machine couldn't get mail, and they were making me switch to Outlook on Windows (ick) However, they did leave webmail as an option for me, woohoo. So, I just broke out my Perl and LWP and now I have a script that quite handily grabs all my mail from the webmail interface and populates it into my standard Unix mail spool. Problem solved.

--
- "That's just the kind of fuzzy-headed liberal thinking that leads to being eaten."
1. Re:LWP by Anonymous Coward · 2002-08-19 08:26 · Score: 0
  
  Anyone out there find a way to write something that scans yahoo mail for messages? I tried this a while back but ran into problems because of SSL and form submission difficulty.
Re:Worth learning LWP instead of doing it manually by Frank+of+Earth · 2002-08-19 06:09 · Score: 2

I've done a whoooole lot of screen-scraping working for a company that shall remain nameless :)

Unless you go to the link to my homepage and read the first paragraph? ;-)

--
Live web cams
Great Book, Cool author by Pinball+Wizard · 2002-08-19 06:26 · Score: 3, Interesting

I started working through this book the day it came out. So far, great book - I used to write bots in C++ of all languages because I had a decent C++ library to work with, but since I switched to Perl and LWP its been a lot easier and more productive. I was able to put some things to work almost immediately retrieving info from sites like amazon.com and abebooks.com(I program for an online bookstore so the examples in this book were very useful) The author happens to live in the same city as me and even stopped by my work to chat for 1/2 hour! Great guy, great book.
In fact I wanted to write a review for this book, but obviously got beaten to the punch. My only wish(2nd edition perhaps) for this book is that it spent a little more time dealing with things like logging into sites, handling redirection, multi-page forms, dealing with stupid HTML tricks that try to throw off bots, etc. But for a first edition this is a great book.

--
No, Thursday's out. How about never - is never good for you?
Wow, just like Evolution! by Anonymous Coward · 2002-08-19 06:26 · Score: 0

I'm sure your company is so glad that you spent hours re-creating the wheel. I hope you didn't charge them for the privilege...
1. Re:Wow, just like Evolution! by linuxelf · 2002-08-19 06:52 · Score: 1
  
  Actually, Evolution doesn't support Exchange 5.5. And, my company appreciates all the work I do. Doesn't yours?
  
  --
  - "That's just the kind of fuzzy-headed liberal thinking that leads to being eaten."
2. Re:Wow, just like Evolution! by Anonymous Coward · 2002-08-19 10:14 · Score: 0
  
  If they truly appreciate you, maybe they would consider a polite request to stop screwing without any good reason.
3. Re:Wow, just like Evolution! by linuxelf · 2002-08-20 01:27 · Score: 1
  
  I doubt it. Big company. Mandates come from on high, we just have to live with them the best we can.
  
  --
  - "That's just the kind of fuzzy-headed liberal thinking that leads to being eaten."
when it's worth using LWP and HTML parsers by Preposterous+Coward · 2002-08-19 07:24 · Score: 2

Can anyone discuss if it's worth it to learn this module and convert HTML the "right" way?

Yes, it's worth it to learn this, even if you still end up using the quick-and-dirty approach most of the time. The abstraction and indirection is pretty much like *any* abstraction and indirection -- it's more work for small, one-off tasks, but it pays off in cases where reusability, volume, robustness, and similar factors are important. If you end up having to parse pages where the HTML is nasty, or really large volumes of pages where quality control by inspection is impractical, or more session-oriented sites, the LWP-plus-HTML-parser-solution can be really valuable.

Frankly, if you're familiar with the principles of screen scraping (and you obviously are), learning the LWP-plus-parser solution is pretty simple (and I suspect you know a big chunk of what this book would try to tell you anyway). You can just about cut and paste from the POD for the modules and have a basic working solution to play with in a few minutes, then adapt or extend that in cases where you really need it.

--

"Biped! Good cranial development. Evidently considerable human ancestry."
1. Re:when it's worth using LWP and HTML parsers by cloudmaster · 2002-08-19 09:41 · Score: 2
  
  There's that whole "saves me from spawning another process every time I wanna grab the HTML from a new page" thing, too. I like the LWP modules largely because they take care of grabbing the content for me without having to go outside of Perl. Reducing the dependency on outside programs is good. :)
Why HTML::TokeParser? by Anonymous Coward · 2002-08-19 07:52 · Score: 1, Interesting

I never liked that module. The tokens it returns are array references that don't even bother to keep similar elements in similar positions, thus forcing you to memorize the location of each element in each token type or repeatedly consult the docs. If you refuse to do event driven parsing, at least use something like HTML::TokeParser::Simple which is pretty cool as it's a factory method letting you call accessors on the returned tokens. You just memorize the method names and forget about trying to memorize the token structures.
Re:Practical but... - One Solution (more complex) by ShannonClark · 2002-08-19 08:19 · Score: 1

I agree.

So, my company developed software that uses AI-like techniques to avoid this problem - not a trivial problem to solve, but valueable when you do.

What we've done (using PHP not Perl but the techniques and languages are very similar for this piece) is do a series of extraction steps - some structural and others data related - the structural steps employ AI-like techniques to detect the structure of the page and then use it to pass the "right" sections on to the data extraction portions.

This employs some modified versions of HTML parsers, but not a full object/tree representation (too expensive from a memory and performance standpoint for our purposes) - rather we normalize the page (to reduce variability) and then build up a data structure that represents the tree structure, but does not fully contain it.

In simpler terms - this stuff can be very complex, but if you need to there are companies (such as mine) who can offer solutions that are resistant to changing content sources and/or are able to rapidly handle new sources (in near realtime).

If you are interested feel free to contact me off Slashdot for more information and/or a product demo. www.jigzaw.com

--
-- Join us in Chicago May 1-4th for MeshForum -- writer, historian, tech geek, entrepreneur, internet junky since '91 --
Or, you could save the money and look at OpenBooks by SLot · 2002-08-19 08:53 · Score: 3, Informative

A lot of this seems to be covered in Web Client Programming with Perl.

Along with the other comments listing many references for Perl & LWP, I don't think I'll be rushing out to spend the money quick-like...
Web scraping, not screen scraping by Anonymous Coward · 2002-08-19 10:12 · Score: 0

Screen scraping is analysing display buffers for information.
LWP rocks by Quixote · 2002-08-19 14:58 · Score: 2

Ahh... LWP is (was) a god-send.
Back in the days when IPOs were hot (anyone remember them?), we wrote a client to place IPO orders on WitCapital's site automatically (when they had first-come, first-served allotments). In those days, it didn't really matter what IPO you got. All you had to do was get it and flip the same day, making a tidy sum of ca$h.
Later, we automated ordering on E*Trade's site. We wrote an application that would check their site for IPOs, fill-in the series of forms and submit the orders. Got many an IPO that way, and it was fun too.
Of course, who hasn't written an EBay sniper using a few lines of LWP?
Re:Worth learning LWP instead of doing it manually by Etcetera · 2002-08-19 17:26 · Score: 2

Shush =P

And just to repeat, in case people didn't see my follow-up post, I'm already using Perl to handle my screen-scraping. My question was if I should take the time to learn to get/parse the resulting HTML using LWP instead of using Lynx and regexp-ing the resulting source to death.

--
Hire a Linux system administrator, systems engineer,
Seth (??) by Anonymous Coward · 2002-08-19 18:30 · Score: 0

One thing I tend to look for in a book reviewer is their ability to actually get the author's name correct.

It's, er, Sean Burke.
Perl and LWP by jacquio · 2002-08-19 19:00 · Score: 1

YAPB
Your money is likely better spent buying Friedl's Mastering Regular Expressions Second Edition for example, which just came out, and then being able to apply that knowledge to many situations. Screen-scraping sounds indeed like parsing HTML, in which case it should be a breeze to use regexes and CPAN modules dedicated to HTML and even modified XML parsers to do the job...after all the power of XML is user-defined tags, there's nothing stopping the user from specifying html tag events...
Re:Worth learning LWP instead of doing it manually by Anonymous Coward · 2002-08-19 19:01 · Score: 0

Widget collectors everywhere should really give up on the web and check out the alt.binaries.*.erotica.* groups.
awk != perl. awk (very much less than) perl by ip4noman · 2002-08-20 01:40 · Score: 1

Using Perl is not even neccessary.
Larry Wall borrowed many ideas from Awk when he wrote Perl (as well as ideas from Unix shell, BASIC, C, and Lisp), but awk is NOT a substitute for perl. If awk were as capable of complex parsing jobs as perl, Larry probably would have done something different with his time.

OTOH, Perl *is* a superset of awk. Any awk program can be converted to perl with the utility a2p (which comes with the Perl source distribution), although probably not optimally.
Thanks merlyn! by ip4noman · 2002-08-20 01:54 · Score: 1

Your Perl advocacy over the years has been very helpful to my perl mast^h^h^h^hhackery, and I have borrowed from your LWP columns in writting LWP servers and user agents.

At my last gig, I had to write an automation to post to LiveLink, a 'Doze based document repository tool. The thing used password logins, cookies, and redirect trickery.

Using LWP (and your sample source code), I wrote a proxy: a server which I pointed my browser to, and a client which pointed to LiveLink. I was then able to observe the detailed shenanigans occuring between my browser and the LiveLink server, which I then simulated with a dedicated client.

I can't imagine any other way to have accomplished it as simply as with LWP, and with your sample code to study. Thanks to you and Gisle Aas (and Larry) for such wonderful tools!
Juizdefora Trabalhos Prontos IRC by Anonymous Coward · 2002-08-20 10:43 · Score: 0

Trabalhos Prontos
Juiz de Fora IRC
I deserve to be spammed? by sorbits · 2002-08-20 16:14 · Score: 1

Don't use mailto tags in your HTML. Use form submission instead. If you use mailto: tags you DESERVE to be spammed.

So trying to provide the audience of my web-pages with some comfort (a single click and a decent (configurable) mail editor appears, which allows for the address to be bookmarked, the letter to be saved as a draft for later completion, sending a carbon copy to a friend etc.) should be re-payed with punishment like being spammed?

Are you one of those persons who also claim that if you leave your stuff unprotected then you deserve to have it stolen? what a sad society we live in...
Re:Worth learning LWP instead of doing it manually by Anonymous Coward · 2002-08-25 23:16 · Score: 0

> Can anyone discuss if it's worth it to
> learn this module and convert HTML the
> "right" way?

s/right/portable/;