Spidering Hacks

← Back to Stories (view on slashdot.org)

Posted by timothy on Tuesday December 16, 2003 @07:00AM from the use-for-good-not-evil dept.

DrCarbonite writes "Spidering Hacks is a well-written guide to scripting and automating your data-seeking forays onto the Internet. It offers an attractive combination of the solving the problems you have and exposing you to solutions that you weren't aware you needed." Read on for Martin's review of the book. Spidering Hacks author Kevin Hemenway and Tara Calishain pages 402 publisher O'Reilly rating 8 reviewer Jeff Martin ISBN 0596005776 summary A wide-ranging collection of hacks detailing how to be more productive in Internet research and data retrieval

Introduction Spidering Hacks (SH), by Kevin Hemenway and Tara Calishain, is a practical guide to performing Internet research that goes beyond a simple Google search. SH demonstrates how scripting and other techniques can increase the power and efficiency of your Internet searching, allowing the computer to obtain data, leaving the user free to spend more time on analysis.

SH's language of choice is Perl, and while there are a few guest appearances by Java and Python, some basic Perl fluency will serve the reader well in reading the Hack's source code. However, regardless of your language preference, SH is still a useful resource. The authors discuss ethics and guidelines for writing polite and properly behaved spiders as well as the concepts and reasoning behind the scripts they present. For this reason, non-Perl coders can still stand to learn a lot of useful tips that will help them with their own projects.

Overview

Chapter 1, Walking Softly, covers the basics of spiders and scrapers, and includes tips on proper etiquette for Web robots as well as some resources for identifying and registering the many Web robots/spiders that exist on the Internet. Hemenway and Calishain should be credited for taking the time to be civically responsible and giving their readers appreciation for the power they are utilizing.

Chapter 2, "Assembling a Toolbox," covers how to obtain the Perl modules used by the book, respecting robots.txt, and various topics (Perls LWP and WWW::Mechanize modules for example) that will provide the reader with a solid foundation throughout the rest of the book. SH does a great job introducing some topics that not all members in its target audience may be familiar with (i.e., regular expressions, the use of pipes, XPath).

Chapter 3, "Collecting Media Files," deals with obtaining files from POP3 email attachments, the Library of Congress, and Web cams, among other sources. While individual sites described here may not appeal to everyone, the idea is to provide a specific example demonstrating each of certain general concepts, which can be applied to sites of the reader's choosing.

Chapter 4, "Gleaning Data from Databases," approaches various online databases. There are some interesting hacks here, such as those that leverage Google and Yahoo together. This chapter is the longest, and provides the greatest variety of hacks. It also discusses locating, manipulating, and generating RSS feeds, as well as other miscellaneous tasks such as downloading horoscopes to an iPod.

Hack #48, Super Word Lookup, is a good example of why SH is so intriguing. While utilizing a dictionary or thesaurus via a browser is simple, having the ability to do so with a command-line program allows the user an automated approach, reducing distractions.

Chapter 5, "Maintaining Your Collections," discusses ways to automate retrieval using cron and practical alternatives for Windows users.

Chapter 6, "Giving Back to the World," ends SH by covering practical ways the reader can give back to the Internet and avoid the ignominious leech designation. This chapter provides information on creating public RSS feeds, making an organization's resources available for easy retrieval by spiders, and using instant messaging with a spider.

Conclusion

There are extensive links provided throughout the book, and this indirectly contributes to SH's worth. The usual O'Reilly site for source code is available and Hemenway also provides some additional code on his site. A detailed listing of the hacks covered in SH is also available online from SH's table of contents.

The Hacks series is a relatively new genre for O'Reilly, but it is rapidly maturing and this growth is reflected in Spidering Hacks. Hemenway and Calishain have done good work in assembling a wide variety of tips that cover a broad spectrum of interests and applications. This is a solid effort, and I can easily recommend it to those looking to perform more effective Internet research as well as those looking for new scripting projects to undertake.

You can purchase Spidering Hacks from bn.com. Slashdot welcomes readers' book reviews -- to submit a review for consideration, read the book review guidelines, then visit the submission page.

11 of 121 comments (clear)

Min score:

Reason:

Sort:

XML interop? by prostoalex · 2003-12-16 07:09 · Score: 4, Interesting

From the review it looks like an excellent books to read and maybe have around. I will check it out on Safari, since it looks like they made it available to subscribers.

However, looking at these hacks:

68. Checking Blogs for New Comments
69. Aggregating RSS and Posting Changes
70. Using the Link Cosmos of Technorati
71. Finding Related RSS Feeds

Do they offer any hacks on working with XML, perhaps XML::RSS or other parsing engines from CPAN? Or is most of the XML handled through regexp?
1. Re:XML interop? by justMichael · 2003-12-16 07:36 · Score: 4, Interesting
  
  Hack 24 Painless RSS with Template::Extract
  
  It's actually a good read. They try to stay away from regex parsing as it tends to be fragile. They do cover it in one of the hacks though.
  
  Most of the hacks have to do with using various methods to walk the doc tree to look for what you want like a certain cell in a table (think header with names) then jumping up one to get that row then grabbing the next row to get your data cells.
Tracking yahoo popularity. by Flat+Feet+Pete · 2003-12-16 07:25 · Score: 3, Interesting

My server's going to die under the load, but I did this using Perl+Curl.

This page is used to source the data.

Is LWP the correct/new way to do this kind of stuff? I started with curl and hacked regex's to get the data.
Agents, anyone? by Wingchild · 2003-12-16 07:36 · Score: 5, Interesting

A few years ago, the big idea was that by some as-yet undetermined point in the future (say, 2005) all human beings would be freed from having to collect their own data by way of intelligent, semi-autonomous Agents that could be given some loose english-query type tasks and go on their merry way, fetching and organizing and categorizing data by relevance. It's not too far different from the proposed use of scripting talked about above.

The problem comes more in the last assertation of the story; that pulling in all of this data will free up more time for people to spend on the work of analysis. I want to say this isn't accurate, but it probably boils down to what you call "analysis" work.

The problem with spiders, agents, and their like -- yes, even those that are going out and fetching porn -- is that they are able to provide content without context, much as a modern search engine does. I can take Google and get super specific with a query (say, `pirates carribean history -movie -"johnny depp"`). That will probably fetch me back some data that has my keywords in it, much as any script or agent could do.

Unfortunately, while the engine could rank based on keyword visibility and recurrance, as well as applying some algorithms to try and guess whether the data might be good or not (encylcopedias look this way, weblogs about Johnny Depp look that way), the engine itself still has on way to physically read the information and decide if it's at all useful. A high-school website's page with a tidbit of information and some cute animated .gifs could theoretically draw more of a response from the engine than an official historian's personal recollections of his research while he was working on his master's thesis about the Jolly Roger. Any script (or engine) is only what you make of it.

The most tedious part of data analysis these days is not providing content (as spiders, scripts, and search engines all do) ... it's in providing a frame of context for the choosing, and, ultimately, rejection of sources.

What comes after that sorting process - the assimilation of good data and the drawing of conclusions there-from - that's what I call data analysis. A shame that scripts, spiders, agents, and robots haven't found a way to do that for us. :)
Perl script to query the library by Saint+Stephen · 2003-12-16 07:54 · Score: 3, Interesting

I have 3 library cards, and get a lot of DVDs, CDs, and books from them. (Lotsa free time).

I got tired of having to go to all 3 websites to see what to take back each day, so I wrote a small bash/curl script so I could do it at the command line.

There are *lots* of things like this that could be done if the web were more semantic.
An alternative by toddcw · 2003-12-16 07:58 · Score: 2, Interesting

It's a commercial app, but it's saved us skads of time: screen-scraper. It's also a lot less of a "hack".
cousin of spam? by GCP · 2003-12-16 08:00 · Score: 3, Interesting

The easier and more widespread the techniques for spidering become, the more websites will get hammered with the unintended equivalent of DOS attacks, the way spam is the equivalent of a DOS attack on your email account.

I don't have any solutions in mind. I don't want anti-spidering legislation, for example, because *I* want to be able to spider. I just don't want *you* to do it. ;-)

Really, I'm just observing that as the Web evolves we could see another spam-like problem emerge, at least for the more interesting sites.

--
"Those who have never entered upon scientific pursuits know not a tithe of the poetry by which they are surrounded."
Re:Table of content is packed with great stuff! by millette · 2003-12-16 08:01 · Score: 2, Interesting

I was way off with my 2-letter name :)
Looking at the page, and I'm pretty sure it's the little program I had lost. Thanks for finding it again!
Pssst! Mod parent up!
How much can you screen-scrape legally ? by JPMH · 2003-12-16 09:12 · Score: 4, Interesting

Question: how much screen-scraping can you do, before the legal questions start ?
In the USA, trading information that has cost somebody else time and money to build up can be caught under a doctrine of "misappropriation of trade values" or "unfair competition", dating from the INS case in 1918.
Meanwhile here in Europe, a collection of data has full authorial copyright (life + 70) under the EU Database Directive (1996), if the collecting involved personal intellectual creativity; or special database rights (last update + 15 years) if it did not.
I've done a little screen-scraping for a "one name" family history project. Presumably that is in the clear, as it was for personal non-commmercial research, or (at most) quite limited private circulation.
But where are the limits ?
How much screen-scraping can one do (or advertise), before legally it becomes a "significant taking" ?
Spidering Google Illegal? by jetkust · 2003-12-16 09:48 · Score: 2, Interesting

From Google Terms of Service:

No Automated Querying You may not send automated queries of any sort to Google's system without express permission in advance from Google. Note that "sending automated queries" includes, among other things:

using any software which sends queries to Google to determine how a website or webpage "ranks" on Google for various queries; "meta-searching" Google; and performing "offline" searches on Google.

Please do not write to Google to request permission to "meta-search" Google for a research project, as such requests will not be granted.
1. Re:Spidering Google Illegal? by the+pickle · 2003-12-16 19:30 · Score: 2, Interesting
  
  And what, exactly, constitutes "meta-searching" Google?
  
  p
  
  --
  In Korea, long hair is for old people!