Spidering Hacks

← Back to Stories (view on slashdot.org)

Posted by timothy on Tuesday December 16, 2003 @07:00AM from the use-for-good-not-evil dept.

DrCarbonite writes "Spidering Hacks is a well-written guide to scripting and automating your data-seeking forays onto the Internet. It offers an attractive combination of the solving the problems you have and exposing you to solutions that you weren't aware you needed." Read on for Martin's review of the book. Spidering Hacks author Kevin Hemenway and Tara Calishain pages 402 publisher O'Reilly rating 8 reviewer Jeff Martin ISBN 0596005776 summary A wide-ranging collection of hacks detailing how to be more productive in Internet research and data retrieval

Introduction Spidering Hacks (SH), by Kevin Hemenway and Tara Calishain, is a practical guide to performing Internet research that goes beyond a simple Google search. SH demonstrates how scripting and other techniques can increase the power and efficiency of your Internet searching, allowing the computer to obtain data, leaving the user free to spend more time on analysis.

SH's language of choice is Perl, and while there are a few guest appearances by Java and Python, some basic Perl fluency will serve the reader well in reading the Hack's source code. However, regardless of your language preference, SH is still a useful resource. The authors discuss ethics and guidelines for writing polite and properly behaved spiders as well as the concepts and reasoning behind the scripts they present. For this reason, non-Perl coders can still stand to learn a lot of useful tips that will help them with their own projects.

Overview

Chapter 1, Walking Softly, covers the basics of spiders and scrapers, and includes tips on proper etiquette for Web robots as well as some resources for identifying and registering the many Web robots/spiders that exist on the Internet. Hemenway and Calishain should be credited for taking the time to be civically responsible and giving their readers appreciation for the power they are utilizing.

Chapter 2, "Assembling a Toolbox," covers how to obtain the Perl modules used by the book, respecting robots.txt, and various topics (Perls LWP and WWW::Mechanize modules for example) that will provide the reader with a solid foundation throughout the rest of the book. SH does a great job introducing some topics that not all members in its target audience may be familiar with (i.e., regular expressions, the use of pipes, XPath).

Chapter 3, "Collecting Media Files," deals with obtaining files from POP3 email attachments, the Library of Congress, and Web cams, among other sources. While individual sites described here may not appeal to everyone, the idea is to provide a specific example demonstrating each of certain general concepts, which can be applied to sites of the reader's choosing.

Chapter 4, "Gleaning Data from Databases," approaches various online databases. There are some interesting hacks here, such as those that leverage Google and Yahoo together. This chapter is the longest, and provides the greatest variety of hacks. It also discusses locating, manipulating, and generating RSS feeds, as well as other miscellaneous tasks such as downloading horoscopes to an iPod.

Hack #48, Super Word Lookup, is a good example of why SH is so intriguing. While utilizing a dictionary or thesaurus via a browser is simple, having the ability to do so with a command-line program allows the user an automated approach, reducing distractions.

Chapter 5, "Maintaining Your Collections," discusses ways to automate retrieval using cron and practical alternatives for Windows users.

Chapter 6, "Giving Back to the World," ends SH by covering practical ways the reader can give back to the Internet and avoid the ignominious leech designation. This chapter provides information on creating public RSS feeds, making an organization's resources available for easy retrieval by spiders, and using instant messaging with a spider.

Conclusion

There are extensive links provided throughout the book, and this indirectly contributes to SH's worth. The usual O'Reilly site for source code is available and Hemenway also provides some additional code on his site. A detailed listing of the hacks covered in SH is also available online from SH's table of contents.

The Hacks series is a relatively new genre for O'Reilly, but it is rapidly maturing and this growth is reflected in Spidering Hacks. Hemenway and Calishain have done good work in assembling a wide variety of tips that cover a broad spectrum of interests and applications. This is a solid effort, and I can easily recommend it to those looking to perform more effective Internet research as well as those looking for new scripting projects to undertake.

You can purchase Spidering Hacks from bn.com. Slashdot welcomes readers' book reviews -- to submit a review for consideration, read the book review guidelines, then visit the submission page.

7 of 121 comments (clear)

Min score:

Reason:

Sort:

Use of "hacker" by davidstrauss · 2003-12-16 07:10 · Score: 2, Insightful

When are people going to realize that hackers just care about computers and the crackers are the bad guys? Oh wait...
Re:Techniques used by spammers? by tds67 · 2003-12-16 07:12 · Score: 3, Insightful

Other than using google, why would anyone use these techniques other than to harvest email like a spammer?
There's a lot more information on the Web than just e-mail addresses. Besides, why be reliant on search engines when you can do it yourself?
Spidering and exceeding ISP bandwidth limits by G4from128k · 2003-12-16 07:26 · Score: 5, Insightful

I suspect that more than a few people are going to hit their ISP's bandwidth limits if they start playing with spiders. A spider running on a simple 768 kbps DSL line can probably schlep down more than 4 GB per day or 129 GB/month (assuming the CPU can keep up analyzing with the flow).

--
Two wrongs don't make a right, but three lefts do.
1. Re:Spidering and exceeding ISP bandwidth limits by interiot · 2003-12-16 07:52 · Score: 4, Insightful
  
  If it's a full spider where you're considering competing with google or reimplementing google with extra features, then yes, you'd obviously need an industrial-strength account.
  More likely though, you leave the big jobs to the big boys, and you want to do very specific things, maybe even building on top of google... eg. find porn movies, copying edmunds' database so you can sort cars by their power/weight ratio (or list all RWD cars, or find the lightest RWD car, or...), or make your own third-party feed of slashdot from their homepage since they watch you like a hawk when you download their .rss too often, but not when you download their homepage too often.
  Little custom jobs like that can take a minimal amount of code (especially if you're a regex wizard), take minimal bandwidth, and take enough skill that target sites aren't likely to track you down because there's only three of you doing it.
Re:Agents, anyone? by LetterJ · 2003-12-16 08:07 · Score: 5, Insightful

I think that some of the things being done to filter *out* spam might also apply to filtering *in* good information from things like agents.

I know that my Popfile spam filter is getting pretty good (with 35,000 messages processed) at not only spam vs. ham type comparisons, but also work vs. personal and other categories.

Bayesian filters are just one type of learning algorithm, but they work fairly well for textual comparisons. I've personally been toying with seeing how well a toolbar/proxy combination would work for predicting the relative "value" of a site to me. Run all browsing through a Bayesian web proxy that analyses all sites visited. Then, with a browser toolbar, sites can be moderated into a series of categories.

That same database could be used by spiders to look for new content, and, if it fits into a "positive" category according to the analysis, add it to a personal content page of some sort that could be used as a browser's home page.

With sufficient data sources (and with a book like this, it shows that there ARE plenty of sources), it could really bring the content you want to read together.

--

The Glass is Too Big: My Take on Things
Re:cousin of spam? by 1iar_parad0x · 2003-12-16 08:47 · Score: 3, Insightful

Well, if you space the time between HTTP requests, it wouldn't be spam.

This might be obvious or just a non-issue, but ignoring IMG tags in your bots (saves on bandwidth costs). You're probably not effecting their bandwidth by downloading text.

Incidently, most spammers are glorified script kiddies, not data miners or AI people. The kind of "hard-earned" money in data mining isn't the kind of money spammers are looking for.

The real problem with data mining is increased server load. Perhaps running your scripts late at night would help.

Of course, if server load was spam, then Slashdot would have a lot of explaining to do. :)

--
What do you mean my sig is repetitive? What do you mean my sig is repetitive? What do you mean....
Re:cousin of spam? by sethx9 · 2003-12-17 01:50 · Score: 2, Insightful

There is an industry built on teaching businesses and web designers how to increase ranking by making pages spider-friendly. The inverse of those same techniques could be used to protect a site.

If "bad" spiders became so common that businesses began needing to weigh the pros of page ranking against the cons of data theft then the indexing services (those that wanted to remain relevant) would develop other methods for accessing web content.

On a side note: I actually bought this book a couple of weeks ago as a tool to help me learn perl. Over the past few years I've built and used scraping tools and when I saw this book I was thrilled to have so many real-world examples that weren't about building front-end grids and tables to databases!

--
Sorry, I keep forgetting to add the tongue-in-cheek emoticon to the bottom of my posts...