Spidering Hacks
Introduction Spidering Hacks (SH), by Kevin Hemenway and Tara Calishain, is a practical guide to performing Internet research that goes beyond a simple Google search. SH demonstrates how scripting and other techniques can increase the power and efficiency of your Internet searching, allowing the computer to obtain data, leaving the user free to spend more time on analysis.
SH's language of choice is Perl, and while there are a few guest appearances by Java and Python, some basic Perl fluency will serve the reader well in reading the Hack's source code. However, regardless of your language preference, SH is still a useful resource. The authors discuss ethics and guidelines for writing polite and properly behaved spiders as well as the concepts and reasoning behind the scripts they present. For this reason, non-Perl coders can still stand to learn a lot of useful tips that will help them with their own projects.
OverviewChapter 1, Walking Softly, covers the basics of spiders and scrapers, and includes tips on proper etiquette for Web robots as well as some resources for identifying and registering the many Web robots/spiders that exist on the Internet. Hemenway and Calishain should be credited for taking the time to be civically responsible and giving their readers appreciation for the power they are utilizing.
Chapter 2, "Assembling a Toolbox," covers how to obtain the Perl modules used by the book, respecting robots.txt, and various topics (Perls LWP and WWW::Mechanize modules for example) that will provide the reader with a solid foundation throughout the rest of the book. SH does a great job introducing some topics that not all members in its target audience may be familiar with (i.e., regular expressions, the use of pipes, XPath).
Chapter 3, "Collecting Media Files," deals with obtaining files from POP3 email attachments, the Library of Congress, and Web cams, among other sources. While individual sites described here may not appeal to everyone, the idea is to provide a specific example demonstrating each of certain general concepts, which can be applied to sites of the reader's choosing.
Chapter 4, "Gleaning Data from Databases," approaches various online databases. There are some interesting hacks here, such as those that leverage Google and Yahoo together. This chapter is the longest, and provides the greatest variety of hacks. It also discusses locating, manipulating, and generating RSS feeds, as well as other miscellaneous tasks such as downloading horoscopes to an iPod.
Hack #48, Super Word Lookup, is a good example of why SH is so intriguing. While utilizing a dictionary or thesaurus via a browser is simple, having the ability to do so with a command-line program allows the user an automated approach, reducing distractions.
Chapter 5, "Maintaining Your Collections," discusses ways to automate retrieval using cron and practical alternatives for Windows users.
Chapter 6, "Giving Back to the World," ends SH by covering practical ways the reader can give back to the Internet and avoid the ignominious leech designation. This chapter provides information on creating public RSS feeds, making an organization's resources available for easy retrieval by spiders, and using instant messaging with a spider.
ConclusionThere are extensive links provided throughout the book, and this indirectly contributes to SH's worth. The usual O'Reilly site for source code is available and Hemenway also provides some additional code on his site. A detailed listing of the hacks covered in SH is also available online from SH's table of contents.
The Hacks series is a relatively new genre for O'Reilly, but it is rapidly maturing and this growth is reflected in Spidering Hacks. Hemenway and Calishain have done good work in assembling a wide variety of tips that cover a broad spectrum of interests and applications. This is a solid effort, and I can easily recommend it to those looking to perform more effective Internet research as well as those looking for new scripting projects to undertake.
You can purchase Spidering Hacks from bn.com. Slashdot welcomes readers' book reviews -- to submit a review for consideration, read the book review guidelines, then visit the submission page.
deepweb
I wonder if Tracking Packages with FedEx is using the new google feature. That would be too simple :)
Does anyone know the name of a small utility to query search engines on the command line? It think it was a 2-letter program, but I couldn't find it anymore :(
This is one of my favorite O'Reilly books. It is amazing what you can do with a few lines of Perl code and LWP.
Strange women lying in ponds distributing swords is no basis for a system of government.
Save yourself $30.
Think of datamining, prime example:
You want to track your rank on www.alexa.com and the ranking of some of your key competitors. You build a spider that goes out each night and scrapes the info you want and stores it localy. now you have history on your and you competitiors ranking over time.
This way you can see that when your traffic is down so is your competitor or maybe when yours is down theirs is up...
This also happens to be one of the examples in the book.
Here's hoping you aren't serious. I mean, rx are one thing, but to parse xml, and to some degree html, there are way better tools specifically for the job. I usually filter html thru tidy a few times until I can more easily parse it with xml tools - but that's just me.
Is it open source? I wish there was more adult open-sores software. UBH running from cron is what I use currently to automate porn consumtion, but I'm sure there are tons of other opportunities....
can be found from www.searchlores.org
Don't know if anyone's pointed it out, but there are some sample links up on the web site. Some really great stuff, just from what I saw. Made me want to buy the book. (Guess that's the point.)
In general, most people use LWP, and if you write very many programs that use the web, you're going to want to go to LWP eventually, so you might as well start learning now (and there are easier interfaces to facilitate that too).
"why would anyone use these techniques other than to harvest email like a spammer"
1. Archiving data on the web
2. Getting your files back when you forget your FTP password
3. Researching the link structure of the Internet and how it changes over time
4. Playing a joke on a friend by scraping his site and reposting the content, filtered in your favorite dialect
5. Reading your favorite site in an RSS reader, even if they don't provide an RSS feed
6. Counting how often certain words on used on the net
7. Checking to see if you have any broken links on your site
8. Testing to make sure every link is reachable on your site, and finding out how deep the deepest link is
9. Taking data from a public website and compiling useful statistics, such as GPA calculations, average completion times for cross country races, or the total number of points scored last night in the NHL.
10. Showing people that the Internet can be more than just a web browser
OddManIn: A Game of guns and game theory.
- #21: WWW::Mechanize 101
- #22: Scraping with WWW::Mechanize
- #36: Downloading Images from Webshots
- #44: Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups (which uses Mech)
- #64: Super Author Searching
- #73: Scraping TV Listings
here are some other online resources to look at:A random bunch of examples submitted by users, included with the Mechanize distribution.
Chris Ball's article about using WWW::Mechanize for scraping TV listings. (repurposed into hack #73 above)
Randal Schwartz's article on scraping Yahoo News for images.
WWW::Mechanize on the Perl Advent Calendar 2002, by Mark Fowler.
There are basically two styles of XML parser, event-based (SAX) and document-based (DOM). I find DOM-types easier to use.
Been there.. done that...
http://www.booble.com/
LWP really just replaces the fetching part, it doesn't do anything to extract the data. It will definitely be easier than curl on the command line, no parameter passing to worry about.
To get the data from the page you can either use a bunch of regexps (as you've done, apparently) or a parser like HTML::TokeParser::Simple. The advantage of a parser is that it makes it more robust and immune to site changes. You also get higher quality data, for example if something subtle changes in the site's html source you sometimes get something like "this is the data <A href="whatever"..." In other words, you don't have to worry about quoting or tag boundaries or anything like that. Naturally, if your script allows user interaction this will tend to be more secure as there is less chance of a XSS and/or SQL injection vuln.
But, using a parser takes a little bit of investment up front in terms of time. With the '::Simple' variant it's really pretty easy, but it still requires that you be a little familiar with the tree structure of the page so that you can pull out the stuff you want.
All in all, if it works don't switch, but in the future you'll have a more robust and maintainable setup if you use LWP and a parser instead of commandline curl and regexps.
This book requires that you submit to Google for a key to search with and use their API. In the hacks that require Google access, it'll just say something like
idkey = "insert your key here!"
AFAIK, this is standard practice for most sites with API access. (If you're interested, do it yourself at google.com/apis.) If you try to pull Google info down with an HTTP object programatically, Google will just return a 403 and tell you to read its terms of service. (Unless you spoof the header, but that requires doing it from scratch, and it will also get you in trouble if you try to use it commercially.)