Spidering Hacks
Introduction Spidering Hacks (SH), by Kevin Hemenway and Tara Calishain, is a practical guide to performing Internet research that goes beyond a simple Google search. SH demonstrates how scripting and other techniques can increase the power and efficiency of your Internet searching, allowing the computer to obtain data, leaving the user free to spend more time on analysis.
SH's language of choice is Perl, and while there are a few guest appearances by Java and Python, some basic Perl fluency will serve the reader well in reading the Hack's source code. However, regardless of your language preference, SH is still a useful resource. The authors discuss ethics and guidelines for writing polite and properly behaved spiders as well as the concepts and reasoning behind the scripts they present. For this reason, non-Perl coders can still stand to learn a lot of useful tips that will help them with their own projects.
OverviewChapter 1, Walking Softly, covers the basics of spiders and scrapers, and includes tips on proper etiquette for Web robots as well as some resources for identifying and registering the many Web robots/spiders that exist on the Internet. Hemenway and Calishain should be credited for taking the time to be civically responsible and giving their readers appreciation for the power they are utilizing.
Chapter 2, "Assembling a Toolbox," covers how to obtain the Perl modules used by the book, respecting robots.txt, and various topics (Perls LWP and WWW::Mechanize modules for example) that will provide the reader with a solid foundation throughout the rest of the book. SH does a great job introducing some topics that not all members in its target audience may be familiar with (i.e., regular expressions, the use of pipes, XPath).
Chapter 3, "Collecting Media Files," deals with obtaining files from POP3 email attachments, the Library of Congress, and Web cams, among other sources. While individual sites described here may not appeal to everyone, the idea is to provide a specific example demonstrating each of certain general concepts, which can be applied to sites of the reader's choosing.
Chapter 4, "Gleaning Data from Databases," approaches various online databases. There are some interesting hacks here, such as those that leverage Google and Yahoo together. This chapter is the longest, and provides the greatest variety of hacks. It also discusses locating, manipulating, and generating RSS feeds, as well as other miscellaneous tasks such as downloading horoscopes to an iPod.
Hack #48, Super Word Lookup, is a good example of why SH is so intriguing. While utilizing a dictionary or thesaurus via a browser is simple, having the ability to do so with a command-line program allows the user an automated approach, reducing distractions.
Chapter 5, "Maintaining Your Collections," discusses ways to automate retrieval using cron and practical alternatives for Windows users.
Chapter 6, "Giving Back to the World," ends SH by covering practical ways the reader can give back to the Internet and avoid the ignominious leech designation. This chapter provides information on creating public RSS feeds, making an organization's resources available for easy retrieval by spiders, and using instant messaging with a spider.
ConclusionThere are extensive links provided throughout the book, and this indirectly contributes to SH's worth. The usual O'Reilly site for source code is available and Hemenway also provides some additional code on his site. A detailed listing of the hacks covered in SH is also available online from SH's table of contents.
The Hacks series is a relatively new genre for O'Reilly, but it is rapidly maturing and this growth is reflected in Spidering Hacks. Hemenway and Calishain have done good work in assembling a wide variety of tips that cover a broad spectrum of interests and applications. This is a solid effort, and I can easily recommend it to those looking to perform more effective Internet research as well as those looking for new scripting projects to undertake.
You can purchase Spidering Hacks from bn.com. Slashdot welcomes readers' book reviews -- to submit a review for consideration, read the book review guidelines, then visit the submission page.
deepweb
I wonder if Tracking Packages with FedEx is using the new google feature. That would be too simple :)
Does anyone know the name of a small utility to query search engines on the command line? It think it was a 2-letter program, but I couldn't find it anymore :(
Oh the shame ...
Maybe the spammers will read the ethics section and have a change of heart!
Someone forgot an </i> tag...
From the review it looks like an excellent books to read and maybe have around. I will check it out on Safari, since it looks like they made it available to subscribers.
However, looking at these hacks:
68. Checking Blogs for New Comments
69. Aggregating RSS and Posting Changes
70. Using the Link Cosmos of Technorati
71. Finding Related RSS Feeds
Do they offer any hacks on working with XML, perhaps XML::RSS or other parsing engines from CPAN? Or is most of the XML handled through regexp?
somebody forget a by any chance?
The latest Slashdot meme.
When are people going to realize that hackers just care about computers and the crackers are the bad guys? Oh wait...
This is one of my favorite O'Reilly books. It is amazing what you can do with a few lines of Perl code and LWP.
Strange women lying in ponds distributing swords is no basis for a system of government.
There's a lot more information on the Web than just e-mail addresses. Besides, why be reliant on search engines when you can do it yourself?
"....a porn gathering spider to....
That thing's going to build one nasty sticky web!
Don't blame Durga. I voted for Centauri.
Save yourself $30.
it IS time
Think of datamining, prime example:
You want to track your rank on www.alexa.com and the ranking of some of your key competitors. You build a spider that goes out each night and scrapes the info you want and stores it localy. now you have history on your and you competitiors ranking over time.
This way you can see that when your traffic is down so is your competitor or maybe when yours is down theirs is up...
This also happens to be one of the examples in the book.
This is a spider hole:
The hole
The term "spider hole" has been part military parlance since WWII, but gained common usage outside the military during Vietnam. It may refer to the trapdoor spider, who doesn't use a web, but rather pops out of a hole in the ground, surprising its prey.
"the starry sky above and the moral law within"-Kant
My server's going to die under the load, but I did this using Perl+Curl.
This page is used to source the data.
Is LWP the correct/new way to do this kind of stuff? I started with curl and hacked regex's to get the data.
I suspect that more than a few people are going to hit their ISP's bandwidth limits if they start playing with spiders. A spider running on a simple 768 kbps DSL line can probably schlep down more than 4 GB per day or 129 GB/month (assuming the CPU can keep up analyzing with the flow).
Two wrongs don't make a right, but three lefts do.
can be found from www.searchlores.org
"Saddam Found in Spider Hole" coincidence? me thinks not
A few years ago, the big idea was that by some as-yet undetermined point in the future (say, 2005) all human beings would be freed from having to collect their own data by way of intelligent, semi-autonomous Agents that could be given some loose english-query type tasks and go on their merry way, fetching and organizing and categorizing data by relevance. It's not too far different from the proposed use of scripting talked about above.
.gifs could theoretically draw more of a response from the engine than an official historian's personal recollections of his research while he was working on his master's thesis about the Jolly Roger. Any script (or engine) is only what you make of it.
... it's in providing a frame of context for the choosing, and, ultimately, rejection of sources.
:)
The problem comes more in the last assertation of the story; that pulling in all of this data will free up more time for people to spend on the work of analysis. I want to say this isn't accurate, but it probably boils down to what you call "analysis" work.
The problem with spiders, agents, and their like -- yes, even those that are going out and fetching porn -- is that they are able to provide content without context, much as a modern search engine does. I can take Google and get super specific with a query (say, `pirates carribean history -movie -"johnny depp"`). That will probably fetch me back some data that has my keywords in it, much as any script or agent could do.
Unfortunately, while the engine could rank based on keyword visibility and recurrance, as well as applying some algorithms to try and guess whether the data might be good or not (encylcopedias look this way, weblogs about Johnny Depp look that way), the engine itself still has on way to physically read the information and decide if it's at all useful. A high-school website's page with a tidbit of information and some cute animated
The most tedious part of data analysis these days is not providing content (as spiders, scripts, and search engines all do)
What comes after that sorting process - the assimilation of good data and the drawing of conclusions there-from - that's what I call data analysis. A shame that scripts, spiders, agents, and robots haven't found a way to do that for us.
"Spider Holes" are not very good places to hide from the American military!
Don't know if anyone's pointed it out, but there are some sample links up on the web site. Some really great stuff, just from what I saw. Made me want to buy the book. (Guess that's the point.)
I have 3 library cards, and get a lot of DVDs, CDs, and books from them. (Lotsa free time).
I got tired of having to go to all 3 websites to see what to take back each day, so I wrote a small bash/curl script so I could do it at the command line.
There are *lots* of things like this that could be done if the web were more semantic.
It's a commercial app, but it's saved us skads of time: screen-scraper. It's also a lot less of a "hack".
The easier and more widespread the techniques for spidering become, the more websites will get hammered with the unintended equivalent of DOS attacks, the way spam is the equivalent of a DOS attack on your email account.
;-)
I don't have any solutions in mind. I don't want anti-spidering legislation, for example, because *I* want to be able to spider. I just don't want *you* to do it.
Really, I'm just observing that as the Web evolves we could see another spam-like problem emerge, at least for the more interesting sites.
"Those who have never entered upon scientific pursuits know not a tithe of the poetry by which they are surrounded."
"why would anyone use these techniques other than to harvest email like a spammer"
1. Archiving data on the web
2. Getting your files back when you forget your FTP password
3. Researching the link structure of the Internet and how it changes over time
4. Playing a joke on a friend by scraping his site and reposting the content, filtered in your favorite dialect
5. Reading your favorite site in an RSS reader, even if they don't provide an RSS feed
6. Counting how often certain words on used on the net
7. Checking to see if you have any broken links on your site
8. Testing to make sure every link is reachable on your site, and finding out how deep the deepest link is
9. Taking data from a public website and compiling useful statistics, such as GPA calculations, average completion times for cross country races, or the total number of points scored last night in the NHL.
10. Showing people that the Internet can be more than just a web browser
OddManIn: A Game of guns and game theory.
sounds like a way to also keep spiders out...
There was an unknown error in the submission.
Mod down! goatse link! my eyes! (j/k)
Anyone ever spider alllmusic.com? Any interest in one?
Why are there only 19 people folding@home for slashdot?
- #21: WWW::Mechanize 101
- #22: Scraping with WWW::Mechanize
- #36: Downloading Images from Webshots
- #44: Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups (which uses Mech)
- #64: Super Author Searching
- #73: Scraping TV Listings
here are some other online resources to look at:A random bunch of examples submitted by users, included with the Mechanize distribution.
Chris Ball's article about using WWW::Mechanize for scraping TV listings. (repurposed into hack #73 above)
Randal Schwartz's article on scraping Yahoo News for images.
WWW::Mechanize on the Perl Advent Calendar 2002, by Mark Fowler.
In the USA, trading information that has cost somebody else time and money to build up can be caught under a doctrine of "misappropriation of trade values" or "unfair competition", dating from the INS case in 1918.
Meanwhile here in Europe, a collection of data has full authorial copyright (life + 70) under the EU Database Directive (1996), if the collecting involved personal intellectual creativity; or special database rights (last update + 15 years) if it did not.
I've done a little screen-scraping for a "one name" family history project. Presumably that is in the clear, as it was for personal non-commmercial research, or (at most) quite limited private circulation.
But where are the limits ?
How much screen-scraping can one do (or advertise), before legally it becomes a "significant taking" ?
From Google Terms of Service:
No Automated Querying You may not send automated queries of any sort to Google's system without express permission in advance from Google. Note that "sending automated queries" includes, among other things:
using any software which sends queries to Google to determine how a website or webpage "ranks" on Google for various queries; "meta-searching" Google; and performing "offline" searches on Google.
Please do not write to Google to request permission to "meta-search" Google for a research project, as such requests will not be granted.
Does anyone have this code? does anyone rememeber how fun www.riggish.com was before they were sued?
Obey the robots.txt.
If it doesn't allow you to gather information, then don't.
The Kruger Dunning explains most post on
Been there.. done that...
http://www.booble.com/
Why'd you have to post that link? I have alot of important work to do and now it isn't going to get done...
Well I guess this will teach me to try to help make /. a better place. My meager Karma is now 'bad'. Oh, and for you mods who can't figure out what's tagline and what isn't, "you suck" is my standard tagline.
Otherwise, I can't see what could have been taken as "Flamebait" in my post.
+5 Insightful, really!
11. Keep an eye out for news on a rare topic or person on specialty forums.
12. Keep tabs on your competitors' sites and see when they change their prices or ad new merchandise.
13. Watching obscure intelligence forums for that secret message from Sydney on Alias.
14. Doing a study of what web technologies are used on sites.
15. Doing studies that track how badly Apache is beating MS servers this month.
16. Keep an eye on the item you are interested in in the online store to see when it goes on sale.
17. Track the entry and exit of businesses in a market segment.
Now, if it had the image search option that google has, it'd be great.
I really think we need an open source search engine/repository. I've always wanted to do this. It would be great to engineer an open-architecture search engine. Something designed with parsers and bulk downloads in mind. The biggest reason is to for use in AI type applications. I also think some healthy competition for google would be nice. As crazy as this sounds, maybe a P2P type of solution might aleviate some of the bandwidth and processing issues. It would be like SETI.
The biggest problem is that I (we) would have to find a way to keep the data from being tainted. Obviously, some spammerific moron would try to taint the data to rate XXXmysite at the top of every search. Is there such a project in progress?
Incidently, I use the wayback machine as well.
What do you mean my sig is repetitive? What do you mean my sig is repetitive? What do you mean....
Another use - presenting data from another site in a more useful form:
http://redheadedleague.com/df.html
The Double Feature Finder goes to moviefone and finds movies in a row you can see.
Enjoy
Uncle Highbrow
It will be written in my biography that will endup on /. and all post will be troll and flaimbait!
I found some additional reviews for this book at this site.