Spidering Hacks

← Back to Stories (view on slashdot.org)

Posted by timothy on Tuesday December 16, 2003 @07:00AM from the use-for-good-not-evil dept.

DrCarbonite writes "Spidering Hacks is a well-written guide to scripting and automating your data-seeking forays onto the Internet. It offers an attractive combination of the solving the problems you have and exposing you to solutions that you weren't aware you needed." Read on for Martin's review of the book. Spidering Hacks author Kevin Hemenway and Tara Calishain pages 402 publisher O'Reilly rating 8 reviewer Jeff Martin ISBN 0596005776 summary A wide-ranging collection of hacks detailing how to be more productive in Internet research and data retrieval

Introduction Spidering Hacks (SH), by Kevin Hemenway and Tara Calishain, is a practical guide to performing Internet research that goes beyond a simple Google search. SH demonstrates how scripting and other techniques can increase the power and efficiency of your Internet searching, allowing the computer to obtain data, leaving the user free to spend more time on analysis.

SH's language of choice is Perl, and while there are a few guest appearances by Java and Python, some basic Perl fluency will serve the reader well in reading the Hack's source code. However, regardless of your language preference, SH is still a useful resource. The authors discuss ethics and guidelines for writing polite and properly behaved spiders as well as the concepts and reasoning behind the scripts they present. For this reason, non-Perl coders can still stand to learn a lot of useful tips that will help them with their own projects.

Overview

Chapter 1, Walking Softly, covers the basics of spiders and scrapers, and includes tips on proper etiquette for Web robots as well as some resources for identifying and registering the many Web robots/spiders that exist on the Internet. Hemenway and Calishain should be credited for taking the time to be civically responsible and giving their readers appreciation for the power they are utilizing.

Chapter 2, "Assembling a Toolbox," covers how to obtain the Perl modules used by the book, respecting robots.txt, and various topics (Perls LWP and WWW::Mechanize modules for example) that will provide the reader with a solid foundation throughout the rest of the book. SH does a great job introducing some topics that not all members in its target audience may be familiar with (i.e., regular expressions, the use of pipes, XPath).

Chapter 3, "Collecting Media Files," deals with obtaining files from POP3 email attachments, the Library of Congress, and Web cams, among other sources. While individual sites described here may not appeal to everyone, the idea is to provide a specific example demonstrating each of certain general concepts, which can be applied to sites of the reader's choosing.

Chapter 4, "Gleaning Data from Databases," approaches various online databases. There are some interesting hacks here, such as those that leverage Google and Yahoo together. This chapter is the longest, and provides the greatest variety of hacks. It also discusses locating, manipulating, and generating RSS feeds, as well as other miscellaneous tasks such as downloading horoscopes to an iPod.

Hack #48, Super Word Lookup, is a good example of why SH is so intriguing. While utilizing a dictionary or thesaurus via a browser is simple, having the ability to do so with a command-line program allows the user an automated approach, reducing distractions.

Chapter 5, "Maintaining Your Collections," discusses ways to automate retrieval using cron and practical alternatives for Windows users.

Chapter 6, "Giving Back to the World," ends SH by covering practical ways the reader can give back to the Internet and avoid the ignominious leech designation. This chapter provides information on creating public RSS feeds, making an organization's resources available for easy retrieval by spiders, and using instant messaging with a spider.

Conclusion

There are extensive links provided throughout the book, and this indirectly contributes to SH's worth. The usual O'Reilly site for source code is available and Hemenway also provides some additional code on his site. A detailed listing of the hacks covered in SH is also available online from SH's table of contents.

The Hacks series is a relatively new genre for O'Reilly, but it is rapidly maturing and this growth is reflected in Spidering Hacks. Hemenway and Calishain have done good work in assembling a wide variety of tips that cover a broad spectrum of interests and applications. This is a solid effort, and I can easily recommend it to those looking to perform more effective Internet research as well as those looking for new scripting projects to undertake.

You can purchase Spidering Hacks from bn.com. Slashdot welcomes readers' book reviews -- to submit a review for consideration, read the book review guidelines, then visit the submission page.

121 comments

Why spider when you can deepweb? by pheared · 2003-12-16 07:03 · Score: 2, Informative

deepweb
1. Re:Why spider when you can deepweb? by Anonymous Coward · 2003-12-16 07:40 · Score: 0
  
  oh lord, goto goto everywhere.
2. Re:Why spider when you can deepweb? by mgrennan · 2003-12-16 12:38 · Score: 1
  
  Thanks Dude - I've been wanting to spider my company for some time. With this and a little work from the book. I'm going to create a realy simple page of pages.
  I bet I find some real dirt.
  
  --
  There are 10 type of people in the world, those who understand binary and those who don't.
Table of content is packed with great stuff! by millette · 2003-12-16 07:03 · Score: 4, Informative

Have a look at the Table of Content - it has 100 items, some of it you wouldn't obviously qualify as spidering, but more like data mining, but whatever, it's all good stuff! There's also some php, besides the java and python code. Perl is the most predominant language.
I wonder if Tracking Packages with FedEx is using the new google feature. That would be too simple :)
Does anyone know the name of a small utility to query search engines on the command line? It think it was a 2-letter program, but I couldn't find it anymore :(
1. Re:Table of content is packed with great stuff! by Theatetus · 2003-12-16 07:48 · Score: 1
  
  The funny thing is, the first page of results for the sample patent search they give is a bunch of pages about Google's ability to search via patent numbers.
  
  Time to rethink a ranking algorithm there...
  
  --
  All's true that is mistrusted
2. Re:Table of content is packed with great stuff! by millette · 2003-12-16 07:58 · Score: 1
  
  You mean hitting search from the google page with the fedex/patent example? The first result, not part of an actual search, is the info you'd be looking for. It's identified with this image.
3. Re:Table of content is packed with great stuff! by rog · 2003-12-16 07:59 · Score: 5, Informative
  
  You're probably looking for surfraw
  
  --
  Saving random seed...
4. Re:Table of content is packed with great stuff! by millette · 2003-12-16 08:01 · Score: 2, Interesting
  
  I was way off with my 2-letter name :)
  Looking at the page, and I'm pretty sure it's the little program I had lost. Thanks for finding it again!
  Pssst! Mod parent up!
Ok I admit it by JSkills · 2003-12-16 07:03 · Score: 5, Funny

I have written a porn gathering spider to seek out large movie files. It beats using a browser to find stuff.
Oh the shame ...
1. Re:Ok I admit it by trentblase · 2003-12-16 07:07 · Score: 5, Funny
  
  Maybe it's time for pr0n.google.com
2. Re:Ok I admit it by interiot · 2003-12-16 07:30 · Score: 5, Informative
  
  Is it open source? I wish there was more adult open-sores software. UBH running from cron is what I use currently to automate porn consumtion, but I'm sure there are tons of other opportunities....
3. Re:Ok I admit it by BigGerman · 2003-12-16 07:40 · Score: 5, Funny
  
  Booble?
4. Re:Ok I admit it by QuasiCoLtd · 2003-12-16 07:44 · Score: 2, Funny
  
  I wish there was more adult open-sores software.
  
  What kind of pr0n have YOU been looking at....
5. Re:Ok I admit it by OmegaGeek · 2003-12-16 08:02 · Score: 1
  
  > I wish there was more adult open-sores software
  
  If adults have open-sores from harvesting pr0n, then I think they need medical (or possibly psychological) attention, not software. At least buy yourself some lotion, buddy!
  
  --
  Even heroes have the right to dream
6. Re:Ok I admit it by Anonymous Coward · 2003-12-16 08:10 · Score: 0
  
  I use brag for usenet leeching and gnaughty for light relief.
7. Re:Ok I admit it by Ambush_Bug · 2003-12-16 08:33 · Score: 1
  
  That is brilliant....
8. Re:Ok I admit it by glwtta · 2003-12-16 09:02 · Score: 1
  
  is what I use currently to automate porn consumtion
  Wow! Automated porn collection is one thing, but actually automating porn consumption - that's something!
  
  --
  sic transit gloria mundi
9. Re:Ok I admit it by Chalex · 2003-12-16 09:08 · Score: 1
  
  I wish there was more adult open-sores software.
  I sure don't!
10. Re:Ok I admit it by Anonymous Coward · 2003-12-16 09:15 · Score: 0
  
  Does it filter out he gay movies? Not that there's anything wrong with that...
11. Re:Ok I admit it by Anonymous Coward · 2003-12-16 10:23 · Score: 0
  
  Quick, patent it before Optima gets a chance!
12. Re:Ok I admit it by Ignis+Flatus · 2003-12-16 14:22 · Score: 2
  
  Try altavista.com.
  Click on the video tab.
  Turn Family Filter off.
  Fap away.
Not Likely by trentblase · 2003-12-16 07:04 · Score: 2, Funny

The authors discuss ethics and guidelines for writing polite and properly behaved spiders
Maybe the spammers will read the ethics section and have a change of heart!
Error in post. by cliffy2000 · 2003-12-16 07:08 · Score: 1, Redundant

Someone forgot an </i> tag...
XML interop? by prostoalex · 2003-12-16 07:09 · Score: 4, Interesting

From the review it looks like an excellent books to read and maybe have around. I will check it out on Safari, since it looks like they made it available to subscribers.

However, looking at these hacks:

68. Checking Blogs for New Comments
69. Aggregating RSS and Posting Changes
70. Using the Link Cosmos of Technorati
71. Finding Related RSS Feeds

Do they offer any hacks on working with XML, perhaps XML::RSS or other parsing engines from CPAN? Or is most of the XML handled through regexp?
1. Re:XML interop? by Anonymous Coward · 2003-12-16 07:24 · Score: 0
  
  He who knows to use regex well, uses CPAN less
2. Re:XML interop? by millette · 2003-12-16 07:29 · Score: 3, Informative
  
  Here's hoping you aren't serious. I mean, rx are one thing, but to parse xml, and to some degree html, there are way better tools specifically for the job. I usually filter html thru tidy a few times until I can more easily parse it with xml tools - but that's just me.
3. Re:XML interop? by justMichael · 2003-12-16 07:36 · Score: 4, Interesting
  
  Hack 24 Painless RSS with Template::Extract
  
  It's actually a good read. They try to stay away from regex parsing as it tends to be fragile. They do cover it in one of the hacks though.
  
  Most of the hacks have to do with using various methods to walk the doc tree to look for what you want like a certain cell in a table (think header with names) then jumping up one to get that row then grabbing the next row to get your data cells.
4. Re:XML interop? by BenitoM · 2003-12-16 09:28 · Score: 4, Informative
  
  Try Perl and XML by O'Reilly: http://www.oreilly.com/catalog/perlxml/index.html
  There are basically two styles of XML parser, event-based (SAX) and document-based (DOM). I find DOM-types easier to use.
5. Re:XML interop? by Koschei · 2003-12-16 11:44 · Score: 1
  
  Most of the XML parsing, in the hacks that have XML in them, is with XML::Simple or modules that are more specialised (such as SOAP::Lite, XML::RSS, etc.).
  
  One of the nicest features of the book is that it promotes the use of appropriate parsers over random regexing.
  
  --
  -- koschei
o fer crissakes by Bob+Cat+-+NYMPHS · 2003-12-16 07:10 · Score: 0, Offtopic

somebody forget a by any chance?

--
The latest Slashdot meme.
Use of "hacker" by davidstrauss · 2003-12-16 07:10 · Score: 2, Insightful

When are people going to realize that hackers just care about computers and the crackers are the bad guys? Oh wait...
Be sure to get LWP & Perl by toupsie · 2003-12-16 07:10 · Score: 4, Informative

This is one of my favorite O'Reilly books. It is amazing what you can do with a few lines of Perl code and LWP.

--
Strange women lying in ponds distributing swords is no basis for a system of government.
1. Re:Be sure to get LWP & Perl by interiot · 2003-12-16 07:55 · Score: 1
  
  Or, if you have minimal working knowledge of objects and modules, you can just read the lwpcook manpage. Yeah the O'Reilly LWP book goes into a little more, but look those modules up on search.cpan.org too, and buy the spidering book instead because it goes so much further.
Re:Techniques used by spammers? by tds67 · 2003-12-16 07:12 · Score: 3, Insightful

Other than using google, why would anyone use these techniques other than to harvest email like a spammer?
There's a lot more information on the Web than just e-mail addresses. Besides, why be reliant on search engines when you can do it yourself?
Yeecchhh! by AtariAmarok · 2003-12-16 07:17 · Score: 3, Funny

"....a porn gathering spider to....

That thing's going to build one nasty sticky web!

--
Don't blame Durga. I voted for Centauri.
perldoc LWP by Ars-Fartsica · 2003-12-16 07:18 · Score: 2, Informative

Save yourself $30.
MOD UP! by Anonymous Coward · 2003-12-16 07:19 · Score: 2, Funny

it IS time
That's not what this is used for... by justMichael · 2003-12-16 07:20 · Score: 5, Informative

Think of datamining, prime example:

You want to track your rank on www.alexa.com and the ranking of some of your key competitors. You build a spider that goes out each night and scrapes the info you want and stores it localy. now you have history on your and you competitiors ranking over time.

This way you can see that when your traffic is down so is your competitor or maybe when yours is down theirs is up...

This also happens to be one of the examples in the book.
1. Re:That's not what this is used for... by beacher · 2003-12-16 08:15 · Score: 3, Informative
  
  Here's what it's used for - violating H.R. 3261, the Database and Collections of Information Misappropriation Act Guess I'm going to buy it very soon ;) -B
2. Re:That's not what this is used for... by justMichael · 2003-12-16 08:41 · Score: 1
  
  Actually the book does stress heavily the reasons behind registering your bots and getting an OK from the site owner/responsible party and generally behaving like a decent human.
  
  I agree with you to some extent, but if I am grabbing a dozen freely available pages from a site and storing the information for my own use (not selling it or publishing it), no foul. Otherwise they could go after people who print/write down/copy-paste the info under the same act.
  
  PS: Thanks for showing me that, now my head hurts. ;-)
  
  PSS: My example doesn't fit into their definition of Liability or Injury per SEC. 3 as it's not made public. (I am not a lawyer, YMMV)
3. Re:That's not what this is used for... by outsider007 · 2003-12-16 09:10 · Score: 1
  
  Think of datamining, prime example
  
  you're misusing this buzzword. what you just described is called 'collecting data'
  
  --
  If you mod me down the terrorists will have won
4. Re:That's not what this is used for... by LoneGun04 · 2003-12-16 11:07 · Score: 1
  
  Yep, that was my hack, #58 (Competitive Data with Java). I decided to collect all of the data and save it to an RSS feed. That way other people within your company can take a look at how your domain, and even its subdomains, competes with the competition over time. Since Alexa's Traffic Detail page displays the traffic from its toolbar users, your comparisons are only as good as the limited user base.
5. Re:That's not what this is used for... by sethx9 · 2003-12-17 01:29 · Score: 1
  
  Analyzing publicly available information is not addressed by H.R. 3261.
  
  Section 3(a) describes "a quantitatively substantial part of the information in a database". The information gathered from a single web page would almost certainly not be the "substantial part" of data held by Alexa or any other indexing service.
  
  Also, I doubt that it could be seriously, let alone successfully, argued that making a record of publicly available information falls under the definition in Section 3(a)(2) that talks about inflicting "injury on the the database". While there may be seen some ambiguity in that, the argument is made moot considering Section 3(b) goes on to narrowly define "injury" as "serving as a functional equivalent in the same market as the database in a manner that causes the displacement, or the disruption of the sources, of sales, licenses, advertising, or other revenue"
  
  --
  Sorry, I keep forgetting to add the tongue-in-cheek emoticon to the bottom of my posts...
Re:New book: Hacking your way into a Spider Hole by Anonymous Coward · 2003-12-16 07:20 · Score: 0

This is a spider hole:
The hole
Re:New book: Hacking your way into a Spider Hole by tsmccaff · 2003-12-16 07:21 · Score: 2, Offtopic

The term "spider hole" has been part military parlance since WWII, but gained common usage outside the military during Vietnam. It may refer to the trapdoor spider, who doesn't use a web, but rather pops out of a hole in the ground, surprising its prey.

--
"the starry sky above and the moral law within"-Kant
Tracking yahoo popularity. by Flat+Feet+Pete · 2003-12-16 07:25 · Score: 3, Interesting

My server's going to die under the load, but I did this using Perl+Curl.

This page is used to source the data.

Is LWP the correct/new way to do this kind of stuff? I started with curl and hacked regex's to get the data.
1. Re:Tracking yahoo popularity. by interiot · 2003-12-16 08:02 · Score: 2, Informative
  
  LWP runs as part of perl, so it gives you a little easier control over the variety of options (eg. user agent and such). And it's easier to get working cross-platform (it's a bitch that you have to do extra work to get around the shell parsing of arguments to subprocesses on Win32). Also, you can do fancy asynchronous stuff with LWP, so you can have interactive programs, or stuff going in parallel, etc...
  In general, most people use LWP, and if you write very many programs that use the web, you're going to want to go to LWP eventually, so you might as well start learning now (and there are easier interfaces to facilitate that too).
2. Re:Tracking yahoo popularity. by ls+-lR · 2003-12-16 15:14 · Score: 2, Informative
  
  LWP really just replaces the fetching part, it doesn't do anything to extract the data. It will definitely be easier than curl on the command line, no parameter passing to worry about.
  
  To get the data from the page you can either use a bunch of regexps (as you've done, apparently) or a parser like HTML::TokeParser::Simple. The advantage of a parser is that it makes it more robust and immune to site changes. You also get higher quality data, for example if something subtle changes in the site's html source you sometimes get something like "this is the data <A href="whatever"..." In other words, you don't have to worry about quoting or tag boundaries or anything like that. Naturally, if your script allows user interaction this will tend to be more secure as there is less chance of a XSS and/or SQL injection vuln.
  
  But, using a parser takes a little bit of investment up front in terms of time. With the '::Simple' variant it's really pretty easy, but it still requires that you be a little familiar with the tree structure of the page so that you can pull out the stuff you want.
  
  All in all, if it works don't switch, but in the future you'll have a more robust and maintainable setup if you use LWP and a parser instead of commandline curl and regexps.
Spidering and exceeding ISP bandwidth limits by G4from128k · 2003-12-16 07:26 · Score: 5, Insightful

I suspect that more than a few people are going to hit their ISP's bandwidth limits if they start playing with spiders. A spider running on a simple 768 kbps DSL line can probably schlep down more than 4 GB per day or 129 GB/month (assuming the CPU can keep up analyzing with the flow).

--
Two wrongs don't make a right, but three lefts do.
1. Re:Spidering and exceeding ISP bandwidth limits by interiot · 2003-12-16 07:52 · Score: 4, Insightful
  
  If it's a full spider where you're considering competing with google or reimplementing google with extra features, then yes, you'd obviously need an industrial-strength account.
  More likely though, you leave the big jobs to the big boys, and you want to do very specific things, maybe even building on top of google... eg. find porn movies, copying edmunds' database so you can sort cars by their power/weight ratio (or list all RWD cars, or find the lightest RWD car, or...), or make your own third-party feed of slashdot from their homepage since they watch you like a hawk when you download their .rss too often, but not when you download their homepage too often.
  Little custom jobs like that can take a minimal amount of code (especially if you're a regex wizard), take minimal bandwidth, and take enough skill that target sites aren't likely to track you down because there's only three of you doing it.
2. Re:Spidering and exceeding ISP bandwidth limits by Anonymous Coward · 2003-12-16 09:15 · Score: 0
  
  no -- pretty quickly you'll run out of stuff to
  fetch if you are getting 4 gb per day with a
  robot.
  
  most robots don't fetch that much data... they
  just do the query at the right place without you
  having to navigate there yourself.
  
  4gig download per day is really crazy unless
  you are moving video or mirroring isos or
  something like that. Most people don't use
  their home dsl for that and those that do know
  that they are moving a lot of data.
3. Re:Spidering and exceeding ISP bandwidth limits by G4from128k · 2003-12-16 12:24 · Score: 1
  
  If it's a full spider where you're considering competing with google or reimplementing google with extra features, then yes, you'd obviously need an industrial-strength account.
  
  More likely though, you leave the big jobs to the big boys, and you want to do very specific things, maybe even building on top of google.
  
  Very good point. You are right that many people will use spiders in a naturally limited way -- a one-shot or infrequently repeated project to gather information on a very limited domain or limited set of sites.
  
  What I suspect, however, is that widespread use of spiders will lead more people to use them in more ways. For example, I often Google for obscure information that Google's search tools don't do a good job of finding (too many false positives). I'd love a feature that lets me spider the sites associated with the first 200 pages of a Google search and filter or sort the results according to my own filters/pagerank algorithms. This could be a batch process or a progressive build process (it continues to download, filter, and rank pages while showing me the interrim results). This is a more bandwidth intensive process. Perhaps I am, as you say, reimplementing Google (or augmenting it), but if the spider is easy to use, then I and others will use it.
  
  My point is that spidering is a tool that can expand to fill all available bandwidth.
  
  --
  Two wrongs don't make a right, but three lefts do.
Alternative ways of searching and spidering by Anonymous Coward · 2003-12-16 07:34 · Score: 3, Informative

can be found from www.searchlores.org
1. Re:Alternative ways of searching and spidering by interiot · 2003-12-16 08:24 · Score: 1
  
  Mod up! Fravia is most famous for his previous work, tons of hard-core documents about reverse-engineering. Anyway, anything he writes is likely Really Deep Stuff, and deserves to be pored over.
This Book works by 2MinutesForRoughing · 2003-12-16 07:34 · Score: 0, Redundant

"Saddam Found in Spider Hole" coincidence? me thinks not
Agents, anyone? by Wingchild · 2003-12-16 07:36 · Score: 5, Interesting

A few years ago, the big idea was that by some as-yet undetermined point in the future (say, 2005) all human beings would be freed from having to collect their own data by way of intelligent, semi-autonomous Agents that could be given some loose english-query type tasks and go on their merry way, fetching and organizing and categorizing data by relevance. It's not too far different from the proposed use of scripting talked about above.

The problem comes more in the last assertation of the story; that pulling in all of this data will free up more time for people to spend on the work of analysis. I want to say this isn't accurate, but it probably boils down to what you call "analysis" work.

The problem with spiders, agents, and their like -- yes, even those that are going out and fetching porn -- is that they are able to provide content without context, much as a modern search engine does. I can take Google and get super specific with a query (say, `pirates carribean history -movie -"johnny depp"`). That will probably fetch me back some data that has my keywords in it, much as any script or agent could do.

Unfortunately, while the engine could rank based on keyword visibility and recurrance, as well as applying some algorithms to try and guess whether the data might be good or not (encylcopedias look this way, weblogs about Johnny Depp look that way), the engine itself still has on way to physically read the information and decide if it's at all useful. A high-school website's page with a tidbit of information and some cute animated .gifs could theoretically draw more of a response from the engine than an official historian's personal recollections of his research while he was working on his master's thesis about the Jolly Roger. Any script (or engine) is only what you make of it.

The most tedious part of data analysis these days is not providing content (as spiders, scripts, and search engines all do) ... it's in providing a frame of context for the choosing, and, ultimately, rejection of sources.

What comes after that sorting process - the assimilation of good data and the drawing of conclusions there-from - that's what I call data analysis. A shame that scripts, spiders, agents, and robots haven't found a way to do that for us. :)
1. Re:Agents, anyone? by LetterJ · 2003-12-16 08:07 · Score: 5, Insightful
  
  I think that some of the things being done to filter *out* spam might also apply to filtering *in* good information from things like agents.
  
  I know that my Popfile spam filter is getting pretty good (with 35,000 messages processed) at not only spam vs. ham type comparisons, but also work vs. personal and other categories.
  
  Bayesian filters are just one type of learning algorithm, but they work fairly well for textual comparisons. I've personally been toying with seeing how well a toolbar/proxy combination would work for predicting the relative "value" of a site to me. Run all browsing through a Bayesian web proxy that analyses all sites visited. Then, with a browser toolbar, sites can be moderated into a series of categories.
  
  That same database could be used by spiders to look for new content, and, if it fits into a "positive" category according to the analysis, add it to a personal content page of some sort that could be used as a browser's home page.
  
  With sufficient data sources (and with a book like this, it shows that there ARE plenty of sources), it could really bring the content you want to read together.
  
  --
  
  The Glass is Too Big: My Take on Things
2. Re:Agents, anyone? by netringer · 2003-12-16 08:39 · Score: 1
  
  It was Apple's former CEO John Scully who pushed the intelligent agent who sat on your desktop knew everthing you needed at you fingertips and got you whatever else you wanted.
  Microsoft jumped on this idea and invented the Office Assistants like Clippy.
  
  --
  Ever dream you could fly? Get up from the Flight Sim. I Fly
3. Re:Agents, anyone? by costas · 2003-12-16 09:13 · Score: 1
  
  Well, my newsbot does a lot of what you describe, at least for news articles. Give it a shot.
4. Re:Agents, anyone? by LetterJ · 2003-12-16 10:32 · Score: 1
  
  The thing is . . . a bad implementation does not invalidate the concept. Many, many, many good ideas get a crappy first, second and many times third implementation.
  
  Given the initial failure of the Newton, would you want $1 for every PDA sold in the last 3 years? If the Newton was your indicator, the answer would be no.
  
  --
  
  The Glass is Too Big: My Take on Things
5. Re:Agents, anyone? by Anonymous Coward · 2003-12-16 11:11 · Score: 0
  
  "Microsoft jumped on this idea and invented the Office Assistants like Clippy."
  
  What about Bob?
In other news... by DVDAshot · 2003-12-16 07:45 · Score: 1, Funny

"Spider Holes" are not very good places to hide from the American military!
1. Re:In other news... by SparafucileMan · 2003-12-16 10:16 · Score: 5, Funny
  
  There was a sign in arabic outside his shack that read "robots.txt...do not archive rug, rug/styrofoam, rug/styrofoam/hole, rug/styrofoam/hole/saddam", etc......let this be a lesson to all: security through obscurity does not work!!
2. Re:In other news... by Anonymous Coward · 2003-12-16 22:04 · Score: 0
  
  Nah, the problem is that US sent people instead of killer robots, so they didn't have to follow robots.txt. It's a pity, but I hope this setback doesn't impede the development of the US killer robot program because killer robots are cool.
Sample hacks by Jadsky · 2003-12-16 07:50 · Score: 3, Informative

Don't know if anyone's pointed it out, but there are some sample links up on the web site. Some really great stuff, just from what I saw. Made me want to buy the book. (Guess that's the point.)
Perl script to query the library by Saint+Stephen · 2003-12-16 07:54 · Score: 3, Interesting

I have 3 library cards, and get a lot of DVDs, CDs, and books from them. (Lotsa free time).

I got tired of having to go to all 3 websites to see what to take back each day, so I wrote a small bash/curl script so I could do it at the command line.

There are *lots* of things like this that could be done if the web were more semantic.
An alternative by toddcw · 2003-12-16 07:58 · Score: 2, Interesting

It's a commercial app, but it's saved us skads of time: screen-scraper. It's also a lot less of a "hack".
cousin of spam? by GCP · 2003-12-16 08:00 · Score: 3, Interesting

The easier and more widespread the techniques for spidering become, the more websites will get hammered with the unintended equivalent of DOS attacks, the way spam is the equivalent of a DOS attack on your email account.

I don't have any solutions in mind. I don't want anti-spidering legislation, for example, because *I* want to be able to spider. I just don't want *you* to do it. ;-)

Really, I'm just observing that as the Web evolves we could see another spam-like problem emerge, at least for the more interesting sites.

--
"Those who have never entered upon scientific pursuits know not a tithe of the poetry by which they are surrounded."
1. Re:cousin of spam? by interiot · 2003-12-16 08:15 · Score: 1
  
  Slashdot's policy on RSS is sort of a response to this problem (that it's far too easy for one person to eat up far too much bandwidth, even if they no longer pay attention to the site very much). But yeah, it's not terribly difficult for a website operator to look in their logs and see one dude snarfing way too much stuff way too regularly.
  And like spam, the perp's natural first step in the battle is to start using anonymous proxies to help avoid detection / retribution.
2. Re:cousin of spam? by 1iar_parad0x · 2003-12-16 08:47 · Score: 3, Insightful
  
  Well, if you space the time between HTTP requests, it wouldn't be spam.
  
  This might be obvious or just a non-issue, but ignoring IMG tags in your bots (saves on bandwidth costs). You're probably not effecting their bandwidth by downloading text.
  
  Incidently, most spammers are glorified script kiddies, not data miners or AI people. The kind of "hard-earned" money in data mining isn't the kind of money spammers are looking for.
  
  The real problem with data mining is increased server load. Perhaps running your scripts late at night would help.
  
  Of course, if server load was spam, then Slashdot would have a lot of explaining to do. :)
  
  --
  What do you mean my sig is repetitive? What do you mean my sig is repetitive? What do you mean....
3. Re:cousin of spam? by ReadParse · 2003-12-17 01:08 · Score: 1
  
  Spidering isn't for everybody. Well, neither is SPAM, but spidering is for a lot less people than SPAM because of the lack of financial incentive. As interesting as certain types of spidering is to certain geeks in certain situations, most people could care less.
  
  I once thought about how neat it would be to start a spider running that would just go and go and go. It didn't take me long to get bored with it, just thinking about it. I do automate a lot of HTTP with Perl and LWP, and it's incredibly useful. But most people are going to have their little sandbox of usefulness, and I think the idea of hundreds of thousands of people just letting a spider run rampant is not likely.
  
  In addition, there are bandwidth considerations. Even if the ideas of getting one site after another after another doesn't bore you silly (especially when you consider how long it will take you to get a decent percentage of the web, and even then you won't have anything anywhere close to as cool as Google), there are bound to be repercussions when it comes to bandwidth. Either you're on a line where you get charged for bandwidth, or your ISP will at least take notice and look closely at what you're doing.
  
  The point is that I just don't see this as something that would get out of control like SPAM has gotten.
4. Re:cousin of spam? by sethx9 · 2003-12-17 01:50 · Score: 2, Insightful
  
  There is an industry built on teaching businesses and web designers how to increase ranking by making pages spider-friendly. The inverse of those same techniques could be used to protect a site.
  
  If "bad" spiders became so common that businesses began needing to weigh the pros of page ranking against the cons of data theft then the indexing services (those that wanted to remain relevant) would develop other methods for accessing web content.
  
  On a side note: I actually bought this book a couple of weeks ago as a tool to help me learn perl. Over the past few years I've built and used scraping tools and when I saw this book I was thrilled to have so many real-world examples that weren't about building front-end grids and tables to databases!
  
  --
  Sorry, I keep forgetting to add the tongue-in-cheek emoticon to the bottom of my posts...
Re:Techniques used by spammers? by Washizu · 2003-12-16 08:23 · Score: 5, Informative

"why would anyone use these techniques other than to harvest email like a spammer"

1. Archiving data on the web
2. Getting your files back when you forget your FTP password
3. Researching the link structure of the Internet and how it changes over time
4. Playing a joke on a friend by scraping his site and reposting the content, filtered in your favorite dialect
5. Reading your favorite site in an RSS reader, even if they don't provide an RSS feed
6. Counting how often certain words on used on the net
7. Checking to see if you have any broken links on your site
8. Testing to make sure every link is reachable on your site, and finding out how deep the deepest link is
9. Taking data from a public website and compiling useful statistics, such as GPA calculations, average completion times for cross country races, or the total number of points scored last night in the NHL.
10. Showing people that the Internet can be more than just a web browser

--
OddManIn: A Game of guns and game theory.
buying it now by pbjones · 2003-12-16 08:42 · Score: 1

sounds like a way to also keep spiders out...

--
There was an unknown error in the submission.
Re:New book: Hacking your way into a Spider Hole by Anonymous Coward · 2003-12-16 08:46 · Score: 0

Mod down! goatse link! my eyes! (j/k)
all music dot com by 3terrabyte · 2003-12-16 08:57 · Score: 1

Anyone ever spider alllmusic.com? Any interest in one?

--
Why are there only 19 people folding@home for slashdot?
WWW::Mechanize is your friend by andy@petdance.com · 2003-12-16 09:08 · Score: 4, Informative
You Perl folks who want something a bit easier than LWP for your spidering and scraping, take a look at WWW::Mechanize Besides the six hacks in the book that discuss Mech:
- #21: WWW::Mechanize 101
- #22: Scraping with WWW::Mechanize
- #36: Downloading Images from Webshots
- #44: Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups (which uses Mech)
- #64: Super Author Searching
- #73: Scraping TV Listings
here are some other online resources to look at:
- WWW::Mechanize::Examples
  A random bunch of examples submitted by users, included with the Mechanize distribution.
- http://www.perl.com/pub/a/2003/01/22/mechanize.htm l
  Chris Ball's article about using WWW::Mechanize for scraping TV listings. (repurposed into hack #73 above)
- http://www.stonehenge.com/merlyn/LinuxMag/col47.ht ml
  Randal Schwartz's article on scraping Yahoo News for images.
- http://www.perladvent.org/2002/16th/
  WWW::Mechanize on the Perl Advent Calendar 2002, by Mark Fowler.
How much can you screen-scrape legally ? by JPMH · 2003-12-16 09:12 · Score: 4, Interesting

Question: how much screen-scraping can you do, before the legal questions start ?
In the USA, trading information that has cost somebody else time and money to build up can be caught under a doctrine of "misappropriation of trade values" or "unfair competition", dating from the INS case in 1918.
Meanwhile here in Europe, a collection of data has full authorial copyright (life + 70) under the EU Database Directive (1996), if the collecting involved personal intellectual creativity; or special database rights (last update + 15 years) if it did not.
I've done a little screen-scraping for a "one name" family history project. Presumably that is in the clear, as it was for personal non-commmercial research, or (at most) quite limited private circulation.
But where are the limits ?
How much screen-scraping can one do (or advertise), before legally it becomes a "significant taking" ?
Spidering Google Illegal? by jetkust · 2003-12-16 09:48 · Score: 2, Interesting

From Google Terms of Service:

No Automated Querying You may not send automated queries of any sort to Google's system without express permission in advance from Google. Note that "sending automated queries" includes, among other things:

using any software which sends queries to Google to determine how a website or webpage "ranks" on Google for various queries; "meta-searching" Google; and performing "offline" searches on Google.

Please do not write to Google to request permission to "meta-search" Google for a research project, as such requests will not be granted.
1. Re:Spidering Google Illegal? by iggymanz · 2003-12-16 09:59 · Score: 1
  
  Google does allow use of their API, limit to 1,000 searches a day. So *some* types of automated query are allowed, and of course since Google provides the most supremely valuable internet service aside from being able to connect to the internet in the first place, let's everyone respect Google's terms of use!
2. Re:Spidering Google Illegal? by cosmo7 · 2003-12-16 12:18 · Score: 1
  
  If an application sends a request to Google (or any useful site) and receives a string in return, how does the site know that it's being used by a spider?
  
  Google doesn't check referring urls, btw.
3. Re:Spidering Google Illegal? by Bubba2146 · 2003-12-16 16:21 · Score: 1
  
  I know for a fact it records the UserAgent of the program sending the queries. I once wrote a Perl module to harvest sentences using the groups.google.com collection of articles. Well I miscalculated a variable and the queries started to be sent too rapidly. Sure enough they started to time-out. However, switching the UA was a quick fix, even from the same IP. I didn't keep sending queries after this started though... ;)
4. Re:Spidering Google Illegal? by the+pickle · 2003-12-16 19:30 · Score: 2, Interesting
  
  And what, exactly, constitutes "meta-searching" Google?
  
  p
  
  --
  In Korea, long hair is for old people!
5. Re:Spidering Google Illegal? by Jadsky · 2003-12-17 01:27 · Score: 2, Informative
  
  This book requires that you submit to Google for a key to search with and use their API. In the hacks that require Google access, it'll just say something like
  
  idkey = "insert your key here!"
  
  AFAIK, this is standard practice for most sites with API access. (If you're interested, do it yourself at google.com/apis.) If you try to pull Google info down with an HTTP object programatically, Google will just return a 403 and tell you to read its terms of service. (Unless you spoof the header, but that requires doing it from scratch, and it will also get you in trouble if you try to use it commercially.)
Riggish web spider perl code? by Anonymous Coward · 2003-12-16 09:51 · Score: 0

Does anyone have this code? does anyone rememeber how fun www.riggish.com was before they were sued?
Simple by geekoid · 2003-12-16 09:57 · Score: 1

Obey the robots.txt.

If it doesn't allow you to gather information, then don't.

--
The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
1. Re:Simple by JPMH · 2003-12-16 12:04 · Score: 1
  
  Obey the robots.txt.
  If it doesn't allow you to gather information, then don't.
  
  I'm not sure that's the whole answer.
  Many sites may not have a robots.txt file, yet may still value their copyright and/or database rights.
  On the other hand, for some purposes it may be legitimate to take some amount of data (obviously not the whole site), even in contravention of the wishes of a robots.txt
  So I think the question is deeper than just "look for robots.txt"
RE: Booble by Anm · 2003-12-16 10:15 · Score: 4, Informative

Been there.. done that...
http://www.booble.com/
Re: Booble by Anonymous Coward · 2003-12-16 11:04 · Score: 2, Funny

Why'd you have to post that link? I have alot of important work to do and now it isn't going to get done...
Re:Whole front page is italicized now by AmigaBen · 2003-12-16 11:04 · Score: 0

Well I guess this will teach me to try to help make /. a better place. My meager Karma is now 'bad'. Oh, and for you mods who can't figure out what's tagline and what isn't, "you suck" is my standard tagline.

Otherwise, I can't see what could have been taken as "Flamebait" in my post.

--
+5 Insightful, really!
Re:Techniques used by spammers? by Anonymous Coward · 2003-12-16 12:42 · Score: 0

11. Keep an eye out for news on a rare topic or person on specialty forums.
12. Keep tabs on your competitors' sites and see when they change their prices or ad new merchandise.
13. Watching obscure intelligence forums for that secret message from Sydney on Alias.
14. Doing a study of what web technologies are used on sites.
15. Doing studies that track how badly Apache is beating MS servers this month.
16. Keep an eye on the item you are interested in in the online store to see when it goes on sale.
17. Track the entry and exit of businesses in a market segment.
Re: Booble by Suppafly · 2003-12-16 14:04 · Score: 1

Now, if it had the image search option that google has, it'd be great.
Is there an OSS search engine/WWW snapshot by 1iar_parad0x · 2003-12-16 14:07 · Score: 1

I really think we need an open source search engine/repository. I've always wanted to do this. It would be great to engineer an open-architecture search engine. Something designed with parsers and bulk downloads in mind. The biggest reason is to for use in AI type applications. I also think some healthy competition for google would be nice. As crazy as this sounds, maybe a P2P type of solution might aleviate some of the bandwidth and processing issues. It would be like SETI.

The biggest problem is that I (we) would have to find a way to keep the data from being tainted. Obviously, some spammerific moron would try to taint the data to rate XXXmysite at the top of every search. Is there such a project in progress?

Incidently, I use the wayback machine as well.

--
What do you mean my sig is repetitive? What do you mean my sig is repetitive? What do you mean....
1. Re:Is there an OSS search engine/WWW snapshot by Man_Holmes · 2003-12-17 03:36 · Score: 1
  
  There is an open source search engine project underway and it is called Nutch as in Nutch.org .
  
  They've received money from some high profile backers such Mitch Kapor and Overture.
  
  Same people that created the open source indexer Lucene. Haven't downloaded the code yet but I am following the project closely.
  
  Man Holmes
Re:Techniques used by spammers? by unclehighbrow · 2003-12-16 14:28 · Score: 1

Another use - presenting data from another site in a more useful form:

http://redheadedleague.com/df.html

The Double Feature Finder goes to moviefone and finds movies in a row you can see.

Enjoy
Uncle Highbrow
Re:FP??? by Anonymous Coward · 2003-12-17 02:22 · Score: 0

It will be written in my biography that will endup on /. and all post will be troll and flaimbait!
more reviews of this book by Anonymous Coward · 2003-12-25 17:01 · Score: 0

I found some additional reviews for this book at this site.