How To Build a Web Spider On Linux

← Back to Stories (view on slashdot.org)

How To Build a Web Spider On Linux

Posted by kdawson on Tuesday November 14, 2006 @07:13PM from the five-eyes dept.

IdaAshley writes, "Web spiders are software agents that traverse the Internet gathering, filtering, and potentially aggregating information for a user. This article shows you how to build spiders and scrapers for Linux to crawl a Web site and gather information, stock data, in this case. Using common scripting languages and their collection of Web modules, you can easily develop Web spiders."

15 of 104 comments (clear)

Min score:

Reason:

Sort:

Hmm... by joe_cot · 2006-11-14 19:15 · Score: 5, Funny

Yes, but does it run on ... damn.
1. Re:Hmm... by strstrep · 2006-11-15 04:37 · Score: 3, Interesting
  
  PHP lightweight? Ha!
  
  The PHP interpreter is over 5 megabytes in size. And it isn't thread-safe. That's a lot of memory overhead for a program that's going to be blocking on I/O most of the time, seeing how you'll have to fork() a new process for each new "thread" you want.
  
  Also, languages like Perl and Python have binaries that are about 1 megabyte in size. Now, while they'll probably need to load in extra files for most practical applications, these extra files are typically small. Most importantly, Perl and Python are thread-safe.
  
  Perl, for example, includes libraries such as Thread::Queue, which allows you to very easily create a threading model with worker threads, without having to worry too much about condition variables, mutexes, and the like.
  
  Disclaimer: All measurements done on x86 Debian Linux.
Crawling efficiently by BadAnalogyGuy · 2006-11-14 19:21 · Score: 5, Informative

Their example of a web crawler uses a queue to hold links. Since a link may appear twice, they use a lookup to scan the queue to see if the link is already loaded, and discard it if so.

Better to use an associative array to cache the links since lookup is O(1). The Queue's lookup time is O(n) and if n gets large, so does the lookup time, not to mention that since you are checking each link the worst case scenario is a lookup time of O(n^2). A hash (associative array) will perform the same check in O(n). /([\W_\-]@\W+)/gs
The 90s called by dave562 · 2006-11-14 19:21 · Score: 5, Funny

They want their technology back.
downloads by Bananatree3 · 2006-11-14 19:30 · Score: 4, Informative

for those of us who don't have them, here are the basics:

Wget: http://www.gnu.org/software/wget/.

Curl http://curl.haxx.se/
Hardly linux-specific by h_benderson · 2006-11-14 19:57 · Score: 5, Insightful

All my love for linux aside, this has to do nothing with linux, the kernel (or even the GNU/Linux, the OS). It works just as well on any other unix-derivate or even windows.
some points by cucucu · 2006-11-14 19:59 · Score: 5, Interesting
- Don't forget to check and respect robots.txt. Python has a module that helps you parse that file
- Samie and its Python port Pamie are your friends. You can automate IE so your script is treated as an human and not discriminated as a robot.
- I use such beasts to do one-click time reporting at work and one-click cartoon collecting in my favorite newspaper.
- And once I even repeatedly voted on an online poll and changed the course of history.
- Ah, yes, TFA was about building a spider on Linux. I didn't check if my one-click IE scripts work on IE/Wine/Linux.
- If I write an one-click script for online shopping, does it infringe the infamous Amazon patent?
- When will Firefox's automation capabilities match those of IE?
1. Re:some points by killjoe · 2006-11-14 21:09 · Score: 3, Informative
  
  "When will Firefox's automation capabilities match those of IE?"
  
  It's always had it. Look up XUL some day. The entire browser is written in xul.
  
  --
  evil is as evil does
Re:yes, I did RTFA by Faylone · 2006-11-14 20:07 · Score: 4, Funny

You RTFA? Are you sure you're in the right place?
Oh sweet Jesus! by msormune · 2006-11-14 20:25 · Score: 3, Insightful

Pull the article out. The last thing we need is more indexing bots.
Re:Just what the internet needs... by ComaVN · 2006-11-14 20:28 · Score: 3, Informative

I think that's robots.txt, *not* spider.txt

--
Be wary of any facts that confirm your opinion.
Re-inventing a square wheel by rduke15 · 2006-11-14 20:48 · Score: 5, Insightful

Basically, the article gives you ruby and python examples of how to get web pages, and (badly) parse them for information. The same thing everyone has been doing for at least a decade with Perl and the appropriate modules, or whatever other tools, except that most know how to do it correctly.

The first script is merely ridiculous: 12 lines of code (not counting empty and comment lines) to do:

HEAD slashdot.org | grep 'Server: '

But it gets worse. To extract a quote from a page, the second script suggests this:

stroffset = resp.body =~ /class="price">/ subset = resp.body.slice(stroffset+14, 10) limit = subset.index('<') print ARGV[0] + " current stock price " + subset[0..limit-1] + " (from stockmoney.com)\n"

You don't need to know ruby to see what it does: it looks for the first occurence of 'class="price">' and just takes the 10 characters that follow. The author obviously never used that sort of thing for more than a couple of days, or he would know how quickly that will break and spit out rubbish.

Finally, there is a Python script. At first glance, it looks slightly better. It uses what appears to be the Python equivalent of HTML::Parse to get links. But a closer look reveals that, to find links, it just gets the first attribute of any a tag and uses that as the link. Never mind if the 1st attribute doesn't happen to be "href".

I suppose the only point of that article were the IBM links at the end:

Order the SEK for Linux, a two-DVD set containing the latest IBM trial software for Linux from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

And that is in a section for Linux developers on the IBM site? Maybe the did copy stuff from SCO after all?...
1. Re:Re-inventing a square wheel by rduke15 · 2006-11-14 22:42 · Score: 4, Insightful
  
  what exactly is HEAD slashdot.org
  
  It's a (perl) script which comes with libwww-perl which either is now part of the standard Perl distribution, or is installed by default in any decent Linux distribution.
  
  If you don't have HEAD, you can type a bit more and get the server with LWP::Simple's head() method (then you don't need grep):
  
  $ perl -MLWP::Simple -e '$s=(head "http://slashdot.org/" )[4]; print $s'
  
  Either way is better than those useless 12 lines of ruby (I'm sure ruby can also do the same in a similarly simple way, but that author just doesn't have a clue)
Actually... by SanityInAnarchy · 2006-11-14 20:52 · Score: 3, Interesting

Some websites do not have good search functionality. Sometimes it's an area that Google doesn't crawl (robots.txt and such), and sometimes I'm looking for something very, very specific.

Regardless, I do, in fact, build spiders. For instance, in an MMO I play, all users can have webpages, so it's very useful to have a spider as part of a clan/guild/whatever to crawl the webpages looking for users who have illegal items and such. In a more general way, there is a third-party site which collects vital statistics of everyone who puts those in their user page, so you can get lists of the most powerful people in the game, the richest people, etc.

--
Don't thank God, thank a doctor!
Okay kids... by Balinares · 2006-11-14 22:30 · Score: 4, Informative

Just so people who may come across this know, if you're going to do some HTML or XHTML parsing in Python, you'd be insane not to use BeautifulSoup or a similar tool.

Example to find all links in a document:
from BeautifulSoup import BeautifulSoup for tag in BeautifulSoup(html_document).findAll("a"): print tag["href"]
Yes, it's that simple. For an URL opener that also handles proxies, cookies, HTTP auth, SSL and so on, look into the urllib2 module that ships natively with Python.

--

-- B.
This sig does in fact not have the property it claims not to have.