How To Build a Web Spider On Linux

← Back to Stories (view on slashdot.org)

How To Build a Web Spider On Linux

Posted by kdawson on Tuesday November 14, 2006 @07:13PM from the five-eyes dept.

IdaAshley writes, "Web spiders are software agents that traverse the Internet gathering, filtering, and potentially aggregating information for a user. This article shows you how to build spiders and scrapers for Linux to crawl a Web site and gather information, stock data, in this case. Using common scripting languages and their collection of Web modules, you can easily develop Web spiders."

6 of 104 comments (clear)

Min score:

Reason:

Sort:

Crawling efficiently by BadAnalogyGuy · 2006-11-14 19:21 · Score: 5, Informative

Their example of a web crawler uses a queue to hold links. Since a link may appear twice, they use a lookup to scan the queue to see if the link is already loaded, and discard it if so.

Better to use an associative array to cache the links since lookup is O(1). The Queue's lookup time is O(n) and if n gets large, so does the lookup time, not to mention that since you are checking each link the worst case scenario is a lookup time of O(n^2). A hash (associative array) will perform the same check in O(n). /([\W_\-]@\W+)/gs
downloads by Bananatree3 · 2006-11-14 19:30 · Score: 4, Informative

for those of us who don't have them, here are the basics:

Wget: http://www.gnu.org/software/wget/.

Curl http://curl.haxx.se/
Re:Just what the internet needs... by ComaVN · 2006-11-14 20:28 · Score: 3, Informative

I think that's robots.txt, *not* spider.txt

--
Be wary of any facts that confirm your opinion.
That reminds me. by archeopterix · 2006-11-14 21:06 · Score: 2, Informative

Also, because of the huge amount of data involved, unexpected things will happen. For example, our system tried to crawl an online game. This resulted in lots of garbage messages in the middle of their game! It turns out this was an easy problem to fix.
Unfortunately, many web developers still ignore the inevitable, leaving their sites vulnerable to the dreaded Googlebot "attack". While most of the spider developer manuals (TFA included) stress the importance of being polite (respect robots.txt & friends), most of the "become teh Web Master in x days" books don't even mention robots.txt. Go figure.
For a good chuckle, see The Spider of Doom on the Daily WTF.
And please use robots.txt.
And go see Google Webmaster tools.
And don't wear socks with sandals. Well, ok, this one is optional.
Re:some points by killjoe · 2006-11-14 21:09 · Score: 3, Informative

"When will Firefox's automation capabilities match those of IE?"

It's always had it. Look up XUL some day. The entire browser is written in xul.

--
evil is as evil does
Okay kids... by Balinares · 2006-11-14 22:30 · Score: 4, Informative

Just so people who may come across this know, if you're going to do some HTML or XHTML parsing in Python, you'd be insane not to use BeautifulSoup or a similar tool.

Example to find all links in a document:
from BeautifulSoup import BeautifulSoup for tag in BeautifulSoup(html_document).findAll("a"): print tag["href"]
Yes, it's that simple. For an URL opener that also handles proxies, cookies, HTTP auth, SSL and so on, look into the urllib2 module that ships natively with Python.

--

-- B.
This sig does in fact not have the property it claims not to have.