Slashdot Mirror

← Back to Stories (view on slashdot.org)

How To Build a Web Spider On Linux

Posted by kdawson on Tuesday November 14, 2006 @07:13PM from the five-eyes dept.

IdaAshley writes, "Web spiders are software agents that traverse the Internet gathering, filtering, and potentially aggregating information for a user. This article shows you how to build spiders and scrapers for Linux to crawl a Web site and gather information, stock data, in this case. Using common scripting languages and their collection of Web modules, you can easily develop Web spiders."

3 of 104 comments (clear)

Min score:

Reason:

Sort:

Crawling efficiently by BadAnalogyGuy · 2006-11-14 19:21 · Score: 5, Informative

Their example of a web crawler uses a queue to hold links. Since a link may appear twice, they use a lookup to scan the queue to see if the link is already loaded, and discard it if so.

Better to use an associative array to cache the links since lookup is O(1). The Queue's lookup time is O(n) and if n gets large, so does the lookup time, not to mention that since you are checking each link the worst case scenario is a lookup time of O(n^2). A hash (associative array) will perform the same check in O(n). /([\W_\-]@\W+)/gs
downloads by Bananatree3 · 2006-11-14 19:30 · Score: 4, Informative

for those of us who don't have them, here are the basics:

Wget: http://www.gnu.org/software/wget/.

Curl http://curl.haxx.se/
Okay kids... by Balinares · 2006-11-14 22:30 · Score: 4, Informative

Just so people who may come across this know, if you're going to do some HTML or XHTML parsing in Python, you'd be insane not to use BeautifulSoup or a similar tool.

Example to find all links in a document:
from BeautifulSoup import BeautifulSoup for tag in BeautifulSoup(html_document).findAll("a"): print tag["href"]
Yes, it's that simple. For an URL opener that also handles proxies, cookies, HTTP auth, SSL and so on, look into the urllib2 module that ships natively with Python.

--

-- B.
This sig does in fact not have the property it claims not to have.