Slashdot Mirror

← Back to Stories (view on slashdot.org)

How To Build a Web Spider On Linux

Posted by kdawson on Tuesday November 14, 2006 @07:13PM from the five-eyes dept.

IdaAshley writes, "Web spiders are software agents that traverse the Internet gathering, filtering, and potentially aggregating information for a user. This article shows you how to build spiders and scrapers for Linux to crawl a Web site and gather information, stock data, in this case. Using common scripting languages and their collection of Web modules, you can easily develop Web spiders."

1 of 104 comments (clear)

Min score:

Reason:

Sort:

Crawling efficiently by BadAnalogyGuy · 2006-11-14 19:21 · Score: 5, Informative

Their example of a web crawler uses a queue to hold links. Since a link may appear twice, they use a lookup to scan the queue to see if the link is already loaded, and discard it if so.

Better to use an associative array to cache the links since lookup is O(1). The Queue's lookup time is O(n) and if n gets large, so does the lookup time, not to mention that since you are checking each link the worst case scenario is a lookup time of O(n^2). A hash (associative array) will perform the same check in O(n). /([\W_\-]@\W+)/gs