Slashdot Mirror


How To Build a Web Spider On Linux

IdaAshley writes, "Web spiders are software agents that traverse the Internet gathering, filtering, and potentially aggregating information for a user. This article shows you how to build spiders and scrapers for Linux to crawl a Web site and gather information, stock data, in this case. Using common scripting languages and their collection of Web modules, you can easily develop Web spiders."

1 of 104 comments (clear)

  1. Crawling efficiently by BadAnalogyGuy · · Score: 5, Informative

    Their example of a web crawler uses a queue to hold links. Since a link may appear twice, they use a lookup to scan the queue to see if the link is already loaded, and discard it if so.

    Better to use an associative array to cache the links since lookup is O(1). The Queue's lookup time is O(n) and if n gets large, so does the lookup time, not to mention that since you are checking each link the worst case scenario is a lookup time of O(n^2). A hash (associative array) will perform the same check in O(n). /([\W_\-]@\W+)/gs