How To Build a Web Spider On Linux
IdaAshley writes, "Web spiders are software agents that traverse the Internet gathering, filtering, and potentially aggregating information for a user. This article shows you how to build spiders and scrapers for Linux to crawl a Web site and gather information, stock data, in this case. Using common scripting languages and their collection of Web modules, you can easily develop Web spiders."
Their example of a web crawler uses a queue to hold links. Since a link may appear twice, they use a lookup to scan the queue to see if the link is already loaded, and discard it if so.
/([\W_\-]@\W+)/gs
Better to use an associative array to cache the links since lookup is O(1). The Queue's lookup time is O(n) and if n gets large, so does the lookup time, not to mention that since you are checking each link the worst case scenario is a lookup time of O(n^2). A hash (associative array) will perform the same check in O(n).
for those of us who don't have them, here are the basics:
Wget: http://www.gnu.org/software/wget/.
Curl http://curl.haxx.se/
I think that's robots.txt, *not* spider.txt
Be wary of any facts that confirm your opinion.
For a good chuckle, see The Spider of Doom on the Daily WTF.
And please use robots.txt.
And go see Google Webmaster tools.
And don't wear socks with sandals. Well, ok, this one is optional.
"When will Firefox's automation capabilities match those of IE?"
It's always had it. Look up XUL some day. The entire browser is written in xul.
evil is as evil does
Example to find all links in a document:Yes, it's that simple. For an URL opener that also handles proxies, cookies, HTTP auth, SSL and so on, look into the urllib2 module that ships natively with Python.
-- B.
This sig does in fact not have the property it claims not to have.