How To Build a Web Spider On Linux
IdaAshley writes, "Web spiders are software agents that traverse the Internet gathering, filtering, and potentially aggregating information for a user. This article shows you how to build spiders and scrapers for Linux to crawl a Web site and gather information, stock data, in this case. Using common scripting languages and their collection of Web modules, you can easily develop Web spiders."
Some websites do not have good search functionality. Sometimes it's an area that Google doesn't crawl (robots.txt and such), and sometimes I'm looking for something very, very specific.
Regardless, I do, in fact, build spiders. For instance, in an MMO I play, all users can have webpages, so it's very useful to have a spider as part of a clan/guild/whatever to crawl the webpages looking for users who have illegal items and such. In a more general way, there is a third-party site which collects vital statistics of everyone who puts those in their user page, so you can get lists of the most powerful people in the game, the richest people, etc.
Don't thank God, thank a doctor!
Maybe because they don't know the first thing about efficiency? You'd be surprised how much programmers don't know/care about efficiency. Once, incidentilly also on a crawler (student project), I improved the function reading a tree of URL's from 1 hour(!) to 0.1second! The guy tested it on an example with 10 URL's and it worked, but his implementation was O(n^2) and involved copying huge amounts of memory each step. Don't ask me how he thought this would be scalable.
"It's too bad that stupidity isn't painful." - Anton LaVey
I know, I know. Flame me. But I found Heritrix http://crawler.archive.org/ is a very polished package. Used it for my Masters research, and found that it is very extensible. Useful if you are doing real crawling, ie not concentrating on one site.
PHP lightweight? Ha!
The PHP interpreter is over 5 megabytes in size. And it isn't thread-safe. That's a lot of memory overhead for a program that's going to be blocking on I/O most of the time, seeing how you'll have to fork() a new process for each new "thread" you want.
Also, languages like Perl and Python have binaries that are about 1 megabyte in size. Now, while they'll probably need to load in extra files for most practical applications, these extra files are typically small. Most importantly, Perl and Python are thread-safe.
Perl, for example, includes libraries such as Thread::Queue, which allows you to very easily create a threading model with worker threads, without having to worry too much about condition variables, mutexes, and the like.
Disclaimer: All measurements done on x86 Debian Linux.
I couldn't resist - in Ruby, using the beautiful (but much understated) hpricot library:
doc = Hpricot(open(html_document))(doc/"a").each { |a| puts a.attributes['href'] }
Check it out - I've been using it for a project, and it's really fast and really easy to use (supports both xpath and css for parsing links). For spidering you should check out the ruby mechanize library (which is like perl's www-mechanize, but also uses hpricot, making parsing the returned document much easier).