How To Build a Web Spider On Linux

← Back to Stories (view on slashdot.org)

How To Build a Web Spider On Linux

Posted by kdawson on Tuesday November 14, 2006 @07:13PM from the five-eyes dept.

IdaAshley writes, "Web spiders are software agents that traverse the Internet gathering, filtering, and potentially aggregating information for a user. This article shows you how to build spiders and scrapers for Linux to crawl a Web site and gather information, stock data, in this case. Using common scripting languages and their collection of Web modules, you can easily develop Web spiders."

7 of 104 comments (clear)

Min score:

Reason:

Sort:

some points by cucucu · 2006-11-14 19:59 · Score: 5, Interesting
- Don't forget to check and respect robots.txt. Python has a module that helps you parse that file
- Samie and its Python port Pamie are your friends. You can automate IE so your script is treated as an human and not discriminated as a robot.
- I use such beasts to do one-click time reporting at work and one-click cartoon collecting in my favorite newspaper.
- And once I even repeatedly voted on an online poll and changed the course of history.
- Ah, yes, TFA was about building a spider on Linux. I didn't check if my one-click IE scripts work on IE/Wine/Linux.
- If I write an one-click script for online shopping, does it infringe the infamous Amazon patent?
- When will Firefox's automation capabilities match those of IE?
crawling is not so trivial by cucucu · 2006-11-14 20:33 · Score: 2, Interesting

As the two students who started a little web search company, crawling the web is not trivial: http://infolab.stanford.edu/~backrub/google.html. An excerpt follows.

Running a web crawler is a challenging task. There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system.

In order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. A single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Both the URLserver and the crawlers are implemented in Python. Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace. At peak speeds, the system can crawl over 100 web pages per second using four crawlers. This amounts to roughly 600K per second of data. A major performance stress is DNS lookup. Each crawler maintains a its own DNS cache so it does not need to do a DNS lookup before crawling each document. Each of the hundreds of connections can be in a number of different states: looking up DNS, connecting to host, sending request, and receiving response. These factors make the crawler a complex component of the system. It uses asynchronous IO to manage events, and a number of queues to move page fetches from state to state.

It turns out that running a crawler which connects to more than half a million servers, and generates tens of millions of log entries generates a fair amount of email and phone calls. Because of the vast number of people coming on line, there are always those who do not know what a crawler is, because this is the first one they have seen. Almost daily, we receive an email something like, "Wow, you looked at a lot of pages from my web site. How did you like it?" There are also some people who do not know about the robots exclusion protocol, and think their page should be protected from indexing by a statement like, "This page is copyrighted and should not be indexed", which needless to say is difficult for web crawlers to understand. Also, because of the huge amount of data involved, unexpected things will happen. For example, our system tried to crawl an online game. This resulted in lots of garbage messages in the middle of their game! It turns out this was an easy problem to fix. But this problem had not come up until we had downloaded tens of millions of pages. Because of the immense variation in web pages and servers, it is virtually impossible to test a crawler without running it on large part of the Internet. Invariably, there are hundreds of obscure problems which may only occur on one page out of the whole web and cause the crawler to crash, or worse, cause unpredictable or incorrect behavior. Systems which access large parts of the Internet need to be designed to be very robust and carefully tested. Since large complex systems such as crawlers will invariably cause problems, there needs to be significant resources devoted to reading the email and solving these problems as they come up.
Actually... by SanityInAnarchy · 2006-11-14 20:52 · Score: 3, Interesting

Some websites do not have good search functionality. Sometimes it's an area that Google doesn't crawl (robots.txt and such), and sometimes I'm looking for something very, very specific.

Regardless, I do, in fact, build spiders. For instance, in an MMO I play, all users can have webpages, so it's very useful to have a spider as part of a clan/guild/whatever to crawl the webpages looking for users who have illegal items and such. In a more general way, there is a third-party site which collects vital statistics of everyone who puts those in their user page, so you can get lists of the most powerful people in the game, the richest people, etc.

--
Don't thank God, thank a doctor!
Re:Crawling efficiently by Mr2cents · 2006-11-14 22:07 · Score: 2, Interesting

Maybe because they don't know the first thing about efficiency? You'd be surprised how much programmers don't know/care about efficiency. Once, incidentilly also on a crawler (student project), I improved the function reading a tree of URL's from 1 hour(!) to 0.1second! The guy tested it on an example with 10 URL's and it worked, but his implementation was O(n^2) and involved copying huge amounts of memory each step. Don't ask me how he thought this would be scalable.

--
"It's too bad that stupidity isn't painful." - Anton LaVey
Reinventing the wheel by Anonymous Coward · 2006-11-15 00:37 · Score: 1, Interesting

I know, I know. Flame me. But I found Heritrix http://crawler.archive.org/ is a very polished package. Used it for my Masters research, and found that it is very extensible. Useful if you are doing real crawling, ie not concentrating on one site.
Re:Hmm... by strstrep · 2006-11-15 04:37 · Score: 3, Interesting

PHP lightweight? Ha!

The PHP interpreter is over 5 megabytes in size. And it isn't thread-safe. That's a lot of memory overhead for a program that's going to be blocking on I/O most of the time, seeing how you'll have to fork() a new process for each new "thread" you want.

Also, languages like Perl and Python have binaries that are about 1 megabyte in size. Now, while they'll probably need to load in extra files for most practical applications, these extra files are typically small. Most importantly, Perl and Python are thread-safe.

Perl, for example, includes libraries such as Thread::Queue, which allows you to very easily create a threading model with worker threads, without having to worry too much about condition variables, mutexes, and the like.

Disclaimer: All measurements done on x86 Debian Linux.
Re:Okay kids...(in Ruby) by amran · 2006-11-15 08:41 · Score: 2, Interesting

I couldn't resist - in Ruby, using the beautiful (but much understated) hpricot library:
doc = Hpricot(open(html_document)) (doc/"a").each { |a| puts a.attributes['href'] }

Check it out - I've been using it for a project, and it's really fast and really easy to use (supports both xpath and css for parsing links). For spidering you should check out the ruby mechanize library (which is like perl's www-mechanize, but also uses hpricot, making parsing the returned document much easier).