How To Build a Web Spider On Linux
IdaAshley writes, "Web spiders are software agents that traverse the Internet gathering, filtering, and potentially aggregating information for a user. This article shows you how to build spiders and scrapers for Linux to crawl a Web site and gather information, stock data, in this case. Using common scripting languages and their collection of Web modules, you can easily develop Web spiders."
Yes, but does it run on ... damn.
Their example of a web crawler uses a queue to hold links. Since a link may appear twice, they use a lookup to scan the queue to see if the link is already loaded, and discard it if so.
/([\W_\-]@\W+)/gs
Better to use an associative array to cache the links since lookup is O(1). The Queue's lookup time is O(n) and if n gets large, so does the lookup time, not to mention that since you are checking each link the worst case scenario is a lookup time of O(n^2). A hash (associative array) will perform the same check in O(n).
They want their technology back.
for those of us who don't have them, here are the basics:
Wget: http://www.gnu.org/software/wget/.
Curl http://curl.haxx.se/
All my love for linux aside, this has to do nothing with linux, the kernel (or even the GNU/Linux, the OS). It works just as well on any other unix-derivate or even windows.
You RTFA? Are you sure you're in the right place?
Pull the article out. The last thing we need is more indexing bots.
I think that's robots.txt, *not* spider.txt
Be wary of any facts that confirm your opinion.
I've never programmed in Ruby, but I think the comment in Listing 1 says it all:
"Iterate through response hash"
Why would somebody want to do that?
A quick net search "reveals": A simple resp["server"] is all you need.
Maybe the article was meant to be posted on thedailywtf.com?
Basically, the article gives you ruby and python examples of how to get web pages, and (badly) parse them for information. The same thing everyone has been doing for at least a decade with Perl and the appropriate modules, or whatever other tools, except that most know how to do it correctly.
The first script is merely ridiculous: 12 lines of code (not counting empty and comment lines) to do:
HEAD slashdot.org | grep 'Server: '
But it gets worse. To extract a quote from a page, the second script suggests this:
You don't need to know ruby to see what it does: it looks for the first occurence of 'class="price">' and just takes the 10 characters that follow. The author obviously never used that sort of thing for more than a couple of days, or he would know how quickly that will break and spit out rubbish.
Finally, there is a Python script. At first glance, it looks slightly better. It uses what appears to be the Python equivalent of HTML::Parse to get links. But a closer look reveals that, to find links, it just gets the first attribute of any a tag and uses that as the link. Never mind if the 1st attribute doesn't happen to be "href".
I suppose the only point of that article were the IBM links at the end:
And that is in a section for Linux developers on the IBM site? Maybe the did copy stuff from SCO after all?...
Some websites do not have good search functionality. Sometimes it's an area that Google doesn't crawl (robots.txt and such), and sometimes I'm looking for something very, very specific.
Regardless, I do, in fact, build spiders. For instance, in an MMO I play, all users can have webpages, so it's very useful to have a spider as part of a clan/guild/whatever to crawl the webpages looking for users who have illegal items and such. In a more general way, there is a third-party site which collects vital statistics of everyone who puts those in their user page, so you can get lists of the most powerful people in the game, the richest people, etc.
Don't thank God, thank a doctor!
For a good chuckle, see The Spider of Doom on the Daily WTF.
And please use robots.txt.
And go see Google Webmaster tools.
And don't wear socks with sandals. Well, ok, this one is optional.
Ah, I can see it clearly now!
1. Post to Slashdot a decoy article(it includes Linux in the subjest) with new spam tricks
2. Watch if spam increases 30% next days
3. Bribe Cowboy Neal with 10G midget lesbian pr0n and get IP adresses of the art. readers
4. Load shotgun and make the world a better place!
Example to find all links in a document:Yes, it's that simple. For an URL opener that also handles proxies, cookies, HTTP auth, SSL and so on, look into the urllib2 module that ships natively with Python.
-- B.
This sig does in fact not have the property it claims not to have.
Has there ever been a news story on Slashdot that doesn't have a "I, for one, welcome our new [Insert here] overlords" comment attached to it?
Should be: "How Not
I don't think I am alone in my thinking
I couldn't resist - in Ruby, using the beautiful (but much understated) hpricot library:
doc = Hpricot(open(html_document))(doc/"a").each { |a| puts a.attributes['href'] }
Check it out - I've been using it for a project, and it's really fast and really easy to use (supports both xpath and css for parsing links). For spidering you should check out the ruby mechanize library (which is like perl's www-mechanize, but also uses hpricot, making parsing the returned document much easier).