How To Build a Web Spider On Linux
IdaAshley writes, "Web spiders are software agents that traverse the Internet gathering, filtering, and potentially aggregating information for a user. This article shows you how to build spiders and scrapers for Linux to crawl a Web site and gather information, stock data, in this case. Using common scripting languages and their collection of Web modules, you can easily develop Web spiders."
Yes, but does it run on ... damn.
Their example of a web crawler uses a queue to hold links. Since a link may appear twice, they use a lookup to scan the queue to see if the link is already loaded, and discard it if so.
/([\W_\-]@\W+)/gs
Better to use an associative array to cache the links since lookup is O(1). The Queue's lookup time is O(n) and if n gets large, so does the lookup time, not to mention that since you are checking each link the worst case scenario is a lookup time of O(n^2). A hash (associative array) will perform the same check in O(n).
They want their technology back.
Why would anyone have a need to write a simple spider nowadays? In 2006, there has to be a better way than just following links. For example, it would be interesting to see something that crawled the various social bookmarking sites and corelated the various terms. For example, User A on Delicious and User B on Stumble Upon both bookmark a link about Pink Floyd and another one about Led Zep. If I'm searching for something about Floyd, the system could recommend some cool info about Led Zep too. (Email me if you need to know where to send my royality checks).
Entrepreneur : (noun), French for "unemployed"
for those of us who don't have them, here are the basics:
Wget: http://www.gnu.org/software/wget/.
Curl http://curl.haxx.se/
the article mostly talks about scripting languages. And yes I do know wget come with a lot of Linux distros, but not EVERYONE has it. So there, I DID read TFA.
I for one welcome our out-of-date-eight-legged overlords!
I want to be retired when I grow up.
All my love for linux aside, this has to do nothing with linux, the kernel (or even the GNU/Linux, the OS). It works just as well on any other unix-derivate or even windows.
http://www.google.com/
Dammit, I was hoping this was article was about the evolution of Dr Weird's phone spiders, mechanical creatures that could be sent down your cable line to maul anyone sending you phishing emails and spam.
Pull the article out. The last thing we need is more indexing bots.
I think that's robots.txt, *not* spider.txt
Be wary of any facts that confirm your opinion.
I've never programmed in Ruby, but I think the comment in Listing 1 says it all:
"Iterate through response hash"
Why would somebody want to do that?
A quick net search "reveals": A simple resp["server"] is all you need.
Maybe the article was meant to be posted on thedailywtf.com?
How does "spider.txt" get an Insightful when it's "robots.txt"? Sheesh, bump the Mods Roster.
these Eight legged freaks!!!
Basically, the article gives you ruby and python examples of how to get web pages, and (badly) parse them for information. The same thing everyone has been doing for at least a decade with Perl and the appropriate modules, or whatever other tools, except that most know how to do it correctly.
The first script is merely ridiculous: 12 lines of code (not counting empty and comment lines) to do:
HEAD slashdot.org | grep 'Server: '
But it gets worse. To extract a quote from a page, the second script suggests this:
You don't need to know ruby to see what it does: it looks for the first occurence of 'class="price">' and just takes the 10 characters that follow. The author obviously never used that sort of thing for more than a couple of days, or he would know how quickly that will break and spit out rubbish.
Finally, there is a Python script. At first glance, it looks slightly better. It uses what appears to be the Python equivalent of HTML::Parse to get links. But a closer look reveals that, to find links, it just gets the first attribute of any a tag and uses that as the link. Never mind if the 1st attribute doesn't happen to be "href".
I suppose the only point of that article were the IBM links at the end:
And that is in a section for Linux developers on the IBM site? Maybe the did copy stuff from SCO after all?...
Some websites do not have good search functionality. Sometimes it's an area that Google doesn't crawl (robots.txt and such), and sometimes I'm looking for something very, very specific.
Regardless, I do, in fact, build spiders. For instance, in an MMO I play, all users can have webpages, so it's very useful to have a spider as part of a clan/guild/whatever to crawl the webpages looking for users who have illegal items and such. In a more general way, there is a third-party site which collects vital statistics of everyone who puts those in their user page, so you can get lists of the most powerful people in the game, the richest people, etc.
Don't thank God, thank a doctor!
For a good chuckle, see The Spider of Doom on the Daily WTF.
And please use robots.txt.
And go see Google Webmaster tools.
And don't wear socks with sandals. Well, ok, this one is optional.
Ah, I can see it clearly now!
1. Post to Slashdot a decoy article(it includes Linux in the subjest) with new spam tricks
2. Watch if spam increases 30% next days
3. Bribe Cowboy Neal with 10G midget lesbian pr0n and get IP adresses of the art. readers
4. Load shotgun and make the world a better place!
I guess most male CS students will have coded something similar at least once to D/L pr0n.
I did one in shell and one in TCL/TK.
Atheism is a non-prophet organisation
Example to find all links in a document:Yes, it's that simple. For an URL opener that also handles proxies, cookies, HTTP auth, SSL and so on, look into the urllib2 module that ships natively with Python.
-- B.
This sig does in fact not have the property it claims not to have.
They forgot the set the User-Agent header to IE.
Wanna fight ? Bend over, stick your head up your ass, and fight for air.
Also, following relative and site-relative links, and obeying things like "base href", "href='javascript:...'" and the case-sensitivity of URLs seems to be too difficult for many beginning crawler programmers.
Has there ever been a news story on Slashdot that doesn't have a "I, for one, welcome our new [Insert here] overlords" comment attached to it?
An app to find broken links on your web site.
k check.html
Checking links with LinkCheck
http://world.std.com/~swmcd/steven/perl/pm/lc/lin
I know, I know. Flame me. But I found Heritrix http://crawler.archive.org/ is a very polished package. Used it for my Masters research, and found that it is very extensible. Useful if you are doing real crawling, ie not concentrating on one site.
I for one welcome our new "I for one welcome our new X overlords" overlords.
It's true I tell you, feller at work's next door neighbour read it in the paper.
Should be: "How Not
I don't think I am alone in my thinking
Yep, it's _very_ intelligent to loop through dictionary for a specific item. (I may be wrong, since I wouldn't even have nightmares about coding ruby, but it sure as hell looks like it...)
I, for one, welcome our clichéd-overlord-joke-bearing Slashdot comments.
Nostarch press are releasing a book about this soon, they had a mockup on display at the Frankfurt book fair.
It's not different at all, steve!
Don't forget that the inet is not just http..
http://en.wikipedia.org/wiki/Archie_search_engine
Why not Nutch?
http://lucene.apache.org/nutch/
"think of it as evolution in action"
Once i had to collect a lot of info from a website. I used java and wget and some java html parser library (possibly JTidy). anyway the code was very short and clean. I'd recommend DOM walking to other solutions when the data isn't trivial.
screen-scraper (http://www.screen-scraper.com/) runs fabulously on Linux, and integrates well with most modern programming languages. It can save all kinds of time over writing Perl and Python scripts. There's a free (as in beer) version available, and a pro version if more features are wanted.
I couldn't resist - in Ruby, using the beautiful (but much understated) hpricot library:
doc = Hpricot(open(html_document))(doc/"a").each { |a| puts a.attributes['href'] }
Check it out - I've been using it for a project, and it's really fast and really easy to use (supports both xpath and css for parsing links). For spidering you should check out the ruby mechanize library (which is like perl's www-mechanize, but also uses hpricot, making parsing the returned document much easier).
I did similar things in college with Perl. (shudders*) The programs were OS-neutral; I think I developed mine in Windows under Cygwin.
*Yes, I know Slashdot is written in Perl.
No, I will not work for your startup
Next question?
To have a right to do a thing is not at all the same as to be right in doing it
That's it.
I'm leaving.
You won't have Anonymous Coward to kick around anymore.
Who needs this much drivel, from the idiots who write the article, to the idiots who submit it, to the idiots who approve it?
Garrrrgh.