cucucu · Slashdot Mirror

crawling is not so trivial on How To Build a Web Spider On Linux · 2006-11-14 20:33 · Score: 2, Interesting

As the two students who started a little web search company, crawling the web is not trivial: http://infolab.stanford.edu/~backrub/google.html. An excerpt follows.

Running a web crawler is a challenging task. There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system.

In order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. A single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Both the URLserver and the crawlers are implemented in Python. Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace. At peak speeds, the system can crawl over 100 web pages per second using four crawlers. This amounts to roughly 600K per second of data. A major performance stress is DNS lookup. Each crawler maintains a its own DNS cache so it does not need to do a DNS lookup before crawling each document. Each of the hundreds of connections can be in a number of different states: looking up DNS, connecting to host, sending request, and receiving response. These factors make the crawler a complex component of the system. It uses asynchronous IO to manage events, and a number of queues to move page fetches from state to state.

It turns out that running a crawler which connects to more than half a million servers, and generates tens of millions of log entries generates a fair amount of email and phone calls. Because of the vast number of people coming on line, there are always those who do not know what a crawler is, because this is the first one they have seen. Almost daily, we receive an email something like, "Wow, you looked at a lot of pages from my web site. How did you like it?" There are also some people who do not know about the robots exclusion protocol, and think their page should be protected from indexing by a statement like, "This page is copyrighted and should not be indexed", which needless to say is difficult for web crawlers to understand. Also, because of the huge amount of data involved, unexpected things will happen. For example, our system tried to crawl an online game. This resulted in lots of garbage messages in the middle of their game! It turns out this was an easy problem to fix. But this problem had not come up until we had downloaded tens of millions of pages. Because of the immense variation in web pages and servers, it is virtually impossible to test a crawler without running it on large part of the Internet. Invariably, there are hundreds of obscure problems which may only occur on one page out of the whole web and cause the crawler to crash, or worse, cause unpredictable or incorrect behavior. Systems which access large parts of the Internet need to be designed to be very robust and carefully tested. Since large complex systems such as crawlers will invariably cause problems, there needs to be significant resources devoted to reading the email and solving these problems as they come up.

some points on How To Build a Web Spider On Linux · 2006-11-14 19:59 · Score: 5, Interesting

Don't forget to check and respect robots.txt. Python has a module that helps you parse that file
Samie and its Python port Pamie are your friends. You can automate IE so your script is treated as an human and not discriminated as a robot.
I use such beasts to do one-click time reporting at work and one-click cartoon collecting in my favorite newspaper.
And once I even repeatedly voted on an online poll and changed the course of history.
Ah, yes, TFA was about building a spider on Linux. I didn't check if my one-click IE scripts work on IE/Wine/Linux.
If I write an one-click script for online shopping, does it infringe the infamous Amazon patent?
When will Firefox's automation capabilities match those of IE?

YouTube has been ./ed! on Death of the Cell Phone Keypad As We Know It? · 2006-11-14 02:55 · Score: 1

kudos for us all

Re:Death knell of the keypad - a little overdramat on Death of the Cell Phone Keypad As We Know It? · 2006-11-14 02:32 · Score: 1

You're right. And let me add: how are you supposed to cheat in an exam with such a speech to text mechanism?

I am used on PS3 Opened For Pictures · 2006-11-13 05:20 · Score: 5, Funny

I am used to surfing sites where the pictures are the important thing

Re:GPL makes forks irrelevant on Sun Open Sources Java Under GPL · 2006-11-13 03:13 · Score: 1

If the fork introduces significant improvements, they will be back-ported to the original branch

Not necessarily. I think communities will spawn that will try to take Java to directions opposed to Sun's philosophy, such as more agile/dynamic dialects of the language, etc. While the results might be compelling I don't think Sun will want to take them back into the main trunk.

let's see who's the first one on Sun Open Sources Java Under GPL · 2006-11-12 23:24 · Score: 3, Funny

to post a link to a forked java

Re:list composition on Top 10 List of Worldwide Internet Censors · 2006-11-12 22:58 · Score: 1

I didn't emphasize the word Muslim, other than inadvertently putting it in Uppercase. Some may claim this is a Freudian slip, but probably it was Firefox's spell checker.

While I don't think it is the work of a government to protect me from pornography, TFA speaks about censoring political opposition and bloggers. And yes, censoring governments usually justify their censorship as a way to protect the people from obscene contents, while actually silencing legitimate political discussions.

list composition on Top 10 List of Worldwide Internet Censors · 2006-11-12 22:00 · Score: 2, Interesting

What is our list made of?
6+4+3=13
6 Muslim countries (Iran, Tunisia, Egypt, Saudi Arabia, Turkmenistan, Syria), 4 communist countries (China, North Korea, Cuba, Vietnam), 3 dicatorships (Myanmar, Belarus, Uzbekistan).
While I am not sure about Uzbekistan, I feel pretty safe about the classification. Countries classified as muslim/communist probably can be tagged as dictatorships too (or as undemocratic to say the least).

So it can be safely said that internet censors are those with ideologies that are/were opposed by the US. We should not be surprised as internet is an american invention and is mostly dominated by english language / western content.

Re:They have every right. on Samba Team Urges Novell To Reconsider · 2006-11-12 07:52 · Score: 1

Actually they have every right to do whatever they like as long as it is within the law. There is nothing specific in the GPL that says they cannot make a deal with Microsoft.

I'm afraid what you say is true, they have every right to do it. TFA says:

arguing that the agreement is a divisive agreement, effectively splitting the open source movement into groups with and without commercial status.

But let's say the truth: the first to do this where RedHat and SuSe with their closed versions of Linux. And everybody stayed silent, so why are they screaming when Microsoft comes into the scene.
It looks like GPL has a bug. It gives the right to distribute Linux as a closed source OS. GPL should somehow allow to distribute the OS with closed applications, but the OS itself should remain open.

Re:Simple! on NASA Avoids "Happy New Year" On Shuttle · 2006-11-12 03:34 · Score: 1

Your code is to risky. You have to maintain it, write unit tests, use test coverages suites, check the code is being covered.

A better risk containment policy is to synchronize all on board clocks to January 1st 1970 before takeoff.

yes for wikipedia on Google's Test Search Engine · 2006-11-12 03:29 · Score: 5, Informative

I used searchmash and voted for results for wikipedia. Some time ago I found the following firefox quick searches to be very useful:

Search in wikipedia: http://www.google.com/search?q=%25s+site:en.wikipe dia.org
Go to wikipedia entry: http://en.wikipedia.org/wiki/%25s
Go to wiktionary: http://en.wiktionary.org/wiki/%25s

Do ./ers have good wikipedia quick searches to share?

terminology change on The Ballpark Stadium of the Future · 2006-11-11 19:43 · Score: 3, Funny

pitch: ping
home run: tracert
out: ttl expired

you say...

Feature missing in MySql on Slashdot Posting Bug Infuriates Haggard Admins · 2006-11-09 03:59 · Score: 1

to filter out comments by mod

where is... on Windows Vista Released To Manufacturing · 2006-11-09 01:51 · Score: 5, Funny

the torrent of the .iso image?

perl or .net? both on Choosing Your Next Programming Job — Perl Or .NET? · 2006-11-09 00:43 · Score: 2, Insightful

Go to the .net job, earn more money and convince them to introduce IronPython. Then you'll have the money and resume of .net, and the geekyness of open source dynamic language.

You can also make great career advances by showing them how they get more productive with Python and being their guru.

Just writing more C# or Perl lines will not take you anywhere. Try to make highest impact and leave your personal mark on the job you do.

Re:Other fields? on Is Computer Science Still Worth It? · 2006-11-08 09:30 · Score: 2, Funny

Is studying philosophy worth it?

Yes, if you love it.

If you studied philosophy you would know that the sentence "studying X is worth if you love it" is a tautology.

Re:a message to Eric on Google CEO — Take Your Data and Run · 2006-11-08 09:23 · Score: 1

not necessarily.
if they can afford to give you smtp and pop access, perhaps imap too.

the REAL Turing test on How to Prevent Form Spam Without Captchas · 2006-11-08 09:20 · Score: 1

for sites of enough traffic:

randomly pair users in private chat rooms (ajax, of course) and have them decide on each other if they are human or computer...

from hosted apps to hosted OS on Google CEO — Take Your Data and Run · 2006-11-08 08:41 · Score: 1

I think Google should provide a Linux box with root access to each surfer.

The most basic web based interface would be a AJAX based command line over https, so you can login as root.

A more sophisticated one would be using a GUI-ish web application.

Finally, when you are at your computer and not in an airport public terminal or internet cafe, you can use special purpose client software for remote desktop access.

All your gmail's attachments, docs, or spreadsheet you edit would end up on your computer, and could be edited with remote desktop software too.

If they want they can add ads in the desktop. Nor does it matter to me if they charge for the service. In the latter case the model could be as in Amazon's EC2, a few cents for CPU hour, GB transmitted/stored. If the computer is iddle it is hibernated and stored, and you are not charged for it.

And if you want to leave you can download your image or have a DVD mailed to you.

a message to Eric on Google CEO — Take Your Data and Run · 2006-11-08 08:29 · Score: 1

Google - if you are so bold, let's see you provide IMAP access to gmail.

Re:API for Contacts? on Google CEO — Take Your Data and Run · 2006-11-08 08:27 · Score: 2, Informative

you can download the as a .cvs file

Civilian use of such a thing on DARPA Starts Ultimate Language Translation Project · 2006-11-08 04:41 · Score: 1

Obviously such a thing will not work well without an advertising filter (imagine an analyzer sifting through washing powder ads).

So they will have to develop one.

This will be integrated into VCRs to stop/start recording when advertising starts/stops.

Great!

Re:If I where the defendant's lawyer on Spammer Can't Have Accuser's Hard Drive · 2006-11-08 04:03 · Score: 1

Yeah, I should have read that that's what the poster thought. But I didn't completely read TF post.

Lawyers have such a love for verbosity, and we programmers are so impacient and have no time to read TFA, not TFM, not TF anything.

If I where the defendant's lawyer on Spammer Can't Have Accuser's Hard Drive · 2006-11-08 04:00 · Score: 1

I would ask from Yahoo Mail or Hotmail that they turn over a tape containing an image of their customer's inboxes in the datacenter.

But fortunately I am not a lawyer!

Slashdot Mirror

User: cucucu

Comments · 73