Running a web crawler is a challenging task. There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system.
In order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. A single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Both the URLserver and the crawlers are implemented in Python. Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace. At peak speeds, the system can crawl over 100 web pages per second using four crawlers. This amounts to roughly 600K per second of data. A major performance stress is DNS lookup. Each crawler maintains a its own DNS cache so it does not need to do a DNS lookup before crawling each document. Each of the hundreds of connections can be in a number of different states: looking up DNS, connecting to host, sending request, and receiving response. These factors make the crawler a complex component of the system. It uses asynchronous IO to manage events, and a number of queues to move page fetches from state to state.
It turns out that running a crawler which connects to more than half a million servers, and generates tens of millions of log entries generates a fair amount of email and phone calls. Because of the vast number of people coming on line, there are always those who do not know what a crawler is, because this is the first one they have seen. Almost daily, we receive an email something like, "Wow, you looked at a lot of pages from my web site. How did you like it?" There are also some people who do not know about the robots exclusion protocol, and think their page should be protected from indexing by a statement like, "This page is copyrighted and should not be indexed", which needless to say is difficult for web crawlers to understand. Also, because of the huge amount of data involved, unexpected things will happen. For example, our system tried to crawl an online game. This resulted in lots of garbage messages in the middle of their game! It turns out this was an easy problem to fix. But this problem had not come up until we had downloaded tens of millions of pages. Because of the immense variation in web pages and servers, it is virtually impossible to test a crawler without running it on large part of the Internet. Invariably, there are hundreds of obscure problems which may only occur on one page out of the whole web and cause the crawler to crash, or worse, cause unpredictable or incorrect behavior. Systems which access large parts of the Internet need to be designed to be very robust and carefully tested. Since large complex systems such as crawlers will invariably cause problems, there needs to be significant resources devoted to reading the email and solving these problems as they come up.
If the fork introduces significant improvements, they will be back-ported to the original branch
Not necessarily. I think communities will spawn that will try to take Java to directions opposed to Sun's philosophy, such as more agile/dynamic dialects of the language, etc. While the results might be compelling I don't think Sun will want to take them back into the main trunk.
I didn't emphasize the word Muslim, other than inadvertently putting it in Uppercase. Some may claim this is a Freudian slip, but probably it was Firefox's spell checker.
While I don't think it is the work of a government to protect me from pornography, TFA speaks about censoring political opposition and bloggers. And yes, censoring governments usually justify their censorship as a way to protect the people from obscene contents, while actually silencing legitimate political discussions.
What is our list made of? 6+4+3=13 6 Muslim countries (Iran, Tunisia, Egypt, Saudi Arabia, Turkmenistan, Syria), 4 communist countries (China, North Korea, Cuba, Vietnam), 3 dicatorships (Myanmar, Belarus, Uzbekistan). While I am not sure about Uzbekistan, I feel pretty safe about the classification. Countries classified as muslim/communist probably can be tagged as dictatorships too (or as undemocratic to say the least).
So it can be safely said that internet censors are those with ideologies that are/were opposed by the US. We should not be surprised as internet is an american invention and is mostly dominated by english language / western content.
Actually they have every right to do whatever they like as long as it is within the law. There is nothing specific in the GPL that says they cannot make a deal with Microsoft.
I'm afraid what you say is true, they have every right to do it. TFA says:
arguing that the agreement is a divisive agreement, effectively splitting the open source movement into groups with and without commercial status.
But let's say the truth: the first to do this where RedHat and SuSe with their closed versions of Linux. And everybody stayed silent, so why are they screaming when Microsoft comes into the scene. It looks like GPL has a bug. It gives the right to distribute Linux as a closed source OS. GPL should somehow allow to distribute the OS with closed applications, but the OS itself should remain open.
Go to the.net job, earn more money and convince them to introduce IronPython. Then you'll have the money and resume of.net, and the geekyness of open source dynamic language.
You can also make great career advances by showing them how they get more productive with Python and being their guru.
Just writing more C# or Perl lines will not take you anywhere. Try to make highest impact and leave your personal mark on the job you do.
I think Google should provide a Linux box with root access to each surfer.
The most basic web based interface would be a AJAX based command line over https, so you can login as root.
A more sophisticated one would be using a GUI-ish web application.
Finally, when you are at your computer and not in an airport public terminal or internet cafe, you can use special purpose client software for remote desktop access.
All your gmail's attachments, docs, or spreadsheet you edit would end up on your computer, and could be edited with remote desktop software too.
If they want they can add ads in the desktop. Nor does it matter to me if they charge for the service. In the latter case the model could be as in Amazon's EC2, a few cents for CPU hour, GB transmitted/stored. If the computer is iddle it is hibernated and stored, and you are not charged for it.
And if you want to leave you can download your image or have a DVD mailed to you.
kudos for us all
You're right. And let me add: how are you supposed to cheat in an exam with such a speech to text mechanism?
I am used to surfing sites where the pictures are the important thing
Not necessarily. I think communities will spawn that will try to take Java to directions opposed to Sun's philosophy, such as more agile/dynamic dialects of the language, etc. While the results might be compelling I don't think Sun will want to take them back into the main trunk.
to post a link to a forked java
I didn't emphasize the word Muslim, other than inadvertently putting it in Uppercase. Some may claim this is a Freudian slip, but probably it was Firefox's spell checker.
While I don't think it is the work of a government to protect me from pornography, TFA speaks about censoring political opposition and bloggers. And yes, censoring governments usually justify their censorship as a way to protect the people from obscene contents, while actually silencing legitimate political discussions.
What is our list made of?
6+4+3=13
6 Muslim countries (Iran, Tunisia, Egypt, Saudi Arabia, Turkmenistan, Syria), 4 communist countries (China, North Korea, Cuba, Vietnam), 3 dicatorships (Myanmar, Belarus, Uzbekistan).
While I am not sure about Uzbekistan, I feel pretty safe about the classification. Countries classified as muslim/communist probably can be tagged as dictatorships too (or as undemocratic to say the least).
So it can be safely said that internet censors are those with ideologies that are/were opposed by the US. We should not be surprised as internet is an american invention and is mostly dominated by english language / western content.
I'm afraid what you say is true, they have every right to do it. TFA says:
But let's say the truth: the first to do this where RedHat and SuSe with their closed versions of Linux. And everybody stayed silent, so why are they screaming when Microsoft comes into the scene.
It looks like GPL has a bug. It gives the right to distribute Linux as a closed source OS. GPL should somehow allow to distribute the OS with closed applications, but the OS itself should remain open.
Your code is to risky. You have to maintain it, write unit tests, use test coverages suites, check the code is being covered.
A better risk containment policy is to synchronize all on board clocks to January 1st 1970 before takeoff.
- Search in wikipedia: http://www.google.com/search?q=%25s+site:en.wikip
e dia.org
- Go to wikipedia entry: http://en.wikipedia.org/wiki/%25s
- Go to wiktionary: http://en.wiktionary.org/wiki/%25s
Dopitch: ping
home run: tracert
out: ttl expired
you say...
to filter out comments by mod
the torrent of the .iso image?
Go to the .net job, earn more money and convince them to introduce IronPython. Then you'll have the money and resume of .net, and the geekyness of open source dynamic language.
You can also make great career advances by showing them how they get more productive with Python and being their guru.
Just writing more C# or Perl lines will not take you anywhere. Try to make highest impact and leave your personal mark on the job you do.
If you studied philosophy you would know that the sentence "studying X is worth if you love it" is a tautology.
not necessarily.
if they can afford to give you smtp and pop access, perhaps imap too.
for sites of enough traffic:
randomly pair users in private chat rooms (ajax, of course) and have them decide on each other if they are human or computer...
I think Google should provide a Linux box with root access to each surfer.
The most basic web based interface would be a AJAX based command line over https, so you can login as root.
A more sophisticated one would be using a GUI-ish web application.
Finally, when you are at your computer and not in an airport public terminal or internet cafe, you can use special purpose client software for remote desktop access.
All your gmail's attachments, docs, or spreadsheet you edit would end up on your computer, and could be edited with remote desktop software too.
If they want they can add ads in the desktop. Nor does it matter to me if they charge for the service. In the latter case the model could be as in Amazon's EC2, a few cents for CPU hour, GB transmitted/stored. If the computer is iddle it is hibernated and stored, and you are not charged for it.
And if you want to leave you can download your image or have a DVD mailed to you.
Google - if you are so bold, let's see you provide IMAP access to gmail.
you can download the as a .cvs file
Obviously such a thing will not work well without an advertising filter (imagine an analyzer sifting through washing powder ads).
So they will have to develop one.
This will be integrated into VCRs to stop/start recording when advertising starts/stops.
Great!
Yeah, I should have read that that's what the poster thought. But I didn't completely read TF post.
Lawyers have such a love for verbosity, and we programmers are so impacient and have no time to read TFA, not TFM, not TF anything.
I would ask from Yahoo Mail or Hotmail that they turn over a tape containing an image of their customer's inboxes in the datacenter.
But fortunately I am not a lawyer!