John Owens has a nice chapter in GPU Gems 2 on this topic.
...
(I'd like to hope that Newell actually knows all this and is just posturing in he middle of his Steam pimping and that this doesn't reflect reality in Valve's world!)
I actually agree with what you're saying, except that you really should mention that you're the editor of "GPU Gems 2". Pimping a product without revealing your association to it is the worst form of pimping in my book.
I really dislike the downplaying of what Google did with the Usenet archives. Yes, they acquired older archives (Google is not around as long as Usenet is...duh). They located and assembeled various pre-Deja archives (1981-1995), they acquired Deja archives (1995-2000 - Deja never hosted anything from before 95) and since the end of 2000 they are the only ones who archive/index/host a fairly complete (text-only) Usenet feed. The addition of the pre-Deja archives was a Big Thing. Nobody managed to assemble such a complete Usenet archive ever before. Many people thought that most of these archives were lost in time, but now we have to ability to browse back to the Stone Age of the internet(!) I find this archive truly fascinating.
Info about the timeline of this archive here and its composition here.
Anyways, comparing UI/feature set of Deja (well, before they sold out at least) to Google Groups (as it was) and to the new Google Groups Beta (which I don't like that much either) is a different topic. I'd choose the considerably improved relevance of Google Groups searches (phrase-search, anyone?), over Deja's wildcards anytime.
Google is fetching these pages to analyse them for displaying AdSense (Adwords text ads targetted to the webpage you're viewing) in the free version of Opera.
This does not end up in Google's web search index.
Not the greatest way of doing this. On one of the sites I maintain, the date shows up at the top of the page. The other content changes very infrequently in most cases (a few pages hit a news&events database but that's about it). But the new date would be enough to change the checksum (unless they're allowing for it somehow)
That's why I mentioned "smart" MD5 Checksums. You'd only checksum certain parts of a page. E.g., detecting everything that looks like a date and make sure that that's not part of the smart checksum. As long as the checksum parser on the grub client and the one at Looksmart are identical, that should work pretty well.
They use the screensaver grub clients to check if a web page has been modified since the last time it was crawled (by the centralized crawl done by Looksmart). They probably use some smart MD5 checksum of the pages and send that with the urls to be crawled to the clients. If the checksum of what the grub client crawled doesn't match then the centralized crawl is instructed to re-fetch that url.
They go this route because the If-Modified-Since HTTP 1.1 request is not supported by many webservers (and even if it is, you can't really trust it). This is especially true for dynamically generated web pages. I.e., if If-Modified-Since would work reliably then it would be a simple operation to check if a previously crawled page has changed. Since that's not the case, they are outsourcing the expensive refetching of whole pages.
It will be interesting to see how this pans out. I think they could run into trouble with ISPs if this really takes off (because bandwidth consumption per user would increase and make flatrate deals less profitable for some ISPs).
Go to http://www.google.com/preferences and
select: 'Search for pages written in any language'
It seems that restricting searches to articles
posted in particular languages is currently not working. I'm sure they'll fix this soon...
BTW, Google Groups doesn't have its own
preferences page (I wish it had), but the cookie
generated from the preferences page of the main
site still has an affect on Google Groups
(at least in terms of language restricts).
Let's assume a cooperating group of trolls somehow manages to get karma points (remember, not every troll is necessarily stupid all the time, they might just pretend to be to get their deranged kicks out of the trolling experience). Now they can cooperatively troll happily at Score 5 by shoving these karma points back and forth between each other.
In other words, the main reason that makes the current moderation system relatively abuse-proof is the fact that you get to moderate quite infrequently.
Good points. Why don't you consider using two different Google accounts? One more for ease-of-use and the other more for sensitive things.
Smart thinking. You either get a Nobel Prize or a Darwin Award. A win-win situation.
...
(I'd like to hope that Newell actually knows all this and is just posturing in he middle of his Steam pimping and that this doesn't reflect reality in Valve's world!)
I actually agree with what you're saying, except that you really should mention that you're the editor of "GPU Gems 2". Pimping a product without revealing your association to it is the worst form of pimping in my book.
Info about the timeline of this archive here and its composition here.
Anyways, comparing UI/feature set of Deja (well, before they sold out at least) to Google Groups (as it was) and to the new Google Groups Beta (which I don't like that much either) is a different topic. I'd choose the considerably improved relevance of Google Groups searches (phrase-search, anyone?), over Deja's wildcards anytime.
Google is fetching these pages to analyse them for displaying AdSense (Adwords text ads targetted to the webpage you're viewing) in the free version of Opera.
This does not end up in Google's web search index.
http://news.google.com/news?q=china+space
That's why I mentioned "smart" MD5 Checksums. You'd only checksum certain parts of a page. E.g., detecting everything that looks like a date and make sure that that's not part of the smart checksum. As long as the checksum parser on the grub client and the one at Looksmart are identical, that should work pretty well.
...rather a crawl with a distributed component.
They use the screensaver grub clients to check if a web page has been modified since the last time it was crawled (by the centralized crawl done by Looksmart). They probably use some smart MD5 checksum of the pages and send that with the urls to be crawled to the clients. If the checksum of what the grub client crawled doesn't match then the centralized crawl is instructed to re-fetch that url.
They go this route because the If-Modified-Since HTTP 1.1 request is not supported by many webservers (and even if it is, you can't really trust it). This is especially true for dynamically generated web pages. I.e., if If-Modified-Since would work reliably then it would be a simple operation to check if a previously crawled page has changed. Since that's not the case, they are outsourcing the expensive refetching of whole pages.
It will be interesting to see how this pans out. I think they could run into trouble with ISPs if this really takes off (because bandwidth consumption per user would increase and make flatrate deals less profitable for some ISPs).
To see how stories change over time on Forbes you can check out Google News Search
Google News Search doesn't seem to be able to get the new versions of a story if it's always at the same url.
A somewhat better solution:
Go to http://www.google.com/preferences and
select: 'Search for pages written in any language'
It seems that restricting searches to articles
posted in particular languages is currently not working. I'm sure they'll fix this soon...
BTW, Google Groups doesn't have its own
preferences page (I wish it had), but the cookie
generated from the preferences page of the main
site still has an affect on Google Groups
(at least in terms of language restricts).
This scheme can be exploited too easily.
Let's assume a cooperating group of trolls somehow manages to get karma points (remember, not every troll is necessarily stupid all the time, they might just pretend to be to get their deranged kicks out of the trolling experience). Now they can cooperatively troll happily at Score 5 by shoving these karma points back and forth between each other.
In other words, the main reason that makes the current moderation system relatively abuse-proof is the fact that you get to moderate quite infrequently.