Using Google to Calculate Web Decay
scottennis writes: "Google has yet another application: measuring the rate of decay of information on the web.
By plotting the number of results at 3,6, and 12 months for a series of phrases, this study claims to have uncovered a corresponding 60-70-80 percent decay rate.
Essentially, 60% of the web changes every 3 months." You may be amused by some of the phrases he notes as exceptional, too.
This kind of thing can be a good application of Google's SOAP interface!
"Why did they cancel my favorite Sci-Fi show? I downloaded ALL the episodes!"
Are google claiming that they can check through the entire internet inside a timescale of 3 months, ready to check through again at the start of the next quarter?
Surely this can't be true. Check Google's cached pages - see the dates on there?
Google is turning into another history book.
Roadkill is yummy.
It seems to me that in a way, the web is like an organism, whose smaller constituents are constantly (or not so constantly, depending on the webmaster) renewing themselves. It's a truely adaptive medium, and thus drastic change in short times like this as interest shifts should be quite expected.
That said, this is one of the many ways in which Google is an invaluable tool for research. Not just finding information, but generating it. Thanks Google!
It would also be interesting to see how much of the web no longer exists... like at what rate the web is dying. God knows there's enough dead links out there...
Once upon a time...
Digital libraries and World Wide Web sites and page persistence
Sig: What Happened To The Censorware Project (censorware.org)
From the evidence, he searched for very few phrases. The sample size is way too low to be representive of the web - which some estimates put at several billion more pages than there are people on the planet! There are no signs of more than about 5 different phrases being searched for here..
Can a few simple searches on Google really generate a large enough sample to draw such large conclusions?
The report is one page long, hosted on Angelfire. There is no substantial data to back up his claims. Is this report reliable in any way?
I'm amazed this got posted on the front page of Slashdot..
This makes the job of Archive.org - like sites damn tough.
P.S. Are we losing information at a comparable rate to generation....?
He creates a problem for himself by not providing us with his raw data, making any subsequent verification of the trend difficult. In fact, the one data set he gives us:
Phrase 3 mos 6 mos 12 mos. Total
buy low sell high 4700 5470 6200 7830
60% 70% 79% 100%
seems to demonstrate the opposite of the trend that he describes. Indeed, a current search on google shows about 1,270,000 results (makes you wonder when he did his searches that the current number of results is so many orders of magnitude in difference). The methodology also fails to take in to account any growth in the size of the web, which could mask the effects of decay.
I'm not impressed. The article does not define what he means by decay, or how he measured it, except in the vaguest of terms. The analysis of the data is poor; anyone interested in decay would suspect some kind of exponential decay. They would therefore plot the data logarithmically, and perhaps calcualte a half life. Piss poor.
Ne mæg werig mod wyrde wiðstondan, ne se hreo hyge helpe gefremman.
The nature of information is decidedly ephemeral compared to the static nature of much of the web. Perhaps the surge in Weblogging has altered this dynamic even more than the hypercommercialization, but I'll dispute the 60% figure if it is based only on those four phrases. Much of the early Web was fairly static research and information hosted on .edu domains from what I gather. Since the tide shifted away to .commercialization and tripe, the nature of "information" has little to do with the state of the web, and more to do with tidiness. How much of the Web is long abandoned fan sites and dusty old means abandoned from the "information superhighway"?
In fact, Information Superhighway would be a great data point for this subject. Another consideration, which would be difficult to accomodate, is the reality of mirrors and shuffling pages to different URLs.
Most importantly, I strongly hope that your "interesting application" never gets implemented, because I can see no application of the resulting data that doesn't make my blood run cold. Psychological Warfare and hostile advertising are the bane of the Post-WWII US, and (likely) the world. Propeganda is a pernicious technology, and I fear further development in this area.
Okay, I'll admit that was a touch trollish. Because the Psych. Warfare genie was already released from it's NAZI bottle and invited into the US (along with other valuable sciences), it's a little late to advocate repression of this technology. Yet I still reel from my country's increasingly malevolent commercialism aspects, which have spun off from Capitalism without any of Capitalism's redeeming social aspects. I almost want to become a socialist, until I consider that this state of affairs sprung from the National Socialist state.
In any case, while the WWW may be evolving, is certainly isn't in the Darwinian sense that was likely intended. Vestigal Geocities homepages long abandoned are plentiful, and are less temporary, giving search engines a better shot at crawling than dynamic, or "living" news portals. This sickly "creature" is more of a construction than the product of evolution (unless you consider pre-Charles Darwin senses of the word). If you want to research the nature of information and survivability/mutability, the Freenet Project would provide a much more fruitful environment, if it ever reached widespread useage. I would have less strenuous objections to classifying the Freenet an "ever-evolving creature".
Actually (and unfortunately for any haters of the Evil that lies in the lands of Redmond) Headline News had this lovely little chart on recently, which showed public approval of several companies. Enron and Arthur Anderson had 9 and 11% approval ratings, respectively, while the big "winner" was Microsoft, with something like a 79% approval rating.
Let's face facts here. We might hate Microsoft, but the vast majority of people do not. Good? Bad? Indifferent?
Kierthos
Mr. Hu is not a ninja.
Once you have put a page on the Web, you need to keep it there indefinitely. Read more. Slow news day, eh?
I don't claim this is the authoritative answer, or an in-depth study, but the raw data comes from Bill's very own MSN search: bill gates sucks, check it out...
Google SOAP thing for compare-stuff is in the pipeline...
Our weblogs show that google visits our site (www.up.org.nz) atleast monthly, and it is by no means a huge traffic drawing site in the global senee. Its' last visit was on 13th April, drawing 1888 hits...
While the numbers clearly aren't totally random, they are very fragile indeed. Some people have had a change of two orders of magnitude, within a week. And in these cases, there have usually been no real world events that could explain such a change. I guess the google page hits numbers depend as much on the internal google structure, as on the number of actual pages on the web.
So I doubt google page hits statistics is a useful research tool. Nonetheless, it can be fun. Here are some google hall of fame lists:
- A list of the most famous Danes according
to google.
- A list of free software
celebrities according to google.
- A list of Emacs contributors sorted
according to google hits.
- A list of sequential artists sorted
according to google hits.
- A list of OS (Kernel) Mindshare sorted
according to google hits.
PS: Mail me to suggest new entries to the lists.What scares me here is the conclusion that web sites need to change their content 60% every 3 months. This is not freshness, this is reorganizing to re-organize. If you are considering doing this, you had better seriously re-consider your future. Its an interesting study but a good meme doesn't die simply because the catch-phrases are tired.
At faculty meetings at our school I sit with a bingo card. On it are a series of catch-phrases. We listen for the catch-phrases and shout out when we have finished our cards. B***SH*T is the game and to reduce your content to a series of reorganized catch-phrases is like having a marketing guy develop foreign policy.
Anyone willing to write the perl module that searches for the latest catch-phrases and inserts them randomly into your web content. Yeesh!