Using Google to Calculate Web Decay

← Back to Stories (view on slashdot.org)

Using Google to Calculate Web Decay

Posted by ryuzaki0 on Monday April 29, 2002 @08:22PM from the bit-rot-quantified dept.

scottennis writes: "Google has yet another application: measuring the rate of decay of information on the web. By plotting the number of results at 3,6, and 12 months for a series of phrases, this study claims to have uncovered a corresponding 60-70-80 percent decay rate. Essentially, 60% of the web changes every 3 months." You may be amused by some of the phrases he notes as exceptional, too.

5 of 208 comments (clear)

Min score:

Reason:

Sort:

Obligatory Full Text by rosewood · 2002-04-29 20:41 · Score: 5, Informative

I only do this since I know an angelfire page will get /. and reach bandwidth limits fast! However, there is a pretty excel chart on there so bookmark and come back much later.

Web Decay
by Scott Ennis
4/26/2002
Knowing how anxious most companies are to keep their web content "fresh," I was curious how "fresh" the web itself was.

In order to come up with a freshness rating for the web you need to sample a very large number of pages. Not wanting to do this, I opted to use the Google search engine as a method for reviewing the web as a whole.

My hypothesis is this: By searching Google using some common english phrases and returning results at various time points, a baseline can be reached for the common rate of freshness of overall web content.

I took the total number of pages found for each given phrase at 3, 6, and 12 months. I calculated a percentage for each of these points based on the total number of results found with no date specified.

For example: Phrase 3 mos. 6 mos. 12 mos. Total

buy low sell high 4700 5470 6200 7830
60% 70% 79% 100%

Note:
This method excludes any pages which are not text and more specifically, not English text.
This method relies on a random sampling of phrases.
Using this methodology I determined that the average rate of decay of the web follows a 60-70-80 percent decline at 3, 6, and 12 months.

Therefore, If a company wants to maintain a freshness rate on par with the web as a whole, their site content should be updated at the inverse rate. In other words:
60% of the site should change every 3 months
70% of the site should change every 6 months
80% of the site should change every 12 months
The only way to do this effectively is to either have a very small site, or have a site with dynamically generated information.

The following graph shows the decay rate for a few phrases. I selected these phrase to display because of their unique characteristics.
bill gates sucks--This phrase had the lowest decay rate of any phrases I searched.
life's short play hard--This phrase had the greatest decay rate of any I searched (note: this search was also very small).
blessed are the cheesemakers--This phrase was relatively small, but demonstrates that quantity of pages may not be important in determining decay rate.
late at night--This phrase returned the highest number of results of any I searched and yet it also adheres closely to the 60-70-80 rule.

Conclusion:

Web content decays at a uniform, determinable rate. Sites wanting to optimize their content freshness need to maintain a rate of freshness that corresponds to the rate of web decay.

--
The ultimate network admin tool needs HELP!
The guy who posted this may have made a mistake. by eugene+ts+wong · 2002-04-29 21:10 · Score: 2, Informative

Essentially, 60% of the web changes every 3 months.
I think that is incorrect, according the "researcher". He should have said, "Essentially, 60% of the web is getting older every 3 months.".

--
testing out my trending skills
Google Study in Another Place by scottennis · 2002-04-30 01:24 · Score: 5, Informative

The study I posted on Angelfire appears to have reached a bandwidth threshhold. I've made the same study available here:

http://helen.lifeseller.com/webdecay.html

I've also included a link to the raw data I used.

--
Read any good sonnets lately?
Mirror Site by scottennis · 2002-04-30 02:58 · Score: 2, Informative

http://helen.lifeseller.com/webdecay.html

--
Read any good sonnets lately?
Re:Thought and mod_rewrite are the key by Fweeky · 2002-04-30 06:43 · Score: 3, Informative

> By simply adding &threshold=-1 to the end of that, I can see all the replies [slashdot.org] at -1 easily and painlessly.

The argument wasn't "query strings are bad, m'kay"; look at the URI and see what information's in there. Does .pl serve any purpose? Does sid=3188 gain anything, aside from make the page very difficult to serve statically? Does tid=95 and pid=3436807 tell you anything?

The URI's would work just as well using something like http://slashdot.org/stories/31884/comments/3436807 / 5/?mode=nested&threshold=-1; even if /stories/31884 were a static file, the URI would still work and point roughly to the right place. And it's not exposing the internals of how the comments system works, and it's keeping the more readily tunable query strings clear, without making the exact resource you're pointing to difficult to find.

> Do you know how to make k5's comments nested instead of threaded purely using the URI?

No. Actually, I wasn't really pointing out k5 as being the perfect example; Scoop actually tends to really suck in this respect (like setting the URI to '/' when you change comment modes). However, I might be tempted to ask you which URI is likely to live the longest, certainly back when SlashDot used to archive articles after a couple of weeks.

> The point is, wether or not it takes the optimum number of bytes isn't always the priority

I never once said the size of the URI was important. I said it contained a lot of extranious information that changed a lot while meaning little (i.e. the URI's changed from the dynamic query string to an .shtml file when a story was archived).

> in the case of /., its designed to be easy to use for the (savvy) user, not easy on the server

What's easier for the "savvy" user? A URI that will work for the rest of SlashDot's life, or one that'll last until the story is archived, or the underlying architecture changes, and which contains a lot of randomly ordered and mostly meaningless information?

A well designed URI scheme will actually give the savvy user a lot more control; say, you include the date of an article, ala http://slashdot.org/stories/2002/05/30/; you can imagine going to such a URI and getting all of the stories on that day, month or year, and instantly being able to identify how old a linked to article is. You can also imagine an archived URI and a live, dynamic URI both using the same schema.

You can also imagine giving a URI of an interesting article to a friend without first having to decode the query string; just strip off anything after /comments and they get the story.

Note: This applies to any site, those particular SlashDot and k5 URI's were just examples.