Slashdot Mirror


Using Google to Calculate Web Decay

scottennis writes: "Google has yet another application: measuring the rate of decay of information on the web. By plotting the number of results at 3,6, and 12 months for a series of phrases, this study claims to have uncovered a corresponding 60-70-80 percent decay rate. Essentially, 60% of the web changes every 3 months." You may be amused by some of the phrases he notes as exceptional, too.

11 of 208 comments (clear)

  1. Obligatory Full Text by rosewood · · Score: 5, Informative

    I only do this since I know an angelfire page will get /. and reach bandwidth limits fast! However, there is a pretty excel chart on there so bookmark and come back much later.

    Web Decay
    by Scott Ennis
    4/26/2002
    Knowing how anxious most companies are to keep their web content "fresh," I was curious how "fresh" the web itself was.

    In order to come up with a freshness rating for the web you need to sample a very large number of pages. Not wanting to do this, I opted to use the Google search engine as a method for reviewing the web as a whole.

    My hypothesis is this: By searching Google using some common english phrases and returning results at various time points, a baseline can be reached for the common rate of freshness of overall web content.

    I took the total number of pages found for each given phrase at 3, 6, and 12 months. I calculated a percentage for each of these points based on the total number of results found with no date specified.

    For example: Phrase 3 mos. 6 mos. 12 mos. Total

    buy low sell high 4700 5470 6200 7830
    60% 70% 79% 100%

    Note:
    This method excludes any pages which are not text and more specifically, not English text.
    This method relies on a random sampling of phrases.
    Using this methodology I determined that the average rate of decay of the web follows a 60-70-80 percent decline at 3, 6, and 12 months.

    Therefore, If a company wants to maintain a freshness rate on par with the web as a whole, their site content should be updated at the inverse rate. In other words:
    60% of the site should change every 3 months
    70% of the site should change every 6 months
    80% of the site should change every 12 months
    The only way to do this effectively is to either have a very small site, or have a site with dynamically generated information.

    The following graph shows the decay rate for a few phrases. I selected these phrase to display because of their unique characteristics.
    bill gates sucks--This phrase had the lowest decay rate of any phrases I searched.
    life's short play hard--This phrase had the greatest decay rate of any I searched (note: this search was also very small).
    blessed are the cheesemakers--This phrase was relatively small, but demonstrates that quantity of pages may not be important in determining decay rate.
    late at night--This phrase returned the highest number of results of any I searched and yet it also adheres closely to the 60-70-80 rule.

    Conclusion:

    Web content decays at a uniform, determinable rate. Sites wanting to optimize their content freshness need to maintain a rate of freshness that corresponds to the rate of web decay.

  2. Study: World Wide Web sites and page persistence by Seth+Finkelstein · · Score: 5, Interesting
    For a more extensive (although older) study, take a look at

    Digital libraries and World Wide Web sites and page persistence

    That said, the Web and its component parts are dynamic. Web documents undergo two kinds of change. The first type, the type addressed in this paper, is "persistence" or the existence or disappearance of Web pages and sites, or in a word the lifecycle of Web documents. "Intermittence" is a variant of persistence, and is defined as the disappearance but reappearance of Web documents. At any given time, about five percent of Web pages are intermittent, which is to say they are gone but will return. Over time a Web collection erodes. Based on a 120-week longitudinal study of a sample of Web documents, it appears that the half-life of a Web page is somewhat less than two years and the half-life of a Web site is somewhat more than two years. That is to say, an unweeded Web document collection created two years ago would contain the same number of URLs, but only half of those URLs point to content. The second type of change Web documents experience is change in Web page or Web site content. Again based on the Web document samples, very nearly all Web pages and sites undergo some form of content within the period of a year. Some change content very rapidly while others do so infrequently (Koehler, 1999a). This paper examines how Web documents can be efficiently and effectively incorporated into library collections. This paper focuses on Web document lifecycles: persistence, attrition, and intermittence.

    Sig: What Happened To The Censorware Project (censorware.org)

  3. The Web is decaying by Anonymous Coward · · Score: 5, Funny
    It is now official - Netcraft has confirmed: The web is decaying

    Yet another crippling bombshell hit the beleaguered web community when recently IDC confirmed that the web accounts for less than a fraction of 1 percent of all server usage. Coming on the heels of the latest Netcraft survey which plainly states that the web has lost more market share, this news serves to reinforce what we've known all along. The web is collapsing in complete disarray, as further exemplified by failing dead last in the recent Sys Admin comprehensive networking usage test.

    You don't need to be a Kreskin to predict the web's future. The hand writing is on the wall: the web faces a bleak future. In fact there won't be any future at all for the web because the web is decaying. Things are looking very bad for the web. As many of us are already aware, the web continues to lose market share. Red ink flows like a river of blood. Dot-coms are the most endangered of them all, having lost 93% of their core developers.

    Let's keep to the facts and look at the numbers.

    The web leader Theo states that there are 7000 users of the web. How many users of other protocols are there? Let's see. The number of the web versus other protocols posts on Usenet is roughly in ratio of 5 to 1. Therefore there are about 7000/5 = 1400 other protocols users. Web posts on Usenet are about half of the volume of other protocols posts. Therefore there are about 700 users of the web. A recent article put the web at about 80 percent of the HTTP market. Therefore there are (7000+1400+700)*4 = 36400 web users. This is consistent with the number of Usenet posts about the web.

    Due to the troubles of Walnut Creek, abysmal sales and so on, the web went out of business and was taken over by Slashdot who sell another troubled web service. Now Slashdot is also dead, its corpse turned over to yet another charnel house.

    All major surveys show that the web has steadily declined in market share. The web is very sick and its long term survival prospects are very dim. If the web is to survive at all it will be among hobbyist dabblers. The web continues to decay. Nothing short of a miracle could save it at this point in time. For all practical purposes, the web is dead.

    Fact: the web is dead.

  4. Better article needed by Raedwald · · Score: 5, Interesting

    I'm not impressed. The article does not define what he means by decay, or how he measured it, except in the vaguest of terms. The analysis of the data is poor; anyone interested in decay would suspect some kind of exponential decay. They would therefore plot the data logarithmically, and perhaps calcualte a half life. Piss poor.

    --
    Ne mæg werig mod wyrde wiðstondan, ne se hreo hyge helpe gefremman.
  5. we've lost the ability to rely on hyperlinks by thegoldenear · · Score: 5, Insightful

    Tim Berners-Lee wrote :"There are no reasons at all in theory for people to change URIs (or stop maintaining documents), but millions of reasons in practice.": http://www.w3.org/Provider/Style/URI and advocated creating a web where documents could last, say, 20 years and more

  6. Google/CowboyNeal Study by BoBaBrain · · Score: 5, Funny

    On a similar note, I was curious to see what the CowboyNeal content of the web is. As luck would have it, a precise answer can be found easily.

    Google gives us the following interesting results:

    3,840,000 sites contain the word Cheese.

    1,640 sites contain the words CowboyNeal and Cheese.

    Therefore, 4.27083333333333333333333333333e-2% of cheese related sites contain a reference to CowboyNeal.

    As cheese is a randomly chosen word with no special connection to CowboyNeal it is reasonable to assume that 4.27083333333333333333333333333e-2% of all sites contain a reference to The Cowboy (Assuming the number of sites dedicated to CowboyNeal equals the number dedicated to ignoring him).

    So there we have it. The web is 99.957291666666666666666666666667% CowboyNeal free. :)


    I said the results were "precise", not "accurate". :P

    --
    I am a Karma Library.
  7. Re:bill gates sucks... by Kierthos · · Score: 5, Interesting

    Actually (and unfortunately for any haters of the Evil that lies in the lands of Redmond) Headline News had this lovely little chart on recently, which showed public approval of several companies. Enron and Arthur Anderson had 9 and 11% approval ratings, respectively, while the big "winner" was Microsoft, with something like a 79% approval rating.

    Let's face facts here. We might hate Microsoft, but the vast majority of people do not. Good? Bad? Indifferent?

    Kierthos

    --
    Mr. Hu is not a ninja.
  8. Heh.. Talk about web decay. by Bowie+J.+Poag · · Score: 5, Funny



    Looks like 100% of the link mentioned in this article decayed in a little under 5 minutes! ;)
    Cheers,

    --
    Bowie J. Poag

  9. Wide jump from findings to conclusion by gpmart · · Score: 5, Interesting
    In fact, I would argue that good content need not change. Aside from the obvious issues with the small sampling of phrases, the web is, thankfully, not just a series of catch-phrases. In fact, it was designed to carry complex information such that it could not be reduced.

    What scares me here is the conclusion that web sites need to change their content 60% every 3 months. This is not freshness, this is reorganizing to re-organize. If you are considering doing this, you had better seriously re-consider your future. Its an interesting study but a good meme doesn't die simply because the catch-phrases are tired.

    At faculty meetings at our school I sit with a bingo card. On it are a series of catch-phrases. We listen for the catch-phrases and shout out when we have finished our cards. B***SH*T is the game and to reduce your content to a series of reorganized catch-phrases is like having a marketing guy develop foreign policy.

    Anyone willing to write the perl module that searches for the latest catch-phrases and inserts them randomly into your web content. Yeesh!

  10. Google Study in Another Place by scottennis · · Score: 5, Informative

    The study I posted on Angelfire appears to have reached a bandwidth threshhold. I've made the same study available here:

    http://helen.lifeseller.com/webdecay.html

    I've also included a link to the raw data I used.

  11. Thought and mod_rewrite are the key by Fweeky · · Score: 5, Insightful

    The key to making links that don't rot is to design a URI schema that's both independent of any redesigns of your site and independent of any particular way of doing things.

    Let's look at a few examples.

    The URI to this page is http://slashdot.org/comments.pl?sid=31884&op=Reply &threshold=3&commentsort=3&tid=95&mode=nested&pid= 3434535 - what is it telling you that it doesn't need to?

    Well, for a start, that .pl is a bad idea. What happens in 4 years time when SlashDot is running on PHP, or Java, or Perl 7, or a Perl Server Page, or ASP? Then there's the difficult-to-decode query string that tells you nothing about the link other than "this is the information the server needs to locate your page at the moment", and doesn't give you much faith in it living forever.

    Now let's look at an equivilent Kuro5hin URI.

    http://www.kuro5hin.org/comments/2002/4/29/22137/6 511/51/post#here is a URI to reply to a random comment on k5.

    For a start, you can't tell what application or script is serving you the page, and you can't see what type of file it's linking to; both these things can and will change over time.

    Second, there's a date embedded in there; you can see the developers, if they ever decide to change the meaning of '/comments', using that date as a reference; if the URI is before the change, they can map it onto the new schema or pass it onto legacy code.

    Having the date in the URI is good because it allows you to determine when the link was issued, and map it onto any changes or pass it off to a legacy system as required.

    Now let's take an apparantly good link on my now horribly out of date site, aagh.net.

    http://www.aagh.net/php/style/ links to an article on PHP coding style.

    Certainly, hiding the fact that I'm using PHP to serve this document is good, and shortening the URI to remove the useless querystring is good (you can't see one? Good, that's the point), however, this URI may well stop working in a few weeks; I'm planning a redesign and the old schema may well not fit in well with it.

    A short yyyymm in there could have made all the difference; a simple if check on the URI's issue date would keep it working.

    The moral of the story: Think about your URI's when you're designing a site. Try to remove as much data as you can without painting yourself into a corner.