Slashdot Mirror


Using Google to Calculate Web Decay

scottennis writes: "Google has yet another application: measuring the rate of decay of information on the web. By plotting the number of results at 3,6, and 12 months for a series of phrases, this study claims to have uncovered a corresponding 60-70-80 percent decay rate. Essentially, 60% of the web changes every 3 months." You may be amused by some of the phrases he notes as exceptional, too.

18 of 208 comments (clear)

  1. At last! by ringbarer · · Score: 1, Interesting

    This kind of thing can be a good application of Google's SOAP interface!

    --
    "Why did they cancel my favorite Sci-Fi show? I downloaded ALL the episodes!"
  2. Google's collection of the data by Fucky+the+troll · · Score: 2, Interesting

    Are google claiming that they can check through the entire internet inside a timescale of 3 months, ready to check through again at the start of the next quarter?

    Surely this can't be true. Check Google's cached pages - see the dates on there?

    Google is turning into another history book.

    --






    Roadkill is yummy.
  3. Not exactly decay... by QuantumFTL · · Score: 4, Interesting

    It seems to me that in a way, the web is like an organism, whose smaller constituents are constantly (or not so constantly, depending on the webmaster) renewing themselves. It's a truely adaptive medium, and thus drastic change in short times like this as interest shifts should be quite expected.

    That said, this is one of the many ways in which Google is an invaluable tool for research. Not just finding information, but generating it. Thanks Google!

  4. Web Death by svwolfpack · · Score: 4, Interesting

    It would also be interesting to see how much of the web no longer exists... like at what rate the web is dying. God knows there's enough dead links out there...

    1. Re:Web Death by Enocasiones · · Score: 4, Interesting
      Educators and "link rot".

      In a paper to be published in the June issue of the Journal of Science Education and Technology, Brooks and Markwell likened the rate of link rot to the type of "extinction equation" commonly used to describe natural processes such as radioactive decay. They wrote that the hyperlinks in their study had an expected "half-life" of 55 months."

      Also this, which is just a link from the previous article.

      Easy! :)

      (web's half-life -game -unreal -counter -gamers)

      --
      Enoc
  5. Study: World Wide Web sites and page persistence by Seth+Finkelstein · · Score: 5, Interesting
    For a more extensive (although older) study, take a look at

    Digital libraries and World Wide Web sites and page persistence

    That said, the Web and its component parts are dynamic. Web documents undergo two kinds of change. The first type, the type addressed in this paper, is "persistence" or the existence or disappearance of Web pages and sites, or in a word the lifecycle of Web documents. "Intermittence" is a variant of persistence, and is defined as the disappearance but reappearance of Web documents. At any given time, about five percent of Web pages are intermittent, which is to say they are gone but will return. Over time a Web collection erodes. Based on a 120-week longitudinal study of a sample of Web documents, it appears that the half-life of a Web page is somewhat less than two years and the half-life of a Web site is somewhat more than two years. That is to say, an unweeded Web document collection created two years ago would contain the same number of URLs, but only half of those URLs point to content. The second type of change Web documents experience is change in Web page or Web site content. Again based on the Web document samples, very nearly all Web pages and sites undergo some form of content within the period of a year. Some change content very rapidly while others do so infrequently (Koehler, 1999a). This paper examines how Web documents can be efficiently and effectively incorporated into library collections. This paper focuses on Web document lifecycles: persistence, attrition, and intermittence.

    Sig: What Happened To The Censorware Project (censorware.org)

  6. Credibility? by Gossy · · Score: 2, Interesting
    Is it me, or does this 'research' simply look like something a bored guy has just thrown together from a few minutes work, then submitted to Slashdot to see if it gets posted?

    From the evidence, he searched for very few phrases. The sample size is way too low to be representive of the web - which some estimates put at several billion more pages than there are people on the planet! There are no signs of more than about 5 different phrases being searched for here..

    Can a few simple searches on Google really generate a large enough sample to draw such large conclusions?

    The report is one page long, hosted on Angelfire. There is no substantial data to back up his claims. Is this report reliable in any way?

    I'm amazed this got posted on the front page of Slashdot..

  7. archive.org by mmThe1 · · Score: 3, Interesting

    This makes the job of Archive.org - like sites damn tough.

    P.S. Are we losing information at a comparable rate to generation....?

  8. interesting but... by lowLark · · Score: 3, Interesting

    He creates a problem for himself by not providing us with his raw data, making any subsequent verification of the trend difficult. In fact, the one data set he gives us:
    Phrase 3 mos 6 mos 12 mos. Total
    buy low sell high 4700 5470 6200 7830
    60% 70% 79% 100%
    seems to demonstrate the opposite of the trend that he describes. Indeed, a current search on google shows about 1,270,000 results (makes you wonder when he did his searches that the current number of results is so many orders of magnitude in difference). The methodology also fails to take in to account any growth in the size of the web, which could mask the effects of decay.

  9. Better article needed by Raedwald · · Score: 5, Interesting

    I'm not impressed. The article does not define what he means by decay, or how he measured it, except in the vaguest of terms. The analysis of the data is poor; anyone interested in decay would suspect some kind of exponential decay. They would therefore plot the data logarithmically, and perhaps calcualte a half life. Piss poor.

    --
    Ne mæg werig mod wyrde wiðstondan, ne se hreo hyge helpe gefremman.
  10. Information vs WWW by castlan · · Score: 1, Interesting

    The nature of information is decidedly ephemeral compared to the static nature of much of the web. Perhaps the surge in Weblogging has altered this dynamic even more than the hypercommercialization, but I'll dispute the 60% figure if it is based only on those four phrases. Much of the early Web was fairly static research and information hosted on .edu domains from what I gather. Since the tide shifted away to .commercialization and tripe, the nature of "information" has little to do with the state of the web, and more to do with tidiness. How much of the Web is long abandoned fan sites and dusty old means abandoned from the "information superhighway"?

    In fact, Information Superhighway would be a great data point for this subject. Another consideration, which would be difficult to accomodate, is the reality of mirrors and shuffling pages to different URLs.

    Most importantly, I strongly hope that your "interesting application" never gets implemented, because I can see no application of the resulting data that doesn't make my blood run cold. Psychological Warfare and hostile advertising are the bane of the Post-WWII US, and (likely) the world. Propeganda is a pernicious technology, and I fear further development in this area.

    Okay, I'll admit that was a touch trollish. Because the Psych. Warfare genie was already released from it's NAZI bottle and invited into the US (along with other valuable sciences), it's a little late to advocate repression of this technology. Yet I still reel from my country's increasingly malevolent commercialism aspects, which have spun off from Capitalism without any of Capitalism's redeeming social aspects. I almost want to become a socialist, until I consider that this state of affairs sprung from the National Socialist state.

    In any case, while the WWW may be evolving, is certainly isn't in the Darwinian sense that was likely intended. Vestigal Geocities homepages long abandoned are plentiful, and are less temporary, giving search engines a better shot at crawling than dynamic, or "living" news portals. This sickly "creature" is more of a construction than the product of evolution (unless you consider pre-Charles Darwin senses of the word). If you want to research the nature of information and survivability/mutability, the Freenet Project would provide a much more fruitful environment, if it ever reached widespread useage. I would have less strenuous objections to classifying the Freenet an "ever-evolving creature".

  11. Re:bill gates sucks... by Kierthos · · Score: 5, Interesting

    Actually (and unfortunately for any haters of the Evil that lies in the lands of Redmond) Headline News had this lovely little chart on recently, which showed public approval of several companies. Enron and Arthur Anderson had 9 and 11% approval ratings, respectively, while the big "winner" was Microsoft, with something like a 79% approval rating.

    Let's face facts here. We might hate Microsoft, but the vast majority of people do not. Good? Bad? Indifferent?

    Kierthos

    --
    Mr. Hu is not a ninja.
  12. Jakob Nielsen: Web Pages Must Live Forever by jukal · · Score: 3, Interesting

    Once you have put a page on the Web, you need to keep it there indefinitely. Read more. Slow news day, eh?

  13. even Bill thinks he sucks more than ever by maccallr · · Score: 3, Interesting

    I don't claim this is the authoritative answer, or an in-depth study, but the raw data comes from Bill's very own MSN search: bill gates sucks, check it out...

    Google SOAP thing for compare-stuff is in the pipeline...

  14. History book? Not as far as I can tell . . . by phobonetik · · Score: 3, Interesting

    Our weblogs show that google visits our site (www.up.org.nz) atleast monthly, and it is by no means a huge traffic drawing site in the global senee. Its' last visit was on 13th April, drawing 1888 hits...

  15. Google "pages found" data by Per+Abrahamsen · · Score: 3, Interesting
    I have maintained a number of google celebrity lists, where celebrities in various categories are ranked based on the number of page hits by google.

    While the numbers clearly aren't totally random, they are very fragile indeed. Some people have had a change of two orders of magnitude, within a week. And in these cases, there have usually been no real world events that could explain such a change. I guess the google page hits numbers depend as much on the internal google structure, as on the number of actual pages on the web.

    So I doubt google page hits statistics is a useful research tool. Nonetheless, it can be fun. Here are some google hall of fame lists:

    PS: Mail me to suggest new entries to the lists.
  16. Wide jump from findings to conclusion by gpmart · · Score: 5, Interesting
    In fact, I would argue that good content need not change. Aside from the obvious issues with the small sampling of phrases, the web is, thankfully, not just a series of catch-phrases. In fact, it was designed to carry complex information such that it could not be reduced.

    What scares me here is the conclusion that web sites need to change their content 60% every 3 months. This is not freshness, this is reorganizing to re-organize. If you are considering doing this, you had better seriously re-consider your future. Its an interesting study but a good meme doesn't die simply because the catch-phrases are tired.

    At faculty meetings at our school I sit with a bingo card. On it are a series of catch-phrases. We listen for the catch-phrases and shout out when we have finished our cards. B***SH*T is the game and to reduce your content to a series of reorganized catch-phrases is like having a marketing guy develop foreign policy.

    Anyone willing to write the perl module that searches for the latest catch-phrases and inserts them randomly into your web content. Yeesh!

  17. Correct links by Per+Abrahamsen · · Score: 3, Interesting
    All the links were wrong. Hopefully, these are better: