Slashdot Mirror


Web Pages Are Weak Links in the Chain of Knowledge

PizzaFace writes "Contributions to science, law, and other scholarly fields rely for their authority on citations to earlier publications. The ease of publishing on the web has made it an explosively popular medium, and web pages are increasingly cited as authorities in other publications. But easy come, easy go: web pages often get moved or removed, and publications that cite them lose their authorities. The Washington Post reports on the loss of knowledge in ephemeral web pages, which a medical researcher compares to the burning of ancient Alexandria's library. As the board chairman of the Internet Archive says, "The average lifespan of a Web page today is 100 days. This is no way to run a culture.""

37 of 361 comments (clear)

  1. Well, by jeffkjo1 · · Score: 5, Interesting

    Really, is there a reason to archive everything in the world? Sure, your 4 year old has some pretty drawings, but should they be put in a library someplace?

    100 years from now, should anyone be forced to accidentally stumble over goatse? (which is very disturbingly archived on archive.org)

    1. Re:Well, by mlush · · Score: 5, Interesting
      Sure, your 4 year old has some pretty drawings, but should they be put in a library someplace?

      I would be fascinated to see my Great Grandad's first drawings, his school web page, his postings to USENET. I only knew him as on old man ....

      To a historian often the most interesting stuff is the ephemera, the diary of an ordanary person gives a view of every day life you will never get looking at 'formal' archives (ie newspaper, film librarys etc etc) which only covers 'important' stuff

    2. Re:Well, by 4of12 · · Score: 4, Interesting

      Really, is there a reason to archive everything in the world?

      No, only the good stuff needs to be saved. So what's good and who should save it?

      IMHO, anything that gets officially referenced by another work should be saved.

      That burden should not fall upon the original creator of the referenced work; it should fall upon the creator of the refering work.

      Despite all the hue and cry about lost revenue opportunities from controlled distribution of copyrighted information, knowledge preservation and the overall benefit to society would improve if works were able to save a local cache of referenced works.

      This would also help with the problem of morphing or revisionist works. Some works can be improved by editing (something around here comes to mind), but it would be inappropriate to change old web pages that show an earlier mistake in thinking, to show that somehow someone was particularly prescient, or to erase knowledge for a political agenda (a la Stalin).

      Just a couple of days ago I was able to retrieve an old recipe from the Google cache that had been summarily removed from a web site due to some time retention policy. An attempt to encourage repeat visits to the website because stuff disappears was circumvented. I would have been particularly annoyed with that website were it not for the delayed action of the Google cache. Google may have enable circumvention of their policy, but they would have garned a lot more ill will from me if their policy were effective.

      Guess what? References in scientific papers I write are not just available in libraries capable of paying $1K/year subscription rates, but as photocopies in my file cabinet. That is, I have a local cache of referenced works already.

      If a colleague's library did not have the specified volume and journal article, I would let him have a copy for the asking. It's a copyright violation, I know, but I'm not convinced that strict adherence to copyright laws in this case provides the best overall benefit to society.

      --
      "Provided by the management for your protection."
  2. Books have an ISBN... by Advocadus+Diaboli · · Score: 5, Interesting

    ...which means that with that ISBN I can refer to the book and find it at libraries or bookstores. Why don't we setup a sort of unique web page number if articles of interest or knowledge are published there. Then it would be easy to track an article if its moved to another site or whatever just by looking up a sort of catalog for these numbers.

    1. Re:Books have an ISBN... by daddywonka · · Score: 4, Interesting

      Why don't we setup a sort of unique web page number if articles of interest or knowledge are published there.

      The article mentions this: "One such system, known as DOI (for digital object identifier), assigns a virtual but permanent bar code of sorts to participating Web pages. Even if the page moves to a new URL address, it can always be found via its unique DOI."

      But it seems that these current systems must use "registration agencies" to act as the gatekeeper of the unique ID.

  3. then don't look for culture in web pages... by TechnoVooDooDaddy · · Score: 4, Interesting

    honestly, the transient nature of webpages makes it an unsuitable medium for the long term establishment of "culture" our categorization happy, buzz-word ridden nature so commonly prevalent will have to find a new term for what is the web. boo-freaking-hoo.. meanwhile i'll keep doing my thing, posting pics for my family to see, putting calendar events up on the web so my homebrew-club will know when we're meeting and not worry about any "culture" i might be potentially creating then destroying when i take stuff back down.

    man i need coffee, insomnia is a bitch...

    1. Re:then don't look for culture in web pages... by Araneas · · Score: 2, Interesting
      "There really should be a permanent way of storing web pages, and storing them at the state they were at one given moment of time."

      Teach browsers to speak CVS.

  4. Reliability by lukewarmfusion · · Score: 5, Interesting

    It's not just the short lifespan of a webpage... it's also the fact that the source isn't always reliable. Web publications are rarely given the same strict editorial process as most journal articles. The content might be just as good - or better - but they're also not given the same credibility.

    I'm a recent grad of a University... my freshman year, profs wanted us to start using the Internet more so we were asked to submit at least x number of references from Internet sources. By my senior year, they were trying to get us to stop using the Internet. Using a URL as a reference was sometimes forbidden by the professor.

    1. Re:Reliability by bubblewrapgrl · · Score: 2, Interesting

      For one science course I took in college, we were told that we could find a source online, but then find it in hardcopy (ie, look up an article on the web, but then also make sure to look it up in a journal). Apparently, there were some issues with students who found information on the web that looked reliable (it was cited from a journal), but the information had been changed by whomever posted the article on a personal site. The professor wasn't interested in trusting scientific articles that students found online after that happened unless you could prove that you verified the same article in text.

  5. The final irony? by the+real+darkskye · · Score: 2, Interesting

    That matters in part because some documents exist only as Web pages -- for example, the British government's dossier on Iraqi weapons.
    "It only appeared on the Web," Worlock said. "There is no definitive reference where future historians might find it."
    Much like the WMDs themselves then ...

    --
    Music is everybody's possession.
    It's only publishers who think that people own it.
    Fuck Beta
    ~John Lenno
  6. Yes, big issue! by Erwos · · Score: 4, Interesting

    I've personally been working (internally so far) on a website of modern-day Orthodox-Jewish responsa to various issues of Jewish law, so this is an issue I've given some thought to.

    To say this is some kind of problem specific to the web is misleading. There are old, well-quoted sources of Jewish thought whose texts are simply lost to us in this current day and age. Example: a famous and extremely popular commentary on the Talmud and Torah, Rashi, is missing for at least a few chapters of Talmud. That would be the equivalent of IEEE misplacing some standards papers and then NO ONE having copies, just lost to the sands of time. Yet it did happen, proving this at least _was_ a serious issue.

    However, these days, with such things as the Way-Back Machine and Google caching, actually LOSING entire web pages doesn't happen very often, and, I'd bet, it happens far less frequently than the loss of books.

    -Erwos

    --
    Plausible conjecture should not be misrepresented as proof positive.
  7. Re:Worst Record Keeping by Urkki · · Score: 2, Interesting

    Nah. There was a time when only very very few could even read, let alone write, let alone keep any kind of records...

    But get your point. Too bad there are some restrictions on copying the web pages you are referencing...

    There should be some service, a bit like google's cache, you could use to store the referenced pages. I submit the page to the service, then provide two links in my own document, one to the original page (which will likely expire eventually) and one to the cached version. I wonder if they could get around copyright issues the same way google cache gets around them, even though this is a bit more permanent storage than google cache... Most web page authors certainly would not have any problem with having their pages archived there, quite the opposite, most would be happy to have their work referenced by others...

  8. Re:Worst Record Keeping by robslimo · · Score: 5, Interesting

    Ummm, maybe only as applies to this topic, which is to say that web pages are a poor place to keep records.

    I'd contend that researchers & scientists in general would be quite silly to site an electronic-only resource in their publications, because the persistence of that resource relies on too many factors (the whim of the webmaster, backups or lack thereof, fiber seeking and grid seeking backhoes, etc).

    I think that will all sort itself out and real scientists will continue or return to citing more traditional resources.

    What I think is much more disturbing and disruptive is the pseudo-science and mis-information that is overly abundant on the web. Too many web sites, personal and commercial, spout 'facts' in such great detail that they have the appearance of authority. Too often, novice/amatuer scientists can be seriously mis-lead by some of the crap that can be found on the web masquerading as 'science'.

  9. web pages as knowledge by Horny+Smurf · · Score: 0, Interesting

    While I use the web as a source of information (information which is unavailable in any other format), I would not cite any information unless I can personally verify it. Would you trust "Anonymous Coward" when he tell you to "click this link"? So why would you trust some random website?

  10. A problem recognized already some time ago.... by tsvk · · Score: 4, Interesting

    Usability expert Jakob Nielsen addressed the issue of linkrot in a column already in 1998: Fighting Linkrot.

  11. archive.org and copyright? by McDutchie · · Score: 5, Interesting
    I've started to keep archivied copies of webpages instead of links, the next time you want it it's gone. Unfortunatly you can't share them like links.
    If you can't share them, then how come archive.org can? How come archive.org seems to be above copyright law?
    1. Re:archive.org and copyright? by Jerf · · Score: 5, Interesting

      How come archive.org seems to be above copyright law?

      Archive.org invokes the DMCA safe harbor provisions (see bottom of that page for the DMCA boilerplate), which is described in Title II of the DMCA.

      However, you'll find a careful reading of the DMCA reveals that none of the exclusions really quite applies to them; a good lawyer might be able to get them protected but I would bet against them.

      Mostly they get by because they will remove content if requested, and nobody who cares cares quite enough to sue them on behalf of "the world" when they are satisfied to have their own content removed. In other words, they are basically OK because nobody cares to sue them. Strictly speaking, archive.org probably is the world's largest copyright violation.

      This goes to show that sometimes if you break the law in a big enough way, you can get away with it. ;-)

      (Not responsible for the results of any actions based on taking that sentence to heart. For entertainment purposes only. etc.)

  12. The web can hold insight, in the right field by mactari · · Score: 3, Interesting

    That's a fairly reductionist view if taken too far. Not all researchers are tech whizzes (no pun intended), and I've seen a number of, in my case, professors of English Literature who run the same sort of, "Throw up ten pages with Under Construction signs, test publish a few papers, and let the site sit for years, one day to mysteriously disappear," web site lifespan that "Bob's World" might as well.

    Perhaps even more interestingly, it doesn't always really matter if you've done great, repeatable research in the "soft science" fields or outright humanities. You don't have to be a literature expect to have a good insight on "Bartleby the Scrivener". A grad student's blog, as an example, might contain excellent contributions to the conversation.

    Now that said, in the context of the article -- dealing with "a dermatologist with the Veterans Affairs Medical Center in Denver" -- I would tend to agree with you heartily. Hard science needs to pull, in my layman's view, from research that the article's author researched well enough to see that it wasn't a few 0's and 1's that might be pulled later, in general.

    And heck, what's the harm in saving the pages on your drive and contacting the original author if they disppear? Hard drive space is cheap. If you take yourself seriously, you might want to grab a snap, even if it is technically illegal (not that I know that it is; Google seems to do it right often).

    --

    It's all 0s and 1s. Or it's not.
  13. Cool URIs don't change by KjetilK · · Score: 4, Interesting

    May I remind everyone to read and understand TimBL's Cool URI's don't change. It's not that hard to design systems where you do not have to change the URI every 100 days, folks.

    --
    Employee of Inrupt, Project Release Manager and Community Manager for Solid
  14. cant erase my usenet postings by peter303 · · Score: 5, Interesting

    I started posting usenet in the late 1980s. These g*dd*mn things are still are still on the net. I was less guarded at that time. Everyone *knew* them becase disk space ws so scare that usenet postings would disappear in 7-14 days.

  15. Re:DSPACE by tomknight · · Score: 3, Interesting
    Bugger, forgot to log in.

    Look at DSpace, the mission of which is "To create and establish an electronic system that captures, preserves and communicates the intellectual output of MIT's faculty and researchers."

    Each data set (collection) has a handle, suppoosedly longer lasting than URNs. We're talking about long term data storage here.

    There's an implementation of it at Cambridge University, and my organisation will be evauluation it as soon as the SuSE Linux Enterprise Server software lands on my desk and I've installed my server.

    Tom.

    --
    Oh arse
  16. Re:What's the problem here ? by southpolesammy · · Score: 4, Interesting

    Yes, good point. The Internet is much more akin to CB radio since it is uncontrolled, unverified, entirely volunteer-based, entirely virtual, and highly volatile. By contrast, books, TV, and other media are highly controlled, subject to external verification, have a high cost of entry, are either themselves physical media, or require a physical presense in order to communicate, and are largely static in content.

    The problem with the Washington Post's article is that their premise is flawed. They assume that the Internet is a mostly static source of information, when it is definitely a mostly dynamic information source. Webpages are meant to be updated, and with updates come change. It's inevitable. To assume that we keep every update to the webpages in separate locations is a false assumption. It's cool to see sites like the Wayback machine do this, but it's not required.

    --
    Rule #1 -- Politics always trumps technology.
  17. How long does the average conversation take? by freality · · Score: 4, Interesting

    Webpages aren't replacements for books. Or rather, you shouldn't use them that way.

    If they're lasting on average 100 days, that puts them somewhere between transient culture, like spoken conversation, and printed culture, like newspapers. Big deal.

    We want to preserve culture for future generations, no doubt. But we don't want to preserve all culture for future generations. Anything that is lasting for 100 days and isn't being persisted... well, relatively that's not worth much to future culture.

    I don't remember the exact saying, but there is a Native American saying to the effect of "We don't write things down. If we don't remember it, it's not worth remembering." Now, they're not the last word (no pun intended) in wisdom traditions, but there is a certain amount of enforced vitality necessitated by forgetting the details.

    We'd better get used to the idea. We're only going to be forgetting more and more of the details as we generate more and more useless information.

  18. Re:URL + date by StormyMonday · · Score: 2, Interesting

    Bingo!

    I watch a number of political sites; it's amazing how, when Congressman Sludgepump says something stupid, it tends to disappear from his Website with no indication that it has ever changed. Occasionally, it even changes to show that he said the opposite of what was originally there.

    Checksums/digital signatures are potentially a solution, but the problem of doing it right can be quite difficult when you include real-world constraints. PDFs are a pain in the arse, but at least you can do a decent checksum on them.

    --
    Welcome to the Turing Tarpit, where everything is possible but nothing interesting is easy.
  19. Reviewed Content by neglige · · Score: 2, Interesting

    Contributions to science, law, and other scholarly fields rely for their authority on citations to earlier publications. The ease of publishing on the web has made it an explosively popular medium, and web pages are increasingly cited as authorities in other publications.

    For true scientific work, this should never happen. Because you should only cite reviewed sources. Such as books, articles or conference papers. This is no guarantee for quality, but at least the review process sorts out the most obvious nonsense. And, if the reviewer is good, it may even increase the quality of the work. Plus, those sources are permanent.

    As always, there are sources that are more respected (IEEE, ACM etc.) than others. And using respectable sources is a good thing, because normally you want to prove a point and you base your argument on those publication. So if your basis for your argument is faulty... well ;)

    Furthermore, there is hardly any information that can be found on the web but not in a reviewed form. Note that there are (accepted) scientific reviewed journals using the web for publishing. Without a printed edition. And you can quote them. And, as many before me have said, the articles and links do not vanish (the URL is usually not quoted anyway - these articles are listed just like printed articles).

    This is just my personal opinion on scientific work. Let's see if my head is still on my shoulders tomorrow :)

    --
    My cats ate my karma. They also wrote this comment.
  20. Longevity by unfortunateson · · Score: 3, Interesting

    Maintaining a links page for my wife's business' site has always been a low priority, and finally, I put up a MySQL/PHP page to do the majority of the work.

    So I've been going through all the old links, and every link request we've gotten in the business' 7-year history. Of the 120 messages in the timeframe of 1997-1999, only about 15 sites still existed. Of those, two-thirds had forwarded URLs -- often from AOL or Homestead to their own brand. A couple still existed, but had totally different content.

    Many just plain didn't exist at all. A fair chunk found the server, but no such page. A few had blank pages or nearly no content. The true annoyance though, is the number of domains that are owned by spamdexers/linkfarms that have no content of their own and beg you to set your homepage to them.

    I've still got to cover the rest of 2000-2003 link requests, but I expect that anything pre-2001 will be very sparse.

    --
    Design for Use, not Construction!
  21. Site Linking Schemes by Oculus+Habent · · Score: 2, Interesting

    An easy system would be for a server to provide each document it houses with a unique meta-data identifier. Then, when a document, story or paper moves from the "main page" into an archive section, you can still refer to the FileID. This ID should be searchable, so that an article could be linked via something like:

    http://www.cnn.com/?2001EXCJA2

    The IDs could be system generated and handled by a file system that supports meta-data or they could be designed to mean something and handled by a content management system.

    Implementation is the difficult part. Getting everyone - or at least news sites, magazines, and colleges/universities - to set up FileID searching and then document the linking process on their site is no small task.

    --
    That what was all this school was for... to teach us how to solve our own problems. -- janeowit
  22. Not everything, but... by FunkyRat · · Score: 4, Interesting

    This is a real problem. When Vannevar Bush conceived the Memex system, his goal was to facilitate the exchange of scientific research. Later, Doug Englebart built on Bush's ideas as did Ted Nelson (the guy who coined the term "hypertext") and Tim Berners-Lee. While the web today has become a vast sinkhole of pop-up ads, crappy web stores and inane blogs it is important to not forget that its inception was in aiding scientific research.

    Yet, that is not possible without some kind of permanence. Probably what is needed is some way to integrate the web into university library collections. If there was some way of indexing web pages the way libraries currently use the Library of Congress scheme to index their physical collections, then web pages could be uniquely numbered with this number incorporated into the URL. If then universities and the Library of Congress itself were to mirror (permanently) these pages, if the original URL were to become unavailable, one could try just about any manjor university or the LOC and retrieve the page. Of course, with the current political climate here in the US I don't forsee this ever happening.

  23. Re:Worst Record Keeping by LiquidCoooled · · Score: 2, Interesting

    The solution suggested seems perfectly reasonable to me.

    Having an archived copy allows the references to be valid and in context, whilst giving the original link allows for the updated and refreshed page to be expanded upon.

    All it takes is a header on the archive stating that this snapshot was taken at a certain time, and from a certain URL.

    I'm not sure if archive.org already does similar, but the action of merely *searching* the archive for a page should send the scan bots out onto that page. This way it becomes a simple operation.

    I would push for the archive to be compulsory and above copyright - ALL the content continues to be the property of the original owner. Nobody should be able to remove data from the archive for any reason - if you posted it publicly, then you expect it to be cached.

    --
    liqbase :: faster than paper
  24. I do think that. by khasim · · Score: 2, Interesting

    "Do you think because you print it out it suddenly becomes a more stable reference?"

    Yes. Because now you have a copy of the source that you're citing.

    "Sometimes people doing professional articles have to cite web pages because that's where the information they are talking about is."

    And the article was about how the web pages don't stay live so you can't reference them later so the information is not available later.

    So, if you're going to use web pages as a citation, you need to have a means of referencing them after they go off-line.

    What better way is there than to have a copy of them yourself?

  25. Re:Worst Record Keeping by drooling-dog · · Score: 2, Interesting
    I'd contend that researchers & scientists in general would be quite silly to site an electronic-only resource in their publications

    I don't necessarily see a problem here, as long as serious academic research is maintained online by trusted, stable parties. That's not demanding any more than we have up to now now with a print-based distribution system, since that depends on the continuity of a large network of brick-and-morter libraries (and associated infrastructure) to function effectively. Imagine how difficult things would look if we were going in the opposite direction technologically!

    As for the volume of dreck available on the web... Well, that's been equally true of print media, something I'm reminded of whenever I stand in a grocery checkout line. Credibility will always be judged by the trustworthiness of the source.

  26. Misleading statistics by Alomex · · Score: 4, Interesting

    The article claims that "the average life span of a web page is 100 days". This is a very misleading statistic. What it really means is that the average web page is updated every 100 days, not that the page dies and goes away after 100 days.

    Moreover, as you can imagine, authorative sources (the type that people are likely to quote) are updated much less frequently.

  27. Re:Worst Record Keeping by sandstress · · Score: 2, Interesting

    slightly off topic but related is the efforts of researchers to create Public Knowledge Projects (PKP), such as John Willinsky , were the effort is to make research, that effects the public, accessible and understandable to the public. Stablity of links to documents and opening up citations is key to trying to develop these sites. So this is a challenge. You would almost need a completely self contained site - meaning you somehow provide duplicates of necessary links

  28. CrossRef initiative and DOIs by jtoras · · Score: 2, Interesting
    Most scholarly publishers (science, tech and medical) participate in CrossRef initiative (crossref.org). This initiative makes it especially easy to cite electronic articles. The publisher registers a unique persistent DOI for each article with CrossRef and thus this DOI is used to cite the article be it printed or electronic.

    Since publishers register DOIs as soon as the electronic version of the article is available online, the article is citable using DOI way before the print journal goes to press. And since the DOIs are persistent, links will work even if the journal changes ownership/publisher.

    In addition to providing free DOI resolution, CrossRef also provides a free metadata lookup for libraries (or it will provide it for free soon I think). Libraries will be able to lookup DOIs using article metadata as needed.

    Many publishers also participate in variety of archive initiatives, where a copy of every electronic article is made available in large or national libraries for safekeeping. In case the publisher goes out of business, the library or institution has the authority to make the stored archive available to public. With persistent DOIs this will be very easy since the existing links will not break even if the servers are different.

  29. Re:An example of broken down copyright laws by WNight · · Score: 2, Interesting

    I agree. The point of copyright is mainly to encourage the production of commercial works, to enrich the public domain. It was never intended to force a work to remain out of print.

    We need to change copyright law so that it doesn't prevent saving of lost works, and so that it can't be used to force a work to moulder away because it's in someone's best interest that it not be for sale. (For instance, old movies that studios don't want cutting into new movie revenue.)

    I'd like to see a short total-rights-reserved copyright, ten years or so maybe, and a longer commercial-rights copyright. I really see little reason why Warner Brothers, for instance, should be able to use Mickey Mouse in their cartoons, but fanfic, kids pictures, and other such uses should be allowed. It's part of our culture and to deny us the right to participate is rude, and short-sighted.

    Few of today's creators grew up isolated and started creating original works immediately. Instead, they built on the culture they saw around them as they grew up. Children today won't have this ability. We're raising the bar, requiring them to create something that's safe from even an over-zealous lawyer and look-and-feel cases, as their first works.

    Tolkein would never have gotten started in our current legal climate. He intentionally built on previous stories and myths, something that wouldn't be legal to do now. Hell, for a while, TSR was trying to sue people who used their monster names in fantasy works, even where their names were derived from Tolkein.

  30. you can erase usenet postings by Anonymous Coward · · Score: 1, Interesting
    Has anyone ever been fired or denied employment due to the discovery of an ancient usenet post?

    Yes. I personally know of one very senior researcher confronted by a review board with his posts about good places for gay cruising!

    Unless I remove them, I will soon get to deal with the much more fun aspect of, "Dad, what's an acid trip and where did you go when you took them?" from my daughter.

    Yes, you can remove usenet posts from Google Groups.

  31. a solution by Anonymous Coward · · Score: 1, Interesting