Web Pages Are Weak Links in the Chain of Knowledge
PizzaFace writes "Contributions to science, law, and other scholarly fields rely for their authority on citations to earlier publications. The ease of publishing on the web has made it an explosively popular medium, and web pages are increasingly cited as authorities in other publications. But easy come, easy go: web pages often get moved or removed, and publications that cite them lose their authorities. The Washington Post reports on the loss of knowledge in ephemeral web pages, which a medical researcher compares to the burning of ancient Alexandria's library. As the board chairman of the Internet Archive says, "The average lifespan of a Web page today is 100 days. This is no way to run a culture.""
Really, is there a reason to archive everything in the world? Sure, your 4 year old has some pretty drawings, but should they be put in a library someplace?
100 years from now, should anyone be forced to accidentally stumble over goatse? (which is very disturbingly archived on archive.org)
...which means that with that ISBN I can refer to the book and find it at libraries or bookstores. Why don't we setup a sort of unique web page number if articles of interest or knowledge are published there. Then it would be easy to track an article if its moved to another site or whatever just by looking up a sort of catalog for these numbers.
honestly, the transient nature of webpages makes it an unsuitable medium for the long term establishment of "culture" our categorization happy, buzz-word ridden nature so commonly prevalent will have to find a new term for what is the web. boo-freaking-hoo.. meanwhile i'll keep doing my thing, posting pics for my family to see, putting calendar events up on the web so my homebrew-club will know when we're meeting and not worry about any "culture" i might be potentially creating then destroying when i take stuff back down.
man i need coffee, insomnia is a bitch...
It's not just the short lifespan of a webpage... it's also the fact that the source isn't always reliable. Web publications are rarely given the same strict editorial process as most journal articles. The content might be just as good - or better - but they're also not given the same credibility.
I'm a recent grad of a University... my freshman year, profs wanted us to start using the Internet more so we were asked to submit at least x number of references from Internet sources. By my senior year, they were trying to get us to stop using the Internet. Using a URL as a reference was sometimes forbidden by the professor.
I've personally been working (internally so far) on a website of modern-day Orthodox-Jewish responsa to various issues of Jewish law, so this is an issue I've given some thought to.
To say this is some kind of problem specific to the web is misleading. There are old, well-quoted sources of Jewish thought whose texts are simply lost to us in this current day and age. Example: a famous and extremely popular commentary on the Talmud and Torah, Rashi, is missing for at least a few chapters of Talmud. That would be the equivalent of IEEE misplacing some standards papers and then NO ONE having copies, just lost to the sands of time. Yet it did happen, proving this at least _was_ a serious issue.
However, these days, with such things as the Way-Back Machine and Google caching, actually LOSING entire web pages doesn't happen very often, and, I'd bet, it happens far less frequently than the loss of books.
-Erwos
Plausible conjecture should not be misrepresented as proof positive.
Ummm, maybe only as applies to this topic, which is to say that web pages are a poor place to keep records.
I'd contend that researchers & scientists in general would be quite silly to site an electronic-only resource in their publications, because the persistence of that resource relies on too many factors (the whim of the webmaster, backups or lack thereof, fiber seeking and grid seeking backhoes, etc).
I think that will all sort itself out and real scientists will continue or return to citing more traditional resources.
What I think is much more disturbing and disruptive is the pseudo-science and mis-information that is overly abundant on the web. Too many web sites, personal and commercial, spout 'facts' in such great detail that they have the appearance of authority. Too often, novice/amatuer scientists can be seriously mis-lead by some of the crap that can be found on the web masquerading as 'science'.
Usability expert Jakob Nielsen addressed the issue of linkrot in a column already in 1998: Fighting Linkrot.
That's a fairly reductionist view if taken too far. Not all researchers are tech whizzes (no pun intended), and I've seen a number of, in my case, professors of English Literature who run the same sort of, "Throw up ten pages with Under Construction signs, test publish a few papers, and let the site sit for years, one day to mysteriously disappear," web site lifespan that "Bob's World" might as well.
Perhaps even more interestingly, it doesn't always really matter if you've done great, repeatable research in the "soft science" fields or outright humanities. You don't have to be a literature expect to have a good insight on "Bartleby the Scrivener". A grad student's blog, as an example, might contain excellent contributions to the conversation.
Now that said, in the context of the article -- dealing with "a dermatologist with the Veterans Affairs Medical Center in Denver" -- I would tend to agree with you heartily. Hard science needs to pull, in my layman's view, from research that the article's author researched well enough to see that it wasn't a few 0's and 1's that might be pulled later, in general.
And heck, what's the harm in saving the pages on your drive and contacting the original author if they disppear? Hard drive space is cheap. If you take yourself seriously, you might want to grab a snap, even if it is technically illegal (not that I know that it is; Google seems to do it right often).
It's all 0s and 1s. Or it's not.
May I remind everyone to read and understand TimBL's Cool URI's don't change. It's not that hard to design systems where you do not have to change the URI every 100 days, folks.
Employee of Inrupt, Project Release Manager and Community Manager for Solid
I started posting usenet in the late 1980s. These g*dd*mn things are still are still on the net. I was less guarded at that time. Everyone *knew* them becase disk space ws so scare that usenet postings would disappear in 7-14 days.
Look at DSpace, the mission of which is "To create and establish an electronic system that captures, preserves and communicates the intellectual output of MIT's faculty and researchers."
Each data set (collection) has a handle, suppoosedly longer lasting than URNs. We're talking about long term data storage here.
There's an implementation of it at Cambridge University, and my organisation will be evauluation it as soon as the SuSE Linux Enterprise Server software lands on my desk and I've installed my server.
Tom.
Oh arse
Yes, good point. The Internet is much more akin to CB radio since it is uncontrolled, unverified, entirely volunteer-based, entirely virtual, and highly volatile. By contrast, books, TV, and other media are highly controlled, subject to external verification, have a high cost of entry, are either themselves physical media, or require a physical presense in order to communicate, and are largely static in content.
The problem with the Washington Post's article is that their premise is flawed. They assume that the Internet is a mostly static source of information, when it is definitely a mostly dynamic information source. Webpages are meant to be updated, and with updates come change. It's inevitable. To assume that we keep every update to the webpages in separate locations is a false assumption. It's cool to see sites like the Wayback machine do this, but it's not required.
Rule #1 -- Politics always trumps technology.
Webpages aren't replacements for books. Or rather, you shouldn't use them that way.
If they're lasting on average 100 days, that puts them somewhere between transient culture, like spoken conversation, and printed culture, like newspapers. Big deal.
We want to preserve culture for future generations, no doubt. But we don't want to preserve all culture for future generations. Anything that is lasting for 100 days and isn't being persisted... well, relatively that's not worth much to future culture.
I don't remember the exact saying, but there is a Native American saying to the effect of "We don't write things down. If we don't remember it, it's not worth remembering." Now, they're not the last word (no pun intended) in wisdom traditions, but there is a certain amount of enforced vitality necessitated by forgetting the details.
We'd better get used to the idea. We're only going to be forgetting more and more of the details as we generate more and more useless information.
Maintaining a links page for my wife's business' site has always been a low priority, and finally, I put up a MySQL/PHP page to do the majority of the work.
So I've been going through all the old links, and every link request we've gotten in the business' 7-year history. Of the 120 messages in the timeframe of 1997-1999, only about 15 sites still existed. Of those, two-thirds had forwarded URLs -- often from AOL or Homestead to their own brand. A couple still existed, but had totally different content.
Many just plain didn't exist at all. A fair chunk found the server, but no such page. A few had blank pages or nearly no content. The true annoyance though, is the number of domains that are owned by spamdexers/linkfarms that have no content of their own and beg you to set your homepage to them.
I've still got to cover the rest of 2000-2003 link requests, but I expect that anything pre-2001 will be very sparse.
Design for Use, not Construction!
This is a real problem. When Vannevar Bush conceived the Memex system, his goal was to facilitate the exchange of scientific research. Later, Doug Englebart built on Bush's ideas as did Ted Nelson (the guy who coined the term "hypertext") and Tim Berners-Lee. While the web today has become a vast sinkhole of pop-up ads, crappy web stores and inane blogs it is important to not forget that its inception was in aiding scientific research.
Yet, that is not possible without some kind of permanence. Probably what is needed is some way to integrate the web into university library collections. If there was some way of indexing web pages the way libraries currently use the Library of Congress scheme to index their physical collections, then web pages could be uniquely numbered with this number incorporated into the URL. If then universities and the Library of Congress itself were to mirror (permanently) these pages, if the original URL were to become unavailable, one could try just about any manjor university or the LOC and retrieve the page. Of course, with the current political climate here in the US I don't forsee this ever happening.
The article claims that "the average life span of a web page is 100 days". This is a very misleading statistic. What it really means is that the average web page is updated every 100 days, not that the page dies and goes away after 100 days.
Moreover, as you can imagine, authorative sources (the type that people are likely to quote) are updated much less frequently.