Web Pages Are Weak Links in the Chain of Knowledge

← Back to Stories (view on slashdot.org)

Web Pages Are Weak Links in the Chain of Knowledge

Posted by Hemos on Monday November 24, 2003 @02:27AM from the destroying-our-young dept.

PizzaFace writes "Contributions to science, law, and other scholarly fields rely for their authority on citations to earlier publications. The ease of publishing on the web has made it an explosively popular medium, and web pages are increasingly cited as authorities in other publications. But easy come, easy go: web pages often get moved or removed, and publications that cite them lose their authorities. The Washington Post reports on the loss of knowledge in ephemeral web pages, which a medical researcher compares to the burning of ancient Alexandria's library. As the board chairman of the Internet Archive says, "The average lifespan of a Web page today is 100 days. This is no way to run a culture.""

7 of 361 comments (clear)

Min score:

Reason:

Sort:

Re:Books have an ISBN..(but web pages are googled) by WillAdams · 2003-11-24 02:40 · Score: 5, Insightful

That was why Tim Berners-Lee wanted URL to stand for ``Universal'' (not Uniform) Resource Locator.

The problem is, few people have formal training as librarians, or understand how to file away a document under such schemes (whether or no pages like this are worth preserving is another issue entirely).

Then there's the technical issue---where's the central repository? Who ensures things are correctly filed? Who pays for it all?

With all that said, I'll admit that I use Google's cache for this sort of thing---it lacks the formal hierarchy, but the search capabilities ameliorate this lack somewhat. It does fail when one wants a binary though (say the copy of Fractal Design Painter 5.5 posted by an Italian PC magazine a couple of years ago).

Moreover, this is the overt, long-term intent behind Google, to be the basis for a Star Trek style universal knowledge database---AI is going to have to get a lot better before the typical person's expectations are met, but in the short term, I'll take what I can get. ;)

William

--
Sphinx of black quartz, judge my vow.
What's the problem here ? by JackJudge · 2003-11-24 02:42 · Score: 5, Insightful

Why would we want to archive 99.9% of today's web content ?
Does anyone archive CB radio traffic ??

It's not a permanent storage medium, never could be, too many points of failure between your screen
and the server holding the data.
Permalinking and archiving by seldolivaw · 2003-11-24 02:44 · Score: 5, Insightful

The ephemeral nature of the web is a very real problem, but it's important not to overstate it. The reason so much more information is lost these days is partly a reflection of the fact that we produce so much more of it. The Library of Alexandria was the distilled knowledge of an entire civilisation; it was unique, irreplaceable and massively important information. The web is full of information that is of low quality, often massively redundant (thousands of pages explain the same thing in different ways) and certainly replaceable (the web is not the final repository of the information: it's a temporary place where that information is published). In the same way, for centuries, newspapers have produced thousands of redundant issues with a lifetime of just a few days. The reason no one decries the loss of our newspapers is because the publishers themselves still archive the information, even if this is somewhat hard to get to. The same is true of web pages, only the number of publishers is vastly larger.

Individual newspapers had their own ways of making their archives public (in many cases for a fee) because storing that information is a cumulative, ever-increasing cost. On the web that cost is much lower, but still present. In addition, there's the question of relevancy: www.mysite.com/index.html may contact valuable information, relevant enough to be on the front page today, but in a week's time you don't want it to still be there. So what we need is archiving, for the web.

But manual archiving is inefficient and a pain to maintain, since it involves constantly moving around old files, updating index pages, etc.. Plus linkers don't bother to work out where the archive copy is eventually going to be: they link to the current position of the item, as they should.

So what the web needs is automatic archiving. One way to do this (a solution to which was the partial subject of my final year project at uni) is to include additional a piece of additional metadata (by whatever mechanism you prefer) when publishing pages; data that describes the location of the *information* you're looking for, not the page itself. So mysite.com/index.html would contain meta-information describing itself as "mysite news 2003.11.23 subject='something happened today'". User-agents (browsers) when bookmarking this information could make a note of that meta-data, and provide the option to bookmark the information, rather than the location (sometimes you want to bookmark the front page, not just the current story). Those user agents, on returning to a location to discover the content has changed, could then send the server a request for the information, to which the server would reply with the current location, even if that's on another server.

Of course, this requires changes at the client side and the server side, which makes it impractical. A simpler but less effective solution is for the "archive" metadata to simply contain another URL, to where the information will be archived or a pointer to that information will be stored. This has the advantage of requiring only changes to the client-side.

Suggestions of better solutions are always welcome :-)
Re:Well, by operagost · 2003-11-24 02:48 · Score: 5, Insightful

Do you really think goatse will be "disturbing" 100 years from now?
The day goatse.cx is no longer disturbing, is sure to be the first day of Armageddon ...

--

Gamingmuseum.com: Give your 3D accelerator a rest.
the problem is bigger by professorhojo · 2003-11-24 02:52 · Score: 5, Insightful

it's not simply webpages that are the problem. it's digital storage in toto.

because we as a generation are quickly moving away from our previous long-lived forms of storage, and toward digital management of archives, it's trivial for someone to decide to unilaterally delete (not backup?) a whole decade of data in some area of our history.

i remember the photographer who found the photograph of bill clinton meeting monica lewinsky 10 years ago. he was in a gaggle of press photographers, but nobody else had this picture because they were all using digital cameras and he was still on film. most of their pictures from that day had been deleted years ago since they weren't worth the cost of storing. but this guy had it on film.

yes. websites are disappearing. but there's a greater problem lurking in the background. the cost of preserving this stuff digitally, indefinately. who's going to pony up the cash for that? unfortunately, no one. and we'll all ultimately pay dearly for that... (hell -- we already have trouble learning from the past.)
Re:Well, by GeorgeH · 2003-11-24 02:56 · Score: 5, Insightful

100 years from now, should anyone be forced to accidentally stumble over goatse?
The fact that you and I can refer to goatse and people know what we're talking about means that it's an important part of our shared culture. I think that anything that archives the good and bad of a culture is worth keeping around.

--
Why can't I moderate something "Wrong" or at least "Grossly Misinformed"?
RTFA... it's about references in scientific papers by dpbsmith · 2003-11-24 05:31 · Score: 5, Insightful

The article is not about archiving "everything in the world." It's specifically about references in scholarly papers, which, for the past three or four centuries, have been part of the essential fabric of scientific research. In a research paper, everything you say is either supposed to be the result of your own direct observation, or backed by a traceable, verifiable, and critiquable authority.

You don't just say "Frotz and Rumble observed that the freeble-tropic factor was 6.32," you say "Frotz and Rumble (1991) observed that the freeble-tropic factor was 6.32." Then, at the end, traditionally, you would put "Frotz, Q. X and Rumble, M (1991): Dilatory freeble-tropism in the edible polka-dotted starfish, Asterias gigantiferus (L) (Echinodermata, Asteroidea), when treated with radioactive magnesium pemoline. J. f. Krankschaft und Gierschift, 221(6):340-347."

Then if someone else wondered about that statement, they'd go to the library and pull down volume 221 of the journal, and see that Frotz and Rumble had only measured that factor on six specimens, using the questionable Rumkohrf assay. If they had more questions, they'd write to Frotz at the address given in the article, asking them whether they remembered to control for the presence of foithbernder residue.

This sort of thing is absolutely essential to the scientific process and makes science self-correcting.

The article says that these days, the papers are published online, the references are URLs, and that an awful lot of them are stale. If so, this cuts to the very heart of the process of scientific scholarship.

--
"How to Do Nothing," kids activities, back in print!