Web Pages Are Weak Links in the Chain of Knowledge

← Back to Stories (view on slashdot.org)

Web Pages Are Weak Links in the Chain of Knowledge

Posted by Hemos on Monday November 24, 2003 @02:27AM from the destroying-our-young dept.

PizzaFace writes "Contributions to science, law, and other scholarly fields rely for their authority on citations to earlier publications. The ease of publishing on the web has made it an explosively popular medium, and web pages are increasingly cited as authorities in other publications. But easy come, easy go: web pages often get moved or removed, and publications that cite them lose their authorities. The Washington Post reports on the loss of knowledge in ephemeral web pages, which a medical researcher compares to the burning of ancient Alexandria's library. As the board chairman of the Internet Archive says, "The average lifespan of a Web page today is 100 days. This is no way to run a culture.""

18 of 361 comments (clear)

Well, by jeffkjo1 · 2003-11-24 02:32 · Score: 5, Interesting

Really, is there a reason to archive everything in the world? Sure, your 4 year old has some pretty drawings, but should they be put in a library someplace?

100 years from now, should anyone be forced to accidentally stumble over goatse? (which is very disturbingly archived on archive.org)
1. Re:Well, by operagost · 2003-11-24 02:48 · Score: 5, Insightful
  
  Do you really think goatse will be "disturbing" 100 years from now?
  The day goatse.cx is no longer disturbing, is sure to be the first day of Armageddon ...
  
  --
  
  Gamingmuseum.com: Give your 3D accelerator a rest.
2. Re:Well, by GeorgeH · 2003-11-24 02:56 · Score: 5, Insightful
  
  100 years from now, should anyone be forced to accidentally stumble over goatse?
  The fact that you and I can refer to goatse and people know what we're talking about means that it's an important part of our shared culture. I think that anything that archives the good and bad of a culture is worth keeping around.
  
  --
  Why can't I moderate something "Wrong" or at least "Grossly Misinformed"?
3. Re:Well, by mlush · 2003-11-24 03:03 · Score: 5, Interesting
  
  Sure, your 4 year old has some pretty drawings, but should they be put in a library someplace?
  I would be fascinated to see my Great Grandad's first drawings, his school web page, his postings to USENET. I only knew him as on old man ....
  
  To a historian often the most interesting stuff is the ephemera, the diary of an ordanary person gives a view of every day life you will never get looking at 'formal' archives (ie newspaper, film librarys etc etc) which only covers 'important' stuff
Books have an ISBN... by Advocadus+Diaboli · 2003-11-24 02:32 · Score: 5, Interesting

...which means that with that ISBN I can refer to the book and find it at libraries or bookstores. Why don't we setup a sort of unique web page number if articles of interest or knowledge are published there. Then it would be easy to track an article if its moved to another site or whatever just by looking up a sort of catalog for these numbers.
1. Re:Books have an ISBN... by kalidasa · 2003-11-24 02:36 · Score: 5, Informative
  
  There already is such an identifier. It's called a Universal Resource Identifier, or URI. See Berners-Lee essay Cool URIs Don't Change.
Reliability by lukewarmfusion · 2003-11-24 02:34 · Score: 5, Interesting

It's not just the short lifespan of a webpage... it's also the fact that the source isn't always reliable. Web publications are rarely given the same strict editorial process as most journal articles. The content might be just as good - or better - but they're also not given the same credibility.

I'm a recent grad of a University... my freshman year, profs wanted us to start using the Internet more so we were asked to submit at least x number of references from Internet sources. By my senior year, they were trying to get us to stop using the Internet. Using a URL as a reference was sometimes forbidden by the professor.
Re:Worst Record Keeping by robslimo · 2003-11-24 02:38 · Score: 5, Interesting

Ummm, maybe only as applies to this topic, which is to say that web pages are a poor place to keep records.

I'd contend that researchers & scientists in general would be quite silly to site an electronic-only resource in their publications, because the persistence of that resource relies on too many factors (the whim of the webmaster, backups or lack thereof, fiber seeking and grid seeking backhoes, etc).

I think that will all sort itself out and real scientists will continue or return to citing more traditional resources.

What I think is much more disturbing and disruptive is the pseudo-science and mis-information that is overly abundant on the web. Too many web sites, personal and commercial, spout 'facts' in such great detail that they have the appearance of authority. Too often, novice/amatuer scientists can be seriously mis-lead by some of the crap that can be found on the web masquerading as 'science'.
Re:Worst Record Keeping by richy+freeway · 2003-11-24 02:39 · Score: 5, Funny

I had some evidence to back it up but all the links are long dead ;P
Re:Books have an ISBN..(but web pages are googled) by WillAdams · 2003-11-24 02:40 · Score: 5, Insightful

That was why Tim Berners-Lee wanted URL to stand for ``Universal'' (not Uniform) Resource Locator.

The problem is, few people have formal training as librarians, or understand how to file away a document under such schemes (whether or no pages like this are worth preserving is another issue entirely).

Then there's the technical issue---where's the central repository? Who ensures things are correctly filed? Who pays for it all?

With all that said, I'll admit that I use Google's cache for this sort of thing---it lacks the formal hierarchy, but the search capabilities ameliorate this lack somewhat. It does fail when one wants a binary though (say the copy of Fractal Design Painter 5.5 posted by an Italian PC magazine a couple of years ago).

Moreover, this is the overt, long-term intent behind Google, to be the basis for a Star Trek style universal knowledge database---AI is going to have to get a lot better before the typical person's expectations are met, but in the short term, I'll take what I can get. ;)

William

--
Sphinx of black quartz, judge my vow.
What's the problem here ? by JackJudge · 2003-11-24 02:42 · Score: 5, Insightful

Why would we want to archive 99.9% of today's web content ?
Does anyone archive CB radio traffic ??

It's not a permanent storage medium, never could be, too many points of failure between your screen
and the server holding the data.
archive.org and copyright? by McDutchie · 2003-11-24 02:43 · Score: 5, Interesting

I've started to keep archivied copies of webpages instead of links, the next time you want it it's gone. Unfortunatly you can't share them like links.
If you can't share them, then how come archive.org can? How come archive.org seems to be above copyright law?
1. Re:archive.org and copyright? by Jerf · 2003-11-24 03:44 · Score: 5, Interesting
  
  How come archive.org seems to be above copyright law?
  
  Archive.org invokes the DMCA safe harbor provisions (see bottom of that page for the DMCA boilerplate), which is described in Title II of the DMCA.
  
  However, you'll find a careful reading of the DMCA reveals that none of the exclusions really quite applies to them; a good lawyer might be able to get them protected but I would bet against them.
  
  Mostly they get by because they will remove content if requested, and nobody who cares cares quite enough to sue them on behalf of "the world" when they are satisfied to have their own content removed. In other words, they are basically OK because nobody cares to sue them. Strictly speaking, archive.org probably is the world's largest copyright violation.
  
  This goes to show that sometimes if you break the law in a big enough way, you can get away with it. ;-)
  
  (Not responsible for the results of any actions based on taking that sentence to heart. For entertainment purposes only. etc.)
Permalinking and archiving by seldolivaw · 2003-11-24 02:44 · Score: 5, Insightful

The ephemeral nature of the web is a very real problem, but it's important not to overstate it. The reason so much more information is lost these days is partly a reflection of the fact that we produce so much more of it. The Library of Alexandria was the distilled knowledge of an entire civilisation; it was unique, irreplaceable and massively important information. The web is full of information that is of low quality, often massively redundant (thousands of pages explain the same thing in different ways) and certainly replaceable (the web is not the final repository of the information: it's a temporary place where that information is published). In the same way, for centuries, newspapers have produced thousands of redundant issues with a lifetime of just a few days. The reason no one decries the loss of our newspapers is because the publishers themselves still archive the information, even if this is somewhat hard to get to. The same is true of web pages, only the number of publishers is vastly larger.

Individual newspapers had their own ways of making their archives public (in many cases for a fee) because storing that information is a cumulative, ever-increasing cost. On the web that cost is much lower, but still present. In addition, there's the question of relevancy: www.mysite.com/index.html may contact valuable information, relevant enough to be on the front page today, but in a week's time you don't want it to still be there. So what we need is archiving, for the web.

But manual archiving is inefficient and a pain to maintain, since it involves constantly moving around old files, updating index pages, etc.. Plus linkers don't bother to work out where the archive copy is eventually going to be: they link to the current position of the item, as they should.

So what the web needs is automatic archiving. One way to do this (a solution to which was the partial subject of my final year project at uni) is to include additional a piece of additional metadata (by whatever mechanism you prefer) when publishing pages; data that describes the location of the *information* you're looking for, not the page itself. So mysite.com/index.html would contain meta-information describing itself as "mysite news 2003.11.23 subject='something happened today'". User-agents (browsers) when bookmarking this information could make a note of that meta-data, and provide the option to bookmark the information, rather than the location (sometimes you want to bookmark the front page, not just the current story). Those user agents, on returning to a location to discover the content has changed, could then send the server a request for the information, to which the server would reply with the current location, even if that's on another server.

Of course, this requires changes at the client side and the server side, which makes it impractical. A simpler but less effective solution is for the "archive" metadata to simply contain another URL, to where the information will be archived or a pointer to that information will be stored. This has the advantage of requiring only changes to the client-side.

Suggestions of better solutions are always welcome :-)
cant erase my usenet postings by peter303 · 2003-11-24 02:50 · Score: 5, Interesting

I started posting usenet in the late 1980s. These g*dd*mn things are still are still on the net. I was less guarded at that time. Everyone *knew* them becase disk space ws so scare that usenet postings would disappear in 7-14 days.
the problem is bigger by professorhojo · 2003-11-24 02:52 · Score: 5, Insightful

it's not simply webpages that are the problem. it's digital storage in toto.

because we as a generation are quickly moving away from our previous long-lived forms of storage, and toward digital management of archives, it's trivial for someone to decide to unilaterally delete (not backup?) a whole decade of data in some area of our history.

i remember the photographer who found the photograph of bill clinton meeting monica lewinsky 10 years ago. he was in a gaggle of press photographers, but nobody else had this picture because they were all using digital cameras and he was still on film. most of their pictures from that day had been deleted years ago since they weren't worth the cost of storing. but this guy had it on film.

yes. websites are disappearing. but there's a greater problem lurking in the background. the cost of preserving this stuff digitally, indefinately. who's going to pony up the cash for that? unfortunately, no one. and we'll all ultimately pay dearly for that... (hell -- we already have trouble learning from the past.)
Legal citations and authority of internet sources by mtpruitt · 2003-11-24 04:40 · Score: 5, Informative

Law journals have tried to tried to cope with the proper weight of authority to grant web pages by trying to follow the Blue Book, a citation manual.

The general rule has been that whenever you can find something in print, cite to that, but add an internet cite when either it is available and would make it easier to find, or if it is only available online.

Things that are only available online are surprisingly common in citation. The leading court reporter services (WestLaw and Lexis Nexis) both have cases that aren't "officially" printed, but are available online.

Also, many journal articles will cite to web pages such as a company's official description or press releases.

In general, these citations are treated for their functional purpose and not their form of media -- online cases are grouped (last) with other cases, and information from most web site is considered a pamphlet or other unofficial publication.

This system seems to deal with the fact that they are ephemera pretty well. The citations really are only used to make a point that is merely illustrative or is easily accessible to legal practitioners.
RTFA... it's about references in scientific papers by dpbsmith · 2003-11-24 05:31 · Score: 5, Insightful

The article is not about archiving "everything in the world." It's specifically about references in scholarly papers, which, for the past three or four centuries, have been part of the essential fabric of scientific research. In a research paper, everything you say is either supposed to be the result of your own direct observation, or backed by a traceable, verifiable, and critiquable authority.

You don't just say "Frotz and Rumble observed that the freeble-tropic factor was 6.32," you say "Frotz and Rumble (1991) observed that the freeble-tropic factor was 6.32." Then, at the end, traditionally, you would put "Frotz, Q. X and Rumble, M (1991): Dilatory freeble-tropism in the edible polka-dotted starfish, Asterias gigantiferus (L) (Echinodermata, Asteroidea), when treated with radioactive magnesium pemoline. J. f. Krankschaft und Gierschift, 221(6):340-347."

Then if someone else wondered about that statement, they'd go to the library and pull down volume 221 of the journal, and see that Frotz and Rumble had only measured that factor on six specimens, using the questionable Rumkohrf assay. If they had more questions, they'd write to Frotz at the address given in the article, asking them whether they remembered to control for the presence of foithbernder residue.

This sort of thing is absolutely essential to the scientific process and makes science self-correcting.

The article says that these days, the papers are published online, the references are URLs, and that an awful lot of them are stale. If so, this cuts to the very heart of the process of scientific scholarship.

--
"How to Do Nothing," kids activities, back in print!