Web Pages Are Weak Links in the Chain of Knowledge
PizzaFace writes "Contributions to science, law, and other scholarly fields rely for their authority on citations to earlier publications. The ease of publishing on the web has made it an explosively popular medium, and web pages are increasingly cited as authorities in other publications. But easy come, easy go: web pages often get moved or removed, and publications that cite them lose their authorities. The Washington Post reports on the loss of knowledge in ephemeral web pages, which a medical researcher compares to the burning of ancient Alexandria's library. As the board chairman of the Internet Archive says, "The average lifespan of a Web page today is 100 days. This is no way to run a culture.""
You probably shouldn't be quoting any kind of "Bob's World of Great Scientific Insight" type pages anyway. I mean, the majority of sites that go under in less than 100 days are the one person operations that one should identify as bad sources anyway. So it might seem obvious that quoting someone's blog in a research paper is just a plain stupid idea, but it happens way more often than you might think.
====
Crudely Drawn Games
People are worried about losing the information on the web: but all that is really happening is that the URLs are no good after a while, you lose the snapshot. The information is not necessarily going anywhere. If there is a need or a want, someone will throw it up, or another will host it. That's the beauty of the web, you get the good with the bad, but time has a way of getting rid of the chaff.
;)
What would be interesting would be a website that archives those snapshots for posterity. Well, what do you know, there are several such sites already! Looks like we're in good shape. The sky is not falling.
Auto-reply to ACs: "Truly, you have a dizzying intellect."
Any extra effort required to make web pages and their URL's preserved for eternity makes it more difficult for people to create them in the first place, which will mean less knowledge available, not more. Something unobtrusive that goes around preserving pages for posterity, like the Internet Archive, is the best soplution.
Energy: time to change the picture.
This is why every time I use a web reference I make a hardcopy of it and include it in my research folder. It did not take long for me to figure out that web pages are no more useful than manufacturer catalogs - once the year is up, you might never get that tidbit of information back. If it's too large to want to print, I'll hardcopy the couple of pages I need, and PDF the whole thing for digital storage.
Having a hardcopy (1) documents the information and it's (purported) source, and (2) allows offline access for comparison and validation.
Is it just my observation, or are there way too many stupid people in the world?
100 years from now, should anyone be forced to accidentally stumble over goatse? (which is very disturbingly archived on archive.org)
:P
Do you really think goatse will be "disturbing" 100 years from now? Only 40 years ago, people thought the Beatles were disturbing
That was why Tim Berners-Lee wanted URL to stand for ``Universal'' (not Uniform) Resource Locator.
;)
The problem is, few people have formal training as librarians, or understand how to file away a document under such schemes (whether or no pages like this are worth preserving is another issue entirely).
Then there's the technical issue---where's the central repository? Who ensures things are correctly filed? Who pays for it all?
With all that said, I'll admit that I use Google's cache for this sort of thing---it lacks the formal hierarchy, but the search capabilities ameliorate this lack somewhat. It does fail when one wants a binary though (say the copy of Fractal Design Painter 5.5 posted by an Italian PC magazine a couple of years ago).
Moreover, this is the overt, long-term intent behind Google, to be the basis for a Star Trek style universal knowledge database---AI is going to have to get a lot better before the typical person's expectations are met, but in the short term, I'll take what I can get.
William
Sphinx of black quartz, judge my vow.
Why would we want to archive 99.9% of today's web content ?
Does anyone archive CB radio traffic ??
It's not a permanent storage medium, never could be, too many points of failure between your screen
and the server holding the data.
Anything worth publishing digitally should be recorded in a more permanent medium.
I constantly backup all my digital photos because they are important to me. I also print the best ones for placing in photo albums, distributing to friends, etc.
The website they are published to is just a delivery medium, and not even the primary one. It can disappear and I wouldn't care. People who know me can always get access to them. Scientists should view their work the same way.
Nostalgia isn't what it used to be.
Would be much more worried about if the site said the same thing. What about revisionism, I would wonder if the reference cited even said the same thing as what it was cited for, it's easy enough to change the pages so that they can be twisted to make the referencer look stupid (don't like their use of the reference) or to just out and out lie after they get referenced. Unless they are locked down, and we all know that is not really possible, someone somewhere will find their way in.
Printed media, while having a low data/pound ratio, has managed to survive and span generations for centuries. I think the need for paper libraries cannot be forgotten. The challenge is distilling out what is worth keeping, and this challenge is better met now rather than later because we have more or less a good idea of what is significant information, and what is crap.
The ephemeral nature of the web is a very real problem, but it's important not to overstate it. The reason so much more information is lost these days is partly a reflection of the fact that we produce so much more of it. The Library of Alexandria was the distilled knowledge of an entire civilisation; it was unique, irreplaceable and massively important information. The web is full of information that is of low quality, often massively redundant (thousands of pages explain the same thing in different ways) and certainly replaceable (the web is not the final repository of the information: it's a temporary place where that information is published). In the same way, for centuries, newspapers have produced thousands of redundant issues with a lifetime of just a few days. The reason no one decries the loss of our newspapers is because the publishers themselves still archive the information, even if this is somewhat hard to get to. The same is true of web pages, only the number of publishers is vastly larger.
:-)
Individual newspapers had their own ways of making their archives public (in many cases for a fee) because storing that information is a cumulative, ever-increasing cost. On the web that cost is much lower, but still present. In addition, there's the question of relevancy: www.mysite.com/index.html may contact valuable information, relevant enough to be on the front page today, but in a week's time you don't want it to still be there. So what we need is archiving, for the web.
But manual archiving is inefficient and a pain to maintain, since it involves constantly moving around old files, updating index pages, etc.. Plus linkers don't bother to work out where the archive copy is eventually going to be: they link to the current position of the item, as they should.
So what the web needs is automatic archiving. One way to do this (a solution to which was the partial subject of my final year project at uni) is to include additional a piece of additional metadata (by whatever mechanism you prefer) when publishing pages; data that describes the location of the *information* you're looking for, not the page itself. So mysite.com/index.html would contain meta-information describing itself as "mysite news 2003.11.23 subject='something happened today'". User-agents (browsers) when bookmarking this information could make a note of that meta-data, and provide the option to bookmark the information, rather than the location (sometimes you want to bookmark the front page, not just the current story). Those user agents, on returning to a location to discover the content has changed, could then send the server a request for the information, to which the server would reply with the current location, even if that's on another server.
Of course, this requires changes at the client side and the server side, which makes it impractical. A simpler but less effective solution is for the "archive" metadata to simply contain another URL, to where the information will be archived or a pointer to that information will be stored. This has the advantage of requiring only changes to the client-side.
Suggestions of better solutions are always welcome
The article states that the average life for a website is 100 days, but wouldn't journals and formal publications (the most often cited documents in research) last longer than the average? Also, is the average skewed because websites are more likely to contain 'current information'? "Average lifetime" is misleading, does this mean the average time the page stays the same, or the average time before the information in the page is unavailable?
Then DOWNLOAD the pages from your web citations.
For example, a short time ago, I did a white paper on power scavenging sources. About 1/2 the articles I read were HTML or PDF sources. Rather than just citing the URL, I downloaded/saved every online article I referenced. If someone wants the source and cannot find it, I'll just provide it to them. If your paper is going to be read by a number of people, it makes good sense to have those sources on-hand; it never hurts to cover your arse.
Hard drive/Network/Optical space is virtually unlimited, so storage isn't a problem. Paper journals are archived by most libraries, anyway, so until they start archiving technical sources, I'm going to have to do my OWN archiving.
Do you really think goatse will be "disturbing" 100 years from now?
The day goatse.cx is no longer disturbing, is sure to be the first day of ArmageddonGamingmuseum.com: Give your 3D accelerator a rest.
Proper URL citations include the date. I'm not worried so much about the page being taken down (since it is presumably archived), as much as changing. If you don't record which version your were referring to, the content can change dramatically.
:w
it's not simply webpages that are the problem. it's digital storage in toto.
because we as a generation are quickly moving away from our previous long-lived forms of storage, and toward digital management of archives, it's trivial for someone to decide to unilaterally delete (not backup?) a whole decade of data in some area of our history.
i remember the photographer who found the photograph of bill clinton meeting monica lewinsky 10 years ago. he was in a gaggle of press photographers, but nobody else had this picture because they were all using digital cameras and he was still on film. most of their pictures from that day had been deleted years ago since they weren't worth the cost of storing. but this guy had it on film.
yes. websites are disappearing. but there's a greater problem lurking in the background. the cost of preserving this stuff digitally, indefinately. who's going to pony up the cash for that? unfortunately, no one. and we'll all ultimately pay dearly for that... (hell -- we already have trouble learning from the past.)
Easy come, easy go... here's another cliche: Give and Take. What's great about the web is that it has effectively demolished the barriers to entry in publishing. Everybody and their grandmother has a blog now - you can't compare webpages to magazine articles or newspapers. There's just so much more information being published now that its average lifespan is bound to go down. So what?
Publications that cite [web pages] lose their authorities? Who the hell told you to cite a webpage? Might as well cite a poster you saw downtown. If the webpage is a reputable source in the first place, it'll keep it around permanently. Still better than scientific journals that are squirrelled away in the basements of university libraries - anyone can get to a webpage.
This is no way to run a culture. Last time I checked, nobody ran our culture... It kinda runs itself. The proliferation of accessable, ephemeral webpages over permanent, priveliged paper publications (wah, too many p's!) is a sign that our information culture has moved on into a new era. Liked the old one? Tough! Now information has to maintain its own relevance in order to be permanent... and I for one welcome that change.
-3Suns
~~~~
The Revolution will be Slashdotted
Why can't I moderate something "Wrong" or at least "Grossly Misinformed"?
As the board chairman of the Internet Archive says, "The average lifespan of a Web page today is 100 days. This is no way to run a culture."
To the contrary, I think this is highly typical of the culture we have today, where everything is a transient fad in the media, technology and politics.
And it is also self feeding, I think, since market forces need to clear out the old to make room for the new in order to meet sales forecasts and shareholder expectations. And this is very true for pop, news and technology, which explains the lack of staying power of pop icons these days and becomes interesting when you want to ask yourself if you really need that new 3GHz machine just to surf the web.
And it is highly convenient in politics where a politician doesn't have to be accountable for what he said 100 days ago.
And so, the lack of long time life on the web is simply symbolic of all the rest here really, even if it is highly questionable.
Use genguid (or other tool) to make a globally unique number
and place that number at the bottom of your
page a link with google's "I'm feeling lucky"
searching for the GUID.
URIs don't provide content-based addressing (like a hash of the document). They rely upon trustworthy name registrars, which is an assumption that might have been valid when Berners-Lee was doing his early work, but is not now. They rely on someone willing to continue hosting the original document -- not necessarily the case.
You can link to a article which is then changed by the original publisher (or someone else). With scientific papers, you can't do that -- and such behavior is probably not desireable.
On the up side, if you're currently using cited references, you should be able to build such a system without too much problem -- follow links to PDFs or automatically crawl HTML documents (and check images) and serve all papers that you refer to with your paper. It'd be big, but it provides better reliability than do current paper schemes.
Another feature that might be useful is signing of the content (assuming RSA doesn't get broken in the future).
Basically, if you put up a SHA-1 (Gnutella), MD4 (eDonkey), or similar reference, you can host the original referred-to documents as well as the original host.
If Freenet didn't have as a specific drawback the inability of someone to guarantee that a document remains hosted as long as they are willing to host it, Freenet would be a good choice for this.
One possibility is that, with a bit of manual work, one can frequently find an academic work by Googling for its title. At least for now, as long as you host the original papers as well, Google should pick up on this fact. Of course, it does nothing to prevent modification of that paper by another party...
A good system for handling this would be to have a known system that is willing to archive, in perpetuity (probably hosted by the US government or other reasonably stable, trustworthy source [yes, yes, cracks at the US government aside]). This system would act like a Tier 1 NTP server -- it would only grant access to a number of other trusted servers (universities, etc) that mirror it -- perhaps university systems -- which would keep load sane. These servers (or perhaps Tier 3 servers) then provide public access. Questions of whether there would be a hard policy of never removing content or what would be allowed (especially WRT politically controversial content) would have to be answered.
There could be multiple Tier 1 servers that would sync up with each other, and could act as checks in case one server is broken into. I'm partial to the idea of including a signature on each file, but I suppose it isn't really necessary.
Specific formats could be required to ensure that these papers are readable for all time. Project Gutenberg went with straight ASCII. This would probably have to be slightly more elaborate. Microsoft Word and PDF might not be good choices, and international support would be necessary.
May we never see th
I'm amazed that anyone doing a professional article would even think of citing a web page as a web page.
Why not just print it out?
Not only are web pages transient, but the facts they have are subject to change. This gets back to your "pseudo-science and mis-information" comment.
If you're going to use it in your work, print a copy or save an image of it or something.
Which brings up to "fair use" and copyrights and all kinds of other crap.
To a historian often the most interesting stuff is the ephemera, the diary of an ordanary person gives a view of every day life you will never get looking at 'formal' archives (ie newspaper, film librarys etc etc) which only covers 'important' stuff
If you like that, you might like the books by the historian Fernand Braudel. Rather than the "kings and battles" of most histories, he focusses on how very simple things like the foods people ate, the weather, etc, and the relationships between long-term trends and the emergent properties of those interactions (i.e. over decades or centuries) are responsible for shaping the course of history.
If we are to say that not everything is worthy of archiving, who, then, is to decide what is? The 'net shouldn't be just another memory hole when there is the potential to create a respository of information that far exceeds the scope of anything possible before. That said, people who wish to cite to information published in an electronic form should be careful to cite only to sources that are reputable not only for veracity but also for longevity.
There was Cowboy Neal at the wheel of a bus to never-ever land.
You make a good point about the abundance of mis-information on the web, and that's another problem that needs to be looked at, but I disagree with "this will all sort itself out and real scientists will continue or return to citing more traditional resources." We have an incredible resource here (the internet) for diseminating information, and to ignore it would be something that's really not going to happen. We need to solve problems like this so we can take advantage of the benefits offered by the internet.
Can I mod something +1 Scary if it's true but I wish it weren't?
There's already a method of long-term storage for established knowledge, and the library at Alexandria was pretty good at it: PRINTED BOOKS. Web pages were never intended to be static monoliths of information but were from the beginning meant to represent a "living document" where the exchange of information was the important thing.
This has been a real problem for a long time. But the web is distributed. The only real solution is for people to realize that moving stuff around all the time breaks links, and avoid it. One thing that would help is a translation layer in the web server, that separates the URL from the server's filesystem. This is basic software engineering common sense.
2 50" is a much better permanent URL for this story, than exposing the details of some perl script called "article.pl" that takes a parameter named "sid", and it will be easier to adapt to all future versions of Slash or other software, or to simple archive as a static file someday. Using the PATH_INFO CGI variable you can make a CGI like "article.pl" use URLS like that above.
Non-transparent CGI, PHP and ASP scripts are even worse, they tend to change all the time. Instead they should be using the "path info", or be in the server (mod_perl, etc.)
Example: "http://science.slashdot.org/article/03/11/24/127
The idea that the basic job of a webserver is to pull files off your disk is incomplete: it's job ought to be to take your URL through *any* kind of query lookup, which might map to the filesystem and might not. The HTTP RFC's imply this as well.
reed
VOS/Interreality project: www.interreality.org
I wouldn't rule it out. There are people who are working very hard now to drag us all back into a new era of ignorance and superstition. Can they succeed? Maybe not, but things were pretty wide-open in the 20s, and then look what happened!
"And heck, what's the harm in saving the pages on your drive and contacting the original author if they disppear? Hard drive space is cheap. If you take yourself seriously, you might want to grab a snap, even if it is technically illegal (not that I know that it is; Google seems to do it right often)."
You might want to make certain you have it RAID'ed. I had TWO IBM Deskstars die in the same time period. What a pain to recover what I could. And I believe that Google could fall under the same provisions as a Library.
Thats not an entirely unreasonable view, however archeologists frequently gain important insights into an ancient culture by looking at dross. Near the burial sites of pharaohs were found carved complaints by workmen about poor conditions. in Greece (I think) notes were found n a ceremonial spot with curses aimed at neighbours and slutty wives. Gossip title-tatle for sure but quite informative and used to get a feel for the society.
/. They will learn Natalie Portman was a fertility goddess worshipped by the mysterious use of a dish called 'hot grits'
In 5000 years archeologists will learn so much about us from blogs & archives of
-he who laughs last, is a bit slow.
journal
Yes. Because now you have a copy of the source that you're citing.
The real item of importance is that others have access to what you are citing. They may need/desire this for several reasons such verifying your claims and gaining more background information. By citing an online resource that is not backed by hard-publication (i.e. IEEE offers full-text online articles in addition to print, slashdot has no periodical that i know of) you may cite something that is gone tomorrow, possibily making you work look suspect. Furthuremore, anyone can post pretty much anything they want to the web -- think the onion.
Should the web become the *sole* source of information, the Ministry of Truth will come into being. No piece of information will be trustworthy, because all information will be mutable.
This is already happening. Read a cnn news story (something controversial or important) and save the text. Come back a couple of hours later-- you will often find changes in the text.
What is truth when there is no proof?
It's whatever they want to tell you.
It's implications go way beyond web pages, which are just one of the first manifestations of our electronic culture creating records that never touch paper, or other more established and permanent mediums.
Businesses typically only have to archive material for around 7 years legally, although some industries like pharaceuticals have to preserve data considerably longer. This is fine when records are primarly paper based, with some nice computers to speed our current business along. When records are totally electronic from start to end, ("born digital"), we start to have problems, legally and culturally. Some researches are talking about a digital dark ages, where many of our records today will simply vanish from history, totally inaccessible and unpreserved.
This is about storage, migration and emulation. It's about persistent identifiers. It's about technology obsolesence leading to cultural obsolesence.
Matt Palmer Digital Preservation Department UK National Archives.
I read an interesting article a few years ago about how even our hard copy (books, magazines, musical scores, etc.) won't be nearly as useful to future historians.
Why?
Current historians learn a lot about each writers creative process, and how writers evolved their ideas, from drafts and corrections. Music scholars pore over every scratched-out note, every furious scribbled comment, in Beethoven's draft scores. Writing music was laborious and hugely frustrating for Beethoven, unlike Mozart, who hardly stopped to think and made few if any corrections.
Future scholars won't know any of this stuff, looking back at our work. We use software to edit our work... so when we fix our errors they are gone forever. We change our minds and the original idea disappears in a puff of electrons. An electronic score of a Beethoven symphony only differs from a Mozart concerto in the musical style -- all of the other data is gone.
It's a sobering thought. Where else are we going to get this data? Not letters, because we write emails now, and regularly delete them (intentionally or not). Diaries? Some people still keep them on paper... but many store them on computer, or publish them in blogs (which as discussed will mostly be gone).
Sobering thought isn't it? It's not neccessarily hubris to say we ought to be saving more of this stuff; people a few hundred years from now should be able to learn from our failures, as well as our successes.
There are only 10 types of people: those who understand decimal, those who don't, and, uh, 8 other types I forget.
The article is not about archiving "everything in the world." It's specifically about references in scholarly papers, which, for the past three or four centuries, have been part of the essential fabric of scientific research. In a research paper, everything you say is either supposed to be the result of your own direct observation, or backed by a traceable, verifiable, and critiquable authority.
You don't just say "Frotz and Rumble observed that the freeble-tropic factor was 6.32," you say "Frotz and Rumble (1991) observed that the freeble-tropic factor was 6.32." Then, at the end, traditionally, you would put "Frotz, Q. X and Rumble, M (1991): Dilatory freeble-tropism in the edible polka-dotted starfish, Asterias gigantiferus (L) (Echinodermata, Asteroidea), when treated with radioactive magnesium pemoline. J. f. Krankschaft und Gierschift, 221(6):340-347."
Then if someone else wondered about that statement, they'd go to the library and pull down volume 221 of the journal, and see that Frotz and Rumble had only measured that factor on six specimens, using the questionable Rumkohrf assay. If they had more questions, they'd write to Frotz at the address given in the article, asking them whether they remembered to control for the presence of foithbernder residue.
This sort of thing is absolutely essential to the scientific process and makes science self-correcting.
The article says that these days, the papers are published online, the references are URLs, and that an awful lot of them are stale. If so, this cuts to the very heart of the process of scientific scholarship.
"How to Do Nothing," kids activities, back in print!
It's not just the short lifespan of a webpage... it's also the fact that the source isn't always reliable. Web publications are rarely given the same strict editorial process as most journal articles. The content might be just as good - or better - but they're also not given the same credibility.
The problem *is* the short lifespan of web pages. Even "reputable" publications move their pages around, or remove them entirely, breaking all links. I'm talking about major newspapers, scientific journals, etc. It's these people, the supposedly reputable ones, who need to do a better job. The way they're doing things now is indeed, "no way to run a culture."
Recently a colleague of mine published a paper in an online peer-reviewed journal which contained a trivial error (transposition typo) that however would change, in fact reverse, the interpretation results. They were permitted to fix this, months after the article had first been posted. Does this aid Progress, or is it Revisionist?
The fact that you and I can refer to goatse and people know what we're talking about means that it's an important part of our shared culture. I think that anything that archives the good and bad of a culture is worth keeping around.
I have to disagree. An object which produces such trauma should not be preserved simply because the traumatic experience is shared. I think I have some form of post-traumatic stress disorder lingering from the day I saw the goatse thing - complete with horrifying flashbacks. That thing needs to go.
Why should any aspect of "culture" be preserved simpy because it constitutes "culture"? If we preserve everything that we have in common, we will be compulsive hoarders and the people of the earth will soon be living under a heap of obsolete car tires, betamax tapes and floppy disks. When we are done with something, we should let it go.
Cool URIs don't change
A bit over-idealistic, but worth aiming towards even if you don't achieve 100% non-URI-breakage in practice.
I feel that search engines should slightly penalize sites that have a history of breaking links or making them redirect to a completely irrelevant page: partly because there is just less chance that the link you follow from the search engine will have the content you want, and partly because even if you do get to a correct page, its usefulness as a bookmark or a link from your own dcuments is reduced.
-- Ed Avis ed@membled.com
This is a real problem. When Vannevar Bush conceived the Memex system, his goal was to facilitate the exchange of scientific research. Later, Doug Englebart built on Bush's ideas as did Ted Nelson (the guy who coined the term "hypertext") and Tim Berners-Lee.
And one of the design goals of the Xanadu server project was to provide exactly this sort of permanent storage and location-redundant backup. (We even refered to it as the "Library of Alexandrea Problem" and named one of the machines after the Alexandrean librarian. B-) )
Unfortunately the project didn't succeed and the web filled the niche.
So now we have a distributed Library of Alexandrea, holding the single copy of every "book", constant brushfires taking out important works, and a few "scribes" frantically trying to make copies of the whole thing (which copies, IF they exist, have to be accessed a different way than the original).
(Also coarse-grained one-way (text snippet->page or image) rather than fine-grained (text snippet, image region, or database entry->text snippet, image region, or database entry), one-way links rather than backfollowable links, and I could go on...)
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
Backups will contain the drafts in the future. Some them will surely survive.
There's a weird kind of paradox involved in what will survive, though.
Digital media has that wonderful property that it can be reproduced *perfectly* -- such that the copy is indistinguishable from the original -- but it must be copied or it will die.
You can burn your vacation videos to CD so your grandkids will be able to see them -- but that CD won't be readable anymore in a decade, never mind a century. If you faithfully make sure they're recopied every once in a while, though (and possibly converted to whatever new video formats are invented), your descendants 500 years hence will be able to see you waving from behind that sandcastle in California, as if it were filmed yesterday. No more flipping through yellowed photographs or crumbling newspaper clippings.... Imagine it! A scientist may use your video to prove his point about how the sunsets on the west coast have improved since California sank into the ocean.
He has to use family videos, though, because two decades of scientifically-recorded data on weather patters was all wiped out when a massive electromagnetic bomb was set up by terrorists in 2012.
Yeah, far-fetched example. I don't want to force the point, and definitely lots of stuff will survive... but our progeny won't be making the same kinds of attic discoveries that we can today.
"Hey, viddy all these ancient discs that Old Grampy Limp Devil had cached away up here! Can you run them? Nothing, huh? Oh, well."
There are only 10 types of people: those who understand decimal, those who don't, and, uh, 8 other types I forget.