Link Rot and the US Supreme Court
necro81 writes "Hyperlinks are not forever. Link rot occurs when a source you've linked to no longer exists — or worse, exists in a different state than when the link was originally made. Even permalinks aren't necessarily permanent if a domain goes silent or switches ownership. According to new research from Harvard Law, some 49% of hyperlinks in Supreme Court documents no longer point to the correct original content. A second study on link rot from Yale stresses that for the Court footnotes, citations, parenthetical asides, and historical context mean as much as the text of an opinion itself, which makes link rot a threat to future scholarship."
Which is not what you want to see in, say, an Apple verses Samsung style case where "previous art" and earlier applications are all that separate you from being successfully sued into the Stone Age.
Laughter is the Spackle of the Soul.
They should just start linking through the Wayback Machine.
Should documents then start including snapshots of the site (Wayback Machine-style) in document appendices? It's more work, sure, but it seems to be an obvious solution.
Link rot could be "a threat to future scholarship"? WHO SAID TRAINING FEWER LAWYERS WAS A BAD THING? I just don't see the problem.
Seven puppies were harmed during the making of this post.
Maybe instead of hyperlinks we need some kind of internet-ISBN and an archive of important data? Maybe the NSA is helping after all! *coughs violently*
Good thing the NSA has it all backed up!
Just go to waybackmachine.nsa.gov
As someone who has worked on multiple sc briefs, the contents of some website referenced in an opinion certainly do not "mean as much as the text of an opinion itself."
This has been a well known problem for at least a couple of decades. Google had their famous cache that was famous for saving peoples hides or embarrassing peoples mistakes. The people that run the Wayback machine have been fighting this problem for many, many years.
Their is a natural resistance to being able to preserve content as it was at the time. People, companies and governments like to make revisionist history and forget that certain things ever happened or change them after the fact. Specialized companies help with reputation management in ensuring that such things disappear for good.
It's a problem from tech support documentation that disappears to finding old employers that have changed their name and moved location. The only way to resolve the issue is to be able to preserve the content as it was for posterity. Always assume your links will vanish and turn your need pages into archive files. If you really want to do something about it donate to the Internet Archive.
For fuck's sake, this is one reason why PURLs exist. The trainwreck that is a constant string of dynamic URLs *printed* out in court opinions is an example of shameful institutional incompetence, regardless of whether it's willful ignorance or just plain ignorance.
What is required to address this is an official government domain that hosts static screencaptures of web pages, provides PURLs to point to them, and ideally uses a URL-shortening function like goo.go or bit.ly.
Then, instead of including a long, difficult-to-retype URL in the opinion, the short, easy-to-type PURL appears in the opinion. The supplemental info for the citation includes things like original URL and date accessed, and the given PURL will point to the material in question.
Opposed to this idea will be copyright owners who fear that court opinions will eliminate their revenues by providing free access to material they usually charge for. Because this kind of opposition is easy to use to score political points (big government! wasting taxpayer dollars!! eminent domain of the little guy's copyrighted material!!!), to make money, getting to this obvious solution will be long delayed. When it is ultimately decided upon, it will be thousands of times more expensive than need be, take three times as long to roll out, will be created using shoddy technology that will break very quickly, and be used as yet another example of government failure.
Everything goes on one massive drive, and you grep keywords. Bring along a donut and coffee - it may be a while!
PURLs and the like assume that there's going to be someone around to maintain the content, and maintain the linkage to the content.
If a document is officially 'published' and given some sort of persistant ID (eg, DOI, ARK, Handle, whatever), then citing documents *should* use those over URLs.
If however, you're just citing an example that's just some web site on the internet ... then you're SOL. They have no reason to never change their materials, keep a given version around 'til the end of time, or inform you if it's been moved elsewhere.
eg, say that there's a complaint about some process, they cite Montgomery Ward's website as an example where it was done previously ... of course, the company doesn't exist any more. This is much different than someone locking up an article from a paywall -- they *want* you to find the item, so they can then try to get $30 or whatever out of you.
(of course, I've just spent the last week talking about all of these issues, between meetings of DataCite, Research Data Alliance and Force 11)
Build it, and they will come^Hplain.
This should be a mission of the Library of Congress - to archive everything ever used by the government (including court cases), be it on the Internet or not.
While they're at it, they can probably archive nearly everything else.
Great warrior...hrmph! Wars not make one great.
I would like to see participation in the memento project/stable URIs (http://mementoweb.org) become considered as a fundamental element of being considered "a journalist", part of the media, etc., in order to get the protections of that status. The lack of a consistent history in the web based media is harmful, and more than one massive corporation has used the "fluidity" of the web and hyperlinks to be more than fluid with the truth.
http://www.metafilter.com/98913/Ancestors-we-will-never-know-presage-feelings-we-can-never-have-now-go-forth-and-time-travel-on-the-web
Oh, and yeah, for e-laws and the presentation of findings of governmental groups and organizations, and those receiving governmentally recognized status as candidates of recognized parties with web presences... mandate that asap, and I will hug the government!
Transparency is only as good as your hyperlink protection and preservation plan.
Two guys rent a boat and go out fishing. They find a spot where there's lots to catch, and one of them makes a sign on the side of the boat pointing down to thwe water, and explains that it's so they can find the spot again. You fool, says the other, the next time we come we may not get the same boat.
For one person to make all the content needed for a person to educate them K-12-college is impossible. However you could write a hypertext document that links to content to educate people from K-12-college. I did not do this however, because of link rot. The obvious solution for these lawyers is to backup any page they want and have it documented, and not simply use URLs.
Someday, someone will have a good system to educate people spoon fed style on the Internet. For now, learning on the Internet can be far superior to a university education, but you need to be proactive about how you go about doing it. People who don't know how to educate themselves on the Internet are the people who need an education the most. Maybe a site that would give people these tools would be of use, but I'm sure someone made one.
God spoke to me
Isn't this precisely the type of thing archive.org exists for?
dupe
http://tech.slashdot.org/story/11/09/21/122210/implications-of-broken-links
The largest such study ever done is:
http://arxiv.org/abs/1105.3459
We used every article on arXiv and in an institutional repository, looked up if it still existed and whether it was in a web archive. For those in the archive we also determine the difference between the time the article was published and the closest archived copy.
The bad news: Less than 25% of URIs referenced from the papers are currently archived. We need to be pro-active in archiving important web resources.
The good news: With perma.cc and more web archives coming online, plus active engagement (such as http://www.hiberlink.org/) we hope to see an improvement
-- Rob Sanderson (first author on the paper) // azaroth42@gmail.com
Perhaps the legal world needs something similar to the DOI that academics use (http://www.doi.org/)?
Sounds like the Justice Dept. needs a better CMS.
Company intranets suffers also from link rot, and some are doing it worse by using tools that inherently promote link rot.
The point is that files are moved around on filesystems now and then "for better structure", "making it easier to find" and other lame excuses, but if every file had an unique ID that could be used to link with then they could move around the files as much as they like without causing harm.
If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
Well, don't want to sound like a dick or nothin', but, ah... you talk like a fag, and your shit's all retarded.
Get free satoshi (Bitcoin) and Dogecoins
I would also add that this should be done with a "write once" kind of storage back. This way we have some small assurance it was not modified.
You could go even further and keep a running log on the same medium that had an md5 of each previous content item which was then md5'd with the current.
This seems (to me at least) like it would provide a verifiable trail that shows the written contents were not tampered with.
Would this kind of scheme me useful? or am I missing something obvious?
(stolen from DaBum) I am dyslexia of borg - your ass will be laminated.
I have always found that whenever an opinion cites a URL the courts are careful to indicate the date that it was accessed. A hard copy (or at least a PDF) of the page as it existed at that time is then retained by the clerk in the case file. There's usually a footnote concerning this arrangement.
It's not that hard. No need for fancy technology or mass archiving of the Internet. The only thing they need is a basic PDF writer. Problem solved.
Forget URLs that depend on some government site's web server filesystem layout, which might change, or some PHP script's specific dynamic URL syntax.. just throw all of the damn things into a flat filesystem or DB and fetch them by the SHA1 hash. (Or something larger if collisions are a problem).
I think that Julian Assange talked about doing this in an interview I read. It really does make a lot of sense. You can make sure you have the right document and that it has not been altered.
Ha! I found it! Interview with Assange and Eric Schmidt.
http://techpresident.com/news/23773/googles-eric-schmidt-and-wikileaks-julian-assange-get-one-anothers-jokes
"Schmidt asks Assange what technologies he's looking out for to make it easier for an anonymous sender to reach out to a dubious recipient. He responds:
The most important one is naming things properly. If we are able to name some... a video file or a piece of text in a way that is intrinsically coupled to the information there, so that there is no ambiguity-- a hash is an example of this--but then there's variations, maybe you want one that human beings can actually remember. Then it permits this information to be spread in such a way where you don't have to trust the underlying networks. And you can flood it."
I don't read your sig. Why are you reading mine?
For something as important as court cases surely you make a copy so it can't be lost.
Court documents actually just list a link? With no copy/printout of what it links to? Really? If ever there was a doubt about how stupid and clueless judges are, it's that fact they allow shit like this to exist in official court documents.
What next? Someone puts in their court document to "Google it"? Seems that would probably be better than a permanent link.
The only link that matters still works.
http://www.archives.gov/exhibits/charters/constitution.html
Too bad they can't reference this one more.
Why not embed a checksum of the original document (probably just the plaintext) in every URL? It would allow search engines to help find archived copies of the document on other sites, and would be a trivial thing to automate in the major blogging engines. Heck, Apache/Nginx could be easily extended to automate this, along with a 404 handler that would generate a search query for the checksum it received. Good idea/bad idea/better idea?
-- member of Project Xanadu
Credo sim. - I think I am.
srsly, SCOTUS isn't the first place of long term reference to have this problem.
Credo sim. - I think I am.
This is because the current IP protocols are Dumb when it comes to data. I mean that with a capital D. Not that the designers are dumb, but the protocol itself is just dumb, in that it knows nothing about the data.
We suffer from the fact that IPv4 and IPv6 do not have store and forward. Instead of / in addition to endpoint IDs, all the routers need to have a large cache for versioned content. You can still have your frackin' unversioned uncacheable content, however we need a more permanent store and forward service. This will reduce bandwidth consumption, and is essential for bringing the Internet to space it's part of the Interstellar DTN (delay tolerant network).
Imagine the entire Internet as a hybrid between a decentralized distributed file store, and the current IP stack. Instead of requesting an endpoint we could request the data hash. A distributed hash table could serve the content from within the Internet. ISPs can vastly decrease bandwidth by increasing their cache duplication size (as we have currently), but when a cache miss happens it could be served by another cache in the distributed hash table on up the chain to the origin. "What about updates to documents? My cached pages!" Fools, the doc will have a different hash. We could actually SOLVE issues whereby resource names must be changed by simply requesting them based on their internal content hashes. Additionally, we can fix the issue of mixed secure / insecure content while we're at it. A resource referenced inside a secure document can include THE HASH ID of the resource. Thus, you know the insecure and cacheable content you're pulling in is unmodified...
Nope, we can't have nice things because you fuckers regard the old farts who designed the current antiquated systems as if they were gods, even though store and forward works beautifully for packet radio. (Hint: The FCC disallows any use of store and forward by unlicensed civilians.) Otherwise we could have a decentralized unsnoopable high-speed (largely) wireless Internet that grows organically with demand with little or no fees (everyone's a node hosting data, buy a box once and you're done).
The main barriers to solving the problem are ISP greed, draconian copyright laws, and desire for a surveillance state.
Note, this WILL all happen eventually anyway, you idiots are just too foolish to realize it, so it'll turn out to be a cluster fuck like "The Web" is now because the end result will be evolved by bolting on shite to the current systems over the years instead of being designed with the desired end result problem space in mind. Eg: Colocation fees? WTF? This is a hack to move data closer to endpoints... like store and forward achieves by design.
kthx.
Why don't they just put what was referred to as an attachment with the documents?
Here's how to fix this: You quote, in total, the web page or article in question. Then, you *attribute* it to the url where you found it from, the date that you found it there, and the author/copyright holder. Now, it doesn't matter if the page changes or the site goes away. The content is preserved, the source is attributed. And, copyright troll lawyers aside, no one is harmed!
You see? You see? Your stupid minds! Stupid! Stupid!
Every time you make a citation, copy it into a legal database, and then reference that entry into the database IN ADDITION to original URL. Include date and time... end of controversy.
I've decided to stop wasting my time responding to AC trolls/sockpuppets... so if you want a response from me... login.
And people make fun of me for saving web pages rather than bookmarking them. Link rot has always been my reason.