Inside the Internet Archives

← Back to Stories (view on slashdot.org)

Posted by CmdrTaco on Wednesday June 18, 2008 @02:46AM from the those-who-ignore-history-are-doomed-to-eat-a-sandwich dept.

blackbearnh writes "O'Reilly Media is running an interview with Gordon Mohr, Chief Technologist for the Internet Archive (archive.org). If you've ever wondered how pages are selected for archiving, or just how they manage such a huge quantity of data, the answers are here. The interview also touches on the problems of intellectual property in archives, archiving the Internet in a post Web 2.0 world, and the potential vulnerabilities exposed by archiving web sites that may include security exploits."

4 of 85 comments (clear)

Min score:

Reason:

Sort:

I wished archive.org stored even more stuff by jacquesm · 2008-06-18 03:07 · Score: 3, Insightful

I keep running into bookmarks that have gone awol, then find that archive.org also doesn't have the pages anymore.

Combining a bookmarking / chaching service would be really handy.

--
MP3 Search Engine
Re:Wayback by iangoldby · 2008-06-18 03:29 · Score: 3, Insightful

wayback refuses to show any ARCHIVED pages where the domain CURRENTLY has a robots.txt.
In true Raymond Chen style, think about what the world would be like if this wasn't true: If it wasn't true, then a site owner would have no way to remove his content from the Wayback Machine retrospectively. That raises far more problems that the ability of a new owner to remove a previous owner's content.
Re:Wayback by SydShamino · 2008-06-18 03:44 · Score: 5, Insightful

If it wasn't true, then a site owner would have no way to remove his content from the Wayback Machine retrospectively. I don't necessarily disagree with their policy, but this is the wrong argument for it.

If you publish something, you lose the right to withdraw it from the public archives retrospectively. That's part of the "contract" (term used figuratively) with the public that establishes the foundation of copyright law.

If you don't want it to appear on the Wayback Machine, you have an ability called robots.txt. That's already more than you have if you publish a book and want to keep it out of libraries. In neither case, though, do you have the right to demand or expect the content to be removed from the archive on your request.

I see what the archive does to be a courtesy service, not something that the site owners should expect.

--
It doesn't hurt to be nice.
Re:Wayback by RareButSeriousSideEf · 2008-06-18 03:52 · Score: 3, Insightful

Ideally they could obey the robots.txt at the time of archiving, and simultaneously grab a snapshot of the whois record. In the future, new robots.txts would by default only take away previously archived content if the domain hadn't changed hands. This would keep squatters from killing the archive, and the original copyright owner could always actively request removal of content if s/he matched the old whois record (though this would take manpower at archive.org, which is a problem).

--
Pi Ran Out