Inside the Internet Archives
blackbearnh writes "O'Reilly Media is running an interview with Gordon Mohr, Chief Technologist for the Internet Archive (archive.org). If you've ever wondered how pages are selected for archiving, or just how they manage such a huge quantity of data, the answers are here. The interview also touches on the problems of intellectual property in archives, archiving the Internet in a post Web 2.0 world, and the potential vulnerabilities exposed by archiving web sites that may include security exploits."
I keep running into bookmarks that have gone awol, then find that archive.org also doesn't have the pages anymore.
Combining a bookmarking / chaching service would be really handy.
MP3 Search Engine
If you publish something, you lose the right to withdraw it from the public archives retrospectively. That's part of the "contract" (term used figuratively) with the public that establishes the foundation of copyright law.
If you don't want it to appear on the Wayback Machine, you have an ability called robots.txt. That's already more than you have if you publish a book and want to keep it out of libraries. In neither case, though, do you have the right to demand or expect the content to be removed from the archive on your request.
I see what the archive does to be a courtesy service, not something that the site owners should expect.
It doesn't hurt to be nice.
Ideally they could obey the robots.txt at the time of archiving, and simultaneously grab a snapshot of the whois record. In the future, new robots.txts would by default only take away previously archived content if the domain hadn't changed hands. This would keep squatters from killing the archive, and the original copyright owner could always actively request removal of content if s/he matched the old whois record (though this would take manpower at archive.org, which is a problem).
Pi Ran Out