Domain: archive.org
Stories and comments across the archive that link to archive.org.
Comments · 7,005
-
Re:Business opportunityWell, I guess I'll respond to my own post.
The disks are manufactured by Norsam in Oregon.
You send TIFF files to the manufacturer, each being the image of a page. The disk will hold anywhere from 1000 to 100,000 page images; the more pages you squeeze on, the more powerful the microscope needed to read it.
Los Alamos tested a disc, and it seemed to hold up pretty well, although a long time in salt water caused slow corrosion, and baking it at high temperature messed it up.
Norsam's Web site is mum on pricing, but a discussion among some of the Long Now/Rosetta Disc folks suggests a one-off disc might be as low as $2K. If I weren't such a lazy ass, I might sign up as a reseller.
For $10k, you can get a cute add-on I didn't expect-- a computerized microscope reader that shifts the field of view as you point and click. Microfiche for the modern age, and future ones too, I guess.
-
Theres also the Internet Archive
Theres also the Internet Archive who are building a library of snapshots of publicly accessible Internet sites, currently standing at 14 terrabytes of information stored on of information on digital linear tapes.
The Internet grows at a rate of 10 percent a month, according to the Archive's estimates, while the average life of a Web page is only 75 days. Obviously, a lot of data is being lost. Much of that comes from commerce and media sites that often kill pages containing obsolete information.
But some of this information is still relevant to researchers and historians
For example The Internet Ecologies Area at Xerox's Palo Alto Research Center is using multiple snapshots from the Internet Archive on disk -- "the Web in a box" -- as a kind of test tube for understanding the Web.
The ultimate goal of Internet Archive is to provide free access to the Internet's complete past, so that individuals looking for clues into how a culture changes will have one more medium to play around with. -
This could be good and bad
The immeadiate benefit of this kind of thing over current distributed services such as FreeNet is of course the fact that data stored on LOCKSS will be permanently available irrespective of how many times people actually request that page. On FreeNet a page is only kept on the network for as long as people are actually requesting the page - there is a "decay" of old information which makes it unsuitable for this kind of guaranteed archival.
The other advantage of the LOCKSS system is that it maintains a certain number of redundant copies across the network, and regularly checks these against each other to ensure that the integrity of each copy is undisturbed by accidents and general bit rot. This system could keep data in pristine form for an indefinite amount of time - as long as the system runs the data is available and correct.
But as for its use as an archive for other kinds of content as suggested in the story? Well, given that it doesn't appear to be anonymous like FreeNet, the same problems that we're now seeing with Napster will undoubtedly occur, and given that the whole point of the system is to keep files on there no matter what happens, the people running the LOCKSS servers will want to keep a close eye on what goes onto the system since removal will be fairly difficult. I doubt that it'll take off for this kind of purpose without the guaranteed anonymity that FreeNet has.
Another related project worth a look at is the Internet Archive which provides snapshots of public Internet sites for researchers.
-
The Internet Archive
The Internet Archive is devoted to preserve the information contained in the Internet.
And I have just found an article from Steve Baldwin, the guy from Ghost Sites!
-- -
What's legal; what can you get away with?
If you're doing something like pulling headlines and providing links to the full story where it was found, I don't think anybody could reasonably object. If you summarize the story yourself (like slashdot) that's cool; if you quote some of the page it's a little more iffy. Search engines do it. If you're pulling text and serving it yourself, more folks are going to have a problem with that. I'd imagine most sane outfits would ask you (politely first) to desist before going to legal action. My understanding of the concept of "fair use" means that you could summarize and provide links, and even quote chunks directly, even if the original source objected.
What do others get away with is probably the standard to test against. Google caches a lot of pages and serves them, is that republication without permission legal? I think they'll stop offering pages cached from your site if you ask; haven't heard of 'em being sued yet. LinuxToday consists mostly of quotes from stories from other sites, have they gotten any complaints? Granted, their process probably isn't automated.
Then there are things like the Internet Archive:
"A digital library for the future. We have already started to build it - we have already collected much of the Web as well as other public Internet materials including Netnews and downloadable software."
They claim to have near 20 terabytes collected, although AFAIK they don't make any of it available to the public. At least their spider is polite, it seems to honor robots.txt, and it's throttled. What's the legality there?
Then again, there's eWatch, who runs a spider over sites being monitored daily, with never a query for robots.txt and a deceptive User-Agent string. They sell "Internet Monitoring":
"Safeguard stakeholder value, improve customer service, protect corporate reputation, monitor competition, identify trends, and pinpoint corporate activism... "
While not on topic, this quote from the propaganda for their new Cybersleuth(R; TM, & FU) service is also likely of interest:
"eWatch CyberSleuth will attempt to identify the entity or entities behind the screen name(s) which have targeted your organization."
...and...
"...then containment is the next step. Containment is a two part endeavor focusing on (1.) Neutralizing the information appearing online, and; (2.) Identifying the perpetrators behind the postings, rogue website, hack, etc. Neutralizing information posted online, if appropriate, is the removal of the offending messages from where ever they appear in cyberspace. This may mean something as simple as removing a posting from a web message board on Yahoo! to the shuttering of a terrorist web site. The objective is to not only stop the spread of incorrect information, but ensure that what has already spread is also eliminated."
Go check it out on their site; I've lost some context in quoting. That'll probably get me a free target's eye view of the service.
...but I tend to ramble. To get back to the point, I'd say go ahead and do it, openly and above board, and see what kind of feedback you get from your vict... sources.