Interview with Brewster Kahle
Netmonger writes "A
fascinating interview with the man behind The Wayback Machine. Some specs from the article: "It's 150-odd standard PC cases, with four drives in each.. 'Over 100 terabytes.. As plain text in book form, that'd be over 3000 miles of shelf space.." All I can say is.. Wow!"
How many miles of shelf space equal one Library of Congress? Lets use standard units here people!
Does this thing still exist? I thought he had to kill it....
So why would you want to preserve all of it? Why not just get the good stuff and maybe he won't need so many comptuers. I understand that just choosing the good stuff would be very subjective, but do we really need archives of pr0n sites and popups?
Visualize Whirled Peas
to get old pr0n! :D
"Times have not become more violent. They have just become more televised."
-Marilyn Manson
It's a shame that some fo the more interesting moments in Internet history are so transient the wayback machine can't catch them.
e.g. The Ded Kitty picture we put up when napster shut down at the star of september, it was only there for a few hours but it will be lost.
Of course, some of the more interesting transient events are websites that are hacked, but there exist dedicated archives for this kind of event, so you can relive the hilarity of RIAA.org being repeatedly defaced.
As plain text in book form, that'd be over 3000 miles of shelf space.."
Huh? How about "If all data was spoken at once, it would be as loud as 674 jet engines!" Or "If this archive were a planet, it would be as large as Jupiter!"
At CFP 97... he scared me.
Now how many Library's Of Congress is that?
Trying is the first step towards failure.
As Borges once said about the Libaray of Babel wayback now...
...The Library is a sphere whose exact center is any one of its hexagons and whose circumference is inaccessible.
The universe (which others call the Library) is composed of an indefinite and perhaps infinite number of hexagonal galleries, with vast air shafts between, surrounded by very low railings.
Looks like he wasn't too far off...
Well, maybe not...
I'd hate to see the history of the net destroyed if the sprinkler system goes off in their server room...
Who is this Kahle guy? I know for a fact that it is Mr. Peabody who is behind the way-back machine. I was with him when he visited Nobel.
Slashdotter are stupid and biased.
Perhaps we need to propose an extension to the robots.txt file to tell certain classes of search crawler to visit more frequently or at specific times?
Here. They seek to create physical items (clocks and libraries are two items they name) that will last for very, very long periods of time. This diagram shows what is meant by the "long now", and this is a link to their first prototype clock that is on display in the Science Museum in the UK (the second clock on the page).
I did a quick price check and for 100 terabytes of data on 80GB drives (Best price/size ratio I could find), that's about $111,250 worth of storage. Of course, I guess they would get bulk discounts :).
Just because I doubt myself does not mean I find your position compelling.
-Cyc
/.'s 10 Millionth
...Beowulf Cluster reference.
Bubble, bubble, toil and trouble... can't we just go to Starbuck's for coffee?
"Over 100 terabytes.. As plain text in book form, that'd be over 3000 miles of shelf space.."
I don't understand terabyte or the shelf space analogy...
I need to know how many banana's.
nbfn
There's an excellent interview with Kahle on technical details at O'Reilly's own archive -- here.
"Freedom is kind of a hobby with me, and I have disposable income that I'll spend to find out how to get people more."
A lot of internet information is crap... So why would you want to preserve all of it? Why not just get the good stuff and maybe he won't need so many comptuers.
And of course, you're going to decide what is "good" and what "isn't?" He is providing the resource for, among other things, scholarly researchers. Of what use is the data if it has been hand edited according to one person's aesthetics or anothers?
Indeed, your comment reminds me of one that was heard quite often, shortly before beautiful and irreplacable old buildings were razed to make way for a new strip mall, or, in downtown Chicago, a couple of new government buildings whose architectural style is best described as "Federal Drab." Preserving as much as possible is a good thing, because none of us can tell what will be valuable, and what will not, in another 20 or 30 years, and no one's aesthetic should be dictating such a decision to entire generations to come.
The Future of Human Evolution: Autonomy
Something you may think is crap may not be crap to someone else... I'm sure someone out there is interested in the millions of Britney Spears or *NStank picture galleries.
...except for the fact that he allowed the Church of Scientology to bend him over and use him like a toy. Why doesn't he get some Google backbone and refuse to bow to their DMCA threats?
Oh, I forget that honor is dead on the internet.
Only in slashdot are posts of solidarity modded at -1 Redundant, while posts of antagonism are modded as -1 Flamebait.
And here I thought it was Mr. Peabody that invented the Wayback Machine. No, hang on, it was Al Gore...
But seriously, unless you know about this project, and the fact that you can ask to remove data from the archives (though there's no reference as to how to actually do it), it means that your Internet past can haunt you forever.
Or at least until simultaneous attacks occur on Cairo and San Francisco...
http://www.mindjack.com/feature/archive.html
In the interest of full disclosure, I wrote it, so be gentle.
Hint: Don't put security pages in your robots.txt which aren't supposed to be linked.... or at least secure them with a password.
http://www.zone-h.org/en/news/read/id=894/
I put in www.archive.org into the wayback machine and my computer exploded!
For other Brewster Kahle interviews, see also the Slashdot story that pointed to the O'Reilly interview and the Slashdot story that pointed to the Feed magazine interview (which is currently unaccessible from my machine).
This sounds great, but there are a lot of limitations. It's not just that the archive is transient (every 60 days), it's also static. Any web pages that access pay sites are not found. Any cool database links that you put into your Web Page and accessed through a cgi - I'm guessing that they are toast.
For some reason, it got modded offtopic, but it seems relevant - for an archive of the internet to remain physically intact for long periods of time, reliable storage media have to be created - media that will survive unmaintained for long periods of time.
What in the hell is wrong with you guys today? Is there a bug in the slashcode giving mod points out to trolls?! I don't see how the parent could be more on-topic unless it was written by Brewster Kahle himself, and who knows maybe it was! Jeeze, get a clue!
Yeah, that's a shitload of CD-Keys and Serial #s.
I just realized - if terrorists blow up Cairo and the Bay area, I'm going to be the first one on the suspect list!
Damn! Now I'm really interested in how to remove stuff from their archive!
I was curious to how the Wayback Machine's operators view its legal status... I mean, it's not really a search engine in the broadly accepted meaning of the term. It doesn't just search what's out there, it archives entire pages of old information; And while search engine sites do this (google), this is ALL the wayback machine site does.
Surely they must know they're treading on untested legal ground. All it might take is one offended copyright holder to bring the whole thing to its knees. Basing it in a country other than the USA might have been smarter, then, given the existence of laws like the DMCA which could serve to shut the site down.
occultae nullus est respectus musicae - originally a Greek proverb
um all the cd keys and serials ever invented by anyone wouldn't fill up a large fraction of that space. I'm talking about isos and divx/svcd
Small personal thanks from me. I had put an online exhibit of my artwork up a few years ago, but unfortunately lost all of it by a harddrive failure. Much to my surprise I was able to find nearly all of my site, http://www.gpapassavas.com online and backed up on the WBM.
Out of curiosity, why only four drives per PC?
With a simple $10 PCI IDE card (per additional 4), you could have gotten at *least* 8 drives, possibly as many as 16, per case. Granted, not many cases will let you *mount* that many, but I would expect paying a few bucks extra for the IDE cards and a better case would save quite a bit of money (and physical space) by halving or quartering the number of PCs you need ($100 extra to save $1500 per $2000, not counting the drives themselves?).
88lf of machines vs 22lf. One requires an entire room, one would fit on a standard sized 3-or-4-tier storage rack. Of course, speaking of racks (of a different sort)... What on earth made you go with an array of standard PCs rather than a raid-in-a-rack?
well, indeed I guess that was not the case and a typo or a slip of the mouse and not censorship. Still, I plan on ramming the broken Tobasco bottle up Taco's ass, it will be a blast for all parties involved.
I always have to chuckle when I see these analogies. "If you printed all of the data on a CD-ROM, it would reach Mars!"... that's super.
There are at least two problems with such analogies:
1) People use them to comment on the marvelous efficiency of technology - but in reality, it's only a comment on the hideous inefficiency of print. It doesn't say much at all about technology. It might be useful to convince people to digitize/OCR their printed matter - but is anyone *not* doing this? Even the Library of Congress is scanning its texts now.
2) In this case it's a particularly bad analogy, because it assumes that all data is printed as hex. Example: images, which are obviously a huge, huge chunk of the Wayback archive. Virtually all website images are small enough to print on a printed page at full resolution. But consider a 500x500-pixel image, at 16 bits (2 bytes per pixel, 2 chars to represent each byte)... that's 1,000,000 characters, or 1,000 pages!
Basically the analogy is good for wildly inflating some numbers to stun the 0.00001% of the population that doesn't already realize these things.
- David Stein
Computer over. Virus = very yes.
What would you give for a video clip of your great-grandmother? I'd give a lot.
What a perv! What does he want with my grandmother?
It's OK, but what's so fascinating about it? Honestly, I don't get it. I get the archive's idea - I use it myself. I just don't understand what was so 'fascinating' about the interview...
creation science book
...Who's going to archive the archive?
I knew I'd read this Slashdot story before!
s lashdot.org/
http://web.archive.org/web/19971221012817/http://
mwahahahaha
#include <sig.h>
If there is a way to permanantly erase pages from the archive, I would be a little less worried. But I can never tell if they let you delete stuff, or just "block" it. "Blocking" is crap, we all know what that will be worth if somebody really wants the info someday and knows the Archive has it.
They'll tell my academic grades to all my future generations.
Or even worst, my children o grandchildren may realize that I used to be a Porn star!
"Yeah, but can it make coffee?"
Response being:
Of course it can! But since it's the Wayback Machine, it's yesterday's coffee... old, cold, and slightly burnt (but when you gotta, you gotta)...
... there is no spoon
Inspite of the fact they seem to get a good amount of funding for this project, it seems the equipment they can afford cannot nicely handle many, if not most, of the page requests. I tried to access a website on a date I know for certain it was up, and their proxy server timed out.
question How many miles of shelf space equal one Library of Congress? Lets use standard units here people! answer 1 L.O.C.=13.2 MS.EULAs..
Malkovich. Malkovich Malkovich Malkovich.
Malkovich... Malkovich?
Malkovich? Malkovich!
MalkovichMalkovichMalkovichMalkovichMalkovich
~~Malkovich Malkoviiiich...~~
Malkovich!!!
Technologists have promised the digital library for decades. In 1945, Vannevar Bush, who was technology adviser to several US presidents, wrote an article in The Atlantic magazine outlining how computers might one day augment libraries.
Those who find this subject interesting, but who may not be familiar with Vannevar Bush's work, might want to read the paper to which Brewster Kahle refers.
Please donate your spare CPU cycles to help fight cancer and other diseases
The whole point of comprehensive library collections is that you can't tell in advance what will be important.
It's not the life of the web page that is the shifting sands of our intellectual society, it's what we hold to be important as we live. As in all the previous thousands of years of human history, the important stuff will be remembered and copied down.
Whether or not people preserve Visa Card or MasterCard receipts long after the bill is paid will not make their lives better.
What is your phone number?
As a woman, I find it increasingly difficult these days to find a man who is willing to have sex with me. I'm attractive and buxom. I am not interested in a whole lot of relationship baggage, just a man who is willing to let me stick my finger into his anus right when he is at the moment of climax. I know men like it, so what is their problem?
Oh well, guess I'm stuck with Ladybird for now (that's what I call my dildo).
....a prior version of the Wayback machine?
Who has one of those, hmmmm?
If the currenty archive has 100TB, and is growing by 10TB a month, imagine what Wayback-Wayback would need to have.....
I just almost shit myself because that was so funny!
Why you decided to broadcast that to the Slashdot community, I have no idea.
Really? And where is Mr. Ward these days?
"Love is a familiar; Love is a devil: there is no evil angel but Love." --William Shakespeare ('Love's Labors Lost')
Try accessing news stories immediately prior to and after the September 11 attack and you'll see just how valuable this website is... or rather, isn't.
I have also personally ran a website which contained fairly controversial material (based on this story) that I saw listed on their website and then removed shortly thereafter. Tell me, why would a service like this ever have occasion to remove material once it's been archived, especially if there are *NO* copyright issues and the webmaster of the archived site never asked them to remove it?
The answer is simple: the powers-that-be saw how dangerous it was to make all this information available to anyone on demand so they took control. It would be a great service were it allowed to operate unfettered, but the reality is quite different.
And I'm the first to mention this here so far? You should all be modded down -1 for naiveté.
Is this truly the only Earth I can live on?
= 23.42 Bills of constitution? + div. amend.
... whenever a text is transmitted, variation occurs. This is because human beings are careless, fallible, and occasiona
I mean seriously, it's gonna take more than some CDRW's baby.
I'm Rick James with mod points biatch!
on how long before a politician has to resign because of some over the top statements he/she made in a flamewar back in college? Or maybe that webpage of ethnic jokes that seemed so hilarious back in high school.
I have a feeling we are either going to have to become way more forgiving, or we're going to be stuck with only faceless boring types with no opinions as our leaders (no wisecracks, it could be much worse than it is now).
That's Project Gutenberg's Distributed Proofreading effort (much more fun than that *other* DP).
-- Help Digitise the Public Domain at DP.
Because it is understood that the pages are ->archived- and not thewaybackmachines property or copyright, but that of the original page.
Wouldn't it make more economic and performance sense to cut the number of PCs by a third and take the $50k and invest in a more high-performance and space-conservative disk subsystem?
Something like this.
Would give you far better disk performance and scalability than trying to add another 200 PCs with IDE disks.
dont pick on the wayback machine...its great for when a site gets /.'ed and its down...use the wayback machine to see it :)
--"The revolution will be simulcast..."--
All Hail the Wayback Guy!
Mindjack had an in-depth report on the Internet Archive a few weeks ago, with pictures from the inside.
http://web.archive.org/web/19971221012817/http://s lashdot.org/
With quality website snapshots like this, I can see how it will be a great resource for future historians!
== Jez ==
Do you miss Firefox? Try Pale Moon.
A site I run (sniggle.net - formerly found at syntac.net) was removed from the wayback machine when the church of Scientology complained about an image of L. Ron Hubbard on one of the site's pages.
Now, not only all of the pages on my site, but all of the pages at syntac.net have vanished from the wayback machine.
Oh yeah, and they can't be found at Bibliotheca Alexandria either, so that's no solution.
Brewster's going to have to turn down his rhetoric about the wayback machine a bit until he gets the resources to fight back. Otherwise people might get the impression that he really is keeping the history of the web, even the parts of the web that entities like the church of Scientology don't like, alive.
Quiquid latine dictum sit altum viditur
Sigh.
Didn't you mean this?
This is a goatse link - don't click it unless you're into that kind of thing.
He said the Wayback machine is 150-odd computer. 150 is EVEN, not odd
Not like it really matters, though.
Just one more comment to be archived forever and nitpicked from the future.
www.Beyond7.com Insane modern art water sculpture.
Well, it seems as if the wayback machine has indeed created a paradox. A simple lookup of 127.0.0.1 gives some fairly interesting results.
The most interesting of these is the one from October 19th 2000. See for yourself!
.noitacidem deen uoy siht daer nac uoy fI
I worked on some projects with the Internet Archive from 1998 - 2000.
The Archive's first storage device (circa 1996) was a large StorageTek tape robot with a multi-gigabyte disk cache to handle user requests for archived pages. As drives and processors became cheaper, it became more interesting to use them instead of tape. The cost penalty of using drives over tape is only 2x - 3x, with the enormous win of increased bandwidth and decreased latency (when the request queue for the bot got large, the wait time for a page could be 16 hours. With disk, it's a fraction of a second).
The first hard-drive based Archive storage used multiple 4U and 5U 12-20 drive Linux/FreeBSD boxes with ~80G IDE drives and Promise cards.
Drive density is greater now - you can get 200G IDE drives and 320G IDEs are on the way, so you can use regular PCs as opposed to custom or niche-market (rackable server) boxes.
--Pat / zippy@cs.brandeis.edu
I would imagine doing a rm -fR / might not go down too well, even with the backups
The Archive choose four drive boxes because they are cheaper than getting higher drive density boxes. Also it allows them to have more CPU's for data mining. The cost of space taken by the machines is negligible. Brewster is one of the cheapest guys you will ever meet and in this case cost was everything. People may think they could get it done for less, but in reality this was done about as low as it goes.
Whether you know it or not, this man was responsible for starting a company which, to this day, gives case-modders wet dreams...
Case in point.
If it does, wouldn't that be a recursive Beowulf cluster ?
Copyright Policy
10 March 2001
The Internet Archive respects the intellectual property rights and other proprietary rights of others. The Internet Archive may, in appropriate circumstances and at its discretion, remove certain content or disable access to content that appears to infringe the copyright or other intellectual property rights of others. If you believe that your copyright has been violated by material available through the Internet Archive, please provide the Internet Archive Copyright Agent with the following information:
Identification of the copyrighted work that you claim has been infringed;
An exact description of where the material about which you complain is located within the Internet Archive collections;
Your address, telephone number, and email address;
A statement by you that you have a good-faith belief that the disputed use is not authorized by the copyright owner, its agent, or the law;
A statement by you, made under penalty of perjury, that the above information in your notice is accurate and that you are the owner of the copyright interest involved or are authorized to act on behalf of that owner; and
Your electronic or physical signature.
Why was the parent moderated upward? In every store, it looks like someone mods up a few posts just to cause trouble. Why do those people have, and keep, moderator points anyway?