Slashdot Mirror


Interview with Brewster Kahle

Netmonger writes "A fascinating interview with the man behind The Wayback Machine. Some specs from the article: "It's 150-odd standard PC cases, with four drives in each.. 'Over 100 terabytes.. As plain text in book form, that'd be over 3000 miles of shelf space.." All I can say is.. Wow!"

37 of 195 comments (clear)

  1. How many by FunkSoulBrother · · Score: 4, Funny

    How many miles of shelf space equal one Library of Congress? Lets use standard units here people!

  2. Transient Moments by szyzyg · · Score: 5, Interesting

    It's a shame that some fo the more interesting moments in Internet history are so transient the wayback machine can't catch them.

    e.g. The Ded Kitty picture we put up when napster shut down at the star of september, it was only there for a few hours but it will be lost.

    Of course, some of the more interesting transient events are websites that are hacked, but there exist dedicated archives for this kind of event, so you can relive the hilarity of RIAA.org being repeatedly defaced.

  3. Re:Wow! by cybermace5 · · Score: 3, Funny

    Yes. Yes it does still exist. That will be $5.00.

    --
    ...
  4. stupid Joe Six-Pack metaphors by p_rotator · · Score: 4, Funny


    As plain text in book form, that'd be over 3000 miles of shelf space.."

    Huh? How about "If all data was spoken at once, it would be as loud as 674 jet engines!" Or "If this archive were a planet, it would be as large as Jupiter!"

  5. Is this thing backed up? by TheSHAD0W · · Score: 3, Funny

    I'd hate to see the history of the net destroyed if the sprinkler system goes off in their server room...

  6. Kahle? by Prince_Ali · · Score: 3, Funny

    Who is this Kahle guy? I know for a fact that it is Mr. Peabody who is behind the way-back machine. I was with him when he visited Nobel.

  7. On a related note, look up the Long Now Foundation by JJAnon · · Score: 3, Interesting

    Here. They seek to create physical items (clocks and libraries are two items they name) that will last for very, very long periods of time. This diagram shows what is meant by the "long now", and this is a link to their first prototype clock that is on display in the Science Museum in the UK (the second clock on the page).

  8. 100 Terabytes! by insanecarbonbasedlif · · Score: 3, Informative

    I did a quick price check and for 100 terabytes of data on 80GB drives (Best price/size ratio I could find), that's about $111,250 worth of storage. Of course, I guess they would get bulk discounts :).

    --
    Just because I doubt myself does not mean I find your position compelling.
    1. Re:100 Terabytes! by dougmc · · Score: 3, Informative
      The math (100 terabytes, 150 computers, 4 drives per computer) works out to an average of 171 GB/drive. Of course, they said `over 100 TB' so it's actually higher than that.

      Obviously they're using IDE drives. Modern ones. And they must have replaced almost everything at once -- there could a mixture of 200 GB and 120 GB drives, but it would have to be mostly 200 GB drives.

      Pretty neat, but still doesn't hold a candle to google's massive setup :)

      (google must have a *team* of people who's sole job is finding failed computers/drives and replacing them :)

    2. Re:100 Terabytes! by product+byproduct · · Score: 3, Funny

      For comparison I did a quick price check and for 3000 miles of shelf space on 5x26.25" bookcases (best price/size ratio I could find), that's about $29M worth of bookcases. Using harddisk drives was a smart decision.

  9. According to... by Cyclopedian · · Score: 4, Informative
    this, the LOC pales in comparision to "3000 miles of shelf space".

    -Cyc

  10. I don't understand terabytes.... by nebenfun · · Score: 4, Funny

    "Over 100 terabytes.. As plain text in book form, that'd be over 3000 miles of shelf space.."

    I don't understand terabyte or the shelf space analogy...
    I need to know how many banana's.

    nbfn

    1. Re:I don't understand terabytes.... by gid · · Score: 4, Funny

      Well since bananas can't directly hold data that well since they rot so quickly, well have to use those bananas to store data by some other indirect means.

      So, how many bananas would it take to feed all the monkeys needed to store the data? Monkey's aren't that smart so lets approximate each monkey can hold 4k worth of data.

      100 TB = 100 * 1024 * 1024 * 1024 KB = 107374182400 KB

      107374182400 KB / 4 = 26843545600 monkeys

      Now we'd want redundancy so lets have triplictate monkeys for all our data, in case one dies, or runs away, or simply forgets.

      26843545600 * 3 = 80530636800 monkeys

      But now want want to figure out how many bannas they're gonna eat, lets say 5 bananas a day per monkey?

      80530636800 * 5 = 402653184000 bananas to feel all monkeys per day

      402653184000 * 365 = 146968412160000 bananas to feed all monkeys per year

      146,968,412,160,000 or 146 trillion bananas per year, which is probably just slightly over the nation debt.

      Overall, I think your method of using bananas to store all this data is quite ridiculous. The latency and dataloss would be unbearable. Plus think of all the poop these monkeys would create, and you'd NEVER be able to get PETA off your back.

  11. Wayback technology by watchful.babbler · · Score: 5, Informative

    There's an excellent interview with Kahle on technical details at O'Reilly's own archive -- here.

    --
    "Freedom is kind of a hobby with me, and I have disposable income that I'll spend to find out how to get people more."
  12. Re:A lot of internet information is crap... by Anonymous Coward · · Score: 5, Interesting

    We're not qualified to judge what "good stuff" is.

    For example, a ciouple of centuries ago old household accounts would have been considered valueless. But today's historians find a wealth of social data in them - what did people eat? how much did they get paid? did families tend to enter service together? how often did servants get new clothes?

    Disc space is cheap. Keep everything, let future historians sort it out.

  13. Like all those crappy old buildings... by FreeUser · · Score: 3, Insightful

    A lot of internet information is crap... So why would you want to preserve all of it? Why not just get the good stuff and maybe he won't need so many comptuers.

    And of course, you're going to decide what is "good" and what "isn't?" He is providing the resource for, among other things, scholarly researchers. Of what use is the data if it has been hand edited according to one person's aesthetics or anothers?

    Indeed, your comment reminds me of one that was heard quite often, shortly before beautiful and irreplacable old buildings were razed to make way for a new strip mall, or, in downtown Chicago, a couple of new government buildings whose architectural style is best described as "Federal Drab." Preserving as much as possible is a good thing, because none of us can tell what will be valuable, and what will not, in another 20 or 30 years, and no one's aesthetic should be dictating such a decision to entire generations to come.

    --
    The Future of Human Evolution: Autonomy
  14. Silly Me! by Cap'n+Canuck · · Score: 3, Funny

    And here I thought it was Mr. Peabody that invented the Wayback Machine. No, hang on, it was Al Gore...

    But seriously, unless you know about this project, and the fact that you can ask to remove data from the archives (though there's no reference as to how to actually do it), it means that your Internet past can haunt you forever.

    Or at least until simultaneous attacks occur on Cairo and San Francisco...

  15. Re:A lot of internet information is crap... by 0xdeadbeef · · Score: 4, Funny

    You don't consider the archiving of pr0n a noble cause? Don't be so selfish, man, think of future generations!

    I mean, hell, forget pr0n, just imagine the blackmail value for the kids of 2020, to be able to dig up pictures of their parents on amihotornot.

  16. Another site, with pics by RhBaby · · Score: 5, Informative

    http://www.mindjack.com/feature/archive.html

    In the interest of full disclosure, I wrote it, so be gentle.

  17. Robots.txt - That was how the RIAA was hacked by szyzyg · · Score: 3, Interesting

    Hint: Don't put security pages in your robots.txt which aren't supposed to be linked.... or at least secure them with a password.

    http://www.zone-h.org/en/news/read/id=894/

  18. Picture of a Picture by paughsw · · Score: 4, Funny

    I put in www.archive.org into the wayback machine and my computer exploded!

  19. See also by danlyke · · Score: 4, Informative

    For other Brewster Kahle interviews, see also the Slashdot story that pointed to the O'Reilly interview and the Slashdot story that pointed to the Feed magazine interview (which is currently unaccessible from my machine).

    1. Re:See also by Orne · · Score: 3, Informative

      Hehe, that's what the Wayback Machine is for!

      Feed magazine interview, back from the grave...

  20. Re:A lot of internet information is crap... by ChaosDiscord · · Score: 3
    Why not just get the good stuff and maybe he won't need so many comptuers.

    Identifying "good stuff" is very hard and certainly not something that can be automated. Furthermore, "good stuff" is in the eye of the beholder. Perhaps Jane's web page dedicated to her kittens in useless to almost everyone in the world. However, to Jane's great-great granddaughter who hasn't been born yet, it might provide a fascinating look into her own past. A historian a hundred years from now analyzing the first twenty years of the web would certainly want to know that porn and popups were so pervasive.

  21. Odd, no copyright questions by dsanfte · · Score: 5, Insightful

    I was curious to how the Wayback Machine's operators view its legal status... I mean, it's not really a search engine in the broadly accepted meaning of the term. It doesn't just search what's out there, it archives entire pages of old information; And while search engine sites do this (google), this is ALL the wayback machine site does.

    Surely they must know they're treading on untested legal ground. All it might take is one offended copyright holder to bring the whole thing to its knees. Basing it in a country other than the USA might have been smarter, then, given the existence of laws like the DMCA which could serve to shut the site down.

    --
    occultae nullus est respectus musicae - originally a Greek proverb
    1. Re:Odd, no copyright questions by Wesley+Felter · · Score: 3, Interesting

      In presentations, Brewster says his policy is to take out the complainers. So if you think having your site in the Wayback Machine is a copyright infringement, he'll just take it out. Meanwhile he's taking the Napster approach: assume what you're doing is legal until someone tells you to stop. Hopefully that day won't arrive.

    2. Re:Odd, no copyright questions by Obiwan+Kenobi · · Score: 3, Interesting
      Or, as the buddhists say:


      "It is easier to ask for forgiveness than permission."

  22. True story and a small thanks.... by Anonymous Coward · · Score: 3, Interesting

    Small personal thanks from me. I had put an online exhibit of my artwork up a few years ago, but unfortunately lost all of it by a harddrive failure. Much to my surprise I was able to find nearly all of my site, http://www.gpapassavas.com online and backed up on the WBM.

  23. Re:A lot of internet information is crap... by garcia · · Score: 4, Insightful

    I think that storing everything on computers will make historians jobs MUCH less difficult but a lot less fun.
    Doing historical research is fun b/c you get to get your hands dirty (literally). I spent 6 hours a day for three weeks researching crime rates in Toledo, OH during prohibition (before, during, and after) and b/c the books were all handwritten and they were so old my hands turned black for days at a time...
    It would have been MUCH easier if all the information was sorted and easily found I guess it would make future historians jobs easier but what fun would that be?

    Just my worthless .02

  24. Why only four? by pla · · Score: 4, Insightful

    Out of curiosity, why only four drives per PC?

    With a simple $10 PCI IDE card (per additional 4), you could have gotten at *least* 8 drives, possibly as many as 16, per case. Granted, not many cases will let you *mount* that many, but I would expect paying a few bucks extra for the IDE cards and a better case would save quite a bit of money (and physical space) by halving or quartering the number of PCs you need ($100 extra to save $1500 per $2000, not counting the drives themselves?).

    88lf of machines vs 22lf. One requires an entire room, one would fit on a standard sized 3-or-4-tier storage rack. Of course, speaking of racks (of a different sort)... What on earth made you go with an array of standard PCs rather than a raid-in-a-rack?

    1. Re:Why only four? by jandrese · · Score: 5, Informative

      Probably the limiting factor there is the PCI bus. Modern ATA HDDs tend to saturate vanilla PCI busses (which is why most chipsets have custom busses between the north and southbridge these days). Add ATA cards and your PCI bus quickly becomes saturated and not very good for serving webpages. Worse, since the NIC probably sits on the PCI bus as well, you can easily starve your NIC with too many ATA devices on PCI ATA controllers.

      I know, I have a fileserver at home that has this exact problem, but I don't care if my fileserver is slow so it's not a problem.

      --

      I read the internet for the articles.
  25. Vannevar Bush by Mannerism · · Score: 3, Informative

    Technologists have promised the digital library for decades. In 1945, Vannevar Bush, who was technology adviser to several US presidents, wrote an article in The Atlantic magazine outlining how computers might one day augment libraries.

    Those who find this subject interesting, but who may not be familiar with Vannevar Bush's work, might want to read the paper to which Brewster Kahle refers.

  26. Re:A lot of internet information is crap... by aengblom · · Score: 3, Insightful

    I think that storing everything on computers will make historians jobs MUCH less difficult but a lot less fun.

    I think it's more that i will be different people. Understanding most of history is constrained by the lack of data about that time. Our age is precisely the opposite. We try and save EVERYTHING we can possible afford--because we know that crap will be valuable to many people later on. For next centuries historians it will be about data sampling and extracting the gold nuggets from all the crap we have saved.

    It will be the folks who built google. Not the current type of folks.

    That said. It's better to have too much than too little.

    --


    So close and yet so far from the world's perfect ID number
  27. Re:Vaguely uncomfortable by Maul · · Score: 3, Interesting

    I disagree completely.

    If you put something on the web, you have put it up for the world to see. The whole point of putting information on the web is making that information available to lots of people.

    What the Internet Archive is doing is no different than libraries storing old copies of newspapers and magazines. With an increasing amount of things being published online, we need an archive of those things.

    Years from now archives of web pages will be quite useful for those doing research on the events of today.

    Say you are a student in the year 2050 and are doing a report on the "history of the web." Wouldn't it be nice to have copies of the web pages from the 1990s to show how the "early" web looked like?

    --

    "You spoony bard!" -Tellah

  28. The Wayback machine is a lie by corebreech · · Score: 5, Insightful

    Try accessing news stories immediately prior to and after the September 11 attack and you'll see just how valuable this website is... or rather, isn't.

    I have also personally ran a website which contained fairly controversial material (based on this story) that I saw listed on their website and then removed shortly thereafter. Tell me, why would a service like this ever have occasion to remove material once it's been archived, especially if there are *NO* copyright issues and the webmaster of the archived site never asked them to remove it?

    The answer is simple: the powers-that-be saw how dangerous it was to make all this information available to anyone on demand so they took control. It would be a great service were it allowed to operate unfettered, but the reality is quite different.

    And I'm the first to mention this here so far? You should all be modded down -1 for naiveté.

  29. Any bets.... by MDX-F1 · · Score: 5, Interesting

    on how long before a politician has to resign because of some over the top statements he/she made in a flamewar back in college? Or maybe that webpage of ethnic jokes that seemed so hilarious back in high school.

    I have a feeling we are either going to have to become way more forgiving, or we're going to be stuck with only faceless boring types with no opinions as our leaders (no wisecracks, it could be much worse than it is now).

  30. Archive architecture by yppiz · · Score: 3, Informative

    I worked on some projects with the Internet Archive from 1998 - 2000.

    The Archive's first storage device (circa 1996) was a large StorageTek tape robot with a multi-gigabyte disk cache to handle user requests for archived pages. As drives and processors became cheaper, it became more interesting to use them instead of tape. The cost penalty of using drives over tape is only 2x - 3x, with the enormous win of increased bandwidth and decreased latency (when the request queue for the bot got large, the wait time for a page could be 16 hours. With disk, it's a fraction of a second).

    The first hard-drive based Archive storage used multiple 4U and 5U 12-20 drive Linux/FreeBSD boxes with ~80G IDE drives and Promise cards.

    Drive density is greater now - you can get 200G IDE drives and 320G IDEs are on the way, so you can use regular PCs as opposed to custom or niche-market (rackable server) boxes.

    --Pat / zippy@cs.brandeis.edu