Slashdot Mirror


How the Wayback Machine Works

tregoweth writes: "O'Reilly has an interview with Brewster Kahle about how The Internet Archive's Wayback Machine works, with lots of juicy details about how the biggest database ever built works."

29 of 134 comments (clear)

  1. Google? by kenneth_martens · · Score: 4, Interesting

    It's an interesting idea, but the real problem is not storing the 100 TB of data, it's figuring out how to search through it to find what you're looking for. Now, apparently they write a lot of their own software, but it might be better if they could team up with Google and have Google index their sites on a special database. We'd have www.google.com for regular searches, and wayback.google.com for the Wayback Machine's sites.

    Something else I found interesting: according to the article, they "use as much open source software as [they] can." That makes sense when they've got between 300 and 400 computers, and with the number growing all the time. Licensing all those with a non-open OS would be quite expensive.

  2. Successfully crashed by SilentChris · · Score: 3, Funny

    Ok, we have successfully Slashdotted the Wayback Machine. Screw history! :) Let's move on to bigger and better things.

  3. Ewwwww! by NeoTron · · Score: 3, Funny

    And I thought I'd erased all my old embarrassing HTTP handywork....until I discovered my old website nicely archived - bleargh!

    Ah hell, may as well keep it there - it's even got my old web-based Curriculum Vitae on it too - perhaps in some way I've now been "immortalised"?? :)

    I've not touched HTML ever since those first abortive attempts I made 5 years ago, cause I realise now that I'm pretty crap at it - I'll stick to Unix admin, what I know best ;)

  4. Interesting Thoughts by nurightshu · · Score: 3, Insightful

    I was glad to see the interviewee was brutally honest about free software -- both its benefits and its drawbacks. Usually discussions among my friends usually degenerate into holy wars, with both of us spouting cliches at one another until we all storm off in huffs.

    Free software can save the world, I think. We just need to realize that it needs a lot more work to get there.

    --
    They that would sacrifice their .sig space for that cliched Franklin quote deserve neither.
  5. Re:Not very way back! by tom.allender · · Score: 3, Informative
    Wayback Slashdot ...only goes back to 2000?

    Wayback slashdot.org goes back to 1997...

  6. Try this instead.. by CptnHarlock · · Score: 4, Interesting
    --
    $HOME is where the .*shrc is
    -- silver_p
  7. They haven't got http://web.archive.org/ by Rentar · · Score: 5, Funny

    They don't seem to think the history of their site would be interesting: http://web.archive.org/web/*/http://web.archive.or g/ lredirects you to their index.html! boring!

    Now, that would really be a test for their apps. Same as if Google indexed www.google.com (entirely).

  8. Quite a lofty goal... by NOT-2-QUICK · · Score: 3, Insightful
    As per the article, Brewster Kahle states that:

    "The idea is to build a library of everything, and the opportunity is to build a great library that offers universal access to all of human knowledge."

    Not only does this sound like a rather far fetched plot from an old StarTrek episode, but it also seems to be an a physical and theoretical impossibility. Even if adequate storage space did exist for such a task (a 10 TB database would be but a small start), I do not foresee any type of technology that could ever adequately capture new data at a sufficient speed to harness that which is human innovation and creativity.

    It is a nice thought, however, and I certainly wish him all the best in her pursuits...
    --
    Beer is proof that God loves us and wants us to be happy. -- Benjamin Franklin
  9. Not the biggest DB by costas · · Score: 5, Informative

    100 TBs do not make the biggest DB ever. I am personally working on an 60-70TB ERP system that's also writeable; I am sure there are bigger systems out there (e.g. Wal-Mart's or GM's ERP systems come to mind).

    A read-only DB containing highly-compressible text does not really make for a very challenging datamine. Just because it's on and about the Web and sexier than a stodgy ERP system should not make you overlook the real technology.

    1. Re:Not the biggest DB by limber · · Score: 3, Interesting

      As a side example to this discussion of 'what constitutes a large database', the NOAA's National Climate Data Centre maintains a database of digital data of about a petabyte of climatological data. The Centre takes in about a quarter of a terabyte of data *daily*.

    2. Re:Not the biggest DB by costas · · Score: 3, Insightful

      I find the claim dubious. Bigger than what kind of database? Wal-Mart if famous for tracking every single little thing about their supply chain. Most grocers or hypermart chains do the same. I can easily see, say A&P or Tesco or Carrefour having multi-TB DBs, even petabyte DBs.

      Also, the size is not the only thing that defines a database installation: numbers of simultaneuous users or concurrent transactions, read or write access, ability to rollback, quality of service standards are way more important in my book (and also for most big companies). Part of the reason DBs in that size range are rare is exactly that current technology does not scale up to those levels while maintaining rollbacks, read-write and fast user response.

      I like the Wayback machine, but to compare it to a proper database is ludicrous. EMC or Veritas will give you much more for their 100TBs of storage than 400 x85 PCs... instant backups for one and way larger MTBF.

  10. Pretty amazing ... by CDWert · · Score: 4, Funny

    Id say is pretty amazing, I actually was able to retreive content I thought lost years ago.

    My sites go back to 95, and yep theyre archived starting 96, this is too cool.

    I wonder how much of the goverments docs that were pulled off post Sept 11 are still on this ?

    A really funny note is it seems like all the p0rn is intact staring in 96, gotta archive the porn.

    But seriously , I was unaware of this, Im gonna use this thing like hell as a sales tool if nothing else. Its also great to find certain content thats been pulled.

    --
    Sig went tro...aahemmm.....fishing........
  11. Noooooooooo !!! by morzel · · Score: 5, Funny
    Please please please please do _NOT_ google it... It was embarassing enough when google acquired dejanews, and put the old usenet archives on-line. :-)
    I just visited some sites from which I hoped that they dissappeared completely from cyberspace. The only defense I've got now are the old cryptic URLs of these monstrosities... Indexing that database would be a disaster, especially with an unusual name like mine...
    (Yes, I was stupid enough to use my real name ;-)
    Damn you, wayback :p

    --
    Okay... I'll do the stupid things first, then you shy people follow.
    [Zappa]
  12. The Cost of a Terabyte by wayn3 · · Score: 3, Interesting

    You buy from EMC a terabyte for maybe $300,000. That's just the storage for 1 TB. We can buy 100 TBs with 250 CPUs to work on it, all on a high-speed switch with redundancy built in.
    Interesting quote. Mr. Kahle addresses something I've been wondering for a while -- are storage area networks really worth it? Or is he ignoring the costs of maintenance and manpower to keep these things afloat?

  13. Copyright infringement by Karma+Star · · Score: 3, Redundant

    Seeing that they cache webpages from other sites, I wonder how long it will take before another company sues them?

    Also, I wonder what their criteria will be for "submissions"? 1 month? 1 year?

    --
    Me email iz skyewalkerluke at microsoft's free email service.
    1. Re:Copyright infringement by pjones · · Score: 3, Informative
      Child! Child! They do not sue you right away -- and they can't. First they send you a cease-and-desist order and you evaluate their claim.


      But Brewster answers your question in the interview himself on the second page:


      Koman: What about the question of rights? I just wrote about Lawrence Lessig's book on intellectual property. Surely the publishers and the television networks and the record companies aren't willing to let you keep a copy of all of their stuff?


      Kahle: All we collect for the Web archive are sites that are publicly accessible for free, and if there's any indication from the site owner that they don't want it in the archive, we take it out. If there's a robot exclusion, it's removed from the Wayback Machine. Over the years, people would notice these things in their logs and would say, what are you doing? And we'd explain what we're doing -- building this archive and donating a copy to the Library of Congress, etc., etc., and 90% of the time they say, "Oh, that's cool, you're crazy, but go ahead." About 10% of the time, they'd say, "I don't want any part of it," and we instruct them on how to use a robot exclusion and they're taken out of history. That seems to work for everybody at this point. People are really excited about this future that we're building together.

      --
      Certified Black Helicopter Pilot *** Unwitting Dupe of One World Gov'ment
  14. Distributed Computing solution... by Tazzy531 · · Score: 3, Insightful

    The interview talked a little about throwing more machines on when the demand deems necessary. I wonder if it is possible to do this over the internet? I mean, I'm seeing something along the lines of SETI, where millions of people worldwide donate their unused processor power. Would it be possible to distribute the searches to remote computers over the internet in real time?

    --


    _______________________________
    "I'm not Conceited...I'm just a realist..."
  15. Government Removed Site still Available by Tazzy531 · · Score: 4, Informative

    A number of you have asked whether the websites taken down since 9/11 are available on archive.org. The answer is yes. One example is:

    DC Air National Guard on Archive

    Same Page - 404

    One of the conspiracy websites that I have read was saying that combat airplanes, normally on 24 hour alert, at this base should have and could have prevented the plane from entering the restricted airspace in DC. They were saying that this site was removed because it provided evidence that somebody dropped the ball.

    --


    _______________________________
    "I'm not Conceited...I'm just a realist..."
  16. Link to various database sizes by rkgmd · · Score: 3, Informative

    http://znet.net/~schester/facts/database_sizes.htm l Apparently, walmart's is 24TB, and the entire www index as of 1999 was only 6TB.

  17. DBMS and model? by leandrod · · Score: 3, Interesting

    But what is the DBMS? Is the database relational? How it was modelled?

    --
    Leandro Guimarães Faria Corcete DUTRA
    DA, DBA, SysAdmin, Data Modeller
    GNU Project, Debian GNU/Lin
  18. Biggest ever? I don't think so! by Proud+Geek · · Score: 4, Funny

    I once worked on a site with a 25 year old database that was much larger.

    The ancient magnetic storage took up several warehouses. Beat that, for biggest database ever!

    --

    Even Slashdot wants to hide some things

  19. Interesting thought process by cheese_wallet · · Score: 3, Interesting

    Pretty decent read, but one thing they said got me thinking a little bit.

    They said that at Thinking Machines they built a super fast computer, but it required a new way of thinking about things in order to program it. And then they called this a mistake, because they couldn't attract any customers.

    This seems like a real problem that would lead to technological stagnation. At least from a market place point of view.

    It is kind of similar to a company making games off of pre-existing engines, like quake, instead of some new non-quake compatible engine.

    Or everybody making x86 compatible CPUs.

    It also seems that when a company does come up with some new way of doing things, they get burned, and it is the second generation of companies that pick up the torch that make the money. So nobody wants to be that first company, they are all waiting for someone else to break the ground.

    Maybe the only people/companies that come up with new stuff are the ones that are insanely rich, and won't get hurt by doing something new, or the insanely poor who have nothing to lose anyway.

    I can't help thinking that this clustering boom going on is just like what 3dfx was trying to do. The difference right now is that clustering actually *does* outperform the super fast single chip. I wonder when technological advances will change this fact.

  20. You know what is SAD? by dood · · Score: 3, Funny

    Slashdot looks the exact same it did 5 years ago!

    WHEN is this site going to be updated? Forget the wayback machine, if I want ancient web history I visit slashdot.

    --Dood

  21. Talk to the US government by Remus+Shepherd · · Score: 3, Informative

    You're right, the Wayback machine is not the largest collection of data -- not even the largest collection online. I work with the USGS's catalog of satellite data. They have over 300 terabytes of satellite imagery, and the collection is growing at a rate of about 1 terabyte per day.

    The USGS collection comprises multiple instruments, but Landsat 7 is a big one, contributing about 100 terabytes that's searchable online.

    Perhaps 'Largest TEXT Database' would be a better description of the Wayback Machine?

    --
    Genocide Man -- Life is funny. Death is funnier. Mass murder can be hilarious.
  22. Wisdom in his words.. by grub · · Score: 3, Interesting


    From the article:
    How the archive works is just with stacks and stacks of computers runnning Solaris on x86, FreeBSD, and Linux, all of which have serious flaws, so we need to use different operating systems for different functions.

    The man puts bias aside and uses various OSs in areas in which each performs well. A real, tangible project like this is worth more than any amount of drooling zealotry.

    --
    Trolling is a art,
  23. 200 transactions/second? by selan · · Score: 4, Insightful

    Having so few transactions for a database of this size probably helps them run without needing large expensive machines. Many VLDBs support thousands of transactions per second. I found a list here of top ten winners of a very large database scalability contest. The winner for peak performance was something like 20,000+ TPS.

  24. the misanthropic bitch by joshuaos · · Score: 3, Interesting
    I spent the summer on the road, and when I settled down the for the cold months, I was quite sad to see the the Misanthropic Bitch appears to have vanished. This made me very sad. Today, when I read this article, I was delighted to find that all of dear bitch's articles are archived.

    I think this is a fabulous project, and I hope it does well. However, I think that the notion of such a centralized database will begin to become unrealistic. I think peer to peer projects are the future, and I can see a day far in the future when the database layer comes down and inhabits the filesystem layer and all the databases on the internet can talk to eachother, and in a sense, the net becomes a giant database that anyone can contribute to.

    Cheers, Joshua

    --

    When in danger or in doubt, run in circles, scream and shout!

  25. Their movie archive has "Hired!" by for(;;); · · Score: 3, Informative
    Hot damn! Their movie archive has a downloadable version of the short they showed on MST3K prior to "'Manos:' The Hands of Fate."


    "Ma'am, did you realize that Chevrolet has an important plan for your life?"

    --

    "Whatever happened to fair use?"
    -- Duff-Man
  26. Isn't this illegal? by russ-smith · · Score: 3, Insightful
    The majority of information being collected by Archive.org is covered by copyright law. It is up to Archive.org to get permission before they republish the information. If you look at the Archive web site they run banner ads for the Alexa toolbar. This Alexa service provides the marketing with information somewhat similar to the Nielson ratings for TV. Archive.org has received complaints about their service contrary to the statements made in the published article. Archive.org has refused to respond to any meaningful way to these issues. Archive.org is trying to put burden on the publisher to determine that The Archive is publishing it, find it within TheArchive web site and then provide them a notarized statement. see their FAQs at

    http://www.archive.org/exec/faqsidos/about/faqs.ht ml?index=2 and
    http://www.archive.org/exec/faqsidos/about/faqs.ht ml?index=26

    The claims made in these faqs are just not consistent with the law. Are they going to repost everything that was available on Napster?

    They also have some problems with their algorithm so that some domains that are redirected fool their algorithm into associating content with a site that was never actually associated with the site. To try to find copywritten works would be a nightmare. Archive.org has refused to respond to any of these issues and, in fact, are lying about it if the quotes in the article are factual.

    Russ Smith