Slashdot Mirror


Internet Archive Opens Crawler Code Under LGPL

ramakant writes: "It looks like the Internet Archive, which hosts the infamous Wayback Machine has opened its newest in-development crawler code under the LGPL. From the announcement: 'Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix , or misspelled or missaid as heratrix / heritix / heretix / heratix) is an archaic word for inheritess. Since our crawler seeks to collect the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.'"

2 of 186 comments (clear)

  1. Then maybe by caston · · Score: 4, Insightful

    OSDN can decide to open source source forge...

    --
    Beings aspergers AND pulling chicks... I enjoy the challenge!
  2. I probably would have done this differently... by Rahga · · Score: 4, Insightful

    Ever since the wayback machine started making waves, I'd guess about 2 years ago, I've noticed 2 things: There are far less updates of the archives, and it seems that the archive is regularly unable to keep up with the client load we impose on it.

    To be honest, I don't have a great answer for the second problem. The only thing that could help there is the passage of time and advancement of technology, really. For the first problem, though, perhaps a SETI-ish distributed "Heritrix" could help make regularly archiving all of these sites a managable affair. IA sends marching orders out to the distributed volunteer network, each clients downloads, compares MD5 of the pages with other clients, compresses them, and sends them back to a master archive. Sounds great in theory, at least at first, to me...

    Then again, would I do this, or even continue the project if I was in charge? No, I wouldn't. While, ideally, every page on the internet would be in XHTML, striking a major blow against signal:noise (hey, my own page is XHTML validated, how about yours?), the vast majority of time spidering is undoubtable wasted on re-downloading several dozen kilobytes of dynamically generated junk surrounding the content on sites such as CNN.com... While it's a noble cause, it's also a futile one.