Slashdot Mirror


Internet Archive Opens Crawler Code Under LGPL

ramakant writes: "It looks like the Internet Archive, which hosts the infamous Wayback Machine has opened its newest in-development crawler code under the LGPL. From the announcement: 'Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix , or misspelled or missaid as heratrix / heritix / heretix / heratix) is an archaic word for inheritess. Since our crawler seeks to collect the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.'"

4 of 186 comments (clear)

  1. In case of /.ing... by Dave2+Wickham · · Score: 4, Informative

    The source download is available on sourceforge.

    I doubt it'll get slashdotted, but you never know...

  2. The code is pretty clean, too... by tcopeland · · Score: 4, Informative

    ...some unused variables and such-like in there, though, as reported by PMD.

  3. Unless the Archive caves in... by turambar386 · · Score: 5, Informative


    "Since our crawler seeks to collect the digital artifacts of our culture for the benefit of future researchers and generations..."

    That is, unless the digital artifacts in question are, like Operation Clambake opposed to rich and powerful sects. In which case, they are blocked by the Wayback machine after the Archive caves in to DMCA notices.



  4. Important clarifications (!!!) by gojomo · · Score: 4, Informative

    Heritrix is just a crawler for collecting web resources recursively, within some defined parameters -- it doesn't offer Internet Archive Wayback Machine (IA WM) functionality.

    FYI, there is a GPL'd web access tool that's very much like the IA WM, and even surpasses it in some ways: the NWA (Nordic Web Archive) Toolset 1.0. It doesn't do crawling, but if you can coerce what you've crawled into its input format, it offers URL-based, date-based, and full-text search plus "back-in-time" viewing of an archive. (Check out their demo, but remember it's only got a small number of pages from www.nb.no, so confine your searches to things like "Norway".)

    Heritrix release 0.2.0 was mainly a test of our new release procedure; we would not recommend the code for outside use yet. We use it for crawls of up to hundreds of sites, taking a week or more to complete, but it still requires expert attention to crawl well.

    We intend to improve its stability and scalability until it is capable of web-scale crawls -- billions of pages -- but that requires many incremental improvements, including extension to run on networks of cooperating crawling machines -- not planned until later in the year. (Heritrix currently crawls from a single machine.)

    We are eager for contributors who would like to extend Heritrix in various ways, especially ways that would make it more valuable to researchers, librarians, and archivists. Optional modules for new fetch protocols, new media format link-extractors, or on-the-fly content-analysis to help direct further crawling would all be very interesting to us.

    IA currently receives almost all of its full-web collection via an agreement with Alexa Internet, who have been crawling the web for the Internet Archive since 1996.

    (P.S.: Yes, 'inheritess' should be 'inheritRess'/'heiress'. Oops.)