Slashdot Mirror


Internet Archive Opens Crawler Code Under LGPL

ramakant writes: "It looks like the Internet Archive, which hosts the infamous Wayback Machine has opened its newest in-development crawler code under the LGPL. From the announcement: 'Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix , or misspelled or missaid as heratrix / heritix / heretix / heratix) is an archaic word for inheritess. Since our crawler seeks to collect the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.'"

6 of 186 comments (clear)

  1. Then maybe by caston · · Score: 4, Insightful

    OSDN can decide to open source source forge...

    --
    Beings aspergers AND pulling chicks... I enjoy the challenge!
  2. What about... by herrvinny · · Score: 3, Insightful

    Heritrix (sometimes spelled heretrix , or misspelled or missaid as heratrix / heritix / heretix / heratix) is an archaic word for inheritess.

    I know some grammar nazi is going to see this, so I might as well get it first. What about heretic: one who dissents from an accepted belief or doctrine.

  3. I probably would have done this differently... by Rahga · · Score: 4, Insightful

    Ever since the wayback machine started making waves, I'd guess about 2 years ago, I've noticed 2 things: There are far less updates of the archives, and it seems that the archive is regularly unable to keep up with the client load we impose on it.

    To be honest, I don't have a great answer for the second problem. The only thing that could help there is the passage of time and advancement of technology, really. For the first problem, though, perhaps a SETI-ish distributed "Heritrix" could help make regularly archiving all of these sites a managable affair. IA sends marching orders out to the distributed volunteer network, each clients downloads, compares MD5 of the pages with other clients, compresses them, and sends them back to a master archive. Sounds great in theory, at least at first, to me...

    Then again, would I do this, or even continue the project if I was in charge? No, I wouldn't. While, ideally, every page on the internet would be in XHTML, striking a major blow against signal:noise (hey, my own page is XHTML validated, how about yours?), the vast majority of time spidering is undoubtable wasted on re-downloading several dozen kilobytes of dynamically generated junk surrounding the content on sites such as CNN.com... While it's a noble cause, it's also a futile one.

  4. Re:gpl vs. lgpl? by Anonymous Coward · · Score: 2, Insightful

    this ain't OT. The guy asked what the difference was between the GPL and LGPL. LGPL being the license the wayback code is being placed under, the opening of the code being the topic of discussion. Therefore, the post couldn't be any more on-topic.

    For chrissakes moderators! It says that the code is LGPL in the freakin' article HEADLINE!! We already have enough trouble with people not RTFA, an occasional someone who didnt read the submitter's post, and now we have moderators not RTFH to deal with too!!

  5. spam by krokodil · · Score: 2, Insightful

    I am afraid spammers may use this code
    to harvest web pages for email addresses.

  6. Stop giving open source movement undeserved credit by jbn-o · · Score: 2, Insightful

    Open source that handles over 300tb of data!

    Please don't be like Mark Webbink, Red Hat's general counsel, and give the open source movement undeserved credit. Adding a license to a list of approved licenses is trivial compared to writing the license and creating a community. The Lesser General Public License (formerly the Library General Public License) was written by the Free Software Foundation well before the open source movement was formed. The LGPL was written as a compromise in order to spread free software but strategically give up the ability to preserve software freedom in derivative works.