Slashdot Mirror


Internet Archive Opens Crawler Code Under LGPL

ramakant writes: "It looks like the Internet Archive, which hosts the infamous Wayback Machine has opened its newest in-development crawler code under the LGPL. From the announcement: 'Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix , or misspelled or missaid as heratrix / heritix / heretix / heratix) is an archaic word for inheritess. Since our crawler seeks to collect the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.'"

10 of 186 comments (clear)

  1. gpl vs. lgpl? by Anonymous Coward · · Score: 3, Interesting

    could someone summarize the differences?

    fp?

  2. Oldest /. emtry by Anonymous Coward · · Score: 5, Interesting
  3. Infamous? by BitchAss · · Score: 4, Interesting

    the infamous Wayback Machine

    Why is it infamous? I haven't heard anything bad about it.

    --
    Like sex? Read and write about it! Indecent Blogging
  4. Old slashdot news by AyeFly · · Score: 5, Interesting

    here is a slashdot story from wayback i just found.

    "IBM announces a 25 gigger

    Posted by Hemos on Wednesday November 11, @10:11AM
    from the why-i-could-put-3/4-my-cd-collection dept.
    Booker writes "So IBM announces a 25 gig hard drive... does the world need this yet? Unless this is in a RAID, would you really want to trust 25 gigs on a single drive? What would you use this for? 400+ hours of MP3s comes to mind... "
    Read More...
    64 comments"

    Just thought it was interesting to see, since we now have 200gig HDs

    --
    Sig- http://www.dreamhost.com/rewards.cgi?ayefly
  5. Wayback = Genealogy of AI Minds by Mentifex · · Score: 3, Interesting


    The Internet Archive serves the hidden purpose of preserving the AI source-code DNA of artificial Minds.
    Each AI Mind leaves a source code trace of itself as it evolves and proliferates across the 'Net and the parsecs of nearby meatspace.
    Robot Minds will be able to look up their ancestors in the Internet archive, just as we humans do. However, when the Joint Stewardship of Earth by man and cyborg has arrived in the form of the Technological Singularity, robots will be able to resurrect their AI Mind ancestors and bring them back to alife from the Internet Archive.

  6. Redundancy? by Anonymous Coward · · Score: 3, Interesting

    The Internet is huge. But get rid of all the redundancy and the size goes down by a huge factor. How many copies of the Linux kernel and distros are there? How many copies of Matrix Reloaded? Do an MD5 sum and store pointers in order to recreate the structure of the net, keeping only one copy of what is unique. Terrabyte servers are cheap these days. Wouldn't need more than a few at the most to archive everything.

  7. Infamous? by Anonymous Coward · · Score: 0, Interesting

    which hosts the infamous Wayback Machine has opened?

    What exactly is infamous about the Wayback Machine? I did not know it was generally hated.

  8. Re:I probably would have done this differently... by benja · · Score: 2, Interesting

    Ever since the wayback machine started making waves, I'd guess about 2 years ago, I've noticed 2 things: There are far less updates of the archives, and it seems that the archive is regularly unable to keep up with the client load we impose on it.

    I think that they possibly intentionally limit their bandwidth, so that it's faster to browse the real Web than them (because they don't want to become Google cache when a site is slashdotted, for example).

    (Although they only would if the page in question is old enough... they have a policy of pages going in only 6 months after they have been spidered, probably for the same reason as above.)

  9. Re:score by corebreech · · Score: 4, Interesting

    I'll use it if you promise not to delete shit that doesn't hew to your ideology.

    That's what really sucks about the Wayback Machine.

    Ever try reading articles from the aftermath of 9/11? It's a great big hole, so much stuff has been deleted.

  10. This is not the Wayback Machine code. by InvisiBill · · Score: 2, Interesting
    A friend from another messageboard is working on this project, and just posted to let us know that he's been /.ed (which is sort of a cool thing in the geek world).
    And of course they got it all wrong. Heritrix != WayBackMachine.

    Heritrix gathers web pages (harvests)
    The WayBackMachine gives access to harvested material.

    Also Heritrix is a new web crawler meant to replace the one that IA has been using (which is owned by Alexa Internet).

    That's what he had to say about it. The post and the article both say it's the crawler, but the title states that it's the Wayback Machine. The two parts are separate though, and this is only the crawler part.