Internet Archive Opens Crawler Code Under LGPL

← Back to Stories (view on slashdot.org)

Internet Archive Opens Crawler Code Under LGPL

Posted by Cliff on Wednesday January 7, 2004 @03:40AM from the preserving-our-digital-culture dept.

ramakant writes: "It looks like the Internet Archive, which hosts the infamous Wayback Machine has opened its newest in-development crawler code under the LGPL. From the announcement: 'Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix , or misspelled or missaid as heratrix / heritix / heretix / heratix) is an archaic word for inheritess. Since our crawler seeks to collect the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.'"

11 of 186 comments (clear)

Min score:

Reason:

Sort:

In case of /.ing... by Dave2+Wickham · 2004-01-07 03:41 · Score: 4, Informative

The source download is available on sourceforge.

I doubt it'll get slashdotted, but you never know...
The code is pretty clean, too... by tcopeland · 2004-01-07 03:47 · Score: 4, Informative

...some unused variables and such-like in there, though, as reported by PMD.

--
The Army reading list
This is great news by CompWerks · 2004-01-07 03:50 · Score: 2, Informative

Open source that handles over 300tb of data!

--
If you can read this sig - the bitch fell off.
Gordon Mohr by Orasis · 2004-01-07 03:50 · Score: 3, Informative

Congrats Gojomo!

This project was written by the brains behind bitzi and some really cool P2P stuff.

He's one of those guys thats going to be working on important stuff for years to come.
Unless the Archive caves in... by turambar386 · 2004-01-07 04:25 · Score: 5, Informative

"Since our crawler seeks to collect the digital artifacts of our culture for the benefit of future researchers and generations..."

That is, unless the digital artifacts in question are, like Operation Clambake opposed to rich and powerful sects. In which case, they are blocked by the Wayback machine after the Archive caves in to DMCA notices.
"Heritrix" explained by skidoo2 · 2004-01-07 04:27 · Score: 2, Informative

Sheesh. Let me put this one to bed before it snowballs into a big cloud of impenetrable Times New Roman.

I'm tempted to shout, but I won't. Don't make me shout!

"Heretrix" is a term most often seen in a geneaology context. It denotes a chick who is designated to inherit (or has already inherited) the estate of someone. Example sentence: "Captain Dork married Jack Dipstick's heretrix Gassy Lucy."

In most cases the word "heretrix" connotes that there was something significant about the inherited estate, e.g. lots of cash.

Now shut up already! :-)
Re:Uh? by gojomo · 2004-01-07 04:42 · Score: 2, Informative

'Inheritess' is femal form of 'inheritor' -- 'someone who inherits' (female). AKA 'heiress'.
Re:Uh? by phiala · 2004-01-07 04:49 · Score: 2, Informative

The OED online is my friend!
As a confirmed sesquipedalian, and obsessive research-addict, how could I overlook the oportunity to learn new words? And of course, share my newfound knowledge with you all...
The OED would like us all to know:
heritrix, heretrix: A female heir or heritor; an heiress.
heritress: An heiress, an inheritress.
inheritress: A female inheritor; an heiress. (Less technical than inheritrix.)
inheritrix: Latinized fem. of INHERITOR
inheritess: not a word
And there you have it, courtesy of madmen and murderers. Well, one anyway, plus a whole collection of fellow logophiles.

--
I prefer to be called Evil Scientist.
Important clarifications (!!!) by gojomo · 2004-01-07 05:26 · Score: 4, Informative

Heritrix is just a crawler for collecting web resources recursively, within some defined parameters -- it doesn't offer Internet Archive Wayback Machine (IA WM) functionality.

FYI, there is a GPL'd web access tool that's very much like the IA WM, and even surpasses it in some ways: the NWA (Nordic Web Archive) Toolset 1.0. It doesn't do crawling, but if you can coerce what you've crawled into its input format, it offers URL-based, date-based, and full-text search plus "back-in-time" viewing of an archive. (Check out their demo, but remember it's only got a small number of pages from www.nb.no, so confine your searches to things like "Norway".)

Heritrix release 0.2.0 was mainly a test of our new release procedure; we would not recommend the code for outside use yet. We use it for crawls of up to hundreds of sites, taking a week or more to complete, but it still requires expert attention to crawl well.

We intend to improve its stability and scalability until it is capable of web-scale crawls -- billions of pages -- but that requires many incremental improvements, including extension to run on networks of cooperating crawling machines -- not planned until later in the year. (Heritrix currently crawls from a single machine.)

We are eager for contributors who would like to extend Heritrix in various ways, especially ways that would make it more valuable to researchers, librarians, and archivists. Optional modules for new fetch protocols, new media format link-extractors, or on-the-fly content-analysis to help direct further crawling would all be very interesting to us.

IA currently receives almost all of its full-web collection via an agreement with Alexa Internet, who have been crawling the web for the Internet Archive since 1996.

(P.S.: Yes, 'inheritess' should be 'inheritRess'/'heiress'. Oops.)
Re:gpl vs. lgpl? (answered) by DonGar · 2004-01-07 08:08 · Score: 3, Informative

I'm quite certain that people will correct me (at length) if I'm wrong, but here goes.

The GPL says that you can use source and code anyway that you want, but if you release modified versions, you must release the modified source under GPL.

The LGPL is intended for libraries that are released until the GPL. It says that commercial and other non-GPL projects can use this library without becoming GPL, but that changes to the library itself must be released under the LGPL.

LGPL is generally considered a lighter weight version of the GPL, and it normally used for things like system libraries. Without the LGPL, it wouldn't be possible to (legally) write closed source software for Linux, since the license for glibc (the standard system library) would require all apps linked against it be GPL.

--
plus-good, double-plus-good
Re:spam by elemental23 · 2004-01-07 09:13 · Score: 2, Informative

Don't lose any sleep over it, spammers have had tools to harvest the web for e-mail addresses for years.

Insightful?

--
I like my women like my coffee... pale and bitter.