Internet Archive Opens Crawler Code Under LGPL

← Back to Stories (view on slashdot.org)

Internet Archive Opens Crawler Code Under LGPL

Posted by Cliff on Wednesday January 7, 2004 @03:40AM from the preserving-our-digital-culture dept.

ramakant writes: "It looks like the Internet Archive, which hosts the infamous Wayback Machine has opened its newest in-development crawler code under the LGPL. From the announcement: 'Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix , or misspelled or missaid as heratrix / heritix / heretix / heratix) is an archaic word for inheritess. Since our crawler seeks to collect the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.'"

49 of 186 comments (clear)

Mr peabody! by Anonymous Coward · 2004-01-07 03:40 · Score: 5, Funny

They've open sourced your wayback machine! Now you've lost the monopoly!
1. Re:Mr peabody! by Anonymous Coward · 2004-01-07 04:12 · Score: 2, Funny
  
  I don't know about you but I have no problem traveling forward in time. It is getting back that is the real trick.
gpl vs. lgpl? by Anonymous Coward · 2004-01-07 03:41 · Score: 3, Interesting

could someone summarize the differences?

fp?
1. Re:gpl vs. lgpl? by Anonymous Coward · 2004-01-07 04:17 · Score: 2, Insightful
  
  this ain't OT. The guy asked what the difference was between the GPL and LGPL. LGPL being the license the wayback code is being placed under, the opening of the code being the topic of discussion. Therefore, the post couldn't be any more on-topic.
  
  For chrissakes moderators! It says that the code is LGPL in the freakin' article HEADLINE!! We already have enough trouble with people not RTFA, an occasional someone who didnt read the submitter's post, and now we have moderators not RTFH to deal with too!!
2. Re:gpl vs. lgpl? by Anonymous Coward · 2004-01-07 04:21 · Score: 2, Funny
  
  One is communist, the other is socialist.
Cultural artifacts? by SexyKellyOsbourne · 2004-01-07 03:41 · Score: 2, Funny

You mean works of art like this?

B1FF#S K3WL H0M3 PAG3!!!
1. Re:Cultural artifacts? by Lev13than · 2004-01-07 03:50 · Score: 3, Funny
  
  What I want to know is, how do they keep it from crashing when it reaches here?
  
  --
  When you have nothing left to burn you must set yourself on fire
In case of /.ing... by Dave2+Wickham · 2004-01-07 03:41 · Score: 4, Informative

The source download is available on sourceforge.

I doubt it'll get slashdotted, but you never know...
1. Re:In case of /.ing... by Anonymous Coward · 2004-01-07 04:15 · Score: 4, Funny
  
  Don't you mean: I doubt it'll get slashdotted, but I needed the Karma.
Then maybe by caston · 2004-01-07 03:42 · Score: 4, Insightful

OSDN can decide to open source source forge...

--
Beings aspergers AND pulling chicks... I enjoy the challenge!
Oldest /. emtry by Anonymous Coward · 2004-01-07 03:44 · Score: 5, Interesting

Look, ma - no trolls!! But anti-MS comments in da hizzouse!!

I much prefer the current /.
score by TedCheshireAcad · 2004-01-07 03:45 · Score: 5, Funny

Score! Now I can run my own wayback machine!

I only have a 30G hard drive though, what do you guys think, bzip should take care of it?
1. Re:score by bamf · 2004-01-07 03:49 · Score: 5, Funny
  
  If you limit yourself to only archiving the useful parts of the interweb, you should be able to fit it all on floppy disk or two.
2. Re:score by corebreech · 2004-01-07 05:14 · Score: 4, Interesting
  
  I'll use it if you promise not to delete shit that doesn't hew to your ideology.
  
  That's what really sucks about the Wayback Machine.
  
  Ever try reading articles from the aftermath of 9/11? It's a great big hole, so much stuff has been deleted.
  
  --
  Is this truly the only Earth I can live on?
The code is pretty clean, too... by tcopeland · 2004-01-07 03:47 · Score: 4, Informative

...some unused variables and such-like in there, though, as reported by PMD.

--
The Army reading list
That sounds like a good working app. by DeKoNiNG · 2004-01-07 03:49 · Score: 5, Funny

From their FAQ: if you are comfortable grabbing code directly from CVS, wrestling with incomplete documentation, and running into undocumented limitations, would you want to use the current software.
Undocumented limitations? That sounds like a lot of fun!

--
Troll: Large Giant, 63 hp, AC 16, Usually chaotic evil.
old torrents by kyoko21 · 2004-01-07 03:50 · Score: 3, Funny

Nothing like crawling for old, recycled, and dead torrents.
This is great news by CompWerks · 2004-01-07 03:50 · Score: 2, Informative

Open source that handles over 300tb of data!

--
If you can read this sig - the bitch fell off.
Gordon Mohr by Orasis · 2004-01-07 03:50 · Score: 3, Informative

Congrats Gojomo!

This project was written by the brains behind bitzi and some really cool P2P stuff.

He's one of those guys thats going to be working on important stuff for years to come.
What about... by herrvinny · 2004-01-07 03:51 · Score: 3, Insightful

Heritrix (sometimes spelled heretrix , or misspelled or missaid as heratrix / heritix / heretix / heratix) is an archaic word for inheritess.

I know some grammar nazi is going to see this, so I might as well get it first. What about heretic: one who dissents from an accepted belief or doctrine.
Maaaaamories... by Dorf+on+Perl · 2004-01-07 03:54 · Score: 5, Funny

This is a great step forward, I welcome our archiving overlords, etc. Right now when I want to share some of my history (the good stuff, natch) with my kids, I have to dig out an old, musty shoebox full of junk. When they want to share theirs with their kids, they'll just beam a URL into my grandkids' in-skull HUDs. While in their flying cars. "Oh look, here's another stupid post to Slashdot by Grandpa..."
Infamous? by BitchAss · 2004-01-07 03:59 · Score: 4, Interesting

the infamous Wayback Machine

Why is it infamous? I haven't heard anything bad about it.

--
Like sex? Read and write about it! Indecent Blogging
1. Re:Infamous? by hey · 2004-01-07 04:06 · Score: 3, Funny
  
  Just wait 20 years when you are trying to get a CEO job and somebody produces your embarrassing old weblog.
2. Re:Infamous? by Lester67 · 2004-01-07 04:07 · Score: 3, Funny
  
  The batting cage that I frequent with the kids hates the fact their web-coupon (with no expiration date) is still stored in the Wayback.
  
  I think they might agree with "infamous". :-)
3. Re:Infamous? by marnanel · 2004-01-07 06:22 · Score: 2, Funny
  
  Beware the Ghost of Usenet^H^H^H^H^HBlog Postings Past!</gratuitous>
  
  --
  GROGGS: alive and well and living in
Heritrix? by elgrinner · 2004-01-07 04:02 · Score: 3, Funny

Sounds a bit like Asterix' grandfather.

--
But my Mom says I'm cool! -Milhouse
Uh? by Zog+The+Undeniable · 2004-01-07 04:02 · Score: 4, Funny

Heritrix (sometimes spelled heretrix , or misspelled or missaid as heratrix / heritix / heretix / heratix) is an archaic word for inheritess.
WTF is inheritess? I think we have recursive typos here...my head is going to explode!

--
When I am king, you will be first against the wall.
1. Re:Uh? by gojomo · 2004-01-07 04:42 · Score: 2, Informative
  
  'Inheritess' is femal form of 'inheritor' -- 'someone who inherits' (female). AKA 'heiress'.
2. Re:Uh? by phiala · 2004-01-07 04:49 · Score: 2, Informative
  
  The OED online is my friend!
  As a confirmed sesquipedalian, and obsessive research-addict, how could I overlook the oportunity to learn new words? And of course, share my newfound knowledge with you all...
  The OED would like us all to know:
  heritrix, heretrix: A female heir or heritor; an heiress.
  heritress: An heiress, an inheritress.
  inheritress: A female inheritor; an heiress. (Less technical than inheritrix.)
  inheritrix: Latinized fem. of INHERITOR
  inheritess: not a word
  And there you have it, courtesy of madmen and murderers. Well, one anyway, plus a whole collection of fellow logophiles.
  
  --
  I prefer to be called Evil Scientist.
Old slashdot news by AyeFly · 2004-01-07 04:04 · Score: 5, Interesting

here is a slashdot story from wayback i just found.

"IBM announces a 25 gigger

Posted by Hemos on Wednesday November 11, @10:11AM
from the why-i-could-put-3/4-my-cd-collection dept.
Booker writes "So IBM announces a 25 gig hard drive... does the world need this yet? Unless this is in a RAID, would you really want to trust 25 gigs on a single drive? What would you use this for? 400+ hours of MP3s comes to mind... "
Read More...
64 comments"

Just thought it was interesting to see, since we now have 200gig HDs

--
Sig- http://www.dreamhost.com/rewards.cgi?ayefly
Slashdot wayback then... by OpCode42 · 2004-01-07 04:04 · Score: 5, Funny

Just been looking at some slashdot pages from 1997... quote from the "Post your comments here!" form : "If you don't have anything worthwhile to say, don't say it. If people continue to abuse this feature, I will have to remove it."

Oh how different things could have been... ;-)

If the trolls had time machines...
I probably would have done this differently... by Rahga · 2004-01-07 04:06 · Score: 4, Insightful

Ever since the wayback machine started making waves, I'd guess about 2 years ago, I've noticed 2 things: There are far less updates of the archives, and it seems that the archive is regularly unable to keep up with the client load we impose on it.

To be honest, I don't have a great answer for the second problem. The only thing that could help there is the passage of time and advancement of technology, really. For the first problem, though, perhaps a SETI-ish distributed "Heritrix" could help make regularly archiving all of these sites a managable affair. IA sends marching orders out to the distributed volunteer network, each clients downloads, compares MD5 of the pages with other clients, compresses them, and sends them back to a master archive. Sounds great in theory, at least at first, to me...

Then again, would I do this, or even continue the project if I was in charge? No, I wouldn't. While, ideally, every page on the internet would be in XHTML, striking a major blow against signal:noise (hey, my own page is XHTML validated, how about yours?), the vast majority of time spidering is undoubtable wasted on re-downloading several dozen kilobytes of dynamically generated junk surrounding the content on sites such as CNN.com... While it's a noble cause, it's also a futile one.
1. Re:I probably would have done this differently... by benja · 2004-01-07 05:09 · Score: 2, Interesting
  
  Ever since the wayback machine started making waves, I'd guess about 2 years ago, I've noticed 2 things: There are far less updates of the archives, and it seems that the archive is regularly unable to keep up with the client load we impose on it.
  
  I think that they possibly intentionally limit their bandwidth, so that it's faster to browse the real Web than them (because they don't want to become Google cache when a site is slashdotted, for example).
  
  (Although they only would if the page in question is old enough... they have a policy of pages going in only 6 months after they have been spidered, probably for the same reason as above.)
Re:Heritrix by hplasm · 2004-01-07 04:06 · Score: 3, Funny

And what, pray tell, is "inheritess" ?
A Heritrix.

--
...and he grinned, like a fox eating shit out of a wire brush.
Wayback = Genealogy of AI Minds by Mentifex · 2004-01-07 04:06 · Score: 3, Interesting

The Internet Archive serves the hidden purpose of preserving the AI source-code DNA of artificial Minds.
Each AI Mind leaves a source code trace of itself as it evolves and proliferates across the 'Net and the parsecs of nearby meatspace.
Robot Minds will be able to look up their ancestors in the Internet archive, just as we humans do. However, when the Joint Stewardship of Earth by man and cyborg has arrived in the form of the Technological Singularity, robots will be able to resurrect their AI Mind ancestors and bring them back to alife from the Internet Archive.
Redundancy? by Anonymous Coward · 2004-01-07 04:11 · Score: 3, Interesting

The Internet is huge. But get rid of all the redundancy and the size goes down by a huge factor. How many copies of the Linux kernel and distros are there? How many copies of Matrix Reloaded? Do an MD5 sum and store pointers in order to recreate the structure of the net, keeping only one copy of what is unique. Terrabyte servers are cheap these days. Wouldn't need more than a few at the most to archive everything.
Re:no articles for 4 hours on a weekday morning? by skidoo2 · 2004-01-07 04:19 · Score: 2, Funny

I was wondering the same thing. Last night I posted a cool article about weird slime on Mars, and it hasn't even been rejected yet.
Unless the Archive caves in... by turambar386 · 2004-01-07 04:25 · Score: 5, Informative

"Since our crawler seeks to collect the digital artifacts of our culture for the benefit of future researchers and generations..."

That is, unless the digital artifacts in question are, like Operation Clambake opposed to rich and powerful sects. In which case, they are blocked by the Wayback machine after the Archive caves in to DMCA notices.
What if there's another archive.org by British · 2004-01-07 04:27 · Score: 3, Funny

...and archive.org tries to archive it? Will it go into an infinite loop,or just have 2 copies of the interweb?
"Heritrix" explained by skidoo2 · 2004-01-07 04:27 · Score: 2, Informative

Sheesh. Let me put this one to bed before it snowballs into a big cloud of impenetrable Times New Roman.

I'm tempted to shout, but I won't. Don't make me shout!

"Heretrix" is a term most often seen in a geneaology context. It denotes a chick who is designated to inherit (or has already inherited) the estate of someone. Example sentence: "Captain Dork married Jack Dipstick's heretrix Gassy Lucy."

In most cases the word "heretrix" connotes that there was something significant about the inherited estate, e.g. lots of cash.

Now shut up already! :-)
finally! by badansible · 2004-01-07 04:30 · Score: 3, Funny

I will be able to look at that exciting gopher site everybody was talking about! Yes?
What will happen if... by balbord · 2004-01-07 05:15 · Score: 2, Funny

...wayback inadvertently archives itself?!?!

That reminds me... once I though of googling for "google"... but I didn't since it, no doubtly, wold create a black hole or something!

--
"If I have been able to see so far, It is because I went out and bought a damn binoculars" - Ze da Esquina
Even better! by Inoshiro · 2004-01-07 05:19 · Score: 3, Funny

" Ooopsies...
Tim
Sat Dec 20 at 6:37PM EST

Guess I should read the article before I post. I was under the impression that the next release of IE4 *would* support HTML 4.0...Oh well."

Guess I should read the article before I post? What a crazy, upside-down world it was back then!

--
--
Internet Explorer (n): Another bug -- that is, a feature that can't be turned off -- in Windows.
Important clarifications (!!!) by gojomo · 2004-01-07 05:26 · Score: 4, Informative

Heritrix is just a crawler for collecting web resources recursively, within some defined parameters -- it doesn't offer Internet Archive Wayback Machine (IA WM) functionality.

FYI, there is a GPL'd web access tool that's very much like the IA WM, and even surpasses it in some ways: the NWA (Nordic Web Archive) Toolset 1.0. It doesn't do crawling, but if you can coerce what you've crawled into its input format, it offers URL-based, date-based, and full-text search plus "back-in-time" viewing of an archive. (Check out their demo, but remember it's only got a small number of pages from www.nb.no, so confine your searches to things like "Norway".)

Heritrix release 0.2.0 was mainly a test of our new release procedure; we would not recommend the code for outside use yet. We use it for crawls of up to hundreds of sites, taking a week or more to complete, but it still requires expert attention to crawl well.

We intend to improve its stability and scalability until it is capable of web-scale crawls -- billions of pages -- but that requires many incremental improvements, including extension to run on networks of cooperating crawling machines -- not planned until later in the year. (Heritrix currently crawls from a single machine.)

We are eager for contributors who would like to extend Heritrix in various ways, especially ways that would make it more valuable to researchers, librarians, and archivists. Optional modules for new fetch protocols, new media format link-extractors, or on-the-fly content-analysis to help direct further crawling would all be very interesting to us.

IA currently receives almost all of its full-web collection via an agreement with Alexa Internet, who have been crawling the web for the Internet Archive since 1996.

(P.S.: Yes, 'inheritess' should be 'inheritRess'/'heiress'. Oops.)
This is not the Wayback Machine code. by InvisiBill · 2004-01-07 05:34 · Score: 2, Interesting

A friend from another messageboard is working on this project, and just posted to let us know that he's been /.ed (which is sort of a cool thing in the geek world).
And of course they got it all wrong. Heritrix != WayBackMachine.
Heritrix gathers web pages (harvests)
The WayBackMachine gives access to harvested material.
Also Heritrix is a new web crawler meant to replace the one that IA has been using (which is owned by Alexa Internet).

That's what he had to say about it. The post and the article both say it's the crawler, but the title states that it's the Wayback Machine. The two parts are separate though, and this is only the crawler part.
spam by krokodil · 2004-01-07 06:38 · Score: 2, Insightful

I am afraid spammers may use this code
to harvest web pages for email addresses.
1. Re:spam by elemental23 · 2004-01-07 09:13 · Score: 2, Informative
  
  Don't lose any sleep over it, spammers have had tools to harvest the web for e-mail addresses for years.
  
  Insightful?
  
  --
  I like my women like my coffee... pale and bitter.
Stop giving open source movement undeserved credit by jbn-o · 2004-01-07 07:43 · Score: 2, Insightful

Open source that handles over 300tb of data!

Please don't be like Mark Webbink, Red Hat's general counsel, and give the open source movement undeserved credit. Adding a license to a list of approved licenses is trivial compared to writing the license and creating a community. The Lesser General Public License (formerly the Library General Public License) was written by the Free Software Foundation well before the open source movement was formed. The LGPL was written as a compromise in order to spread free software but strategically give up the ability to preserve software freedom in derivative works.

--
Digital Citizen
Re:gpl vs. lgpl? (answered) by DonGar · 2004-01-07 08:08 · Score: 3, Informative

I'm quite certain that people will correct me (at length) if I'm wrong, but here goes.

The GPL says that you can use source and code anyway that you want, but if you release modified versions, you must release the modified source under GPL.

The LGPL is intended for libraries that are released until the GPL. It says that commercial and other non-GPL projects can use this library without becoming GPL, but that changes to the library itself must be released under the LGPL.

LGPL is generally considered a lighter weight version of the GPL, and it normally used for things like system libraries. Without the LGPL, it wouldn't be possible to (legally) write closed source software for Linux, since the license for glibc (the standard system library) would require all apps linked against it be GPL.

--
plus-good, double-plus-good