Internet Archive Opens Crawler Code Under LGPL
ramakant writes: "It looks like the Internet Archive, which hosts the infamous Wayback Machine has opened its newest in-development crawler code under the LGPL. From the announcement: 'Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Heritrix (sometimes spelled heretrix , or misspelled or missaid as heratrix / heritix / heretix / heratix) is an archaic word for inheritess. Since our crawler seeks to collect the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.'"
They've open sourced your wayback machine! Now you've lost the monopoly!
could someone summarize the differences?
fp?
You mean works of art like this?
B1FF#S K3WL H0M3 PAG3!!!
The source download is available on sourceforge.
I doubt it'll get slashdotted, but you never know...
...Heretics or yet another dumb Matrix reference. Or possibly both.
OSDN can decide to open source source forge...
Beings aspergers AND pulling chicks... I enjoy the challenge!
Look, ma - no trolls!! But anti-MS comments in da hizzouse!!
/.
I much prefer the current
Score! Now I can run my own wayback machine!
I only have a 30G hard drive though, what do you guys think, bzip should take care of it?
...some unused variables and such-like in there, though, as reported by PMD.
The Army reading list
From their FAQ: if you are comfortable grabbing code directly from CVS, wrestling with incomplete documentation, and running into undocumented limitations, would you want to use the current software.
Undocumented limitations? That sounds like a lot of fun!
Troll: Large Giant, 63 hp, AC 16, Usually chaotic evil.
Nothing like crawling for old, recycled, and dead torrents.
Open source that handles over 300tb of data!
If you can read this sig - the bitch fell off.
Congrats Gojomo!
This project was written by the brains behind bitzi and some really cool P2P stuff.
He's one of those guys thats going to be working on important stuff for years to come.
Heritrix (sometimes spelled heretrix , or misspelled or missaid as heratrix / heritix / heretix / heratix) is an archaic word for inheritess.
I know some grammar nazi is going to see this, so I might as well get it first. What about heretic: one who dissents from an accepted belief or doctrine.
Beneath this article I noticed this fortune cookie:
"Insanity is hereditary. You get it from your kids. "
> Heritrix (sometimes spelled heretrix , or misspelled or missaid as heratrix / > > heritix / heretix / heratix) is an archaic word for inheritess.
And what, pray tell, is "inheritess" ?
This is a great step forward, I welcome our archiving overlords, etc. Right now when I want to share some of my history (the good stuff, natch) with my kids, I have to dig out an old, musty shoebox full of junk. When they want to share theirs with their kids, they'll just beam a URL into my grandkids' in-skull HUDs. While in their flying cars. "Oh look, here's another stupid post to Slashdot by Grandpa..."
the infamous Wayback Machine
Why is it infamous? I haven't heard anything bad about it.
Like sex? Read and write about it! Indecent Blogging
in english?
I think we /.'d sf.net...either that or its conviently not accessable right after I see it linked from slashdot.
snowulf.com
Sounds a bit like Asterix' grandfather.
But my Mom says I'm cool! -Milhouse
WTF is inheritess? I think we have recursive typos here...my head is going to explode!
When I am king, you will be first against the wall.
here is a slashdot story from wayback i just found.
"IBM announces a 25 gigger
Posted by Hemos on Wednesday November 11, @10:11AM
from the why-i-could-put-3/4-my-cd-collection dept.
Booker writes "So IBM announces a 25 gig hard drive... does the world need this yet? Unless this is in a RAID, would you really want to trust 25 gigs on a single drive? What would you use this for? 400+ hours of MP3s comes to mind... "
Read More...
64 comments"
Just thought it was interesting to see, since we now have 200gig HDs
Sig- http://www.dreamhost.com/rewards.cgi?ayefly
Just been looking at some slashdot pages from 1997... quote from the "Post your comments here!" form : "If you don't have anything worthwhile to say, don't say it. If people continue to abuse this feature, I will have to remove it."
;-)
Oh how different things could have been...
If the trolls had time machines...
Ever since the wayback machine started making waves, I'd guess about 2 years ago, I've noticed 2 things: There are far less updates of the archives, and it seems that the archive is regularly unable to keep up with the client load we impose on it.
To be honest, I don't have a great answer for the second problem. The only thing that could help there is the passage of time and advancement of technology, really. For the first problem, though, perhaps a SETI-ish distributed "Heritrix" could help make regularly archiving all of these sites a managable affair. IA sends marching orders out to the distributed volunteer network, each clients downloads, compares MD5 of the pages with other clients, compresses them, and sends them back to a master archive. Sounds great in theory, at least at first, to me...
Then again, would I do this, or even continue the project if I was in charge? No, I wouldn't. While, ideally, every page on the internet would be in XHTML, striking a major blow against signal:noise (hey, my own page is XHTML validated, how about yours?), the vast majority of time spidering is undoubtable wasted on re-downloading several dozen kilobytes of dynamically generated junk surrounding the content on sites such as CNN.com... While it's a noble cause, it's also a futile one.
The Internet Archive serves the hidden purpose of preserving the AI source-code DNA of artificial Minds.
Each AI Mind leaves a source code trace of itself as it evolves and proliferates across the 'Net and the parsecs of nearby meatspace.
Robot Minds will be able to look up their ancestors in the Internet archive, just as we humans do. However, when the Joint Stewardship of Earth by man and cyborg has arrived in the form of the Technological Singularity, robots will be able to resurrect their AI Mind ancestors and bring them back to alife from the Internet Archive.
I wonder how long it will be till we see a new site open using the code...
Hacking the Network
http://gnomesupport.org/
The Internet is huge. But get rid of all the redundancy and the size goes down by a huge factor. How many copies of the Linux kernel and distros are there? How many copies of Matrix Reloaded? Do an MD5 sum and store pointers in order to recreate the structure of the net, keeping only one copy of what is unique. Terrabyte servers are cheap these days. Wouldn't need more than a few at the most to archive everything.
Did Taco die or something?
which hosts the infamous Wayback Machine has opened?
What exactly is infamous about the Wayback Machine? I did not know it was generally hated.
Pedophilia too. Ugh!
entertaining fiction - but sort of unproductive.
"Did you know that flammable and inflammable mean the same thing?
Boy, did I find that out the hard way" - Woody from Cheers
I know BIFF. BIFF is my friend. SexyKellyOsbourne you are no BIFF.
(BIFF never used numbers)
Need Mercedes parts ?
Slashdot without comments would have around the same information density as a book without letters...
HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
"Matrix" - The guy in the village you see in the background, and notice that there are about 30 of the same guy.
"Unix" - The guy sued by Roman IP attorneys
"Asterick" - Because no one can pronounce "Asterisk" any more.
"Xenix" - To make the Asterix comics more inclusive, they have now added a female warrior defender for the village.
Why is archive.org arichived :) so many times on the 18 Sept 2001? There are actually more - "Note some duplicates are not shown. See all." then there are about 7500 entries, mostly in the same year. I opend about 10 of them and they seem to be the same.
Is someone standing next to you, holding a gun to your head? Why are you forced to use Gnome?
It's a great (cough) offsite backup, but very frustrating when you can't get all the pieces.
Need Mercedes parts ?
"Since our crawler seeks to collect the digital artifacts of our culture for the benefit of future researchers and generations..."
That is, unless the digital artifacts in question are, like Operation Clambake opposed to rich and powerful sects. In which case, they are blocked by the Wayback machine after the Archive caves in to DMCA notices.
...and archive.org tries to archive it? Will it go into an infinite loop,or just have 2 copies of the interweb?
Sheesh. Let me put this one to bed before it snowballs into a big cloud of impenetrable Times New Roman.
:-)
I'm tempted to shout, but I won't. Don't make me shout!
"Heretrix" is a term most often seen in a geneaology context. It denotes a chick who is designated to inherit (or has already inherited) the estate of someone. Example sentence: "Captain Dork married Jack Dipstick's heretrix Gassy Lucy."
In most cases the word "heretrix" connotes that there was something significant about the inherited estate, e.g. lots of cash.
Now shut up already!
Which one?
I will be able to look at that exciting gopher site everybody was talking about! Yes?
Guess this solves this guys problem.
s/K/Gn/g
How long until SCO claims that the code is theirs?
I hate sigs.
There's a huge number of open source web crawlers available already on SourceForge and elsewhere. Anyone know the advantages and disadvantages of this one over the others?
And many a geek without a RL will achieve eternal life when their personality (as expressed through pointed comments), experiences (as expressed through pointless anecdotes) and knowledge (as expressed through worthless advice) and thus their consciousness and LIVING MIND ITSELF, is painstakingly put back together by the same future race which will unfreeze the richer geeks from their cryogenic deathsleeps, from the myriad holographic shreds on the archived internet.
Think about it...
Everything you've ever said... it all came from you...
I hope the Archive does the same thing with their parallel programming system called P2.
p hp
It's a script execution environment that they use for processing the archive data.
http://www.archive.org/web/researcher/parallel.
More as a bridge without trolls...
I went to the GNU main site to try and figure out what the LGPL was about, and no luck at all getting a coherent explanation.
:)
P ublic_License
Wikipeda has a good explanation (below), although I am confused as to why the way back machine choose this particular licence since it seems to really be specifically for software libraries. Perhaps they meant the GFDL (GNU Free Documentation License).
P.S. Your allowed to copy all the stuff you want from Wikipeda its copylefted with the GFDL itself!
--- Wikipedia Article on LGPL ---
http://en2.wikipedia.org/wiki/GNU_Lesser_General_
GNU Lesser General Public License
From Wikipedia, the free encyclopedia.
The GNU Lesser General Public License is a software license designed as a compromise between the GNU General Public License and simple permissive licenses such as the BSD license and the MIT License.
It places a copyleft restriction on individual source code files but does not copyleft the program as a whole. The license is useful for software libraries; it was once called the GNU Library General Public License.
Actually, the more-than-famous thing is a reference to the movie Three Amigos where they didn't understand the exact meaning of the word...
True story.
...wayback inadvertently archives itself?!?!
That reminds me... once I though of googling for "google"... but I didn't since it, no doubtly, wold create a black hole or something!
"If I have been able to see so far, It is because I went out and bought a damn binoculars" - Ze da Esquina
" Ooopsies...
Tim
Sat Dec 20 at 6:37PM EST
Guess I should read the article before I post. I was under the impression that the next release of IE4 *would* support HTML 4.0...Oh well."
Guess I should read the article before I post? What a crazy, upside-down world it was back then!
--
Internet Explorer (n): Another bug -- that is, a feature that can't be turned off -- in Windows.
Heritrix is just a crawler for collecting web resources recursively, within some defined parameters -- it doesn't offer Internet Archive Wayback Machine (IA WM) functionality.
FYI, there is a GPL'd web access tool that's very much like the IA WM, and even surpasses it in some ways: the NWA (Nordic Web Archive) Toolset 1.0. It doesn't do crawling, but if you can coerce what you've crawled into its input format, it offers URL-based, date-based, and full-text search plus "back-in-time" viewing of an archive. (Check out their demo, but remember it's only got a small number of pages from www.nb.no, so confine your searches to things like "Norway".)
Heritrix release 0.2.0 was mainly a test of our new release procedure; we would not recommend the code for outside use yet. We use it for crawls of up to hundreds of sites, taking a week or more to complete, but it still requires expert attention to crawl well.
We intend to improve its stability and scalability until it is capable of web-scale crawls -- billions of pages -- but that requires many incremental improvements, including extension to run on networks of cooperating crawling machines -- not planned until later in the year. (Heritrix currently crawls from a single machine.)
We are eager for contributors who would like to extend Heritrix in various ways, especially ways that would make it more valuable to researchers, librarians, and archivists. Optional modules for new fetch protocols, new media format link-extractors, or on-the-fly content-analysis to help direct further crawling would all be very interesting to us.
IA currently receives almost all of its full-web collection via an agreement with Alexa Internet, who have been crawling the web for the Internet Archive since 1996.
(P.S.: Yes, 'inheritess' should be 'inheritRess'/'heiress'. Oops.)
That's what he had to say about it. The post and the article both say it's the crawler, but the title states that it's the Wayback Machine. The two parts are separate though, and this is only the crawler part.
...while there may be unique content, there's certainly not unique versions. I'm sure there's many different rips of Matrix Reloaded. First off, there's all the various screener / preview dvd / telesync / DVD releases.
Then there's all the corrupted versions (a single unnoticable bit error = different MD5). Different rips (Macrovision removed/not removed, inverse telecine, PAL/NTSC versions, different resizing (bicubic/bilinear/Lanczos3).
Some made using XviD, some DivX, some WMV, different versions of the codec, different settings of the codec, different audio codecs, different audio/video bitrate mix.
And even if none of that were true, you really don't seem to get how mind boggingly big the Internet is. I read somewhere that they estimated there was well over 100 years of commercial cinema film (screen time) produced.
Let's assume that there exists one, and only one copy of each film (in return, we'll assume that every movie exists online, which might not be entirely accurate. But it's not far from it) Say 1h45 & 700mb DivX rips (to be kind) and those are at least 350,000 Tb alone. Multiply by a factor of 8 or so for original DVDs (which are also online many places now)
Then there's the hundreds of thousands of albums around. www.allmusic.com lists 5,155,636 tracks in their database. At an average of 3mb/song (a low estimate) you're talking another 15,000 Tb.
Then there's all the CD-ROM titles (applications, games, encyclopedias, whatever), books, databases, statistics and all sorts of other data that are available online. Not to mention all the homemade content that is available online, if only to a limited audience (like e.g. on homepages and such). Even though a couple pics don't do much, they add up when millions of people do it.
So, on a guesstimate I think you'd rather need a server on the order of 1 exabyte (1,000,000 tb) rather than "a few (terabyte servers) at the most". Already my personal network (desktop+server) has 500gb+ of data alone, so if I share all that you're already past a quarter of your 2 tb server...
Kjella
Live today, because you never know what tomorrow brings
Did the math using mb, when I thought I was operating in gb. So I was off by a factor of 1000. So the correct guessitmate would be 1 petabyte (1000 Tb).
Kjella
Live today, because you never know what tomorrow brings
I am afraid spammers may use this code
to harvest web pages for email addresses.
Maybe with the code released, I can find out why it constantly tauns me by having a cache of everything EXCEPT what I want!
-- 'The' Lord and Master Bitman On High, Master Of All
Please don't be like Mark Webbink, Red Hat's general counsel, and give the open source movement undeserved credit. Adding a license to a list of approved licenses is trivial compared to writing the license and creating a community. The Lesser General Public License (formerly the Library General Public License) was written by the Free Software Foundation well before the open source movement was formed. The LGPL was written as a compromise in order to spread free software but strategically give up the ability to preserve software freedom in derivative works.
Digital Citizen
I'm quite certain that people will correct me (at length) if I'm wrong, but here goes.
The GPL says that you can use source and code anyway that you want, but if you release modified versions, you must release the modified source under GPL.
The LGPL is intended for libraries that are released until the GPL. It says that commercial and other non-GPL projects can use this library without becoming GPL, but that changes to the library itself must be released under the LGPL.
LGPL is generally considered a lighter weight version of the GPL, and it normally used for things like system libraries. Without the LGPL, it wouldn't be possible to (legally) write closed source software for Linux, since the license for glibc (the standard system library) would require all apps linked against it be GPL.
plus-good, double-plus-good
I shall go and tell the indestructible man that someone plans to murder him.
Since when does a 2 bedroom condo have a basement?
Caldera & SCO to merge Unix and Linux OS? Where are the lawyers?
gb/mb = grambit/millibit = 1000 grams.
So either you don't know what you're doing, or you were off by a factor of 1 kg.
"Heritrix" would not mean "inheritness" - I'm not sure that's the word you are seeking either. Either inheritEDness, or for a lot of uses, "heredity" would work better.
However, neither of these would apply to the latin word in question. "Heritrix" is the female form of "heritor", in this case meaning "she who inherits".
Still crops up as a legal term in some places..
Nothing to do with the original post, really, but are we not all committed to the spreading of knowledge?
Good news about the Heritrix code, any way you read it!
I was wondering was written in Perl, Python, C, C++, Ruby ... but no, it's Java... I hope this can run on free VM ;-)
I've always wanted to say that.
Brewster Kahle and Alexa Internet are the real deal. This isn't some undergrad's CS-101 project, it's a tool designed from the very start to archive the entire web. And it does it on a regular basis. Even if there's a really good SourceForge project (you didn't cite any of them), Alexa's should be a first stop for anyone interested in the task.
Heretics? This will join the FreeBSD devil and "Darwin" on Objective's list as to why Open Source is the spawn of the Devil.
I first read the headline and i thought it said the Internet Archive would be archiving L/GPL code.
.01 to 2.6.0 and a couple thousand other applications too. Oh well, the real story is neat too.
That would be cool actually, like a 1stop shop for all the opensource cvs servers... get to see the linux kernel from
As said above, OSDN *HAS* open sourced SourceForge. You can obtain it at the Alexandria Development Project on SourceForge. Please try to do some research prior to saying things like this. That said, it is true that like many open source projects, SourceForge can only be used for open source software development. For commercial, closed source development using the SourceForge system, try SourceForge Enterprise Edition from VA Software, the original developers of SourceForge.
It's better to vote for what you want and not get it than to vote for what you don't want and get it.
- E. Debs
Archaic for heiress: i.e female who inherits.
It turns out that there are other open source crawlers that also have been written in Java. For a comprehensive listing go here:
Crawlers in Java