I can't comment in more detail about Alexa's bulk crawl strategy because it is only documented to the public (and us at the Internet Archive) in general terms: it is a broad survey crawl of the public web, weighted by Alexa's internal measures of site/page importance and legitimacy (which are at least partially based on the same toolbar data that drives their site rankings). While we expect to continue receiving the Alexa donations indefinitely, a growing proportion of the public archive is likely to come from other sources, including the IA's own crawling and other outside donors, in the future.
The Archive is funded by a combination of private donations from individuals and foundations (sometimes for general operations and sometimes for specific projects), and fees for services provided to our partners, who are public libraries and archives themselves. With 11+ year history, and long partnerships with customers and funding sources, we're pretty stable in the world of technology nonprofits.
I wasn't directly involved in the Ubuntu choice, but it's been nice to have our developer desktops in close sync with cluster servers.
'Recall' wasn't exactly Google-like search. IIRC, in some respects it was better, with an advanced idea of related concepts, and with data on frequency of terms over time. In other respects, it was not what people would expect: there was no exact phrase matching, and certain terms that didn't become tracked concepts weren't findable at all, even though you could see the words in other indexed results.
Unfortunately, IA couldn't maintain the deployment when the developer, Anna Patterson, moved to Google. So, Recall turned out to be a short-lived experiment, grand in scale of pages indexed and novel features but not in traffic served.
Patterson did big things at Google and now has another search startup, Cuill, that's likely to do more good things for the web.
At the Internet Archive, we've also been using the open-source projects Nutch and Hadoop to offer search on smaller web collections for our partners. (A pair of such searchable partner collections for the US National Archives and Records Administration lives at webharvest.gov.) Someday we may be able to scale these up to the full 11+ year archive.
Unfortunately, this "squatters-add-robots-restrictions" problem comes up a lot.
We'd like to address it, and to do so there are two major issues to be tackled: (1) our current Wayback Machine software only excludes sites on a "for all time" basis; (2) short of mechanistically trusting the current domain owner, determining who has the right to exclude or restore material could be a very labor-intensive, error-prone, and liability-compounding process.
The new open-source 'Wayback' software, which will go live for the Worldwide Wayback Machine later this year, enables time-range exclusions. (It's currently only used for many smaller collections we do for partners.) That should give us the capability to address (1). Addressing (2) will require further discussion about the proper and efficient policies -- but it's on our agenda once the technical capability for time-range exclusions is in place.
Specifically regarding the mindchild.net site you mention, it looks like the issue is that our current retroactive-exclude robots.txt-parser doesn't understand the 'Allow' directive. (The mindchild.net/robots.txt tries to enable ia_archiver/WaybackMachine access via an 'Allow'.) That too will be fixed in the new 'Wayback' deploy (if not sooner).
Though anything in the SF Bay Guardian should be taken with a grain of salt, it should be noted that publication blames now-Speaker Nancy Pelosi (D-San Francisco) for the Presidio arrangement, not the Bush-41 Administration. Since the legislation was passed during the Clinton-42 administration, blaming it on either Bush is farfetched.
But the course taken wasn't unreasonable. The Presidio was already developed when it was a military base. Turning it into a traditional, naturalist national park would have required un-development, destroying pre-existing housing, buildings and roads. (Restoring it to its native grassy sand dunes would have required deforestation.) Mixing updates of the prior development with other expansion of public use and re-naturalization made sense -- and given the immense value of just the already-developed real estate, having the whole project pay its own way should have been a no-brainer.
Give a try to my web-based tool, Regex Powertoy. Its interface is all DHTML/CSS/Javascript, but requires a hidden Java (1.5) applet for the advanced and steppable regex engine.
Given that Java core, there are options for adding/removing usual Java literal escaping, which in Java code means lotsa backslashes. Not all Perl advanced features are supported.
I hadn't considered a pick for awk/sed/bash syntax limits/conversion but will consider it. Any handy reference to how their syntax differs from Perl/Java? (The thing that usu. bites me with sed is escaping of parentheses.)
Great things about the Java 1.4+ regex support, from my perspective, include that (1) it's nearly as full-featured as Perl's regexes (and thus far better than Javascript's); and (2) it's usable in web browsers and via embedded applets.
Those were both key to helping me create Regex Powertoy, a interactive visual regex tester, much like others mentioned in this discussion -- but fully implemented in a browser. It's in JavaScript and DHTML, with a Java applet for the full-featured and step-controlled regex matching -- requires FF1.5+/IE6+ & Java 1.5+.
Check it out, break it (it's still got some rough edges under heavy input), let me know how it could be improved.
Yet a number of posts have claimed they are. Can anyone provide a reference about these orgs' official positions?
(Note: "EFF-Austin" is not the same as the San Francisco-based Electronic Frontier Foundation.)
FWIW, I'm with Bram on this issue.
Government regulatory involvement will only increase the chance of either censorship or value-destroying inflexible rules down the road. We got this far without government restrictions on the shape of bandwidth and quality-of-service commercial arrangements -- and there are as yet no actual (as opposed to theoretical-in-some-darkly-imagined-future) victims of non-neutral network practices. So the "we must regulate now to save the internet!" position looks like irrational histrionics to me.
The Internet Archive is participating, too. We'd accept contribution projects related to the Heritrix web crawler, Wayback access tool, or NutchWAX full-text search facility. See our Summer of Code 2006 Ideas Page.
Agree wholeheartedly with those recommending against pure 50/50 split. Even close partners can have an acrimonious falling out; don't have a situation that risks deadlock. Beside 51/49, another possibility is 49/49/2 -- with the small share going to someone mutually respected and familiar with the business. In normal times of agreement, they're passively along for the ride, perhaps a useful advisor. If there's a risk of irreconcilable disagreements, they can mediate and if necessary cast the deciding influence. Plan for the best -- consensus and success -- but throw in a few protections against the kinds of things that do occasionally happen.
Credence adds an interesting automatic custom-to-each-user trust metric, which we don't yet have at Bitzi. (At Bitzi, you should read over other users' comments and histories to make a judgement as to whether you'd like to rely upon them.)
But: TigerTree uses the 'tiger' 192-bit hash, specifically designed to be different from the MD4/MD5/SHA1 family, so MD5 results should not be generally applicable to tiger. (Perhaps, given the apparent 'momentum' of new discoveries, all extant hashes should be looked upon with extra wariness. But no recent results specifically impugn either tiger or TigerTree, as far as I know.)
Interesting consideration that larger files provide more 'hook points' against which isomorph ranges could be inserted.. I hadn't seen that pointed out before, though I suppose it makes a certain intuitive sense...
To clarify: Kazaa's old hash, pre-2.6, hashed the first 300K with MD5, then small samples of the rest of the file with crc32. There were giant ranges of file that could be changed without affecting the early Kazaa hash at all.
Starting in 2.6, they kept that flawed hash as the first 20 bytes of 'kzhash', but then added a hash-tree based on MD5 of every 32K chunk. Much better, as long as you trust MD5 -- but anyone who's read Effugas's paper shouldn't trust MD5. (Despite the title "...harmful someday", the content of the paper suggests to me MD5 is harmful today.)
Public domain code to calculate both Kazaa hashes is available as part of the Bitcollider project. See for example for the original super-flawed hash:
Finally, a question: your last throwaway line seems to imply you think that TigerTree is in some way "more exploitable" than (the newer) kzhash.
But TigerTree uses a hash against which no MD5-like cmpromising results have been announced, and a similar tree calculation method with an added leaf/node discriminant suggested by professional cryptographers.
So if there's a published or unpublished weakness there you know about, can you please supply details?
Even though EMule may now be the most popular client, the EDonkey network was started by US company MetaMachine, which began in San Francisco in 2000 but then moved in 2002 with its founder Jed McCaleb to New York.
Someone mentioned ED2K hashes being MD5; in fact last I checked they were a composite hash based on MD4 (!). Don't tell any of the bad guys that.
BitTorrent is good, DownhillBattle's idea of making BT easier for a larger audience is good, but their proposed technique has problems. The "Blog Torrent" site says....
"One good way to do this [avoid excluding a large portion of users] is to attach torrent files to an executable client."
Directing unsophisticated users to download custom EXEs from any random site offering big media they want would be a dangerous step backwards, encouraging a very unsafe practice that's likely to get their machines infected with various kinds of malware, sooner or later.
I'd suggest instead improving the installers of well-respected BT clients, and encouraging users to get them from well-known sites.
It loses a little in terms of instant gratification, butbut is instant gratification worth it if it also risks instant victimization?
Regarding the "catalog [of] every human creation in existence that can be expressed by a digital medium" -- there's already an open source, open data collaborative project to build that: Bitzi, "the free universal media catalog."
The Star Trek:TNG episode was in 1993, called The Chase.
This same idea was advanced by a short story called We'll Return, After This Message, by AutoDesk founder John Walker, written in 1989 and published in 1993.
I also mentioned the same general idea in my 2002 OReillyNet weblog item, SETI not through telescopes but microscopes, about how rugged microscopic messages might be the only ones to survive millions of years.
I am using Nautilus (2.4.0), and junk text files given the extensions 'mp3' and 'ogg' are identified as MP3 audio and Ogg audio, even though they clearly do not contain files of that format.
Hmm. My Gnome file manager seems to rely on file extensions, just like a lot of other programs, as a key clue to file type. It also describes anything that ends in '.ogg' as 'Ogg audio'.
A mimetype of 'application/ogg' extends the problem to another domain.
HTTP did not pre-date Berners-Lee; his project made the first version in 1991/1992... which didn't become an IETF standard until 1996.
I can't comment in more detail about Alexa's bulk crawl strategy because it is only documented to the public (and us at the Internet Archive) in general terms: it is a broad survey crawl of the public web, weighted by Alexa's internal measures of site/page importance and legitimacy (which are at least partially based on the same toolbar data that drives their site rankings). While we expect to continue receiving the Alexa donations indefinitely, a growing proportion of the public archive is likely to come from other sources, including the IA's own crawling and other outside donors, in the future.
The Archive is funded by a combination of private donations from individuals and foundations (sometimes for general operations and sometimes for specific projects), and fees for services provided to our partners, who are public libraries and archives themselves. With 11+ year history, and long partnerships with customers and funding sources, we're pretty stable in the world of technology nonprofits.
I wasn't directly involved in the Ubuntu choice, but it's been nice to have our developer desktops in close sync with cluster servers.
- Gordon @ IA'Recall' wasn't exactly Google-like search. IIRC, in some respects it was better, with an advanced idea of related concepts, and with data on frequency of terms over time. In other respects, it was not what people would expect: there was no exact phrase matching, and certain terms that didn't become tracked concepts weren't findable at all, even though you could see the words in other indexed results.
Unfortunately, IA couldn't maintain the deployment when the developer, Anna Patterson, moved to Google. So, Recall turned out to be a short-lived experiment, grand in scale of pages indexed and novel features but not in traffic served.
Patterson did big things at Google and now has another search startup, Cuill, that's likely to do more good things for the web.
At the Internet Archive, we've also been using the open-source projects Nutch and Hadoop to offer search on smaller web collections for our partners. (A pair of such searchable partner collections for the US National Archives and Records Administration lives at webharvest.gov.) Someday we may be able to scale these up to the full 11+ year archive.
- Gordon @ IAUnfortunately, this "squatters-add-robots-restrictions" problem comes up a lot.
We'd like to address it, and to do so there are two major issues to be tackled: (1) our current Wayback Machine software only excludes sites on a "for all time" basis; (2) short of mechanistically trusting the current domain owner, determining who has the right to exclude or restore material could be a very labor-intensive, error-prone, and liability-compounding process.
The new open-source 'Wayback' software, which will go live for the Worldwide Wayback Machine later this year, enables time-range exclusions. (It's currently only used for many smaller collections we do for partners.) That should give us the capability to address (1). Addressing (2) will require further discussion about the proper and efficient policies -- but it's on our agenda once the technical capability for time-range exclusions is in place.
Specifically regarding the mindchild.net site you mention, it looks like the issue is that our current retroactive-exclude robots.txt-parser doesn't understand the 'Allow' directive. (The mindchild.net/robots.txt tries to enable ia_archiver/WaybackMachine access via an 'Allow'.) That too will be fixed in the new 'Wayback' deploy (if not sooner).
- Gordon @ IA
The Internet Archive's Web Archiving Blog has a post, "Confusion at The Register and Slashdot about the Wayback Machine", which addresses some of the concerns in this article and thread.
[Just a pointer; my posts here are me speaking as myself, and not for the Archive.]
- GordonThough anything in the SF Bay Guardian should be taken with a grain of salt, it should be noted that publication blames now-Speaker Nancy Pelosi (D-San Francisco) for the Presidio arrangement, not the Bush-41 Administration. Since the legislation was passed during the Clinton-42 administration, blaming it on either Bush is farfetched.
But the course taken wasn't unreasonable. The Presidio was already developed when it was a military base. Turning it into a traditional, naturalist national park would have required un-development, destroying pre-existing housing, buildings and roads. (Restoring it to its native grassy sand dunes would have required deforestation.) Mixing updates of the prior development with other expansion of public use and re-naturalization made sense -- and given the immense value of just the already-developed real estate, having the whole project pay its own way should have been a no-brainer.
Give a try to my web-based tool, Regex Powertoy. Its interface is all DHTML/CSS/Javascript, but requires a hidden Java (1.5) applet for the advanced and steppable regex engine.
Given that Java core, there are options for adding/removing usual Java literal escaping, which in Java code means lotsa backslashes. Not all Perl advanced features are supported.
I hadn't considered a pick for awk/sed/bash syntax limits/conversion but will consider it. Any handy reference to how their syntax differs from Perl/Java? (The thing that usu. bites me with sed is escaping of parentheses.)
Great things about the Java 1.4+ regex support, from my perspective, include that (1) it's nearly as full-featured as Perl's regexes (and thus far better than Javascript's); and (2) it's usable in web browsers and via embedded applets.
Those were both key to helping me create Regex Powertoy, a interactive visual regex tester, much like others mentioned in this discussion -- but fully implemented in a browser. It's in JavaScript and DHTML, with a Java applet for the full-featured and step-controlled regex matching -- requires FF1.5+/IE6+ & Java 1.5+.
Check it out, break it (it's still got some rough edges under heavy input), let me know how it could be improved.
...as being part of the pro-net-neutrality coalition. See:
http://www.savetheinternet.com/=members
Yet a number of posts have claimed they are. Can anyone provide a reference about these orgs' official positions?
(Note: "EFF-Austin" is not the same as the San Francisco-based Electronic Frontier Foundation.)
FWIW, I'm with Bram on this issue.
Government regulatory involvement will only increase the chance of either censorship or value-destroying inflexible rules down the road. We got this far without government restrictions on the shape of bandwidth and quality-of-service commercial arrangements -- and there are as yet no actual (as opposed to theoretical-in-some-darkly-imagined-future) victims of non-neutral network practices. So the "we must regulate now to save the internet!" position looks like irrational histrionics to me.
Haven't they seen the runaway-female-assassin-robot-movie starring Gregory Hines, Eve of Destruction ?
- Gordon @ IA
Agree wholeheartedly with those recommending against pure 50/50 split. Even close partners can have an acrimonious falling out; don't have a situation that risks deadlock. Beside 51/49, another possibility is 49/49/2 -- with the small share going to someone mutually respected and familiar with the business. In normal times of agreement, they're passively along for the ride, perhaps a useful advisor. If there's a risk of irreconcilable disagreements, they can mediate and if necessary cast the deciding influence. Plan for the best -- consensus and success -- but throw in a few protections against the kinds of things that do occasionally happen.
Credence adds an interesting automatic custom-to-each-user trust metric, which we don't yet have at Bitzi. (At Bitzi, you should read over other users' comments and histories to make a judgement as to whether you'd like to rely upon them.)
- Gordon @ Bitzi
Maybe they kidnap them from Japan.
See for example their history of doing the same to acquire knowledge about the outside world:
http://slate.msn.com/id/2087627/
No, there's no special encoding/padding of file size, beyond what happens for the last, often odd-sized leaf node.
Can an Indy contributor or user compare it to Gnomoradio or other prior work in this area?
But: TigerTree uses the 'tiger' 192-bit hash, specifically designed to be different from the MD4/MD5/SHA1 family, so MD5 results should not be generally applicable to tiger. (Perhaps, given the apparent 'momentum' of new discoveries, all extant hashes should be looked upon with extra wariness. But no recent results specifically impugn either tiger or TigerTree, as far as I know.)
Interesting consideration that larger files provide more 'hook points' against which isomorph ranges could be inserted.. I hadn't seen that pointed out before, though I suppose it makes a certain intuitive sense...
To clarify: Kazaa's old hash, pre-2.6, hashed the first 300K with MD5, then small samples of the rest of the file with crc32. There were giant ranges of file that could be changed without affecting the early Kazaa hash at all.
e r/ bitcollider/lib/ftuuhash.c?rev=1.2&view=auto
e r/ bitcollider/lib/kztree.c?rev=1.3&view=auto
Starting in 2.6, they kept that flawed hash as the first 20 bytes of 'kzhash', but then added a hash-tree based on MD5 of every 32K chunk. Much better, as long as you trust MD5 -- but anyone who's read Effugas's paper shouldn't trust MD5. (Despite the title "...harmful someday", the content of the paper suggests to me MD5 is harmful today.)
Public domain code to calculate both Kazaa hashes is available as part of the Bitcollider project. See for example for the original super-flawed hash:
http://cvs.sourceforge.net/viewcvs.py/bitcollid
Or for the 2.6-and-up improved (but still dependent on MD5) tree hash:
http://cvs.sourceforge.net/viewcvs.py/bitcollid
Finally, a question: your last throwaway line seems to imply you think that TigerTree is in some way "more exploitable" than (the newer) kzhash.
But TigerTree uses a hash against which no MD5-like cmpromising results have been announced, and a similar tree calculation method with an added leaf/node discriminant suggested by professional cryptographers.
So if there's a published or unpublished weakness there you know about, can you please supply details?
...our new extinct pygmy underlords.
Even though EMule may now be the most popular client, the EDonkey network was started by US company MetaMachine, which began in San Francisco in 2000 but then moved in 2002 with its founder Jed McCaleb to New York.
Someone mentioned ED2K hashes being MD5; in fact last I checked they were a composite hash based on MD4 (!). Don't tell any of the bad guys that.
BitTorrent is good, DownhillBattle's idea of making BT easier for a larger audience is good, but their proposed technique has problems. The "Blog Torrent" site says....
"One good way to do this [avoid excluding a large portion of users] is to attach torrent files to an executable client."
Directing unsophisticated users to download custom EXEs from any random site offering big media they want would be a dangerous step backwards, encouraging a very unsafe practice that's likely to get their machines infected with various kinds of malware, sooner or later.
I'd suggest instead improving the installers of well-respected BT clients, and encouraging users to get them from well-known sites.
It loses a little in terms of instant gratification, butbut is instant gratification worth it if it also risks instant victimization?
Regarding the "catalog [of] every human creation in existence that can be expressed by a digital medium" -- there's already an open source, open data collaborative project to build that: Bitzi, "the free universal media catalog."
This same idea was advanced by a short story called We'll Return, After This Message, by AutoDesk founder John Walker, written in 1989 and published in 1993.
I also mentioned the same general idea in my 2002 OReillyNet weblog item, SETI not through telescopes but microscopes, about how rugged microscopic messages might be the only ones to survive millions of years.
I am using Nautilus (2.4.0), and junk text files given the extensions 'mp3' and 'ogg' are identified as MP3 audio and Ogg audio, even though they clearly do not contain files of that format.
Hmm. My Gnome file manager seems to rely on file extensions, just like a lot of other programs, as a key clue to file type. It also describes anything that ends in '.ogg' as 'Ogg audio'.
A mimetype of 'application/ogg' extends the problem to another domain.