Slashdot Mirror


Inside the Internet Archives

blackbearnh writes "O'Reilly Media is running an interview with Gordon Mohr, Chief Technologist for the Internet Archive (archive.org). If you've ever wondered how pages are selected for archiving, or just how they manage such a huge quantity of data, the answers are here. The interview also touches on the problems of intellectual property in archives, archiving the Internet in a post Web 2.0 world, and the potential vulnerabilities exposed by archiving web sites that may include security exploits."

85 comments

  1. Oblig. Clark/Kubrick by Anonymous Coward · · Score: 3, Funny

    My God, it's full of ones and zeros!

    1. Re:Oblig. Clark/Kubrick by Anonymous Coward · · Score: 0

      I thought this was pretty clever... AC doesn't deserve an Offtopic for this. :)

    2. Re:Oblig. Clark/Kubrick by crapdot · · Score: 1

      At least it is not Benders nightmare!

  2. question: by Anonymous Coward · · Score: 0, Funny

    do they archive SECOND POSTs?

  3. mutual exclusivity? by Itninja · · Score: 3, Funny

    The Interviewer: And I'm not sure I want to think about what posterity is going to think about a recording of my Twitter feed.

    If Twitter becomes so mainstream so as to be more than a 'remember when?' to posterity I will kill myself.

    --
    I judt got a nre Kinesis keybiartf so please excusr ant egregiou typos.
    1. Re:mutual exclusivity? by IBBoard · · Score: 1

      Not only that, but surely it points out how stupid and pointless some of the stuff is that he must post to twitter?

      Hopefully they've got a reasonable enough algorithm that it can pick the useful sites from the random blog crap.

    2. Re:mutual exclusivity? by GregNorc · · Score: 2, Funny

      I really like Penny Arcade's comic about twitter.

      Twitter seems useless to me. Maybe if my friends used it I might, but for now an away message or facebook status does the job just fine.

    3. Re:mutual exclusivity? by anti-pop-frustration · · Score: 1

      And in the unlikely event that Twitter does become more that than, thanks to the internet archive, we will be able to remind you of your previous commitments.

    4. Re:mutual exclusivity? by negRo_slim · · Score: 2, Funny

      Maybe if my friends used it I might, That's what I always thought, but then I realize how few shits I'd give about what my friends would write... and it all becomes clear, all those sites are pointless!
      --
      On the Oregon Cost born and raised, On the beach is where I spent most of my days
    5. Re:mutual exclusivity? by Chyeld · · Score: 2, Funny

      If Twitter becomes so mainstream so as to be more than a 'remember when?' to posterity I will kill myself.
      --
      I am ten ninjas.
      Lets be realistic here, you are ten ninjas. You will be killing yourself regardless.
    6. Re:mutual exclusivity? by beckerist · · Score: 1

      Either some redtube or a klondike bar...you have to elaborate first.

    7. Re:mutual exclusivity? by Anonymous Coward · · Score: 0

      You are thinking of ten samurai. Ten ninjas kill you.

    8. Re:mutual exclusivity? by rhinokitty · · Score: 1

      Can we expand upon this thread? I have a vehement hatred of Twitter (never used it myself) but I can't quite put my finger on why. I usually like to know why I hate something, but when it comes to Twitter I am sputtering for words...vapid...inane...blonde.

    9. Re:mutual exclusivity? by Gulthek · · Score: 1

      Why waste time hating something you don't use? Especially something you don't have to use?

      Twitter is just a website that allows people to post snippets of text that other people can subscribe to. That's it.

      Do you also have an unreasonable hatred of some book genres you don't read? Movies you don't watch? Videogames you don't play? Sports you don't care about? If so, why?

    10. Re:mutual exclusivity? by ActionDesignStudios · · Score: 1

      When you chamber the bullet, can you let us all know via Twitter?

    11. Re:mutual exclusivity? by rhinokitty · · Score: 1

      Yes!! Hate, hate and pure hate!! ARGGHHH!!!

  4. but is it indexed by google? by AliasMarlowe · · Score: 4, Funny

    and does archive.org record google's cache?

    --
    Those who can make you believe absurdities can make you commit atrocities. - Voltaire
    1. Re:but is it indexed by google? by ibwolf · · Score: 1

      They obey robots.txt so no.

  5. Damn it! by sm62704 · · Score: 1

    I'm going to have to RTFA! I keep wondering why my old Quake site the Springfield Fragfest (abandoned years ago) is in the Wayback Machine, while "Kneel" Harriot's Yello There has seemingly disappeared from the entire internet.

    The only page from Yello There I can find is one that was linked from my site (the aformentioned Quake site). That particular page was wonderfully recursive because of Yello's frame.

    --
    mcgrew's razor: Never attribute to stupidity that which can be explained by greedy self-interest
  6. I wished archive.org stored even more stuff by jacquesm · · Score: 3, Insightful

    I keep running into bookmarks that have gone awol, then find that archive.org also doesn't have the pages anymore.

    Combining a bookmarking / chaching service would be really handy.

    1. Re:I wished archive.org stored even more stuff by blhack · · Score: 5, Funny

      Combining a bookmarking / chaching service would be really handy. I heard that lexmark makes one, its called a "printer".
      --
      NewslilySocial News. No lolcats allowed.
    2. Re:I wished archive.org stored even more stuff by RareButSeriousSideEf · · Score: 3, Interesting
      Yeah, how exactly do pages go AWOL from archive.org? I've encountered that, plus pages suddenly acquiring META refresh tags (maybe through an external script or iframe?) that redirect to some domain squatter's site now. Extremely annoying. I'm going to have to mess around with wget to see what's in the markup, unless someone can suggest an easier way to get at such content.

      Combining a bookmarking / chaching service would be really handy. Furl fits that bill, doesn't it?

    3. Re:I wished archive.org stored even more stuff by jacquesm · · Score: 2, Insightful

      hehe, yes, so true, but then you can't access it electronically any more.

      I really think the bookmark + cache would be a nice thing to have without resorting to 'dead tree' format.

      But it's a good point, a printer would be an easy way to collect stuff that you really want / need to keep.

    4. Re:I wished archive.org stored even more stuff by jacquesm · · Score: 1

      thank you !

      I'd never even heard of them.

    5. Re:I wished archive.org stored even more stuff by Alphax.au · · Score: 2, Informative

      Combining a bookmarking / chaching service would be really handy. WebMynd claims to do that; I haven't tried it myself though.
    6. Re:I wished archive.org stored even more stuff by Burz · · Score: 1

      Combining a bookmarking / chaching service would be really handy. Install the Scrapbook add-on for Firefox, which does exactly that. You can save URLs, pages, sets of linked pages, and/or selected areas simply by right-clicking and selecting 'Capture'.

      The pages are saved in a folder of your choosing, can be organized in a hierarchy and also searched and viewed using the add-on; It even has a feature to 'refresh' a saved paged from its URL, or just send you to the page's original URL.

      Finally, it has a quick element editor that lets you remove page elements, add 'sticky' notes and keywords, and combine multiple pages into one.
    7. Re:I wished archive.org stored even more stuff by Anonymous Coward · · Score: 0

      Combining a bookmarking / chaching service would be really handy. I heard that lexmark makes one, its called a "printer". File > Print > Print to PDF.
    8. Re:I wished archive.org stored even more stuff by reddburn · · Score: 1

      I'd love it if they would archive political sites as seen at various addresses, and make them available to the public - the RNC homepage (at least once upon a time) was completely different when logging in from different areas of the country - graphics, photos, everything custom to the geography from which one logged in.

      --
      "Those who believe in telekinetics, raise my hand" - Kurt Vonnegut, Jr.
    9. Re:I wished archive.org stored even more stuff by TTK+Ciar · · Score: 2, Informative

      New material is always being added to The Archive's web archive, and (afaik) unlike the collections archive it is never deliberately deleted. Most of what appears to be "pages going AWOL" is indexing errors. In order for newly archived stuff to become visible to the wayback machine interface, the entire web archive needs to be periodically re-indexed. Unfortunately the indexing process is error-prone, and stuff that might have been accessible before the index might disappear afterwards (and appear again after the next indexing).

      Other things that can make data temporarily unavailable are:

      * Downed servers. There are over a thousand on the web archive's end of the cluster, and if a server "only" crashes once a year, that's about three servers crashing per day on average, but it doesn't happen at a low constant rate. Traumatic events at the datacenter (like AC failures, power cycles, etc) tend to knock a bunch of hosts onto their asses at a time, and they don't always come back up when rebooted. The Archive runs at a deficit of system administration manpower, so it takes a long time (weeks, sometimes months) to get humpty-dumpty back together again.

      * robots.txt. In order to avoid getting their asses sued off, The Archive uses the live site's robots.txt to control which pages are publicly viewable. Every time you hit up a wayback machine URL, it downloads the real site's robots.txt and parses it to see if the owners have rendered the desired content unviewable. So if a site owner changes their robots.txt, archived content that was viewable yesterday might not be viewable today. When a website is abandoned, there isn't a robots.txt to download anymore, so at least entirely "lost" sites are viewable by everyone.

      * Genuinely lost data. When I worked at The Archive (a few months ago, now), most of the web archive was on "SOLO" nodes, meaning there was no on-site replication of the content. The data servers lack RAID-level redundancy as well, so if a SOLO loses a disk, and nobody copied the data off it first, and there isn't a copy tucked away on our sister sites (some in Amsterdam, some in Alexandria Egypt), then the data is lost forever. To prevent this from happening, disks are tested hourly for a variety of symptoms (like nonzero sectors reallocated in SMART), and if a disk shows early signs of ill health, its contents gets "shuffled" off onto other machines in the cluster, and the disk itself is replaced.

      But the system isn't perfect, not by a long shot, and lossage occurs. It's possible to do a lot better. Numerous people within the archive have tried to put better practices into place over the years, but for various reasons getting those practices into .. well, into practice has proven futile. Fortunately around the time I left, there was a push underway to get more of the web archive onto paired storage (so that all data was stored in duplicate on different physical machines). We can hope that moves forward.

      The last time anyone tried seriously measuring the rate of lossage, iirc it ran into the 10-20 Mbits/sec range. That's not even a slow drip against the ~1.1PB in the web archive, but loss is loss.

      Anyway, the *vast* majority of missing pages aren't really missing, they're just not indexed at the moment. The content itself is still tucked away in the cluster, and may resurface in the future.

      -- TTK

    10. Re:I wished archive.org stored even more stuff by cffrost · · Score: 1

      hehe, yes, so true, but then you can't access it electronically any more. I heard that Canon makes an automatic printer>scanner>shredder.
      --
      Thank you, Edward Snowden.

      "Arguments from authority are worthless." —Carl Sagan
  7. The real reason for archive.org... by VxSote · · Score: 1

    is clearly because someone wanted the worlds largest porn collection!

    From TFA:
    "there's a lot of porn on the internet, so there's a lot of porn that gets collected when you're archiving the whole internet"

  8. 2008 is the year of Linux on the Archive! by flattop100 · · Score: 1
    Quick - all you naysayers, start jumping up and down!

    JT: You mentioned that you use a lot of Open Source and in-house developed software. I assume the underlying operating system is something Open Source(y)? GM: Yes, yes; what we've moved over the years from a Redhat version to a brief use of something that was a pure Debian to now using almost exclusively Ubuntu. Personally, I'd like to think Ubuntu is used because it's relatively easy to use, and Just Works(TM).
    1. Re:2008 is the year of Linux on the Archive! by TTK+Ciar · · Score: 2, Insightful

      The transition from Debian to Ubuntu was driven by developers' desire for more and newer features. We originally went with Debian-Stable because it was, well, stable, and did everything we needed the PetaBox to do at the time. But programmers whined and moaned that such-and-such package wasn't supported, or was too old, and claimed that this held back development of features which Brewster wanted to see made into reality.

      Brewster was never much for stability anyway, so the transition was made. It bit us several times, as Ubuntu is not as stable as Debian-Stable (which is to be expected when releases happen more often and newer software is deployed without extensive testing), but the developers were a lot happier with it. And, to be fair, while some of the problems have been substantial (like kernel bugs which interacted with the forcedeth device drivers to make servers freeze ~10% of the time when power cycled), afaik it has not contributed directly to data lossage (which is the bottom line at an archive).

      -- TTK

  9. Wayback by TheRealMindChild · · Score: 4, Informative

    While I love the wayback machine, a little "problem" creped in a couple of years ago that is still there... and it drives me nuts.

    At one point, I forgot to renew my domain name and a squatter snatched it up the second it was available. I have since lost the html/java applets/images/etc that I had originally there. I used to show people what it looked like via the wayback machine. But you can't do it anymore. Example: http://web.archive.org/web/*/http://www.mindchild.net

    Apparently, the current squatter put a robots.txt on that domain, and wayback refuses to show any ARCHIVED pages where the domain CURRENTLY has a robots.txt. I emailed them about it, and after a couple of months, I actually got a reply pretty much saying "That is just the way it is. We are underfunded and have no time to fix it. Sorry".

    So if for some reason you don't want to have your site viewable via the wayback machine, just put up a robots.txt. It doesn't even need to contain anything.

    --

    "When life gives you lemons, don't make lemonade. Make life take the lemons back!" -- Cave Johnson
    1. Re:Wayback by ibwolf · · Score: 2, Insightful

      This is an unfortunate side effect of their policies but it is very understandable that they would like to err on the side of caution.

      Should the robots.txt ever go away or change then your old stuff will become accessible again.

    2. Re:Wayback by iangoldby · · Score: 3, Insightful

      wayback refuses to show any ARCHIVED pages where the domain CURRENTLY has a robots.txt.
      In true Raymond Chen style, think about what the world would be like if this wasn't true: If it wasn't true, then a site owner would have no way to remove his content from the Wayback Machine retrospectively. That raises far more problems that the ability of a new owner to remove a previous owner's content.
    3. Re:Wayback by corsec67 · · Score: 2, Informative

      User-agent: ia_archiver
      Allow: /
      in the robots.txt that you mention (http://mindchild.net/robots.txt) is hardly not containing anything.

      But, it is interesting how they take the current robots.txt to apply to old content that used to be at that location...
      --
      If I have nothing to hide, don't search me
    4. Re:Wayback by Anonymous Coward · · Score: 0

      Perhaps you should explain your reasoning as to why there's more problems with a new owner being able to remove a previous owner's content than being unable to remove content that you made publicly accessible. I'm not saying there aren't problems, but I certainly don't see them as being as important as an archive site removing access to parts of its archives

    5. Re:Wayback by SydShamino · · Score: 5, Insightful

      If it wasn't true, then a site owner would have no way to remove his content from the Wayback Machine retrospectively. I don't necessarily disagree with their policy, but this is the wrong argument for it.

      If you publish something, you lose the right to withdraw it from the public archives retrospectively. That's part of the "contract" (term used figuratively) with the public that establishes the foundation of copyright law.

      If you don't want it to appear on the Wayback Machine, you have an ability called robots.txt. That's already more than you have if you publish a book and want to keep it out of libraries. In neither case, though, do you have the right to demand or expect the content to be removed from the archive on your request.

      I see what the archive does to be a courtesy service, not something that the site owners should expect.
      --
      It doesn't hurt to be nice.
    6. Re:Wayback by RareButSeriousSideEf · · Score: 3, Insightful

      Ideally they could obey the robots.txt at the time of archiving, and simultaneously grab a snapshot of the whois record. In the future, new robots.txts would by default only take away previously archived content if the domain hadn't changed hands. This would keep squatters from killing the archive, and the original copyright owner could always actively request removal of content if s/he matched the old whois record (though this would take manpower at archive.org, which is a problem).

    7. Re:Wayback by oodaloop · · Score: 2, Funny

      I'm sorry you feel that way. I, for one, welcome our robots.txt overlords.

      --
      Tic-Tac-Toe, Global Thermonuclear War, and relationships all have the same winning move.
    8. Re:Wayback by iangoldby · · Score: 1

      Well, inclusion in the Wayback Machine is optional, albeit opt-out rather than opt-in. So presumably the argument is not over whether a site owner should have control over whether his content is archived in the Wayback Machine.

      But if you couldn't remove content retrospectively, then exclusion would be optional only for site owners who happen to know about the Wayback Machine and its robots.txt policy. That specifically is the thing that I would find unjustifiable.

    9. Re:Wayback by sp332 · · Score: 2

      If you put it on the internet, it is expected that you want people to see it. I usually prefer opt-in to opt-out, but this is a case where the content is ALREADY PUBLIC. In this case, any opt-out is being generous.

    10. Re:Wayback by turbidostato · · Score: 1

      "If it wasn't true, then a site owner would have no way to remove his content from the Wayback Machine retrospectively."

      Well, what the heck is the point for a Wayback Machine that refuses to way back, then?

    11. Re:Wayback by ljw1004 · · Score: 1

      Simple. Use the wayback-machine to see how the wayback-machine used to display your page before it instituted its robots.txt policy.

    12. Re:Wayback by SydShamino · · Score: 1

      It's already well established that robots.txt is required to avoid having your data indexed and archived. Perhaps mom and pop websites don't know this, but I would argue that mom and pop don't understand large portions of the copyright law, and that this is just one small part. I think at this point that any "copyright law for dummies" book, if not bought and paid for by the xxAA, would include this information in the section on web publishing.

      --
      It doesn't hurt to be nice.
    13. Re:Wayback by iangoldby · · Score: 1

      Forget legalistic arguments. The problem that I see with your position is that you are making ignorance the 'unforgivable sin'. ('Unforgivable' in the sense that once committed you can't ever correct it.)

    14. Re:Wayback by SydShamino · · Score: 1

      Again, once you publish something, you forever lose the "right" to keep it from entering the public domain, free for all use with no restrictions, at some point in the future. This is no different that your "right" to prevent a library from owning and lending a copy of your book, long after its out of print. Neither exist, and neither should.

      That's a basic tenet of copyright. Without copyright, too many people would keep their works unpublished and hidden, preventing great art from ever becoming part of human heritage.

      Whatever archive.org does is a courtesy, nothing more. You can't take back what you publish.

      --
      It doesn't hurt to be nice.
    15. Re:Wayback by iangoldby · · Score: 1

      You might have a point, but it is nothing to do with what I just wrote.

  10. Downside of IP conciderations... by CFBMoo1 · · Score: 2, Insightful

    I had a cheesy site back in college where I played around with HTML and learning the basics. I ended up making a few pages that poked fun at friends.

    I went to archive.org years later looking for them cause I remember back in the day they nabbed em and now they're all gone. The images and sounds I used were all gone.

    I wanted to recreate a page from that archive for nostalgia reasons with my old friends. Can't do it and I can't find the files anymore in my local archives.

    I was kinda disappointed but I guess it was expecting too much. I really wish there was a true and complete archive of the internet that didn't care what was there it just had it.

    --
    ~~ Behold the flying cow with a rail gun! ~~
  11. Er, what? by bigstrat2003 · · Score: 1
    They use Alexa???

    Ew.

    --
    "16MB (fuck off, MiB fascists)" - The Mighty Buzzard
  12. selection? funding? why not plain Debian? by bcrowell · · Score: 2, Insightful

    I was left with a several questions that weren't addressed by the article.

    The slashdot summary says the article explains how pages are selected for archiving, but I couldn't find anything in the article that explicitly explained that. It does say that the actual crawler is run by alexa, which hands off the data to them, but it didn't say what the criteria were. Alexa computes various stats about web sites, so presumably they could apply some kind of minimum cut. Or do they try to index every single lame personal page, unless the owner opts out? That seems like it would require an unreasonable amount of disk space. The web also has a lot of stuff like, e.g., the kind of spam sites that try to scam google's search/ad system; I wonder if the archive records those.

    The article didn't say a darn thing about funding. They have to run thousands of machines, so the electric bills must be formidable. Where the heck do they get their money? Is there a significant chance that their funding will dry up at some point in the future, and the whole archive will disappear?

    The article states that they moved from plain Debian to Ubuntu. That surprised me, and I was curious why they'd do that. E.g., if you're shopping for webhosts, it's much more common for them to offer plain Debian than Ubuntu. I love Ubuntu as a desktop distro, but it surprises me that they'd see any big advantage in using Ubuntu for their application.

    1. Re:selection? funding? why not plain Debian? by Loether · · Score: 1

      >Or do they try to index every single lame personal page, unless the owner opts out?

      I can state to an absolute certainty that they do index very lame sites.

      One of my first web pages ever. Very lame indeed!

      http://web.archive.org/web/20010404062818/http://loether.com/

      --
      TODO create witty sig.
    2. Re:selection? funding? why not plain Debian? by brotheralien · · Score: 1

      I wonder what the "3D Visuals and Animations" people at http://www.paralleldimension.co.uk/ think of their earlier incarnation's efforts: http://web.archive.org/web/20010203234000/www.paralleldimension.co.uk/home.htm I thought I'd lost this! (and 74m to beat..)

    3. Re:selection? funding? why not plain Debian? by Anonymous Coward · · Score: 0


      Where the heck do they get their money?

      Lots of sources. Donations, grants, funds and the crawling they do for other organizations.

      The article states that they moved from plain Debian to Ubuntu. That surprised me, and I was curious why they'd do that. E.g., if you're shopping for webhosts, it's much more common for them to offer plain Debian than Ubuntu.

      The IA aren't likely to be for web hosts any time soon given the number of servers they run themselves and it probably makes sense to run the same OS the developers are running on their desktops on the servers to simplify testing.
    4. Re:selection? funding? why not plain Debian? by gojomo · · Score: 2, Informative

      I can't comment in more detail about Alexa's bulk crawl strategy because it is only documented to the public (and us at the Internet Archive) in general terms: it is a broad survey crawl of the public web, weighted by Alexa's internal measures of site/page importance and legitimacy (which are at least partially based on the same toolbar data that drives their site rankings). While we expect to continue receiving the Alexa donations indefinitely, a growing proportion of the public archive is likely to come from other sources, including the IA's own crawling and other outside donors, in the future.

      The Archive is funded by a combination of private donations from individuals and foundations (sometimes for general operations and sometimes for specific projects), and fees for services provided to our partners, who are public libraries and archives themselves. With 11+ year history, and long partnerships with customers and funding sources, we're pretty stable in the world of technology nonprofits.

      I wasn't directly involved in the Ubuntu choice, but it's been nice to have our developer desktops in close sync with cluster servers.

      - Gordon @ IA
  13. Remember Slashdot in it's Infancy? by dbarron · · Score: 5, Interesting

    Check this out....it reads like a free software update blog :)
    http://web.archive.org/web/19980113191222/http://slashdot.org/

    1. Re:Remember Slashdot in it's Infancy? by FirstTimeCaller · · Score: 1

      Check this out....it reads like a free software update blog :) http://web.archive.org/web/19980113191222/http://slashdot.org/
      WTF? Those stories are all dupes!
      --
      Wanted: witty unique signature. Must be willing to relocate.
    2. Re:Remember Slashdot in it's Infancy? by city · · Score: 1

      and don't forget the obligatory MS anti-trust update from around the world!

      --
      I am a v1ral sig. Plse c0py me and h3lp me spread. Thank y0u?
    3. Re:Remember Slashdot in it's Infancy? by raddan · · Score: 1

      Wow, that reminds me of when some days there would only be a small handful of stories on Slashdot. Now that I actually like my job, there's far more than I can read in one day.

    4. Re:Remember Slashdot in it's Infancy? by shish · · Score: 1

      Holy shit, has it really been 10 years since this?

      So IBM announces a 25 gig hard drive... does the world need this yet? Unless this is in a RAID, would you really want to trust 25 gigs on a single drive?
      --
      I mod down anyone who says "I will be modded down for this", regardless of the rest of their comment
    5. Re:Remember Slashdot in it's Infancy? by Anonymous Coward · · Score: 0

      Ah ah ah...

      From that page, the poll it's interesting:
      "Netscape should GPL Mozilla (Read Editorial First!)?"

      Man... I feel old!

  14. "archiving the Internet in a post Web 2.0 world" by Anonymous+Freak · · Score: 1

    AAAARGH!1!

    I can't stand "post [xyz] world", "pre [xyz] mindset" or any such similar phrases. Go away, GO AWAY!!!!

    Really, the archive is tasked with 'saving' the internet every so often. I'm sure they'll figure out how to save AJAX stuff. And if not, then that stuff isn't really meant to be saved, now is it? (I mean, we don't need a save of Gmail, since it's account based.)

    --
    Another non-functioning site was "uncertainty.microsoft.com."
    The purpose of that site was not known.
  15. slashdot uid by chimpo13 · · Score: 1

    I tried to find a Hogans Heroes page that I did in 1996 so I could find whatever email I used so that I could make an attempt at getting back my original slashdot uid. No luck for me.

    I blame Internet Archive for this and ask twitter and his sock puppet army to start ranting about this horrible horrible travesty as well. The loss of my 5 digit uid is as bad as Gitmo and waterboarding combined!

    1. Re:slashdot uid by ConceptJunkie · · Score: 2, Funny

      HAW HAW!

      Sorry, I couldn't resist.

      --
      You are in a maze of twisty little passages, all alike.
  16. Squatters & robots.txt Re:Wayback by gojomo · · Score: 5, Informative

    Unfortunately, this "squatters-add-robots-restrictions" problem comes up a lot.

    We'd like to address it, and to do so there are two major issues to be tackled: (1) our current Wayback Machine software only excludes sites on a "for all time" basis; (2) short of mechanistically trusting the current domain owner, determining who has the right to exclude or restore material could be a very labor-intensive, error-prone, and liability-compounding process.

    The new open-source 'Wayback' software, which will go live for the Worldwide Wayback Machine later this year, enables time-range exclusions. (It's currently only used for many smaller collections we do for partners.) That should give us the capability to address (1). Addressing (2) will require further discussion about the proper and efficient policies -- but it's on our agenda once the technical capability for time-range exclusions is in place.

    Specifically regarding the mindchild.net site you mention, it looks like the issue is that our current retroactive-exclude robots.txt-parser doesn't understand the 'Allow' directive. (The mindchild.net/robots.txt tries to enable ia_archiver/WaybackMachine access via an 'Allow'.) That too will be fixed in the new 'Wayback' deploy (if not sooner).

    - Gordon @ IA

  17. Searchable Archive. . . by Fantastic+Lad · · Score: 1
    Back in 2003, the Internet Archive guys set up a new project called, "Recall" which theoretically would allow somebody to do a Google-style search through the collected material irrespective of the data. 3D searches through the data stacks.


    This was very exciting! Seriously; you might remember the content of a page you were looking at five years ago, but can you remember it's specific web address? --Especially with the turnover and abandoning of domain names, it is entirely possible to simply lose contact with mountains of data.

    So a basic search engine was a very exciting idea!

    Too bad they killed "Recall" after only a few weeks. I never got a chance to try it, (and boy, I would have made good use of it! There are still a few dozen items I'd love to find again.) I somehow didn't expect Recall to be discussed in the interview, and I was right about that. Too bad.

    Maybe Google should set up something similar; they don't trash old data, do they? I know they've got a setting which allows you to look for data up to a year old, but it's rather vague and it doesn't provide specific controls. How awesome would a non-linear search engine in an archive going back to the beginning of the web be?

    I wonder what the deal with "Recall" was, and why nobody talks about it.


    -FL

    1. Re:Searchable Archive. . . by gojomo · · Score: 3, Informative

      'Recall' wasn't exactly Google-like search. IIRC, in some respects it was better, with an advanced idea of related concepts, and with data on frequency of terms over time. In other respects, it was not what people would expect: there was no exact phrase matching, and certain terms that didn't become tracked concepts weren't findable at all, even though you could see the words in other indexed results.

      Unfortunately, IA couldn't maintain the deployment when the developer, Anna Patterson, moved to Google. So, Recall turned out to be a short-lived experiment, grand in scale of pages indexed and novel features but not in traffic served.

      Patterson did big things at Google and now has another search startup, Cuill, that's likely to do more good things for the web.

      At the Internet Archive, we've also been using the open-source projects Nutch and Hadoop to offer search on smaller web collections for our partners. (A pair of such searchable partner collections for the US National Archives and Records Administration lives at webharvest.gov.) Someday we may be able to scale these up to the full 11+ year archive.

      - Gordon @ IA
    2. Re:Searchable Archive. . . by Fantastic+Lad · · Score: 1
      Thanks for the info! I hope I didn't seem too gripe-y; I appreciate that you guys are working at all on such a project as the Archive. Though I would indeed love to see one day it fully searchable! Good luck in your continued efforts.


      Cheers!


      -FL

  18. I meant 'Date' by Fantastic+Lad · · Score: 1
    Dumb typo. Perhaps not obviously, I meant in the first line of the above post, "irrespective of the date"


    Normally I let typos go; people are generally forgiving and will read around them knowing that they are just as susceptible to making errors, but in the case of those typos which don't just create a spelling mistake, but actually switch the meaning of an entire sentence, I will sometimes haul myself to the task of writing a short retraction. Just like this one.

    Cheers.


    -FL

    1. Re:I meant 'Date' by JCWDenton · · Score: 1

      You mean they recalled recall ? ...

      I too would like for google to store old versions of sites a little longer and keep sites that have dissapeared.
      Ah well, guess that isn't their purpose.

  19. I thought he worked for Intel by phozz+bare · · Score: 1

    I thought he worked for Intel and was the person behind Mohr's Law. Has he changed his interests recently?

  20. Why is so much of 2001 missing? by Anonymous Coward · · Score: 1, Insightful

    I think it is really weird that EVERY SINGLE news site on the Internet is mysteriously missing any captures from May 2001 to Sept 2001 (maybe one or two days in July are there).

    And then all of a sudden on Sept 11, ALL the news sites have multiple captures per day.

    I want to see what CNN, LA times, Washington Post, etc. had in the news on Sept 8th, 9th and 10th...

    1. Re:Why is so much of 2001 missing? by Anonymous Coward · · Score: 0

      I have seen this too!

      There is an image with a sampling of the gaps.

      It is really easy to double check the gaps asserted in the image. Just go to archive.org and put in the url, and see the gaps for yourself.

      I assume archive.org has them, just not online. Why not?

  21. NYT 9/11/2001 by mujadaddy · · Score: 1

    Shark Attacks in Florida. True Story. There was also Chandra Levy/Gary Condit stuff leading up to that.

    --
    Populus vult decipi, ergo decipiatur...
    "Force shits upon Reason's back." - Poor Richard's Almanac
  22. My Issues With Archive.org by evilviper · · Score: 1

    #1 Why aren't archived pages modified very slightly to insert a tag, so that archived images, sub-pages, and the like will be fetched from the archive, rather than linking to non-existent locations on the current server? Surely the current server operators don't like the dozens of hit for everyone that visits the archive...

    But more than that, it's a PITA to visit an archived page, and manually copy and paste every single link, one at a time. And I'm sure most people don't realize that they can even do that, and just give up on finding the content they want...

    #2 Why is the video encoding so incredibly horrible? Terrible ghosting on the low-res / low-bitrate versions, and a 30fps frame-rate like you've just stupidly deinterlaced the pulldown/telecine.

    At 320x240, and 256kbps, those movies should look great, not HORRIBLE. That, combined with the fact that you almost never provide a full resolution (around 720x480) low-bitrate version, no doubt forces a large number of people to download the HUGE (4+ GB) interlaced MPEG-2 copy of a film, rather than the (completely screwed up) 250MB copy...

    And don't complain about lack of skill, manpower, etc. I've e-mailed you, TWICE, personally volunteering to take care of everything, from getting the proper software compiled and installed (that was mentioned as a problem long ago) to writing an automatic conversion script to detect the encoding, and do the job without human intervention. I received no responses, either time.

    #3 I'm on Verizon DSL in Southern CA. Why does my (dynamic) IP address seem to be blocked the majority of the time I try to access the archives, while it's fine on the rest of the site? Are you just having capacity problems these days?

    --
    Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
    1. Re:My Issues With Archive.org by belg4mit · · Score: 1

      Umm, they do this with JavaScript appended to the page.

      --
      Were that I say, pancakes?
    2. Re:My Issues With Archive.org by TTK+Ciar · · Score: 1

      #1: It's done with javascript, as someone else already said.

      #2: The Deriver system's video encoding was the attempt of a few people (usually one, sometimes two, occasionally three, often zero) to get something that would work at all with as much of the content as possible. If a change improved output for some items, while causing new problems for others, it tended to not be adopted. The Archive has always been bad about getting back to volunteers. It takes manpower to reply and incorporate third parties' efforts, and often we didn't have the manpower to even read the messages people were sending us.

      If you seriously want to help, though, physically show up at one of The Archive's friday lunch meetings at 116 Sheridan Avenue in the Presidio of San Francisco (they're open to everyone), say your piece, then talk to Brewster and/or Tracey afterwards. If you can get your foot in the door, you may be allowed to contribute. It's insane that it isn't easier to give useful stuff to The Archive, but sanity was never its strong suite.

      #3: Some parts of The Archive's web interface scales better than others. The web server pool is load-balanced and self-correcting (via ipvs/KeepAlived), and the main datacenter has oodles of bandwidth, but every single goddamned hit on the site also hits the central database so that the content can be generated dynamically (sans some caching). Somehow Brewster got it into his head that this was what Web2.0 means, and that it had to be a good thing. The database is chronically overloaded, and those in charge of it more or less refuse to look at how other companies have solved this problem.

      So, yeah, expect the website to be flaky for the foreseeable future. It's nothing directed personally towards you. Just pigheadedness getting in the way of good engineering.

      -- TTK

    3. Re:My Issues With Archive.org by evilviper · · Score: 1

      #1 I hate Javascript...

      #2 I remain convinced the conversion system was simply set up by woefully unqualified individuals. I've done several such systems that are surely much more complicated. Traveling to the Bay Area in person would be prohibitive. Oh well.

      #3 Thanks for the explanation.

      --
      Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
  23. Before you start this discussion, consider this. by Anonymous Coward · · Score: 0

    Web 2.0 will be the end of the free internet as we know it. It will be fully corporate controlled. You won't have access to anything like we have now. That being the case, it's pointless to discuss what the internet archives will be like post web 2.0. Even if it does stay around it will be as heavily censored as your post web 2.0 internet.

    Do some reading on the topic and do what you can to stop Web 2.0 now, while you can still express your (uncensored) opinions.

  24. Ironic by YourExperiment · · Score: 1

    From the site's current robots.txt: -

    User-agent: ia_archiver
    Allow: /

    Now that's irony. (Actually, is that irony? I'm always a bit worried I might get it wrong, since the whole Alanis Morissette thing.)