Slashdot Mirror


British Library To Archive One Billion UK Websites

An anonymous reader writes "The British Library is to begin archiving the entire UK web, including one billion pages from 4.8 million websites, blogs, forums and social media sites. The process will take five months, with the aim of presenting a more complete picture of news events for future generations to read and learn from."

22 of 89 comments (clear)

  1. archive.org? by denpun · · Score: 5, Interesting

    Why not work with the good folks at archive.org and their Internet wayback machine?

    Is it not a similar idea?

    The Internet Wayback Machine folks could use the funding and would be achieving the same purpose, albeit not in a format that the library folks might want....but they could come to agreement.

    1. Re:archive.org? by kaiidth · · Score: 5, Insightful

      Without wishing to offend it, the BL is a monolithic organisation that doesn't always play well with others. Part of that is because funding doesn't always work that way. You can get money for claiming that you are going to do the very first über-awesome UK archive, but your chances of receiving the funding becomes rather lower if in the very first breath you point out that somebody else has been doing pretty much this for a decade. Another part of it is: most politicians would likely want the national heritage, such as it is (jubilee celebration tweets - please...) to be held by that nation's own national library.

      I would imagine the BL have referenced archive.org work extensively, but differentiate this project with what tits in suits like to call "a compelling USP." To put it in plain English, they'll have a neat explanation that suggests that they are totally aware of previous work in the domain whilst making sure that this project looks a) different, b) excitingly new and c) contextually, better.

    2. Re:archive.org? by ibwolf · · Score: 2

      I would imagine the BL have referenced archive.org work extensively

      They've actually worked closely with the Internet Archive for many many years. This includes commissioning IA to conduct crawls for them of government sites.

      Both the BL and IA are members of the International Internet Preservation Consortium (IIPC see: http://netpreserve.org./ Both are very familiar with what the other is doing in this space.

      So why not let IA do all the work? There are several reasons. Part of it is that the BL is responsible for web archiving as far as British cultural heritage is concerned. Relying on a foreign entity to handle it is questionable as they would not be able to enforce any/all policies they might need on IA. You can certainly contract the IA to crawl for you, but it will be on their terms.

      However, there is also a question of redundancy. If multiple institutions, all over the world, are all engaged in web archiving, the ultimate result will be much better coverage and resilience. From my experience in dealing with the Internet Archive, this is something they support. Ever since I got involved in web archiving, 10 years ago, the Internet Archive has been a strong support of national libraries, archives and other interested parties doing their own web archiving.

      That is why the IIPC was formed. So we could share knowledge and pool resources where useful while each institution follows its own path in web archiving.

  2. Gotta love that management "thought" process by 93+Escort+Wagon · · Score: 4, Funny

    We had a manager, some years ago, who had the bright idea of assigning one staff member the task of printing out our entire website once a month so she (the manager) could look things up easily.

    --
    #DeleteChrome
  3. Data Storage by Trpajzlix · · Score: 2

    How are they going to store the data? Isn`t this whole library idea about storing things for future generations if there has been a war or other mass scale destruction? So when "future generations" uncover this Babylonian/British collection of knowledge hundreds years later, they can still learn from the remains? What are they going to get from a 200 years old harddrive, covered in dust?

    --
    A day will always be long, because 86400 won't fit into short.
    1. Re:Data Storage by 93+Escort+Wagon · · Score: 4, Funny

      How are they going to store the data?

      They're planning to save disk space by just referencing the original page content inside of an iframe.

      --
      #DeleteChrome
    2. Re:Data Storage by Anonymous Coward · · Score: 4, Informative

      BL, and other memory institutions such as archives, apply a concept, called "Digital Preservation", to the stored data. This concept, based on the OAIS model, covers all stages of storage, administration, maintenance and retrieval of these "remains".

      Hardest part of webarchiving is not storing the data but how to render it in 200 years. They also need to store the browser, but nowadays, browsers use so much different "subrenderers" such as Flash, Java, Javascript and CSS engines and whatnot to render a page, so there is also a need to archive all those subrenderers as well.

      Best known strategy to date is to create and store emulator containers or VM's with the original software so they can be emulated in the far future.

      http://en.wikipedia.org/wiki/Open_Archival_Information_System

    3. Re:Data Storage by SternisheFan · · Score: 2

      How are they going to store the data?

      They'll use the "Cloud".

      ..., Oh, wait...

    4. Re:Data Storage by N+Monkey · · Score: 3, Funny

      How are they going to store the data?

      They'll use the "Cloud".

      ..., Oh, wait...

      No problems. Plenty of those in the UK.

  4. Presumably by AliasMarlowe · · Score: 3, Insightful

    Perhaps they mean one billion web pages rather than web sites. It seems unlikely that the UK could host a billion web sites (even the American billion of 10^9 rather than the British billion of 10^12).

    --
    Those who can make you believe absurdities can make you commit atrocities. - Voltaire
    1. Re:Presumably by Trpajzlix · · Score: 4, Informative

      Ehm, "everyone else". In Czech bilion = 10^12.
      The Brits use the same billion=10^9 as everyone else speaking english
      FTFY

      --
      A day will always be long, because 86400 won't fit into short.
    2. Re:Presumably by Alain+Williams · · Score: 2, Insightful

      Because of the ambiguity I usually say either ''a thousand million'' or use the SI prefix Giga. So: it will be an archive of a Giga web page. Hmmm: doesn't quite trip off the tongue, unfortunately.

      Similarly with dates. What does 10/5/13 mean ? 10 May 2013 or 5 October 2013 ? I favour the first (to know why see how I spelled 'favour'), but recognising that it can be misunderstood (by those who spell differently), I would usually write dates as 10 May 2013 - no ambiguity.

    3. Re:Presumably by Joce640k · · Score: 2

      Spain uses 10^12

      --
      No sig today...
    4. Re:Presumably by Tastecicles · · Score: 3, Insightful

      I use YYYY/MM/DD. By extension, HH:MM:SS. Logical.

      --
      Operation Guillotine is in effect.
    5. Re:Presumably by Carewolf · · Score: 2

      The "British billion = 10^12" went out of use in the 1970's. The Brits use the same billion=10^9 as everyone else.

      No a billion is still 10^12. That has never changed. But because Americans usually get it wrong, the British now uses the American billion when speaking about money, but the real billion when speaking about everything else. Of course billions are rarely used for anything other than money.

  5. You can't just do it once... by icebike · · Score: 4, Interesting

    Unless you do this fairly frequently, say every 6 months at a minimum, the picture left for future generations will be muddled at best.
    Its always interesting how the news changes with the passage of time, and events are seen very differently in just a few weeks.

    On 9/11 I used this Adobe's web site mining software that essentially captures every link on every page of a site and builds a large web replicate in pdf form. All the links work within that PDF, and every page on the the site is preserved. I pointed it at all the major news web sites, one large PDF for each, burned them to disk, and still have them today. (Yup, I violated a boat load of copyrights).

    Two weeks later I did it again. You would be astounded at the difference. Entire pages are missing, not just unlinked, but even when you look for them by URL that appeared in the first capture, you won't find them in the second. Other news sites kept the old stuff on line, but the links often disappeared from their own web pages so that the only way to find these pages was by following links from some other site.

    The point is, that a snapshot of the web does very little good, unless it has some collection. Looking at the archives of a newspaper from June 6 1944, wouldn't give you much of an idea of the Normandy invasion, unless you had subsequent editions from days and months forward.
    But a web site isn't a newspaper with discrete editions, it is a constantly evolving thing, and archiving it today (or any point in time) is fairly useless, but archiving it daily is largely redundant, (most stories will be the same). You can't tell which stories changed over time based solely on the dates either, so you pretty well have to grab it all.

    Why doesn't the Library simply work a deal with the Wayback Machine Internet Archive. They seem to have this problem fairly well thought out. Maybe they plan to do that. I can't tell because the site that wants to archive all of Britain seems slashdotted at the moment.

    It seems that libraries are about the only place that can get away with ignoring copyright these days.

    --
    Sig Battery depleted. Reverting to safe mode.
    1. Re:You can't just do it once... by El_Muerte_TDS · · Score: 2

      > (Yup, I violated a boat load of copyrights).

      So, you distributed the created PDFs? If you didn't, and it's still your in private collection, when how did you violate the right of creating copies?

    2. Re:You can't just do it once... by bumburumbi · · Score: 2

      The National Library of Iceland has had a similar program for a couple of years. The national TLD is collected three times a year and made available via the Wayback Machine. The english version of the project's page is rather terse, but according to the Icelandic version, selected pages are collected more frequently when warranted, e.g. political debates around election times. Icelandic law requires publishers to deposit copies of ther work with the National Library. This includes web pages so the library doesn't have to worry about copyright.
      For a small country with few resources, co-operation with other small countries and archive.org is probably best. The task of collectiing the british TLD is orders of magnitude bigger. It may well be cheaper for the British Library to pay for a system tailored to their needs rather than figure out how to make archive.org's software do what the library needs.

  6. Re:Come on morons... by SternisheFan · · Score: 2

    One of the comments from the CNN story was, "The UK web archive is actually using archive.org's software. The point it that archive.org has only got so much money, and only archives a percentage of the web. Having the BL support this is a good thing."

  7. Wow by databeam · · Score: 2

    That's going to be a lot of porn!

    --
    "Creationists make it sound as though a 'theory' is something you dreamt up after being drunk all night." -- Isaac Asimo
  8. Average Web Site by wisnoskij · · Score: 2

    So the average website contains about 1 thousand pages then? That seems like a lot...

    --
    Troll is not a replacement for I disagree.
  9. Assumptions and questions by Martin+S. · · Score: 2
    There seem to be a few post making incorrect assumption and raising questions. I was involved as a technical architect on the long term preservation store aspect of this project few years ago.

    archive.org The BL is already cooperating with a number of other organisations do the same thing thing, including the archive.org, the Smithsonian, Scottish, French, Australian, Canadian and quite few other National Libraries. archive.org has been an important technology spike for these but is not the whole solution.

    Preservation BL has a legal responsibility to preserve it's archive, including this content essentially forever; which is a significant technology challenge.

    Legal archive.org is essentially opt in; the BL programme is legal deposit requirement. The site content for any uk tld should be collected at least once a year. An important piece of the technology puzzle is to identify these and mange this process.

    Scale The last scaling I saw placed the BL archive about two orders of magnitude larger than archive.org and growing faster. The number of new websites in .uk grows faster than the awareness of archive.org. There are a lot of challenges

    - Maintain structure and semantic context.

    - Searchable Meta Data

    - Searchable Content

    - Re-Presentation