British Library To Archive One Billion UK Websites
An anonymous reader writes "The British Library is to begin archiving the entire UK web, including one billion pages from 4.8 million websites, blogs, forums and social media sites. The process will take five months, with the aim of presenting a more complete picture of news events for future generations to read and learn from."
Why not work with the good folks at archive.org and their Internet wayback machine?
Is it not a similar idea?
The Internet Wayback Machine folks could use the funding and would be achieving the same purpose, albeit not in a format that the library folks might want....but they could come to agreement.
We had a manager, some years ago, who had the bright idea of assigning one staff member the task of printing out our entire website once a month so she (the manager) could look things up easily.
#DeleteChrome
How are they going to store the data?
They're planning to save disk space by just referencing the original page content inside of an iframe.
#DeleteChrome
Perhaps they mean one billion web pages rather than web sites. It seems unlikely that the UK could host a billion web sites (even the American billion of 10^9 rather than the British billion of 10^12).
Those who can make you believe absurdities can make you commit atrocities. - Voltaire
Unless you do this fairly frequently, say every 6 months at a minimum, the picture left for future generations will be muddled at best.
Its always interesting how the news changes with the passage of time, and events are seen very differently in just a few weeks.
On 9/11 I used this Adobe's web site mining software that essentially captures every link on every page of a site and builds a large web replicate in pdf form. All the links work within that PDF, and every page on the the site is preserved. I pointed it at all the major news web sites, one large PDF for each, burned them to disk, and still have them today. (Yup, I violated a boat load of copyrights).
Two weeks later I did it again. You would be astounded at the difference. Entire pages are missing, not just unlinked, but even when you look for them by URL that appeared in the first capture, you won't find them in the second. Other news sites kept the old stuff on line, but the links often disappeared from their own web pages so that the only way to find these pages was by following links from some other site.
The point is, that a snapshot of the web does very little good, unless it has some collection. Looking at the archives of a newspaper from June 6 1944, wouldn't give you much of an idea of the Normandy invasion, unless you had subsequent editions from days and months forward.
But a web site isn't a newspaper with discrete editions, it is a constantly evolving thing, and archiving it today (or any point in time) is fairly useless, but archiving it daily is largely redundant, (most stories will be the same). You can't tell which stories changed over time based solely on the dates either, so you pretty well have to grab it all.
Why doesn't the Library simply work a deal with the Wayback Machine Internet Archive. They seem to have this problem fairly well thought out. Maybe they plan to do that. I can't tell because the site that wants to archive all of Britain seems slashdotted at the moment.
It seems that libraries are about the only place that can get away with ignoring copyright these days.
Sig Battery depleted. Reverting to safe mode.
BL, and other memory institutions such as archives, apply a concept, called "Digital Preservation", to the stored data. This concept, based on the OAIS model, covers all stages of storage, administration, maintenance and retrieval of these "remains".
Hardest part of webarchiving is not storing the data but how to render it in 200 years. They also need to store the browser, but nowadays, browsers use so much different "subrenderers" such as Flash, Java, Javascript and CSS engines and whatnot to render a page, so there is also a need to archive all those subrenderers as well.
Best known strategy to date is to create and store emulator containers or VM's with the original software so they can be emulated in the far future.
http://en.wikipedia.org/wiki/Open_Archival_Information_System
How are they going to store the data?
They'll use the "Cloud".
No problems. Plenty of those in the UK.