British Library To Archive One Billion UK Websites
An anonymous reader writes "The British Library is to begin archiving the entire UK web, including one billion pages from 4.8 million websites, blogs, forums and social media sites. The process will take five months, with the aim of presenting a more complete picture of news events for future generations to read and learn from."
I'm sure they'll want to look at low-def goatse 20 years from now.
Why not work with the good folks at archive.org and their Internet wayback machine?
Is it not a similar idea?
The Internet Wayback Machine folks could use the funding and would be achieving the same purpose, albeit not in a format that the library folks might want....but they could come to agreement.
We had a manager, some years ago, who had the bright idea of assigning one staff member the task of printing out our entire website once a month so she (the manager) could look things up easily.
#DeleteChrome
How are they going to store the data? Isn`t this whole library idea about storing things for future generations if there has been a war or other mass scale destruction? So when "future generations" uncover this Babylonian/British collection of knowledge hundreds years later, they can still learn from the remains? What are they going to get from a 200 years old harddrive, covered in dust?
A day will always be long, because 86400 won't fit into short.
Does your annoyingly snarky post have a point behind it, or are you just enjoying being a dick?
It's one billion pages, not one billion websites. Which would have been a lot of websites for a country of 63 million people.
Perhaps they mean one billion web pages rather than web sites. It seems unlikely that the UK could host a billion web sites (even the American billion of 10^9 rather than the British billion of 10^12).
Those who can make you believe absurdities can make you commit atrocities. - Voltaire
Unless you do this fairly frequently, say every 6 months at a minimum, the picture left for future generations will be muddled at best.
Its always interesting how the news changes with the passage of time, and events are seen very differently in just a few weeks.
On 9/11 I used this Adobe's web site mining software that essentially captures every link on every page of a site and builds a large web replicate in pdf form. All the links work within that PDF, and every page on the the site is preserved. I pointed it at all the major news web sites, one large PDF for each, burned them to disk, and still have them today. (Yup, I violated a boat load of copyrights).
Two weeks later I did it again. You would be astounded at the difference. Entire pages are missing, not just unlinked, but even when you look for them by URL that appeared in the first capture, you won't find them in the second. Other news sites kept the old stuff on line, but the links often disappeared from their own web pages so that the only way to find these pages was by following links from some other site.
The point is, that a snapshot of the web does very little good, unless it has some collection. Looking at the archives of a newspaper from June 6 1944, wouldn't give you much of an idea of the Normandy invasion, unless you had subsequent editions from days and months forward.
But a web site isn't a newspaper with discrete editions, it is a constantly evolving thing, and archiving it today (or any point in time) is fairly useless, but archiving it daily is largely redundant, (most stories will be the same). You can't tell which stories changed over time based solely on the dates either, so you pretty well have to grab it all.
Why doesn't the Library simply work a deal with the Wayback Machine Internet Archive. They seem to have this problem fairly well thought out. Maybe they plan to do that. I can't tell because the site that wants to archive all of Britain seems slashdotted at the moment.
It seems that libraries are about the only place that can get away with ignoring copyright these days.
Sig Battery depleted. Reverting to safe mode.
They should definitely reduce the time allotted to that tea break..
Slashdot, fix the reply notifications... You won't get away with it...
Its about developing the architecture to take continuous snapshots of the web for intelligence purposes. Nothing more. Or else they would just fund the internet archive.
...typically British utter redundancy.
Operation Guillotine is in effect.
I believe that the National Library of Australia already does this, but there are issues around copyright for granting access to these archives. Thanks again America for the free trade agreement and all of your shitty copyright rules
That's going to be a lot of porn!
"Creationists make it sound as though a 'theory' is something you dreamt up after being drunk all night." -- Isaac Asimo
So will they being getting legal permission to host all of this copyrighted material.
Doesn't all the individual websites won their own content, how does archive.org even get around this?
And what about the illegal porn, cracks, hacks, and viruses?
Troll is not a replacement for I disagree.
So the average website contains about 1 thousand pages then? That seems like a lot...
Troll is not a replacement for I disagree.
the library or some government agency probably already has an archive of news programs, the library already archives news papers and magazines........ and for everybody else, there's cctv recordings.
archive.org The BL is already cooperating with a number of other organisations do the same thing thing, including the archive.org, the Smithsonian, Scottish, French, Australian, Canadian and quite few other National Libraries. archive.org has been an important technology spike for these but is not the whole solution.
Preservation BL has a legal responsibility to preserve it's archive, including this content essentially forever; which is a significant technology challenge.
Legal archive.org is essentially opt in; the BL programme is legal deposit requirement. The site content for any uk tld should be collected at least once a year. An important piece of the technology puzzle is to identify these and mange this process.
Scale The last scaling I saw placed the BL archive about two orders of magnitude larger than archive.org and growing faster. The number of new websites in .uk grows faster than the awareness of archive.org.
There are a lot of challenges
- Maintain structure and semantic context.
- Searchable Meta Data
- Searchable Content
- Re-Presentation
The title says "One Billion UK Websites" but the first sentence of the post says "4.8 million websites." Clearly, the poster is being misleading or fraudulent. Oh timothy please be consistent with your own post.
Yet they plan to copy other people's endeavours without a thought.
five-thou is five one-thousandths of an inch.
mill would be metric.
If you want access to this then pay toward the taxes that will fund it.
Thanking you in advance,
A UK taxpayer.
The BnF (French National Library) has started doing this in 2006 for a selection of .fr websites.
In 2011 they had 16.5*10^9 files.
They store content on "Petaboxes" made by the Internet Archive.
See http://www.bnf.fr/en/collections_and_services/book_press_media/a.internet_archives.html
The trolls here are just getting weirder.
I'm pretty late to this story, but let me clear up some misunderstandings for posterity's sake:
Disclosure: I've been involved in this effort for at least ten years, I'm head of ICT for one of the UK Copyright Libraries (National Library of Wales), and this story goes way back to the Primary Legislation passed by the UK in 2003, and we've been working on the practicalities of this since before that legislation was passed.
* Yes, Internet Archive and others have been archiving web sites for many years. We're using their software for capturing.
* We've been collecting and archiving web sites by agreement with the web publishers for years via the UK Web archive project.
* What's different here is that the secondary legislation has been passed (in March) that has given the UK copyright libraries the mechanism (agreed with publishers) to extend legal deposit to digital publications, which includes websites.
* This gives the legal deposit libraries the right to add to the national legal deposit collections (the collection of all published material for the UK) digital publications, including ebooks, ejournals and websites.
* Until the 6th of April 2013, we did not have the right (under normal copyright law) to take a copy of websites without permission. Previously we had to request a written agreement from each website we archived to take a copy - obviously this does not scale very far.
* Under the new legislation, we will be taking periodic copies of the entire .uk domain and other websites in other domains which fall under the regulation (territoriality has been difficult to define, as you may imagine).
* The difference between us and the Internet Archive is intended to be that given the status as a national collection, the material that we collect is intended to be available in perpetuity. Our print collections go back centuries, and the intention is that the digital material we collect now will also be available in centuries to come. You can read about the distributed redundant storage here.
TL;DR : this is a legal thing, not a technical thing, and it's about a lot more than websites.