British Library To Archive One Billion UK Websites

Yes. For future generations. by Anonymous Coward · 2013-04-06 20:52 · Score: 0

I'm sure they'll want to look at low-def goatse 20 years from now.

archive.org? by denpun · 2013-04-06 20:58 · Score: 5, Interesting

Why not work with the good folks at archive.org and their Internet wayback machine?

Is it not a similar idea?

The Internet Wayback Machine folks could use the funding and would be achieving the same purpose, albeit not in a format that the library folks might want....but they could come to agreement.

Re:archive.org? by denpun · 2013-04-06 21:03 · Score: 1

Was not able to access the article linked btw. (or parent site for that matter). /.ed already?
Re:archive.org? by Shimbo · 2013-04-06 21:15 · Score: 1

Report from BBC news: http://www.bbc.co.uk/news/entertainment-arts-22028738
Re:archive.org? by Anonymous Coward · 2013-04-06 21:33 · Score: 0

The Internet Wayback Machine folks could use the funding
The funding wouldn't help them unless it was more than the cost of actually undertaking the work. In which case the bit that actually helps them is more like a straight donation, so why not do the work themselves and make that donation too (assuming that they wanted to).
Re:archive.org? by kaiidth · 2013-04-06 21:57 · Score: 5, Insightful

Without wishing to offend it, the BL is a monolithic organisation that doesn't always play well with others. Part of that is because funding doesn't always work that way. You can get money for claiming that you are going to do the very first über-awesome UK archive, but your chances of receiving the funding becomes rather lower if in the very first breath you point out that somebody else has been doing pretty much this for a decade. Another part of it is: most politicians would likely want the national heritage, such as it is (jubilee celebration tweets - please...) to be held by that nation's own national library.
I would imagine the BL have referenced archive.org work extensively, but differentiate this project with what tits in suits like to call "a compelling USP." To put it in plain English, they'll have a neat explanation that suggests that they are totally aware of previous work in the domain whilst making sure that this project looks a) different, b) excitingly new and c) contextually, better.
Re:archive.org? by Anonymous Coward · 2013-04-06 23:23 · Score: 1

The British Library will probably use the same techniques as internet archive.

Some reasons:
* internet archive may bankrupt and the material may be lost. Government libraries may have - in theory at least - more reliable funding to preserve the material.
* it is easier to do targeted crawling (of specific themes) using your own workers than 3rd party company
* there are some legal matters that may make it more "illegal" for the 3rd party to do the crawling than if the government organization does it (as specified in the law)
* some government organizations may not want to outsource their work to companies for many different reasons (lack of control etc)
* it maybe cheaper to do own generic crawls than pay IA to do it
Re:archive.org? by Anonymous Coward · 2013-04-06 23:26 · Score: 1

Without wishing to offend it, the BL is a monolithic organisation that doesn't always play well with others.

Where 'others' also includes people who might wish to make use of the library, but are refused admission despite a research case. Whereas all UK undergraduates are automatically granted access.
Re:archive.org? by Anonymous Coward · 2013-04-07 02:29 · Score: 0

Why not work with the good folks at archive.org and their Internet wayback machine?
The actual reason is legal. The British Library is a specially designated deposit library, and so under Section 44A of the Copyright, Designs and Patents Act 1988 it is allowed to make an archival copy of anything from the internet without the copyright holder's permission. It's doubtful whether what archive.org is doing is legal under UK law, not that it cares because archive.org is based in the US.
Re:archive.org? by Anonymous Coward · 2013-04-07 02:38 · Score: 0

"The actual reason is legal."
Nah. That's just the BL's reason for involvement in the work. It doesn't stop the BL collaborating with archive.org (i.e. sharing tools, technology, developers, etc).
Re:archive.org? by 93+Escort+Wagon · 2013-04-07 05:53 · Score: 1

Without wishing to offend it, the BL is a monolithic organisation that doesn't always play well with others.
And you REALLY don't want to piss off their Rare Book Retrieval Unit!

--
#DeleteChrome
Re:archive.org? by ibwolf · 2013-04-07 06:05 · Score: 2

I would imagine the BL have referenced archive.org work extensively
They've actually worked closely with the Internet Archive for many many years. This includes commissioning IA to conduct crawls for them of government sites.
Both the BL and IA are members of the International Internet Preservation Consortium (IIPC see: http://netpreserve.org./ Both are very familiar with what the other is doing in this space.
So why not let IA do all the work? There are several reasons. Part of it is that the BL is responsible for web archiving as far as British cultural heritage is concerned. Relying on a foreign entity to handle it is questionable as they would not be able to enforce any/all policies they might need on IA. You can certainly contract the IA to crawl for you, but it will be on their terms.
However, there is also a question of redundancy. If multiple institutions, all over the world, are all engaged in web archiving, the ultimate result will be much better coverage and resilience. From my experience in dealing with the Internet Archive, this is something they support. Ever since I got involved in web archiving, 10 years ago, the Internet Archive has been a strong support of national libraries, archives and other interested parties doing their own web archiving.
That is why the IIPC was formed. So we could share knowledge and pool resources where useful while each institution follows its own path in web archiving.
Re:archive.org? by kaiidth · 2013-04-07 07:05 · Score: 1

See, what you're saying is both sensible and unsurprising, but here's what bothers me: TFA doesn't acknowledge any of what you are saying. Instead, it suggests this is a novel activity, which seems ridiculous but happens for political reasons.
Re:archive.org? by Anonymous Coward · 2013-04-07 08:07 · Score: 0

That was my first thought. Isn't the wayback machine already doing this? I found the old original site I had made and run in 2000/2001.
Re:archive.org? by tehcyder · 2013-04-08 03:18 · Score: 1

Why not work with the good folks at archive.org and their Internet wayback machine?
Is it not a similar idea?
The Internet Wayback Machine folks could use the funding and would be achieving the same purpose, albeit not in a format that the library folks might want....but they could come to agreement.
This is specifically for UK web sites, and the British Library is a British institution funded by the British taxpayer. Archive.org is US-based and a separate entity.

--
To have a right to do a thing is not at all the same as to be right in doing it

Gotta love that management "thought" process by 93+Escort+Wagon · 2013-04-06 21:03 · Score: 4, Funny

We had a manager, some years ago, who had the bright idea of assigning one staff member the task of printing out our entire website once a month so she (the manager) could look things up easily.

--
#DeleteChrome

Data Storage by Trpajzlix · 2013-04-06 21:03 · Score: 2

How are they going to store the data? Isn`t this whole library idea about storing things for future generations if there has been a war or other mass scale destruction? So when "future generations" uncover this Babylonian/British collection of knowledge hundreds years later, they can still learn from the remains? What are they going to get from a 200 years old harddrive, covered in dust?

--
A day will always be long, because 86400 won't fit into short.

Re:Data Storage by 93+Escort+Wagon · 2013-04-06 21:08 · Score: 4, Funny

How are they going to store the data?
They're planning to save disk space by just referencing the original page content inside of an iframe.

--
#DeleteChrome
Re:Data Storage by Anonymous Coward · 2013-04-06 21:26 · Score: 4, Informative

BL, and other memory institutions such as archives, apply a concept, called "Digital Preservation", to the stored data. This concept, based on the OAIS model, covers all stages of storage, administration, maintenance and retrieval of these "remains".
Hardest part of webarchiving is not storing the data but how to render it in 200 years. They also need to store the browser, but nowadays, browsers use so much different "subrenderers" such as Flash, Java, Javascript and CSS engines and whatnot to render a page, so there is also a need to archive all those subrenderers as well.
Best known strategy to date is to create and store emulator containers or VM's with the original software so they can be emulated in the far future.
http://en.wikipedia.org/wiki/Open_Archival_Information_System
Re:Data Storage by SternisheFan · 2013-04-06 22:03 · Score: 2

How are they going to store the data?
They'll use the "Cloud".
..., Oh, wait...
Re:Data Storage by Anonymous Coward · 2013-04-06 22:05 · Score: 0

Always impressed by the flowcharts per paragraph ratio in digital preservation. Also the number of uses of the term 'framework'. You're nothing and nobody in this field if you haven't got at least one 'framework' to your name. Digital preservation is very obviously an Augean stable. There is some excellent work in there (including the aforementioned playing with emulators), but the field doesn't half need mucking out.
Re:Data Storage by N+Monkey · 2013-04-06 22:47 · Score: 3, Funny

How are they going to store the data?
They'll use the "Cloud".
..., Oh, wait...
No problems. Plenty of those in the UK.
Re:Data Storage by Anonymous Coward · 2013-04-07 02:43 · Score: 0

On the inter-internet under .co.uk

Re: Great plan there smart guys! by Anonymous Coward · 2013-04-06 21:05 · Score: 0

Does your annoyingly snarky post have a point behind it, or are you just enjoying being a dick?

Inaccurate title! by Anonymous Coward · 2013-04-06 21:11 · Score: 0

It's one billion pages, not one billion websites. Which would have been a lot of websites for a country of 63 million people.

Re:Inaccurate title! by loufoque · 2013-04-06 21:16 · Score: 1

Hasn't each person created at least 10 websites in their lives?

Presumably by AliasMarlowe · 2013-04-06 21:19 · Score: 3, Insightful

Perhaps they mean one billion web pages rather than web sites. It seems unlikely that the UK could host a billion web sites (even the American billion of 10^9 rather than the British billion of 10^12).

--
Those who can make you believe absurdities can make you commit atrocities. - Voltaire

Re:Presumably by Anonymous Coward · 2013-04-06 21:40 · Score: 1

The "British billion = 10^12" went out of use in the 1970's. The Brits use the same billion=10^9 as everyone else.
Re:Presumably by Trpajzlix · 2013-04-06 22:28 · Score: 4, Informative

Ehm, "everyone else". In Czech bilion = 10^12.
The Brits use the same billion=10^9 as everyone else speaking english
FTFY

--
A day will always be long, because 86400 won't fit into short.
Re:Presumably by Anonymous Coward · 2013-04-06 22:36 · Score: 1

I confirm... at least in Portugal and France, 1 billion is 10^12, rather than just 10^9.
Re:Presumably by Alain+Williams · 2013-04-06 22:42 · Score: 2, Insightful

Because of the ambiguity I usually say either ''a thousand million'' or use the SI prefix Giga. So: it will be an archive of a Giga web page. Hmmm: doesn't quite trip off the tongue, unfortunately.
Similarly with dates. What does 10/5/13 mean ? 10 May 2013 or 5 October 2013 ? I favour the first (to know why see how I spelled 'favour'), but recognising that it can be misunderstood (by those who spell differently), I would usually write dates as 10 May 2013 - no ambiguity.
Re:Presumably by Joce640k · 2013-04-06 23:49 · Score: 2

Spain uses 10^12

--
No sig today...
Re:Presumably by Livius · 2013-04-07 01:12 · Score: 1

Not only that, people not speaking English use a word with a different pronunciation! And spelling! And grammatical rules!
Re:Presumably by Tastecicles · 2013-04-07 01:55 · Score: 3, Insightful

I use YYYY/MM/DD. By extension, HH:MM:SS. Logical.

--
Operation Guillotine is in effect.
Re:Presumably by Carewolf · 2013-04-07 02:23 · Score: 2

The "British billion = 10^12" went out of use in the 1970's. The Brits use the same billion=10^9 as everyone else.
No a billion is still 10^12. That has never changed. But because Americans usually get it wrong, the British now uses the American billion when speaking about money, but the real billion when speaking about everything else. Of course billions are rarely used for anything other than money.
Re:Presumably by Anonymous Coward · 2013-04-07 02:45 · Score: 0

I use YYYY/MM/DD. By extension, HH:MM:SS. Logical.
YYYY/MM/DD doesn't work so well within a file name...
Re:Presumably by K.+S.+Kyosuke · 2013-04-07 03:15 · Score: 1

Just write 20130407-171547 like everybody else and be done with it.

--
Ezekiel 23:20
Re:Presumably by CanEHdian · 2013-04-07 04:06 · Score: 1

This is because the English didn't use the -illion and -illiard system, just kept -illion
Rest of Europe: million, milliard, billion, billiard, trillion, trilliard
England: million, billion, trillion
This is something that cannot be "fixed" other than adopting the SI system.

--
When the copyright term is "forever minus a day", live every day like it's the last.
Re:Presumably by AmiMoJo · 2013-04-07 05:11 · Score: 1

We had to send some drawings to some guys in the US a while back. Rush job just to check everything would fit. First they complained that we has post-dated it, then that our dimensions were impossible small until it was pointed out that "mills" means "millimetres" and not 1/1000th of an inch (which surely should be 1/1200th, unless it was an attempt at metrification).

--
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
Re:Presumably by Anonymous Coward · 2013-04-07 05:36 · Score: 0

Works great on a Mac. Get into the 21st century, guys.
Re:Presumably by Anonymous Coward · 2013-04-07 07:14 · Score: 0

It's a little more complicated than that: British English did until recently use the "million, milliard, billion"; "million, billion, trillion" has been imported from the US within the last 20 years or so. In the early '90s the BBC were often very careful to say "thousand million", because "billion" was by then ambiguous. Since the late nineties we have "standardized" on 10^9=billion, much to the annoyance of the rest of Europe. Sorry about that chaps 8(
Re:Presumably by Tim+the+Gecko · 2013-04-07 11:45 · Score: 1
The "British billion = 10^12" went out of use in the 1970's. The Brits use the same billion=10^9 as everyone else.
No a billion is still 10^12. That has never changed. But because Americans usually get it wrong, the British now uses the American billion when speaking about money, but the real billion when speaking about everything else. Of course billions are rarely used for anything other than money.
I think you are a little out of date:
The Economist Pocket Style Book recommended 10^9 for "billion" back in 1986.
Re:Presumably by tehcyder · 2013-04-08 02:20 · Score: 1

Yes, it's not like the fucking summary says it's one billion web pages or anything, is it? Oh, wait...

--
To have a right to do a thing is not at all the same as to be right in doing it
Re:Presumably by tehcyder · 2013-04-08 02:24 · Score: 1

This website is written in English. So, for example you would see 3.1415927 here rather than 3,1415927. It is silly to quibble about how conventions are different in other languages/cultures. I wouldn't go to a Russian language website and start moaning about how the alphabet is all fucked up.

--
To have a right to do a thing is not at all the same as to be right in doing it
Re:Presumably by tehcyder · 2013-04-08 02:28 · Score: 1

The "British billion = 10^12" went out of use in the 1970's. The Brits use the same billion=10^9 as everyone else.
No a billion is still 10^12. That has never changed. But because Americans usually get it wrong, the British now uses the American billion when speaking about money, but the real billion when speaking about everything else. Of course billions are rarely used for anything other than money.
No one in Britain uses billion to mean 10^12 unless they are being deliberately anachronistic, and have no interest in communicating with other people. In the UK, you would say the world population was 7 billion, for instance.

--
To have a right to do a thing is not at all the same as to be right in doing it

You can't just do it once... by icebike · 2013-04-06 21:24 · Score: 4, Interesting

Unless you do this fairly frequently, say every 6 months at a minimum, the picture left for future generations will be muddled at best.
Its always interesting how the news changes with the passage of time, and events are seen very differently in just a few weeks.

On 9/11 I used this Adobe's web site mining software that essentially captures every link on every page of a site and builds a large web replicate in pdf form. All the links work within that PDF, and every page on the the site is preserved. I pointed it at all the major news web sites, one large PDF for each, burned them to disk, and still have them today. (Yup, I violated a boat load of copyrights).

Two weeks later I did it again. You would be astounded at the difference. Entire pages are missing, not just unlinked, but even when you look for them by URL that appeared in the first capture, you won't find them in the second. Other news sites kept the old stuff on line, but the links often disappeared from their own web pages so that the only way to find these pages was by following links from some other site.

The point is, that a snapshot of the web does very little good, unless it has some collection. Looking at the archives of a newspaper from June 6 1944, wouldn't give you much of an idea of the Normandy invasion, unless you had subsequent editions from days and months forward.
But a web site isn't a newspaper with discrete editions, it is a constantly evolving thing, and archiving it today (or any point in time) is fairly useless, but archiving it daily is largely redundant, (most stories will be the same). You can't tell which stories changed over time based solely on the dates either, so you pretty well have to grab it all.

Why doesn't the Library simply work a deal with the Wayback Machine Internet Archive. They seem to have this problem fairly well thought out. Maybe they plan to do that. I can't tell because the site that wants to archive all of Britain seems slashdotted at the moment.

It seems that libraries are about the only place that can get away with ignoring copyright these days.

--
Sig Battery depleted. Reverting to safe mode.

Re:You can't just do it once... by El_Muerte_TDS · 2013-04-06 22:42 · Score: 2

> (Yup, I violated a boat load of copyrights).
So, you distributed the created PDFs? If you didn't, and it's still your in private collection, when how did you violate the right of creating copies?
Re:You can't just do it once... by bumburumbi · 2013-04-06 23:12 · Score: 2

The National Library of Iceland has had a similar program for a couple of years. The national TLD is collected three times a year and made available via the Wayback Machine. The english version of the project's page is rather terse, but according to the Icelandic version, selected pages are collected more frequently when warranted, e.g. political debates around election times. Icelandic law requires publishers to deposit copies of ther work with the National Library. This includes web pages so the library doesn't have to worry about copyright.
For a small country with few resources, co-operation with other small countries and archive.org is probably best. The task of collectiing the british TLD is orders of magnitude bigger. It may well be cheaper for the British Library to pay for a system tailored to their needs rather than figure out how to make archive.org's software do what the library needs.
Re:You can't just do it once... by Anonymous Coward · 2013-04-06 23:20 · Score: 0

Those websites are terrible, then.
The fact that you aren't linked by permalink and those links change is absolutely embarrassing for a website.
Even Facebook isn't that terrible, single comments have permalinks.
Link-breaking is bad enough as it is already during website restructures, but doing it on purpose through terrible design is rage-inducing.
Re:You can't just do it once... by Anonymous Coward · 2013-04-06 23:22 · Score: 0

I hope you did not pay for that tool, considering that you get the same functionality by using the program wget. You can get the Windows-version of the program (I assume your Adobe program is for Windows) here: http://gnuwin32.sourceforge.net/packages/wget.htm
Wget does however not generate any PDF-versions for you. It does allow you to browse the downloaded websites using your regular browser though, as if they were still on the Internet.
Re:You can't just do it once... by Anonymous Coward · 2013-04-07 00:47 · Score: 0

It seems that libraries are about the only place that can get away with ignoring copyright these days.
Seeing as this programme is set up by the law that CREATES copyright, and is instituted under the aegis of the Copyright Act, I think you're missing the point if you consider this to be 'ignoring copyright'.
Re:You can't just do it once... by dkf · 2013-04-07 01:43 · Score: 1

Why doesn't the Library simply work a deal with the Wayback Machine Internet Archive. They seem to have this problem fairly well thought out. Maybe they plan to do that. I can't tell because the site that wants to archive all of Britain seems slashdotted at the moment.
I imagine that it will eventually happen, and that it will end up enriching the archive.org system when it does. Maybe it won't happen for a year or two, but when we're talking about long term preservation, that's not so important and the global nature of the internet makes it valuable (and logical) to globally coordinate the historical archives of it as well.

It seems that libraries are about the only place that can get away with ignoring copyright these days.
National libraries cannot ignore copyright, but they have a special position with regards to copyright law: they're explicitly empowered to retain copies for future generations whether the publishers like it or not (and whether or not they're Big Media). If you don't want it archived for future generations, don't publish it at all.

--
"Little does he know, but there is no 'I' in 'Idiot'!"
Re:You can't just do it once... by Tastecicles · 2013-04-07 02:04 · Score: 1

I use Backstreet. OK it's £13 after the 30-day trial, but it's bloody handy to have a full relinking of crawled content so you can pretty much pull a website, import it into a VM, and do what you want to do there. Me? I PDF what I download using Acrobat X batch conversion then run the fulltext indexing engine. Considering it's all running on a VM it ain't half fast, even if it is currently holding an index of 6 million pages.
Oh yeah, and it runs on Linux via WinE. Not that I run it on Linux, I run it native in Win7 64-bit.

--
Operation Guillotine is in effect.

The process will take five months by hcs_$reboot · 2013-04-06 21:32 · Score: 1

They should definitely reduce the time allotted to that tea break..

--
Slashdot, fix the reply notifications... You won't get away with it...

Come on morons... by Anonymous Coward · 2013-04-06 21:43 · Score: 0

Its about developing the architecture to take continuous snapshots of the web for intelligence purposes. Nothing more. Or else they would just fund the internet archive.

Re:Come on morons... by SternisheFan · 2013-04-06 22:32 · Score: 2

One of the comments from the CNN story was, "The UK web archive is actually using archive.org's software. The point it that archive.org has only got so much money, and only archives a percentage of the web. Having the BL support this is a good thing."
Re:Come on morons... by SternisheFan · 2013-04-06 22:51 · Score: 1

I mean the 'BBC article, http://www.bbc.co.uk/news/entertainment-arts-22028738 ... I noticed that they're such polite postings made by the British people, out of the 88 comments, only 2 were moderated out.
Wandalust1956 5th April 2013 - 10:01
This is just, in essence, the 21st Century equivilant of the Mass Observation project that started in the 1930's and included the diary of a housewife from Cumbria during the 2nd World War...which was turned into a TV play by Victoria Wood. http://www.massobs.org.uk/index.htm As long as the content is "relevant" to current affairs then it could be a cultural insight to life in the 21st Century.

I'll see your Internet Archive and raise you... by Tastecicles · 2013-04-06 23:02 · Score: 1

...typically British utter redundancy.

--
Operation Guillotine is in effect.

Re:I'll see your Internet Archive and raise you... by Anonymous Coward · 2013-04-06 23:27 · Score: 0

Redundancy in this sort of things (well, anything that matters to you really) is a good thing. Why rely on a single organisation that loses everything if it runs out of money? Not that the Internet Archive archives everything anyway.
Re:I'll see your Internet Archive and raise you... by Tastecicles · 2013-04-07 01:53 · Score: 1

my point is (and I apologise if I didn't make it obvious) that this isn't news. IA has archived the internet, and done a fairly decent job of it. The BL is off on a "Me Too!" campaign and the BBC are all over it like it's a first.

--
Operation Guillotine is in effect.
Re:I'll see your Internet Archive and raise you... by tehcyder · 2013-04-08 03:29 · Score: 1

...typically British utter redundancy.
Yeah, we're the sort of idiots who make more than one back up of important data. What's the point of that eh?
Hint: redundancy is somethimes a very, very good thing indeed.

--
To have a right to do a thing is not at all the same as to be right in doing it
Re:I'll see your Internet Archive and raise you... by tehcyder · 2013-04-08 03:30 · Score: 1

There is still no harm in a national archiving organisation doing its job for its own country's data.

--
To have a right to do a thing is not at all the same as to be right in doing it

NLA by Anonymous Coward · 2013-04-06 23:14 · Score: 0

I believe that the National Library of Australia already does this, but there are issues around copyright for granting access to these archives. Thanks again America for the free trade agreement and all of your shitty copyright rules

Re:NLA by Anonymous Coward · 2013-04-07 02:51 · Score: 0

I believe that the National Library of Australia already does this, but there are issues around copyright for granting access to these archives. Thanks again America for the free trade agreement and all of your shitty copyright rules
You are welcome. Pray that we do not create more evil copyright rules for you to bow down to. (Bwah-hah-hah-ha) - Signed, Evil U.S. (tm)

Wow by databeam · 2013-04-07 00:32 · Score: 2

That's going to be a lot of porn!

--
"Creationists make it sound as though a 'theory' is something you dreamt up after being drunk all night." -- Isaac Asimo

Illegal Content by wisnoskij · 2013-04-07 01:32 · Score: 1

So will they being getting legal permission to host all of this copyrighted material.
Doesn't all the individual websites won their own content, how does archive.org even get around this?
And what about the illegal porn, cracks, hacks, and viruses?

--
Troll is not a replacement for I disagree.

Re:Illegal Content by PPH · 2013-04-07 02:36 · Score: 1

And what about the Elgin Marbles?

--
Have gnu, will travel.
Re:Illegal Content by Anonymous Coward · 2013-04-07 02:38 · Score: 0

They already have legal permission. It is a long existing legal requirement under UK copyright law that for ANY printed material published in the UK, a copy is provided free of charge to the British Library, and on request to five other "libraries of legal deposit" in Scotland, Wales, Oxford, Cambridge and Dublin (that's a library not even in the UK!).
As I understand it, this project has been waiting for suitable updates to the law to be in place before going ahead with its archiving. If you don't like it then you have the choice of not publishing publicaly accessible material on the .uk domain. The law allows the libraries of deposit to make copies for archiving and preservation purposes, and to make these copies available for readers at the library; I don't think they're able to republish pages online as the Internet Wayback machine does. On the other hand archive.org does allow you to opt out via a robots.txt file.
Re:Illegal Content by Anonymous Coward · 2013-04-07 03:04 · Score: 0

"The law allows the libraries of deposit to make copies for archiving and preservation purposes, and to make these copies available for readers at the library; I don't think they're able to republish pages online as the Internet Wayback machine does."
Hah. It's another one of the British Library's 'we'll give access to our mates and sod everybody else' things. Bugger open data, let's corner our market. Also, how useful are a billion tweets if you can only access them as a 'reader at the library'? Are researchers going to do social network analysis by hand?
Re:Illegal Content by Anonymous Coward · 2013-04-07 08:20 · Score: 0

Also, how useful are a billion tweets if you can only access them as a 'reader at the library'?
About as useless as they are under any other circumstance.
Re:Illegal Content by Anonymous Coward · 2013-04-07 08:25 · Score: 0

Hah! True that.

Average Web Site by wisnoskij · 2013-04-07 01:33 · Score: 2

So the average website contains about 1 thousand pages then? That seems like a lot...

--
Troll is not a replacement for I disagree.

Re:Average Web Site by tehcyder · 2013-04-08 03:33 · Score: 1

So the average website contains about 1 thousand pages then? That seems like a lot...
No, it doesn't. Imagine how many pages something like the BBC website has on any particular day.

--
To have a right to do a thing is not at all the same as to be right in doing it
Re:Average Web Site by wisnoskij · 2013-04-08 05:28 · Score: 1

Yes, but you would be hard pressed in my opinion to fund more than a few hundred regular websites that contain around or more than 1000 pages. Add in every medium or larger sized forum and it really seems like 1000 is a lot. I think the mode (type of average) website would have something like 10, with a bunch more at the 50 range, and still quite a bit at a few hundred. But I really do not see many websites that have over 1000.
I guess news sites that keep every article they ever published in the last 100 years up would balance the scales with tens of thousands if not more, but I still find it a large number.

--
Troll is not a replacement for I disagree.

wasted effort by Anonymous Coward · 2013-04-07 03:32 · Score: 0

the library or some government agency probably already has an archive of news programs, the library already archives news papers and magazines........ and for everybody else, there's cctv recordings.

Assumptions and questions by Martin+S. · 2013-04-07 04:30 · Score: 2

There seem to be a few post making incorrect assumption and raising questions. I was involved as a technical architect on the long term preservation store aspect of this project few years ago.

archive.org The BL is already cooperating with a number of other organisations do the same thing thing, including the archive.org, the Smithsonian, Scottish, French, Australian, Canadian and quite few other National Libraries. archive.org has been an important technology spike for these but is not the whole solution.

Preservation BL has a legal responsibility to preserve it's archive, including this content essentially forever; which is a significant technology challenge.

Legal archive.org is essentially opt in; the BL programme is legal deposit requirement. The site content for any uk tld should be collected at least once a year. An important piece of the technology puzzle is to identify these and mange this process.

Scale The last scaling I saw placed the BL archive about two orders of magnitude larger than archive.org and growing faster. The number of new websites in .uk grows faster than the awareness of archive.org. There are a lot of challenges

- Maintain structure and semantic context.

- Searchable Meta Data

- Searchable Content

- Re-Presentation

Re:Assumptions and questions by Anonymous Coward · 2013-04-07 06:02 · Score: 0

"BL has a legal responsibility to preserve it's archive, including this content essentially forever; which is a significant technology challenge. "
In truth, the BL has a legal responsibility to preserve its archive only as long as they can afford to stay open. Quite rightly, therefore, it plays the game, resulting in news articles like this one that are transparently based on press releases originating with pressandpolicy.bl.uk in which the egos of MPs and senior managers are massaged and complications like 'archive.org have been archiving lots of UK sites for ages' are glossed over. That's fine, or if it isn't fine it is understandable, but it can be confusing. The original press release (most likely this one) isn't actually that bad since at least it focuses on legal deposit, but like most press releases it fails to preempt, acknowledge or answer the obvious questions like how exactly does this differ from archive.org? and aren't you just reinventing the wheel?
As for the challenges, meh and pshaw. It's all good stuff but how much of that is unique to this project?

Alert: Misleading/Fraudulent Title by Anonymous Coward · 2013-04-07 05:21 · Score: 0

The title says "One Billion UK Websites" but the first sentence of the post says "4.8 million websites." Clearly, the poster is being misleading or fraudulent. Oh timothy please be consistent with your own post.

Yet they plan to copy other people's endeavours without a thought.

if it were inches, it would be "thou". by Anonymous Coward · 2013-04-07 08:29 · Score: 0

five-thou is five one-thousandths of an inch.

mill would be metric.

Please pay UK taxes, then. by Anonymous Coward · 2013-04-07 08:38 · Score: 0

If you want access to this then pay toward the taxes that will fund it.

Thanking you in advance,

A UK taxpayer.

Re:Please pay UK taxes, then. by Anonymous Coward · 2013-04-07 08:50 · Score: 0

I am a UK taxpayer you insensitive clod.

French National Library does it since 2006 by aikawa · 2013-04-07 17:14 · Score: 1

The BnF (French National Library) has started doing this in 2006 for a selection of .fr websites.
In 2011 they had 16.5*10^9 files.
They store content on "Petaboxes" made by the Internet Archive.

See http://www.bnf.fr/en/collections_and_services/book_press_media/a.internet_archives.html

Re:This is FLAG in Florida by Anonymous Coward · 2013-04-07 18:57 · Score: 1

The trolls here are just getting weirder.

Clarification for posterity by illtud · 2013-04-11 13:40 · Score: 1

I'm pretty late to this story, but let me clear up some misunderstandings for posterity's sake:

Disclosure: I've been involved in this effort for at least ten years, I'm head of ICT for one of the UK Copyright Libraries (National Library of Wales), and this story goes way back to the Primary Legislation passed by the UK in 2003, and we've been working on the practicalities of this since before that legislation was passed.

* Yes, Internet Archive and others have been archiving web sites for many years. We're using their software for capturing.

* We've been collecting and archiving web sites by agreement with the web publishers for years via the UK Web archive project.

* What's different here is that the secondary legislation has been passed (in March) that has given the UK copyright libraries the mechanism (agreed with publishers) to extend legal deposit to digital publications, which includes websites.

* This gives the legal deposit libraries the right to add to the national legal deposit collections (the collection of all published material for the UK) digital publications, including ebooks, ejournals and websites.

* Until the 6th of April 2013, we did not have the right (under normal copyright law) to take a copy of websites without permission. Previously we had to request a written agreement from each website we archived to take a copy - obviously this does not scale very far.

* Under the new legislation, we will be taking periodic copies of the entire .uk domain and other websites in other domains which fall under the regulation (territoriality has been difficult to define, as you may imagine).

* The difference between us and the Internet Archive is intended to be that given the status as a national collection, the material that we collect is intended to be available in perpetuity. Our print collections go back centuries, and the intention is that the digital material we collect now will also be available in centuries to come. You can read about the distributed redundant storage here.

TL;DR : this is a legal thing, not a technical thing, and it's about a lot more than websites.

Slashdot Mirror

British Library To Archive One Billion UK Websites

89 comments