Google To Digitize, Make Available British Library's Historical Holdings
pbahra writes with part of an excellent story at the WSJ: "The British Library today announced its first partnership with Google, under which Google will digitize 250,000 items from the library's vast collection of work produced between 1700-1870. The Library, the only British institution that automatically receives a copy of every book and periodical to go on sale in the United Kingdom and Ireland, joins around 40 libraries worldwide in allowing Google to digitize part of its collection and make it freely available and searchable online, at books.google.co.uk and the British Library website, www.bl.uk. ... As well as published books, the 1700-1870 collection will also contain pamphlets and periodicals from across Europe. This was a period of political and technological turmoil, covering much of the Industrial Revolution, the French Revolution, the introduction of UK income tax and the invention of the telegraph and railway. All of these topics are covered, as are the quirkier matters of the day, such as the account, from 1775, of a stuffed hippopotamus owned by the Prince of Orange."
What will Apple and Facebook do? They can't afford a British literature gap!
Controls the future..
No doubt there'll be plenty of "ZOMG GOOGLE IS TAKING OVER" comments but this is brilliant. There's so much archived information in Britain that is supposedly public but actually costs a fortune to research as you have to travel to wherever it's stored then pay an archivist to take you into the vault and find the papers etc.
What about the Prince of Orange and a stuffed hippopotamus?
Inquiring minds want to know.
What does one do with a stuffed hippo?
Down With Slashdot BETA!!! I've been around the corner and seen the oliphant; you can only abuse me from your perspecti
This is not the only British library that gets all publications, The National Library of Wales (http://www.llgc.org.uk/) also gets all publications that are published in the UK (and there is likely one also in Scotland)
metageek
From the article:
Google are approaching it correctly this time.
Sorry, I only have "the pile would reach to the moon and back x amount" or number of double decker buses jumped by and or Eddie Kidd / Evel Knievel my mate Dave.
Considering the items involved that require you to have a readers pass, yes of course it is difficult - they are one of a kind items, often needing to be handled in specific ways and treated with extreme respect, costing millions of pounds to restore, thousands of pounds to store and cannot be replaced. They are exactly the items that need a gate keeper to look after them.
Legal deposit cover printed material, digital publications (Newspapers, scholarly journals, software including games) and online material are covered by a voluntary scheme.
No, no ... in terms of cricket pitches.
Or, in multiples of 'Playing fields of Eton'
ALL items in the British Library require a Reader's Pass to view, except for the limited stock that they retain for inter-library loan.
This is regardless of their provenance or rarity.
I worked at company that did the same for the French National Library, about fifteen to eighteen years ago. To go through your questions:
We had a mix of temps and perms, mostly temp scanner operators and perm developers.
Professionals - yes, there were clauses in the contract about how much we paid if things were damaged.
Team size? Smaller than you might think - we had about ten at its peak. Around the clock - not quite, but there were definitely early and late shifts.
We used then-flash Bell & Howell scanners with expensive document feeders to avoid ripping the papers. We used Kofax image processing cards at a staggering 1Mb VRAM (yes - feel the power...) and super-powerful PCs too (486DX2 66Mhz). We stored the resulting TIFFs on a vast network server (a Network 3 1gb machine called Leviathan. Inconceivably it ran out of space so we bought a second called Behemoth). Actual process was to guillotine the books and feed them through the scanners, some books would then be restitched. In the case of rare books we'd photograph them instead (and then scan the photo - this predates digital cameras).
Yes, we then OCR'd them, and the contract stipulates that x pages in 100 have to then be proof-read.
Clearly the tech is now completely outclassed, but I'd be surprised if the contract and physical side has changed much. Am not terribly surprised to hear the British Library have taken the best part of two decades to catch up, we were talking to them at the time and they were terribly, terribly slow to see the potential in this.
Cheers,
Ian
The 18th century saw the birth of both the Industrial Age and the Age of Enlightenment. This was a time of profound change on a global scale that easily rivals the impact of our own information age.
You may ask what is the point in studying history -- who cares about the impact of steam power, for example? Here's the thing: although technology improves over time, people basically remain the same. By understanding the dislocation of farmers to factories in 1750, you can gain insight into the dislocation of national workers to global workers today.
To get access to literally every single published work from this period is going to be amazing. Bravo UK and Google!
"We receive as friendly that which agrees with, we resist with dislike that which opposes us" - Faraday
I wonder what they will look like... If someone hasn't thought of it before, someone should start drawing up plans for futuristic libraries where instead of checking out paper books you can check out books for your kindle or some other device... on top of that, I think it would be cool for it to look like a traditional library, but server racks instead of bookshelves.. (this probably just seems cool to me because I'm a nerd, I have a lot of friends who are 'conservative' when it comes to paper books.. A lot of the English majors I know treat technology like the anti-christ.
Calling your bluff. What state are you in?
For that to happen for free you need to declare the contents of your game system Creative Commons BY-SA which is Attribution-ShareAlike, and avoids the weird tangles regarding ad revenue vs "non commercial".
Then you have to develop the Literacy Pyramid, which is what every single copyright-clueless entity always falls into, proving that they are about the lawyers instead of the writers. The Literacy Pyramid says that you need a base of some 100 Lurkers to get about 7 Enthusiasts. But the output of Enthusiasts may not be to the standards of the Creator or the Skilled Amateur! So then you need to let 100 Enthusiasts stomp around leaving muddy tracks everywhere to get your 7 Skilled Amateurs. So every time Eric Flint whines on the Baen Free Library that "it's too expensive to digitize old works therefore they will never be republished" he's full of ...jellyBaens because it's somehow magically worth paying the lawyers afterward to sue the Enthusiasts as they stomp around.
So are you ready to do a little carpet cleaning to get your game out there?
My first Journal Entry ever, in 8 years! http://slashdot.org/journal/365947/aphelion-scifi-fantasy-horror-poetry-webzine
Strange, mine just went to the BL. Perhaps it depends upon the examining institution.
Mod this up, interesting discussion.
I'd guess the answer to 1 and 2 is "it depends." There must be rarities for which a full-on expert is required with white gloves and a wand (and in their spare time they supplement their income as street magicians.)
The proofreading is at least partly through reCAPTCHA. "Currently, we are helping to digitize old editions of the New York Times and books from Google Books." http://www.google.com/recaptcha/learnmore
"... and more and more now there are all kinds of electronic goodies available" -- Pink Floyd 1972
I worked at company that did the same for the French National Library, about fifteen to eighteen years ago. To go through your questions: ...
Actual process was to guillotine the books and feed them through the scanners, some books would then be restitched. In the case of rare books we'd photograph them instead (and then scan the photo - this predates digital cameras).
I thought that Google had tech that could scan the pages of an original book and automatically compensate for any curvature. IIRC** it did something like flash a test pattern onto the page to determine how to straighten the final image.
**but it was a while ago I read this so could easily be mistaken.
So my question is, since the original material is in the public domain (copyright expired), is Google's digitized copy in the public domain as well?
[Sir Garlon] is the marvellest knight that is now living, for he destroyeth many good knights, for he goeth invisible.
The BL blows on about adding to "our shared heritage" but the truth is that they are notoriously fickle and arbitrary about issuing Reader's Passes to actually use their collection.
It's automatic if you are doing a postgraduate degree.
I have had my application for a pass refused as my research justification was deemed "insufficiently scholarly", even after I had spent 10 minutes being interviewed by the secretary. The average man on the street who wanders in to their London campus will be in for a rude shock.
You don't accept the possibility that your research justifiction might have been insufficiently scholarly?
Even if the staff judge you to be worthy enough to view their precious possessions you have to jump through hoops just to reserve the item.
You ask the person on the information desk to reserve it for you, or you log in to the electronic catalogue (on-site or on-line), look the item up, press the "reserve" button, and select the reading room to which you want it delivered. If you consider that to be jumping through hoops then it says a lot for the academic standard you are likely to achieve.
Whenever I finally publish the fruits of my work I will happily flout the Legal Deposit Libraries Act and refuse to provide BL a copy.
And nothing of value was lost, I suspect.
Quidnam Latine loqui modo coepi?
The British Library has just handed the copyright on a load of uncopyrighted work to Google, and Google in return gets exclusive commercial rights to the work. This is awful. And for only £6 million, by their estimate, they could have done it themselves - considering the broad range of interested parties, donations could easily raise that amount. Their effort would be far better, too, if the standards of Google's old archives are anything to go by.
This is just another example of the British "public private partnership", where one guy does an under-the-table deal with another guy to do something seemingly simple and relatively inexpensive in an unnecessarily convoluted and costly manner, ending up with a product/service far worse than it could otherwise have been.
The guilty party is the British people for allowing the government to engage in an ongoing sale of the country.
Fuck off, Google. It was OK when all you wanted to do is control the future - the future's not that interesting, if the last three decades can be extrapolated - but now you want to control the past.
This has been pointed out and proven wrong a dozen times already in the comments. Only the British Library gets one automatically, the other libraries may request a free copy.
I've been involved with a similar project in the Netherlands. We found that commercial OCR engines had a high error rate on these old documents. We ended up having each document OCR'ed twice: once by software, once by having a sweatshop in India manually type up the document. The Indians had a lower error rate than the OCR software. By combining the two sources we could achieve an error rate low enough to comply with the project spec.
The project was unusual in that the documents were an index (of the minutes of parliament meetings); this meant it was full of words without context (incl. loads of names), and part of the information was in numbers, so we couldn't use a spelling checker to increase accuracy.
Using a spelling checker on century-old documents is iffy anyway, since you need one that has the then-current vocabulary instead of modern spelling.
Heh, reCAPTCHA isn't exactly foolproof. There's more spambots solving them per minute than humans. So if a human gets it right but two spambots already agreed on a wrong answer, guess what the system does...
Hi there,
Do you hold a BL Reader Pass? Actually they're also now available to undergraduates, but since I am 20 years out of Uni that's not much help to me either
> You don't accept the possibility that your research justifiction might have been insufficiently scholarly?
"A history of astro-navigation" may not be Earth-shatteringly exciting, but who are the BL to judge its merit? I had a case for research work, I showed that pamphlets they held were not available elsewhere but my application was denied for no reason other than the secretary was grumpy that day. She could provide no objective explanation.
> And nothing of value was lost, I suspect.
Exactly the attitude expressed by the BL.
The Brotherton at Leeds University is also a copyright library.
"I thought that Google had tech that could scan the pages of an original book and automatically compensate for any curvature. IIRC** it did something like flash a test pattern onto the page to determine how to straighten the final image."
We did that too - the Kofax card and driver software could take care of deskewing and it did a reasonably good job. Again, this was a while ago so I imagine things have improved but it wasn't too bad.
Cheers,
Ian
Hi there,
Do you hold a BL Reader Pass?
Yes.
Actually they're also now available to undergraduates, but since I am 20 years out of Uni that's not much help to me either
They're available to anybody who can make the case for one, irrespective of study level. It's just that doing postgrad studies is one of the objective criteria that automatically makes the case.
A history of astro-navigation" may not be Earth-shatteringly exciting, but who are the BL to judge its merit?
They are the people appointed with the task of making that judgement.
I had a case for research work, I showed that pamphlets they held were not available elsewhere but my application was denied for no reason other than the secretary was grumpy that day. She could provide no objective explanation.
In other words, you failed to make the case and it's somebody else's fault. There is a set of objective criteria to decide whether somebody can get a card. If you fail those tests then you get a second chance with an interview and a subjective judgement. It's meaningless to complain that she could "provide no objective explanation". You'd already failed the objective tests.
> And nothing of value was lost, I suspect.
Exactly the attitude expressed by the BL.
So you are still failing to make your case.
Quidnam Latine loqui modo coepi?
Here is a tip: Don't do drugs before you post rants on slashdot.
If Google really cared they would fix Android Chrome to reflow text, instead of discriminating
That's strange. The Library of Congress gives readers passes out to most anybody who applies.
I genuinely have no idea what you're talking about?
I will happily flout the Legal Deposit Libraries Act and refuse to provide BL a copy.
What with that and your user name, you're on two strikes. Just as well you're not in the US, or the next time you crossed the road owithout lokking properly you'd be off to prison for thirty years.
To have a right to do a thing is not at all the same as to be right in doing it
No, you're wrong..
This has already been answered by several people who either knew or could be bothered (like me) to spend ten seconds on Google.
To have a right to do a thing is not at all the same as to be right in doing it
Do they use automated machines, scanning beds, or wands?
No, they're transcribing everything by hand using quill pens and ink, then typesetting it on proper hot metal presses, then finally photographing each page with an 11 x 14 plate camera and emailing the images one page at a time to everyone who has a gmail account.
To have a right to do a thing is not at all the same as to be right in doing it
Oh okay I think I get it - the game will be free for all to use and share upon publication, that's in the blog, issue 1. That's not much help though, I've yet to meet the ocr program that can translate my scribbles.