Human-Powered Internet Archive Book Project
Carl Bialik from the WSJ writes "A group led by the Internet Archive is planning a massive, ambitious effort to scan millions of old books and make them available for Web searching early next year. Behind that effort are about a dozen scanners, employees making about $10 an hour to manually scan volumes -- some more than a century old -- one page at a time, on special contraptions. The Wall Street Journal Online visits a University of Toronto library to watch one of the scanners in action: 25-year-old Liz Ridolfo."
How is this diffrent from something like Googles project? And how will the Copyright holders feel? Also, this could be pretty usefull...
Yay, I have a sig.
Will the scans be added to the Project Gutenberg collection?
Last time I moved, It took many VERY HEAVY boxes to Move all my books. Maybe I'll scan them all..
:(
All though anything useful has to be illegal...
0xB315AA8D852DCD3F3DCA578FD2E0BF88
Since they are nonprofit, why don't they join forces with a quite similar project: Project Gutenberg
That Wall Street Journal article reads less like a report than it does a nuance-to-nuance account of the author's infatuation with Liz Ridolfo.
And what the hell is with that sketch? DAVID KESMODEL, MORE LIKE DAVIDEON -DYANMITE-!!!
Project Gutenberg frequently makes use of the page scans for source material. What PG does is to run the images through OCR, proofread and post-process it. It's more useful than a stack of page images, but considerably more work.
If you look at the current books on Distributed Proofreaders, you'll see that some of them credit the Million Books Project for the page scans.
Laws do not persuade just because they threaten. --Seneca
No. You failed to RTFA. They are only scanning books pre-1923 -- out of the copyright domain -- and those that they are specifically allowed to scan by the publishers. This has the backing of a lot of big corporations (Microsoft, HP, etc), and I don't think they'd like to be caught on the wrong side of copyright law, considering their position on the whole issue.
The books are old enough to be in the public domain now. No problem.
There are lies, damned lies, and statistics.
From TFA, they are only scanning works that are out of copyright and in public domain, so this is not the same as what google is doing.
Getting written works off of paper and stored electronically should be a priority--bits are much easier to store, preserve, and copy for future use.
In Stanislaw Lem's science fiction book "Memoirs Found in a Bathtub", all the paper in the world gets eaten by a virus and chaos ensues. Interesting read if you've missed it, has made me paranoid about how much the world still depends on paper.
Entropy just isn't what it used to be.
Why hello, Ms. Liz Ridolfo. I'm happy to see you are into computers (at least I'll tell myself that) and you like to put your pictures online.
Please email me at superdesperateteengeek@needtogetlaid.net
I have seen a video how Google is adding its books to their database: a huge machine which can automatically turn over a page and continue to scan very rapidly. I guess that these 100+ year-old books would need subtility but scanning al those pages manually is a pain in the ass.
The good:
Old books prior to copyright laws are being scanned.
The bad:
Pay is roughly $10/hr. Now, I happen to be concerned that someone being paid so little should be handling rare books. Not to mention the college graduate getting paid so little.
The ugly:
The digital camera contraption costs $30,000!! There's a few scanner manufacturers left in the world and none of them have exploited this niche. Shame on them.
http://www.maxineudall.com/2010/02/should-economists-be-sued-for-malpractice.html
10$ per hour is too much here. You can take 2$ as the commision per labour hour. Since the people are very hard working they'll work 20 hours a day. Making it a huger 8*20 = 160$ per day. In one month one can make 160*30 = 4800$. In Indian this is a huge money. Even a top class manager doesnt get this much salary per month.
...is not a NOVEL idea.
*Ducks*
Wow, that book scanner rig is just what I've been dreaming of for years. I've been thinking about mounting a couple of glass plates at a 90 degree angle, and then I could put the open book on apex of the glass, then photograph it with a couple of cameras underneath. This rig is just exactly what I was thinking of, but upside down and even cleverer, with a footpedal to lift the glass up and down onto the book. A very nice piece of design work.
The obvious advantage of this rig is that you don't have to open the spine 180 degrees and smash the books flat onto a single glass plane, you don't have to open the book up more than 90 degrees, so it's gentle on the spine of fragile old books. And the glass wedge is always self-centering against the spine of the book. The only way this scheme could work better is if there was a way to turn the pages automatically. But these are old and presumably valuable works, safer to let paid low-wage drones to do the work than risk mechanical damage.
... hey guys, I think I heard that some people were doing this already. Maybe even another group too. I think I heard it on slashdot.
I am Spartacus
Will it automatically provide full text or scanned image files for works that have gone out of copyright? And do the restrictions against scanning , storage or reproduction also lapse when copyright lapses? This would be massive. Lots of publishers just reissue old work with new copyrights attached to them.
Personally I've read lots of old science fiction from copyright lapsed works, there is some in Gutenberg, and like it quite a bit, though I'd like to find more of them.
For example I'm looking for Perry Rhodan (anything past #128) in English, which is out of print and maybe in old book stores or garage sales though I'm not in the U.S. now.
Web searching is fine but the most important part is to be able to get the works digitized. Then make freely available what is not in copyright, and make it easy to purchase what is. I bet you'll see publishers rushing towards that when they start seeing dollars rolling in.
It seems a pity to use such a manual method. This... http://www.kirtas-tech.com/ is designed to scan books, especially old and fragile books, automatically. It handles the pages even more gently than a trained person. It's not cheap, but is does around 1,000 pages per hour, and the operator just loads books in and takes them out when they're done. I looked at the company a couple of years ago (I'm a VC) and get regular updates from them. A LOT of libraries are using them now.
http://online.wsj.com/public/article_print/SB1131
And of course, direct linkage to the picture of the girl.
Because that's the only reason 90% of you would click on the link anyways
The Girl Has Nice Shoes
As an aside, cigarettes + old books = bad
"This book almost killed me," Ms. Ridolfo said to her boss, Gabe Juszel, who was preoccupied with a stack of books and didn't reply. Then she walked outside for a cigarette break, pausing along the way to rub her neck."
Not to mention that girls + cigarette smell = not terribly attractive
(but that's just my opinion)
[Fuck Beta]
o0t!
What's the legality difference between this and say regular libraries? Don't regular libraries loan material freely? What changes when it becomes electronic, it just means that the people will be able to keep them for longer or as long as they want, no? IMHO, I like the idea of doing this. It'll make doing books for school much easier knowing that there's a backup copy of it floating around somewhere on the interweb.
I'd RTFA if the black text didn't overlap a black image. IE-only web designers should be shot.
Here's a list of book scanning equipment. I've seen the one from Kirtas in action, it's fun to watch.
How can I help? I'm willing to give a couple of hours a week, I don't have a scanner, but I'm willing to type...if this is truly "open", I will be more than willing to contribute my time.
> bullshit
I too want to be modded Insightful!
being smart is exausting
Google is doing a great amount of that in addition to the headline-grabbing bullshit that everyone harps about so much. [From TFReality]
slashdotters thought process..."reading slashdot -> Liz? -> that is a female nameif I'm not mistaken. *clicks the link*"
Send the scans to india or eastern europe to be scanned for a fraction of the price. I mean really. This seems to be a serious operation - why not maximize the use of available resources? Spending $10/hr on scanning is just dumb.
apparently you were more interested in getting first post than reading tfa.
employees making about $10 an hour to manually scan volumes -- some more than a century old
I think that if they hired younger people to scan the books, it might go a little faster.
Imagine a 100 year old at this job...
"...(mumble mumble) in my day we used priests to copy books (mumble mumble) oh dear, I tore another page, darn Parkinson (mumble mumble)"
Ah, arrogance and stupidity, all in the same package. How efficient of you. -- Londo Mollari
I can't help but think of midgets in a running wheel. Is that an improvement over a "hamster-powered" book project?
The grass is always greener on the other side of the light cone.
10$ per hour for the humans, tens of thousands for the scanners. Damn you machine-overlords!
...
On the other hand, the whole project is funded by Microsoft and Yahoo, which creates the usual good (open content!) / evil (paid for with the devil's money!) dilemma.
That's enough coffee for me, I suppose...
You must be new here ;)
Interesting - I don't understand your line of thinking - interested to hear more. Is the argument that automated page turning is *cheaper* so it's a pity that the project spends a lot on labour charges (manual scanning)? Or is the argument that the automated page turning is easier on the fragile old books? I'd appreciate if you could offer more details about the technology - the company's demo video shows a vacuum device lifting pages, but both examples are with modern books. Honest question: surely the advantage here is a low labour cost method of scanning huge numbers of pages (like the telephone directory example they show). But if you have fragile books, surely the advantage of a human is that they can see that individual pages might be particularly fragile, maybe even needing support or repair to scan, while the pre-set vacuum device will plough on regardless, it won't be able to make a decision on the quality of the pages. Does it have any sensing devices built in? My experience of older books (e.g. nineteenth century) is that in some cases the paper can be very brittle.
The (Jack) Vance Integral Edition was a volunteer effort to produce a limited edition 42 volume set of the complete works of Jack Vance, restored to as close to the author's original manuscripts as possible.
(The project is complete, and an amazing success.)
The team scanned and edited many of Jack's early works for which there was no good clean manuscript. They developed software tools that would compare scans from different editions to automatically find errors. It turns out that even the best human editor still missed "scanos" (typos produced by the scanning process) that the automated tools found.
Even so, in the final books there were a handful of errors that slipped through, despite extremely careful editing by hundreds of volunteers.
Yep.
Too bad the comments on Digg make this place look like a scholar's retreat.
Slashdot - where whining about luck is the new way to make the world you want.
You must enjoy using ridiculously worn-out catchphrases.
I do hope they're not duplicating efforts... and whether they even know about Project Gutenberg. http://www.promo.net/pg/>
mark
At the University of Toronto the Internet Archive pays $12 Canadian / hour to the scanners, and $11-$12 American in the US. The exchange rates keep changing so judging Canadian pay by US translations is a bit confusing. With experience the Archive will adapt as well, but the Archive is interested in maintaining a reasonable wage while keeping the overall cost cheaper than most commercial offerings. The reason for that is to encourage the open nature that the Archive supports.
What would be the equivalent local rate for scanners in Europe?
-brewster
Digital Librarian
The book included several copies of handwritten letters by authors, that folded out from the pages and were difficult to photograph. "This book almost killed me," Ms. Ridolfo said to her boss, Gabe Juszel, who was preoccupied with a stack of books and didn't reply. Then she walked outside for a cigarette break, pausing along the way to rub her neck.
Wow... at CA $12.00/hour I can imagine her life must be quite difficult. I grit my teeth at how I could work an entire week just to see it go "poof" at a 15 minute doctors consultation.
(1) The library paid for the copy you're borrowing. (Or somebody paid for it, in case the book was donated to the library.) Thus the author was paid for that copy. If you read a whole copyrighted book via a Content Display Site (CDS - Google Print, Amazon Search Inside, etc.) and never buy the book, the author wasn't paid. Copyright law is about creating new copies; you're not creating a new copy when you read in a store or from a library.
(2) Browsing in a bookstore is pretty inconvenient. You can't take the copy with you to look at any time you want. (Unless you buy it! That's sort of the point.) Bookstores know that few people really read entire books in the store -- else they'd go out of business. However, reading a book from a CDS doesn't have that limitation: You can take it with you, on your laptop, etc. This is particularly critical in light of digital paper, when the digital copy is the paper copy.
(3) Libraries and bookstore reading isn't anywhere near free: You have to move your physical body to the bookstore to read. For one thing, you can't likely do that at 3am. (And certainly not in your pajamas.) You can't do it from your bed, couch, or desk, without getting up. You have to spend time to move your body down there, which might be 10min-30min each way; 20-60min round trip, plus say 10min to find the book, a place to sit, etc; call it 30-70min. If you value your time at say, $10/hr, that's $5-12. Then there's the cost of transportation. If the library/bookstore is three miles away, 6mi. round trip, and gas costs $2.50/gal., and you get 20mi/gal., that's another $.75. The IRS figures driving a car costs $.405/mile in repairs, wearing it out, etc., so that's another $2.40. So you're at something like $8-15 to go read a "free" book.
Really -- if it were that free, people would do a lot more of it.
Yet reading a free copy from a CDS doesn't have those limitations. It is much closer to $0, actually and truly free. THAT's the problem.
(4) You can't pass on a "free" copy you read in the store or from the library. You have to leave the book at the bookstore (or buy it); you have to return the book to the library. Reading a book in digital form that was stolen from a CDS, you could pass that copy on to others by email, via a web page, P2P software, etc.
So, bottom line, bookstore/library reading isn't really free. CDS copies are essentially free, and that's the problem. They're too convenient to read free.
This is one of the reasons we formed the COCOA Association ( http://www.copyrightaccess.com/ ), to make more copyrighted work available. (Note, COCOA does not inhibit indexing and searching and returning text snippet search results -- just what page images can be displayed.) If you support this, please sign our petition at http://www.petitiononline.com/cocoa/petition.html -- thanks!
Dr. Andrew Burt,
Chair, The COCOA Association
I remember being put in charge (still not sure how it happened, but it did) of my HS senior class's slideshow thing for the end of the year banquet, and everyone brought in 2 or 3 pictues to be scanned for it..... It wasn't fun........ But then again, for $10/hr, it couldn't have been any worse then most other crappy jobs....
In undeveloped countries, the consumer controls the market. In capitalist America, the market controls you.
If you bought a copy of any classic book that is out of copyright, and it's a literal republication of the original (not a 'modern interpretation' or new translation or anything else) than you could, I believe, scan, OCR, and distribute the resulting text. The literary work -- be it Shakespeare's, Clemens', Dickens', etc. -- is no longer protected by copyright.
You could not, however, scan the book and distribute the images of the pages. Because although the original author's text is not under copyright protection, the book itself (layout, design, etc.) could be. Also, any changes they might have made to the text (new grammar, or diction) could be, which is why you'd have to be careful. I think it would only qualify as a new protected work if the changes represented "an original work of authorship" according to 17 U.S.C. 101, but depending on the publisher they might try to sue you into bankruptcy anyway.
As long as you didn't copy the layout or any of the additional materials (critical essays, introductions) that publishers put into re-prints of classic literature, I see no reason why it would be illegal to type in and share the $2.99 Penguin Classics edition of Tom Sawyer that you can get at any Borders.
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
The focuses of OCA and PG are really quite different: PG is most interested in preserving the essential information of a book (ie, its text), while OCA's interest is in preserving the form of the book (ie, its fonts, pages format, coloration, even down to the yellowing of the pages). That having been said, there's a lot each can do for the other (and has!).
The Archive has archived most of PG's material, because even though the Books department of The Archive is focussed mostly on preserving books, The Archive as a whole is interested in preserving just about any information it can, and the PG data is definitely of interest.
When the The Archive's Scribe software processes the book images into its various format (jpg, djvu, pdf, flippy, et al), it OCR's the book's text. This text then becomes part of generating some of the other formats. It will be really trivial for PG to obtain this text for any book it wants to incorporate into their dataset.
qv: intlepisode00jamearch. The interesting files here are intlepisode00jamearch.txt which is just the OCR'd text, and intlepisode00jamearch_djvu.xml which is the OCR'd text with layout information (which has been useful to me in developing software which auto-corrects some OCR errors -- where the text is on the page often offers valuable hints for choosing the right heuristic for guessing the right text).
A quick side note on the differences between Google's and OCA's efforts that I haven't seen talked about much -- Google's main advantages in their bookscanning efforts are their wealth and fame, while The Archive's main advantages are experience, familiarity, and scanning technology.
Traditional book-scanning technologies are expensive and slow (which makes doing a lot of books, fast, that much more expensive, because you have to hire more people to do more books in parallel), but Google has enough money to throw at the problem that this is less of an issue. Google's fame means they can bring powerful partners onboard with a smile and a handshake, including some of the most prestigious libraries in the nation.
The Archive has been involved in scanning books and making them available online for several years now (qv The Million Books Project). This experience has shaped the processes used in the acquisition and scanning of books, as well as the technology used in their storage, indexing, and presentation. Furthermore, libraries around the world have grown familiar with The Archive over the years. That, and The Archive's good track record, make it a powerful rallying point for partnerships and alliances, and have given it more experience in facilitating such relationships. Finally, partially due to the limits of existing book-scanning solutions, and partially due to The Archive's limited budget, it has facilitated the development of two independent low-cost, reliable, high-quality book-scanning systems: The Scribe (developed in-house at The Archive) and the Kirtas Robot (developed at Kirtas, a Canadian company).
Many of the books scanned for the Million Book Project using traditional scanning methods are really lousy, sometimes to the point of being unreadable. These new scanning systems dramatically improve the quality of the end product, while equally dramatically reducing the cost-per-page. This means that more scanning systems can be purchased for more libraries (avoiding the per-library capital outlay problem), and more books can be scanned more quickly within a given budget.
Obviously, Google and OCA can benefit from co-operation, as each has a lot to offer the other. I'd be surprised if Google didn't join the OCA, eventually, if for no other reason that to gain access to the books of the >100 OCA
so thats why the bible is rewritten a thousand times? i wanan see jesus and the apostles sue the televangelists.
If your neighbours roof is flying past your window, you know it's cyclone season.
Regarding the article, they are only scanned books from before 1923 - for which the copyright has expired. Copyright USED to expire - the whole point of both copyrights and patents was that they grant the author a SHORT period of exclusivity to encourage creation.
But that's not true anymore. Currently copyrights have an expiration date, but the expiration date has consistently gotten farther away faster than it has gotten closer.
Essentially NOTHING expires now unless somebody didn't do their paperwork.
That is, our Congress - with heavy contributions of the media lobby and sometimes with laws actually written by them - extends copyright by more years than have gone by since the last time they did it.
This is RETROACTIVELY true, in clear opposition to the intent of copyright law. (Any works already created are CLEARLY not encouraged by the extension of their copyright)
This is one of the best examples of our need to take back our government and elect officials who care about citizens. Personally I believe intensive campaign finance reform is the only solution.
Looking for freelance Actionscript (Flash/Flex) or ColdFusion work and/or freelance developers. Email me, put Slashdot
Are you familiar with the WorldWide TV Show and Event "The Secret"
:-)
It is based on old teachings and books that were banned for many reasons.
Somebody felt threatened...for some reason!!!
I say, "let me have the access to everything"
and I am a big girl and can make up my own mind!
Pat
Perfect Sales Career
ones and zeros remind me, now that an Indian discovered zero, why not stop using that dirty zero as well, or the decimal system as well, or the digits(damn arabs bought them to Eupore, that makes it even more dirtier aint it??), wont make much of a difference will it?..or wait...binar.y..well u better screw PC's and take up reading Pol Pot's autobiography..and make sure it aint printed in India ink ;)
7-8-9-10-0