Yahoo Competes with Google in Book Scanning

← Back to Stories (view on slashdot.org)

Yahoo Competes with Google in Book Scanning

Posted by ScuttleMonkey on Monday October 3, 2005 @09:04AM from the my-literary-collection-is-bigger-than-yours dept.

UltimaGuy writes "A consortium backed by Yahoo has launched an ambitious effort to digitize classic books and technical papers and make them freely available on the Web. The company is partnering with the newly formed Open Content Alliance, which aims to offer PDF documents of books to the public at no charge. Consumers will be able to search the contents of the Open Content Alliance's database and download the entire content of any work, such as a scanned copy of a book."

18 of 193 comments (clear)

Min score:

Reason:

Sort:

Who cares! Yahoo is a dying engine by Anonymous Coward · 2005-10-03 09:07 · Score: 0, Interesting

Yahoo services are often slow, riddled with annoying ads and cluttered.
What do these guys know... by dada21 · 2005-10-03 09:10 · Score: 5, Interesting

...that we don't?

It seems to me that they're throwing money at an unnecessary application. Does Yahoo know something that we don't? I'd venture that they're starting with PD books to shake the bugs out of their platform so the app works well in round 2.

Round 2 (current commercial books) won't occur without a massive copyright law change or support of the Author's Guild.

Hmm.
Why PDF? by matr0x_x · 2005-10-03 09:11 · Score: 0, Interesting

An OS solution would be better would it not? 10 years down the road when everything is in PDF format, whose to stop them from charging us to view material in their format?

--
LINUX ONLINE POKER: Linux Poker
Whew! by op12 · 2005-10-03 09:12 · Score: 4, Interesting

I almost panicked after seeing we had gone so long without a Google-related article.

The opt-in rather than opt-out strategy is really what Google probably should have done, but it'll be interesting to see who comes out as a winner, Yahoo or Google, in all of this.
What about China? by DAldredge · 2005-10-03 09:14 · Score: 3, Interesting

Will Yahoo provide sorted or unsorted lists of books that China's Internet uses view to the thugs that run China?
Re:Project Gutenberg by harmonica · 2005-10-03 09:22 · Score: 4, Interesting

More books are a good thing. Having a scanned PDF version includes graphics as well, which are missing from Gutenberg ebooks. So I see this as a very positive development.
Sad thing about Yahoo though by totallygeek · 2005-10-03 09:26 · Score: 2, Interesting

You will be reading the content to Moby Dick on Yahoo and in the top right it will say, "content provided by Google."

--
Click here or here.
University of Calif: Yahoo OK, Guttenburg banned by dananderson · 2005-10-03 09:41 · Score: 5, Interesting

I find it funny (in an ironic way only) that the University of California is allowing its public domain books to be scanned by Yahoo. At the same time, UC libraries prohibit scanning for Project Gutenberg or other true "open" content projects unless they receive $$$$ in royalities.
I hate to see a University pander to commercial interests, while at the same time, welcome commercial interests such as Yahoo. Money talks, and I'm sure UC is being paid a lot, but libraries are supposed to be public resources too, not exclusive profit-centers :-(.
Bookripper on its way? by serutan · 2005-10-03 09:56 · Score: 4, Interesting

Google maintains its scanning represents "fair use" allowed under the law because it only allows Web surfers to view excerpts from copyrighted books.

Soon after Google Mail was introduced, somebody created a SourceForge project that lets you use Google Mail as a database. How long until somebody releases a "Bookripper" app that assembles a whole book from search extracts? As I understand it Google displays two pages at a time (or wait, that's Amazon, but I bet they're similar). All you would need to know is a quote from a book's first page as a seed, and you should be able to grab the whole book by doing a series of searches using text from the second page returned by each search. The trick would be to knit the pieces together and eliminate the overlapping text. Seems almost trivial. Another possibility would be to search for random words and look for overlaps between the results, assembling them like a linear jigsaw puzzle until there are no gaps.
Re:Google/Yahoo by crashelite · 2005-10-03 10:11 · Score: 1, Interesting

google isnt evil so they cant become a monopoly like M$...

--
(yes i know i suck at spelling fell free to correct my grammar and/or spellin i dont care, im still not going to change
Re:Project Gutenberg by shellbeach · 2005-10-03 10:41 · Score: 3, Interesting

Project Gutenberg is great and all, but there's something to be said for some effort made at presentation. Sometimes italics are a good thing.

It's not a great solution, but emphasis _is_ preserved in the etexts, just like that. Or occasionally like THIS ... Pity there's no consistency, but for most texts it works well enough.

Also, the fact that they are plain text, with no markup, formatting, binary code, whatever in them means that they'll always be accessible to anyone, regardless of software or platform. And that's a good thing, too!
best format? by j1m+5n0w · 2005-10-03 10:55 · Score: 2, Interesting

Actually, I prefer plain txt to pdf if I'm reading from a computer (assuming the book is not illustrated), since I have more control over fonts and colors (and I have read quite a few gutenberg books that way). However, I think the best native format (despite its general user-unfriendliness) would be latex, from which txt, pdf, and html could be generated. On the other hand, I suppose it's much easier to generate txt or pdf from scanned pages than latex.
Re:University of Calif: Yahoo OK, Guttenburg banne by esme · 2005-10-03 12:19 · Score: 2, Interesting

At the same time, UC libraries prohibit scanning for Project Gutenberg or other true "open" content projects unless they receive $$$$ in royalities.

do you have a source for this? do you mean that a UC library tried to stop someone from checking out books and scanning them? or do you mean that they didn't allow the gutenberg folks to setup a scanning shop inside a library? there's a huge difference between those two.

i work at a UC library, and i've certainly never heard of any policies about project gutenberg. i'm not sure what kind of arrangements yahoo made, where the scanning is going to happen, etc. but i would imagine that yahoo agreed to (at least) cover the expense and hassle of any library facilities they're going to be using. project gutenberg might not have that kind of funding.

this is all assuming that this was involving public domain books, where the only leverage that UC libraries would have would be their facilities and lending policies. if you're talking about stuff that UC owns the copyright to, then that would be another kettle of fish. it would not surprise me to learn that a campus counsel or some such wouldn't let a library give away rights to content that UC held the rights to (like a library's special collections holdings).

-esme
Re:i have heard of these "printer" inventions, yes by B4RSK · 2005-10-03 15:36 · Score: 2, Interesting

I do see your points as well, and definitely there will be demand for commercially produced books for some time to come.

However, what I described does not require any folding and binding takes all of about 10 seconds. I've done this more than a few times and it does work out well.

I have a Brother laser printer that cost about US$300. I bought this printer for other reasons, but it is a great book printer too. (Has a duplexer, supports both PCL6 and PS3, built-in standard 10/100 LAN port. Basically it will work on any OS that supports PS or PCL6.)

Anyway, it prints duplexed pages at about 16ppm and the toner is cheap. The Windows driver also lets me easily (one click) print two pages onto one side of a sheet. The result of all this is that I can print a 300 page book perfectly in under 5 minutes using only 75 sheets of A4 paper. I then apply two of those triangular binding clips (the ones with the fold in handles), and it's done!

Total cost of around US$1 including the clips, and total time of about 5 minutes. It's not as pretty as a bound paperback but I'm willing to trade that off for the instant availability and the ability to reprint again any time if needed.

(The fact that I live in Japan definitely plays some role in my choice. English books here are very expensive and only available from major downtown bookstores -- and even then selection is pretty limited. Ordering from Amazon Japan (or US/UK) is possible, but the shipping increases the prices and takes time. A $1 five minute book is a dream!)

--
Some people are like slinkies--basically useless but they bring a smile to your face when pushed down the stairs.
More expensive books? by Grendel+Drago · 2005-10-03 17:05 · Score: 2, Interesting

Huh? Where are you from? I worked at a research library at a large state university, and I have no idea what you're talking about. True, libraries pay extortionate rates for journal subscriptions, but when they purchase monographs, they frequently get them off the used book market, just like you or I would. It costs them extra to get it bound in a durable fashion, and to enter it into their Byzantine catalog system, but I've never, ever heard of libraries having to pay extra for books simply because they were libraries.

Also, ongoing royalties? What country does that happen in? I've never heard of such a thing.

--
Laws do not persuade just because they threaten. --Seneca
Right you are! See TEI. by Grendel+Drago · 2005-10-03 17:18 · Score: 2, Interesting

Indeed. It's bothered me for some time now that it takes a good deal of doing to make a nice LaTeX edition of the book, so that it's nontrivial to go from the eBook to a really high-quality printed page.

Luckily, someone's decided to do something about it. See PGTEI, a very verbose and flexible method for marking up literary works. The full TEI spec is gargantuan, so PGTEI is actually a dialect of a subset called TEI Lite. It's an XML markup scheme which has output filters (it uses XSLT, it seems) for plain vanilla TXT (for longetivity, and on general principle), HTML and PDF. (Probably some others as well.)

You can try it out yourself. Grab some examples, and run them through the online tools.

Post-processors are very set in their ways, but as I've recently joined their ranks, I hope to use PGTEI for my first post-production job. It certainly seems more elegant than generating and tweaking multiple formats by hand.

--
Laws do not persuade just because they threaten. --Seneca
Awesome, indeed! by Grendel+Drago · 2005-10-03 17:50 · Score: 2, Interesting

I remember seeing some of Dudeney's puzzles referred to before, but I couldn't remember where. Then the book popped up on my RSS feed (it was released within the last month, I think), and indeed, it was full of fun math puzzles. Man, that was nice.

But they don't just have HTML; see various examples of files released with filetype "TEI", including PDF (through LaTeX), TXT (in a variety of encodings, i.e. Latin-1, US-ASCII and UTF-8) and HTML.

--
Laws do not persuade just because they threaten. --Seneca
University of California locks away public domain by dananderson · 2005-10-03 19:22 · Score: 2, Interesting
The source is my personal experience with the UCSD, UCI, and UCLA libraries. I assume the other UCs have the same or similar policy against digitizing books. Gutenburg is not a corporation, it's private individuals (volunteers). It's usually one guy (or gal) with a scanner, OCR software, and a little bit of time to proofread.
would not surprise me to learn that a campus counsel or some such wouldn't let a library give away rights to content that UC held the rights to (like a library's special collections holdings)
So in other words, the public domain is locked away. The PD consists of OLD books, which are largely in special collections.
Here's some policies I digged up. It's worse than the policy though. They say write a letter explaining your needs and they ignore you.
- UC Irvine: royalty fees required for PD books and other nonsense
- UC San Diego: royalty fees required for PD books and other nonsense
- UC San Diego: 20-page limit policy and no scanning policy
- Can't find UCLA's policies online, but they also prohibit scanning and wanted a $200/page royality!