Slashdot Mirror


Google Book Scanning Efforts Not Open Enough?

An anonymous reader writes to mention the Washington Post is reporting that the Open Content Alliance is taking the latest shot at Google's book scanning program. Complaining that having all of the books under the "control" of one corporation wouldn't be open enough, the New York-based foundation is planning on announcing a $1 million grant to the Internet Archive to achieve the same end. From the article: "A splinter group called the Open Content Alliance favors a less restrictive approach to prevent mankind's accumulated knowledge from being controlled by a commercial entity, even if it's a company like Google that has embraced 'Don't Be Evil' as its creed. 'You are talking about the fruits of our civilization and culture. You want to keep it open and certainly don't want any company to enclose it,' said Doron Weber, program director of public understanding of science and technology for the Alfred P. Sloan Foundation."

113 comments

  1. Good! by SatanicPuppy · · Score: 4, Insightful

    The more the merrier!

    Ideally we could set up a few hundred digital libraries that would all hold some percentage of the catalog, so that any 5 would be able to duplicate the entire catalog. That way, in the event of a catastrophe or some kind of weird global event, it would be more likely that an uncorrupted copy could be found.

    I'd definitely like to see some not-for-profits get involved.

    --
    ad logicam Claiming a proposition is false because it was presented as the conclusion of a fallacious argument.
    1. Re:Good! by funfail · · Score: 3, Funny

      RAIL: Redundant Array of Inexpensive Libraries

      Preferably the technology should be RoR.

    2. Re:Good! by s20451 · · Score: 2, Interesting

      That way, in the event of a catastrophe or some kind of weird global event, it would be more likely that an uncorrupted copy could be found.

      How do you plan to read it once you find it?

      10 year disruption -- content formats have moved on; readers are scarce
      100 year disruption -- hard drives, DVDs decay to unreadability
      1000 year disruption -- even paper decays, unless specifically preserved
      >1000 year disruption -- even if it's chiseled into a stone tablet, the language might be extinct

      --
      Toronto-area transit rider? Rate your ride.
    3. Re:Good! by geekoid · · Score: 2, Insightful

      If there was a catastrophy, then the technology would not have 'moved on'
      I can read data from ten years ago on my home computer with no problems.

      If we ahve a 100 year disruption, well then we are probably throwing rocks at one another and rebuilding civilization.

      --
      The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
    4. Re:Good! by Anonymous Coward · · Score: 2, Informative

      Yeah, that's why we don't have any idea what anybody might have said (or meant) more than a thousand years ago.

    5. Re:Good! by evilviper · · Score: 3, Insightful
      10 year disruption -- content formats have moved on; readers are scarce

      I've been using computers for well more than 10 years, and ASCII is still just as readable as ever.

      Mark-up languages like HTML, XML, or RTF may die off eventually (several hundred years at least), but you can always strip the markup (either with code, or mentally by ignoring it). Plus, with the formats being so simple, and book layout being so obvious, it should take 5 minutes to write a new parser for any of them.

      100 year disruption -- hard drives, DVDs decay to unreadability

      Both of the above would be unreadable by the standard pick-up mechanism, but manually reading it, bit-by-bit with something like an electron microscope should be possible for many, many more years after that. Just as technology has made it possible to read previously erased text on paper, so to will it be easier, in the future, to read physically decaying digital media.

      >1000 year disruption -- even if it's chiseled into a stone tablet, the language might be extinct

      It takes many thousands of years for even uncommon languages to disappear. And if they were even remotely similar to our own, they can be deciphered without any advanced knowledge. So, I'd be worried about the long-term chances of a complex language like Chinese to be preserved, but anything with Latin roots, that uses a small alphabet should do fine.

      Besides that, you can ensure the language survives by having multiple language tranlations, side-by-side. If any one of them is understood in the distant future, they can use it to learn all the rest. See: The Rosetta Stone
      --
      Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
    6. Re:Good! by d34thm0nk3y · · Score: 1

      I am sure someone will have the bright idea to upgrade the format within the next thousand years.

      As to the article I completely agree. If public libraries were undertaking this project they would have a lot more fair use wiggle room.

    7. Re:Good! by Anonymous Coward · · Score: 0
      ...in the event of a catastrophe or some kind of weird global event...

      Beatlemania is coming back again?

    8. Re:Good! by Korin43 · · Score: 2, Informative

      Not to mention that the whole "decaying medium" argument is ridiculous. If a hard drive fails, replace it. If you get something better than hard drives, copy it. It's not like big servers only keep the information in one specific place. There's usually copies.

    9. Re:Good! by bendodge · · Score: 0

      The scientists agree; papyrus is still the most reliable form of data storage known to man. It is good for a couple thousand years at least, so don't despair yet.

      --
      The government can't save you.
    10. Re:Good! by Anonymous Coward · · Score: 0

      ASCII is still just as readable as ever.

      Extremely true. Of course you fail to mention it never was readable at all to most of the people on the planet.

    11. Re:Good! by hopethisnickisnottak · · Score: 1

      1000 year disruption -- even if it's chiseled into a stone tablet, the language might be extinct

      Umm, by conservative estimates, Hebrew and Sanskrit are both more than 5000 years old. If you go by most widely accepted estimates, the oldest work in Sanskrit is more than 7000 years old. Both languages have survived.

      --
      -Shaunak
    12. Re:Good! by Scarblac · · Score: 1

      I've been using computers for well more than 10 years, and ASCII is still just as readable as ever.

      But EBCDIC is slightly harder. Besides, ASCII is only usable for a subset of human text - basically only for English. It's not really a solution.

      --
      I believe posters are recognized by their sig. So I made one.
    13. Re:Good! by Dan+Ost · · Score: 1

      Do not disregard partial solutions just because they're not 100%.

      --

      *sigh* back to work...
    14. Re:Good! by Pollardito · · Score: 1

      i expect a whole line of O'Reilly "Research on RAILs" books to show up on shelves near me, or would they just digitize those directly instead of printing?

    15. Re:Good! by moofo · · Score: 1

      Even if I understand that some knowledge is sometimes easier to explain with a specific language, I think an universal language would be very important.

      I am a french speaker, and I think english would be the best for this job. Why should we put the knowledge in several language in the first place when there are so many good translation engines ?

      --
      "I've heard nonsense, compared with which that would be as sensible as a dictionary." Through the looking glass and what
    16. Re:Good! by evilviper · · Score: 1
      Besides, ASCII is only usable for a subset of human text - basically only for English.
      ...and Spanish ...and German ...and French ...and Latin ...and many, many more.

      You're just simply wrong.

      Their languages all work just fine without the few non-ASCII characters... Accents can be approximated easily enough.

      It's not really a solution.

      A) Yes it is.
      B) What made you confuse my post with a proposal for the universal book-digitizing system of the future?
      C) I was illustrating a point.

      Obviously, if such a system was started today, something like Unicode would be used.
      --
      Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
  2. Just Open Source It? by lionheart1327 · · Score: 1

    Can't Google just Open Source the project?

    That way we don't have different companies and foundations duplicating eachother's work, but all the results are still open and accessible to everybody.

    1. Re:Just Open Source It? by CDPatten · · Score: 1

      Sure they "could", but they won't. At the very best they will allow some API's into their database, but then they will find a way to integrate their ads with it. Whatever the case, Google is going to be the sole owner of their project here (by the way there is nothing wrong with that either).

      Google is just as "evil" as any other corporation, its just thus far they have put enough spin on what they do to skirt the label.

    2. Re:Just Open Source It? by Whiney+Mac+Fanboy · · Score: 2, Insightful

      Can't Google just Open Source the project?

      Well, the source of the code running the project wouldn't be that helpful, it's the content we're after.

      And presuming you meant Google opening the content.... well I doubt it... they want to sell ads on the content after all!

      Don't forget, google nice tho' they are haven't given out code/content/etc for any of their "crown jewels"

      --
      There are shills on slashdot. Apparently, I'm one of them.
    3. Re:Just Open Source It? by SatanicPuppy · · Score: 3, Interesting

      I bet they won't.

      There is nothing sexy or secret about the methods of scanning, but they must have put an imperial frickton of money into the process...To give the fruit of that much money away would be irresponsible to their shareholders...At least until they've made their money back with it.

      --
      ad logicam Claiming a proposition is false because it was presented as the conclusion of a fallacious argument.
    4. Re:Just Open Source It? by Whiney+Mac+Fanboy · · Score: 1

      Hate to reply to myself, but I should have added:

      1) Google may also have contractual obligations with copyright holders that prevent putting the content in an open format.

      2) If point 1 can be overcome & Google could see a competitive advantage over MS's book scanning effort in opening the content then perhaps they'd try it after all...

      --
      There are shills on slashdot. Apparently, I'm one of them.
    5. Re:Just Open Source It? by Jherek+Carnelian · · Score: 1, Interesting

      To give the fruit of that much money away would be irresponsible to their shareholders...At least until they've made their money back with it.

      Only if you don't expect to reap the benefits of it afterwards and that giving it away might actually be required in order to reap those benefits. You know, kinda like how google gives away search engine results and email accounts.

    6. Re:Just Open Source It? by ePhil_One · · Score: 1
      You know, kinda like how google gives away search engine results and email accounts


      Google does not give those things away for free. It exchanges them in return for subjecting you to advertising, which they in turn sell to folks who want to show you advertising.

      There's no such thing as a free lunch.

      --
      You are in a maze of twisted little posts, all alike.
    7. Re:Just Open Source It? by MS-06FZ · · Score: 2, Funny

      Hear, hear! Books want to be open! I find that when books can be open, as they should, they become much more accessible to people than if they were kept closed.

      --
      ---GEC
      I'm but the humble pupil, seeking to snatch the scratchbuilt pebble from the master's fully articulated hand
    8. Re:Just Open Source It? by Jherek+Carnelian · · Score: 0, Flamebait

      Google does not give those things away for free. It exchanges them in return for subjecting you to advertising, which they in turn sell to folks who want to show you advertising.

      Gee, thanks for the class in Google 101. You miss the point that if google did not give them to you for nothing - i.e. no other requirement on your part, they could not get money from the advertisers. Thus giving it away is actually required in order to reap those benefits.

    9. Re:Just Open Source It? by Anonymous Coward · · Score: 0

      Google doesn't want to open source it anymore than they open source their search system. Think about it.

      GOOGLE = FOR PROFIT CORPORATION, which means they are in the business of capturing information about YOU to sell it to OTHERS (directly or indirectly). Try looking at Google without those rose-colored glasses for a minute.

    10. Re:Just Open Source It? by ePhil_One · · Score: 2, Interesting
      Thus giving it away is actually required in order to reap those benefits.

      Quite the opposite. If they give it away, then I can set up ePhil House o' Classic Literature and reap the benefits of that advertising in place of Google. I can show less advertising because I don't have that nasty overhead of scanning the books. Google's need is to make it available to consumers in exchange for "eyeballs" but keep it away from me. Hammer away on Google's servers and they will cut you off, I ran operations in a company that performed such meta-searches and used to be able to tell you with a high degree of precision where that line was (which we considered business intelligence and thus wouldn't tell you unless you worked there).

      And for the record there is no requirement that they give away the content to show you advertising, they choose do to so because a free service attracts more "eyeballs" than a paid service. It up to management to decide which combination of advertising vs subscription fees nets the most profit. Since Google best understands the "charge by advertising" model, the have a predilection for the "advertising-only supported" model.

      So your grade for Google 101 is an F

      --
      You are in a maze of twisted little posts, all alike.
    11. Re:Just Open Source It? by CleverBoy · · Score: 2, Informative

      Exactly right. All these comments about "must show ads over it" pretty much misses the point. Google's project allows you to SEARCH all the books its scanning, and even so, its drawn the ire of copyright holders. Imagine if they said... "Oh, yes... we're OPEN SOURCING all of our scanning results for unfettered public consumption." No judge in the world... nuff said. Open sourcing the actually methodology would not serve much purpose, although its worthy of note that they have open sources some OCR software earlier. Very well received too. Gift horses and such, blah blah blah.

    12. Re:Just Open Source It? by Achromatic1978 · · Score: 1
      1)Google may also have contractual obligations with copyright holders that prevent putting the content in an open format.

      For the most part, the copyright holders complaint is specifically that there is NO agreement with Google to allow them to do anything with their work, let alone redistribute it.

    13. Re:Just Open Source It? by LocoMan · · Score: 1

      There's also the thing of wether they can open it at all (the content, I mean, not the system). They get away with it because of the fair use laws that (IIRC from what I've read a long time ago) allows you to show a sample of a work but not the entire work (like online CD vendors do when they let you hear 15 seconds or so of each song). They can't open something the don't own the copyright of in the first place, unless they only did it with public domain works only (like what the project guttemberg was/is doing... are they still around?)

    14. Re:Just Open Source It? by Jherek+Carnelian · · Score: 1

      for the record there is no requirement that they give away the content to show you advertising, they choose do to so because a free service attracts more "eyeballs" than a paid service.

      Doh! You've just described the requirement. IF they want to maximize their return they are required to give it away.

    15. Re:Just Open Source It? by Anonymous+McCartneyf · · Score: 1

      That's not a requirement, just a strong incentive. If it was a requirement, we would not have ads in glossy magazines, nor ads on websites requiring subscriptions to glossy magazines for access.

      --
      There is a fine line between recklessness and courage... -- Paul McCartney
  3. Google's goof by bcrowell · · Score: 3, Insightful

    Google's big mistake was to try to do both PD and copyrighted books. Regardless of the legal merits (which are complicated), it was just a stupid business decision to waste effort on doing copyrighted books in general, on an opt-out basis. The controversy about the copyrighted books has dragged the PD books down with it. Part of the fallout from the lawsuit has been that Google has done everything it could to hide from users the fact that the service even exists. The whole thing is actually an abject failure, so it doesn't make me worry that Google will somehow get too powerful. Anyway, AFAIK Google doesn't claim any IP rights on their scans of PD books, so they actually don't have any control at all -- other people can take the scans and do whatever they want with them. Google is in the advertising business, not the publishing business.

    1. Re:Google's goof by DragonWriter · · Score: 3, Interesting
      Part of the fallout from the lawsuit has been that Google has done everything it could to hide from users the fact that the service even exists.


      Its on the short list "More" link on the Google search page, and results from it are brought up without special request for certain searches on the main web search engine (apparently, any with the word "book" that get hits, though I'm not certain of that.)

      That's hardly Google doing "everything it could to hide from users the fact that the service even exists".
    2. Re:Google's goof by bcrowell · · Score: 1

      Its on the short list "More" link on the Google search page...
      When the service first came online, you would just do a normal Google search, and results from books would pop up, by default. When the lawsuit happened, that stopped happening, and you had to go to books.google.com to get separate results on books. They had an easy way to let millions of people use the service, just by encountering it naturally in their search results, but they got rid of that. The result is that ordinary people have no idea it's even an option.

      and results from it are brought up without special request for certain searches on the main web search engine (apparently, any with the word "book" that get hits, though I'm not certain of that.)
      Interesting, if true, but I can't seem to confirm it by casual tests. Here are four searches:

      1. text from a copyrighted book, via a regular google search
      2. text from a copyrighted book, via google books
      3. text from a PD book, without the word "book" in the search
      4. the same PD text, with "book" in the search
      In search #4, adding "book" fails to bring up the google print stuff, at least within the first screenful of results. As far as I can tell, the results returned by books.google.com and google.com are disjoint sets.
    3. Re:Google's goof by MyNymWasTaken · · Score: 1

      Providing the available full text copies when a book is searched for is denying it exists?

      http://www.google.com/search?hl=en&q=origin+of+spe cies&btnG=Google+Search

      Is Google also denying the existence of its Froogle service since it's listed below the 'Books' search option in 'more>>'?

    4. Re:Google's goof by bcrowell · · Score: 1

      In the search you gave, I don't get any Google Book Search hits. To get the Google Book Search hits, you need to go to this url at books.google.com.

    5. Re:Google's goof by DragonWriter · · Score: 1
      Regular Google search: Search terms 'book math'. Book results come up as a special heading after sponsored links and before regular results.

      As far as I can tell, the results returned by books.google.com and google.com are disjoint sets.


      They clearly aren't disjoint, but are instead overlapping (particularly, the book results returned by the main search engine are a proper subset of those that would be returned using the main book search page); I think this is typical of the way Google presents "OneBox" results, where it uses services other than the prime database in a more limited way than if you used the service directly.

      At any rate, that it is possible to get results from the book database without making any special effort to use it just by using the main google search engine makes it pretty far from Google actively concealing the service.
    6. Re:Google's goof by DragonWriter · · Score: 1

      I get three Google Books hits from the original search without using books.google.com, just off the main google engine. Now, Google search results aren't particularly consistent (refreshing the search will sometimes change the results, and frequently cause sponsored links and OneBox results to disappear, as will, IIRC, doing multiple different searches in rapid succession).

    7. Re:Google's goof by bcrowell · · Score: 1

      Interesting. I wonder why it's giving you different hits than me. Are you logged in to a google account?

      There's no question that they changed the normal behavior, though. I'm enrolled in Google Books as a publisher (I opted in), and they sent me e-mails announcing all these policy changes. There was a period when the results from scanned books were always mixed in with web results, and then it abruptly changed. I think they're just trying to reduce their legal exposure in this lawsuit -- if fewer people use it, then the damages are smaller if they lose.

    8. Re:Google's goof by DragonWriter · · Score: 1
      There was a period when the results from scanned books were always mixed in with web results, and then it abruptly changed.


      Since they are a different kind of result, the use of OneBox is consistent with the rest of the Google interface—if you use the web search, you get web results in the main, but if there are particularly appropriate results by some more limited algorithm in one of the other databases, you also may get a handful of those in the OneBox area immediately after the sponsored links, and before the main resutls.

      I think they're just trying to reduce their legal exposure in this lawsuit -- if fewer people use it, then the damages are smaller if they lose.


      I think its far more likely that they made the Book Search a specialized search that presented its results the same way other specialized searches do through OneBox for consistency and because of the reasonable idea that people choosing to use the web search engine want web results primarily.

    9. Re:Google's goof by Anonymous Coward · · Score: 0
      Anyway, AFAIK Google doesn't claim any IP rights on their scans of PD books, ....

      If Google ever screws up and gets bought out, all bets are off.

    10. Re:Google's goof by MyNymWasTaken · · Score: 1
      Are you logged in to a google account?

      No. Here is a screen capture for you.

      book results screen capture
  4. funny. by CDPatten · · Score: 2, Interesting

    anyone else find the irony here funny. Google is on the side of keeping this a closed circuit project and MS is part of the alliance trying to make it open.

    Its funny. Laugh.

    1. Re:funny. by guspasho · · Score: 1

      Not really ironic. Microsoft would definitely be singing a different tune if they were in Google's shoes. Take all the examples of when they have been in Google's shoes. When was the last time they open-sourced anything? That's right, they haven't.

      Embrace, extend, and extinguish.

    2. Re:funny. by Anonymous Coward · · Score: 0

      well actually maybe you should take a look as MS reaserch labs.... they open source a LOT of stuff. Hell, slashdot has even covered a wireless project they opened up. For that matter, and despite what politically charged slashdotter's like to think, Open XML passes many of the standards bodies across the globe. Its as open as you get, with the exveption you can't change it and still call it open xml. you can call it qusoasho's super duper xml, but not Open XML. I'd say that is fair.

      all that said I agree with the premise, that MS would be doing the same thing.

  5. Google's got a long way to go . . . by cashman73 · · Score: 2, Interesting
    It's kind of sad to think that people are already worried about one corporation controlling ALL of the world's books. Let's still think about the reality of it. Google came to a handful (like 5 or so) of libraries (major ones at that), with a plan to digitize out-of-copyright books and put their content on the internet. They've got the search technology, they're trying to innovate. Now, if there were only five libraries in the entire world, yes, we could have a problem here. But in reality, there's A LOT more libraries than that. It's going to take a HUGE, MASSIVE effort by Google in order to digitize all the content of all the libraries in the world, and that will likely never happen anyways. More likely, some other libraries will probably partner with other companies in the future to digitize their content, and they'll be placed on the web. Yeah, Googlebot will probably spider that, so it will be searchable via Google. But so will the other spiders.

    It would also be pretty nieve and stupid to only utilize reference from one source if you're doing research. You'd want to check out multiple sources to get the full picture. Of course, there is a growing problem that is quite common nowadays among an increasing number of college students that they believe that if it's not available on the web, it doesn't exist. Such students might find themselves somewhat, "enlightened," if they walked over to the library and cracked open a book or journal from, say, before 1995.

    1. Re:Google's got a long way to go . . . by bcrowell · · Score: 4, Informative
      I think you're vastly overestimating the added benefit from scanning books from more libraries after the first few:
      1. Most libraries' collections are very similar to most other libraries' collections, and the greatest overlap occurs with the books that are the most important.
      2. This is all about PD stuff, since OCA isn't proposing to do anything still in copyright. Less ephemeral works (the kind typically preserved in library collections a century later) generally all had their copyrights renewed in the U.S., so that means we're only talking about pre-1923 materials. Since congress keeps on extending copyright terms, nothing is probably ever going to enter the public domain from 1923 on. That means we're talking about the publishing world of 1922, which was vastly smaller than today's publishing world. Amazon.com has on the order of 10^6 books. To get a feel for the size of the publishing industry in past decades, try browsing through the catalog of renewals; the number of books published was extremely small in the early 19th century.
      3. There are many books that won't be in any library's collection, simply because they weren't considered very valuable. You could digitize a thousand libraries, and never find them. Handwriting manuals from 1893. Trashy novels. Etc. In fact, there are a lot of books from the 1930's-1950's that are now PD, because they never had their copyrights renewed, but you're not going to find them in libraries' collections, and in fact it's very unlikely that anyone will ever be interested in them.
    2. Re:Google's got a long way to go . . . by webbod · · Score: 2, Interesting

      Oxford University is one of the UK copyright libraries - it has a copy of every book and published in the UK and Ireland since the 1600s - it gets them by default.

    3. Re:Google's got a long way to go . . . by Anonymous+McCartneyf · · Score: 1

      Hey! I'm interested in some of that '30s through '50s material!
      I think I am, anyway. There is this library I know that had the largest selection of old sci-fi I've ever seen. Many of the books it has, I've never seen anywhere else, and I think that at least some of the sub-works are public domain. I mean, most of the books in question are in generic library covers.
      There are stories in those books that I liked, that I might want to read again. Let's not let those works disintegrate--please?
      Also, the libraries I've gone to do have trashy novels. Some of them have sections for paperbacks, or even "romance" sections. I don't read the romances, but I have read and enjoyed trashy novels from libraries. I would like the trashy novels to remain available if possible.

      --
      There is a fine line between recklessness and courage... -- Paul McCartney
    4. Re:Google's got a long way to go . . . by Baricom · · Score: 1
      Google came to a handful (like 5 or so) of libraries (major ones at that), with a plan to digitize out-of-copyright books and put their content on the internet.
      If that was all that happened, nobody would be complaining. The problem is that it wasn't only out-of-copyright books, but every book in their collection, including those clearly in copyright. What's more, they require publishers who have issues with this copyright violation to opt out, and blanket opt-outs are not accepted - the publisher has to provide a list of EVERY book they publish to get each removed.
    5. Re:Google's got a long way to go . . . by lamona · · Score: 1
      Most libraries' collections are very similar to most other libraries' collections, and the greatest overlap occurs with the books that are the most important.

      Because the original Google 5 libraries have their holdings entered into WorldCat, a statistical study was done that showed that those five libraries would account for 33% of the 32 million books in that database. It also showed that 61% of the books held by the Google 5 are uniquely held by only one library. Essentially, the holdings of libraries follows a common pattern of a short high followed by a very long tail. If, even with their long tails, these 5 major libraries account for only 1/3 of books that libraries have entered into WorldCat, imagine how many libraries it will take to find and digitize the long tail of that one bibliographic database.

      Less ephemeral works (the kind typically preserved in library collections a century later) generally all had their copyrights renewed in the U.S

      The rate of copyright renewal was very low. According to Lessig ("Free Culture" p. 135) "In 1973, more than 85 percent of copyright owners failed to renew their copyright." I've seen estimates that about 90% of the books published between 1923 and 1978, when renewal was abolished, were never renewed. That means that there are MANY public domain books in that time frame, only we can't easily know which ones they are. You can look them up in the renewal database, but my impression is that the database is not considered to be complete, and therefore not entirely reliable. If you find the book in the database, it was renewed. If not...

      --
      I just read /. for the amusing .sigs
  6. Why compete... by ArcherB · · Score: 1

    ...when you can copy. If Google is going to make the data freely available, why pay people to start another scanning program when pay people to wait for Google to finish, have them go to the Google page and simply press CTRL-A, CTRL-C and then CTRL-V into their own page? Scan complete!

    --
    There is no "I disagree" mod for a reason. Flamebait, Troll, and Overrated are not substitutes.
    1. Re:Why compete... by evil_Tak · · Score: 1

      How is it going to help for them to create a new screen(1) window and then prepare to insert a literal control character?

    2. Re:Why compete... by HAL9000_mirror · · Score: 1

      Just because Google will make the data freely available may not necessarily mean that they will let you laugh at their work and let you use it for profit in your own company.
      --Ram

  7. Haha... by Jugalator · · Score: 0

    Since when were Google "in control" for being allowed to show excerpts of a book for the advertisement of the companies allowing them to carry their books?

    --
    Beware: In C++, your friends can see your privates!
  8. Nowhere near enough by Stephen+Ma · · Score: 0

    One million dollars? Even if you focus that solely on the contents of the Library of Congress, that will be, what, five cents per book?

  9. Scanning a book is easy... by creative_Righter · · Score: 5, Insightful
    Already facing a legal challenge for alleged copyright infringement, Google Inc.'s crusade to build a digital library has triggered a philosophical debate with an alternative project promising better online access to the world's books, art and historical documents.

    Scanning a book is easy, it simply involves taking pictures. You can splice the spine off an take pictures of each page or use one of the panoply of non-destructive machines to correct the page warping effects of an open book. This is not particularly hard or expensive.

    The latest tensions revolve around Google's insistence on chaining the digital content to its Internet-leading search engine and the nine major libraries that have aligned themselves with the Mountain View-based company.

    Damn straight. The OCR process is the hardest part, of course they wouldn't allow access to highly valuable text to others. They might have a million books "scanned" this year but each page has to be OCRed. Most people don't decouple those operations and assume that after scanning the hard part is over. Say each book has 300 pages, so we're talking about running 300 million pages of text through OCR. Now you've got a real problem. How does one know if a page of a book is OCRed correctly? You can pay a human or even a large team of humans to QA the text but even then you can only spot check here and there. A 99.99% correct OCR program will mess up on the equivalent of 150,000 pages of text a year (spread out more or less uniformly across the 300 million). Also, not all pages of books are scanable (pictures, weird fonts, weird page layouts), and then there are headaches with keeping track of the related editions of a books, multiple editions of books, displaying pictures in the reader you don't have copyright to (which I think always gets glossed over with these sorts of articles), 10 digit to 13 digit ISBNs, etc. So yes, they aren't going to allow access to the text to others, because it's hard and expensive to do so because you can only automate so much if you want to the ensure accuracy of the text itself (I think Google does). If they opened the text up what stops the competitors from simply adding the data into their search engines after the difficult part is over? Google does no evil but they aren't stupid.

    1. Re:Scanning a book is easy... by monopole · · Score: 1

      Scanning a book is easy, it simply involves taking pictures. You can splice the spine off an take pictures of each page or use one of the panoply of non-destructive machines to correct the page warping effects of an open book. This is not particularly hard or expensive.

      Only if the book is expendable. In the case of many pre-1920 books (i.e. out of copyright) any sane library wouldn't even let you push it flat against the glass of a flatbed scanner. Ideally you need a scanner that keeps the book from opening appreciably, with filtered illumination.

      Now you've got a real problem. How does one know if a page of a book is OCRed correctly?
      Now that's simple. Distributed proofreading. Just the sort of thing Google is good at.

    2. Re:Scanning a book is easy... by Anonymous Coward · · Score: 0

      Self - Correcting by readers. Just allow actual readers of the books as they
      are browsing it to look at the ocr'd document too, and leave comments. Books with
      comments get attention by human editor.

    3. Re:Scanning a book is easy... by Anonymous Coward · · Score: 0

      I'd rather get png file than utf-8 file, as especially old books do have a tremendous value in the layout, pictures, and characters themselves. Of course having both is the best option.

    4. Re:Scanning a book is easy... by Anonymous Coward · · Score: 0

      Now you've got a real problem. How does one know if a page of a book is OCRed correctly?
      Now that's simple. Distributed proofreading. Just the sort of thing Google is good at.

      No, the sort of thing the distributed proofreaders are good at. And they're already raiding Google's scans, as well as many other page image collections....

    5. Re:Scanning a book is easy... by SnarfQuest · · Score: 1

      As I understand it, Google just uses the raw OCR. It's usually good enough for searching, which is what they are intrested in, and requires a lot less manpower than corrected OCR. If you want corrected OCR, you need to look at places like Project Gutenberg (and distributed proofreading).

      --
      Who would win this election: Andrew Weiner vs Andrew Weiner's weiner.
  10. RE: Google 'Do No Evil' ... by Super+Dave+Osbourne · · Score: 0, Troll

    is about the biggest lie perp'ed on mankind. Google is the last company aside from the obvious M$ that I would want to control anything. They are about inflated stock, and making you see ads online. Are well all that stupid that we believe Google-ganda?

  11. so, they will also campaign against copyright by speculatrix · · Score: 0, Flamebait

    ... and copyright extension then, since that is also dominating our culture now...?
    yeah, thought not. copyright enforcement is only demanded by those who can control it, and it's sheer brilliance that they turned a civil law issue into a criminal one and thus got the gov't to pay the copyright holder's costs!

    1. Re:so, they will also campaign against copyright by Anonymous Coward · · Score: 0

      Yes - they'll fight the extension of copyright. The Internet Archive filed an Amicus brief in the Eldred case (which sadly went against Eldred and the congress' new terms were left standing).

      Brewster Kahle (the founder of the archive) is now personally suing the Attorney General over 'orphan works' (represented by Larry Lessig). Some details of the ongoing case here :

      http://www.archive.org/iathreads/post-view.php?id= 76756

      and

      http://cyberlaw.stanford.edu/case/kahle-v-gonzales

  12. Re: Google 'Do No Evil' ... by urbanradar · · Score: 3, Insightful

    Google 'Do No Evil' ... is about the biggest lie perp'ed on mankind. Google is the last company aside from the obvious M$ that I would want to control anything. They are about inflated stock, and making you see ads online. Are well all that stupid that we believe Google-ganda?

    Oh, do calm down... They never claimed "we do absolutely no evil whatsoever", it's more like - the founders happen to think that "evil should not be done". What's a lie about that? Also, how does inflated stock make them evil?

    And how, pray, are they supposed to survive without the adverts? Never mind the fact that Google didn't actually come up with online advertising but were pretty much the first ones to run targeted, non-offensive (as in, no flashing banners, pop-ups, etc.) ads.

    I'm no Google fanboy, although I happily use many of their services. But I don't think there's anything inherently wrong with them, and I find it somewhat sad to see this paranoid drivel modded up to +3 Insightful.

  13. Re: Google 'Do No Evil' ... by sweatyboatman · · Score: 1

    Oh damn! You really nailed Google there. They're all about making you see ads. Oh man, they're never going to live that tongue-lashing down. I bet their PR people are going nuts trying to figure out how to clean this mess up.

    Are you angry because Google suspended the SOAP API? Or are you just a grumpy troll?

    --
    It breaks my pluginses, my precious!
  14. Re:I agree by Anonymous Coward · · Score: 0

    Outer darkness time for you ;)

  15. Project Gutenburg by larry+bagina · · Score: 5, Interesting

    I'm a kind of baffled why people are talking about starting up new projects or Open Sourcing (tm) google's prject (whatever that means...).

    Project Gutenburg is open and non proprietary (ASCII text) and has been for quite a while.

    After scanning, they use a distributed proofreading system where volunteers compare a scanned page image to the OCR text for errors. If you've got some free time, consider helping out.

    --
    Do you even lift?

    These aren't the 'roids you're looking for.

    1. Re:Project Gutenburg by DragonWriter · · Score: 0
      I'm a kind of baffled why people are talking about starting up new projects or Open Sourcing (tm) google's prject (whatever that means...).


      "Open sourcing" Google's project, as others have used the term in the thread, would seem to mean providing, at least, an open API so that different collections could federate easily, and perhaps providing an Open Source implementation of some of parts of that API, as well.

      Project Gutenburg is open and non proprietary (ASCII text) and has been for quite a while.


      Project Gutenberg isn't a full-text search system delivering scanned images of printed works.

      While it is has some conceptual relation to the system at issue, it doesn't fill exactly the same role.
    2. Re:Project Gutenburg by evilviper · · Score: 1
      Project Gutenburg is open and non proprietary (ASCII text) and has been for quite a while.

      They focus solely on public-domain works, as opposed to fair-use of current, copyrighted works, as Google does.

      --
      Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
  16. Google = M$ = Evil Empire by Anonymous Coward · · Score: 0

    The only thing left is for the two CEOs to have a drink and snicker about the foolish peasants known as Internet users. They control the content, access, and media.

  17. Maybe, but not yet. by Xenographic · · Score: 1

    No, I don't think that they'll hold to it forever. I suspect that once the founders are gone, things will erode until that motto will go the way of the dinosaur except for its PR function.

    That said, based on what they're *doing* (and not what they're merely saying), they're at least making a reasonable effort to live up to an ideal, and that's a hell of a lot more than I can say for any other corp.

    In other words, I'll retain some loyalty to Google so long as it shows some loyalty to us. Like I said, they'll probably let us down someday and that'll be the time to ditch them, but at the same time, it's stupid not to enjoy the good while it lasts.

  18. the books aren't going anywhere... by the+packrat · · Score: 4, Insightful

    You folks do realise that Google returns the books after they scan them so they'll still be in the libraries afterwards right? So how does this reduce their availability?

    --
    Nihil Illegitemi Carborvndvm
    1. Re:the books aren't going anywhere... by sgholt · · Score: 1

      You seem to be the first one who came to the same conslusion as me...the books themselves are the thing that needs to stay open and available. Whether Google or the Gutenburg Project copies the books is not the issue.
      So that brings me to another conclusion...there must be some other reason for this...hmmm
      Who would want to limit Google?

    2. Re:the books aren't going anywhere... by callousmuppet · · Score: 1

      Not only are the books themselves not going anywhere, the libraries themselves will be getting their own copy of the archive (at least, that's the deal at least one of the libraries has made with Google, although it seems that each library has made a separate deal and they've all had to sign significant non-disclosure agreements). So it's really not as though Google will have exclusive control over even the digital form of all this material.

    3. Re:the books aren't going anywhere... by Tacvek · · Score: 1

      Nonsense. Several of the libraries have the contracts posted online. For example University of Michigan: http://www.lib.umich.edu/mdp/ (The contract is listed as "U-M Library/Google Cooperative Agreement")

      --
      Stylish sheet to fix many problems in Slashdot's D3: https://gist.github.com/801524
  19. Please do a better job, not just a bigger job by Ankh · · Score: 2, Informative

    Most of these people focus on English-language books printed in the 19th and early 20th centuries, because (1) it's usually easy to determine copyright status, and (2) if you go earlier you get the tall "s" ( in utf-8) which no OCR program today seems able to handle, so the scanning cost is increased.

    Scanning with a flat-bed scanner basically wrecks the binding. So the books probably need to be rebound afterwards, or can be discarded.

    There are photography setups (e.g. Phase One has one) but the resolution is too low, even with a 40 megapixel medium-format camera (yes, they are used for this). A little high-school mathematics (e.g. Nyquist) and the back of an envelope, combined with some measurements, will show that if you scan engravings at under 1200dpi, you will lose a lot of detail, and indeed, compare for example the Alice in Wonderland pictures on my own site with the Project Gutenberg ones. You can read the engraver's signature on most of the ones I have. Yes, the bandwidth needed to host higher resolution images is greater (which is why I have ads, sorry). But it's worth it.

    Some of these books will never be scanned again. Even for OCR, 400dpi grayscale seems a minimum for footnotes and other small text even in English.

    I'd also like to see more interfaced like the Project Gutenberg Distributed Proofreaders' site where people can submit corrections. Maybe use a WIKI for the transcription??

    Liam

    --
    Live barefoot!
    free engravings/woodcuts
    1. Re:Please do a better job, not just a bigger job by bgalbrecht · · Score: 1

      In my scanning for Distributed Proofreaders http://www.pgdp.net/, I would have to say that over 90% of the books do not have illustrations or footnotes that require scanning at higher resolutions than 300 DPI. Even the ones with illustrations are probably fine with 300-600 DPI scans. For the most part, the black and white images of text pages in the Google PDF files are adequate, although the illustrations are low resolution garbage. The real problem I see with Google's work is that it's substandard, with missing illustrations, missing pages, and poorly scanned pages. They do rescan books when errors are reported, but even then, it takes a few months and the rescans are subject to error. Unless there are multiple projects independently scanning, OCRing these works, I fear that single source for these works will end up with an incomplete and somewhat unusable digital copy of these Google partner libraries.

      The second issue I have is that the full image display at both Google and the OCA/live.com (and PDF downloads of full images) is not particularly useful on low resolution displays, like PDAs, mobile phones, tablets, and dedicated ebook readers. Perhaps future generations of ebook readers will have the form factor of a paperback book with high enough resolution to view the scans made available by google and the OCA, but I don't see it happening for years.

      The last complaint I have about Google, is that with their proprietary database, it's not easy to create searches based on criteria not in their search parameters (for example, based on number of pages).

      Despite these complaints, I'm still pleased that Google decided to try to digitize the works in these libraries, if for no other reason than it got several digitzation projects funded.

    2. Re:Please do a better job, not just a bigger job by Ankh · · Score: 1

      I agree with you that needing more than 200dpi is fairly rare, and if I have needed it more often it is possibly because I tend to work with older books, or books in poor condition.

      It's not only for footnotes, but also, say, to distinguish an ae ligature (æ in utf8) from an oe ligature (oe in utf-8 if it survives slashdot), or from the unligatured letters, or to distinguish a zero (0) and a letter "O", and so on. If one has the original book to hand, that's less of an issue.

      I agree that the Google bulk scanning is not good quality. In the past, Project Gutenberg produced some very low quality e-texts too. Some of them have been improved, and newer ones are to a much higher standard. It doesn't seem that long since I suggested the use of SGML to Michael Hart, and pointed out that they would then be able to capture italics and other textual variations -- but it's about 15 years ago I think, and since then HTML has become popular :-)

      600dpi is really bad for an engraving, but OK for a screened photograph if you don't mine losing some detail. It's not good for archival purposes, e.g. for studying which engraver worked on a particular image.

      You're right, though, that Google has spurred competition, and I agree that it's a good thing!

      Best,

      Liam

      --
      Live barefoot!
      free engravings/woodcuts
    3. Re:Please do a better job, not just a bigger job by TTK+Ciar · · Score: 1

      The second issue I have is that the full image display at both Google and the OCA/live.com (and PDF downloads of full images) is not particularly useful on low resolution displays, like PDAs, mobile phones, tablets, and dedicated ebook readers.

      What formats do these devices understand? The OCA's books are available in a variety of formats, including text, xml (which is just the text annotated with positional information), and high- and low-resolution jpeg. Click on the "FTP" link to the left in the details page to see all formats:

      details page
      FTP index

      It shouldn't be too difficult to write a little software that takes the xml + jpeg and combines them into a cohesive html document .. converting the xml to html would be easy, but recognizing the "garbage" where an illustration goes would be harder. Once you had code that recognized it, though, cropping and inserting the appropriate jpeg region would be trivial (the garbage's positional information is noted in the xml along with all the rest of the text).

      -- TTK

  20. Doesn't really matter by crabpeople · · Score: 0

    In 25 years they will determine that googles library is incomplete and start OCR shotgunning books down camera filled canvas chutes.

    A brief protest will be launched, but all the kids will be too busy with their new fangled wearables and feelie parks to care.

    --
    I'll just use my special getting high powers one more time...
  21. Did someone break their legs? by Anonymous Coward · · Score: 3, Insightful
    A splinter group called the Open Content Alliance favors a less restrictive approach to prevent mankind's accumulated knowledge from being controlled by a commercial entity...


    Did someone break their legs?

    See that big building downtown with all the books in it?

    Oh wait, get up from your desk, go outside (yes I know, it burns...), get on the bus and go downtown.

    OK, now see the big building with the strange letters "LIBRARY" on the front? OK, that's the one, go inside... see all the books?

    Now go up to the attendant at the desk and tell them your name and address and show a piece of photo ID. The nice person will give you a card that you can use to borrow books.

    What's a book? OK, its many pages of paper bound together usually with glue and string. On each of these "pages" you will find ink (a dye) in the pattern of letters that form words and sentences and paragraphs.

    Usually, these "books" tell a story or provide organised information.

    No go ahead, pick one out - they'll even let you take it home for a week or two so you can read it. For free!

    You can browse the stacks (a colloquialism for those big shelves with books on them) which are organised according to a system known as the Dewey Decimal System. You can use a revolutionary piece of technology known as a "card catalog" to indicate the position of the title you seek on the stacks (though many libraries have this same catalog searchable from computer terminals).

    It's revolutionary, I know. But there you have it, free information and entertainment, enough to last a lifetime, with a "less restrictive approach".

    Enjoy.
    1. Re:Did someone break their legs? by dido · · Score: 1

      But unfortunately, not all of the world has access to such wonderful libraries, and specialized research is somewhat difficult, even if your city is one that is blessed with a nice public library. Boy, I loved it when I discovered sites like this, and this, and this, collections to truly warm the heart of a math geek like me. Good luck finding even a tenth of the books and journals in those three collections in your local public library.

      --
      Qu'on me donne six lignes écrites de la main du plus honnête homme, j'y trouverai de quoi le faire pendre.
    2. Re:Did someone break their legs? by SL+Baur · · Score: 1

      But unfortunately, not all of the world has access to such wonderful libraries As you are probably well aware of. The first private primary school in Banaybanay (Banaybanay is a small municipality in eastern Mindanao) I sent my eldest step-son to had exactly one book in its "library". It was a donated high school text on Shakespeare. However they get this done, it will be wonderful opportunity for many people.
  22. Wheres our free music and vids?!? by Anonymous Coward · · Score: 0

    Google is scaning books why arent they doing the same for music and vids so we can get them for FREE?!?

    i dont care about books thats too much work!!! i want free music and vids!!! it dont cost them nothing when its on p2pand besides they dont pay the artists nyway and copyrights are bad bcause they infring on our right to be entertaine. its a hole nu paradim and they should be LISTENING to us!!!

    its all ones and 0's and they are tryin to CHARGE us for them, can you beleve it?!?

    I am so fucking angry right now, I can barly type! Fucking greedy corps! Google shuld fiht them! They should just make ALL music and vids FREE, and FUCK the other corps that are so gredy!

    They have BILLIONS of dollars and if they arnt GREEDY, THEY do that for us!

    FUCK the RIAA! FUCK the MPAA! FUCK, FUCK, FUCK!

    I WANT TO BE ENTERTAIND FOR FREE! I DESERVE IT BECAUSE I DO!

    YOU ARE GREEDY IF YOU DONT LET ME DO IT!

    Fucking greedy corps! Its all shit anyway! I wouldnt pay for it anyhow, so I should be able to just download it for free because it isnt worth anything! I PAY my internet!

    And im not rich like those fucking bastards! I shouldnt have to pay for music and vids they already have enugh money! I should get it for FREE!

    1. Re:Wheres our free music and vids?!? by Anonymous Coward · · Score: 0

      Hehehe.

  23. money by Anonymous Coward · · Score: 1, Informative

    "And how, pray, are they supposed to survive without the adverts?"

    Don't know about you, but I would pop for a yearly subscription for a *good quality* search engine that had a toggle for "with adverts" or "no adverts" option. Not sure how much I would spend, that would depend on how good they were on filtering out link farms, etc, but some reasonable fee to have the option of no ads. And then websites might have an indcement to restrict use of ads to at least the interior pages and nt the main public facing page. Ads there just suck.

    Right now I would classify the free google search with ads as being of medium quality until you get good at it with a lot of -restrict this and that word added to your query and learning wild cards and domain restrictions, etc. In fact, I wish google had one simple option on their main page, split their search bar in two by default, one side is for words/phrases you are looking for, the other side is what you want to immediately filter out. For example if you add -sale, you eliminate a lot of commercial sites. Dogsquat simple, hardly anyone does it.

        Google is good once you learn to use it, by default like most people use it though it's just a fancy yellow pages.

  24. Re:open source? by Anonymous Coward · · Score: 0

    whoever modded the gp down is obviously a fanboi, a faggot, a rump roaster, a dicksucker, a fucktard and a bush supporter. fucking faggots ruin it for everyone else with their ass fucking aids disease.

    and if you're a fag reading this you're useless and you're a shithead. go fuck yourself.

  25. Enclose what? by McFadden · · Score: 2, Insightful
    'You are talking about the fruits of our civilization and culture. You want to keep it open and certainly don't want any company to enclose it,
    Yeah... because by scanning a book, Google automatically controls all the of the knowledge inside it.
  26. How about: UnfoldingClassics@Home by Mathinker · · Score: 1

    > Hammer away on Google's servers and they will cut you off

    It'd be hard for them to defend against a bandwidth-limited, widely distributed effort.

    Anyone want a crack at writing "UnfoldingClassics@Home" ?

  27. Google says one thing does another by SerpentMage · · Score: 2, Interesting

    One things that bugs the heck out of me with Google is their, "Oh we will do this because we have the rights", yet if you want to use their stuff you need EXPLICIT permissions. http://www.google.com/permissions/index.html

    " All of Google's trademarks, logos, web pages, screen shots, or other distinctive features ("Google Brand Features") are protected by applicable trademark, copyright, and other intellectual property laws. If you would like to use any of Google Brand Features on your website, in an advertisement, in an article or book, or reproduce them anywhere else, you must first receive Google's permission. We've tried to make this process as painless as possible."

    Funny Google wants you to get permission and they are saying no such thing as fair use. YET they want publishers to opt out...

    Google is hypocritical!

    --

    "You can't make a race horse of a pig"
    "No," said Samuel, "but you can make very fast pig"
  28. more credible by oohshiny · · Score: 1

    Given Microsoft's history on intellectual property, the complaints of the OCA would be a lot more credible if Microsoft weren't a part of it.

  29. what goof? by Anonymous Coward · · Score: 0

    it was just a stupid business decision to waste effort on doing copyrighted books in general, on an opt-out basis. The controversy about the copyrighted books has dragged the PD books down with it

    The number of books that are clearly out of copyright is actually quite small (most books are in a gray area), so doing just them isn't very useful.

    But more important: what is being "dragged down"? There's a lot of chest beating by people with strong interest in keeping control over printed materials and distribution channels, but nothing really substantial has happened.

  30. Preservation of languages by tjwhaynes · · Score: 1

    It takes many thousands of years for even uncommon languages to disappear. And if they were even remotely similar to our own, they can be deciphered without any advanced knowledge. So, I'd be worried about the long-term chances of a complex language like Chinese to be preserved, but anything with Latin roots, that uses a small alphabet should do fine.

    A thousand years for a language to disappear? All it takes is a generation who doesn't speak it and it might as well be considered gone. A language is often two interdepenent parts - spoken and written. Often - but not always. You could take the shining example of the Canadian approach to the First Nations peoples in the last century, where students were forced to learn exclusively in English, rather than their native tongue. An entire generation suddenly loses contact with the language of their parents. That would be devastating enough for, say, french speakers. Now consider that most of the First Nations languages and dialects have no written form. Needless to say, in hindsight apologies have been made but it certainly wiped out dialects that had survived centuries until then.

    I think the corollary in IT is also important. Any physical media which is not used for a generation of technology (maybe less than 10 years) quickly becomes difficult to read as the machinery required to read it fails. Wait thirty years and it will cost you many times over to retrieve that information. The only hope for a lot of old data is to constantly move it onto the ubiquitous storage of the day, time after time. Anything missed will, sooner rather than later, be lost.

    Cheers,
    Toby Haynes

    --
    Anything I post is strictly my own thoughts and doesn't necessarily have anything to do with the opinions of IBM.
    1. Re:Preservation of languages by evilviper · · Score: 1
      A thousand years for a language to disappear? All it takes is a generation who doesn't speak it and it might as well be considered gone.

      Not a chance in hell of that ever happening in the real world.

      You'd have to seperate every single child from their parents at birth, send them to some far away land where the old language isn't spoken at all, and make sure they never meet anyone who speaks anything else.

      Languages are handed down from parent to child, for several generations before they are forgotten, even when they are completely foreign to everyone else. When there are also numerous others that still know how to speak the language (as happens in any real senario) it stays useful, and is passed on for many more generations.

      Any physical media which is not used for a generation of technology (maybe less than 10 years) quickly becomes difficult to read as the machinery required to read it fails.

      I think you've completely misunderstood the senario in question.
      --
      Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
  31. Digital originals available from publishers by foniksonik · · Score: 1

    Surely we can speed up this process by simply asking the publishers to make available the original digital Latex or SGML files for all books printed since the late 70s right?

    Why invest hundreds of hours on scan/ocr/qa for texts which already exist in a digital format?

    --
    A fool throws a stone into a well and a thousand sages can not remove it.
    1. Re:Digital originals available from publishers by lamona · · Score: 1

      Because the publishers 1) didn't use any particular standard for their digital files 2) and they didn't keep the digital files once the book was published. The folks doing e-books for the publishers were horrified in the early part of this decade to discover that publishers had considered the digital files discardable. Today, most publishers send the book to the printers in PDF, so there is at least that file, but if you want to do any reformatting you're going to be working with a Quark file with formatting of the publisher's own devising. Yes, better than scanning, but only just.

      --
      I just read /. for the amusing .sigs
  32. Decoupling of content and medium by DrYak · · Score: 1
    100 year disruption -- hard drives, DVDs decay to unreadability


    That is, if we imagine a digital archive to function like it's plain-paper counterpart : with huge underground stores with shelves full of discs.

    But if we're a little bit realistic we should realise that, in the current age of internet and digital information, the data doesn't hve to remain fixed on a specific medium. The ability to make perfect copies is basically inherent to the nature of digital data.

    The problem of preservation isn't anymore preserving a single old medium, but keeping a copy of the data as the storage medium is progressivly upgraded.

    Think about it : everytime you upgrade a harddisk in your computer, you keep your old data (you either copy your old partition or copy your files). Some of the files you have kept around in nostalgy may come from very old computers that can't be found anymore. (On this system I'm writing on, I still hve some games, I programmed in basic when I was a kid long time ago. The original floppy may have rotten, but there's still a copy of the .BAS file somewhere in a folder).

    According to your argument, software could NOT be found for old vintage computers, home computers, game consoles and arcade machine, because most of the disc have rotten, the ROM board may be broken and/or not be readable by any modern hardware, etc.
    But in reality you can google for any classic emulation site and such and still find disc and rom image. Digital data is easy to copy around. The medium may have changed data was moved from ROMs and 8" / 5.25" / 2" floppy to harddisks, then to image inside ZIP files on the internet.

    Granted the medium it self will never again be a medium. The single biggest problem that we will face are the readers. For all this marvellous "survival through digital copy" to function, the data need to be accessed and copied in the first time.
    Sadly with all DRM systems that appear and restrict the possibility to copy digital data, the preservatiion will be much more difficult.

    DRM : Bringing you a new dark age.
    --
    "Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
  33. LOCKSS by lamona · · Score: 1

    It's called "Lots of Copies Keeps Stuff Safe" and it's even got standards and software.

    --
    I just read /. for the amusing .sigs
  34. You could read this as... by TheRecklessWanderer · · Score: 1

    You could read this as.. What?? Money? We want money, give us some money. We want our share. Why can't we have our share?? Wahhhhh. We want money.

    --
    Mean what you say...say what you mean.
  35. Just How Does...? by Nom+du+Keyboard · · Score: 1

    Just how does Google scanning a book prevent anyone else from doing the same? Does Google own the only copy? I doubt that. This seems like much ado about nothing, or an outright grab to force Google to share what they put the effort into creating in the first place. And I'll bet the sharing is expected to be Free.

    --
    "It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
  36. How to open up the archive by RecycledElectrons · · Score: 0

    If Google (or Microsoft's www.Books.Live.Com) wants to open this up to us, they can do one of 2 things:

    (1) provide a complete index, possibly sortable, so I can have an easy set of links to mirror

    (2) send a backup to a company that will sell the DVD version of their collection. One company makes money selling $1 DVDs at dollar stores, and we all see 4-disc sets fo John Wayne videos at Wal-Mart for $5.50. Microsoft or Google could send such a company a backup of their book collection once a year, and copmelte sets of, say, 25 DVD-9 discs (2TB) were available for $50? Bean counters can raise the price, but as long as we are free to copy them, I'm sure that universities would be willing to buy

    Andy Out!

  37. Update existing laws by Anonymous Coward · · Score: 0
    Works don't necessarily need to just go away and Google, Project Guttenburg, and others don't necessarily need to be the only avenue for preserving works.
    Surely we can speed up this process by simply asking the publishers to make available the original digital Latex or SGML files for all books printed since the late 70s right? Why invest hundreds of hours on scan/ocr/qa for texts which already exist in a digital format?
    Legal deposit and mandatory deposit LAWS already in effect might be updated to ensure that copyright holders place works in an electronic format on deposit with national libraries... http://www.copyright.gov/help/faq/mandatory_deposi t.html http://www.bl.uk/about/policies/legaldeposit.html
  38. Don't be evil by jandersen · · Score: 1

    ... a company like Google that has embraced 'Don't Be Evil' as its creed

    Now that you mention it, so has the Christian Church, the Muslems and in fact most of the other religions. As have such magnificent luminarias as George Bush and Tony Blair. Well, more or less.

    Morale: You can't trust people that try to use that kind of 'creed' as a selling point.

  39. Project Gutenberg by Anonymous+McCartneyf · · Score: 1

    http://gutenberg.org/
    Last modified this month.
    I think Project Gutenberg is still around.

    --
    There is a fine line between recklessness and courage... -- Paul McCartney