Slashdot Mirror


Open Library Project Takes Flight

Aaron Swartz today announced the launch of the new Open Library project. The goal of the project is to produce the world's greatest library on the Internet free for anyone to use. Starting with the Internet Archive's book scanning project and organizing the insertion of new content via a wiki-type model the project seems to be off to a great start. The demo, source code, and mailing lists were all opened up today in hopes of drawing interest from the public at large.

27 of 126 comments (clear)

  1. Awesome by CrazyJim1 · · Score: 2, Funny

    Project Gutenberg(sp) never really had a large enough selection to interest me. I would like to see how they do this new library.

    1. Re:Awesome by Creepy+Crawler · · Score: 2, Insightful

      Well, you can thank extensive copyright for that fact.

      Go Disney.

      --
    2. Re:Awesome by illegalcortex · · Score: 3, Insightful

      Yes, I particularly enjoyed Human Genome Project, Chromosome Number 08. Some fine reading there.

      C'mon, I would be fairly disappointed with a library of 21,000 real books even if it contained only fiction from random authors from 1900-2000. Gutenberg doesn't even have that much depth.

      That's not to take anything away from them. But to make claims about it being a good selection based on "21,000 - gee that's a big number" is a bit ludicrous.

  2. In response to your question: by CaptainPatent · · Score: 3, Interesting
    FALTWSBTFA: (From a link to what should be the feature article)

    What if there was a library which held every book? Not every book on sale, or every important book, or even every book in English, but simply every book It would probably be sued for copyright infringement.
    --
    Well, back to rejecting software patent applications.
    1. Re:In response to your question: by jandrese · · Score: 4, Insightful

      I find it depressing that if someone came up with the concept of a free library system today, they would be sued out of existence by the book companies. What is perhaps one of the greatest triumphs ever for the poor uneducated masses would not stand a chance in our current legal environment.

      --

      I read the internet for the articles.
    2. Re:In response to your question: by timeOday · · Score: 2, Interesting
      Speaking of which, do you think people would be allowed to drive cars or own guns if they were invented today? I don't.

      Anyways, the good news is that libraries do exist, and aren't going away. If the electronic library is to exist, it should be pursued as an extension of existing libraries. In other words, we must ensure that electronic access to text grows out of the familiar library setting, not Napster. There are lots of ways to do this.

      For instance, current library filing systems are really just electronic card catalogues, which is quite primitive - what if whoever catalogued the book didn't think up the same keywords you did? Only by digitizing the books will we be able to use all the information retrieval algorithms that make searching the WWW so effective. This would be very useful even if users couldn't "click through" the search results to the content of the book.

      Another good argument for digitization is preservation. It just seems reckless not to have an easily duplicated archive of all published works.

      After that, I hope we could consider exemptions to copyright that allow electronic access from anywhere, for a fee. Call it "compulsory licensing" if you like, but it really just means "we won't prohibit people from accessing the information, but we will make them pay and give you the money," which sounds better and happens to be true.

    3. Re:In response to your question: by fyngyrz · · Score: 4, Insightful
      Anyways, the good news is that libraries do exist, and aren't going away.

      No, of course not, because they're protected by copyright law, which in turn grew out of article 1, section 8 of the constitution. Just there will never be a restriction on keeping and bearing arms... uh, oh, wait. OK then, like there will never be restrictions on speech... no, no, turns out there are plenty of those. Mmmm, ok, just like the feds can only take action on interstate commerce, because you know, that's an enumerated power they can't step outside... aw, no, they do that all the time. Well, it'll be like how they can't do searches or seizures without probable cause, oath or affirmation, and a warrant... oh... I guess that's no longer true. Well, of course they can't make ex post facto laws... except for the ones they've made, that is, you know, thinking of the children and such.

      Wait. Why is it again libraries "aren't going away?"

      --
      I've fallen off your lawn, and I can't get up.
    4. Re:In response to your question: by OldeTimeGeek · · Score: 2, Insightful
      Wait. Why is it again libraries "aren't going away?"

      Aside from the already mentioned fact that all books aren't digitized, it may be because Internet access is not universal, the barrier to access is still high (computers aren't free, right?) and one of the few places that you can get free access and access to a device to do it is, of course, a library.

  3. Re:Project Gutenburg by AaronSw · · Score: 5, Informative

    Hi, Aaron Swartz here. Project Gutenberg is about putting up text versions of out-of-copyright books. This project is about creating a catalog of _every_ book, with links to PG, scans, Amazon.com, PDFs, print on demand, etc. -- anything we can get our hands on. Gutenberg books are in our catalog, of course, but so are millions more.

  4. Re:wikipedia 2.0 by The+Iso · · Score: 2, Informative

    It's called Wikisource. Mod this article redundant.

    --
    "You don't need a weatherman to know which way the wind blows." - Bob Dylan
  5. Libraries don't get sued for infringement by Anonymous Coward · · Score: 2, Interesting

    Even in these litigation-happy days, physical book libraries don't get sued, and indeed they normally get direct governmental funding to continue their work.

    If an electronic library can find a way to obtain support as a literacy project, there are plenty of traditional avenues open. Suits against council literacy efforts don't go down well, at least in Europe.

  6. Not Project Gutenbeg by krelian · · Score: 4, Insightful

    Don't compare this to Project Gutenberg. This is the supposed to be the Internet Movie Database" for books (as far as I understand anyway). Anyway, I am pretty sure that a big part of this information can filled with calls to Amazon web services.

    1. Re:Not Project Gutenbeg by dwarfsoft · · Score: 2, Insightful

      This is exactly what I was going to say. It is basically an overview of books with links to Amazon or wherever. I forsee reviews, quotes, and links on par with IMDB. Only there is a benefit of being able to have full-text books too.

      I had a play with it and it is quite limited at the moment. I did manage to add a book, but there was minimal instruction on how to go about this, and uploading covers at the moment is not available (as far as I could determine in 5 minutes anyway).

      --
      Cheers, Chris
    2. Re:Not Project Gutenbeg by hcdejong · · Score: 2, Informative

      Except that they'll also store and supply the books themselves (scanned and/or as text), if available.

  7. Re:Project Gutenburg by Reziac · · Score: 2, Interesting

    I've been using the openlibrary.org site for a while now. I find these scanned original pages FAR more restful to the eye than any other form of electronic book. This way, I can sit down and read a complete book on the screen -- without suffering the eye fatigue that comes from reading large swaths of ordinary onscreen text. I think it has a lot to do with print fonts being designed specifically for the eye, and somewhat to do with the normal yellowing of paper that produces a less glary background.

    Also, many of these old texts, especially popular fiction from the late 1800s, have been discarded by meatspace libraries, so are otherwise pretty much unavailable -- and quite possibly in danger of being lost to the public altogether. (The first such book I picked at random to read, a late-1800s novel I'd never heard of, also proved to be a very relaxing way to spend an evening.)

    Anyway, I've been thrilled with the project, especially with the ability to download the scanned images as well as the plain text.

    --
    ~REZ~ #43301. Who'd fake being me anyway?
  8. Mod parent up by Anonymous Coward · · Score: 2, Informative

    She is correct. This is not a 'library' per se but a catalog of books, with links to PG, Amazon, B&N, etc. Most books are NOT free.

    The difference between this and other catalogs (Library of Congress, etc.) is that presumably you can customize it more.

  9. Re:Take flight? by Reziac · · Score: 2, Interesting

    Actually, you're wrong -- to "take flight" primarily means to take off, or to start a project. So the usage was correct.

    --
    ~REZ~ #43301. Who'd fake being me anyway?
  10. A Library Card Tip by Revenge_of_Solver_Ta · · Score: 2, Insightful

    This is great news, I hope it actually works. Related: I recently discovered my local library has about 50% of the books I usually buy. Why didn't I think of this earlier? Must of lost about $10K from that during the last decade. Now, if you'll excuse me, I must go check out a copy of "How to Make a Your Very Own Video Game in 16 Days Using ONLY...Wordstar!"

  11. Re:IPL? by TTK+Ciar · · Score: 4, Interesting

    OpenLibrary is a lot more complete, for one .. searching on "Ogorkiewicz" in IPL yielded no hits, while OL gave me several. The Archive is well-connected to various institutions like the Library of Congress and Bibliotech, and is able to pull a lot of help from these other organizations into making a more complete service.

    OpenLibrary is also a catalog of metadata, providing information for each book like physical format, publisher, ISBN#, number of pages, and so on. This metadata has a lot of holes for now, but hopefully that will change as publishers and/or people who own copies of these books fill in the blanks, much like the Internet Movie Database.

    Finally, OpenLibrary has its own staff which is dedicated to working with Internet Archive partners to make this the most complete catalog on the planet. IPL is cool (I like it!) but it does not seem to be very actively maintained.

    (disclaimer: I work for The Internet Archive, but I do not speak for it, and the OpenLibrary team is in a completely different department from mine so DO NOT treat this post as necessarily any more authorative or correct than any other slashdot post.)

    -- TTK

  12. Re:Project Gutenburg by PMBjornerud · · Score: 4, Insightful

    I find these scanned original pages FAR more restful to the eye than any other form of electronic book. This way, I can sit down and read a complete book on the screen -- without suffering the eye fatigue that comes from reading large swaths of ordinary onscreen text. I think it has a lot to do with print fonts being designed specifically for the eye, and somewhat to do with the normal yellowing of paper that produces a less glary background. This does not make sense. A scanned document will always have artifacts and imperfections from the scanning process and should by definition be harder to read. A well-sized font on a pleasant background should beat scannded text every single time.

    Your issue is more likely that there are a lot of crappily designed webpages out there.

    If you're reading "large swaths of ordinary onscreen text", do this:
    - Copy-paste in into any word processor
    - Choose a nice, big font. (Small is good for UI, not for 400-page-novels.)
    - Use a dark background. A page reflects light, a screen projects it. You do not want glaring white.
    - Use 8-10 words per line.
    - Profit! Err... less mental exhaustation, at least.

    Pay extra attention to words per line. It's a key reason onscreen text is often hard to read. Too many words per line, and you'll have a mental overhead every few seconds trying to figure out which line you just read and which is next. Basically, books do it right and you want to display onscreen text at a similar width. Scrolling is easy these days, and wide lines is a remnant from when computers required a click-and-drag to scroll.

    Wide books and newspapers are divided into columns. There is a reason for doing this, but almost nobody seemed to think about that when they display text on screens.

    Heck, even slashdot defaults to a glaring white background and text stretched all over my 1920 pixels. Go figure.
    --
    I lost my sig.
  13. Re:Gutenburg by wootest · · Score: 2, Informative

    By being a listing/index/catalog of all books with references to where to get them instead of being a site dedicated to reproducing the source material of stuff in the public domain, perhaps?

  14. Re:Project Gutenburg by Fallingcow · · Score: 3, Interesting

    What I really want are some modern, well-written footnotes and introductions to older works. Maybe throw in some good annotated maps when appropriate.

    Older books are often hard to relate to without some context, and that sort of thing is what makes or breaks many editions of the "classics", IMO. If, when shopping for books, I pick up a copy of a book that was written more than 200 years or so ago, and it has no foot notes, most of the time I won't buy it. This is doubly true of translated works.

    Wikipedia can usually stand in for an introduction, but there's nothing like footnotes to get you closer to an older text, and nothing that I know of provides that. If someone started a project to provide that kind of information for Project Gutenberg books, I'd get on board to help. Bonus points if they're also putting them in formats that don't suck (making plain text look good on the screen is a pain in the ass).

    I'd start it up myself, but alas, I am poor (college). I'd definitely help out if someone else got it going, though.

    Until someone does that, PG is practically useless to me.

    Will this project do anything like that, or do you know of anyone who's doing this?

    It seems to me that 500-1,000 really well-edited, footnoted, and formatted free books are better than 21,000 books worth of plain-text barf.

  15. Re:More of an IMDB than a library by maddskillz · · Score: 2, Informative

    That's what I was thinking. Sounds a lot like what http://www.librarything.com/ already does. Of course, they already have a big head start

  16. Kinakuta by EnsilZah · · Score: 2, Insightful

    How about placing the servers somewhere where copyright law hold no sway?
    Are there really any working data havens?

  17. Vandalism controls? by Creosote · · Score: 2, Interesting

    First thing I did on the site was pull up an entry for a book my university press publishes. It had no "Buy" option. I edited the metadata to add the ISBN-10 number for it, and voila, a Buy option.

    It then took a certain amount of self-control for me not to go into various titles dealing with George W. Bush and enter the ISBN-10 of the storybook containing "My Pet Goat". Purely as a proof of concept, you understand.

    This is simply the Wikipedia vandalism problem writ large. What controls will OpenLibrary put in place to guard against it?

  18. Some thoughts by harmonica · · Score: 4, Insightful

    I know the project is just starting, but here it goes.

    They should republish the raw data the same way Wikipedia and even IMDb does. I for one am not going to contribute to any data collection project that I can't later use myself.

    Their schema doesn't differentiate between editions. If I understand it right, that means that for the 3000 existing editions of "Tom Sawyer" released over the years, by different publishers in different countries and languages, the book's description has to be replicated for each one. That can't be good. I don't have a quick solution to this myself. Sometimes (esp. with tech books), a new edition changes content significantly compared to the previous one, sometimes they're exactly the same.

    Collecting the cover images is a great service. However, doesn't this infringe on the publisher's copyright? Is this still fair use? What about countries like Germany without fair use laws--will German books still be OK because the data is collected in the USA (I guess)?

    Add a feature to upload book descriptions as XML. Suggest a DTD. I have a list of my book collection stored as an XML file, so have others (maybe not natively, but book collection management software usually has an export function). It should be possible to automate the process of adding book information already stored in some digital format.

    There should be some category system to pick from. Some may put Tom sawyer into "Novel, USA antebellum", others into "Novel, USA 19th century".

    Somehow connect this to Wikipedia. The more prominent books have article pages. Maybe data could be retrieved from it as well. There are currently Tom Sawyer articles in 16 or so languages.

    The edit page should group items better: stuff everyone understands (year published, title) first, then those things only specialists know.

    The edit page's descriptors shouldn't be images but text which links to an explanation page for the same reason. BISAC? LCCN? UCC13? I know, I can find out what those are with a search engine, but I shouldn't have to.

    Prepare for i18n. I guess LCCN is a library of congress code number? Those types of libraries exist in other countries, too. Each book can have a gazillion codes. Make this another tuple in the database: (book_id, code_id, code_value) instead of (book_id, lcc_id, isbn10, isbn13, 10 other codes in the same record).

    Also i18n: store language codes with all textual columns. A description is most likely going to be Hungarian for a book published in Hungary in Hungarian.

    This complicates the schema a lot. Having very few tables is tempting, but it usually doesn't work well with the real world.

    1. Re:Some thoughts by Teancum · · Score: 2, Insightful

      Here is some additional food for thought about this idea.... coming from somebody who has only given this concept just a few minutes of thought, but having dealt with this issue extensively in the past (of trying to catalog e-books):

      The kinds of skills necessary for doing actual cataloging work.... classifying and organizing knowledge... are so rare as to be a very precious jewel of a person if you ever do find somebody like that. And developing these skills is not something very easy to accomplish either. Certainly some basic tools can be developed that would make it a bit easier to climb up the steep slope of learning various cataloging techniques and understanding ontology as a discipline, but it is unusual. Most professional librarians that I have met (I'm talking people who actually work in real libraries) may have taken a college class or two about the subject, but even they seldom get into this sort of activity.

      Here is the main point about this discussion, and why this is a much harder task than is apparent: Almost all cataloging work in the USA (and the rest of the world too, BTW) is done by the national libraries (aka Library of Congress) and the thousands of other libraries largely rely upon that cataloging effort to come up with their own numbering scheme. Especially with the "cataloging in publication" process where the process of formal copyright registration assigns cataloging numbers happens well before the book even arrives at a typical local or even university library.

      At even a large library, those involved in the cataloging of content are usually a small team or even a single individual who has to catalog the couple dozen books that come in each year that aren't from major book publishers (often local histories that are self-published). Even then, it is hardly a full-time job and library staff like this usually have many other job duties.

      How this relates to eBooks and content on the internet is that there are many electronic resources in book-like form that are largely uncataloged. I would put it at close to 100,000 books, perhaps even more that are original "books" that have been written in the past 20 years, and are available under a free (as in beer and freedom) copyright license. The "low-hanging fruit" is the Project Gutenberg collection, but much of that surprisingly has already been cataloged in more than one form. This is because they are older books and have been cataloged years ago. While there certainly is value in preserving older documents like the PG collection, there is so much more, and in many ways more relevant explicitly because it is up to date.

      BTW, in response about the cataloging numbers, you can't simply assign a book to a single cataloging ID and expect it to work in every situation (without something incredibly complicated). Every classification system; ISBN, Library of Congress, Dewey Decimal, and about a dozen other classification systems; each have their own strengths and weaknesses. And different strengths and weaknesses. If a book has any value, it covers a very unique topic that is one of a kind, and it is these books which are the ones that you need to have a clean cataloging system that is able to allow you to "place" the book in a format that there are multiple methods for being able to find that content. For the hundreds of books about how to write HTML (to pick a topic that is common) they are largely the same... but my experience in trying to deal with book cataloging is that something so common like this is a rare situation, and at least 50% of all books in an e-book library are going to be something completely unique in terms of the topic covered. In short, you need the dozens of cataloging ID numbers for each book and not just a single cataloging ID number that is cross-referenced into a much larger and more complex database.