Open Library Project Takes Flight

← Back to Stories (view on slashdot.org)

Open Library Project Takes Flight

Posted by ryuzaki0 on Monday July 16, 2007 @10:34AM from the alexandria-green-with-envy dept.

Aaron Swartz today announced the launch of the new Open Library project. The goal of the project is to produce the world's greatest library on the Internet free for anyone to use. Starting with the Internet Archive's book scanning project and organizing the insertion of new content via a wiki-type model the project seems to be off to a great start. The demo, source code, and mailing lists were all opened up today in hopes of drawing interest from the public at large.

7 of 126 comments (clear)

Min score:

Reason:

Sort:

Re:Project Gutenburg by AaronSw · 2007-07-16 10:53 · Score: 5, Informative

Hi, Aaron Swartz here. Project Gutenberg is about putting up text versions of out-of-copyright books. This project is about creating a catalog of _every_ book, with links to PG, scans, Amazon.com, PDFs, print on demand, etc. -- anything we can get our hands on. Gutenberg books are in our catalog, of course, but so are millions more.
Re:In response to your question: by jandrese · 2007-07-16 10:53 · Score: 4, Insightful

I find it depressing that if someone came up with the concept of a free library system today, they would be sued out of existence by the book companies. What is perhaps one of the greatest triumphs ever for the poor uneducated masses would not stand a chance in our current legal environment.

--

I read the internet for the articles.
Not Project Gutenbeg by krelian · 2007-07-16 11:19 · Score: 4, Insightful

Don't compare this to Project Gutenberg. This is the supposed to be the Internet Movie Database" for books (as far as I understand anyway). Anyway, I am pretty sure that a big part of this information can filled with calls to Amazon web services.
Re:IPL? by TTK+Ciar · 2007-07-16 12:04 · Score: 4, Interesting

OpenLibrary is a lot more complete, for one .. searching on "Ogorkiewicz" in IPL yielded no hits, while OL gave me several. The Archive is well-connected to various institutions like the Library of Congress and Bibliotech, and is able to pull a lot of help from these other organizations into making a more complete service.

OpenLibrary is also a catalog of metadata, providing information for each book like physical format, publisher, ISBN#, number of pages, and so on. This metadata has a lot of holes for now, but hopefully that will change as publishers and/or people who own copies of these books fill in the blanks, much like the Internet Movie Database.

Finally, OpenLibrary has its own staff which is dedicated to working with Internet Archive partners to make this the most complete catalog on the planet. IPL is cool (I like it!) but it does not seem to be very actively maintained.

(disclaimer: I work for The Internet Archive, but I do not speak for it, and the OpenLibrary team is in a completely different department from mine so DO NOT treat this post as necessarily any more authorative or correct than any other slashdot post.)

-- TTK
Re:Project Gutenburg by PMBjornerud · 2007-07-16 12:22 · Score: 4, Insightful

I find these scanned original pages FAR more restful to the eye than any other form of electronic book. This way, I can sit down and read a complete book on the screen -- without suffering the eye fatigue that comes from reading large swaths of ordinary onscreen text. I think it has a lot to do with print fonts being designed specifically for the eye, and somewhat to do with the normal yellowing of paper that produces a less glary background. This does not make sense. A scanned document will always have artifacts and imperfections from the scanning process and should by definition be harder to read. A well-sized font on a pleasant background should beat scannded text every single time.

Your issue is more likely that there are a lot of crappily designed webpages out there.

If you're reading "large swaths of ordinary onscreen text", do this:
- Copy-paste in into any word processor
- Choose a nice, big font. (Small is good for UI, not for 400-page-novels.)
- Use a dark background. A page reflects light, a screen projects it. You do not want glaring white.
- Use 8-10 words per line.
- Profit! Err... less mental exhaustation, at least.

Pay extra attention to words per line. It's a key reason onscreen text is often hard to read. Too many words per line, and you'll have a mental overhead every few seconds trying to figure out which line you just read and which is next. Basically, books do it right and you want to display onscreen text at a similar width. Scrolling is easy these days, and wide lines is a remnant from when computers required a click-and-drag to scroll.

Wide books and newspapers are divided into columns. There is a reason for doing this, but almost nobody seemed to think about that when they display text on screens.

Heck, even slashdot defaults to a glaring white background and text stretched all over my 1920 pixels. Go figure.

--
I lost my sig.
Re:In response to your question: by fyngyrz · 2007-07-16 12:48 · Score: 4, Insightful

Anyways, the good news is that libraries do exist, and aren't going away.

No, of course not, because they're protected by copyright law, which in turn grew out of article 1, section 8 of the constitution. Just there will never be a restriction on keeping and bearing arms... uh, oh, wait. OK then, like there will never be restrictions on speech... no, no, turns out there are plenty of those. Mmmm, ok, just like the feds can only take action on interstate commerce, because you know, that's an enumerated power they can't step outside... aw, no, they do that all the time. Well, it'll be like how they can't do searches or seizures without probable cause, oath or affirmation, and a warrant... oh... I guess that's no longer true. Well, of course they can't make ex post facto laws... except for the ones they've made, that is, you know, thinking of the children and such.

Wait. Why is it again libraries "aren't going away?"

--
I've fallen off your lawn, and I can't get up.
Some thoughts by harmonica · 2007-07-16 21:24 · Score: 4, Insightful

I know the project is just starting, but here it goes.

They should republish the raw data the same way Wikipedia and even IMDb does. I for one am not going to contribute to any data collection project that I can't later use myself.

Their schema doesn't differentiate between editions. If I understand it right, that means that for the 3000 existing editions of "Tom Sawyer" released over the years, by different publishers in different countries and languages, the book's description has to be replicated for each one. That can't be good. I don't have a quick solution to this myself. Sometimes (esp. with tech books), a new edition changes content significantly compared to the previous one, sometimes they're exactly the same.

Collecting the cover images is a great service. However, doesn't this infringe on the publisher's copyright? Is this still fair use? What about countries like Germany without fair use laws--will German books still be OK because the data is collected in the USA (I guess)?

Add a feature to upload book descriptions as XML. Suggest a DTD. I have a list of my book collection stored as an XML file, so have others (maybe not natively, but book collection management software usually has an export function). It should be possible to automate the process of adding book information already stored in some digital format.

There should be some category system to pick from. Some may put Tom sawyer into "Novel, USA antebellum", others into "Novel, USA 19th century".

Somehow connect this to Wikipedia. The more prominent books have article pages. Maybe data could be retrieved from it as well. There are currently Tom Sawyer articles in 16 or so languages.

The edit page should group items better: stuff everyone understands (year published, title) first, then those things only specialists know.

The edit page's descriptors shouldn't be images but text which links to an explanation page for the same reason. BISAC? LCCN? UCC13? I know, I can find out what those are with a search engine, but I shouldn't have to.

Prepare for i18n. I guess LCCN is a library of congress code number? Those types of libraries exist in other countries, too. Each book can have a gazillion codes. Make this another tuple in the database: (book_id, code_id, code_value) instead of (book_id, lcc_id, isbn10, isbn13, 10 other codes in the same record).

Also i18n: store language codes with all textual columns. A description is most likely going to be Hungarian for a book published in Hungary in Hungarian.

This complicates the schema a lot. Having very few tables is tempting, but it usually doesn't work well with the real world.