Proposal: Put Library of Congress' Contents Online
Mark_Uplanguage writes "The idea to scan in all materials available at the U.S. Library of Congress was presented at the Web 2.0 conference this week (as just one of many ideas presented). The proposed cost of $260 million would create a huge benefit to society (well, at least to those who can read English)."
The government has proposed recently. I would also suggest that they put in place requirements that all future material that is to be copyrighted present appropriate copies in machine readable form so this will be cheaper in the future.
If you folks want out of state donations from non-taxpayers, I'll stump up happily from Canada!
I would reserve that honor for Andrew Carnegie, who basically sold his empire for $485M and spent the rest of his life giving away all his money to good causes. Bill Gates is a far cry from that so far.
Just as a point of /. interest, what is the conversion factor between ACMs (Andrew Carnegie Millions) and BGBs (Bill Gates Billions)?
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
Yes, the article does say 'all 26 million books in the US Library of Congress.' I think all the books should be scanned. Imagine if a terrorist detonated a nuclear bomb there and destroyed the largest library in the world. What a loss that would be. But just because we would have a backup of the data doesn't mean they must allow full access to copyrighted works. They could release DVDs of a subset which includes only information in the Public Domain. It would be a huge boon for Project Gutenberg though each scanned work would still need to be transcribed to text.
Liberals call everyone Nazis yet they are the closest thing to it.
Creating primarily for money is shortsighted when a work has the chance to impact the larger culture. Just look at Michael Moore (ooh, isn't he ugly, but that's not the point), he's more interested in people seeing and being influenced by his movies than in getting richer off them. Enough money to be comfortable is great, but then, barriers to free movement of ideas should be relaxed.
If you must moderate, please moderate as irrelevent, not something bad, because I'm sure someone will find this interest
Right now, Internet2 can download the entire Library of Congress in about 20 seconds.
I'm not aware of any PIAA for publishers, but somebody is going to have a problem with this. And by the time this actually happens, I bet there will be an Internet4 that can do it all in 20ms.
Punctanym: alternate spelling of words using punctuation or numerals in place of some or all of its letters; see 'leet'
You'll find that many here (including me - and I'm one of the most conservative) find that copyright period oppressively long. Just because you wrote one useful book shouldn't entitle you to a generation of monopoly on its art and ideas. The copyright period was once much shorter, and that encouraged derivative works.
Gamingmuseum.com: Give your 3D accelerator a rest.
Can't have that, now can we?
No, we can't... it not be fair to lots of people whose copyrights haven't yet lapsed.
But scanning the materials is _still_ a good idea. It allows for automated OCR that allows searching for text _within_ a book (like A9.com does, and as Google plans to do.) The difference is that all books published in the US could be searched.
It would also make this scenario possible:
Since this process is handled by people trained to respect copyright (i.e. the librarians), it is a win-win for everyone.
Their collections policy statement states that they only keep material specific to their very broad mission statement. This means that they will not keep a copy of a laundry list they received throught the copyright office.
you must mean that whole 70 years after the author's death.
Wait, don't you mean 20 years after their death? Oh wait, yesterday it was 50 years. Today its 70 years. Every time mickey mouse is up for grabs, that number gets bigger.
And of course, we're completely ignoring that grandparent was talking about the publisher who maybe fronted some money, but otherwise didn't do jack to earn this sort of perpetual power over "their" creation.
you must mean that whole 70 years after the author's death.
You must mean currently. But we all know that as soon as anything major (like Steamboat Willy) comes close to coming out of copyright, we'll see Congress extend the term of copyright yet again, thanks to 'encouragement' from Disney.
Copyright terms are nigh on infinite in fact, if not in law.
There's a Vulcan saying: "The needs of the many outweigh the needs of the few."
I would say, scupper copyrights for all volumes owned by LoC.Scan and put every volume on the internet.
Within few years we would witness a Renaissance of sorts once again in human knowledge and education.
"Doing what i can, with what i have." ~ Burt Gummer
what makes you think all the contents of the library of congress are in english?
i am convinced that "/.ers" are homosexuals and imma make that my "sig"
A major problem with copyright law is that congress makes sure that copyrights never lapse. Everytime anything Disney is about to enter the public domain, Disney and the large companies that control all creative material get together and lobby (bribe) congress to extend the term, ensuring nothing created after about 1930 will ever be in the public domain.
That isn't fair to mankind.
The best way so far of capturing wax recordings and the like is to run the disk under a high-resolution scanner and use a piece of software to render the image of the grooves as a waveform ; this involves no physical wear of the medium. In fact, I'd think that a commercial version of this could well catch on for old-timers with large vinyl collections....
Maybe you should have a look at How the music biz can live forever, get even richer, and be loved.
The library does not have to be capable of maintaining the client, just of funding the infrastructure. The client may even be a thin client run by an external company. After all, network connectivity would probably be essential to access the Congress book database (for copyright reasons, the entire database would probably run out of a Google-like government contracted facility somewhere.)
About funding being scarce: after initial seed funding by the government, a library should easily be able to fund infrastructure in the same manner the internet funds itself.
That is because giving the local library the ability to sell any of _260 million books_ to anyone who walks in their door, on demand, with effectively zero inventory costs, adds a _HUGE_ improvement to their basic information dissemination activity which is very valuable to the library's customers. The commission on book sales would earn easily recoup the investment on infrastructure. A library could become a sort of a low cost competitor to Barnes and Noble. Barnes and Noble would probably do the same thing as the library, but differentiate themselves by having a better quality printer, and being able to grant 24x7 access to a bn.com server where electronic copy of the book could be accessed by the purchaser. Ah, and the nice coffee shop, where you can read the electronic copy of your book on a nice LCD screen as it is being printed.
Other advantages:
- knowledge hidden in books would be suddenly visible and searchable
- for most people, reading a book is more natural than reading a screen
- when people buy a printed book, they retain first sale property rights
(unlike DRM'd ebook software and music liceses)
- the library could become a focal point for paper recycling efforts. For eg: as part of a loyalty program, it could issue credits for old printed books that people turned in.
Disadvantages:
- paper consumption
- printer consumption
I think the only loser will be online bookstores that have no mortar component, like amazon.com
What a cool idea and, even "if" the dollar estimate is too low, who cares? $260M is chump change for our gov't.
Right now, the only way to access the stuff in LoC is to go there in person. Anyone can do it but you have to travel to WashDC and pass through security and so forth to get into the LoC public reading room. Then you have to ask the librarian to pretty-please bring you the book that you want.
Now imagine that you can access any item in the LoC by simply entering the building and using a public kiosk with a browser. LoC's software would only permit use within the copyright so that is OK. But you don't have to mess with as much security because LoC isn't handing over the physical book.
Now imagine that, from any web browser, you can access any book in the LoC for which the copyright has expired. I like that idea!
My opinion... skip the buy on the next couple of cruise missiles and digitize LoC's books instead.
Oh yeah, before I forget, LoC already has tons of seriously neat stuff online. My favorite is this collection of tons photos from Russia. These were taken between about 1907 and 1915! I don't know about you, but I never dreamed that I would see color photos that are almost 100 years old.
Cheers,
-- Art Z.
Now imagine that, from any web browser, you can access any book in the LoC for which the copyright has expired. I like that idea!
That's the idea of Project Gutenberg. It's been around for quite some time now, and everybody is free to join their distributed proofreading network!
cpghost at Cordula's Web.
Heh. Whichever it turns out to be, the LoC, being yet another part of the federal government, will probably make it available for viewing/downloading as a single PDF file.
PDF sucks.
If a job's not worth doing, it's not worth doing right.
According to the LOC website, they have 119 million items in the library.
...so I guess we assume the rest are books and newspapers.
They tell us that there are:
4.5 million maps.
14 million 'images'
So in round numbers, let's say there are 50 million books and 50 million newspapers, periodicals, comic books, etc.
$260 million to scan all that stuff? $2.60 per book or newspaper? That seems a little unlikely. The book would have to be carried off the shelf to the scanning machine, mounted in the machine (which would clearly have to turn the pages and scan and index them 100% automatically), the title and such would probably have to be typed in manually, then the book carried back to the shelf and placed back in the correct place.
I find it hard to believe that a machine for scanning newspapers could be devised that could turn the pages automatically...but even without that, the project is still possible. At minimum wage, you'd need to pay people to scan a complete newspaper in maybe 20 minutes.
Then some significant fraction of the collection would probably be too fragile for the automatic page turning machines...the cost of hand-scanning those would be FAR more than the bulk of the books. Some books would be *so* fragile and valuable that scanning them would be a considerable expense.
Then there is the cost of the storage media. Suppose those 100 million books and newspapers had just 100 pages each on average. To get a readable image of the page you're going to need to scan at maybe 2000 x 2000 resolution. So we'll have something like 10^16 pixels, let's be generous and allow 100:1 compression ratios - and one byte per pixel. So we have 1000 terabytes. That's a lot - but to put it in context, it's only about a fifth of the amount
that Google is estimated to have in their main cluster. Goggle spent $250 mil to buy that - so maybe only 20% of the LOC's budget needs to be for storage.
OCR'ing and indexing all that data would be an incredibly valuable thing - the extra storage is trivial and the cost can be low if you aren't in a hurry to get the project done. Just stick a few thousand PC's in a room and wait!
Dunno - $260 mil sounds like a low end estimate to me - but it seems do-able.
www.sjbaker.org
Some wild assumptions flying around here. Even if the LOC could get funding for the project, and even if the publishers did not tie it up in the courts for decades, there is still the questionable assumption everyone seems to be making -- that LOC would make the digitized texts available free. There is no reason why they should be expected to do that. In fact, they would most likely have to charge for use of copyrighted materials. The fee would necessarily include some sort of negotiated reimbursement to the copyright owner/publisher. Otherwise, publishers would just stop contributing their books to the LOC.
But, there's an even bigger fantasy involved. Does anyone really think that the right-wing protectors of our morality who are running our government would stand by and allow a government agency like the LOC to spend tax dollars scanning gothic novels, adult literature, subversive tracts, revolutionary polemics, treatises on abortion rights, non-christian religious texts, pagan and satanic epistles, books critical of the administration, etc, and making them available to the country at large? The Shrub would burst into a Burning Bush instantaneously at the idea.
The only possibility is digitization of "suitable" and "defensible" public domain items, which is already under way in piecemeal fashion.
> Barnes and Noble has a right to make money without having to compete
... the costs for all categories of information - both books and journals - would come down.
> with government-subsidized pseudo-businesses.
Sure they do. But I imagine Kinkos and B&N would have access to the same LoC book database, and would be able to print books for purchase too (otherwise it would be unfair to them). This just puts libraries on an even footing - libraries that don't want to sell books to the public could stay that way.
> If you want information or to read a book, go to the library.
> If you want to OWN the book, go to a bookstore.
Yup, that's the current model. Lets break down what you said:
Go to a library for this:
1. information about a book
2. to read a book
Go to a bookstore for this:
3. To own a book
One of the key reasons people use libraries is the library database, and the assurance that a book in the library database is probably "in stock" for lending out. Now if this proposal goes ahead, both Kinko's and B&N will suddenly get #1 - the best library database there can be. With #3 becoming more attractive, (book price reduction due to a larger market -- see below), one of the USPs of libraries simply isn't so anymore.
> I'd also say that the concept of really cheap books
> because of lack of physical inventory isn't guaranteed.
A book typically has a single fixed cost at the start (the authoring). After that, the more copies you sell, the more the profit.
> It certainly hasn't panned out with magazines or academic journals
That's because a journal is different from a typical book - each journal issue is like a book that comes out each month with the authoring costs paid each time, but sold to a very limited market.
Since this proposal would broaden the market immensely both books and journals:
- size of the print run is no longer an issue
- inventory is no longer an issue
- royalties keep flowing in for longer durations
> There's also a big copyright issue with the whole concept of scanning in the LoC collection.
> With physical library items, only one person may have the item at a time,
> so there's no copyright issue. (No copies are being made.)
> With a digital version, multiple people can access it at one time.
Many copyright holders _want_ this to happen and are already doing this.
For instance, you can go to Amazon's A9.com site and search on Gandhi's wife:
http://a9.com/gandhi%20kasturba
(be sure to click on the books button on the left - this returns matches within a book)
Now if entire books were scanned into the LoC database, a canny person could type in the name of a book, and then "page 1", "page 2"... and so... to essentially read the book without paying for it.
One way to secure copyright against behavior like this is by restrictions that can be imposed on both the server and the clients that are searching the book (say, the client cannot view more than 30 words surrounding each match). Amazon's restrictions seem to be that they just scan in the table of contents, not the entire book.
> Finally, the upkeep cost for scanned items is huge.
Well, that scanning would only be done once, using government funds to scan it into the LoC database. The only thing a library would need would be network access to the LoC database (just like they currently do with some electronic journals and databases.)
$260 million is $1 per US citizen. A bargain if ever there was one. I suspect that this estimate is extremely low.
The hard part is, of course, proofreading. See distributed proofreading at http://www.pgdp.net/c/default.php
Let's get started on the out-of-copyright stuff NOW. Maybe b the time is online, people will see the benefit of making everything available.
Thank You Kindly.
A couple of years ago the Harvard University Centre for Astronomy had one of it's collections of technical publications scanned in order to be put online. But to make the material actually usable they had to launch a program over the net for volunteers (predominantly amateur astronomers) to view the scanned pages and enter, by hand, the necessary bibligraphical information (authors, paper titles, etc), as well as to QC things (look for duplicated pages, missing pages, work out which of several scans of fold-out drawings is the best image, etc).
The scanning step was trivial (probably lots of bored students on minimum wages, getting brownie points from their professors); the INDEXING process has been going on for over 2 years now and is not yet finished.
NASA ADS at SAO: Historical scans currently in the ADS
Birds are not dinosaur descendants;birds are dinosaurs, for all useful meanings of "birds", "are" and "dinosaurs"