Linux and OSS to Aid the Library of Congress
flakeman2 writes with a link to Linux.com article about Linux's new role at the Library of Congress. The national archive of books is looking to begin an ambitious digitization project, aimed at getting some rare and crumbling documents into the public record online. These will include "Civil War and genealogical documents, technical and artistic works concerning photography, scores of books, and the 850 titles written, printed, edited, or published by Benjamin Franklin. According to Brewster Kahle of the Internet Archive, which developed the digitizing technology, open source software will play an 'absolutely critical' role in getting the job done. The main component is Scribe, a combination of hardware and free software. 'Scribe is a book-scanning system that takes high-quality images of books and then does a set of manipulations, gets them in optical character recognition and compressed, so you can get beautiful, printable versions of the book that are also searchable,' says Kahle." Linux.com and Slashdot.org are both owned by OSTG.
For the past few months Microsoft has been dispatching crack teams of special operatives into the past to alter the course of American History for their benefit, in hopes of eventually transforming the United States into the New Microsoft Empire. But little do they know, a world-weary Librarian and Ex-Marine at the Library of Congress won't stand for that shit. He's put together a team of agents in hopes of reversing the damage to the timestream before it becomes irreversable. Together with Agents Linus Torvalds (Technology Specialist - Special power: x-ray glasses), Donald Trump (Logistics Specialist - Special power: nuclear fusion comb-over) and Stephen Hawking (Quantum Physics Specialist - Special power: medusa glare), he just may be the only hope for American History's future.
using namespace slashdot;
troll::post();
and got "Dear Blair, let's set so double the killer delete select all ..."
Suffice to say, they settled with Linux. The Microsoft version had psychic powers, apparently!
If you keep throwing chairs, one day you'll break windows....
more info in this interesting article http://www.nytimes.com/2007/03/10/business/yourmon ey/11archive.html?ei=5090&en=9bf0874841a9d705&ex=1 331182800&adxnnl=1&partner=rssuserland&emc=rss&
RaTFA (note the lowercase "a" for "all")
"the Internet Archive has migrated Scribe entirely to Linux, and Windows support has been dropped."
Seems focused on Linux to me.
ERR 411[Max number of witty sigs reached]
So many technologies have been made specifically to hold libraries of congress.
This Wiki Feeds You TV and Anime - vidwiki.org
Hooray! Linux does something useful! Let's write an article about it! Hooray! Electricity did something useful too! Why isn't there an article about how great electricity is at all this too? Insert your favorite item to rave for the next story. I realize this is a pro-Linux site and all, but come on guys, it gets unnecessary praise and accolades every time it does something. It's like a bunch of Linux geeks sitting around and mutually masturbating each other, slapping each other on the backs, and then telling each other how cool they are. Step back a moment and look at how ridiculous this all is...
this has got to cause some flying chairs in Redmond.
Arguably one of the most important repositories of information in the U.S. is about to be available via OSS software and not MS products. For all the efforts that MS put out in Mass. this has got to be a kick in the face! Just wow!
Support NYCountryLawyer RIAA vs People
The revisions to the law would not be infringing freedom of speech, in fact by allowing the free copying of works that did not further the arts or the sciences it would be limiting copyrights impact upon the freedom of speech. If people are really concerned about the quality of content, they should remember that eliminating the profit motive will have a substantial impact upon the amount of questionable content that is out there including movies, music, pictures and literature. Most of the members of the RIAA and the MPAA have a total disregard for the harm their content cause to society, let them feel some of the pain, wipe out the copyright protections on some of their more divisive content ;).
Chaos - everything, everywhere, everywhen
I am a Christian, as were the Founding Fathers, who this established this country as a nation under GOD. They would be, as am I, deeply offended and disgusted to see the homosexual communist software known as Linux used in the hallowed halls of the Library of Congress.
As the article says, the OCR itself is still done with proprietary software. I wonder if Google is using Tesseract for their digitization efforts. It would be cool if the original raw scanned images could also be archived and available for download - then you could print your own copy of the book, check the OCR for errors, or even do some weird genetic algorithm thing to make a LaTeX style that typesets the text in the same format as the original book.
-- Ed Avis ed@membled.com
How many libraries of congress is that?
Does anyone have more information on this 'Scribe' system? Google fails me.
All rites reversed 2010
The leader of Microsoft's team of time-travelling commandos
This is absolutely cool. There are a lot more places where "brittle books" are laying around, waiting to be digitalized and distributed to the whole world. And as the technologies used in this project are going to be refined and improved, and eventually released, everyone will benefit.
The question now is: would they accept technical contributions from the public (I mean, OS geek communities), just like other open source projects? I know a lot of people would be eager to join. How about a SETI-like system to harness the power of desktop computers around the world to help with image processing and OCR? Hey, I got 4 decent desktop computers that can contribute at least 8 hours/day each.
How much data is needed. Of course, it would be necessary to have this number in a useful units system. Perhaps, the number of Libaries of Congresses of Data?
"Sure there's porn and piracy on the Web but there's probably a downside too."
It would be neat if they did some other type of scanning, such as laser scanning the exterior, so that the book's heft and presence can be reproduced in the future. I've printed out crisp, high-resolution PDFs of user manuals, and having three hundred pages of printer paper binder clipped together just isn't the same as a nice, perfect-bound manual with a glossy cover.
Sure, I imagine most of the consumption in the future will be done in a digital environment, but it would be nice if future generations had the option of popping the file into whatever will pass for a replicator and getting a decent representation of a long-vanished physical object- especially since the technology exists and the incremental memory needed is fairly trivial.
Marc Siry || interactive media professional, motorcycle enthusiast ||
Eventually we will have no physical record of these writings and may someday learn from the digital copies that Benjamin Franklin, George Washington, and others had offered enthusiastic support for wiretapping and other forms of electronic surveillance.
Why bother.
But... where in the world is she?!
When the posters fear their moderators, there is tyranny; when the moderators fears the posters, there is liberty.
Not sure of the details, but wasn't Google trying to scan books into online searchable format and got sued by publishers? Maybe Google should look at making a donation in time, effort and money to the Library of Congress and maybe get a huge tax writeoff. If its in the possession of the Library of Congress in a searchable format online, what are the publishers going to say then? Obviously Google can't donate unauthorized copies, but they can donate software engineering, scanning services, hosting services, bandwith and money or can they? Would be amusing to watch how government officials line up on such a discussion.
The OCR software from Scribe is still closed source.
Posted the above without having read the New York Times article that previous poster linked which indicates Google has made some donations to such. Would still be interesting to see them come out stronger on it though.
What OCR-Engine do they use?
I'd totally buy that game for my Xbox360 or "Windows Vista" based home gaming computer.
Yeah, but how many Libraries of Congress will... eh... never mind.
bla
...has been done before. The Domesday Book was digitized by the BBC Domesday Project. Unfortunately, it ended up being a comic rather than a technical triumph. People must remember that, no matter how low-tech a method of physical data storage seems, it's more reliable in the long run than data storage relying on complex technology. I'm not against digitization by any means, of course: it could be useful as a research tool, and as an alternate method of access. It shouldn't be viewed as a long-term archival project, however. If the documents are really "disintegrating," take high-quality images of the pages and print them in good ink on acid-free archival paper (or vellum).
Project Gutenberg uses plenty of scans from American Memory to make their etexts--they do pretty much what you describe. At the lowest level, they make a plaintext copy, but they also do formatting and in-text hyperlinking: for instance, linking footnotes to their references, or index page numbers to anchors in the text. (See the HTML version of this etext to see what I mean.) Browse to a random book from this random collection, and you'll see what the LoC provides for their collections currently. As Brewster Kahle will be involved, you might want to see what projects he's done and how they're provided: a random book from the Million Book Project is available as a DjVu document, as well (badly) OCR'd text.
Laws do not persuade just because they threaten. --Seneca
The books I've looked at have been scanned at a resolution that's more or less adequate for OCR, but isn't really adequate for reproducing fine woodcuts, and is hopeless at metal engravings. I've found from my work on fromoldbooks.org that anything less than 1200 dpi generally produces pretty poor results for images, so that, for example, you can't read the signatures of the artist and engraver, still less compare engraving styles. It would be sort of like having a paraphrase of the text instead of the actual words.
It does, of course, vary a lot depending on the style of image. Bold illustrations for children's books, for example, do better at, say, 800dpi greyscale or colour. Fine steel engravings with lines at, say, less than a tenth of a degree from horizontal (they were done by hand after all) and that come out only a couple of pixels wide even at 1200dpi just turn into gray mush with weird banding artefacts until you go to a higher resolution (I use 2400dpi). There's a widely-cited study indicating that an "ultra-high" scan resolution of 400dpi is more than sufficient, based on an extremely small sample of images.
The damage that's done by poor quality digitization is that it makes it harder to justify doing a better job in the future.
Live barefoot!
free engravings/woodcuts
If I had mod points, I'd use them all to mark this +5 hilarious.
Self-referential Sigs are cool on /. these days...
54
Poor Benjamin Franklin is about to be deprived of the legitimate compensation owed him. How is the fellow supposed to make a living? He's only been dead 217 years. Surely copyright on his works ought to be retained for awhile yet. Now those communist pinko linux f@9$ have really gone over the line.
Just in case you're immeasurably thick.
Perscriptio in manibus tabellariorum est.
Ahh, but if they had saved the digitised images using openly specified formats, rather than some obscure format, they would not have had to much problem reading the images.
This sig is intentionally blank
like this?
http://sourceforge.net/projects/xena/
Is that these documents will be made available on iTunes for $0.99 at roughly 80% digital accuracy and for $1.29 at around 92% digital accuracy. If you want 100% digital accuracy, just get a Library of Congress card, check it out, and copy it yourself.
That's true, but the media would have needed to be updated to CD-ROM (and DVD?) anyway. Remember, at the time, truecolor displays were very uncommon, and JPEG wasn't yet specified.