Slashdot Mirror


Linux and OSS to Aid the Library of Congress

flakeman2 writes with a link to Linux.com article about Linux's new role at the Library of Congress. The national archive of books is looking to begin an ambitious digitization project, aimed at getting some rare and crumbling documents into the public record online. These will include "Civil War and genealogical documents, technical and artistic works concerning photography, scores of books, and the 850 titles written, printed, edited, or published by Benjamin Franklin. According to Brewster Kahle of the Internet Archive, which developed the digitizing technology, open source software will play an 'absolutely critical' role in getting the job done. The main component is Scribe, a combination of hardware and free software. 'Scribe is a book-scanning system that takes high-quality images of books and then does a set of manipulations, gets them in optical character recognition and compressed, so you can get beautiful, printable versions of the book that are also searchable,' says Kahle." Linux.com and Slashdot.org are both owned by OSTG.

4 of 63 comments (clear)

  1. Re:Hmm... by rednuhter · · Score: 4, Informative

    RaTFA (note the lowercase "a" for "all")
    "the Internet Archive has migrated Scribe entirely to Linux, and Windows support has been dropped."
    Seems focused on Linux to me.

    --
    ERR 411[Max number of witty sigs reached]
  2. Re:Scribe? by rs232 · · Score: 2, Informative
    --
    davecb5620@gmail.com
  3. Quality as well as quantity, please by Ankh · · Score: 3, Informative

    The books I've looked at have been scanned at a resolution that's more or less adequate for OCR, but isn't really adequate for reproducing fine woodcuts, and is hopeless at metal engravings. I've found from my work on fromoldbooks.org that anything less than 1200 dpi generally produces pretty poor results for images, so that, for example, you can't read the signatures of the artist and engraver, still less compare engraving styles. It would be sort of like having a paraphrase of the text instead of the actual words.

    It does, of course, vary a lot depending on the style of image. Bold illustrations for children's books, for example, do better at, say, 800dpi greyscale or colour. Fine steel engravings with lines at, say, less than a tenth of a degree from horizontal (they were done by hand after all) and that come out only a couple of pixels wide even at 1200dpi just turn into gray mush with weird banding artefacts until you go to a higher resolution (I use 2400dpi). There's a widely-cited study indicating that an "ultra-high" scan resolution of 400dpi is more than sufficient, based on an extremely small sample of images.

    The damage that's done by poor quality digitization is that it makes it harder to justify doing a better job in the future.

    --
    Live barefoot!
    free engravings/woodcuts