Slashdot Mirror


Digitizing Your Dead Trees?

smart2000 asks: "I'm tired of lugging around dead trees. I've just moved offices and had to move over 100 pounds of 'essential' technical books. It is clear to me that the dead tree industry is never going to supply the books I want in electronic form, so it's time to do it myself. What hardware and software should I use?"

"The Plan: Take the binding of each book and cut it off. Feed into a scanner with duplex and cut-sheet feeder. Scan as a 300 DPI jpeg with compression. Then OCR them overnight. I don't expect the OCR to be perfect, just good enough to use as a searchable index.

What are the suitable scanner choices for Linux? Any recommendations for OCR software that will write in an open format? Has anyone done this before?"

3 of 347 comments (clear)

  1. look online before you scan by cheesyfru · · Score: 5, Informative

    You can find a wealth of PDF/PS/HTML/etc copies of computer texts online. Kazaa is a good place to start. Obviously, only download the books you have physical copies of. :-)

  2. Safari is your friend by Dredd13 · · Score: 5, Informative
    If you're like me, a good chunk of your collection is ORA books... in which case, you should check out O'Reilly's Safari, which is their online book offering. It also includes non-ORA books as well, actually.

    Quite useful and handy.

    D

    1. Re:Safari is your friend by Wanker · · Score: 5, Informative
      I'll second this-- the O'Reilley Safari site is wonderful for anyone with a hoard of tech books.

      I bet about half of your books are already online.

      Also, for your compression you should NOT use JPEG. JPEG is optimized for smooth tones and will badly blur hard edges like text. On the other hand, JPEG performs relatively poorly at compressing large areas of the same color (i.e. white backgrounds.) [Note for the nit-pickers, both of these JPEG issues will be reduced/eliminated in JPEG2000.]

      I scan documents to either compressed TIFF (tend to be large), PNG, or (*shudder*) GIF.

      From the Project Gutenberg "Making Etexts from Paper Originals" paper": (You can bet these guys know how to scan...)

      A general rule is to store scanned images to JPEG and store computer-generated pictures (like diagrams etc.) to GIF. The exception is if you scan in grayscale, then use GIF. Never scan pictures as lineart. If acceptable from a file size perspective use the highest possible quality setting for JPEG.
      I suggest never using JPEG. The quality loss for printed words is just terrible relative to the compression you get. Also, just substitute PNG for GIF and the above works.