Slashdot Mirror


Digitizing Your Dead Trees?

smart2000 asks: "I'm tired of lugging around dead trees. I've just moved offices and had to move over 100 pounds of 'essential' technical books. It is clear to me that the dead tree industry is never going to supply the books I want in electronic form, so it's time to do it myself. What hardware and software should I use?"

"The Plan: Take the binding of each book and cut it off. Feed into a scanner with duplex and cut-sheet feeder. Scan as a 300 DPI jpeg with compression. Then OCR them overnight. I don't expect the OCR to be perfect, just good enough to use as a searchable index.

What are the suitable scanner choices for Linux? Any recommendations for OCR software that will write in an open format? Has anyone done this before?"

1 of 347 comments (clear)

  1. searchable text versus scanned images by pomakis · · Score: 2, Redundant
    The first question you'll want to ask yourself is whether you want the result in searchable text form or scanned image form. Searchable text is achievable with OCR (optical character recognition) software, but has at least two issues:

    • OCR software isn't perfect, and so errors will occur that'll you'll either have to live with or correct manually. Good OCR software does some validating against a dictionary, but this doesn't help when the source is highly mathematical, etc.
    • You'll lose figures, diagrams and pictures.

    Scanned images solve these problems, but have two problems of their own:

    • They're not searchable.
    • They're bulky (perhaps 100x).

    Perhaps a hybrid solution exists, but I suspect such a solution will require a lot of manual intervention and tweaking, something you'll want to avoid if your goal is to digitize several books.