Slashdot Mirror


Digitizing Your Dead Trees?

smart2000 asks: "I'm tired of lugging around dead trees. I've just moved offices and had to move over 100 pounds of 'essential' technical books. It is clear to me that the dead tree industry is never going to supply the books I want in electronic form, so it's time to do it myself. What hardware and software should I use?"

"The Plan: Take the binding of each book and cut it off. Feed into a scanner with duplex and cut-sheet feeder. Scan as a 300 DPI jpeg with compression. Then OCR them overnight. I don't expect the OCR to be perfect, just good enough to use as a searchable index.

What are the suitable scanner choices for Linux? Any recommendations for OCR software that will write in an open format? Has anyone done this before?"

8 of 347 comments (clear)

  1. look online before you scan by cheesyfru · · Score: 5, Informative

    You can find a wealth of PDF/PS/HTML/etc copies of computer texts online. Kazaa is a good place to start. Obviously, only download the books you have physical copies of. :-)

  2. Go To Kinko's!!!! by thedbp · · Score: 4, Informative

    Kinko's offers high-volume scan-to-PDF solutions ... at low volume, it is usually a 10 - 25 per page and the cost of the media to copy it to, but in large volume, sometimes the cost can go down to 1 per page.

    Call Kinko's. Ask for the Territory Representative. They'll help you out!!!

  3. Safari is your friend by Dredd13 · · Score: 5, Informative
    If you're like me, a good chunk of your collection is ORA books... in which case, you should check out O'Reilly's Safari, which is their online book offering. It also includes non-ORA books as well, actually.

    Quite useful and handy.

    D

    1. Re:Safari is your friend by Wanker · · Score: 5, Informative
      I'll second this-- the O'Reilley Safari site is wonderful for anyone with a hoard of tech books.

      I bet about half of your books are already online.

      Also, for your compression you should NOT use JPEG. JPEG is optimized for smooth tones and will badly blur hard edges like text. On the other hand, JPEG performs relatively poorly at compressing large areas of the same color (i.e. white backgrounds.) [Note for the nit-pickers, both of these JPEG issues will be reduced/eliminated in JPEG2000.]

      I scan documents to either compressed TIFF (tend to be large), PNG, or (*shudder*) GIF.

      From the Project Gutenberg "Making Etexts from Paper Originals" paper": (You can bet these guys know how to scan...)

      A general rule is to store scanned images to JPEG and store computer-generated pictures (like diagrams etc.) to GIF. The exception is if you scan in grayscale, then use GIF. Never scan pictures as lineart. If acceptable from a file size perspective use the highest possible quality setting for JPEG.
      I suggest never using JPEG. The quality loss for printed words is just terrible relative to the compression you get. Also, just substitute PNG for GIF and the above works.

  4. check sane by walt-sjc · · Score: 4, Informative

    Check the hardware list for sane and then pick one of the fastest scanners you can afford. The DB on Sane's web site is your best bet. You will find that to get good scanning speed you will need scsi as USB is just too slow.

    jpeg also sucks for this. Jpeg is best for full color images like photographs. Better off using tiff or png. Most OCR software will require tiff. Don't know of any OCR software for linux although you might get some windows app to work under WINE. Textbridge from Xerox isn't bad for the money.

  5. We do this all the time at the office...... by diorio · · Score: 4, Informative

    .....we use a xerox DC265ST. This digital photocopier scans pages at 65 per minute and posts them to an FTP server inhouse. It can scan at 300 or 600 DPI and you can apply OCR after the scans are done. The DC265 is a workhorse and there are about a million of them out there. The scan back feature is a additional price on the device so not everyone spent the money on that feature....but about 1000 Kinko's have these in house and a Kinko's with a good DTP department might actually even know how to use the feature. (Good Luck!)
    .

    --
    Ignored Since 1973
  6. Re:are you sure you want to do this? by Hallow · · Score: 4, Informative

    What he's probably looking for is something like PDF. You can leave the image on the front (i.e., it's what shows up in acrobat reader), and adobe's ocr ocr's the document and and indexes it for searches. The problem with this is, you wind up with big pdf's with poor quality.

    Where I work we tried to turn a book into PDF that we no longer had an electronic copy of. Keeping the images up front with ocr text behind, about 300 pages alltogether. Even with max compression, and the lowest acceptable DPI (300 I think), the PDF came out to 95MB. It didn't help that we scanned the book page by page and generated the PDF by hand, on a slow hp general consumer model scanner, either. (the initial pdf took over 120hrs to produce, with rescans and ocr'ing and everything).

    We wound up taking the acrobat ocr'd text (it was better than the off the shelf ocr package we had at the time) via the adobe accessibility website, and fixing it up. It was a pretty big project.

    We recently hired a document imaging company to PDF a lot of smaller historical documents for us, and that has worked out well. It's kind of pricey, but we also paid them to proof the ocr behind the images, and to hand adjust the images for appearance. It's worked out rather well.

  7. 4DigitalBooks 900 pages/hour - or do it yourself by jukal · · Score: 4, Informative

    I do not have any experience with their products, but the solution offered by this company seems simple and functional. Their system consists of an apparatus that turns pages of your book automatically, scans, turns, scans, turns. The result you can naturally pass to OCR.

    Now, if I was to digitize all my books, I would try to create te the 4DigitalBooks kind of solution myself. The only tricky part is to find a cheap enough way to turn pages automatically, see also Kris Mckenzie's automatic page turner, still the best start is this document which is a proposal and overview on how to create an automatic page turner from pieces, the total cost is $459.