Slashdot Mirror


Digitizing Your Dead Trees?

smart2000 asks: "I'm tired of lugging around dead trees. I've just moved offices and had to move over 100 pounds of 'essential' technical books. It is clear to me that the dead tree industry is never going to supply the books I want in electronic form, so it's time to do it myself. What hardware and software should I use?"

"The Plan: Take the binding of each book and cut it off. Feed into a scanner with duplex and cut-sheet feeder. Scan as a 300 DPI jpeg with compression. Then OCR them overnight. I don't expect the OCR to be perfect, just good enough to use as a searchable index.

What are the suitable scanner choices for Linux? Any recommendations for OCR software that will write in an open format? Has anyone done this before?"

10 of 347 comments (clear)

  1. look online before you scan by cheesyfru · · Score: 5, Informative

    You can find a wealth of PDF/PS/HTML/etc copies of computer texts online. Kazaa is a good place to start. Obviously, only download the books you have physical copies of. :-)

  2. Safari is your friend by Dredd13 · · Score: 5, Informative
    If you're like me, a good chunk of your collection is ORA books... in which case, you should check out O'Reilly's Safari, which is their online book offering. It also includes non-ORA books as well, actually.

    Quite useful and handy.

    D

    1. Re:Safari is your friend by Wanker · · Score: 5, Informative
      I'll second this-- the O'Reilley Safari site is wonderful for anyone with a hoard of tech books.

      I bet about half of your books are already online.

      Also, for your compression you should NOT use JPEG. JPEG is optimized for smooth tones and will badly blur hard edges like text. On the other hand, JPEG performs relatively poorly at compressing large areas of the same color (i.e. white backgrounds.) [Note for the nit-pickers, both of these JPEG issues will be reduced/eliminated in JPEG2000.]

      I scan documents to either compressed TIFF (tend to be large), PNG, or (*shudder*) GIF.

      From the Project Gutenberg "Making Etexts from Paper Originals" paper": (You can bet these guys know how to scan...)

      A general rule is to store scanned images to JPEG and store computer-generated pictures (like diagrams etc.) to GIF. The exception is if you scan in grayscale, then use GIF. Never scan pictures as lineart. If acceptable from a file size perspective use the highest possible quality setting for JPEG.
      I suggest never using JPEG. The quality loss for printed words is just terrible relative to the compression you get. Also, just substitute PNG for GIF and the above works.

  3. As Krow always says... by bdesham · · Score: 5, Funny

    You can't grep a dead tree.

    --
    Alcohol and Calculus don't mix. Don't drink and derive.
  4. 100 pounds? by NineNine · · Score: 5, Funny

    That's it? Jesus, what are you, a 12 year old girl? That's 2 armloads. Sounds like you need the exercise, fatass.

    1. Re:100 pounds? by zulux · · Score: 5, Funny

      That's it? Jesus, what are you, a 12 year old girl?

      Girl? On Slashdot?

      Woah!

      --

      Moneyed corporations, non-working 'poor' and criminal prisoners are turning productive citizens into tax-slaves.

    2. Re:100 pounds? by mikeage · · Score: 5, Funny

      Jesus, what are you, a 12 year old girl

      To the best of my knowledge, Jesus was not a 12 year old girl.

      --
      -- Is "Sig" copyrighted by www.sig.com?
  5. Let me get this straight... by deacon · · Score: 5, Insightful
    You are going to cut up thousands of dollars worth of your "essential" books?

    And put them into an inferior visual format you cannot read without the computer being working and on?

    And you are going to spend about 100 hours to do this.. and the original books are going to be ruined.

    All this just so you don't have to make 3 trips to move your books?

    Mmmkayyy.. (backs away slowly)

    Have you ever heard of a dolly?

  6. Re:Do you really need them? by Waffle+Iron · · Score: 5, Insightful
    Do they actually have time to read them? Or are they more for show?

    Back before the Web when I was a hardware designer, books were a kind of currency that engineering salespeople used to entice you to meet with them. Each chip manufacturer printed stacks and stacks of data books covering their various product lines. They'd give these to the sales reps who would cart them in on dollies to hand out to the engineers who showed up to hear their latest pitch.

    In a way, huge bookshelves with hundreds of books was a status symbol, showing that you'd been around a while and a lot of people thought it was worthwile to give you books. It was useful to have all of that info available, but few people actually used more than 1% the data that was on their shelves.

    The instant the chip companies put their chip data on the web, all of those books became totally useless. Now I'm doing software, everything is online, and I can go for weeks on end without picking up a technical book.

    I do sometimes miss the office atmosphere you get from row after row of data books neatly segregated by the corporate logos and color schemes on their spines. It had an important look to it.

  7. You *need* to be aware of OpenDJVu by Effugas · · Score: 5, Interesting

    Run, don't walk, to http://djvu.research.att.com/home.html . DJVu is a image-based competitor to PDF that is a feat of beautiful engineering -- 300DPI scans break down to about 10-30K a page, the viewer is about an order of magnitude faster than PDF, the format cleanly supports separate encoding of page texture/graphics vs. page text, there's significant amounts of open source for it, and more.

    It's truly a brilliant format. Go check it out.

    Yours Truly,

    Dan Kaminsky
    DoxPara Research
    http://www.doxpara.com