Slashdot Mirror


Digitizing Your Dead Trees?

smart2000 asks: "I'm tired of lugging around dead trees. I've just moved offices and had to move over 100 pounds of 'essential' technical books. It is clear to me that the dead tree industry is never going to supply the books I want in electronic form, so it's time to do it myself. What hardware and software should I use?"

"The Plan: Take the binding of each book and cut it off. Feed into a scanner with duplex and cut-sheet feeder. Scan as a 300 DPI jpeg with compression. Then OCR them overnight. I don't expect the OCR to be perfect, just good enough to use as a searchable index.

What are the suitable scanner choices for Linux? Any recommendations for OCR software that will write in an open format? Has anyone done this before?"

35 of 347 comments (clear)

  1. look online before you scan by cheesyfru · · Score: 5, Informative

    You can find a wealth of PDF/PS/HTML/etc copies of computer texts online. Kazaa is a good place to start. Obviously, only download the books you have physical copies of. :-)

    1. Re:look online before you scan by jonbrewer · · Score: 3, Informative

      O'Reilly actually sells electronic editions of their books, so please buy them! You can also subscribe and read many of their books online. Also a good idea.

      (I personally like my dead tree O'Reilly books, and will stick with them until I have a really hi-res lcd to read electronic versions with.)

  2. An easier solution. by SystemFork · · Score: 4, Funny

    Lots of college students at $5/hour.

    --
    Slogan-free since April! We pass the savings on to you!
  3. Go To Kinko's!!!! by thedbp · · Score: 4, Informative

    Kinko's offers high-volume scan-to-PDF solutions ... at low volume, it is usually a 10 - 25 per page and the cost of the media to copy it to, but in large volume, sometimes the cost can go down to 1 per page.

    Call Kinko's. Ask for the Territory Representative. They'll help you out!!!

    1. Re:Go To Kinko's!!!! by Microsift · · Score: 4, Interesting

      I seriously doubt Kinko's would do this. They are ultra-paranoid about violating copyright. I imagine if you could do it at Kinko's, you'd have to all the work yourself in the Self-Service area. I doubt they have machines like that in self-service.

      --
      My other sig is extremely clever...
  4. monkeys by blugecko · · Score: 4, Funny

    hire an infinite amount of monkeys on typewriters and... oh wait, that is for shakespeare

    --
    Lysergic Acid Diethylamide, not just chemistry, reality!
  5. Safari is your friend by Dredd13 · · Score: 5, Informative
    If you're like me, a good chunk of your collection is ORA books... in which case, you should check out O'Reilly's Safari, which is their online book offering. It also includes non-ORA books as well, actually.

    Quite useful and handy.

    D

    1. Re:Safari is your friend by Wanker · · Score: 5, Informative
      I'll second this-- the O'Reilley Safari site is wonderful for anyone with a hoard of tech books.

      I bet about half of your books are already online.

      Also, for your compression you should NOT use JPEG. JPEG is optimized for smooth tones and will badly blur hard edges like text. On the other hand, JPEG performs relatively poorly at compressing large areas of the same color (i.e. white backgrounds.) [Note for the nit-pickers, both of these JPEG issues will be reduced/eliminated in JPEG2000.]

      I scan documents to either compressed TIFF (tend to be large), PNG, or (*shudder*) GIF.

      From the Project Gutenberg "Making Etexts from Paper Originals" paper": (You can bet these guys know how to scan...)

      A general rule is to store scanned images to JPEG and store computer-generated pictures (like diagrams etc.) to GIF. The exception is if you scan in grayscale, then use GIF. Never scan pictures as lineart. If acceptable from a file size perspective use the highest possible quality setting for JPEG.
      I suggest never using JPEG. The quality loss for printed words is just terrible relative to the compression you get. Also, just substitute PNG for GIF and the above works.

    2. Re:Safari is your friend by Dredd13 · · Score: 4, Insightful
      That's nice, but why would he want to pay a monthly fee to rent books he already owns?

      Because there's something very nice to having access to your 30-odd book collection from home, office, conference, at a job-site, etc. etc., without dragging along 40 pounds of books with you everywhere you go.

      It's a convenience you pay for. Considering how many ORA books many people pay for (and keep current as new editions come out), the annualized cost of simply subscribing and NOT buying the dead-tree version at all is very appealing to some folks, especially if their lifestyle has them wanting ready access to the material "from lots of different places".

  6. As Krow always says... by bdesham · · Score: 5, Funny

    You can't grep a dead tree.

    --
    Alcohol and Calculus don't mix. Don't drink and derive.
  7. 100 pounds? by NineNine · · Score: 5, Funny

    That's it? Jesus, what are you, a 12 year old girl? That's 2 armloads. Sounds like you need the exercise, fatass.

    1. Re:100 pounds? by zulux · · Score: 5, Funny

      That's it? Jesus, what are you, a 12 year old girl?

      Girl? On Slashdot?

      Woah!

      --

      Moneyed corporations, non-working 'poor' and criminal prisoners are turning productive citizens into tax-slaves.

    2. Re:100 pounds? by mikeage · · Score: 5, Funny

      Jesus, what are you, a 12 year old girl

      To the best of my knowledge, Jesus was not a 12 year old girl.

      --
      -- Is "Sig" copyrighted by www.sig.com?
  8. Do you really need them? by alt.sex.fetish.jesus · · Score: 4, Insightful

    I suppose this will be marked off-topic, since the poster is asking about digitization hardware. But whenever I see coworkers with tons of books on their desk shelf, I wonder to myself why they really need them. Do they actually have time to read them? Or are they more for show?

    Personally, I have about 3 books I consider _essential_, and I've read them cover to cover (mostly while in the crapper ;-) ). The rest of the time, I get what I need off the web or USENET.

    As far as I'm concerned, the most important quality in an engineer is not what you know but what search engine you use to look stuff up.

    1. Re:Do you really need them? by sphealey · · Score: 4, Insightful
      I suppose this will be marked off-topic, since the poster is asking about digitization hardware. But whenever I see coworkers with tons of books on their desk shelf, I wonder to myself why they really need them. Do they actually have time to read them? Or are they more for show?
      Because once you have developed the skill of processing technical books/documentation, you can scan through them and pick up critical information rapidly - far faster than you could click through them as hypertext.

      Case in point: I recently took a position where I had to do some work with Oracle, which I had not used previously. After some skimming at B&N, I purchased 5 good texts. A lot of pages, but when you need to figure something out you can open 2 or 3 of them, mark multiple pages, and get the outline of what you need very quickly.

      sPh

    2. Re:Do you really need them? by Waffle+Iron · · Score: 5, Insightful
      Do they actually have time to read them? Or are they more for show?

      Back before the Web when I was a hardware designer, books were a kind of currency that engineering salespeople used to entice you to meet with them. Each chip manufacturer printed stacks and stacks of data books covering their various product lines. They'd give these to the sales reps who would cart them in on dollies to hand out to the engineers who showed up to hear their latest pitch.

      In a way, huge bookshelves with hundreds of books was a status symbol, showing that you'd been around a while and a lot of people thought it was worthwile to give you books. It was useful to have all of that info available, but few people actually used more than 1% the data that was on their shelves.

      The instant the chip companies put their chip data on the web, all of those books became totally useless. Now I'm doing software, everything is online, and I can go for weeks on end without picking up a technical book.

      I do sometimes miss the office atmosphere you get from row after row of data books neatly segregated by the corporate logos and color schemes on their spines. It had an important look to it.

  9. check sane by walt-sjc · · Score: 4, Informative

    Check the hardware list for sane and then pick one of the fastest scanners you can afford. The DB on Sane's web site is your best bet. You will find that to get good scanning speed you will need scsi as USB is just too slow.

    jpeg also sucks for this. Jpeg is best for full color images like photographs. Better off using tiff or png. Most OCR software will require tiff. Don't know of any OCR software for linux although you might get some windows app to work under WINE. Textbridge from Xerox isn't bad for the money.

    1. Re:check sane by josepha48 · · Score: 3, Informative
      There is gocr or jocr -> http://jocr.sourceforge.net/

      Also there are a few commercial ones. However scanned to text conversion needs at least 600dpi and is only goind to have about a 97% accuracy.

      --

      Only 'flamers' flame!

  10. We do this all the time at the office...... by diorio · · Score: 4, Informative

    .....we use a xerox DC265ST. This digital photocopier scans pages at 65 per minute and posts them to an FTP server inhouse. It can scan at 300 or 600 DPI and you can apply OCR after the scans are done. The DC265 is a workhorse and there are about a million of them out there. The scan back feature is a additional price on the device so not everyone spent the money on that feature....but about 1000 Kinko's have these in house and a Kinko's with a good DTP department might actually even know how to use the feature. (Good Luck!)
    .

    --
    Ignored Since 1973
  11. Try one of these... by matthew.thompson · · Score: 3, Interesting
    Canon DR-5020

    Canon's 90ppm high speed scanner - only problem with high speed scanning is that they need loose leaves. Any decent books you have and want to copy will need a Stanley knife taking to the spine.

    Please remember to make decent backups on a long lasting madium with a high chance of recoverability. Failing that place the loose leaf versions with a document recovery firm and take their insurance for the full purchase value of the originals.

    --
    Matt Thompson - Actuality - Insert product here.
  12. Let me get this straight... by deacon · · Score: 5, Insightful
    You are going to cut up thousands of dollars worth of your "essential" books?

    And put them into an inferior visual format you cannot read without the computer being working and on?

    And you are going to spend about 100 hours to do this.. and the original books are going to be ruined.

    All this just so you don't have to make 3 trips to move your books?

    Mmmkayyy.. (backs away slowly)

    Have you ever heard of a dolly?

  13. are you sure you want to do this? by binaryDigit · · Score: 4, Insightful

    I think you may be underestimating the sheer enormity of your task. Getting sheets to all feed right (a little skew and you're skrewed) and in order (feeder issues, what happens when one page mis-scans/feeds, can you go back and insert it into it's proper location), handling front to back issues (though I would assume that decent scanning software would take care of this for you). Also, your plan to use jpg might be problematic. OCR is finicky enough as it is, back when we were scanning documents we always used 300dpi tiff (using group3 or group4 lossless compression) to get the maximum accuracy rates from the ocr package we were using. And speaking of accuracy, keep in mind that OCR software that has a 97% accuracy rate means that it will flub 3 out of every 100 words, in a book that might contain tens/hundreds of thousands or words, that is a whole lot of errors. Now it's been a few years (6-8) since I've done this kind of stuff, so who knows, maybe things are much better now?

    I've been wanting to do something similar for years, but with technical magazines, not books. But the sheer amount of manual labor involved has turned me off considerably (not to mention the thought of destroying the original source).

    Keep in mind that this is such a common need, that if it were pretty straight forward, much of it would be done already (perhaps someone out there has the time/hardware/software to have done some of this already?) Not to mention the issue that with the web, that much of the information contained in those books are now available online, makes you wonder if it's really worth the time and effort, esp. considering that a great many of the technical books are obsolete two weeks before they hit the shelves.

    1. Re:are you sure you want to do this? by Hallow · · Score: 4, Informative

      What he's probably looking for is something like PDF. You can leave the image on the front (i.e., it's what shows up in acrobat reader), and adobe's ocr ocr's the document and and indexes it for searches. The problem with this is, you wind up with big pdf's with poor quality.

      Where I work we tried to turn a book into PDF that we no longer had an electronic copy of. Keeping the images up front with ocr text behind, about 300 pages alltogether. Even with max compression, and the lowest acceptable DPI (300 I think), the PDF came out to 95MB. It didn't help that we scanned the book page by page and generated the PDF by hand, on a slow hp general consumer model scanner, either. (the initial pdf took over 120hrs to produce, with rescans and ocr'ing and everything).

      We wound up taking the acrobat ocr'd text (it was better than the off the shelf ocr package we had at the time) via the adobe accessibility website, and fixing it up. It was a pretty big project.

      We recently hired a document imaging company to PDF a lot of smaller historical documents for us, and that has worked out well. It's kind of pricey, but we also paid them to proof the ocr behind the images, and to hand adjust the images for appearance. It's worked out rather well.

  14. Re:Great by yintercept · · Score: 3, Funny

    Cool idea. You could sell special 3D glasses with an encrypted pattern that you would have to purchase to read a book. With the print on demand technologies, book seller might create a system where people have to get a special printing of the book that fits only their encrypted readers. That way you can guarantee that only one person reads the book. You could also create a pretty good database of what people read. This would give you a good idea on who are the subversive elements in society.

  15. Somewhat on topic... Historical Papers by Embedded+Geek · · Score: 3, Interesting
    My father passed on Sunday and we were going through all the family papers. We have lots of original documents from my family during the Civil War and earlier. My sister and I were thinking of donating them to a museum, so there would be no risk of their loss should my house get damaged (there's way too many documents to fit in my fire safe).

    Before doing this, though, we were thinking of scanning/copying all the documents to keep copies for ourselves. In doing so, though, we could use some advice:

    What special steps must we take in scanning 150+ year old documents, some very yellowed and fragile?

    What is the best format in which to store them (assuming we want them easilly readble in 20+ years for our kids)?

    What is the best media upon which to store the data (again, hoping for readability in 20+ years)? (I'm thinking online storage to allow easy conversion to the media of the moment, but I still want something to stash in the safe deposit box)

    Does anyone have experience with digital preservation/resoration of archival documents? Should I just try cleaning it up in photoshop or should I find a pro to help out? Maybe I can make it a term of the donation to the museum/library, for that matter.

    Thanks in andvance for your advice.

    --

    "Prepare for the worst - hope for the best."

    1. Re:Somewhat on topic... Historical Papers by Seanasy · · Score: 3, Informative

      If you really want to do it right, do it on film. Either pay someone or beg/borrow/steal a medium format camera and try to do it yourself. Film and archive quality prints will probably last longer than CDs and you can get good scans from the negatives if you want digital, too.

      I beleive libraries use uncompressed TIFF files for digital archives.

      You might find some discussions of this on photo.net

  16. Free the monkeys! by sydb · · Score: 4, Funny

    Isn't an infinte number of computers enough?

    cat /dev/random > ebooks

    --
    Yours Sincerely, Michael.
  17. 4DigitalBooks 900 pages/hour - or do it yourself by jukal · · Score: 4, Informative

    I do not have any experience with their products, but the solution offered by this company seems simple and functional. Their system consists of an apparatus that turns pages of your book automatically, scans, turns, scans, turns. The result you can naturally pass to OCR.

    Now, if I was to digitize all my books, I would try to create te the 4DigitalBooks kind of solution myself. The only tricky part is to find a cheap enough way to turn pages automatically, see also Kris Mckenzie's automatic page turner, still the best start is this document which is a proposal and overview on how to create an automatic page turner from pieces, the total cost is $459.

  18. Funny You should ask. by Fapestniegd · · Score: 3, Informative

    My current setup consisits of:
    4 HP scanners with ADF ~$150 ea. (eBay)
    4 Sparc LXs from a property contol auction $50
    one flatbed scanner for covers and bad scans. $50 (eBay again)
    Barebones System/w scsi from Compgeeks $80

    (NFS server), An Amtren Device(courtesy of the office) and away you go. I've found the best way to cut off the binders is to use a box cutter and to use your previous cuts as a guide. Several shell scripts to scan various types of books. It's amazing the page numbering schemes some publisers use. With this setup I can scan approximately 2-3 college textbooks 1000 pgs.(grayscale) or 1 color in an 10 hour period. (including checking for bad scans, sane ain't perfect, so you better check em) also jpg isn't very good for OCR, I store as png, and convert a second set to jpg for web viewing. OCR under linux isn't quite there yet (unless you want to pay through the nose) So I am Archiving the pngs to CD until it is. This also allows me to regenerate the jpgs if I lose a webserver disk. Add a nifty little IMageMagick web viewer and viola! eBookshelf! Oh and a NSM CD changer is nice too get to the CDs nearline.You can pick these up on ebay for $200-$400

  19. Re:While you're scanning my books... by toocoolforsocks · · Score: 3, Informative

    Actually if sign this little buls**t form they have under the counter, they can copy whatever you want. I should know, I work there.

  20. I've done this by brad3378 · · Score: 4, Insightful
    To do it, I purchased a used HP scanner with a 50 page Automatic Document Feeder (Search for ADF on Ebay).

    I started with the easiest books. - Books that could be removed from the binding. Scans go smoothly with the ADF, but it is not as easy as you might think. I find that I spend most of my time naming the files because the default naming comvention is *01.jpg , *02.jpg , *03.jpg, etc.

    It is a problem for two reasons:

    most of my books are double sided.
    My HP scanning software for windows does not let me name files with a 2,4,6,8 or 1,3,5,7 format.

    If books contain more pages than the ADF holds, The first page scanned will still be named page 1.

    If I knew a little perl, I'd write a script to rename the files between scan batches.

    For scanning full bound textbooks, there are two main problems:

    Scanning the side of the page along the binding requires carefully holding downward pressure on the book to keep it near the scanner glass.

    You cannot scan the book using ADF, so you should expect to spend A LOT of time scanning.

    Do not even consider manual scanning hundreds of pages with a parallel port scanner. WAY WAY too slow. USB scanners are cheap now, and will usually scan as fast as the scanner mechanism can move (assuming black & White scans).

    Lastly, be realistic.
    Know how much time you'll need to invest.
    Rule of thumb: If you need to scan manually, expect to scan about 200 pages per hour at top speed. Is it worth investing six hours to scan that 1200 page book of yours? If money allows, I'd suggest purchasing a second book that you can afford to destroy. Cut the binding off with something like a jigsaw, then insert the pages into an ADF scanner. Hope this helps somebody.

    --

  21. You *need* to be aware of OpenDJVu by Effugas · · Score: 5, Interesting

    Run, don't walk, to http://djvu.research.att.com/home.html . DJVu is a image-based competitor to PDF that is a feat of beautiful engineering -- 300DPI scans break down to about 10-30K a page, the viewer is about an order of magnitude faster than PDF, the format cleanly supports separate encoding of page texture/graphics vs. page text, there's significant amounts of open source for it, and more.

    It's truly a brilliant format. Go check it out.

    Yours Truly,

    Dan Kaminsky
    DoxPara Research
    http://www.doxpara.com

  22. definition of "dearth" by bcrowell · · Score: 3, Informative

    There are hundreds of them here. Very few are the kind of dopey software manuals you're referring to. Is that a "dearth?"

  23. You are insane by labradore · · Score: 4, Insightful
    Ask yourself this question:
    What is the oldest file that I have?
    and ask:
    What is the oldest useful file that I have?
    For most people their papers and books are much older than the data they keep and the paper version is always available and easy to read.

    You are much more likely to lose or corrupt your data if it is on a disk or a tape than if it is in a book. Your electronic version is going to be of much lesser quality than the books you had and you will have a lot of "adventures" getting your ebooks to be as easy to read as your paper books. What happens to your portable ebook when your reader runs out of batteries? Ebooks have failed because ... THEY SUCK. Let us all know how much time you wasted tweaking your ebook setup and worrying about how to make them sustainable. Also, please tell us when you go back to the store and buy new "dead trees" copies of the ones you destroyed.

  24. Use JBIG - not GIF by mangu · · Score: 3, Informative

    For bi-level images, the standard to use is JBIG, comes from an ISO group similar to those that created JPEG and MPEG.

    It generates much smaller files than GIF for printed text, with none of the inconveniences of JPEG. Grey scale pictures come reasonably well, if done at 300 dpi, dithered.

    I don't know exactly why JBIG never caught like those other standards. There doesn't seem to be many JBIG programs around, but, if you are handy with source code, there's jbigkit, a library for reading and writing JBIG files. I wrote my own software with that, and converted a half-ton of old magazines into a 20-pack caselogic of CD's.