Ask Slashdot: State-of-the-Art In Amateur Book Scanning?
An anonymous reader writes: I have a shelf full of books and other book-like things ranging from old to very old that I would like to turn into PDFs (or other similarly portable format), and have been on a slow-burn quest for the right hardware and method to do so on a budget. These are mostly sentimental — things handed down over generations, and they include family bibles, notebooks, and photo albums, as well as some conventional — published, bound — books from the late 19th and early 20th Century. None of them are especially valuable as antiques, as far as I know; my goals in preserving them are a) to make them available to other people in my family who are into genealogy or just nostalgia, and b) so I can read some of those old, interesting books (et cetera) without endangering them any more than it takes to scan them once. I was intrigued by the (funded, but not yet available) scanner mentioned earlier this year on Slashdot; it seems to do a lot of things right, but like any crowdfunded project, the proof is in the pudding, and the pudding hasn't yet arrived. It's also cheap, and that fits my household budget. What methods and hardware are you using to scan old documents? Any tips you have from a similar project, with regard to hardware, treatment of the materials being scanned, light sources, file formats, clean-up and editing tools, file-size-vs-resolution tradeoffs? In the end, I'm likely to err toward high-resolution scans, since they can be knocked down to size later if need be, but I'd be interested in hearing about what tradeoffs you've found to work for you.
One big question that I'd like to have answered: Is there stand-alone Free / Open Source software, or even just cheap software (I am mostly on Linux, by choice, but won't leap onto a sword to keep my Free Software purity) that makes for easy correction of the distortion introduced by camera-based imaging? If I could easily uncurl and keystone-correct pages, then a lot of input methods (even my phone) are suddenly much more attractive. My old Casio camera could do this 10 years ago, but I haven't found a free software desktop utility that lets me turn photos into nicely squared-up pages.
One big question that I'd like to have answered: Is there stand-alone Free / Open Source software, or even just cheap software (I am mostly on Linux, by choice, but won't leap onto a sword to keep my Free Software purity) that makes for easy correction of the distortion introduced by camera-based imaging? If I could easily uncurl and keystone-correct pages, then a lot of input methods (even my phone) are suddenly much more attractive. My old Casio camera could do this 10 years ago, but I haven't found a free software desktop utility that lets me turn photos into nicely squared-up pages.
Scanning is stealing from GOD
UNITE with the Campaign for a Free Internet because today, our future begins with tomorrow!
If you're looking for a project, what we use at my university library to scan some of the rarest and most delicate books on the planet, is definitely achievable at home. It's simply a table with interchangeable wedge shaped foam pieces, and a rack above with two cameras pointing down. Since the book is on a v cradle, the pages lay flat. You can change the angle and position of the cameras to point squarely at the pages. There's a pedal that will snap a picture with both cameras at once, so once you've got it set up, all you need to do is flip the pages and hit the pedal. You might need to readjust if the book is particularly thick, but that's all pretty intuitive once you're used to the setup.
I would suggest you look here http://www.diybookscanner.org/...
I'm planning to do much the same thing as you myself, but I've still not decided how to do it and other things have been occupying my attention recently, so I've not kept up with developments for a year or so.
There are plenty of ideas there and suggestions for software and workflows that will do what you want .
N.B. this user is far too lazy to write a witty and intelligent sig.
Sit down with transcription software and read those books aloud. Done.
It's NOT cheap. Seems rather Windows Centric.
leather-dog muksihs
Blog: @muksihs
This method is destructive, as you are removing the pages from the book. However, it gets you an adequate scan without the need of controlled ambient light or running transformations from a photo such that the page seems to be flat. Any other method is complicated, so expect to invest a lot of time.
I collected paperback and hard cover books for almost 5 decades. I had storage bins in the attic and garage full of them. They all went to a "friends of the library" benefit sale. About two thousand books gone freeing up space and now I have more than that amount on a hard drive and about 200 on my phone. No more dead tree books and magazines for me. I can pull up something to read any time and any place. Technology is fabulous.
Is this out of your budget? Buy one, sell it on eBay it when you're done. Anything else, you'll just be wasting huge amounts of your time.
"I would like to turn into PDFs (or other similarly portable format)"
What is it about PDF files which you think makes it portable? You'd be better off with PNG format.
I will admit reading a digital book is easier now that I'm 5+ decades old.
I'm not much of a grammar Nazi, but I'm seeing this error everywhere now and I'm afraid it'll become the norm. The saying is, "The proof of the pudding is in the eating," which makes a lot more sense when you think about it.
The software piece you mentioned for turning scans into nice clean rectangles exists as "unpaper". Here's one fork: https://www.flameeyes.eu/proje...
The people who have bothered to fork and improve unpaper probably did so because they did a project similar to yours, so you might ask them about other tips and resources.
As someone else said, while pdf is convenient for READING book, it's not a particularly great format for archiving a collection of images which you may want to convert to another format later. There are several good grayscale image fomats to choose from. To order those images into a cohesive document, perhaps with separate chapters, one could produce html via a tiny Perl or shell script. That would preserve the images in their native format for later conversion as needed in the future.
Keystoning is easy to correct in Gimp. But that's going to be pretty labor intensive, and you really would want something automatic. I'd follow what others have said and buy one of the better products, like a professional scanner, and re-sell it once you're done. You can buy the ScanSnap SV600 (which everyone else seems to be recommending) for under $600 -- is that budget-friendly for you? If not, have you looked into renting such a device, or using one at a local library?
As an analogy, if you wanted to scan the old family slides, then the way to go is to buy a used Nikon pro-level slide scanner, do your stuff, and re-sell it with nearly zero loss, with the understanding that you're putting the couple of thousand dollars of purchase price at risk. I'm in the midst of doing exactly that, although given the number of slides I have to scan, I bought the scanner with the expectation that it will be a full write-off, and that's the price of not risking loss of family heirlooms by shipping the slides somewhere to have a minimum-wage flunky do the scanning.
Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
ImageMagick can do (most of) what you want for squaring up photos of pages. It is free/open source software. I'm not sure that I'd describe it as "easy" though: you would have to manually mark out the fixes required for each page.
I am an aficionado of dead tree technology. I find reading long documents online is very tiring. That is why I prefer dead tree technology.
Dead tree technology has many benefits:
It never needs to be recharged.
It is very portable. Just toss it into your bag. No cords or power supply.
It is very easy to share with some one. Just hand the book to them.
It has a very user friendly user indexing system called "dog ear".
Simply fold a corner of a page over and you can find your place again.
It is very easy to make notes with a pen or yellow highlighter technology.
But only if it is your own book.
Character image resolution is excellent. No "jaggies" in the font.
Reading a book has a great tactile feel.
Holding it in your hands, turning the pages.
The only drawback is that it requires an external light source. Sunshine can be great to read by.
Yes, I do like reading using my "dead tree" technology. The only problem is that in a decade or two, children will be asking me about my odd hand held device. Do I really never have to charge it? How can I use it if it does not connect to the Internet? What if I have a question or want to text my friends? Do I really need a different one for each book I want to read?
Apologies for this being off topic.
RLH
I scan B&W pages of historical manuals I have (SunOS 1.1, not Solaris). What would you recommend for grayscale and why?
RLH
Agreed. I scanned a bunch books that way, using a commercial-grade Fujitsu scanner, capable of scanning about 60ppm - both sides. I got a little over 20,000 pages in, and I had to quit, because the work was so intense. That was more than 10 years ago, and I still haven't been able to get back to it.
There's more to scanning a book than just scanning. Between preparing the book for scanning and making sure it scanned correctly, there's a lot of work involved.
Sit, Ubuntu, sit. Good dog.
The IndieGogo project is just a clone of something that already exists:
http://scanners.fcpa.fujitsu.com/scansnap11/features_sv600.html
The ScanSnap SV600 looks good, but you'll have to keep something in mind: The web page for this link says, "* Maximum document scanning thickness is 30 mm (1.18 in.)"
A maximum thickness of 1.18 inches would limit the books you can scan, unless you break apart a thick book, and scan it in sections.
Search Google Books for free downloads. They might have already done the hard work on those older books.
Seriously. Most popular books are already in electronic formats. While there may be some solid questions about whether it's legal, I think it's a perfectly ethical move. You're going from print to print, not audiobook, play, or movie version of a printed book that you own.
And scanning it is probably illegal anyway so it's not like all the extra work will protect you. Yeah, Google got away with it but they've got millions of dollars worth of lawyers who argued that their work was done for research purposes.
I often scan in music from bulky books. I find Scan Tailor (http://scantailor.org/) works pretty well. It lets you crop, unbend, despeckle etc. in a wizard like way. The drawback is that it wierdly insists on TIFF format input and output. So you have to be handy with tools like pdf2pnm, pnmtotiff, tiffcp and tiff2pdf, etc.
Works really well apart from that.
I was asked to look at scanning file cabinets full of legal size folders with contracts (mostly NCR forms) and hand writing on the inside covers of the file folder itself. Using a bit of lumber and a square you can make a frame to position the item to be scanned and if need be mount the webcam so it can be run up and down a mast for very thick items. Our focal plane was about 15" up for a good legal size paper image. They had an old XP box so I used IrfanView for capture and X-mouse Button Control 2 for mapping a right click to IrfanView's batch scan mode (right mouse button remapped to = ctrl+shft+a) giving sequenced file numbers to the output. They had to reset the counter for each new contract and change the destination folder, in the end we worked in the same temp folder and moved each contract before starting the next task... books would be easier in that regard. IrfanView supports a lot of plugins, but if you make a good fixture for aligning your books below the camera, keystoning should not be an issue. I don't think I would do two pages at once at 1080, but on a small book it might be OK. The main thing is lighting, even lighting makes all the difference in that rig. If I had a budget I would have set up LEDs, we just used two desk lamps. Not counting the webcam and XP box, we were way under $50. I understand why you want to do this, good luck however you go.
You have the right to remain sentient. If you give up the right to remain sentient, you will be elected to public office
I also hear that you can pay the Scanning Service close to your location to scan your books. but you will need to check on that.
Check if by any chance your books are already digitized on OpenLibrary.org
The Internet Archive Book Drive - https://openlibrary.org/bookdr...
Scanning Services - http://archive.org/scanning
Although I don't have the answer to the exact question you asked, I can point out on thing. It's easy to convert from djvu to pdf if, at some point in time, you want a pdf copy for some reason. The reverse isn't so true. If you archive it as pdf, you can't readily convert to anything else without losing information.
Overall, pdf is reasonable for viewing (right now), but not good for editing, manipulating, and archiving. Even for viewing, pdf at its heart assumes it is being printed on letter- sized paper, and that's the layout you'll always get. It doesn't flow or scale well or work well on widescreen displays. This is because pdf is essentially the Postscript printer language, zipped. It's designed for printing, not for screens of varying sizes, resolutions, and aspect ratios.
OK - open source has a really good OCR engine - tesseract.
But that is only one part - you need software that can recognize layout - differentiate pictures from text etc.
There are two approaches - put a text layer under a bitmap (searchable image) - or make a real document with fonts and pictures where needed (clear-scan) . (Hopefully a ODT file ).
Even in Windows clear-scan is iffy - diagrams with text confuse the software. Clear-scan to ODT is what we want - but can't have yet..
Notes and links on this: https://wiki.xtronics.com/inde...
On the path to essential we all take a few detours to learn things.. one of my favorite 'sayings'.
Scan Tailor fits your original description and price range.
http://scantailor.org/
There is a GitHub site for downloading the installer, works on Windows 7 for me, but I see no limitations to prevent it from working on OSX or Linux.
The documentation isn't great, but the software is very good, quite on par with most of the BookDrive or BookScanner types of programs.
Digital Book Collecting, or Scanning or Ripping depending on how you prefer to call the process; is basically two things:
1. Capture
2. Post processing
Capture is usually to a series of TIFF files, which are lossless compressed images files, sometimes people compile those direct into PDF files, but are usually not satisfied with the size or the results.
So the "gold standard" is direct to TIFF (although direct to large JPG is kind of becoming common)
You generally want to make sure the images are scanned at around 300x300 dpi, to make really good Optical Character Recognition (OCR) is possible. (Abby Fine reader has been the gold standard for OCR for years). Also an image is not indexable or "Searchable" which is what people start wanting when they need to search a document.
A PDF will hold multiple TIFF images and the results of an OCR scan in a single PDF file, and its a nice format in which you open and can use the built-in "Find" to skim the index and take you right to a page.
A PDF can also have a full functional Contents page and Index with clickable "hot links" to take you direct to a page.. this is also almost "expected" these days, but first you need software to OCR and index it, and usually someone to make the links for you.
A "Cross Document" searcher like FileCenter by Lucion will even index multiple PDF files in a catalog and let you search between them for references. FileCenter will also work direct with Fujitsu TWAIN scanners to let you capture and OCR everything that will fit in the scanner into arbitrary folders on your computer or home nas device.. its fairly inexpensive paperless office software (and it actually works, I use it a lot). http://www.lucion.com/
For Step #1 Capture you need some type of stable camera stand and a camera to snap a picture of a document/book, if it is a loose group of pages a Scanner can work, Fujitsu usually makes the best and still support TWAIN on their high end. They have Automatic Document Feeders (ADF) and flatbed models, and ADF+Flatbed all in ones. Fopydo makes some stiff plastic construction board type stands for very low cost that will support a book or documents and your cell phone for capturing images, and they are available on Amazon. Atiz makes very high end scanning "booths" which support professional DSLRs and flood lights to illumninate opposing sides of a 'V' shaped cradle with a plexiglass levitated platform for pressing the pages of a book flat before photography. They are somewhat combersome to use and require a permenant location dedicated to scanning. Atiz also former made a Canon Powershot model to take advantage of lesser expensive prosumer cameras for shooting images, but the Booksnap is no longer available. The Planetary or Overhead shooting tower that uses a Cell phone cam or a dedicated image sensor built-into the tower is becoming more popular, Fujitsu makes one one high quality, but it appears a bit slow and its still quite expensive.
For Step #2 you will want to break it down into Prep work before the OCR, then Post work after the OCR and finally Binding or Publishing the eBook to a format of your choice. Scan Tailor, BookDrive, and others are for Prep work before the OCR, they let you adjust contrast, tease out image artifacts or correct for under/overexposure and the "bleed through" bright lights and thin pages can bring out from the opposing side of the page that was imaged. OCR requires either the freebie copy of whatev
We have ScanSnap scanners at work and one of the biggest pains is they do NOT support the TWAIN/ISIS driver standard. That means you cannot scan using any software except the ScanSnap software. And at ~$900 it is a little expensive for home use.
--
We have ScanSnap scanners at work and one of the biggest pains is they do NOT support the TWAIN/ISIS driver standard. That means you cannot scan using any software except the ScanSnap software. And at ~$900 it is a little expensive for home use.
--
You might investigate Project Gado.
A free open source robot for taking pictures of documents without exposing them to danger.
Not sure if it has all the software you want, but there is an open source community developing for it, the Univ of Finland seems to be the hub.
http://projectgado.org/2015/07...
If you are able to build this thing, look at linearbookscanner. This would be my preferred method of digitizing but to build it is above my ability :-(.
If you have to correct for keystoning, your cameras aren't aligned well.
You want to use a mirror for alignment, as it allows you to verify that the camera is in the correct place -- a non-reflective target only ensures that you're pointed at the correct place.
The Czur has no platen, and therefore there will be distortion due to curved pages which would have to be corrected for. It also won't be able to image as well closer into the binding -- if you have to spread the book flat, you're going to end up damaging the spine.
Build it, and they will come^Hplain.
Before building something yourself, make sure you don't have access to better equipment locally. The main library here in Cleveland has what they call a Preservation Lab that has library-grade equipment available for public use.
http://cpl.org/clevdpl/
First digitize using the best solution which is easily available to you now, like a good flat bed scanner, and then look for correcting software later. So long as you have the original JPEGs/PDFs, you can continue enhancing them without putting your documents in danger.
Seems preferable to waiting for perfect hardware/software while your archive deteriorates further.