Digitizing Your Dead Trees?
smart2000 asks: "I'm tired of lugging around dead trees. I've just moved offices and had to move over 100 pounds of 'essential' technical books. It is clear to me that the dead tree industry is never going to supply the books I want in electronic form, so it's time to do it myself. What hardware and software should I use?"
"The Plan: Take the binding of each book and cut it off. Feed into a scanner with duplex and cut-sheet feeder. Scan as a 300 DPI jpeg with compression. Then OCR them overnight. I don't expect the OCR to be perfect, just good enough to use as a searchable index.
What are the suitable scanner choices for Linux? Any recommendations for OCR software that will write in an open format? Has anyone done this before?"
You can find a wealth of PDF/PS/HTML/etc copies of computer texts online. Kazaa is a good place to start. Obviously, only download the books you have physical copies of. :-)
Josh Woodward
Lots of college students at $5/hour.
Slogan-free since April! We pass the savings on to you!
Kinko's offers high-volume scan-to-PDF solutions ... at low volume, it is usually a 10 - 25 per page and the cost of the media to copy it to, but in large volume, sometimes the cost can go down to 1 per page.
Call Kinko's. Ask for the Territory Representative. They'll help you out!!!
hire an infinite amount of monkeys on typewriters and... oh wait, that is for shakespeare
Lysergic Acid Diethylamide, not just chemistry, reality!
Quite useful and handy.
D
You can't grep a dead tree.
Alcohol and Calculus don't mix. Don't drink and derive.
That's it? Jesus, what are you, a 12 year old girl? That's 2 armloads. Sounds like you need the exercise, fatass.
I suppose this will be marked off-topic, since the poster is asking about digitization hardware. But whenever I see coworkers with tons of books on their desk shelf, I wonder to myself why they really need them. Do they actually have time to read them? Or are they more for show?
;-) ). The rest of the time, I get what I need off the web or USENET.
Personally, I have about 3 books I consider _essential_, and I've read them cover to cover (mostly while in the crapper
As far as I'm concerned, the most important quality in an engineer is not what you know but what search engine you use to look stuff up.
Check the hardware list for sane and then pick one of the fastest scanners you can afford. The DB on Sane's web site is your best bet. You will find that to get good scanning speed you will need scsi as USB is just too slow.
jpeg also sucks for this. Jpeg is best for full color images like photographs. Better off using tiff or png. Most OCR software will require tiff. Don't know of any OCR software for linux although you might get some windows app to work under WINE. Textbridge from Xerox isn't bad for the money.
.....we use a xerox DC265ST. This digital photocopier scans pages at 65 per minute and posts them to an FTP server inhouse. It can scan at 300 or 600 DPI and you can apply OCR after the scans are done. The DC265 is a workhorse and there are about a million of them out there. The scan back feature is a additional price on the device so not everyone spent the money on that feature....but about 1000 Kinko's have these in house and a Kinko's with a good DTP department might actually even know how to use the feature. (Good Luck!)
.
Ignored Since 1973
And put them into an inferior visual format you cannot read without the computer being working and on?
And you are going to spend about 100 hours to do this.. and the original books are going to be ruined.
All this just so you don't have to make 3 trips to move your books?
Mmmkayyy.. (backs away slowly)
Have you ever heard of a dolly?
I think you may be underestimating the sheer enormity of your task. Getting sheets to all feed right (a little skew and you're skrewed) and in order (feeder issues, what happens when one page mis-scans/feeds, can you go back and insert it into it's proper location), handling front to back issues (though I would assume that decent scanning software would take care of this for you). Also, your plan to use jpg might be problematic. OCR is finicky enough as it is, back when we were scanning documents we always used 300dpi tiff (using group3 or group4 lossless compression) to get the maximum accuracy rates from the ocr package we were using. And speaking of accuracy, keep in mind that OCR software that has a 97% accuracy rate means that it will flub 3 out of every 100 words, in a book that might contain tens/hundreds of thousands or words, that is a whole lot of errors. Now it's been a few years (6-8) since I've done this kind of stuff, so who knows, maybe things are much better now?
I've been wanting to do something similar for years, but with technical magazines, not books. But the sheer amount of manual labor involved has turned me off considerably (not to mention the thought of destroying the original source).
Keep in mind that this is such a common need, that if it were pretty straight forward, much of it would be done already (perhaps someone out there has the time/hardware/software to have done some of this already?) Not to mention the issue that with the web, that much of the information contained in those books are now available online, makes you wonder if it's really worth the time and effort, esp. considering that a great many of the technical books are obsolete two weeks before they hit the shelves.
Isn't an infinte number of computers enough?
/dev/random > ebooks
cat
Yours Sincerely, Michael.
I do not have any experience with their products, but the solution offered by this company seems simple and functional. Their system consists of an apparatus that turns pages of your book automatically, scans, turns, scans, turns. The result you can naturally pass to OCR.
Now, if I was to digitize all my books, I would try to create te the 4DigitalBooks kind of solution myself. The only tricky part is to find a cheap enough way to turn pages automatically, see also Kris Mckenzie's automatic page turner, still the best start is this document which is a proposal and overview on how to create an automatic page turner from pieces, the total cost is $459.
I started with the easiest books. - Books that could be removed from the binding. Scans go smoothly with the ADF, but it is not as easy as you might think. I find that I spend most of my time naming the files because the default naming comvention is *01.jpg , *02.jpg , *03.jpg, etc.
It is a problem for two reasons:
most of my books are double sided.
My HP scanning software for windows does not let me name files with a 2,4,6,8 or 1,3,5,7 format.
If books contain more pages than the ADF holds, The first page scanned will still be named page 1.
If I knew a little perl, I'd write a script to rename the files between scan batches.
For scanning full bound textbooks, there are two main problems:
Scanning the side of the page along the binding requires carefully holding downward pressure on the book to keep it near the scanner glass.
You cannot scan the book using ADF, so you should expect to spend A LOT of time scanning.
Do not even consider manual scanning hundreds of pages with a parallel port scanner. WAY WAY too slow. USB scanners are cheap now, and will usually scan as fast as the scanner mechanism can move (assuming black & White scans).
Lastly, be realistic.
Know how much time you'll need to invest.
Rule of thumb: If you need to scan manually, expect to scan about 200 pages per hour at top speed. Is it worth investing six hours to scan that 1200 page book of yours? If money allows, I'd suggest purchasing a second book that you can afford to destroy. Cut the binding off with something like a jigsaw, then insert the pages into an ADF scanner. Hope this helps somebody.
Run, don't walk, to http://djvu.research.att.com/home.html . DJVu is a image-based competitor to PDF that is a feat of beautiful engineering -- 300DPI scans break down to about 10-30K a page, the viewer is about an order of magnitude faster than PDF, the format cleanly supports separate encoding of page texture/graphics vs. page text, there's significant amounts of open source for it, and more.
It's truly a brilliant format. Go check it out.
Yours Truly,
Dan Kaminsky
DoxPara Research
http://www.doxpara.com
What is the oldest file that I have?
and ask:
What is the oldest useful file that I have?
For most people their papers and books are much older than the data they keep and the paper version is always available and easy to read.
You are much more likely to lose or corrupt your data if it is on a disk or a tape than if it is in a book. Your electronic version is going to be of much lesser quality than the books you had and you will have a lot of "adventures" getting your ebooks to be as easy to read as your paper books. What happens to your portable ebook when your reader runs out of batteries? Ebooks have failed because ... THEY SUCK. Let us all know how much time you wasted tweaking your ebook setup and worrying about how to make them sustainable. Also, please tell us when you go back to the store and buy new "dead trees" copies of the ones you destroyed.