Digitizing Your Dead Trees?
smart2000 asks: "I'm tired of lugging around dead trees. I've just moved offices and had to move over 100 pounds of 'essential' technical books. It is clear to me that the dead tree industry is never going to supply the books I want in electronic form, so it's time to do it myself. What hardware and software should I use?"
"The Plan: Take the binding of each book and cut it off. Feed into a scanner with duplex and cut-sheet feeder. Scan as a 300 DPI jpeg with compression. Then OCR them overnight. I don't expect the OCR to be perfect, just good enough to use as a searchable index.
What are the suitable scanner choices for Linux? Any recommendations for OCR software that will write in an open format? Has anyone done this before?"
You can find a wealth of PDF/PS/HTML/etc copies of computer texts online. Kazaa is a good place to start. Obviously, only download the books you have physical copies of. :-)
Josh Woodward
Quite useful and handy.
D
You can't grep a dead tree.
Alcohol and Calculus don't mix. Don't drink and derive.
That's it? Jesus, what are you, a 12 year old girl? That's 2 armloads. Sounds like you need the exercise, fatass.
And put them into an inferior visual format you cannot read without the computer being working and on?
And you are going to spend about 100 hours to do this.. and the original books are going to be ruined.
All this just so you don't have to make 3 trips to move your books?
Mmmkayyy.. (backs away slowly)
Have you ever heard of a dolly?
Back before the Web when I was a hardware designer, books were a kind of currency that engineering salespeople used to entice you to meet with them. Each chip manufacturer printed stacks and stacks of data books covering their various product lines. They'd give these to the sales reps who would cart them in on dollies to hand out to the engineers who showed up to hear their latest pitch.
In a way, huge bookshelves with hundreds of books was a status symbol, showing that you'd been around a while and a lot of people thought it was worthwile to give you books. It was useful to have all of that info available, but few people actually used more than 1% the data that was on their shelves.
The instant the chip companies put their chip data on the web, all of those books became totally useless. Now I'm doing software, everything is online, and I can go for weeks on end without picking up a technical book.
I do sometimes miss the office atmosphere you get from row after row of data books neatly segregated by the corporate logos and color schemes on their spines. It had an important look to it.
Run, don't walk, to http://djvu.research.att.com/home.html . DJVu is a image-based competitor to PDF that is a feat of beautiful engineering -- 300DPI scans break down to about 10-30K a page, the viewer is about an order of magnitude faster than PDF, the format cleanly supports separate encoding of page texture/graphics vs. page text, there's significant amounts of open source for it, and more.
It's truly a brilliant format. Go check it out.
Yours Truly,
Dan Kaminsky
DoxPara Research
http://www.doxpara.com