Digitizing Your Dead Trees?
smart2000 asks: "I'm tired of lugging around dead trees. I've just moved offices and had to move over 100 pounds of 'essential' technical books. It is clear to me that the dead tree industry is never going to supply the books I want in electronic form, so it's time to do it myself. What hardware and software should I use?"
"The Plan: Take the binding of each book and cut it off. Feed into a scanner with duplex and cut-sheet feeder. Scan as a 300 DPI jpeg with compression. Then OCR them overnight. I don't expect the OCR to be perfect, just good enough to use as a searchable index.
What are the suitable scanner choices for Linux? Any recommendations for OCR software that will write in an open format? Has anyone done this before?"
You can find a wealth of PDF/PS/HTML/etc copies of computer texts online. Kazaa is a good place to start. Obviously, only download the books you have physical copies of. :-)
Josh Woodward
Lots of college students at $5/hour.
Slogan-free since April! We pass the savings on to you!
Kinko's offers high-volume scan-to-PDF solutions ... at low volume, it is usually a 10 - 25 per page and the cost of the media to copy it to, but in large volume, sometimes the cost can go down to 1 per page.
Call Kinko's. Ask for the Territory Representative. They'll help you out!!!
hire an infinite amount of monkeys on typewriters and... oh wait, that is for shakespeare
Lysergic Acid Diethylamide, not just chemistry, reality!
Quite useful and handy.
D
You can't grep a dead tree.
Alcohol and Calculus don't mix. Don't drink and derive.
Now the bookseller's will join with the entertainment industry. Nexty we will be seeing books that can't be scanned easily.
Remeber those passkeys for computer games in the 80's that were black on maroon paper? Or some dial thingy.
That's it? Jesus, what are you, a 12 year old girl? That's 2 armloads. Sounds like you need the exercise, fatass.
Most of my technical books contain vast quantities of useful information in charts, diagrams, and illustrations... which are far more of a challenge to OCR than mere printed text.
I suspect that even were this sort of thing really possible, it's a major time investment. I have several dozen technical books I'd like to scan, each with four hundred or so pages... and I'm not sure I want to spend a week's vacation time doing it.
And even were it done... there is just something comforting about having a nice printed book that I can set on the desk next to the computer and consult, without having to read it on the screen. Print still looks way better than monitors.
People are never as simple as their stereotypes. This applies equally to Christians, Muslims, and Emacs-lovers.
I suppose this will be marked off-topic, since the poster is asking about digitization hardware. But whenever I see coworkers with tons of books on their desk shelf, I wonder to myself why they really need them. Do they actually have time to read them? Or are they more for show?
;-) ). The rest of the time, I get what I need off the web or USENET.
Personally, I have about 3 books I consider _essential_, and I've read them cover to cover (mostly while in the crapper
As far as I'm concerned, the most important quality in an engineer is not what you know but what search engine you use to look stuff up.
Check the hardware list for sane and then pick one of the fastest scanners you can afford. The DB on Sane's web site is your best bet. You will find that to get good scanning speed you will need scsi as USB is just too slow.
jpeg also sucks for this. Jpeg is best for full color images like photographs. Better off using tiff or png. Most OCR software will require tiff. Don't know of any OCR software for linux although you might get some windows app to work under WINE. Textbridge from Xerox isn't bad for the money.
.....we use a xerox DC265ST. This digital photocopier scans pages at 65 per minute and posts them to an FTP server inhouse. It can scan at 300 or 600 DPI and you can apply OCR after the scans are done. The DC265 is a workhorse and there are about a million of them out there. The scan back feature is a additional price on the device so not everyone spent the money on that feature....but about 1000 Kinko's have these in house and a Kinko's with a good DTP department might actually even know how to use the feature. (Good Luck!)
.
Ignored Since 1973
I dont know HOW many times i've looked at a tech manual(or other paper book for that matter)trying to find something I read a while ago and thought " i wish i could just do a text search to find the 3 or so words i remember seeing..." Sure theindex and table of contents gets you part of the way there, but if the author mentions something off-hand in an 'unrelated' section of the book...
Canon's 90ppm high speed scanner - only problem with high speed scanning is that they need loose leaves. Any decent books you have and want to copy will need a Stanley knife taking to the spine.
Please remember to make decent backups on a long lasting madium with a high chance of recoverability. Failing that place the loose leaf versions with a document recovery firm and take their insurance for the full purchase value of the originals.
Matt Thompson - Actuality - Insert product here.
Scanned images solve these problems, but have two problems of their own:
Perhaps a hybrid solution exists, but I suspect such a solution will require a lot of manual intervention and tweaking, something you'll want to avoid if your goal is to digitize several books.
Personally, however, I still like printed manuals. Using an online manual means either reducing some windows or switching desktops. With a paper manual I can keep the screen exactly as it is. Higher resolution screens, or the use of multiple screens, are making online manuals much more useful (anyone remember what a pain in the ass it was to try and figure out something with only an online manual on a 640x480 screen?). Occasionally I still manage to fill two 1600x1200 screens with a bunch of stuff I want to keep visible while still reading the manual.
if you are anything like the computer guys I know (myself included), you'd end up printing out
portions of the text whenever you wanted to read them anyway!!!
-- Adam
Yup. There is quite a lot already scanned. The best places to look are usenet (at alt.binaries.e-book, alt.binaries.e-book.technical, alt.binaries.e-books) and IRC at #bookwarez and #bookz on undernet, dalnet, and irc.nullus.net (and most likely other irc nets as well.)
You could try making a request in abeb, but the biggest selection in one place is irc. So as long as you are not scared by the interface, that is where I would look first.
O'Rielly (sp?) has many of their java books available on CD-ROM, although I only own the dead tree versions of the ones I have in that series.
.sig and buy some of my favorite books!) That a lot of weight for two books, and I usually haul around a couple smaller ones as well, O'Riely's perl book, and their EJB 3rd edition.
On a regular basis, I haul 2188 pages worth, I just added them up, of QUE's Using Java2 Standard Edition, and Enterprise edition, between home an the office. (Speaking of which, go to the link in my
Not only are all of these books heavy, but I have also yet to find an easy way to card them around, they don't all fit right in any of my bags.
I want all of these books on CD-ROM, but not just CD-ROM. Half the books I have INCLUDED a cd-rom, it just doesn't contain the texxt of the book. With O-Riely, I'd buy the CD-ROM version, but I want to dead tree version too. I want to use the dead tree version, unless I am working from home, I want to haul home the CD's. I don't think I should have to pay any more for it either, I bought the IP (in the property sense), and I am already paying the price for the wood slices, which includes a silver disk.
PUBLISHERS, GIVE ME THE BOOK ON THE CD TOO! I spend $100/month or so on tech books.
-Pete
Soccer Goal Plans
And put them into an inferior visual format you cannot read without the computer being working and on?
And you are going to spend about 100 hours to do this.. and the original books are going to be ruined.
All this just so you don't have to make 3 trips to move your books?
Mmmkayyy.. (backs away slowly)
Have you ever heard of a dolly?
Schools for the blind have been doing this for years, especially with technical books. Many of my V.I. friends would remove the binding and feed them through a high-speed sheet feeder to a scanner. Then, the books are proofed by seeing people for OCR perfection. Contact your local school and ask if they already have some of your works in pdf/jpeg/tiff/WordPerfect (yes, lots of Word Perfect). They may be willing to give you some legal copies of your books in exchange for you converting some of the books you have that they don't into blind readable format (which means, you'd have to proof your own book for accuracy - but you're doing that anyway). Basically, you're donating your time for a good cause and bennifiting yourself.
I think you may be underestimating the sheer enormity of your task. Getting sheets to all feed right (a little skew and you're skrewed) and in order (feeder issues, what happens when one page mis-scans/feeds, can you go back and insert it into it's proper location), handling front to back issues (though I would assume that decent scanning software would take care of this for you). Also, your plan to use jpg might be problematic. OCR is finicky enough as it is, back when we were scanning documents we always used 300dpi tiff (using group3 or group4 lossless compression) to get the maximum accuracy rates from the ocr package we were using. And speaking of accuracy, keep in mind that OCR software that has a 97% accuracy rate means that it will flub 3 out of every 100 words, in a book that might contain tens/hundreds of thousands or words, that is a whole lot of errors. Now it's been a few years (6-8) since I've done this kind of stuff, so who knows, maybe things are much better now?
I've been wanting to do something similar for years, but with technical magazines, not books. But the sheer amount of manual labor involved has turned me off considerably (not to mention the thought of destroying the original source).
Keep in mind that this is such a common need, that if it were pretty straight forward, much of it would be done already (perhaps someone out there has the time/hardware/software to have done some of this already?) Not to mention the issue that with the web, that much of the information contained in those books are now available online, makes you wonder if it's really worth the time and effort, esp. considering that a great many of the technical books are obsolete two weeks before they hit the shelves.
I just wanna be able to look at the dollar bills on my computer instead of having to carry them with me. Is that so bad?
/^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$/i
If you really want to go through all this effort use both PDF and OCR.
OCR sucks royally for large documents, documents with images or diagrams, handwritten comments, etc. However scanning the pages to an image and then creating a PDF of the images does not care about any of that.
So, scan all of your books as images that your OCR software can process. Use the OCR output to create an index of pages. If a specific word on a specific page doesn't OCR well who cares. With typed and professionally printed books your OCR software should be about 90% accurate. Take the images and create PDF files.
Now you have your nice clean images but you still have a searchable index. BTW, when you get this done post your procedures, problems, and solutions to a web site somewhere so that you can share your experiences with the rest of the world.
Start with google. There is a lot of technical information online, and google will find it. Not as good as those dead trees, but if you can find it and it is accurate, google is often easier than searching indexes. Best of all, dead trees are limited to the ones you own, while google is limited to whatever someone found useful to put online.
Note the last line of the above: google is limited to what someone else finds useful to put online. So if you can't find it on google, take some time to put it online for the rest of us. If/when you find yourself going back to the same few sites often, link to them from your homepage so google knows you find them useful. In other words, google is interactive, make it work for you and it will work for everyone. The internet is not a one way street.
Finially, some things are just plan eaiser to look up in dead tree format. I would strongly recomend you keep your books intact. Put the information you need on the web (what you can do legally), and keep the books for the rest. If you find you are not using a book anymore because all the information is on the web (including you put it there), then throw it out. My monitor is only 19 inches, not nearly enough to hold all the information I have scattered about my desk.
Tons and tons of e-texts. In multiple formats: text, pdf, lit, HTML.
Excellent resource!
Anders Borg wrote this FAQ from Project Gutenberg. Lots of field-tested advice there, such as a suggestion to scan at 300dpi or better.
ancarett, historian and zombie gamer
Before doing this, though, we were thinking of scanning/copying all the documents to keep copies for ourselves. In doing so, though, we could use some advice:
What special steps must we take in scanning 150+ year old documents, some very yellowed and fragile?
What is the best format in which to store them (assuming we want them easilly readble in 20+ years for our kids)?
What is the best media upon which to store the data (again, hoping for readability in 20+ years)? (I'm thinking online storage to allow easy conversion to the media of the moment, but I still want something to stash in the safe deposit box)
Does anyone have experience with digital preservation/resoration of archival documents? Should I just try cleaning it up in photoshop or should I find a pro to help out? Maybe I can make it a term of the donation to the museum/library, for that matter.
Thanks in andvance for your advice.
"Prepare for the worst - hope for the best."
Call Paul Bunyan. Cause he's a lumberjack and he's okay!
/^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$/i
Have you tried contacting the publishers directly? Or maybe the companies that created any of your software documentation? I know that some companies have PDFs of their manuals and other books, but don't make it well known. They don't usually offer them for free download, but if you prove you have a hard copy some companies will tell you how to get a PDF version. This works especially well for lost instruction manuals, which you can always get for free.
One good, but old, example is Oracle. Back in the day my company had megs of PDFs of all of Oracle's documentation. There was a main index PDF with links to basically every other possible document. I don't recall Oracle leaving them open for download on the internet. We got them on CD. But it was easy to get since they new we were a customer.
Developers: We can use your help.
Right then. In 1993/4, this is what I did for a living. The company I worked for did quite a lot of this, and one contract in particular sticks in my mind - the digitising of all books in the French National Library.
No doubt the equipment we used has moved on in the intervening decade however. We used Bell & Howell scanners fitted with automatic document shredders. Err...feeders. Yes, automatic document feeders. Not shredders at all. No. Honest.
You see, these were high-speed scanners, and some of the books we received were qute old. Me and the other coder on the project got really quite good at doing "pit stops", or changing the rubber wheels that drove the ADF. What I'm saying is no disrespect to the scanner company - it was the quality of the paper we had to put through it that caused the hassle. Some books, like the 18th century Academie Francais records, were so thin we had to photograph them and scan the photos.
We then scaled, OCR'd, deskewed and indexed the results on decent machines - 25Mhz 486SX, 4Mb RAM and Kofax graphics cards. Everything was then tarred up to DAT.
Hardware moves on, but I'll bet the amount of work remains the same. Do not underestimate the preparation required, and also the ammount of QA.
Oh, and don't use JPEG. Lossy compressionon text? Use TIFF - the image processing industry standard.
Cheers,
Ian
You're right that he should not use JPEG for this, but for the wrong reasons. JPEG is simply the wrong format for images that are not like photographs. Specifically, JPEG is not appropriate for images with high spatial frequencies (ie, distinct lines and shapes, and a small number of colors). Raster-based formats (GIF, PNG, TIFF, etc) are the appropriate format for scanned text, diagrams, etc. PNG is not a replacement for JPEG.
Furthermore, if you want animations, you are overlooking the new, cool computer technology called MNG.
Isn't an infinte number of computers enough?
/dev/random > ebooks
cat
Yours Sincerely, Michael.
Real men use the command shell and man() or google ;)
Seriously, most of the hard-core computer folks I know either open their copy of the ORA book on the subject, steal their neighbors copy and flip it open, or use some form of online docs w/o printing said docs off. The only reason I've ever known anyone to print anything resembling a doc is when someone I knew had assembled binder full of pages on tech specs for a project.
It's just a lot easier to sit at the screen arrowing up and down on the doc than it is to print it, reach over to the printer, pull it out, shuffle through it....and then eventually have to take it out with the trash. I've seen comments about paperless offices vis a vis paperless restrooms, but the fact is that for reference there really isn't a reason to print the online doc.
What is your Slash Rating?
..an index of the book on my system. just a table with all the words and which page they appear. Pretty useless without the book, since it would be practically impossible to create the book from it, and it would be damn convienant.
The Kruger Dunning explains most post on
I do not have any experience with their products, but the solution offered by this company seems simple and functional. Their system consists of an apparatus that turns pages of your book automatically, scans, turns, scans, turns. The result you can naturally pass to OCR.
Now, if I was to digitize all my books, I would try to create te the 4DigitalBooks kind of solution myself. The only tricky part is to find a cheap enough way to turn pages automatically, see also Kris Mckenzie's automatic page turner, still the best start is this document which is a proposal and overview on how to create an automatic page turner from pieces, the total cost is $459.
Reading over these responses I realized what it is that bugs me most about having a reference manual in PDF or some other electronic format versus having a nice book in my lap: I don't have the screen real estate for both a document reader and whatever app it is I'm using the reference for.
The endless jumping between windows gets old real fast, especially if I need to copy a code snippet out of a document (like a PDF) that won't let me select & copy text.
But if I had a second monitor right there at eye level, I could just open up the reference doc there. No more switching between windows, and no more neck strain from constantly looking down at a book in my lap and then up at the screen.
My current setup consisits of:
4 HP scanners with ADF ~$150 ea. (eBay)
4 Sparc LXs from a property contol auction $50
one flatbed scanner for covers and bad scans. $50 (eBay again)
Barebones System/w scsi from Compgeeks $80
(NFS server), An Amtren Device(courtesy of the office) and away you go. I've found the best way to cut off the binders is to use a box cutter and to use your previous cuts as a guide. Several shell scripts to scan various types of books. It's amazing the page numbering schemes some publisers use. With this setup I can scan approximately 2-3 college textbooks 1000 pgs.(grayscale) or 1 color in an 10 hour period. (including checking for bad scans, sane ain't perfect, so you better check em) also jpg isn't very good for OCR, I store as png, and convert a second set to jpg for web viewing. OCR under linux isn't quite there yet (unless you want to pay through the nose) So I am Archiving the pngs to CD until it is. This also allows me to regenerate the jpgs if I lose a webserver disk. Add a nifty little IMageMagick web viewer and viola! eBookshelf! Oh and a NSM CD changer is nice too get to the CDs nearline.You can pick these up on ebay for $200-$400
People love books in dead tree format for the most part. You don't really want to curl up with a cup of coffee and a nice monitor.
Why the hell not? Isn't that what we all do while working?
I started with the easiest books. - Books that could be removed from the binding. Scans go smoothly with the ADF, but it is not as easy as you might think. I find that I spend most of my time naming the files because the default naming comvention is *01.jpg , *02.jpg , *03.jpg, etc.
It is a problem for two reasons:
most of my books are double sided.
My HP scanning software for windows does not let me name files with a 2,4,6,8 or 1,3,5,7 format.
If books contain more pages than the ADF holds, The first page scanned will still be named page 1.
If I knew a little perl, I'd write a script to rename the files between scan batches.
For scanning full bound textbooks, there are two main problems:
Scanning the side of the page along the binding requires carefully holding downward pressure on the book to keep it near the scanner glass.
You cannot scan the book using ADF, so you should expect to spend A LOT of time scanning.
Do not even consider manual scanning hundreds of pages with a parallel port scanner. WAY WAY too slow. USB scanners are cheap now, and will usually scan as fast as the scanner mechanism can move (assuming black & White scans).
Lastly, be realistic.
Know how much time you'll need to invest.
Rule of thumb: If you need to scan manually, expect to scan about 200 pages per hour at top speed. Is it worth investing six hours to scan that 1200 page book of yours? If money allows, I'd suggest purchasing a second book that you can afford to destroy. Cut the binding off with something like a jigsaw, then insert the pages into an ADF scanner. Hope this helps somebody.
is http://docs.rinet.ru:8080/ - I ran across this site a few years back. It almost looks like an online library for a Russian ISP's technical support staff.
They've got lots and lots of official books, all HTMLized a chapter or a section at a time. They're all a bit old or out of date, too - I know of one Perl book in particular that they have there was one edition behind what was being sold on the shelf at the time I saw it.
-----
Is Darwin an evolutionary OS?
Come to the University of Mars! Classes starting soon!
Run, don't walk, to http://djvu.research.att.com/home.html . DJVu is a image-based competitor to PDF that is a feat of beautiful engineering -- 300DPI scans break down to about 10-30K a page, the viewer is about an order of magnitude faster than PDF, the format cleanly supports separate encoding of page texture/graphics vs. page text, there's significant amounts of open source for it, and more.
It's truly a brilliant format. Go check it out.
Yours Truly,
Dan Kaminsky
DoxPara Research
http://www.doxpara.com
A mathematician, a physicist, and an engineer are asked to find the volume of a red rubber ball. The mathematician measures the diameter and calculates the ball's volume. The physicist submerges the ball in a full beaker, and measures the amount of water that spills out to get the volume. The engineer turns the ball over until he find's it's serial number, then looks up the volume for that model on his Red Rubber Ball Table.
Half of the library in my office is catalogs and equipment data sheets for components. A lot of the rest is more generalized data like stress concentration factors for various object geometries and material characteristics; these are things that CANNOT be derived from theory. Only about 4 of my books (which, admittedly, I do use a great deal) are theoretical books. Physics, Advanced Math, Design of Experiments, and a Mech. Eng. Handbook. When you work with real objects, rather than just theory and pure numbers, you tend to need a lot more detailed reference materials. And I'm sure that at least one Engineer in the red rubber ball industry has himself a Red Rubber Ball Table.
Because the yellow highlighter looks like shit on my CRT.
I have outfitted my $100 Visor Handspring with a Compact Flash springboard module and now I can carry around over 100M of books in my shirt pocket. The darn thing is even backlit so that I can read in the dark. What's more I can search for keywords, and annotate the books to my hearts content.
What really settled it for me was when I started reading Structure and Interpretation of Computer Programs on my Visor and could do the example programs in LispME.
Needless to say I prefer my Visor over the dead tree version for any book that is text heavy.
There are hundreds of them here. Very few are the kind of dopey software manuals you're referring to. Is that a "dearth?"
Find free books.
Scanned text pages should be black and white.
Of course it won't scan this way due to shading, bits of wood chips on the pages, etc. Your image processing software can/should convert it to literally two colors-- black text + background (white). As you can imagine, this kind of "lossy" conversion cuts out a great deal of information and the file size reflects this.
Combined with a lossless compression algorithm which takes these huge areas of the same value and compresses them very tightly and you have a tiny, high-contrast, easy-to-read (or OCR) image.
Now with JPEG, it "loses" information by smoothing (forgive my oversimplification of a complex mapping process). With text you *want* unsmoothed (hard) edges-- it makes things easy to read. The JPEG smoothing process results in hard to read text, so you can't use as much of it before the image degrades too badly to read.
The result, the 2-color conversion with lossless compression gives you a smaller image size for the same relative viewing quality as a JPEG. (Or the flip side, for the same image size, the 2-color image is much more readable than the JPEG.)
Try this-- take a screenshot of some text. (Only text) From the GIMP, convert it to 2 colors and save as PNG. Then save it as a high-quality JPEG and a low-quality JPEG. Check the file sizes versus the clarity of the text.
I have scanned several books (in my case, Atari and other classic computing books) for atariarchives.org. The process takes time, but is worth it.
A scanner with a reliable sheet feeder is essential. This doesn't necessarily mean expensive -- I've seen a lot of reasonable-looking scanners with ADFs on ebay for less than $100.
I cut the pages off the books using a single-edge razor blade -- non-ragged cuts are essential. Then I scan then into TIFF format at 300 DPI, greyscale. If I want searchable PDFs, I use OmniPage X on a Mac to create image-over-text PDF, it's quick and easy.
But most of the time, I these books are for Web viewing. So I use a graphics conversion program with batch capability (GraphicConverter on the Mac) to a) increase the contrast dramatically -- near 100%; b) trim the whitespace from the edge of the images; c) scale the pages as necessary. d) scale them more to create thumbnail versions.
There are no hard-and-fast rules for choosing the final file type. Just got to balance file size and readability, and this varies from book to book. Sometimes I go with JPEG, sometimes 8-bit GIF, and sometimes 4-bit GIF. Sometimes I'll convert every page to GIF and also to JPG, then use a little script to select the smallest one for each page.
The digital representation of the "copyrighted" work as existed in a "page layout" program, using a technological means to prevent digital copying: Imaged to paper using digitally created "Plates".
By attempting to "recreate" the digital representation by using technological means to defeat the digital copy protection of a bound book, you are criminally liable to the owner of the copyright.
(Now if you were just copying this to another piece of paper, you may be ok under existing laws. But moving it to digital... Um, hands up scofflaw!)
I know exactly what you mean about the National Geographic CD-ROM set. I was very excited about having the complete archives available and was deeply disappointed in the quality of the final product.
Much of the text is completely unreadable because of over-JPEGging. (Is that a word? It is now.)
However, it did teach me to be very careful before plunking down $200+ for online books in the future. Now, I insist on a preview before I buy. (And yes, this does mean that many electronic collections don't get purchased simply because I can't find them in any libraries to view...)
And then 3 weeks after you chuck it, go "Damn, I can't read this page!" when you go to look up something and it says, "It is extremely important that you fark dnf2 gib oefll or else you will damage your hard disk."
Stick with books. There's a reason why they are popular. They work really well. Besides, the trees are already dead so you're not doing them a favor. And you'll just have to kill more trees to get more books to scan more stuff.
Wouldn't it be more productive if they divided the number of pages by the number of entrants to this sad "scanathon" and saw who finished first ? That way no work would be duplicated.
If you're going to rip off books, at least be efficient!
graspee
Put Cd in drive. Run Sad Old Easy CD Creator that came free with your cd burner, select "copy cd", select source and destination cd drive, click copy and follow on-screen prompts about changing cds over.
Just remember to search for a crack on the web too!
graspee
What is the oldest file that I have?
and ask:
What is the oldest useful file that I have?
For most people their papers and books are much older than the data they keep and the paper version is always available and easy to read.
You are much more likely to lose or corrupt your data if it is on a disk or a tape than if it is in a book. Your electronic version is going to be of much lesser quality than the books you had and you will have a lot of "adventures" getting your ebooks to be as easy to read as your paper books. What happens to your portable ebook when your reader runs out of batteries? Ebooks have failed because ... THEY SUCK. Let us all know how much time you wasted tweaking your ebook setup and worrying about how to make them sustainable. Also, please tell us when you go back to the store and buy new "dead trees" copies of the ones you destroyed.
For bi-level images, the standard to use is JBIG, comes from an ISO group similar to those that created JPEG and MPEG.
It generates much smaller files than GIF for printed text, with none of the inconveniences of JPEG. Grey scale pictures come reasonably well, if done at 300 dpi, dithered.
I don't know exactly why JBIG never caught like those other standards. There doesn't seem to be many JBIG programs around, but, if you are handy with source code, there's jbigkit, a library for reading and writing JBIG files. I wrote my own software with that, and converted a half-ton of old magazines into a 20-pack caselogic of CD's.
I am also faced with the task of converting thousands of pages from paper to text files. I suggest looking into using a high resolution digital camera in a custom docking station above a flat surface that holds the printed material. (a photo enlarger comes to mind). Then instead of waiting for the scanner carriage to pass downward over the page, you can take a snapshot of the page.
Send the image directly from the camera to the OCR program. I find that the Xerox TextBridge program can do OCR on a page almost as fast as I could turn the page were I not using a scanner to input the text. TextBridge is quite ackward to use and not very customizable for new types of applications such as this.
Using a high resolution digital camera to input OCR text is also a good way to get around the question of whether or not to cut off the binding of the book.
By the way, I assume that you're wishing to scan european language text. Doing OCR on Japanese, Chinese, or Korean I would assume is much slower than recognizing ASCII. Does anyone know of an available program that will do OCR on Chinese?
With our friends in the middle east obsessed with blowing the shit out of us, it might be time to develop an open-source program that will do OCR on Arabic and Farsi, along with a translation program companion. Would Arabic be much more difficult to OCR because all of the phonetic symbols are joined together? I sometimes wonder about these things when I'm bumming about not having a life.
I was locked out because of their spidering filter, too. But I called up at like eight o'clock one night & someone unlocked it for me (& set it so that it wouldn't happen again).
Safari also has a very good search engine, althought it's wierd that they coded it in MS ASP.
The spidering filter seems intent on inhibiting the casual copier. I thought this was lame, but there's actually a certain logic to it. If you go to all the trouble to download & reassemble the books, then you've put enough work into it not to not just throw the book out there on Gnutella.
At it's most expensive, Safari books cost $2 per month. So I'm not impeding anyone's education, and I'd like to see this service stick around. In fact, I can save people a bundle if I get them to use it the way it's meant to be used.
The one lame thing is that OReilly pads their selection with multiple editions of the same book and also with books that are available for free on the openbook site--ok, that's like five books, but still... They're really starting to get a good selection now.
In college, I used a free (as in stolen beer) html copy of a textbook for a class, and realized at the end of the year that someone had purposefully altered the book so that a lot of information was horribly incorrect. They'd basically cut out the word "not" all through the book, and inserted it after "is" in other places. Most people would not do that, but some a-hole did. Ah, college, what a hellhole.
This sounds a lot like that PDF that was on the NYTimes (i think) where they had a list of names of people, but they were blacked out. Someone with a slow connection or something like that was able to see the names at first, and then the black squares loaded after over the names..
Pretty strange stuff =P
Amen on that hypertext comment. The battle has not even begun.
Most folks aren't lawyers, but generally people have seen some texts of court opinions at one point or another. I was just going over some court documents related to the patent courts --AKA, the CAFC-- and I was struck by how computer code-like the text was. The only reason people think it's hard to read court cases, especially patent court cases, is because they're riddled with links to other cases. Since the system was developed in a book only format in a rather rag-tag fashion, the text becomes very difficult to read because of all the notations they've used to indicate varying types of links.
In my opinion, requiring the legal system to use electronic hyperlinked texts for court opinions and other legal documents is absoultely essential to any kind of IP reform. Until judges are benefitting from hypertext in an immediate way, they're going to fail to see the urgency of advocating its use or deciding in favor of electronic formats.
Law and court documents should be readable by anyone with standard high school level English skills. The same is true for patents themselves. The core of a patent isn't the drawings. In fact, the drawings are often intentionally misleading to avoid disclosure of importatnt information valuable to competitors. The important part of a patent is the references to other works, these are natural places for hyperlinks. I bet Bounty Quest would move a lot quicker if patents had hyperlinks.
http://www.claraocr.org/
"Clara OCR is a free (GPL) OCR for systems that support the C library and the X windows system (e.g. most flavours of Unix). The development platform of Clara OCR is 32-bit Intel running GNU/Linux.
Clara OCR is intended for large scale digitalization projects....."
Havent tried it, but it looks good.
Jeff Knox
"A man observed by the celebrated Dutch physician Hermann Boerhaave took his meals at a table that had been cut away in a semicircle to accommodate his circumference"
:)
No kidding. I never saw him, but my Grandmother has stories about this.
(But I weigh all of 170# without any flab at all.