Large-Scale Paper-To-Digital Conversion?
An anonymous reader writes "I've just been asked to digitize several dozen sets of lecture outlines at the university where I work. Basically, professors want to hand me a big (often 100+ page) stack of their handwritten lecture notes (with messy text, equations, and diagrams; sometimes double-sided) and expect me to post a PDF-or-something-similar to their course's web page. However, every desktop scanner I've ever used takes 1-2 minutes of user-attention per page and the resulting files end up Huge, impossible-to-read, or both. All I have at my disposal is my PowerBook, Acrobat, a couple hundred dollars of department funds for a new scanner (this maybe?), and, if I ask nicely, overnight use of the secretary's Win2k box. Any ideas? Sheet-fed scanner recommendations? Better file formats than PDF (or better PDF settings)? Do any of you students have usability advice?"
While PDFs are pretty well supported, you'll still be storing it as raster data, so there won't be any size decrease over using an image format, such as PNG.
Are there any web-based packages for searching documents, based on OCR-extracted keywords? Obviously with messy hand-written notes, formulas, etc, OCR won't work reliably. For a similar project, I'd like to OCR the files and use the text data solely for keyword searching. Obviously not perfect, but better than just images.
PNG is your friend....
Just fax the documents to a computer.
I think you're right on the money. May be well worth taking the job to an outside agency. There are many print shops using Xerox Docutechs, which scan in many hundreds of sheets at once to print copies of documents. The scanning takes barely a second a page, and it wouldn't surprise me if the document format being stored inside the docutech is something that can be used for this purpose.
I've had a similar job, where our school's lecturers wanted their notes in the same style so one of my jobs as admin assistant was retyping chapters from textbooks & inserting the original illustrations. That didn't start out too bad until lecturers started basing course notes on entire quarters of books, expecting them to be retyped completely in their own style. Give an inch they'll try to take a mile - use the few hundred $$ to get it professionally scanned.
Do it the open source way.
Get several (dozen) other students to use their own equipment and time in echange for a copy/copies of the completed work.
I would hazard a guess that there are more than a few people who would like to have a copy of the complete series of the lecture outlines.
Get some students of the professor's course to type them into LaTeX. Give them some points they'd otherwise get for homework.
a) Publication quality DVI/PS/PDF files
b) The student can deepen their knowledge of the topic
Everyone happy. Used to work like this at the university I went to. And you may be even lucky that some student typed these notes in for himself.
I found that DjVu format produces substantially smaller file than PDF for the same scanned image.
There is an open-source project http://djvu.sourceforge.net/ that provides code for reading DjVu docs, but I have no idea where to get DjVu encoder.
...retyping chapters from textbooks & inserting the original illustrations. That didn't start out too bad until lecturers started basing course notes on entire quarters of books...
Isn't that copyright infringement?
Unless, of course, they wrote the textbooks.
After you get all of it scanned it and put through OCR, there will still be a ton of mistakes you'll need to correct.
now, at this point, you'll likely start wishing that you live in Canada (if you already don't).
The key is in volunteers, to bastardize "1984". Get a number of fairly intelligent high school kids that haven't done thier 40 hours of community service (a graduation requirement).
Now, make them look at the originals, the scanned, and correct all the discrepancies
bonus: if the kids are the nerdy types, tell them that they're learning university material for free.
they could start paying you!
Kent State just announced thier FlashNotes website. I go to school there, email me at fiveonethree@yahoo.com I would be more than happy to come down and help you sort out your options.
A bit of opinion on the project. This is not a good idea. Its one more tool that studnets will rely on to memorize information isntead of taking time ti THINK about thier subjects and really LEARN the material.
The Internet Archive did or plans something like this for their scanned book project. The cost of sending scanners to (India | China | Malaysia) and paying a few cents per page in labor is less than doing the same job here.
Even half an hour is being generous, on your side. As a consultant the smallest unit of time I was even allowed to to quote was four hours, meaning that the client was looking at at least a $500 bill every time I got involved (or if I was involved, each time they changed directions or wanted something done differently because they changed their mind.) Needless to say, I was allowed to stay focused on the actual project and rarely got hit with mickey mouse crap like changing the colors or fonts or rearranging the buttons on the screen because the secretary likes the word 'Yes' instead of 'Ok'.
... there is significant rework that needs to be done and the associated ramp down / ramp up time is often a big chunk.
Granted if it was something reasonable and I could do it without shifting gears (mentally) I would usually slipstream it into the work I was doing and not write it up. If it was redoing work that I had already done, or worse if I recommended doing it one way and they mandated I did it some other way and after I was knee deep in it decide to go yet another direction or even in the direction I originally suggested
Glonoinha the MebiByte Slayer
I tend to scan lots of documents and setup a simple perl script that uses the 'scanimage' command line tool to do the scanning. Using my Epson Perfection 1650 scanner (pretty standard flatbed scanner) I can scan an 8"x10" page in black & white mode in about 10 seconds.
I actually added a button to the Nautilus GUI shell so I can move to the directory I want and hit the button to scan a page to that directory. Very convenient.
I scan to tiff and then use the convert utility (part of imagemagick) to convert to png. The resulting files typically run about 100K to 200K depending on the content.
If anyone's interested in seeing the perl script I've posted it to: www.ollies.net/scanscript.html
Steve
Most of my profs would just scan in the handwritten notes and put it on the net. They were absolutely astonished when one prof showed off his multi variable calculus notes That looked readable and I think more of them will be doing this. They dont seem to mind the fact its in handwriting in a huge jpg, in fact they love it. I dont think students mind either since they have to read the handwriting during lectures anyway so having to read the notes wont be too bad.
The quality of the scaning is obviously important; get or borrow the best scanner you can. The point made about putting a black backing onto a flatbed scanner is important. Also important is adjusting the scanner settings so that you get minimum noise (random black dots) without degrading the stuff you want to keep.
For this sort of thing you almost certainly want to do it bi-level/B&W/one bit deep (hopefully there are no shaded pictures, but you can use screening for those), and to my knowledge nothing has been developed that compresses these images better than CCITT Group IV (fax machines use Group III). You almost certainly don't want to use grey-scale, at least not for your final images.
You should see if you can find some post-processing software; we used to use ScanFix, which would straighten the image (which makes Group IV compression a lot better) and depending on settings clean it up as well. You also need to decide upon the size of the final images; you want to scan at 200 to 300 or even 400DPI, but you don't have to have final versions at those high resolutions.
The standard used to be TIFF images with Group IV compression, but not every image viewer can read them, or display them well (esp. if the image needs resizing, and I doubt you can assume everyone reading these has their monitor at a high resolution).
If PDF will accept and display images compressed with Group IV compression, you're probably best off with that, since Acrobat Reader is ubiquitous and fairly easy to use.
PNG is a nice format that I use by preference for > 1 bit deep images, but a quick check of some PNG documentation says that Group IV "often" compresses a lot better than 1 bit "greyscale" PNG; it was simply not designed for document imaging. And you also want to avoid JPEG, it's a lossy (will introduce artifacts) system that also wasn't designed for bi-level images.
Hope this helps.
I see I'm not the only one with one. The nice things about it is that it's built like a tank (weights it too), and can handle legal size. The only downsides is the resolution isn't as high as modern scanners, and that sheet-feeder is bulky.
Anyway I run the output through this and a bit of OCR (doesn't have to be perfect), and store it in a Database.
Xerox bundles OCR as a software add-on. It works well when you get it all set up at your company. By the time you get back to your desk, the document is open and ready to be OCR'd with a drag and drop.
It obviously wouldn't be so convenient if he had to go to Kinkos, but they might have it set up on one of their machines. (Yeah, I doubt it, too.)
We have one of these in our office and they're great for taking stacks of workpapers from clients, scanning them in and getting rid of the originals. You can email a PDF directly to someone, or store the PDF on a server somewhere.
Bill Clinton: Pimp we can believe in. - The Shirt!!!
charge by the hour, at least 50 dollars an hour. That way you can hire 3 student at 10 bucks an hour to do the actual work.
The Kruger Dunning explains most post on
So, I recommend scanning to TIFF (or TIFF inside PDF). Even if you don't currently have the encoding softeware, you can convert to JBIG2 compression later as it becomes more and more ubiquitous in the future.
And definitely use a automated document feeder of some sort to keep from going crazy. Newer Xerox machines work pretty well for this (I use a DocumentCentre 440ST for this all the time) unless you have hundreds of thousands of pages to deal with, in which case you should either invest in industrial scanning equipment or outsource to a scanning center that does.
As other people pointed out, if you can get a couple of departments in on this, then you can more easily amortize the costs of really good equipment to do this...
One thing that I'll note is that I don't really like PDFs for this sort of stuff. If you really have a 100 page article, you're going to be looking at a 3 meg file and, perhaps, a 30 second startup time... That's fine for someone who's going to read the document from cover to cover, or print it... On the other hand, it's a pain if you only want to look at pages 37 and 38.
GrokLaw gets PDFs of court filings regularly, and I got so fed up with PDF's that I created a (semi-automated) batch system to split up the PDF's into separate PNG images and create a simple index.
You can see a sample here. Far easier to view a page or two there (IMNSHO) -- but not as easy if you just want to download and print it.
Before you go too far, you might want to get a good handle on how people are likely to use what you produce -- Use that knowledge to decide just how you want to organize the result. You may want to make it available in two (or more) different formats. It's not that difficult to bulk convert things between different forms (at lest, not if you can dual boot into Linux, or have OS/X).
Sometimes boldness is in fashion. Sometimes only the brave will be bold.
Use the fairly user-friendly LyX to do the LaTeX-ing.
Heck, get the academics themselves using it to prepare their notes in the first place!
They might actually thank you for introducing them to this convenient and easy document processor.
Important info:
http://www.lifeaftertheoilcrash.net
http://dieoff.org/synopsis.htm
http://www.peakoil.net
ScanSnap may be just what you need if the notes are on a uniform-sized paper (e.g. A4 or letter). You need Acrobat (included) on a Windows machine, but you just set the notes on the scanner and click a mouse then it scans 50 sheets (both sides in one-pass) without human intervention and gives you an Acrobat file in a few minutes. It is small and weighs light so you can easily bring it into the secretary's office. The price is also reasonable ($495 with Acrobat 6.0), and it seems they are even offering a $100 rebate now.
The specified resolution is for a colored documents. For a b/w one, you will get a better resolution. You can obtain scan samples from a Japanese page (pdf files at the bottom).
Actually, a newer model, fi-5110EOX, has already been available in Japan, and I think that is why they are offering a rebate now. The new model have usb2.0 connection and a higher resolution mode (excellent) that is not possible with fi-4110.
I don't know what the specifics of your work is, but you probably have a huge supply of untapped workpower at your fingertips.
.TIF to word/wordperfect/Mathematica, whatever, up to three pages worth.
.TIF files in a class-accessible online folder, and accept the end result in an e-mail.
The students who are taking these classes could easilly be a source of tappable work hours.
See Project Gutenberg's proofreading site for an example of this type of effort. http://www.pgdp.net/c/default.php
If you could get the professors to offer a little bit of extra credit for proofreading or converting a page, the task could be much easier for you.
Envision this: You use and ADF to scan an entire stack of notes in order, but you don't worry about how the scanning goes on each page. Then you xerox the whole stack and place the copies in a binder in someone's office. The students are then offered 10 points extra credit per page translated from
The points are justified since the student is in the class and learning something by carefully duplicating, analyzing, correcting, and studying the professors notes for that class. (Can you imagine a more likely way to end up accidentally committing three pages of facts to memory?)
You can place the
If the file isn't legible, the student can check the xeroxed copy out from the binder. Since it's just a copy, you don't need to worry about losing it.
You could skip the scanning altogeather, and ask the students to return any pages they don't finish translating.
Obviously this works best for large classes where the student:pages ratio is large.
Make sure you number pages if you do anything like this.