Large-Scale Paper-To-Digital Conversion?
An anonymous reader writes "I've just been asked to digitize several dozen sets of lecture outlines at the university where I work. Basically, professors want to hand me a big (often 100+ page) stack of their handwritten lecture notes (with messy text, equations, and diagrams; sometimes double-sided) and expect me to post a PDF-or-something-similar to their course's web page. However, every desktop scanner I've ever used takes 1-2 minutes of user-attention per page and the resulting files end up Huge, impossible-to-read, or both. All I have at my disposal is my PowerBook, Acrobat, a couple hundred dollars of department funds for a new scanner (this maybe?), and, if I ask nicely, overnight use of the secretary's Win2k box. Any ideas? Sheet-fed scanner recommendations? Better file formats than PDF (or better PDF settings)? Do any of you students have usability advice?"
Uh. How about telling your prof. to get stuffed and get a real secretary.
The owls are not what they seem
Just say 'No'. (If you're being told, it's a different matter, of course).
It sounds to me like a damned hard job to automate (which is the only way it's not going to be a constant drain on your time), and you're being given next-to-no resources to even come up with a creative solution. Sometimes the best answer is in fact 'No' - it forces people to re-evaluate what they're asking. It comes with the danger of being sacked if it's you that's being unreasonable, of course....
Simon.
Physicists get Hadrons!
"Ummm yeahhhh... if you could just do that..."
Faust7 is right about this one. Frankly, OCR is ok, but not great - on nice text on book-or-better paper. Handwritten notes? With equations? No. Not unless your profs have some damn fine handwriting and we all know that that is absolutely not the case.
My advice is the same as Faust7's with these additions: spend some of that money on a really nice keyboard, wrist-rest and/or maybe a nice monitor. You are going to be needing all three. If there are any left over funds, get some really nice tea. I suggest Twinnings English Breakfast or Prince of Wales, if you're going to go bagged.
Exocet Industries - Taking over the world, one computer at a
Will you please tell both of us where we can get one for a few hundred dollars, as specified in the question?
.. he's just going to have to spend some good quality time getting to know a consumer-level scanner, and let the professor know to do his notes in software initially.
I think the real answer is that this guy is S.O.L.
But the broader question is whether this is really a good idea. The result is going to be huge files, which will be messy, hard to read, and will lack an index or table of contents. Seems like a case of profs with too much ego and not enough willingness to put their own work into more useful form.
Find free books.
Maybe he *is* the cheap manual labor / unpaid intern...
I know the parent post was funny but he's thinking along the right ideas.
Take the few hundred you have to spend on equipment and spend it hiring a few temps.
A good typist should be able to type up hand written notes faster than scanning them all in and manually fixing all the mistakes.
Outsource the job to India.
Not as bad an idea as it sounds. My advice is to not waste the department's money, and your time, buying, installing, and using a sheet feed scanner. Somebody in your local area assuredly has one already that they either rent out to people in your situation, or that they use to do the work you need done.
Use the funds that the department gave you to have your local copy shop do the work. They will almost certainly do it faster than you could, and the end product will most certainly be better than what you could provide. This is the kind of thing that the people who work at copy shops do for a living.
Also PDF is a great format for this, highly portable, and so far fairly version proof. You don't have to worry about the PDF being obsolete before the professor decides to change the structure of his class.
Is say "Sure. I'll get this done- when I can. Don't expect it to be done for at least a few weeks, maybe longer."
DON'T CLEAN UP THE SCANS. Don't even look at the scans. DO NOT RETYPE ANYTHING.
With the kind of volume you say you're receiving, the only way you're going to survive is to:
1. close your eyes,
2. load the documents into the feeder,
3. press 'scan'.
4. Make sure everyone knows this policy.
poorly set expectations. How did the professors get the idea that it was possible? It's not pssobile under the contraints that you are faced with. If money was not a limiting factor you could do this. But I'll assume money is a factor and time as well. So go back and tell them that it's possible but it's going to cost this much to automate the process and this much if I type it in by hand and this much if someone else does it but with poorer accuracy and so on and so forth. Put the burden on them to decide how they want to deal with this. Only then will the appropriate solution be found and chosen.
http://tinyurl.com/3t236
Why do it all one way? It sounds like a very great deal of stuff that may never be used by students. Why not try to find a prof who will cooperate with letting you see his/her webpage usage patterns?
/.ers claiming to represent all students. On the other hand, by trying different things in my classes, I've been able to find out what my students will use eagerly. Hint: It ain't the same type of thing for every class!!!!!!
In my experience, it is very hard to predict what students will use for any given class based on the moronic ramblings of
I'd like to say that you're at a really shitty university that would take this kind of student-hostile course of action, but then, I checked out MIT's Open Courseware only to find that the first course I looked at, Gilbert Strang's linear algebra, was a botch job. There was a postage-stamp-sized video of Strang telling anecdotes on the first day of class that could only be appreciated by someone who'd already taken the class. So much for leveraging the web's inherent strong points!
Comment removed based on user account deletion
The professional approach is to go back to them and clarify the outcome:
(a) you can scan the documents in, and they'll take X amount of space, and Y time; and this doesn't include OCR;
(b) you did a few tests (using the supplied document) and these are the results for TIFF, JPG, PDF, etc;
(c) OCR is probably infeasible (or not, do some tests) because of the nature of the documents;
Include in (a) the option of purchasing an automated document scanner, and the corresponding reduction in time.
Based upon all the above, get a clear go-ahead, and make the purchase if new equipment is authorised.
You said "where I work": this is your job: it's a bit poor to do as the other posters suggest and refuse to do the work: you need to make sure that the customer (professors) understand exactly what they are getting, and give them a choice to buy into it or not - i.e. "clarify the expectations".
If you assess that it's 2 weeks worth of work, and the professors don't disagree, then you're supervisor just has to put up with it.
It makes no sense at all to me, to have a PDF created of handwritten notes. Since most students will probably just download and print out the PDF anyway. The only adavntage is it may save a few trees not everyone will print them out.
It sounds like the school wants to shift the production costs (i.e printing) to the students. This seems inefficient because the old way where the instructor could go to the copy center and have the notes copied the at the schools expense (I know these expenses are often passed along to the students anyway), rather than at the students DIRECT expense of their time for downloading, then printing out on their own equipment or using their own printing accounts at the computer center.
If the notes were being OCR'd and then made available on-line, or post processed in such a fashion (where they are searchable, indexed, etc) where they were searchable, it would be useful. Otherwise this seems like a waste of time and money.
-MS2k
The company I work at scans large amounts of documents to PDF format on a daily basis. Depending on the volume some people do, we use either a Canon DR-3060 or DR-5020 document scanner. These will scan both sides of a page simultaneously, clean up the image (despeckle and deskew) and convert them into TIF or PDF all on the fly. They're fast too. Between 20 and 50 pages per minute. Only problem is that they're expensive.
For your budget, you may be able to afford the Canon DR-2080C which goes for around $600. It has all the features of the more expensive ones, but it's meant for smaller volumes like what you're dealing with. With that, you'd be able to scan 100 pages into a pdf document in around 5 minutes.
I know there are Adobe archival systems that store the scanned image, along with whatever text they manage to recognize. You don't expect near 100% OCR accuracy from an old, largely handwritten sheaf of lecture notes and transparencies. But hopefully enough is recognized to be of some use.
I think what your professor wants is not a bitmapped copy of his handwritten notes or some vector curves that resembles such, but actually a typeset version of the lecture notes. If that is the case, assuming that his handwritten notes are sparse (and hopefully without diagrams, since it takes more time to mess around with them), you can definitely do a stack of 100 sheets in a week, or, as someone already suggested, hire some typists to help you out.
I once had a signature.
1. Get Dragon Naturally speaking.
2. Dictate the Essay, albeight a bit lengthy, into it.
3. Import to Word or your favorite word processor.
4. Add any cool equations and such that you cannot dictate.
4. Publish to PDF.
Nice small file size I'm sure.
Scanning is nice, but it only works with fonts it can recognize. Not Proffesorese.
It could take you a day or so to dictate, but after your finished, more than likely you will have alot less spelling and random letter and symbol problems.
But again, this might be more work that you want to do. Why? Well, if you do it this way, make a nice clean portable document that everyone can read, you might find yourself getting more "extra work" than you wanted.
Have it professionally done, like other people here have recommended. High-end sheetfed scanners are great, but you probably can't afford one, and it wouldn't make sense as a one-time expense for this small of a job. I'm a big fan of just handing someone some money and it's magically accomplished.
Alternatively, use a digital camera and well-lit copy stand. You can improvise a copy stand with a tripod or whatever, but make sure you have a lot of light. It's a lot faster than using a scanner, and the results are acceptable if you have a good camera. The more megapixels the better - don't use the old 1.3mp one you have lying around. 3mp will technically work, but more is better. Ideally a digital SLR pointed straight down at the page, a very well-lit area (a clamp light on either side of the page works nicely), and you sitting there sipping Starbucks while you hit a cable shutter release after you flip every page. You could get a few hundred pages an hour done this way--your only limitation is how fast you can turn the pages. You'd only have to stop to transfer images to your computer, and you only have to do that often if you don't have enough memory cards. After you get all the pages into the computer, feed them into Acrobat and you're done.
If you don't want to use acrobat you could make a web-page with thumbnails linked to the hi-res images. Then your end-users wouldn't need to download the Acrobat reader. I love Acrobat's ubiquity but hate the file sizes and the slow start-up time.
I looked into this once for a client. Agencies charge around 5c a page but that is only to scan. Add more for OCR, manual verification and/or transfer to M$ Word or what-have you. I think I recall seeing 50c a page for such value-adds. Agencies are good because you dont get need to buy the kit (30K and up) or watch it run (they need feeding and jam quite a lot, especially if the paper is lower quality). Agencies also make sense for shops with nil/low expectation of producing more paper in the future. Get some quotes, references and examples of their work and start with a short trial run.
I wish at was Friday, but I dont want to wish my life away. So I wish it was last Friday.
Frankly, I've seen professors' handwritten lecture notes, and 90% of them add nothing to the educational process. Certainly not more than a quick note saying, "Read sections 2.1, 2.2, and 2.4, paying special attention to least-squares curve fitting and finding orthonormal bases." They're generally disorganized and difficult to follow because they usually take a lot of material for granted when they write.
The mere fact that it's handwritten means that it's basically a rough draft that was hastily flung together. Send them back to him, and have him type them in and rework them until he figures they're worth recycling for next semester. The prof will save time in the long run, and the students will have something nice, clean, and organized to peruse.
You want the truthiness? You can't handle the truthiness!
Basically, professors want to hand me a big (often 100+ page) stack of their handwritten lecture notes (with messy text, equations, and diagrams; sometimes double-sided) and expect me to post a PDF-or-something-similar to their course's web page.
After I stopped laughing, I realized this may be a serious inquiry rather than a joke. I've assisted local government agencies in converting clear, printed, 8.5x11" text documents into searchable text / pdf documents, and the cost for these is over 10 cents a page. (Tax and mill levy records have to be verified 100% correct, as I'm sure your prof's notes need to be.) That's with volume discounting (> 500,000 pages), using nearly perfect ascii text documents, not scribbled notes.
So my advice is to get a few bids from outside contractors, then submit a realistic estimate based on the average. Hint: Given those spec's, it's clear you/your management have no idea what's involved in this process. (Shows at least a modicum of IQ that you had the good sense to ask, however.) If you simply need to scan/save as pics (jpg/tiff -> pdf), you can do this yourself at reasonable cost/effort expenditure. Seems to be implied that you need OCR capabilities for handwritten text, as complicated as equations at that, so you're really pretty screwed. Even simply creating 100-200 kb jpg's & emailing them in an automated process is going to run into problems when the campus mail servers refuse to accept attachements larger than a Meg.
Good luck, BWAhahahahaha!