Slashdot Mirror


Large-Scale Paper-To-Digital Conversion?

An anonymous reader writes "I've just been asked to digitize several dozen sets of lecture outlines at the university where I work. Basically, professors want to hand me a big (often 100+ page) stack of their handwritten lecture notes (with messy text, equations, and diagrams; sometimes double-sided) and expect me to post a PDF-or-something-similar to their course's web page. However, every desktop scanner I've ever used takes 1-2 minutes of user-attention per page and the resulting files end up Huge, impossible-to-read, or both. All I have at my disposal is my PowerBook, Acrobat, a couple hundred dollars of department funds for a new scanner (this maybe?), and, if I ask nicely, overnight use of the secretary's Win2k box. Any ideas? Sheet-fed scanner recommendations? Better file formats than PDF (or better PDF settings)? Do any of you students have usability advice?"

110 of 459 comments (clear)

  1. Get stuffed by October_30th · · Score: 4, Insightful

    Uh. How about telling your prof. to get stuffed and get a real secretary.

    --
    The owls are not what they seem
    1. Re:Get stuffed by Amiga+Lover · · Score: 5, Interesting

      I think you're right on the money. May be well worth taking the job to an outside agency. There are many print shops using Xerox Docutechs, which scan in many hundreds of sheets at once to print copies of documents. The scanning takes barely a second a page, and it wouldn't surprise me if the document format being stored inside the docutech is something that can be used for this purpose.

      I've had a similar job, where our school's lecturers wanted their notes in the same style so one of my jobs as admin assistant was retyping chapters from textbooks & inserting the original illustrations. That didn't start out too bad until lecturers started basing course notes on entire quarters of books, expecting them to be retyped completely in their own style. Give an inch they'll try to take a mile - use the few hundred $$ to get it professionally scanned.

    2. Re:Get stuffed by SoSueMe · · Score: 2, Interesting

      ...retyping chapters from textbooks & inserting the original illustrations. That didn't start out too bad until lecturers started basing course notes on entire quarters of books...

      Isn't that copyright infringement?
      Unless, of course, they wrote the textbooks.

    3. Re:Get stuffed by October_30th · · Score: 3, Insightful
      WOW! Thats *so* helpful! Just refuse to do the job your employer is paying you to do... DAMN... why didn't I think of that?

      How do you know he's getting paid to do it? Some professors have a nasty habit of getting all their nasty, menial and boring stuff done by their students who are already working on their degree projects 12 hours a day, six days a week.

      Ok, so for some reason I assumed that the poster is a student so my initial reaction was probably off. I would never assign such a menial, dead-end task to my postgrad students, nor would I have accepted such a task without objections when I was still a student.

      --
      The owls are not what they seem
    4. Re:Get stuffed by Walt+Dismal · · Score: 3, Insightful

      No, seriously, this request shows utter lack of concern by someone who may be a professor, but is also a bad manager and possibly an idiot. Your response perhaps should be to scope out the project and toss estimate and the funding issue back into his lap. But do not let yourself be used as slave labor.

    5. Re:Get stuffed by djplurvert · · Score: 5, Insightful

      In addition to the points already made it is not unreasonable to simply tell the prof that his/her expectations are unreasonable. Perhaps "get stuffed" is a bit over the top but I've found that employers (even professors) will listen to reasonable explanations.

      I used to have a boss that would say things like "this should only take you about five minutes". I finally told him, "nothing takes just five minutes, if I have to stop what I'm doing there is a startup/teardown cost for every task." I convinced him that there was a granularity of 1/2 hour for every random task he wanted done. The discussion was fruitful for both of us, he was more reasonable about his expectations and put a bit more thought into what he wanted to distract me from my primary task to do.

      Now, the original idea is a reasonable proposition, however, it isn't really the sort of thing that should be done for just one prof. Perhaps several departments can combine their resources to setup something that will allow this type of thing to done in a reasonable time frame.

      plurvert

    6. Re:Get stuffed by Glonoinha · · Score: 2, Interesting

      Even half an hour is being generous, on your side. As a consultant the smallest unit of time I was even allowed to to quote was four hours, meaning that the client was looking at at least a $500 bill every time I got involved (or if I was involved, each time they changed directions or wanted something done differently because they changed their mind.) Needless to say, I was allowed to stay focused on the actual project and rarely got hit with mickey mouse crap like changing the colors or fonts or rearranging the buttons on the screen because the secretary likes the word 'Yes' instead of 'Ok'.

      Granted if it was something reasonable and I could do it without shifting gears (mentally) I would usually slipstream it into the work I was doing and not write it up. If it was redoing work that I had already done, or worse if I recommended doing it one way and they mandated I did it some other way and after I was knee deep in it decide to go yet another direction or even in the direction I originally suggested ... there is significant rework that needs to be done and the associated ramp down / ramp up time is often a big chunk.

      --
      Glonoinha the MebiByte Slayer
    7. Re:Get stuffed by Adian · · Score: 5, Insightful

      On the contrary, it's your job as a professional and as an employee to keep your employers in tune with what is possible, and what is most efficient for the manhours/money involved. As employees you are also responsible to your employers to keep them informed of ways to actually save money also if there is a place this can be done. If this particular job would require hundreds of manhours to do, versus paying a place that actually specializes in these services to do it. Which I'd guess the university either has this equipment on campus, or has contracts with some company already for something similar.
      Besides the fact, it sounds like they are not aware of the time involved in scanning off 10's nonetheless hundreds of pages. It doesn't sound like they are too anxious to make it easy for him to get the job done either (not buying him new equipment, using the secretaries Win2k box after hours??).
      I've volunteered my efforts before on a simple scanning job that required hundreds of regular photos to be scanned in at relatively good quality (why else do it otherwise), and ended up taking forever. Upon informing the client of the amount of time required, they adjusted the way the job was being handled.
      I think being straight with your employers, and clients is the best approach to any situation where too much is being expected. The times I've had these instances come up, and recommended different approaches that resulted in money being saved, or manhours on a task being reduced, I saw benefit in my paycheck through raises or promotions.

      --
      Adian
    8. Re:Get stuffed by gkuz · · Score: 2, Informative
      Xerox Docutechs, which scan in many hundreds of sheets at once to print copies of documents. The scanning takes barely a second a page, and it wouldn't surprise me if the document format being stored inside the docutech is something that can be used for this purpose.

      Truly, ignorance is bliss. This is clearly written by someone who has seen a DocuTech only from a distance.

      We have three of them where I work, and I have worked very very closely with them on a number of projects. Sure, they scan quickly, but you can't get the data out of them. They are copiers, essentially.

    9. Re:Get stuffed by curator_thew · · Score: 2, Informative

      Fair use for educational purposes has narrowed. In patent law, Madey v Duke 2002 found that a university wasn't allowed a non-commercial exception because "experimentation" was "furthering the business aims of the university". Yes, this was a hugely contentious case.

    10. Re:Get stuffed by Man+Eating+Duck · · Score: 4, Informative


      I've been working with various versions of the Docutech system for about six years, and they're in use in most of the professional copy/print shops around, at least in Scandinavia. They scan full page and double sided, 600 dpi at about 1 page/sec. Newer versions also can handle full colour.

      Native document format is tiff images with a proprietary control file (structuring, positioning etc), but you can easily convert it to pdf.

      I'd guess that a professional shop will charge you about 30 cents a page if you accept the raw document files without 'touching up'. This is more than adequate if you're just going to reproduce it on paper, or even distribute the PDFs. It'll weigh in at about 100k a page for the tiff format, and a lot less for the PDFs. This is black and white, which in most cases will suffice.

      Professional equipment (as in contracting a print shop) is definitely the way to go. I know that at the University of Oslo, Norway, they have established an in-house shop that will do this type of work internally for just about cost. Maybe that's an idea to put forth to the management? Surely your university will find other uses for it than just your assignment.

      Hope this helps :)

      --
      Are you a grammar Nazi? I'm trying to improve my English; please correct my errors! :)
    11. Re:Get stuffed by eliza_effect · · Score: 3, Funny

      professors want to hand me a big (often 100+ page) stack of their handwritten lecture notes (with messy text, equations, and diagrams; sometimes double-sided) and expect me to post a PDF-or-something-similar to their course's web page.

      Spend the money on buying them copies of Mavis Beacon Teaches Typing.

    12. Re:Get stuffed by Paisley+Phrog · · Score: 2, Informative

      Exactly. I work at a print/copy shop, and the last color machines we had (Xerox DocuColor 3535) served as document scanners as well as copiers/printers. Place your document in the feeder, select "scan" from the menu, pick your resolution, and press the big green button. When it completes, you log into the printer's internal web services page (reachable from it's own IP address) and select how you want to download your files...JPEG, TIF, or PDF. I loved the PDF feature....too bad the 3535 was absolutely horrendous at everything else for high end color.

      This feature isn't available on every Xerox color (we don't have it on our new 2045, unfortunately), so you'll have to check around a little. Check with print shops and see if their Xeroxes have a "scan to file" capability.

  2. Kinkos? by axonal · · Score: 5, Informative

    Some Kinkos have those big goliath Xerox scanners which act just like copiers. Load a stack up papers, and it will scan the pages and load them up. Not sure about PDF export/etc though.

    1. Re:Kinkos? by zenquest · · Score: 5, Informative

      We have a Xerox WorkCentre Pro 65 at my school. It can scan at around 50-60 pages per minute, and will do double-sided. It will do PDF output, too. (and email it or FTP it to you, if so configured)

      Our teachers use them for exactly the purpose described. If you don't have one of these type machines around anywhere, then definitely give Kinkos or some similar establishment a try.

    2. Re:Kinkos? by zenquest · · Score: 4, Informative

      Going to Kinkos? Yeah, it's a bit pricey, but not totally out of bounds.

      If he's at a large university, some other department might have one of these. Xerox doesn't charge for scans when you lease the machine. They only charge for how many prints or copies are made, so it would be essentially free for another department to allow him use their machine. It doesn't even require any additional setup, since you can enter any email address into the machine and have it send the document there directly from the copier. (assuming the SMTP server has been set on the machine)

    3. Re:Kinkos? by glowurm · · Score: 2, Informative

      Confirmed. Your local Kinko's should have the resources to scan the pages to PDF in an automated fashion. Call around and talk to a couple of locations, if necessary, to get the proper terms and pricing, but it should run about US$0.25 per page. The software used at the Kinko's I'm on familiar terms with (intimately familiar - too much work done there) uses Canon hardware and software and can burn the resulting files on to CD for you (at a cost of US$9.95/cd, of course!)

      They can usually do up to tabloid size sheets (11"x17") through the feeder, and expect a turn-around time of 24 hours or so. They can also do OCR scans on the resulting files. The OCR conversion runs around US$9.95 for the first page, and US$2.50-ish for additional pages, and they won't do any correcting of the errors such that result from software retardedness unless you pay extra. It hurts, but if you gotta have it...

      Good luck!

    4. Re:Kinkos? by Luzumsuz+Lazim · · Score: 2, Informative

      Our department has Konica 7165 copier, which has the scan-to-email capability. It can e-mail the scanned document as multi-page-pdf or tiff files, thus you don't need to convert it to pdf page by page.

      And, use a low resolution setting (say 100dpi) for handwritten documents. It will do just fine. Pdf (depending on the driver though) compresses the image. If you use a machine something like the Konica, try to set the threshold/brightness to a level such that the empty portion of the pages will appear as plain white; this will increase the compression ratio significantly.

      So, my recommendation is that try to find a Kinkos which has this type of machine. If you can't find, just tell the professors that it is simply not a reasonable task that can be done in finite time.

    5. Re:Kinkos? by digitalrust · · Score: 2, Informative

      At the FedEx Kinko's where I work (dig the new name), we use the Canon ImageRunner 105 and scan directly into Acrobat. It's very convenient and pretty fast. We have control over dpi of the scan, pure B/W vs. greyscale, and minimal halftone settings. There's no company-defined pricing for this; we charge $0.25 per page, with a $10 minimum. It creates huge files though, unless you reduce the dpi of the scan.

      Another option is to look in the phone book under "litigation copying" or "legal copying". Lawyers often scan thousands of legal documents and have them indexed by keyword. Data entry people get paid to skim each document to record the keywords before the documents are actually scanned. Price quotes are based on the quality of the originals (staples, torn sheets, etc.)

  3. well... by Anonymous Coward · · Score: 5, Funny

    if I ask nicely, overnight use of the secretary's Win2k box

    Plus, if you're lucky, you could also get other after-hours favors from the secretary as well ;-)

    1. Re:well... by PsiPsiStar · · Score: 2, Funny

      Maybe. But I doubt she has a scanner too.

      --

      ___
      It's the end of my comment as I know it and I feel fine.
    2. Re:well... by Anonymous Coward · · Score: 2, Funny

      I thought you were going to get a Funny Mod from me until you failed to reference the word box.

  4. High Speed Scanner by Anonymous Coward · · Score: 2, Informative

    You need a high speed scanner. Fujistu makes a nice one that works pretty well.

    1. Re:High Speed Scanner by Nogami_Saeko · · Score: 2, Informative

      Do NOT get that HP scanner. I have the same model, and while the hardware is just fine, HP's scanning software is garbage.

      I run paperport to store all of my bills, documents, etc. The HP scanner software simply will-not use the resolutions and options I want paperport to use (200 DPI, B&W).

      When using the sheetfeeder, the damn thing always scans in 24bit at 200DPI no matter what I try and set as a default - then I have to manually convert every page.

      Go with a different model.

      N.

      --
      "Nothing strengthens authority so much as silence." - Charles de Gaulle
  5. Simple. by jebell · · Score: 5, Funny

    Outsource the job to India.

    --
    This is my sig. There are many like it but this one is mine.
    1. Re:Simple. by GothChip · · Score: 4, Insightful

      I know the parent post was funny but he's thinking along the right ideas.

      Take the few hundred you have to spend on equipment and spend it hiring a few temps.

      A good typist should be able to type up hand written notes faster than scanning them all in and manually fixing all the mistakes.

    2. Re:Simple. by pendragn · · Score: 2, Insightful

      Outsource the job to India.

      Not as bad an idea as it sounds. My advice is to not waste the department's money, and your time, buying, installing, and using a sheet feed scanner. Somebody in your local area assuredly has one already that they either rent out to people in your situation, or that they use to do the work you need done.

      Use the funds that the department gave you to have your local copy shop do the work. They will almost certainly do it faster than you could, and the end product will most certainly be better than what you could provide. This is the kind of thing that the people who work at copy shops do for a living.

      Also PDF is a great format for this, highly portable, and so far fairly version proof. You don't have to worry about the PDF being obsolete before the professor decides to change the structure of his class.

  6. HP Copiers by kevinank · · Score: 2, Informative

    The large multi-function HP Printer/Copiers will scan and e-mail a PDF of an entire stack of papers just as you would use a normal copier. I'm sure that the other manufacturers have similar features, but it is the HP equipment that we use at work.

    --
    LibBT: BitTorrent for C - small - fast - clean (Now Versio
    1. Re:HP Copiers by XaXXon · · Score: 2, Insightful

      Will you please tell both of us where we can get one for a few hundred dollars, as specified in the question?

      I think the real answer is that this guy is S.O.L. .. he's just going to have to spend some good quality time getting to know a consumer-level scanner, and let the professor know to do his notes in software initially.

    2. Re:HP Copiers by plankers · · Score: 3, Informative

      The Konica ones where I work do a similar thing -- they can email you a TIFF or a PDF of a huge stack of paper. Ours are only black & white, and will only do a fixed resolution, but a newer color copier would fix all those shortcomings. Many universities and colleges have print centers that have this type of equipment if your department doesn't.

      Worse case, you can get an HP scanner and the automatic document feeder for it. If this is going to happen a lot it should be pretty easy to justify the $500 or so for the scanner, ADF, and a copy of Acrobat.

    3. Re:HP Copiers by kevinank · · Score: 2, Informative

      The big copiers run a couple of thousand dollars, but the multi-function fax/scanner/printers from HP are in the approximate price range and are all able to scan stacks of paper rather than individual sheets. The easiest way to get one of the large printers for less that a few hundred dollars is to start calling alumni who work for HP and ask them if they'll make an equipment donation.

      --
      LibBT: BitTorrent for C - small - fast - clean (Now Versio
  7. HP Digital Sender by Guanix · · Score: 4, Informative

    The HP Digital Sender series are really great for this stuff. You feed it a stack of paper and it scans it, 15 pages per minute, and can store the PDF on a file server or you can send an email with the PDF attached directly from the network sender! It's a bit expensive, but try to look around for one, maybe the local copyshop? Guan

    1. Re:HP Digital Sender by W2k · · Score: 3, Informative

      Great product. Unfortunately, its price is listed at about 10x the "few hundred dollars" the original submitter specified in his posting.

      I've found the Canon Canoscan flatbeds do a good job of automatically scanning straight to PDF, only minimal user intervention (hit "enter") is required. There's a special mode for scanning text which enhances contrast, so messy notes and diagrams should be fine, too. The resulting PDF:s are also remarkably small in size for what is essentially a huge bitmap. I've a Canon Canoscan 8000F myself, it's very fast and can do higher DPI's than most people need, and although it might be a bit out of his price range, I'm sure the cheaper models can do the same job nearly as well.

      --
      Quality, performance, value; you get only two, and you don't always get to pick.
    2. Re:HP Digital Sender by Zak3056 · · Score: 3, Informative

      The HP Digital Sender series are really great for this stuff. You feed it a stack of paper and it scans it, 15 pages per minute, and can store the PDF on a file server or you can send an email with the PDF attached directly from the network sender!

      I have one of these on my office network, and and I agree that they're pretty good machines--though I have some complaints about them.

      First off, I don't believe their functionality justifies the $3100 price tag. While the feature set it good, for that kind of money, this thing should be able to OCR, and not have to rely on 3rd party software for that functionality.

      Secondly, their "scan to file server" feature requires a server side daemon to run--you can't simply drop the document to an SMB or NFS server. Further, the daemon only runs on WinNT/2k/XP systems, and you need to do a little bit of hacking to get it to run as a service, instead of opening it manually (or via startup folder) on login.

      Third, it can be DOG SLOW. In particular, when scanning multiple large jobs (particularly at higher resolution) the thing will bog down. It also can only handle a fairly small number of jobs in queue at any one time. One of our secretaries can fill its queue in short order, and have to wait about ten minutes before she can scan the next document packet. When she's trying to scan a hundred packets, this essentially becomes her main focus for a work day.

      All in all, our Toshiba copiers seem to do the same job better--of course, they have their own problems (i.e. over $20k each, with a poor user interface, and they don't do color, and don't OCR either.)

      --
      What part of "shall not be infringed" is so hard to understand?
  8. Format by bobthemuse · · Score: 2, Interesting

    While PDFs are pretty well supported, you'll still be storing it as raster data, so there won't be any size decrease over using an image format, such as PNG.

    Are there any web-based packages for searching documents, based on OCR-extracted keywords? Obviously with messy hand-written notes, formulas, etc, OCR won't work reliably. For a similar project, I'd like to OCR the files and use the text data solely for keyword searching. Obviously not perfect, but better than just images.

    PNG is your friend....

    1. Re:Format by Chuckaluphagus · · Score: 5, Informative

      I have to scan and store very high-res black-and-white images for work, and I've found that the best format to save in is TIF with a CCITT Fax 4 compression. It will only work for black-and-white files, but for a full page of text and graphics scanned at 2-color, 600 dpi, you can get a file about 100 kbyte. The image quality is superb, and it's far, far more efficient than PDF.

      The program I use to convert to TIF is IrfanView (http://www.irfanview.com/), a generally excellent image viewer. I'ts free, too, so no worries there. It offers a ton of options for compression settings for different formats, so you can try other file formats as needed.

    2. Re:Format by alannon · · Score: 2, Informative

      Storing what you describe as a PDF should be almost the same size as the TIFF you describe, except for the small overhead of the PDF wrapper. PDFs support CCITT Fax 3 & 4, as well as ZIP & run-length compression on monochrome images.

      I run a micro-publishing business which often involves scanning a lot of B&W images at high resolution. I'll agree that storing files as TIFFs makes them much easier to edit, though. Our final publishing happens as PDFs, though, and it does not bloat the size of the images.

  9. If you're being 'asked' by Space+cowboy · · Score: 4, Insightful

    Just say 'No'. (If you're being told, it's a different matter, of course).

    It sounds to me like a damned hard job to automate (which is the only way it's not going to be a constant drain on your time), and you're being given next-to-no resources to even come up with a creative solution. Sometimes the best answer is in fact 'No' - it forces people to re-evaluate what they're asking. It comes with the danger of being sacked if it's you that's being unreasonable, of course....

    Simon.

    --
    Physicists get Hadrons!
    1. Re:If you're being 'asked' by malia8888 · · Score: 4, Insightful
      I really agree with Space cowboy. My former husband was a college professor. He was very brilliant in his field, but anything out side of his narrow realm daunted him. He wanted to put pennies in our fusebox when the lights went out. He stared at a breaker box in the condo like it was the control panel of an alien spacecraft.

      Explain the enormity of this scratched note-to-finished Pdf to this educator. Use crayons, mirrors, yarn and tape if necessary to get your point across. Just be diplomatic :P

      --
      Harpo Tunnel Syndrome--my wrist feels funny.
    2. Re:If you're being 'asked' by miffo.swe · · Score: 2, Insightful

      I agree totally. Some people tend to look at an admin as someone who does magic. They dont understand that some things either costs money or takes time. Perhaps it would be better to give the people writing theese things a laptop in the first place. It sounds like a great waste of time to duplicate the work when it should have been given to the admin in digital format in the first place.

      --
      HTTP/1.1 400
  10. The most important thing by Timesprout · · Score: 5, Funny

    Is to first make an exact copy (by hand) of all the existing documents. Its vital to have a full backup in case anything goes wrong with the scanning process you can always restore the manilla folders to their original filled state.

    --
    Do not try to read the dupe, thats impossible. Instead, only try to realize the truth
    What truth?
    There is no dupe
  11. ADF Scanners by Loiosh-de-Taltos · · Score: 5, Informative

    What I suggest and use is the HP 4C scanner. It's a SCSI-II only scanner that can be found on Ebay for under $10 usually. They also have an automatic document feeder option that can be found on Ebay. This scanner was originally designed for both Windows and Apple compatibility as well. It cannot handle 2-sided sheets.

    The scanner has four different pieces of software you can choose to use, I'd suggest Precision Scan Pro as that makes multi-document scanning easier.

    1. Re:ADF Scanners by LightForce3 · · Score: 2, Informative

      I agree, an ADF scanner is definitely the way to go on your budget. However, I'd recommend purchasing a new one instead of buying used, especially since you'll be doing high-volume work. I'd also be wary of HP scanners, as I've had bad experiences with their PrecisionScan Pro software, and have been told that in general HP software is sub-par.

    2. Re:ADF Scanners by silverhalide · · Score: 2, Informative

      I have used this setup to scan in tens of thousands of pages. All you need is Adobe Acrobat 4 or 5 (full version) and the Deskscan driver. Drop a stack in, click scan, and walk off and go do something. Come back in 5 minutes, put the pile back in to scan the back sides, click continue, you're done. Acrobat automatically interweaves the fronts and backs for you so the no-duplex thing is a non-issue. Ideal speed/quality settings: 300 dpi black and white threshold scanning. Tweak the threshold for the first page, should be good for the rest of them... Resulting files are 30-40k a page and look great when reproduced on a laser printer.

    3. Re:ADF Scanners by detritus. · · Score: 2, Informative

      I definitely second this recommendation. IMO, one of the best scanners ever made. I have a newer usb HP scanner that doesn't even come close to the speed of this thing. They just don't make bulky, well built quality scanners like the 4C anymore.

      And for the record, you aren't limited to only 4 software applications for scanning (at least in Windows, any application will work if it uses TWAIN). Perhaps you were referring to the document feeder having limited software compatibility?

      (Off topic, but amusing nonetheless if you didn't know, there's an easter egg that's quite humorous..)

  12. HP Scanjet 5550c is not what you want by GraZZ · · Score: 4, Informative

    Definately keep clear of the Scanjet 5550c; there's a reason why it's the cheapest feed scanner out there. It will frequently jam if you a) load more than 5 sheets into the feeder or b) use any sort of paper that has been handled by human beings.

    Our Engineering Society was trying to put up an exam archive with one of them and quickly gave up and started scanning with the flatbed.

    Also the scanner has no sane support (one of the few HP scanners that doesn't)

  13. DjVu by alienw · · Score: 3, Informative

    Acrobat sucks ass for bitmap images. It doesn't display them very well, they don't print out well, and the files are huge. DjVu is a new image format that compresses extremely well (a few kilobytes a page -- actually comparable to ASCII text). It's somewhat proprietary, but it's probably the best solution here. There are free web-based services that can compress your images. You can try some of them and see for yourself.

    1. Re:DjVu by Ed+Avis · · Score: 4, Informative

      For scanned documents, tic98 compresses even better than DjVu. It's free software and you can even read the author's PhD thesis about it.

      --
      -- Ed Avis ed@membled.com
    2. Re:DjVu by mystik · · Score: 2, Informative

      I haven't tried tic98 (mentioned lower in this thread) but I can vouch for DjVu. I routinely scan notices, bills and whatnot mailed to me, then destroy them (rather than maintain a large paper file)

      300DPI Black & White scans take about 19kb. They are quite readable, and with 300DPI information, make pretty good printouts.

      --
      Why aren't you encrypting your e-mail?
    3. Re:DjVu by jskiff · · Score: 2, Informative

      Disclaimer: I work for the company the sells the commercial version of DjVu, LizardTech

      DjVu is licensed from AT&T labs, and has both a commercial component and an open source component called DjVuLibre. The technology works by analyzing documents, particularly scanned color documents, for hard edges. Hard edges typically indicate text, while smooth, continuous tones indicate background images. DjVu then "segments" the two types of imagery on the page into different layers and compresses them using different formats for optimal compression and quality.

      Okay, enough marketing. While it does have some warts, it's a pretty cool technology to work with. That, of course, and I'm happy to have any job these days.

      --
      It's "no one," not "noone." Who the hell is noone anyway?
  14. Fax machine by markprus · · Score: 5, Interesting

    Just fax the documents to a computer.

    1. Re:Fax machine by Anonymous Coward · · Score: 2, Informative

      Don't rely on faxing.

      Fax on a computer is actually TIFFg3 format, created by Sam Leffler whose name is on all the old BSD UNIX copyrights. It's a wonderful tool, and a wonderful format, and easily transformed to other needed formats by the various tools in the TIFF library still published by Sam, but faxes provide only 196 dots per inch at "fine" resolution, and that's usually not good enough for documents and hand-written notes.

      A good flatbed scanner is really your friend, for price and performance.

  15. Recruit the community by SoSueMe · · Score: 5, Interesting

    Do it the open source way.

    Get several (dozen) other students to use their own equipment and time in echange for a copy/copies of the completed work.

    I would hazard a guess that there are more than a few people who would like to have a copy of the complete series of the lecture outlines.

  16. Easy by JensR · · Score: 5, Interesting

    Get some students of the professor's course to type them into LaTeX. Give them some points they'd otherwise get for homework.
    a) Publication quality DVI/PS/PDF files
    b) The student can deepen their knowledge of the topic
    Everyone happy. Used to work like this at the university I went to. And you may be even lucky that some student typed these notes in for himself.

  17. DjVu format is pretty good for scanned docs. by artemb · · Score: 3, Interesting

    I found that DjVu format produces substantially smaller file than PDF for the same scanned image.

    There is an open-source project http://djvu.sourceforge.net/ that provides code for reading DjVu docs, but I have no idea where to get DjVu encoder.

    1. Re:DjVu format is pretty good for scanned docs. by Anonymous Coward · · Score: 2, Informative

      Sigh... In THE VERY SAME PACKAGE! Is reading documentation contained in tarballs you download really *that* hard?!

  18. Re:Knee to the grindstone... by Exocet · · Score: 4, Insightful

    "Ummm yeahhhh... if you could just do that..."

    Faust7 is right about this one. Frankly, OCR is ok, but not great - on nice text on book-or-better paper. Handwritten notes? With equations? No. Not unless your profs have some damn fine handwriting and we all know that that is absolutely not the case.

    My advice is the same as Faust7's with these additions: spend some of that money on a really nice keyboard, wrist-rest and/or maybe a nice monitor. You are going to be needing all three. If there are any left over funds, get some really nice tea. I suggest Twinnings English Breakfast or Prince of Wales, if you're going to go bagged.

    --
    Exocet Industries - Taking over the world, one computer at a
  19. where to look by bcrowell · · Score: 2, Insightful
    Have a look at the archives of this mailing list, which is mainly populated by Project Guternberg folks.

    But the broader question is whether this is really a good idea. The result is going to be huge files, which will be messy, hard to read, and will lack an index or table of contents. Seems like a case of profs with too much ego and not enough willingness to put their own work into more useful form.

  20. Try making GIFs by PapayaSF · · Score: 2, Informative

    GIFs compress very well, especially with source material that's in limited colors. Try making a page into an 8-color or even 4-color GIF at about 150 dpi. The handwriting should be about as readable as the original.

    Also, if you're scanning material with copy on both sides, you might get some visible bleed-through. Try scanning such pages with a sheet of black paper between the page and the lid of the scanner, then adjust contrast to ensure white whites and black blacks.

    --
    Q: What does the "B." in Benoit B. Mandelbrot stand for? A: Benoit B. Mandelbrot
  21. Re:Outsource it by cloudmaster · · Score: 4, Insightful

    Maybe he *is* the cheap manual labor / unpaid intern...

  22. No good answers AFAIK by John+Miles · · Score: 4, Informative
    I've run into a similar problem, and have no good solutions in the general case. I'm on a mailing list for users and collectors of Tektronix test equipment (oscilloscopes, logic and spectrum analyzers, and so forth). Last year, Tektronix's legal department issued a copyright release that permits the reproduction and distribution of documentation for test equipment that they (Tek) no longer support. This was of great interest to the people on the TekScopes list, because it gave a green light to scanning and trading/selling copies of manuals. I've scanned in a few manuals for some equipment I own, and it's a huge pain in the butt any way you look at it.

    Electronic test-equipment manuals are pretty much worst-case candidates for scanning. In Tek's case, the schematic volumes often consist of hundreds of double-sided, nonstandard-sized foldout sheets (11x23" for example) with lots of fine detail that must be reproduced clearly. You can either scan the pages in segments and leave it to the reader to reassemble them, or you can take the manuals to Kinko's and have the foldout pages shrunk to 11x17" or 8.5x11" for scanning. Either way, it's a real hassle, and highlights a clear need for a "prosumer" duplex sheet-feed scanner solution.

    A few years ago you could buy scanners like this one that could handle arbitrary sheet sizes, but I haven't seen them in stores lately. These may be easier to use than flatbed scanners, assuming the precision they offer is sufficient for your application. I don't know how well they'd work on densely-printed schematics.

    Other than bitching about the state of the scanner marketplace, I don't have much to suggest. There are a few hints that will improve the quality and usability of your final document:
    • There are other formats, like DjVu, that have certain advantages over .PDF, but think carefully before using them. Will you be able to read your files 10, 20 years from now? In .PDF's case, the answer is an unequivocal 'yes' because of widespread government, military, and commercial standardization around it. I hate to see people spend hours scanning manuals in DjVu or another nonstandard format, because I'm 95% sure I won't be able to read them years down the road on a completely different platform.
    • To make the document searchable, use an OCR package like FineReader if possible... but expect to spend even more time babysitting the process.
    • Experiment with your scanner resolution settings to minimize the resulting .PDF file size. There's a big difference in size between 200 dpi and 300 dpi, and between a B&W and color scan.
    • For some mysterious, forehead-slapping reason, flatbed scanners often use glossy-white backing material in the lid. This encourages bleedthrough of text on the reverse side of double-sided material, making your scanned documents look sloppy and compress poorly. Placing a sheet of black paper, plastic, or cardboard material between your document and the scanner lid will make a big difference.
    --
    Dahlmann tightly grips the knife, which he may have no idea how to use, and steps out into the plain.
    1. Re:No good answers AFAIK by deranged+unix+nut · · Score: 3, Informative

      Since purchasing a Canon G2 (4 megapixel) digital camera, I have discovered that it works pretty well for producing readable quality duplications of 8.5"x11" sheets of paper and whiteboard notes.

      This camera can be controlled programatically. Automation would be needed to make it practical for a large scale, but it is much quicker than most flat-bed scanners and the quality would be okay for hand-written notes. It would be easy to take multiple overlapping pictures and leave it to software to re-assemble the images.

      (Yes, it is a goofy solution, but I works well for me as I normally have my camera handy.)

    2. Re:No good answers AFAIK by jensend · · Score: 2, Informative

      Bunk. DjVu has an open-source implementation and well-documented specs. It will thus be readable no matter what happens to LizardTech. Similarly, the main reason PDF can be counted on to be readable in the distant future is not its installed user base (that changes quickly enough to be fairly well negated as an advantage over the 10-20 year timespan you suggest), but rather that it is an open format.

      DjVu is probably the best format for the poster's needs. I had a university class where nothing was ever handed out to students in hard copy and documents were instead posted on the web; .doc was used for the kind of documents PDF is good for, while PDF was used for scanned-in (but not OCR'd) articles and so forth. This was a nightmare; the PDFs were absolutely huge, and just scrolling through them would bring a >1ghz computer to its knees. It would even have been better to use uncompressed TIFFs.

  23. Gotta be careful though. by Faust7 · · Score: 5, Funny

    Outsource the job to India

    "No, no, not my entire job, just this one part. No, I can do the rest. No, really. No! No... please..."

  24. Large Scale Paper to Digital Conversion by felila · · Score: 4, Informative

    I do conversion for fun, at Distributed Proofreaders.

    The problem is the mixture of graphics, equations, and text.

    It's easy enough to turn a page of text into a smallish file. Get a good automatic-feed scanner ($3500 or so) and a copy of ABBYY OCR software. If the original isn't too speckly, tiny, or smudged, ABBYY will give you a 95% accurate text you can then correct. Best format to save in? Depends on what the school is going to do the files. If they're to be posted on web sites, perhaps XHTML. If it's just for preservation, plain text (if there's no Greek characters) or XML with UTF-8.

    Equations -- well, there's supposedly a version of XML for math, but Distributed Proofreaders has ended up using TeX, as it seems to be the mathematical standard. While this would work for preservation, it wouldn't work for a web site.

    For a web site, perhaps the best way would be to intersperse text with pngs of the equations and graphics. The pngs would still take a lot more space than text, but the files would be smaller than PDF versions of the whole page.

  25. A Fujitsu scanner, SANE and Quartz Python bindings by sabi · · Score: 5, Informative
    Such as the fi-4120c is what I'd recommend. You might have to stretch your budget a bit. The cheap HP sheet feeders are very unreliable; we went through two HP 5550c's enduring constant paper jams before switching to a better (Fujitsu) scanner.

    Unfortunately you don't have much use for something like Acrobat Capture because you have handwritten notes to deal with. To process the files, SANE and/or TWAIN interfaces are reasonably easy to write code for. The cool thing about SANE is that you can run the saned daemon on any Mac or Linux box, and with a couple of lines of config file changes, it's instantly available over the network from any Mac, Windows, or Unix box (there are TWAIN bridges for Mac/Windows so it even shows up in Photoshop and so forth); there are also standalone GUI clients like XSane.

    I wrote a document management system in Python/wxWidgets (for Windows) in about a month part-time, and it works very well. Either on Mac or Windows, PDF makes sense because of the ubiquity of the viewers, even if you lose a bit in compression compared to more optimized formats such as DjVu. On Windows you can easily embed the Acrobat ActiveX control; on Mac OS X you have native PDF support, Panther's Preview kicks ass, and there are several open-source PDF browsing components such as the ones out of TeXShop or Glen Low's Graphviz port you can embed in your own app.

    Given a choice I would probably pick the Mac to do this project, because of the wonderful Quartz/CoreGraphics Python bindings. You can just draw right to PDF, and place PDF files as if they were images; for example, here's a short script to rotate a bunch of PDF files (sorry, Slashdot destroys Python indentation):

    #!/usr/bin/python

    from CoreGraphics import *
    import math, sys

    for inputPDFPath in sys.argv[1:]:
    inputProvider = CGDataProviderCreateWithFilename(inputPDFPath)
    &n bsp; inputPDF = CGPDFDocumentCreateWithProvider(inputProvider)
    &n bsp; if inputPDF is None:
    print >> sys.stderr, \
    "unable to open '%s': perhaps is not a PDF file?" % inputPDFPath
    continue
    outputContext = CGPDFContextCreateWithFilename(
    inputPDFPath + '-rotated.pdf', None)

    for pageNumber in xrange(1, inputPDF.getNumberOfPages() + 1):
    mediaBox = inputPDF.getMediaBox(pageNumber)
    rotatedBox = CGRectMake(0, 0, mediaBox.getMaxY(), mediaBox.getMaxX())
    outputContext.beginPage(rotatedBox)
    outputContext.saveGState()
    outputContext.translateCTM(0, rotatedBox.size.height)
    outputContext.rotateCTM(-math.pi/2)
    outputContext.drawPDFDocument(mediaBox, inputPDF, pageNumber)
    outputContext.restoreGState()
    outputContext.endPage()
    outputContext.finish()
    You could also use ReportLab, but because a lot of the PDF processing code is written in Python it's somewhat slower and memory-hogging for high-volume use. (I used ReportLab on Windows for the above project, and use CoreGraphics Python bindings for my research, so I do know what I'm talking about mostly :)
  26. My dad's office by pavera · · Score: 5, Informative

    My father is an attorney,
    he has a couple of high speed scanners from panasonic. They cost less than a thousand dollars (4-500) if I remember correctly, they scan at about 20 ppm, and the software that came with them will save each scanned group of pages as a separate document (pdf, tif, whatever). My dad uses this setup to scan all of the files that his cases generate (shrinking his document storage from about 1000 sq ft to 2 shelves in a bookcase). we are talking files that consist of 10,000+ pages, and normally he saves a years worth of cases on 3-4 cds. They can scan up to 500 pages at a time.
    Here is a link:
    High Speed Scanners

  27. All you can do... by cliffiecee · · Score: 5, Insightful

    Is say "Sure. I'll get this done- when I can. Don't expect it to be done for at least a few weeks, maybe longer."

    DON'T CLEAN UP THE SCANS. Don't even look at the scans. DO NOT RETYPE ANYTHING.

    With the kind of volume you say you're receiving, the only way you're going to survive is to:

    1. close your eyes,
    2. load the documents into the feeder,
    3. press 'scan'.
    4. Make sure everyone knows this policy.

  28. It's not a technology problem it's a problem of by Bob+Bitchen · · Score: 2, Insightful

    poorly set expectations. How did the professors get the idea that it was possible? It's not pssobile under the contraints that you are faced with. If money was not a limiting factor you could do this. But I'll assume money is a factor and time as well. So go back and tell them that it's possible but it's going to cost this much to automate the process and this much if I type it in by hand and this much if someone else does it but with poorer accuracy and so on and so forth. Put the burden on them to decide how they want to deal with this. Only then will the appropriate solution be found and chosen.

    --
    http://tinyurl.com/3t236
  29. HP9200C? by MrChuck · · Score: 2, Informative
    We have one at work. You put in a pile of papers, tell it "go" and it emails a PDF of each to you. I've been struggling without a manual to reconfigure it a bit.

    Cheap? Dunno. It was just there. In any sort of volume though, the cost drops precipitously (cheaper that you doing a flatbed scanner!).

    Check out something like that (or indeed that) used, use it, resell it. Or new, then use/resell. Or get the school to buy it.

    If this is a continuous thing, then all the better to own.

  30. Do you work at Kent State? by NevarMore · · Score: 2, Interesting

    Kent State just announced thier FlashNotes website. I go to school there, email me at fiveonethree@yahoo.com I would be more than happy to come down and help you sort out your options.

    A bit of opinion on the project. This is not a good idea. Its one more tool that studnets will rely on to memorize information isntead of taking time ti THINK about thier subjects and really LEARN the material.

  31. Some photocopiers support this by adamsc · · Score: 4, Informative

    Check whether any of the photocopiers around campus support scanning: we have a Canon ImageRunner in one of the labs which I support. It's extremely fast - ~1 second per page for a double-sided scan and the feeder is pretty robust - we have grad students who take handwritten lecture notes for an entire class and dump this stack of a couple hundred crumpled pages into the feeder and end up with a PDF a couple minutes later.

    1. Re:Some photocopiers support this by Rude+Turnip · · Score: 3, Interesting

      We have one of these in our office and they're great for taking stacks of workpapers from clients, scanning them in and getting rid of the originals. You can email a PDF directly to someone, or store the PDF on a server somewhere.

  32. Comment removed by account_deleted · · Score: 3, Insightful

    Comment removed based on user account deletion

  33. Works for me by sglow · · Score: 5, Interesting

    I tend to scan lots of documents and setup a simple perl script that uses the 'scanimage' command line tool to do the scanning. Using my Epson Perfection 1650 scanner (pretty standard flatbed scanner) I can scan an 8"x10" page in black & white mode in about 10 seconds.

    I actually added a button to the Nautilus GUI shell so I can move to the directory I want and hit the button to scan a page to that directory. Very convenient.

    I scan to tiff and then use the convert utility (part of imagemagick) to convert to png. The resulting files typically run about 100K to 200K depending on the content.

    If anyone's interested in seeing the perl script I've posted it to: www.ollies.net/scanscript.html

    Steve

  34. General principles for document imaging by mangastudent · · Score: 2, Interesting
    I used to develop systems to do this sort of thing ("document imaging"), so here are a few basic principles:

    The quality of the scaning is obviously important; get or borrow the best scanner you can. The point made about putting a black backing onto a flatbed scanner is important. Also important is adjusting the scanner settings so that you get minimum noise (random black dots) without degrading the stuff you want to keep.

    For this sort of thing you almost certainly want to do it bi-level/B&W/one bit deep (hopefully there are no shaded pictures, but you can use screening for those), and to my knowledge nothing has been developed that compresses these images better than CCITT Group IV (fax machines use Group III). You almost certainly don't want to use grey-scale, at least not for your final images.

    You should see if you can find some post-processing software; we used to use ScanFix, which would straighten the image (which makes Group IV compression a lot better) and depending on settings clean it up as well. You also need to decide upon the size of the final images; you want to scan at 200 to 300 or even 400DPI, but you don't have to have final versions at those high resolutions.

    The standard used to be TIFF images with Group IV compression, but not every image viewer can read them, or display them well (esp. if the image needs resizing, and I doubt you can assume everyone reading these has their monitor at a high resolution).

    If PDF will accept and display images compressed with Group IV compression, you're probably best off with that, since Acrobat Reader is ubiquitous and fairly easy to use.

    PNG is a nice format that I use by preference for > 1 bit deep images, but a quick check of some PNG documentation says that Group IV "often" compresses a lot better than 1 bit "greyscale" PNG; it was simply not designed for document imaging. And you also want to avoid JPEG, it's a lossy (will introduce artifacts) system that also wasn't designed for bi-level images.

    Hope this helps.

  35. PDF settings by mdkemp · · Score: 2, Informative
    If you intend for people to print this stuff out, PDF is definitely the file format of choice. The size of the resulting files will largely depend on the scanning resolution and color settings you use, as well as the type of compression in the PDF.

    If the lecture notes you're scanning don't contain any grayscale or color graphics, your best bet is to scan in black-and-white mode (as opposed to color or grayscale) for smallest file size. I'd suggest scanning at 300 DPI for sharp-looking printouts. Be sure to play around with the "threshold" value (or equivalent) in your scanning software until you figure out what looks best. If it's not set to a good level, text may look too thick and blocky, or thin lines might disappear completely.

    Once you have a monochrome scan, you'll want to save in a lossless compression format that preserves the monochrome attribute of the image, such as compressed TIF, and not as JPEG. When exporting to PDF, you could experiment with both ZIP and fax (CCITT group 3/4) compression types -- both compress black-and-white images very well. If your PDF software doesn't have those options, the default should probably be good enough. Even at 300 DPI, most pages should fit into about 30K or so.

  36. University doesn't already have this service? by needacoolnickname · · Score: 3, Informative

    Most universities already have this service. The professor might not know it exists, but check the other departments to see if they have one (not the scanner - but the service at the school). It is usually somewhat intertwined with a Distance Learning center or department.

    It takes away the cost of printing lectures/notes/required readings from the departments and tacks it onto the students who now seem to pay for printing above a certain limit in the labs.

    At least this is the way at the universities I have worked at.

  37. professionally by curator_thew · · Score: 2, Insightful


    The professional approach is to go back to them and clarify the outcome:

    (a) you can scan the documents in, and they'll take X amount of space, and Y time; and this doesn't include OCR;
    (b) you did a few tests (using the supplied document) and these are the results for TIFF, JPG, PDF, etc;
    (c) OCR is probably infeasible (or not, do some tests) because of the nature of the documents;

    Include in (a) the option of purchasing an automated document scanner, and the corresponding reduction in time.

    Based upon all the above, get a clear go-ahead, and make the purchase if new equipment is authorised.

    You said "where I work": this is your job: it's a bit poor to do as the other posters suggest and refuse to do the work: you need to make sure that the customer (professors) understand exactly what they are getting, and give them a choice to buy into it or not - i.e. "clarify the expectations".

    If you assess that it's 2 weeks worth of work, and the professors don't disagree, then you're supervisor just has to put up with it.

  38. Re:Xerox Scanner doesn't do OCR by zenquest · · Score: 2, Interesting

    Xerox bundles OCR as a software add-on. It works well when you get it all set up at your company. By the time you get back to your desk, the document is open and ready to be OCR'd with a drag and drop.

    It obviously wouldn't be so convenient if he had to go to Kinkos, but they might have it set up on one of their machines. (Yeah, I doubt it, too.)

  39. PDF Settings by AvitarX · · Score: 2, Informative

    I've been reading a few minutes and nobody seams to address your setting etc.

    The you should scan in grey-scale or if there is high enough contrast (pen notes, not pencil) in Black and White. The grey-scale with a JPEG medium or even low compressions is going to be much smaller then the deafaults. A pure black and white with group four compression will be even better. At work we scan pages at 300 DPI that way and get 20 to 30 k files (I think, haven't done it for a while).

    Also typically images for web viewing of even text are scanned at 72 dpi (all the scholarly journals at my university). This can make things hard to read but really shrinks the file (about 1/16th the size of 300 dpi).

    Also if the scanner is set low res pure black and white it will scan a lot faster, but still be pretty slow.

    The other option is to pay someone to do it. If you have all of the stuff ready at once and give the pros a week or so to do it when they aren't busy you can probably get as low as 50 cents a page.

    Blah blah, I lost my train of thought 2 paragraphs ago

    --
    Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
  40. docutech is the way to go... by capsteve · · Score: 4, Informative
    being in the prepress industry, i see more and more traditional printing going the way of xerography. of the competitors in the field, xerox probably has the best system with the docutech series... you may want to consider kinko's which is an authorized user/vendor of the docutech system.

    on a side note, if the professors are utilizing a lot of additional material which includes might include3 handwritten information, you might consider getting encouraging them to transcribe that material(hopefully your not the TA that has to do the transcription) into a digital for, be it text or WORD. this'll difinitely help in reducing the size of your files.

    also consider looking into adobe's pdf service, if you're overwhelmed with just orginizing the material itself. probably not so kosher to suggest ity on /. but it could be something the school already has an agreement with adobe(taking into account the units of acrobat the school itself might be using). i know it's not rolling your own, but sometimes using an "out of the box" solution to get thing up and running so you can explore other solutions has it's merit as well...

    --
    three can keep a secret, if two are dead - benjamin franklin
  41. PDF of handwritten notes is DUMB!!! by madstork2000 · · Score: 4, Insightful

    It makes no sense at all to me, to have a PDF created of handwritten notes. Since most students will probably just download and print out the PDF anyway. The only adavntage is it may save a few trees not everyone will print them out.

    It sounds like the school wants to shift the production costs (i.e printing) to the students. This seems inefficient because the old way where the instructor could go to the copy center and have the notes copied the at the schools expense (I know these expenses are often passed along to the students anyway), rather than at the students DIRECT expense of their time for downloading, then printing out on their own equipment or using their own printing accounts at the computer center.

    If the notes were being OCR'd and then made available on-line, or post processed in such a fashion (where they are searchable, indexed, etc) where they were searchable, it would be useful. Otherwise this seems like a waste of time and money.

    -MS2k

  42. Get a Canon Document Scanner by spizm · · Score: 3, Insightful

    The company I work at scans large amounts of documents to PDF format on a daily basis. Depending on the volume some people do, we use either a Canon DR-3060 or DR-5020 document scanner. These will scan both sides of a page simultaneously, clean up the image (despeckle and deskew) and convert them into TIF or PDF all on the fly. They're fast too. Between 20 and 50 pages per minute. Only problem is that they're expensive.

    For your budget, you may be able to afford the Canon DR-2080C which goes for around $600. It has all the features of the more expensive ones, but it's meant for smaller volumes like what you're dealing with. With that, you'd be able to scan 100 pages into a pdf document in around 5 minutes.

  43. OR by geekoid · · Score: 3, Interesting

    charge by the hour, at least 50 dollars an hour. That way you can hire 3 student at 10 bucks an hour to do the actual work.

    --
    The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
  44. JBIG2 inside PDF by jab · · Score: 2, Interesting
    Actually, the newer PDF specifications and newer PDF viewers (like the totally excellent xpdf utility, and oh yeah, Adobe acroread 5 and onwards) all support JBIG2 compression. JBIG2 is a token based compression technology giving roughly similar file size and image quality compared to DjVu, but with the advantage that everyone and their uncle can deal with the PDF file format.

    So, I recommend scanning to TIFF (or TIFF inside PDF). Even if you don't currently have the encoding softeware, you can convert to JBIG2 compression later as it becomes more and more ubiquitous in the future.

    And definitely use a automated document feeder of some sort to keep from going crazy. Newer Xerox machines work pretty well for this (I use a DocumentCentre 440ST for this all the time) unless you have hundreds of thousands of pages to deal with, in which case you should either invest in industrial scanning equipment or outsource to a scanning center that does.

  45. Re:Xerox Scanner doesn't do OCR by timeOday · · Score: 2, Insightful
    Yes, it makes a PDF of all the pages, but each page is just a picture. There's no way to search for text in the result.
    There is no way you're going to solve that problem with one person and a couple hundred dollars.

    I know there are Adobe archival systems that store the scanned image, along with whatever text they manage to recognize. You don't expect near 100% OCR accuracy from an old, largely handwritten sheaf of lecture notes and transparencies. But hopefully enough is recognized to be of some use.

  46. ask your professor to be precise ... by pikine · · Score: 3, Insightful

    I think what your professor wants is not a bitmapped copy of his handwritten notes or some vector curves that resembles such, but actually a typeset version of the lecture notes. If that is the case, assuming that his handwritten notes are sparse (and hopefully without diagrams, since it takes more time to mess around with them), you can definitely do a stack of 100 sheets in a week, or, as someone already suggested, hire some typists to help you out.

    --
    I once had a signature.
  47. Depends how good you want to do it by Danh · · Score: 4, Informative

    If you want to do a good job, you have to type it, in LaTeX. It's the only way to get something nice and something the professors will be able to enhance in future.

    If a digitized copy of the manuscripts will do for you, you can go the scan -> image enhancement -> OCR -> save to PDF way.

    For scanning, you already got a lot of good comments how to automatise the scanning of dozens of scripts. If you lack these possibilities also a SCSI or USB desktop scanner should do the job (it's definitely less than 1 min per page), so you scan a script in 2 hours. No need to bother to outsource the job to India. Probably you can scan B/W and don't need greyscale or colors. I would scan handwritten scripts at 200 DPI and save the whole pictures in front of the OCRed text, so the user doesn't see the OCRed text and can only use it for selecting and copy&paste. It would be too much work to correct the OCRed text here. For machine written text I would use 300 dpi or more for better OCRing.

    As image enhancement you only need to be able to automatically orient the page so that the text is horizontal. I don't remember if Acrobat does it, but for this job I would anyhow get a good OCR program.

    As OCR program I recommend FineReader, but also Omnipage is ok. FineReader does better OCR than Omnipage and Acrobat. It also saves better to PDF (with retaining all of the paragraph structure) than Omnipage.

    If you keep the image before the OCRed text in the PDF you can expect files of 10MB for 100 pages for B/W scan at 200 dpi. OCRing of machine written text has become incredibly accurate, so you can do real OCR there and throw away the bitmap picture. This of course gives much nicer output (and smaller filesize), but you need to spend a lot of time correcting the text. Here the best OCR program really pays off (you probably have a lot of words which are not in a dict, need custom dicts (does Acrobat have them?),...). A program with a single flaw (e.g. that recognized you formula as text, or code as paragraph text,...) will let you waste a lot of time correcting it on every second page.

  48. Re:Xerox Scanner doesn't do OCR by dougmc · · Score: 2, Insightful
    Xerox bundles OCR as a software add-on. It works well when you get it all set up at your company. By the time you get back to your desk, the document is open and ready to be OCR'd with a drag and drop.
    The original question said that the notes were handwritten. Has anybody had any sort of success whatsoever in reading handwriting with OCR? (Not that I'm aware of.)
  49. Large Scale OCR and/or PDF conversion by kingsy · · Score: 2, Informative

    The answer is simple. You are at a university. MOST modern photocopiers do inbuilt pdf conversion or OCR'ing to network drives or email. Find one of them.

  50. 8100C by MushMouth · · Score: 2, Informative

    They no longer make it but they can be found on ebay for a few hundred bucks and no I am not selling this one or one at all.

    1. Re:8100C by larkost · · Score: 2, Informative

      I use a 8100c on a daily basis, and it is a good little machine, but their are a few gotcha's:

      The color tiffs use a depreciated form of tiff that was rescinded from the standard as unworkable. On top of that, the version they use does not work with an variant of libTiff. Basically you are stuck working with a few windows programs... and graphics converter on the Mac. Photoshop with sometimes even choke on them.

      When we try and scan yellow documents the scanner will occasionally freeze up. It seems to happen sooner in tiff mode, later in PDF mode... but eventually it just freezes and has t be rebooted. Reloading firmware does nothing to abate this.

      Oh... and it does mean that the scanner has to feed a windows box, as I have not found a means of attaching it to anything else.

  51. Re:Xerox Scanner doesn't do OCR by jayminer · · Score: 2, Informative

    This claims hardwriting recognition. (Although it requires some sort of structure in the OCR'd page I think)/

  52. either outsource or use a digital camera by misanthrope101 · · Score: 2, Insightful
    I've used a flatbed for this type of thing, and it works, but it takes forever and it's frustrating. It isn't hard, mind you, but time-comsuming and mind-numbing. The first 30 pages is easy and then you get really really sick of it. If you do scan it yourself, you don't ned more than 200dpi or so, and you can save as high-quality jpeg. This isn't artwork, and there is no need for perfection. Acrobat will accept any image file. I'd scan with a standalone image program (I use ACDSee and it works well) and then feed the images into Acrobat. But as far as a recommendation...

    Have it professionally done, like other people here have recommended. High-end sheetfed scanners are great, but you probably can't afford one, and it wouldn't make sense as a one-time expense for this small of a job. I'm a big fan of just handing someone some money and it's magically accomplished.

    Alternatively, use a digital camera and well-lit copy stand. You can improvise a copy stand with a tripod or whatever, but make sure you have a lot of light. It's a lot faster than using a scanner, and the results are acceptable if you have a good camera. The more megapixels the better - don't use the old 1.3mp one you have lying around. 3mp will technically work, but more is better. Ideally a digital SLR pointed straight down at the page, a very well-lit area (a clamp light on either side of the page works nicely), and you sitting there sipping Starbucks while you hit a cable shutter release after you flip every page. You could get a few hundred pages an hour done this way--your only limitation is how fast you can turn the pages. You'd only have to stop to transfer images to your computer, and you only have to do that often if you don't have enough memory cards. After you get all the pages into the computer, feed them into Acrobat and you're done.

    If you don't want to use acrobat you could make a web-page with thumbnails linked to the hi-res images. Then your end-users wouldn't need to download the Acrobat reader. I love Acrobat's ubiquity but hate the file sizes and the slow start-up time.

  53. How I Do It by MSInsight · · Score: 3, Informative

    I scan and upload various land use and financial documents for a county and its townships to the internet on a shoe-string budget - actually, no budget - all volunteer, public service for fellow citizens. This is my prescription:

    Stay with your current flat-bed scanner. Do not waste money on a sheet-fed scanner. You do not have nearly enough money for a high-end Fujitsu or Bell & Howell sheet-fed scanner which will reliably get the job done without mechanically screwing up. The pros use high-end scanners because they never screw up and they go fast. Cheap sheet-fed scanners miss sheets or jam up too often to trust them with anything. Make a sign-up sheet for work-study or volunteer students in your academic department to sit down at your computer and scanner and scan the documents into the computer. Give them free pops and gummy bears (slur it so it sounds like "rum & beers") or something similar which won't transfer from fingers to documents. Just take a few minutes to set them up and show them what to do. Keep it simple. Let those empty minds waiting to be filled with knowledge (and beer) do the time consuming zombie work. You should focus your attention on how to put the files on the website.

    The scan file format I use is Portable Network Graphics format or PNG format. On average, it compresses black and white graphics 20-25 percent smaller than the widely used GIF format. PNG format is also supported to a basic enough level to be displayed using MS Internet Explorer, Netscape, Mozilla, and other internet browsers.

    I use free Xsane scanning software on a linux system to scan the documents. Xsane can be set to scan in line-art mode, also known as black and white mode. This software can also be set to save files directly to disk in PNG format and automatically change the file names using numerical iteration, i.e., file-01.png, file-02.png, file-03.png, etc. without the need for human intervention to change the file name each time. I use a 100 dpi scan resolution setting because documents do not need to look ultra-smooth; they just have to be legible. Anything beyond that is a waste of hard drive space. Using this resolution also means I do not have to spend time embedding the graphic file in html code to constrain its width so it can be viewed on the average 15", 800x600 resolution monitor. I just insert weblinks to the individual, one-page graphic files: "Page 1, 2, 3, 4, ...", with each page number hyperlinked to a corresponding graphic file. Your graphic files will run 15-25kb each. The use of PDF graphics format is a waste of time and space unless a professor gives you a MS Word file of their lecture notes which you can convert directly into a PDF file with embedded text. That is the only case in which I would use PDF over PNG. Good luck.

  54. Mass Scanning by carldot67 · · Score: 2, Insightful

    I looked into this once for a client. Agencies charge around 5c a page but that is only to scan. Add more for OCR, manual verification and/or transfer to M$ Word or what-have you. I think I recall seeing 50c a page for such value-adds. Agencies are good because you dont get need to buy the kit (30K and up) or watch it run (they need feeding and jam quite a lot, especially if the paper is lower quality). Agencies also make sense for shops with nil/low expectation of producing more paper in the future. Get some quotes, references and examples of their work and start with a short trial run.

    --
    I wish at was Friday, but I dont want to wish my life away. So I wish it was last Friday.
  55. I'm archiving stuff at my university by adrew · · Score: 4, Informative

    We've undertaken a pretty large archiving job at my university. We're scanning every page of every newspaper we've ever printed (started in 1927) up to the time we have digital archives starting around 1993 or so. We're also scanning about 80 300 page yearbooks. Hopefully this can offer you some help or suggestions.

    We have a dual-processor G4 and an Epson 1640XL large-format FireWire scanner with the optional auto document feeder. It's probably a bit out of your budget ($2899 + ~$1200 for the ADF) but it's awesome. It can scan at up to 1600dpi and the ADF can automatically duplex and scan both sides of the page. We're using OmniPage Pro X for OCR software.

    Right now we're more concerned with scanning the documents and getting them online, so we haven't started OCR'ing everything yet. But the ADF is awesome. It can scan both sides of all 300+ pages of a yearbook automatically in about 2 1/2 hours.

    The newspapers are a bit different. They're getting a bit fragile in their old age so we have to manually scan them. We scan them at 300dpi in full color, so the 12x18 pages are around 50MB per page. But the scanner takes less than a minute per page. It's impressive.

    We use Photoshop's web gallery feature to generate the image galleries. Pretty simple really. Let me know if you have any questions.

  56. I do this all the time by s.a.m · · Score: 2, Informative

    Hopefully you'll get to read this one and hopefully it won't get modded down to oblivion.

    Yes there are scanners out there that can work for you. I have a Canon DR-5020 which we just feed it a ton of paper and come back in a few and it's done. It can scan VERY quickly. PDF format would work just fine as well. It's the best option especially since it's hand written notes as well.

    If this is a requirement which is going to be on-going then you will have to pony up the money and spend a few thousand. If you're not ready to do that, you may be in luck. Some places will lease it out to you and with that few hundred bucks I'm sure you can easily get a hold of one for about a week or 2.

    Look up for people who do Document Imaging, and you should find a lot of business that come up. If you're in the washington dc area then maybe I can help you out quite a bit.

  57. Re:Get stuffed - outsource to india by remolacha · · Score: 3, Informative

    we've gotten a bunch of jobs like this - turning handwritten documents into searchable pdfs - and had a lot of luck sending them to firms in india, either by sending the documents snailmail or scanning with a sheet feeder and ftp'ing. the firm we got the best results from was called suntec, suntecindia.com I believe. I know outsourcing is a touchy subject these days, but they were all set up for this, we weren't, and their prices were quite good.

  58. Re:Xerox Scanner doesn't do OCR by Anonymous Coward · · Score: 2, Funny



    Well I hope someone develops something soon, I've been unable to read my own handwriting since 1995.

  59. Kinkos isn't worth it (probably). by darkonc · · Score: 2, Interesting
    Somebody else noted that Kinko's would probably charge $.30/page. That's $30/100 pages. If you can manage to set up a sheet-feeding scanner such that you can do one page/30 seconds you would be getting cheaper results by paying that person $30/hour to do the same job.

    As other people pointed out, if you can get a couple of departments in on this, then you can more easily amortize the costs of really good equipment to do this...

    One thing that I'll note is that I don't really like PDFs for this sort of stuff. If you really have a 100 page article, you're going to be looking at a 3 meg file and, perhaps, a 30 second startup time... That's fine for someone who's going to read the document from cover to cover, or print it... On the other hand, it's a pain if you only want to look at pages 37 and 38.

    GrokLaw gets PDFs of court filings regularly, and I got so fed up with PDF's that I created a (semi-automated) batch system to split up the PDF's into separate PNG images and create a simple index.

    You can see a sample here. Far easier to view a page or two there (IMNSHO) -- but not as easy if you just want to download and print it.

    Before you go too far, you might want to get a good handle on how people are likely to use what you produce -- Use that knowledge to decide just how you want to organize the result. You may want to make it available in two (or more) different formats. It's not that difficult to bulk convert things between different forms (at lest, not if you can dual boot into Linux, or have OS/X).

    --
    Sometimes boldness is in fashion. Sometimes only the brave will be bold.
  60. LyX for LaTeX!! by IronBlade · · Score: 2, Interesting
    Get some students of the professor's course to type them into LaTeX.

    Use the fairly user-friendly LyX to do the LaTeX-ing.
    Heck, get the academics themselves using it to prepare their notes in the first place!
    They might actually thank you for introducing them to this convenient and easy document processor.

    --
    Important info:
    http://www.lifeaftertheoilcrash.net
    http://dieoff.org/synopsis.htm
    http://www.peakoil.net
  61. Don't bother by An+Onerous+Coward · · Score: 2, Insightful

    Frankly, I've seen professors' handwritten lecture notes, and 90% of them add nothing to the educational process. Certainly not more than a quick note saying, "Read sections 2.1, 2.2, and 2.4, paying special attention to least-squares curve fitting and finding orthonormal bases." They're generally disorganized and difficult to follow because they usually take a lot of material for granted when they write.

    The mere fact that it's handwritten means that it's basically a rough draft that was hastily flung together. Send them back to him, and have him type them in and rework them until he figures they're worth recycling for next semester. The prof will save time in the long run, and the students will have something nice, clean, and organized to peruse.

    --

    You want the truthiness? You can't handle the truthiness!

  62. Spring for tiff conversion and ftp storage by georgeha · · Score: 2, Informative

    or just get the ftp storage and a DocuJob Converter, which converts DocuTech jobs to TIFF or PS, or just use a DigiPath instead.

  63. Alternatives by RebornData · · Score: 2, Informative

    Hardware for image acquisition:
    Check to see if the department copy machine has scan functions... most built in the past few years do, even if they aren't used in most places for that. You'll get a decent sheet feeder and way faster scanning than most desktop sheet-fed scanners.

    If you have to buy something and have to go *really* cheap, you could get a multi-function print / scan / fax thing. Most will handle legal size, because they're not actually moving the sheet fed paper onto the flatbed glass... the image element stays stationary while the paper goes by. But, of course, you get what you pay for... expect to spend time dealing with paperjams and skipped pages. However, it should be faster than hand-feeding a flatbed.

    Software:
    I mention this simply because nobody else has (that I've found): Scansoft Omnipage Pro is designed for highly repetitive, batch-oriented OCR. It has options for doing automated or hand-tweaked "area recognition" (separating text from graphics) and has the best proofreading UI I've seen... it flags "low confidence" recognitions automatically, and displays both it's best dictionary guesses and the actual scanned words. Not sure it will help much with hand-written work, but for printed material it works well.

    Format: Your primary concern when looking for a destination file format should be longevity... will the files be readable 5 years from now? I've seen a number of people recommending highly efficient but obscure compression schemes, which are a terrible idea if you want the data to stick around. Saving a few bits doesn't do you much good if you can't figure out what they mean. I recommend that people scan to two formats, just for safety (Omnipage can do this automatically).

    -R

  64. ScanSnap by takasuz · · Score: 2, Interesting

    ScanSnap may be just what you need if the notes are on a uniform-sized paper (e.g. A4 or letter). You need Acrobat (included) on a Windows machine, but you just set the notes on the scanner and click a mouse then it scans 50 sheets (both sides in one-pass) without human intervention and gives you an Acrobat file in a few minutes. It is small and weighs light so you can easily bring it into the secretary's office. The price is also reasonable ($495 with Acrobat 6.0), and it seems they are even offering a $100 rebate now.

    The specified resolution is for a colored documents. For a b/w one, you will get a better resolution. You can obtain scan samples from a Japanese page (pdf files at the bottom).

    Actually, a newer model, fi-5110EOX, has already been available in Japan, and I think that is why they are offering a rebate now. The new model have usb2.0 connection and a higher resolution mode (excellent) that is not possible with fi-4110.

  65. Distributed proofreading? by cwm9 · · Score: 2, Interesting

    I don't know what the specifics of your work is, but you probably have a huge supply of untapped workpower at your fingertips.

    The students who are taking these classes could easilly be a source of tappable work hours.

    See Project Gutenberg's proofreading site for an example of this type of effort. http://www.pgdp.net/c/default.php

    If you could get the professors to offer a little bit of extra credit for proofreading or converting a page, the task could be much easier for you.

    Envision this: You use and ADF to scan an entire stack of notes in order, but you don't worry about how the scanning goes on each page. Then you xerox the whole stack and place the copies in a binder in someone's office. The students are then offered 10 points extra credit per page translated from .TIF to word/wordperfect/Mathematica, whatever, up to three pages worth.

    The points are justified since the student is in the class and learning something by carefully duplicating, analyzing, correcting, and studying the professors notes for that class. (Can you imagine a more likely way to end up accidentally committing three pages of facts to memory?)

    You can place the .TIF files in a class-accessible online folder, and accept the end result in an e-mail.

    If the file isn't legible, the student can check the xeroxed copy out from the binder. Since it's just a copy, you don't need to worry about losing it.

    You could skip the scanning altogeather, and ask the students to return any pages they don't finish translating.

    Obviously this works best for large classes where the student:pages ratio is large.

    Make sure you number pages if you do anything like this.

  66. Ricoh Aficios, Ancient Fujitsus, and OmniPage Pro by BigBlockMopar · · Score: 4, Informative

    we've gotten a bunch of jobs like this - turning handwritten documents into searchable pdfs

    We had to do this, too. For a Court, which requires the reasons, decisions, etc. to be publicly available online.

    *Thousands* of documents, hundreds of pages each. The responsible department got me, as the IT guy, to set it up for them (after they'd already bought the stuff to do it).

    Basically, a couple of Ricoh Aficio series copier/scanners, a couple of ancient Fujitsu sheet-feed scanners, and a bunch of students sitting all day in front of computers running OmniPage Pro.

    The Ricohs were great on paper - fast, networked, etc. but their scanner drivers were poor (reminded me of bad CD-ROM drivers - "Copywrite 1995 Behavior Tech Computer. All right reverse." [sic,sic,sic]), and their service (contract) involved having to call the Ricoh guy because the scanner portions randomly wouldn't appear on the network, then wait for him to appear while at least one of the students sat idle. 2 stars out of 5.

    Ancient Fujitsu scanners, black and white only, don't remember the model number, required proprietary SCSI cards, no support under Windows NT/XP/2K. These were commercial-grade super-expensive scanners when new (about 1990). Installed Windows 95 on a bunch of relics with ISA slots for the SCSI cards and let 'er rip. Scanning was fast, feed was reliable like a good-quality photocopier or fax machine. Only issue was requirement for an old computer running an old OS; better overall than Ricohs - 4 stars out 5.

    OmniPage Pro 12 - reading was *excellent*, far better than anything else I've ever seen. Handled French and English, simple monochrome diagrams, etc. with only very small occasional formatting problems. Print to a PDF using Acrobat on the file server. Only real problem was stability, frequently locking up and losing the scan and OCR on page 99 of a 104 page document. 2 stars out of 5, being punitive because of frustration.

    As they got to be more proficient with OPP, and as OPP's dictionaries filled up, we were able to add more and more computers and scanners, so that they were running around, tossing files into the scanners, stapling scanned documents back together, and occasionally rebooting one of the Windows 95 workstations. Peak was 15 computers and scanners.

    Task took 3 students 3 months full-time.

    --
    Fire and Meat. Yummy.
  67. Prof. Clueless, PhD by im+a+fucking+coward · · Score: 2, Insightful

    Basically, professors want to hand me a big (often 100+ page) stack of their handwritten lecture notes (with messy text, equations, and diagrams; sometimes double-sided) and expect me to post a PDF-or-something-similar to their course's web page.

    After I stopped laughing, I realized this may be a serious inquiry rather than a joke. I've assisted local government agencies in converting clear, printed, 8.5x11" text documents into searchable text / pdf documents, and the cost for these is over 10 cents a page. (Tax and mill levy records have to be verified 100% correct, as I'm sure your prof's notes need to be.) That's with volume discounting (> 500,000 pages), using nearly perfect ascii text documents, not scribbled notes.

    So my advice is to get a few bids from outside contractors, then submit a realistic estimate based on the average. Hint: Given those spec's, it's clear you/your management have no idea what's involved in this process. (Shows at least a modicum of IQ that you had the good sense to ask, however.) If you simply need to scan/save as pics (jpg/tiff -> pdf), you can do this yourself at reasonable cost/effort expenditure. Seems to be implied that you need OCR capabilities for handwritten text, as complicated as equations at that, so you're really pretty screwed. Even simply creating 100-200 kb jpg's & emailing them in an automated process is going to run into problems when the campus mail servers refuse to accept attachements larger than a Meg.

    Good luck, BWAhahahahaha!