Slashdot Mirror


Large-Scale Paper-To-Digital Conversion?

An anonymous reader writes "I've just been asked to digitize several dozen sets of lecture outlines at the university where I work. Basically, professors want to hand me a big (often 100+ page) stack of their handwritten lecture notes (with messy text, equations, and diagrams; sometimes double-sided) and expect me to post a PDF-or-something-similar to their course's web page. However, every desktop scanner I've ever used takes 1-2 minutes of user-attention per page and the resulting files end up Huge, impossible-to-read, or both. All I have at my disposal is my PowerBook, Acrobat, a couple hundred dollars of department funds for a new scanner (this maybe?), and, if I ask nicely, overnight use of the secretary's Win2k box. Any ideas? Sheet-fed scanner recommendations? Better file formats than PDF (or better PDF settings)? Do any of you students have usability advice?"

18 of 459 comments (clear)

  1. Kinkos? by axonal · · Score: 5, Informative

    Some Kinkos have those big goliath Xerox scanners which act just like copiers. Load a stack up papers, and it will scan the pages and load them up. Not sure about PDF export/etc though.

    1. Re:Kinkos? by zenquest · · Score: 5, Informative

      We have a Xerox WorkCentre Pro 65 at my school. It can scan at around 50-60 pages per minute, and will do double-sided. It will do PDF output, too. (and email it or FTP it to you, if so configured)

      Our teachers use them for exactly the purpose described. If you don't have one of these type machines around anywhere, then definitely give Kinkos or some similar establishment a try.

  2. well... by Anonymous Coward · · Score: 5, Funny

    if I ask nicely, overnight use of the secretary's Win2k box

    Plus, if you're lucky, you could also get other after-hours favors from the secretary as well ;-)

  3. Simple. by jebell · · Score: 5, Funny

    Outsource the job to India.

    --
    This is my sig. There are many like it but this one is mine.
  4. The most important thing by Timesprout · · Score: 5, Funny

    Is to first make an exact copy (by hand) of all the existing documents. Its vital to have a full backup in case anything goes wrong with the scanning process you can always restore the manilla folders to their original filled state.

    --
    Do not try to read the dupe, thats impossible. Instead, only try to realize the truth
    What truth?
    There is no dupe
  5. ADF Scanners by Loiosh-de-Taltos · · Score: 5, Informative

    What I suggest and use is the HP 4C scanner. It's a SCSI-II only scanner that can be found on Ebay for under $10 usually. They also have an automatic document feeder option that can be found on Ebay. This scanner was originally designed for both Windows and Apple compatibility as well. It cannot handle 2-sided sheets.

    The scanner has four different pieces of software you can choose to use, I'd suggest Precision Scan Pro as that makes multi-document scanning easier.

  6. Fax machine by markprus · · Score: 5, Interesting

    Just fax the documents to a computer.

  7. Re:Get stuffed by Amiga+Lover · · Score: 5, Interesting

    I think you're right on the money. May be well worth taking the job to an outside agency. There are many print shops using Xerox Docutechs, which scan in many hundreds of sheets at once to print copies of documents. The scanning takes barely a second a page, and it wouldn't surprise me if the document format being stored inside the docutech is something that can be used for this purpose.

    I've had a similar job, where our school's lecturers wanted their notes in the same style so one of my jobs as admin assistant was retyping chapters from textbooks & inserting the original illustrations. That didn't start out too bad until lecturers started basing course notes on entire quarters of books, expecting them to be retyped completely in their own style. Give an inch they'll try to take a mile - use the few hundred $$ to get it professionally scanned.

  8. Recruit the community by SoSueMe · · Score: 5, Interesting

    Do it the open source way.

    Get several (dozen) other students to use their own equipment and time in echange for a copy/copies of the completed work.

    I would hazard a guess that there are more than a few people who would like to have a copy of the complete series of the lecture outlines.

  9. Easy by JensR · · Score: 5, Interesting

    Get some students of the professor's course to type them into LaTeX. Give them some points they'd otherwise get for homework.
    a) Publication quality DVI/PS/PDF files
    b) The student can deepen their knowledge of the topic
    Everyone happy. Used to work like this at the university I went to. And you may be even lucky that some student typed these notes in for himself.

  10. Gotta be careful though. by Faust7 · · Score: 5, Funny

    Outsource the job to India

    "No, no, not my entire job, just this one part. No, I can do the rest. No, really. No! No... please..."

  11. A Fujitsu scanner, SANE and Quartz Python bindings by sabi · · Score: 5, Informative
    Such as the fi-4120c is what I'd recommend. You might have to stretch your budget a bit. The cheap HP sheet feeders are very unreliable; we went through two HP 5550c's enduring constant paper jams before switching to a better (Fujitsu) scanner.

    Unfortunately you don't have much use for something like Acrobat Capture because you have handwritten notes to deal with. To process the files, SANE and/or TWAIN interfaces are reasonably easy to write code for. The cool thing about SANE is that you can run the saned daemon on any Mac or Linux box, and with a couple of lines of config file changes, it's instantly available over the network from any Mac, Windows, or Unix box (there are TWAIN bridges for Mac/Windows so it even shows up in Photoshop and so forth); there are also standalone GUI clients like XSane.

    I wrote a document management system in Python/wxWidgets (for Windows) in about a month part-time, and it works very well. Either on Mac or Windows, PDF makes sense because of the ubiquity of the viewers, even if you lose a bit in compression compared to more optimized formats such as DjVu. On Windows you can easily embed the Acrobat ActiveX control; on Mac OS X you have native PDF support, Panther's Preview kicks ass, and there are several open-source PDF browsing components such as the ones out of TeXShop or Glen Low's Graphviz port you can embed in your own app.

    Given a choice I would probably pick the Mac to do this project, because of the wonderful Quartz/CoreGraphics Python bindings. You can just draw right to PDF, and place PDF files as if they were images; for example, here's a short script to rotate a bunch of PDF files (sorry, Slashdot destroys Python indentation):

    #!/usr/bin/python

    from CoreGraphics import *
    import math, sys

    for inputPDFPath in sys.argv[1:]:
    inputProvider = CGDataProviderCreateWithFilename(inputPDFPath)
    &n bsp; inputPDF = CGPDFDocumentCreateWithProvider(inputProvider)
    &n bsp; if inputPDF is None:
    print >> sys.stderr, \
    "unable to open '%s': perhaps is not a PDF file?" % inputPDFPath
    continue
    outputContext = CGPDFContextCreateWithFilename(
    inputPDFPath + '-rotated.pdf', None)

    for pageNumber in xrange(1, inputPDF.getNumberOfPages() + 1):
    mediaBox = inputPDF.getMediaBox(pageNumber)
    rotatedBox = CGRectMake(0, 0, mediaBox.getMaxY(), mediaBox.getMaxX())
    outputContext.beginPage(rotatedBox)
    outputContext.saveGState()
    outputContext.translateCTM(0, rotatedBox.size.height)
    outputContext.rotateCTM(-math.pi/2)
    outputContext.drawPDFDocument(mediaBox, inputPDF, pageNumber)
    outputContext.restoreGState()
    outputContext.endPage()
    outputContext.finish()
    You could also use ReportLab, but because a lot of the PDF processing code is written in Python it's somewhat slower and memory-hogging for high-volume use. (I used ReportLab on Windows for the above project, and use CoreGraphics Python bindings for my research, so I do know what I'm talking about mostly :)
  12. Re:Get stuffed by djplurvert · · Score: 5, Insightful

    In addition to the points already made it is not unreasonable to simply tell the prof that his/her expectations are unreasonable. Perhaps "get stuffed" is a bit over the top but I've found that employers (even professors) will listen to reasonable explanations.

    I used to have a boss that would say things like "this should only take you about five minutes". I finally told him, "nothing takes just five minutes, if I have to stop what I'm doing there is a startup/teardown cost for every task." I convinced him that there was a granularity of 1/2 hour for every random task he wanted done. The discussion was fruitful for both of us, he was more reasonable about his expectations and put a bit more thought into what he wanted to distract me from my primary task to do.

    Now, the original idea is a reasonable proposition, however, it isn't really the sort of thing that should be done for just one prof. Perhaps several departments can combine their resources to setup something that will allow this type of thing to done in a reasonable time frame.

    plurvert

  13. My dad's office by pavera · · Score: 5, Informative

    My father is an attorney,
    he has a couple of high speed scanners from panasonic. They cost less than a thousand dollars (4-500) if I remember correctly, they scan at about 20 ppm, and the software that came with them will save each scanned group of pages as a separate document (pdf, tif, whatever). My dad uses this setup to scan all of the files that his cases generate (shrinking his document storage from about 1000 sq ft to 2 shelves in a bookcase). we are talking files that consist of 10,000+ pages, and normally he saves a years worth of cases on 3-4 cds. They can scan up to 500 pages at a time.
    Here is a link:
    High Speed Scanners

  14. All you can do... by cliffiecee · · Score: 5, Insightful

    Is say "Sure. I'll get this done- when I can. Don't expect it to be done for at least a few weeks, maybe longer."

    DON'T CLEAN UP THE SCANS. Don't even look at the scans. DO NOT RETYPE ANYTHING.

    With the kind of volume you say you're receiving, the only way you're going to survive is to:

    1. close your eyes,
    2. load the documents into the feeder,
    3. press 'scan'.
    4. Make sure everyone knows this policy.

  15. Re:Format by Chuckaluphagus · · Score: 5, Informative

    I have to scan and store very high-res black-and-white images for work, and I've found that the best format to save in is TIF with a CCITT Fax 4 compression. It will only work for black-and-white files, but for a full page of text and graphics scanned at 2-color, 600 dpi, you can get a file about 100 kbyte. The image quality is superb, and it's far, far more efficient than PDF.

    The program I use to convert to TIF is IrfanView (http://www.irfanview.com/), a generally excellent image viewer. I'ts free, too, so no worries there. It offers a ton of options for compression settings for different formats, so you can try other file formats as needed.

  16. Works for me by sglow · · Score: 5, Interesting

    I tend to scan lots of documents and setup a simple perl script that uses the 'scanimage' command line tool to do the scanning. Using my Epson Perfection 1650 scanner (pretty standard flatbed scanner) I can scan an 8"x10" page in black & white mode in about 10 seconds.

    I actually added a button to the Nautilus GUI shell so I can move to the directory I want and hit the button to scan a page to that directory. Very convenient.

    I scan to tiff and then use the convert utility (part of imagemagick) to convert to png. The resulting files typically run about 100K to 200K depending on the content.

    If anyone's interested in seeing the perl script I've posted it to: www.ollies.net/scanscript.html

    Steve

  17. Re:Get stuffed by Adian · · Score: 5, Insightful

    On the contrary, it's your job as a professional and as an employee to keep your employers in tune with what is possible, and what is most efficient for the manhours/money involved. As employees you are also responsible to your employers to keep them informed of ways to actually save money also if there is a place this can be done. If this particular job would require hundreds of manhours to do, versus paying a place that actually specializes in these services to do it. Which I'd guess the university either has this equipment on campus, or has contracts with some company already for something similar.
    Besides the fact, it sounds like they are not aware of the time involved in scanning off 10's nonetheless hundreds of pages. It doesn't sound like they are too anxious to make it easy for him to get the job done either (not buying him new equipment, using the secretaries Win2k box after hours??).
    I've volunteered my efforts before on a simple scanning job that required hundreds of regular photos to be scanned in at relatively good quality (why else do it otherwise), and ended up taking forever. Upon informing the client of the amount of time required, they adjusted the way the job was being handled.
    I think being straight with your employers, and clients is the best approach to any situation where too much is being expected. The times I've had these instances come up, and recommended different approaches that resulted in money being saved, or manhours on a task being reduced, I saw benefit in my paycheck through raises or promotions.

    --
    Adian