Slashdot Mirror


Large-Scale Paper-To-Digital Conversion?

An anonymous reader writes "I've just been asked to digitize several dozen sets of lecture outlines at the university where I work. Basically, professors want to hand me a big (often 100+ page) stack of their handwritten lecture notes (with messy text, equations, and diagrams; sometimes double-sided) and expect me to post a PDF-or-something-similar to their course's web page. However, every desktop scanner I've ever used takes 1-2 minutes of user-attention per page and the resulting files end up Huge, impossible-to-read, or both. All I have at my disposal is my PowerBook, Acrobat, a couple hundred dollars of department funds for a new scanner (this maybe?), and, if I ask nicely, overnight use of the secretary's Win2k box. Any ideas? Sheet-fed scanner recommendations? Better file formats than PDF (or better PDF settings)? Do any of you students have usability advice?"

6 of 459 comments (clear)

  1. Kinkos? by axonal · · Score: 5, Informative

    Some Kinkos have those big goliath Xerox scanners which act just like copiers. Load a stack up papers, and it will scan the pages and load them up. Not sure about PDF export/etc though.

    1. Re:Kinkos? by zenquest · · Score: 5, Informative

      We have a Xerox WorkCentre Pro 65 at my school. It can scan at around 50-60 pages per minute, and will do double-sided. It will do PDF output, too. (and email it or FTP it to you, if so configured)

      Our teachers use them for exactly the purpose described. If you don't have one of these type machines around anywhere, then definitely give Kinkos or some similar establishment a try.

  2. ADF Scanners by Loiosh-de-Taltos · · Score: 5, Informative

    What I suggest and use is the HP 4C scanner. It's a SCSI-II only scanner that can be found on Ebay for under $10 usually. They also have an automatic document feeder option that can be found on Ebay. This scanner was originally designed for both Windows and Apple compatibility as well. It cannot handle 2-sided sheets.

    The scanner has four different pieces of software you can choose to use, I'd suggest Precision Scan Pro as that makes multi-document scanning easier.

  3. A Fujitsu scanner, SANE and Quartz Python bindings by sabi · · Score: 5, Informative
    Such as the fi-4120c is what I'd recommend. You might have to stretch your budget a bit. The cheap HP sheet feeders are very unreliable; we went through two HP 5550c's enduring constant paper jams before switching to a better (Fujitsu) scanner.

    Unfortunately you don't have much use for something like Acrobat Capture because you have handwritten notes to deal with. To process the files, SANE and/or TWAIN interfaces are reasonably easy to write code for. The cool thing about SANE is that you can run the saned daemon on any Mac or Linux box, and with a couple of lines of config file changes, it's instantly available over the network from any Mac, Windows, or Unix box (there are TWAIN bridges for Mac/Windows so it even shows up in Photoshop and so forth); there are also standalone GUI clients like XSane.

    I wrote a document management system in Python/wxWidgets (for Windows) in about a month part-time, and it works very well. Either on Mac or Windows, PDF makes sense because of the ubiquity of the viewers, even if you lose a bit in compression compared to more optimized formats such as DjVu. On Windows you can easily embed the Acrobat ActiveX control; on Mac OS X you have native PDF support, Panther's Preview kicks ass, and there are several open-source PDF browsing components such as the ones out of TeXShop or Glen Low's Graphviz port you can embed in your own app.

    Given a choice I would probably pick the Mac to do this project, because of the wonderful Quartz/CoreGraphics Python bindings. You can just draw right to PDF, and place PDF files as if they were images; for example, here's a short script to rotate a bunch of PDF files (sorry, Slashdot destroys Python indentation):

    #!/usr/bin/python

    from CoreGraphics import *
    import math, sys

    for inputPDFPath in sys.argv[1:]:
    inputProvider = CGDataProviderCreateWithFilename(inputPDFPath)
    &n bsp; inputPDF = CGPDFDocumentCreateWithProvider(inputProvider)
    &n bsp; if inputPDF is None:
    print >> sys.stderr, \
    "unable to open '%s': perhaps is not a PDF file?" % inputPDFPath
    continue
    outputContext = CGPDFContextCreateWithFilename(
    inputPDFPath + '-rotated.pdf', None)

    for pageNumber in xrange(1, inputPDF.getNumberOfPages() + 1):
    mediaBox = inputPDF.getMediaBox(pageNumber)
    rotatedBox = CGRectMake(0, 0, mediaBox.getMaxY(), mediaBox.getMaxX())
    outputContext.beginPage(rotatedBox)
    outputContext.saveGState()
    outputContext.translateCTM(0, rotatedBox.size.height)
    outputContext.rotateCTM(-math.pi/2)
    outputContext.drawPDFDocument(mediaBox, inputPDF, pageNumber)
    outputContext.restoreGState()
    outputContext.endPage()
    outputContext.finish()
    You could also use ReportLab, but because a lot of the PDF processing code is written in Python it's somewhat slower and memory-hogging for high-volume use. (I used ReportLab on Windows for the above project, and use CoreGraphics Python bindings for my research, so I do know what I'm talking about mostly :)
  4. My dad's office by pavera · · Score: 5, Informative

    My father is an attorney,
    he has a couple of high speed scanners from panasonic. They cost less than a thousand dollars (4-500) if I remember correctly, they scan at about 20 ppm, and the software that came with them will save each scanned group of pages as a separate document (pdf, tif, whatever). My dad uses this setup to scan all of the files that his cases generate (shrinking his document storage from about 1000 sq ft to 2 shelves in a bookcase). we are talking files that consist of 10,000+ pages, and normally he saves a years worth of cases on 3-4 cds. They can scan up to 500 pages at a time.
    Here is a link:
    High Speed Scanners

  5. Re:Format by Chuckaluphagus · · Score: 5, Informative

    I have to scan and store very high-res black-and-white images for work, and I've found that the best format to save in is TIF with a CCITT Fax 4 compression. It will only work for black-and-white files, but for a full page of text and graphics scanned at 2-color, 600 dpi, you can get a file about 100 kbyte. The image quality is superb, and it's far, far more efficient than PDF.

    The program I use to convert to TIF is IrfanView (http://www.irfanview.com/), a generally excellent image viewer. I'ts free, too, so no worries there. It offers a ton of options for compression settings for different formats, so you can try other file formats as needed.