Large-Scale Paper-To-Digital Conversion?

Get stuffed by October_30th · 2004-05-23 06:28 · Score: 4, Insightful

Uh. How about telling your prof. to get stuffed and get a real secretary.

--
The owls are not what they seem

Re:Get stuffed by Amiga+Lover · 2004-05-23 06:35 · Score: 5, Interesting

I think you're right on the money. May be well worth taking the job to an outside agency. There are many print shops using Xerox Docutechs, which scan in many hundreds of sheets at once to print copies of documents. The scanning takes barely a second a page, and it wouldn't surprise me if the document format being stored inside the docutech is something that can be used for this purpose.

I've had a similar job, where our school's lecturers wanted their notes in the same style so one of my jobs as admin assistant was retyping chapters from textbooks & inserting the original illustrations. That didn't start out too bad until lecturers started basing course notes on entire quarters of books, expecting them to be retyped completely in their own style. Give an inch they'll try to take a mile - use the few hundred $$ to get it professionally scanned.
Re:Get stuffed by djplurvert · 2004-05-23 06:49 · Score: 5, Insightful

In addition to the points already made it is not unreasonable to simply tell the prof that his/her expectations are unreasonable. Perhaps "get stuffed" is a bit over the top but I've found that employers (even professors) will listen to reasonable explanations.

I used to have a boss that would say things like "this should only take you about five minutes". I finally told him, "nothing takes just five minutes, if I have to stop what I'm doing there is a startup/teardown cost for every task." I convinced him that there was a granularity of 1/2 hour for every random task he wanted done. The discussion was fruitful for both of us, he was more reasonable about his expectations and put a bit more thought into what he wanted to distract me from my primary task to do.

Now, the original idea is a reasonable proposition, however, it isn't really the sort of thing that should be done for just one prof. Perhaps several departments can combine their resources to setup something that will allow this type of thing to done in a reasonable time frame.

plurvert
Re:Get stuffed by Adian · 2004-05-23 07:11 · Score: 5, Insightful

On the contrary, it's your job as a professional and as an employee to keep your employers in tune with what is possible, and what is most efficient for the manhours/money involved. As employees you are also responsible to your employers to keep them informed of ways to actually save money also if there is a place this can be done. If this particular job would require hundreds of manhours to do, versus paying a place that actually specializes in these services to do it. Which I'd guess the university either has this equipment on campus, or has contracts with some company already for something similar.
Besides the fact, it sounds like they are not aware of the time involved in scanning off 10's nonetheless hundreds of pages. It doesn't sound like they are too anxious to make it easy for him to get the job done either (not buying him new equipment, using the secretaries Win2k box after hours??).
I've volunteered my efforts before on a simple scanning job that required hundreds of regular photos to be scanned in at relatively good quality (why else do it otherwise), and ended up taking forever. Upon informing the client of the amount of time required, they adjusted the way the job was being handled.
I think being straight with your employers, and clients is the best approach to any situation where too much is being expected. The times I've had these instances come up, and recommended different approaches that resulted in money being saved, or manhours on a task being reduced, I saw benefit in my paycheck through raises or promotions.

--
Adian
Re:Get stuffed by Man+Eating+Duck · 2004-05-23 09:11 · Score: 4, Informative

I've been working with various versions of the Docutech system for about six years, and they're in use in most of the professional copy/print shops around, at least in Scandinavia. They scan full page and double sided, 600 dpi at about 1 page/sec. Newer versions also can handle full colour.

Native document format is tiff images with a proprietary control file (structuring, positioning etc), but you can easily convert it to pdf.

I'd guess that a professional shop will charge you about 30 cents a page if you accept the raw document files without 'touching up'. This is more than adequate if you're just going to reproduce it on paper, or even distribute the PDFs. It'll weigh in at about 100k a page for the tiff format, and a lot less for the PDFs. This is black and white, which in most cases will suffice.

Professional equipment (as in contracting a print shop) is definitely the way to go. I know that at the University of Oslo, Norway, they have established an in-house shop that will do this type of work internally for just about cost. Maybe that's an idea to put forth to the management? Surely your university will find other uses for it than just your assignment.

Hope this helps :)

--
Are you a grammar Nazi? I'm trying to improve my English; please correct my errors! :)

Kinkos? by axonal · 2004-05-23 06:30 · Score: 5, Informative

Some Kinkos have those big goliath Xerox scanners which act just like copiers. Load a stack up papers, and it will scan the pages and load them up. Not sure about PDF export/etc though.

Re:Kinkos? by zenquest · 2004-05-23 06:37 · Score: 5, Informative

We have a Xerox WorkCentre Pro 65 at my school. It can scan at around 50-60 pages per minute, and will do double-sided. It will do PDF output, too. (and email it or FTP it to you, if so configured)

Our teachers use them for exactly the purpose described. If you don't have one of these type machines around anywhere, then definitely give Kinkos or some similar establishment a try.
Re:Kinkos? by zenquest · 2004-05-23 07:38 · Score: 4, Informative

Going to Kinkos? Yeah, it's a bit pricey, but not totally out of bounds.

If he's at a large university, some other department might have one of these. Xerox doesn't charge for scans when you lease the machine. They only charge for how many prints or copies are made, so it would be essentially free for another department to allow him use their machine. It doesn't even require any additional setup, since you can enter any email address into the machine and have it send the document there directly from the copier. (assuming the SMTP server has been set on the machine)

well... by Anonymous Coward · 2004-05-23 06:30 · Score: 5, Funny

if I ask nicely, overnight use of the secretary's Win2k box

Plus, if you're lucky, you could also get other after-hours favors from the secretary as well ;-)

Simple. by jebell · 2004-05-23 06:31 · Score: 5, Funny

Outsource the job to India.

--
This is my sig. There are many like it but this one is mine.

Re:Simple. by GothChip · 2004-05-23 06:48 · Score: 4, Insightful

I know the parent post was funny but he's thinking along the right ideas.

Take the few hundred you have to spend on equipment and spend it hiring a few temps.

A good typist should be able to type up hand written notes faster than scanning them all in and manually fixing all the mistakes.

HP Digital Sender by Guanix · 2004-05-23 06:31 · Score: 4, Informative

The HP Digital Sender series are really great for this stuff. You feed it a stack of paper and it scans it, 15 pages per minute, and can store the PDF on a file server or you can send an email with the PDF attached directly from the network sender! It's a bit expensive, but try to look around for one, maybe the local copyshop? Guan

If you're being 'asked' by Space+cowboy · 2004-05-23 06:31 · Score: 4, Insightful

Just say 'No'. (If you're being told, it's a different matter, of course).

It sounds to me like a damned hard job to automate (which is the only way it's not going to be a constant drain on your time), and you're being given next-to-no resources to even come up with a creative solution. Sometimes the best answer is in fact 'No' - it forces people to re-evaluate what they're asking. It comes with the danger of being sacked if it's you that's being unreasonable, of course....

Simon.

--
Physicists get Hadrons!

Re:If you're being 'asked' by malia8888 · 2004-05-23 06:48 · Score: 4, Insightful

I really agree with Space cowboy. My former husband was a college professor. He was very brilliant in his field, but anything out side of his narrow realm daunted him. He wanted to put pennies in our fusebox when the lights went out. He stared at a breaker box in the condo like it was the control panel of an alien spacecraft.
Explain the enormity of this scratched note-to-finished Pdf to this educator. Use crayons, mirrors, yarn and tape if necessary to get your point across. Just be diplomatic :P

--
Harpo Tunnel Syndrome--my wrist feels funny.

The most important thing by Timesprout · 2004-05-23 06:31 · Score: 5, Funny

Is to first make an exact copy (by hand) of all the existing documents. Its vital to have a full backup in case anything goes wrong with the scanning process you can always restore the manilla folders to their original filled state.

--
Do not try to read the dupe, thats impossible. Instead, only try to realize the truth
What truth?
There is no dupe

ADF Scanners by Loiosh-de-Taltos · 2004-05-23 06:32 · Score: 5, Informative

What I suggest and use is the HP 4C scanner. It's a SCSI-II only scanner that can be found on Ebay for under $10 usually. They also have an automatic document feeder option that can be found on Ebay. This scanner was originally designed for both Windows and Apple compatibility as well. It cannot handle 2-sided sheets.

The scanner has four different pieces of software you can choose to use, I'd suggest Precision Scan Pro as that makes multi-document scanning easier.

HP Scanjet 5550c is not what you want by GraZZ · 2004-05-23 06:33 · Score: 4, Informative

Definately keep clear of the Scanjet 5550c; there's a reason why it's the cheapest feed scanner out there. It will frequently jam if you a) load more than 5 sheets into the feeder or b) use any sort of paper that has been handled by human beings.

Our Engineering Society was trying to put up an exam archive with one of them and quickly gave up and started scanning with the flatbed.

Also the scanner has no sane support (one of the few HP scanners that doesn't)

Fax machine by markprus · 2004-05-23 06:34 · Score: 5, Interesting

Just fax the documents to a computer.

Recruit the community by SoSueMe · 2004-05-23 06:35 · Score: 5, Interesting

Do it the open source way.

Get several (dozen) other students to use their own equipment and time in echange for a copy/copies of the completed work.

I would hazard a guess that there are more than a few people who would like to have a copy of the complete series of the lecture outlines.

Easy by JensR · 2004-05-23 06:35 · Score: 5, Interesting

Get some students of the professor's course to type them into LaTeX. Give them some points they'd otherwise get for homework.
a) Publication quality DVI/PS/PDF files
b) The student can deepen their knowledge of the topic
Everyone happy. Used to work like this at the university I went to. And you may be even lucky that some student typed these notes in for himself.

Re:Knee to the grindstone... by Exocet · 2004-05-23 06:36 · Score: 4, Insightful

"Ummm yeahhhh... if you could just do that..."

Faust7 is right about this one. Frankly, OCR is ok, but not great - on nice text on book-or-better paper. Handwritten notes? With equations? No. Not unless your profs have some damn fine handwriting and we all know that that is absolutely not the case.

My advice is the same as Faust7's with these additions: spend some of that money on a really nice keyboard, wrist-rest and/or maybe a nice monitor. You are going to be needing all three. If there are any left over funds, get some really nice tea. I suggest Twinnings English Breakfast or Prince of Wales, if you're going to go bagged.

--
Exocet Industries - Taking over the world, one computer at a

Re:DjVu by Ed+Avis · 2004-05-23 06:37 · Score: 4, Informative

For scanned documents, tic98 compresses even better than DjVu. It's free software and you can even read the author's PhD thesis about it.

--
-- Ed Avis ed@membled.com

Re:Outsource it by cloudmaster · 2004-05-23 06:41 · Score: 4, Insightful

Maybe he *is* the cheap manual labor / unpaid intern...

No good answers AFAIK by John+Miles · 2004-05-23 06:41 · Score: 4, Informative

I've run into a similar problem, and have no good solutions in the general case. I'm on a mailing list for users and collectors of Tektronix test equipment (oscilloscopes, logic and spectrum analyzers, and so forth). Last year, Tektronix's legal department issued a copyright release that permits the reproduction and distribution of documentation for test equipment that they (Tek) no longer support. This was of great interest to the people on the TekScopes list, because it gave a green light to scanning and trading/selling copies of manuals. I've scanned in a few manuals for some equipment I own, and it's a huge pain in the butt any way you look at it.

Electronic test-equipment manuals are pretty much worst-case candidates for scanning. In Tek's case, the schematic volumes often consist of hundreds of double-sided, nonstandard-sized foldout sheets (11x23" for example) with lots of fine detail that must be reproduced clearly. You can either scan the pages in segments and leave it to the reader to reassemble them, or you can take the manuals to Kinko's and have the foldout pages shrunk to 11x17" or 8.5x11" for scanning. Either way, it's a real hassle, and highlights a clear need for a "prosumer" duplex sheet-feed scanner solution.

A few years ago you could buy scanners like this one that could handle arbitrary sheet sizes, but I haven't seen them in stores lately. These may be easier to use than flatbed scanners, assuming the precision they offer is sufficient for your application. I don't know how well they'd work on densely-printed schematics.

Other than bitching about the state of the scanner marketplace, I don't have much to suggest. There are a few hints that will improve the quality and usability of your final document:

There are other formats, like DjVu, that have certain advantages over .PDF, but think carefully before using them. Will you be able to read your files 10, 20 years from now? In .PDF's case, the answer is an unequivocal 'yes' because of widespread government, military, and commercial standardization around it. I hate to see people spend hours scanning manuals in DjVu or another nonstandard format, because I'm 95% sure I won't be able to read them years down the road on a completely different platform.
To make the document searchable, use an OCR package like FineReader if possible... but expect to spend even more time babysitting the process.
Experiment with your scanner resolution settings to minimize the resulting .PDF file size. There's a big difference in size between 200 dpi and 300 dpi, and between a B&W and color scan.
For some mysterious, forehead-slapping reason, flatbed scanners often use glossy-white backing material in the lid. This encourages bleedthrough of text on the reverse side of double-sided material, making your scanned documents look sloppy and compress poorly. Placing a sheet of black paper, plastic, or cardboard material between your document and the scanner lid will make a big difference.

--
Dahlmann tightly grips the knife, which he may have no idea how to use, and steps out into the plain.

Gotta be careful though. by Faust7 · 2004-05-23 06:42 · Score: 5, Funny

Outsource the job to India

"No, no, not my entire job, just this one part. No, I can do the rest. No, really. No! No... please..."

--
The coolest voice ever.

Large Scale Paper to Digital Conversion by felila · 2004-05-23 06:45 · Score: 4, Informative

I do conversion for fun, at Distributed Proofreaders.

The problem is the mixture of graphics, equations, and text.

It's easy enough to turn a page of text into a smallish file. Get a good automatic-feed scanner ($3500 or so) and a copy of ABBYY OCR software. If the original isn't too speckly, tiny, or smudged, ABBYY will give you a 95% accurate text you can then correct. Best format to save in? Depends on what the school is going to do the files. If they're to be posted on web sites, perhaps XHTML. If it's just for preservation, plain text (if there's no Greek characters) or XML with UTF-8.

Equations -- well, there's supposedly a version of XML for math, but Distributed Proofreaders has ended up using TeX, as it seems to be the mathematical standard. While this would work for preservation, it wouldn't work for a web site.

For a web site, perhaps the best way would be to intersperse text with pngs of the equations and graphics. The pngs would still take a lot more space than text, but the files would be smaller than PDF versions of the whole page.

A Fujitsu scanner, SANE and Quartz Python bindings by sabi · 2004-05-23 06:46 · Score: 5, Informative

Such as the fi-4120c is what I'd recommend. You might have to stretch your budget a bit. The cheap HP sheet feeders are very unreliable; we went through two HP 5550c's enduring constant paper jams before switching to a better (Fujitsu) scanner.

Unfortunately you don't have much use for something like Acrobat Capture because you have handwritten notes to deal with. To process the files, SANE and/or TWAIN interfaces are reasonably easy to write code for. The cool thing about SANE is that you can run the saned daemon on any Mac or Linux box, and with a couple of lines of config file changes, it's instantly available over the network from any Mac, Windows, or Unix box (there are TWAIN bridges for Mac/Windows so it even shows up in Photoshop and so forth); there are also standalone GUI clients like XSane.

I wrote a document management system in Python/wxWidgets (for Windows) in about a month part-time, and it works very well. Either on Mac or Windows, PDF makes sense because of the ubiquity of the viewers, even if you lose a bit in compression compared to more optimized formats such as DjVu. On Windows you can easily embed the Acrobat ActiveX control; on Mac OS X you have native PDF support, Panther's Preview kicks ass, and there are several open-source PDF browsing components such as the ones out of TeXShop or Glen Low's Graphviz port you can embed in your own app.

Given a choice I would probably pick the Mac to do this project, because of the wonderful Quartz/CoreGraphics Python bindings. You can just draw right to PDF, and place PDF files as if they were images; for example, here's a short script to rotate a bunch of PDF files (sorry, Slashdot destroys Python indentation):

#!/usr/bin/python from CoreGraphics import * import math, sys for inputPDFPath in sys.argv[1:]: inputProvider = CGDataProviderCreateWithFilename(inputPDFPath) &n bsp; inputPDF = CGPDFDocumentCreateWithProvider(inputProvider) &n bsp; if inputPDF is None: print >> sys.stderr, \ "unable to open '%s': perhaps is not a PDF file?" % inputPDFPath continue outputContext = CGPDFContextCreateWithFilename( inputPDFPath + '-rotated.pdf', None) for pageNumber in xrange(1, inputPDF.getNumberOfPages() + 1): mediaBox = inputPDF.getMediaBox(pageNumber) rotatedBox = CGRectMake(0, 0, mediaBox.getMaxY(), mediaBox.getMaxX()) outputContext.beginPage(rotatedBox) outputContext.saveGState() outputContext.translateCTM(0, rotatedBox.size.height) outputContext.rotateCTM(-math.pi/2) outputContext.drawPDFDocument(mediaBox, inputPDF, pageNumber) outputContext.restoreGState() outputContext.endPage() outputContext.finish()

You could also use ReportLab, but because a lot of the PDF processing code is written in Python it's somewhat slower and memory-hogging for high-volume use. (I used ReportLab on Windows for the above project, and use CoreGraphics Python bindings for my research, so I do know what I'm talking about mostly :)

My dad's office by pavera · 2004-05-23 06:49 · Score: 5, Informative

My father is an attorney,
he has a couple of high speed scanners from panasonic. They cost less than a thousand dollars (4-500) if I remember correctly, they scan at about 20 ppm, and the software that came with them will save each scanned group of pages as a separate document (pdf, tif, whatever). My dad uses this setup to scan all of the files that his cases generate (shrinking his document storage from about 1000 sq ft to 2 shelves in a bookcase). we are talking files that consist of 10,000+ pages, and normally he saves a years worth of cases on 3-4 cds. They can scan up to 500 pages at a time.
Here is a link:
High Speed Scanners

All you can do... by cliffiecee · 2004-05-23 06:51 · Score: 5, Insightful

Is say "Sure. I'll get this done- when I can. Don't expect it to be done for at least a few weeks, maybe longer."

DON'T CLEAN UP THE SCANS. Don't even look at the scans. DO NOT RETYPE ANYTHING.

With the kind of volume you say you're receiving, the only way you're going to survive is to:

1. close your eyes,
2. load the documents into the feeder,
3. press 'scan'.
4. Make sure everyone knows this policy.

Re:Format by Chuckaluphagus · 2004-05-23 06:52 · Score: 5, Informative

I have to scan and store very high-res black-and-white images for work, and I've found that the best format to save in is TIF with a CCITT Fax 4 compression. It will only work for black-and-white files, but for a full page of text and graphics scanned at 2-color, 600 dpi, you can get a file about 100 kbyte. The image quality is superb, and it's far, far more efficient than PDF.

The program I use to convert to TIF is IrfanView (http://www.irfanview.com/), a generally excellent image viewer. I'ts free, too, so no worries there. It offers a ton of options for compression settings for different formats, so you can try other file formats as needed.

Some photocopiers support this by adamsc · 2004-05-23 07:05 · Score: 4, Informative

Check whether any of the photocopiers around campus support scanning: we have a Canon ImageRunner in one of the labs which I support. It's extremely fast - ~1 second per page for a double-sided scan and the feeder is pretty robust - we have grad students who take handwritten lecture notes for an entire class and dump this stack of a couple hundred crumpled pages into the feeder and end up with a PDF a couple minutes later.

Works for me by sglow · 2004-05-23 07:07 · Score: 5, Interesting

I tend to scan lots of documents and setup a simple perl script that uses the 'scanimage' command line tool to do the scanning. Using my Epson Perfection 1650 scanner (pretty standard flatbed scanner) I can scan an 8"x10" page in black & white mode in about 10 seconds.

I actually added a button to the Nautilus GUI shell so I can move to the directory I want and hit the button to scan a page to that directory. Very convenient.

I scan to tiff and then use the convert utility (part of imagemagick) to convert to png. The resulting files typically run about 100K to 200K depending on the content.

If anyone's interested in seeing the perl script I've posted it to: www.ollies.net/scanscript.html

Steve

docutech is the way to go... by capsteve · 2004-05-23 07:45 · Score: 4, Informative

being in the prepress industry, i see more and more traditional printing going the way of xerography. of the competitors in the field, xerox probably has the best system with the docutech series... you may want to consider kinko's which is an authorized user/vendor of the docutech system.

on a side note, if the professors are utilizing a lot of additional material which includes might include3 handwritten information, you might consider getting encouraging them to transcribe that material(hopefully your not the TA that has to do the transcription) into a digital for, be it text or WORD. this'll difinitely help in reducing the size of your files.

also consider looking into adobe's pdf service, if you're overwhelmed with just orginizing the material itself. probably not so kosher to suggest ity on /. but it could be something the school already has an agreement with adobe(taking into account the units of acrobat the school itself might be using). i know it's not rolling your own, but sometimes using an "out of the box" solution to get thing up and running so you can explore other solutions has it's merit as well...

--
three can keep a secret, if two are dead - benjamin franklin

PDF of handwritten notes is DUMB!!! by madstork2000 · 2004-05-23 07:45 · Score: 4, Insightful

It makes no sense at all to me, to have a PDF created of handwritten notes. Since most students will probably just download and print out the PDF anyway. The only adavntage is it may save a few trees not everyone will print them out.

It sounds like the school wants to shift the production costs (i.e printing) to the students. This seems inefficient because the old way where the instructor could go to the copy center and have the notes copied the at the schools expense (I know these expenses are often passed along to the students anyway), rather than at the students DIRECT expense of their time for downloading, then printing out on their own equipment or using their own printing accounts at the computer center.

If the notes were being OCR'd and then made available on-line, or post processed in such a fashion (where they are searchable, indexed, etc) where they were searchable, it would be useful. Otherwise this seems like a waste of time and money.

-MS2k

Depends how good you want to do it by Danh · 2004-05-23 08:16 · Score: 4, Informative

If you want to do a good job, you have to type it, in LaTeX. It's the only way to get something nice and something the professors will be able to enhance in future.

If a digitized copy of the manuscripts will do for you, you can go the scan -> image enhancement -> OCR -> save to PDF way.

For scanning, you already got a lot of good comments how to automatise the scanning of dozens of scripts. If you lack these possibilities also a SCSI or USB desktop scanner should do the job (it's definitely less than 1 min per page), so you scan a script in 2 hours. No need to bother to outsource the job to India. Probably you can scan B/W and don't need greyscale or colors. I would scan handwritten scripts at 200 DPI and save the whole pictures in front of the OCRed text, so the user doesn't see the OCRed text and can only use it for selecting and copy&paste. It would be too much work to correct the OCRed text here. For machine written text I would use 300 dpi or more for better OCRing.

As image enhancement you only need to be able to automatically orient the page so that the text is horizontal. I don't remember if Acrobat does it, but for this job I would anyhow get a good OCR program.

As OCR program I recommend FineReader, but also Omnipage is ok. FineReader does better OCR than Omnipage and Acrobat. It also saves better to PDF (with retaining all of the paragraph structure) than Omnipage.

If you keep the image before the OCRed text in the PDF you can expect files of 10MB for 100 pages for B/W scan at 200 dpi. OCRing of machine written text has become incredibly accurate, so you can do real OCR there and throw away the bitmap picture. This of course gives much nicer output (and smaller filesize), but you need to spend a lot of time correcting the text. Here the best OCR program really pays off (you probably have a lot of words which are not in a dict, need custom dicts (does Acrobat have them?),...). A program with a single flaw (e.g. that recognized you formula as text, or code as paragraph text,...) will let you waste a lot of time correcting it on every second page.

I'm archiving stuff at my university by adrew · 2004-05-23 09:12 · Score: 4, Informative

We've undertaken a pretty large archiving job at my university. We're scanning every page of every newspaper we've ever printed (started in 1927) up to the time we have digital archives starting around 1993 or so. We're also scanning about 80 300 page yearbooks. Hopefully this can offer you some help or suggestions.

We have a dual-processor G4 and an Epson 1640XL large-format FireWire scanner with the optional auto document feeder. It's probably a bit out of your budget ($2899 + ~$1200 for the ADF) but it's awesome. It can scan at up to 1600dpi and the ADF can automatically duplex and scan both sides of the page. We're using OmniPage Pro X for OCR software.

Right now we're more concerned with scanning the documents and getting them online, so we haven't started OCR'ing everything yet. But the ADF is awesome. It can scan both sides of all 300+ pages of a yearbook automatically in about 2 1/2 hours.

The newspapers are a bit different. They're getting a bit fragile in their old age so we have to manually scan them. We scan them at 300dpi in full color, so the 12x18 pages are around 50MB per page. But the scanner takes less than a minute per page. It's impressive.

We use Photoshop's web gallery feature to generate the image galleries. Pretty simple really. Let me know if you have any questions.

Ricoh Aficios, Ancient Fujitsus, and OmniPage Pro by BigBlockMopar · 2004-05-23 17:32 · Score: 4, Informative

we've gotten a bunch of jobs like this - turning handwritten documents into searchable pdfs

We had to do this, too. For a Court, which requires the reasons, decisions, etc. to be publicly available online.

*Thousands* of documents, hundreds of pages each. The responsible department got me, as the IT guy, to set it up for them (after they'd already bought the stuff to do it).

Basically, a couple of Ricoh Aficio series copier/scanners, a couple of ancient Fujitsu sheet-feed scanners, and a bunch of students sitting all day in front of computers running OmniPage Pro.

The Ricohs were great on paper - fast, networked, etc. but their scanner drivers were poor (reminded me of bad CD-ROM drivers - "Copywrite 1995 Behavior Tech Computer. All right reverse." [sic,sic,sic]), and their service (contract) involved having to call the Ricoh guy because the scanner portions randomly wouldn't appear on the network, then wait for him to appear while at least one of the students sat idle. 2 stars out of 5.

Ancient Fujitsu scanners, black and white only, don't remember the model number, required proprietary SCSI cards, no support under Windows NT/XP/2K. These were commercial-grade super-expensive scanners when new (about 1990). Installed Windows 95 on a bunch of relics with ISA slots for the SCSI cards and let 'er rip. Scanning was fast, feed was reliable like a good-quality photocopier or fax machine. Only issue was requirement for an old computer running an old OS; better overall than Ricohs - 4 stars out 5.

OmniPage Pro 12 - reading was *excellent*, far better than anything else I've ever seen. Handled French and English, simple monochrome diagrams, etc. with only very small occasional formatting problems. Print to a PDF using Acrobat on the file server. Only real problem was stability, frequently locking up and losing the scan and OCR on page 99 of a 104 page document. 2 stars out of 5, being punitive because of frustration.

As they got to be more proficient with OPP, and as OPP's dictionaries filled up, we were able to add more and more computers and scanners, so that they were running around, tossing files into the scanners, stapling scanned documents back together, and occasionally rebooting one of the Windows 95 workstations. Peak was 15 computers and scanners.

Task took 3 students 3 months full-time.

--
Fire and Meat. Yummy.

Slashdot Mirror

Large-Scale Paper-To-Digital Conversion?

37 of 459 comments (clear)