Large-Scale Paper-To-Digital Conversion?
An anonymous reader writes "I've just been asked to digitize several dozen sets of lecture outlines at the university where I work. Basically, professors want to hand me a big (often 100+ page) stack of their handwritten lecture notes (with messy text, equations, and diagrams; sometimes double-sided) and expect me to post a PDF-or-something-similar to their course's web page. However, every desktop scanner I've ever used takes 1-2 minutes of user-attention per page and the resulting files end up Huge, impossible-to-read, or both. All I have at my disposal is my PowerBook, Acrobat, a couple hundred dollars of department funds for a new scanner (this maybe?), and, if I ask nicely, overnight use of the secretary's Win2k box. Any ideas? Sheet-fed scanner recommendations? Better file formats than PDF (or better PDF settings)? Do any of you students have usability advice?"
Uh. How about telling your prof. to get stuffed and get a real secretary.
The owls are not what they seem
Some Kinkos have those big goliath Xerox scanners which act just like copiers. Load a stack up papers, and it will scan the pages and load them up. Not sure about PDF export/etc though.
Flex your fingers, crack your knuckles, and get some eyedrops... because you're going to be doing a lot of typing.
The coolest voice ever.
if I ask nicely, overnight use of the secretary's Win2k box
;-)
Plus, if you're lucky, you could also get other after-hours favors from the secretary as well
You need a high speed scanner. Fujistu makes a nice one that works pretty well.
Outsource the job to India.
This is my sig. There are many like it but this one is mine.
The large multi-function HP Printer/Copiers will scan and e-mail a PDF of an entire stack of papers just as you would use a normal copier. I'm sure that the other manufacturers have similar features, but it is the HP equipment that we use at work.
LibBT: BitTorrent for C - small - fast - clean (Now Versio
overnight use of the secretary's box ...
The HP Digital Sender series are really great for this stuff. You feed it a stack of paper and it scans it, 15 pages per minute, and can store the PDF on a file server or you can send an email with the PDF attached directly from the network sender! It's a bit expensive, but try to look around for one, maybe the local copyshop? Guan
While PDFs are pretty well supported, you'll still be storing it as raster data, so there won't be any size decrease over using an image format, such as PNG.
Are there any web-based packages for searching documents, based on OCR-extracted keywords? Obviously with messy hand-written notes, formulas, etc, OCR won't work reliably. For a similar project, I'd like to OCR the files and use the text data solely for keyword searching. Obviously not perfect, but better than just images.
PNG is your friend....
This whole problem could be eliminated if these papers were put into PDF as soon as they are created. That said; I would explore solutions from the legal profession - they have a lot of things that do this.
Humor from a Genetically Molested Mind
Just say 'No'. (If you're being told, it's a different matter, of course).
It sounds to me like a damned hard job to automate (which is the only way it's not going to be a constant drain on your time), and you're being given next-to-no resources to even come up with a creative solution. Sometimes the best answer is in fact 'No' - it forces people to re-evaluate what they're asking. It comes with the danger of being sacked if it's you that's being unreasonable, of course....
Simon.
Physicists get Hadrons!
Is to first make an exact copy (by hand) of all the existing documents. Its vital to have a full backup in case anything goes wrong with the scanning process you can always restore the manilla folders to their original filled state.
Do not try to read the dupe, thats impossible. Instead, only try to realize the truth
What truth?
There is no dupe
See if the department can afford an HP Digital Sender. While they're quite pricy, they'll feed, scan, and email you a PDF.
5 17 9-64175-64404-12126-64404-25324.html
http://h10010.www1.hp.com/wwpc/us/en/sm/WF05a/1
/ \
\ / ASCII ribbon campaign for peace
x
/ \
What I suggest and use is the HP 4C scanner. It's a SCSI-II only scanner that can be found on Ebay for under $10 usually. They also have an automatic document feeder option that can be found on Ebay. This scanner was originally designed for both Windows and Apple compatibility as well. It cannot handle 2-sided sheets.
The scanner has four different pieces of software you can choose to use, I'd suggest Precision Scan Pro as that makes multi-document scanning easier.
Just hand back half of the stack, then do the half you kept up in latex.
This is a job made for outsourcing to India.
Change your major?
Hey, it's a thought.
I had to do something similar with about a thousand or so pages except they were all seperate files. I would concentrate on doing everything one step at a time. What I mean by that is scan all the pages into your computer and then begain making them into PDF files or whatever format you prefer. On my scanner it took about a minute per page so my main problem was just not having anything to do durring the time while it was scanning. Don't worry about this use this time to do something else such as reading a book or have another computer next to you to surf the web or play games on.
I am digitalizing my lecture notes with LaTeX. Takes some time, but results in perfect output quality and small file sizes. Needless to say I am not using any wimpy wysiwyg-stuff to produce the graphics, thats what the picture-environment was made for.
Definately keep clear of the Scanjet 5550c; there's a reason why it's the cheapest feed scanner out there. It will frequently jam if you a) load more than 5 sheets into the feeder or b) use any sort of paper that has been handled by human beings.
Our Engineering Society was trying to put up an exam archive with one of them and quickly gave up and started scanning with the flatbed.
Also the scanner has no sane support (one of the few HP scanners that doesn't)
Acrobat sucks ass for bitmap images. It doesn't display them very well, they don't print out well, and the files are huge. DjVu is a new image format that compresses extremely well (a few kilobytes a page -- actually comparable to ASCII text). It's somewhat proprietary, but it's probably the best solution here. There are free web-based services that can compress your images. You can try some of them and see for yourself.
Wank off for a bit
At my uni the course-capture guys started with the scanning approach. OFcourse it didn't work out since it is impossible.
Eventually they rigged up a system of sticking a cheap video camera in the class, and giving the prof a chalkboard capable of printing whatever he wrote on it. That would just get converted to a PDF (no conversion or OCR), and the taped course convered to an MPEG1.
If you can't handle the load, look at alternative solutions like the above. YMMV.
Just fax the documents to a computer.
Mongofax it to yourself. Will come to an inbox near you as an email with pdf attachment. No need for a scanner. Works as fast as your fax can chew through your docs.
the scanner might be acting like it's scanning a multi-colored photograph when reading in a hand-drawn lecture frame. try to see if the scanner can simpify colors or if your PDF maker could do it. Put another way, instead inputting 64000 possible resultant colors, use 16 or some other low count number, as typical lecture slides (and pens) only use a small number of colors (typically black, red, blue, green, and orange)
Do it the open source way.
Get several (dozen) other students to use their own equipment and time in echange for a copy/copies of the completed work.
I would hazard a guess that there are more than a few people who would like to have a copy of the complete series of the lecture outlines.
Get some students of the professor's course to type them into LaTeX. Give them some points they'd otherwise get for homework.
a) Publication quality DVI/PS/PDF files
b) The student can deepen their knowledge of the topic
Everyone happy. Used to work like this at the university I went to. And you may be even lucky that some student typed these notes in for himself.
I found that DjVu format produces substantially smaller file than PDF for the same scanned image.
There is an open-source project http://djvu.sourceforge.net/ that provides code for reading DjVu docs, but I have no idea where to get DjVu encoder.
I put "continuous feed scanner" in froogle, sorted by price, and found one for arround $400. You can do it 25 pages at a time with this. (Microtek X12USL 2400x1200dpi 42bit).
Where i used to work, we digtized 4-5 million documents per month. But these were mostly printed copies.
We had a set of high-speed sheet fed scanners, it would be then checked, and linked to a database. The documents in most cases where shipped to a vault.
In America we are imprisoned by our fear of them.
This looks like a job for cheap manual labor. Try India. Or an unpaid intern.
Don't you dare moderate this as a troll. You know as well as I do that this is probably the only viable solution.
Bret
The cure for cancer is coming: Reovirus
Get one of those Canon scanner/copier/printer thingies..
They can scan direct to PDF at an amazing rate of feed using the standard sheet feed.
Since it has dual purposes, you might con them into one, shared among a couple of departments...
---- Booth was a patriot ----
We have a 332 ST at work and recently added the scanning software to it; it can export PDF's (image PDF's no OCR stuff) straight to a FTP site. pretty nice. Of course this seems like you'd already have to have a documentCentre to begin with.
I'm sorry to hear of your trouble. I offer prayers for you and your professors.
But the broader question is whether this is really a good idea. The result is going to be huge files, which will be messy, hard to read, and will lack an index or table of contents. Seems like a case of profs with too much ego and not enough willingness to put their own work into more useful form.
Find free books.
Use the department funds to sign up an account at interpage.net, which will allow you to fax stuff off to yourself and recieve it as an email attachement. Then use the fax machine in the office to run everything through.
That takes care of the scanning part; cataloging, organizing, and etc will take a lot more time.
You may be able to presuade some professors to fax you the stuff themselves, saving you a bit of time.
We use a Canon DR-3050 at work to do about 5,000 pages/week. It scans at 20 PPM, and you can put in a batch of about 75-100 pages and say 'go' and not worry about it. It's a $4,000 scanner, but it works really well for continuous processing.
As for formats, if it has handwritten stuff on it, you probably won't be able to OCR it and just store that. PDF image files are a pain, but so are lots of individual TIF's. Your students probably won't have a smart image viewere that can thumb through multiple pages of a multi-image TIF file, but if the prof's can mandate they download a free one somewhere, that'd probably be the way to go... even less proprietary than Adobe's PDF.
Be very, very careful what you put into that head, because you will never, ever get it out. -Thomas Cardinal Wolsey
It's not as bad as it seems.
At work, we have several multifunction printers / copiers / faxes / scanners. These things are huge, and take take reams of paper at a time for input, and don't take too long. Besides, it's completely automated (you might just have to import the resulting images into pdf which can be done easily). I've used it in the past to scan in my notes and worksheets the professor's handed out. It makes storage a lot easier.
Someone already suggested Kinko's. Yes, they might have it. Also, I've seen some smaller copying places in Newark have similar devices. So it's common enough that you can find it easily.
If a friend or contact doesn't have access to such a device, then I'd suggest paying a copy shop to do it for you. I doubt it would be that expensive, and you can bill the school for it.
The problem isn't that hard to solve (unless you want to try to do it in your apartment). But it's a good thing to bring up on slashdot, as many people might learn about this in case they need to do it in the future.
Basically you walk over to a Xerox copier with a sheet feeder attached and using a cover sheet created in flowport, scan in your documents into Docushare. They are stored as fairly high quality PDFs. The Docushare software also does an OCR on the files and then makes them text searchable.
Although not perfect, it is by far the best solution I have seen. It sounds like you do not have the funds to implement this at your school (the price of the Xerox copier and dedicated docushare server) but if you only have a limited number of these documents, then you would not need to have the infrastructure and perhaps Xerox would do this for you. Xerox has many offices in major cities.
One option would be to use Xerox's Flowport. You would have to check what is available to you locally - but I can tell you that making a PDF with a Xerox copier and Flowport of 100 pages is a few minutes of work.
Also, try look for others are doing in the university setting.I just bought a HP ScanJet 6250C with ADF on ebay for 100 euros. I have not tried it yet but it scans all pages in the feeder (25?) after a press on the button. Some multifunctionals (fax, printer and scanner in one thing) have a feeder too and are much cheaper than a scanner with an ADF.
History matters..
GIFs compress very well, especially with source material that's in limited colors. Try making a page into an 8-color or even 4-color GIF at about 150 dpi. The handwriting should be about as readable as the original.
Also, if you're scanning material with copy on both sides, you might get some visible bleed-through. Try scanning such pages with a sheet of black paper between the page and the lid of the scanner, then adjust contrast to ensure white whites and black blacks.
Q: What does the "B." in Benoit B. Mandelbrot stand for? A: Benoit B. Mandelbrot
After you get all of it scanned it and put through OCR, there will still be a ton of mistakes you'll need to correct.
now, at this point, you'll likely start wishing that you live in Canada (if you already don't).
The key is in volunteers, to bastardize "1984". Get a number of fairly intelligent high school kids that haven't done thier 40 hours of community service (a graduation requirement).
Now, make them look at the originals, the scanned, and correct all the discrepancies
bonus: if the kids are the nerdy types, tell them that they're learning university material for free.
they could start paying you!
The only handwritten stuff I saw professors use were in math/statistics classes and math-heavy engineering classes. Survey class professors lecture and test the same stuff every year. Go with October 30's advice.
The only thing new in this world is the history that you don't know.[Harry Truman]
Use a digital camera and save as jpg. It's a lot faster than scanning, and the quality is just as good.
i have a fujitsu scanpartner fi-4120c desktop scanner. only offers a page feeder, though, no scan bed, so you will need everything to be loose pages.
very fast, and will do both sides in one pass, if you are working with double-sided pages. at 200x200 resolution (you might need higher, ymmv) and scanning double sided pages, i get something like 3 seconds per page (counting one double-sided page as two pages). for software i am just using the included scanner driver and twain software and adobe acrobat.
cdw has it here, i'm sure it can be had for cheaper. i got mine for $800 i think. a little more expensive, but the speed is well worth it in time savings.
Electronic test-equipment manuals are pretty much worst-case candidates for scanning. In Tek's case, the schematic volumes often consist of hundreds of double-sided, nonstandard-sized foldout sheets (11x23" for example) with lots of fine detail that must be reproduced clearly. You can either scan the pages in segments and leave it to the reader to reassemble them, or you can take the manuals to Kinko's and have the foldout pages shrunk to 11x17" or 8.5x11" for scanning. Either way, it's a real hassle, and highlights a clear need for a "prosumer" duplex sheet-feed scanner solution.
A few years ago you could buy scanners like this one that could handle arbitrary sheet sizes, but I haven't seen them in stores lately. These may be easier to use than flatbed scanners, assuming the precision they offer is sufficient for your application. I don't know how well they'd work on densely-printed schematics.
Other than bitching about the state of the scanner marketplace, I don't have much to suggest. There are a few hints that will improve the quality and usability of your final document:
Dahlmann tightly grips the knife, which he may have no idea how to use, and steps out into the plain.
Just pay some little kids (younger siblings?) like 3 bucks each to type it up.
Outsource the job to India
"No, no, not my entire job, just this one part. No, I can do the rest. No, really. No! No... please..."
The coolest voice ever.
I don't understand why, but most people don't realize that most new copy machines are also PRINTERS and DIGITAL SCANNERS. I always find it funny when companies purchase fax machines/scanners/copy machine/printers when they really only need one device.
If you can find access to a digital copier at your university somewhere, you can just put the whole stack of paper in the sheet feed and it should be able to scan every page double sided and put it on a network drive somewhere.
It might take awhile to figure out how to set this up, but it's infinitely easier than trying to scan each page by hand using a crummy consumer scanner.
Don't put messy handwritten notes on the web. Its very unprofessional and looks rubbish. Ask for student volunteers to transcribe the notes into latex, then use html/pdf conversions for the web.
It'll take longer, but it will be worth the effort, especially when it comes to maintaining the notes in the future.
'nuff said ;-)
At your budget, I'd get a digital camera (Nikon Coolpix on e-bay, for example), shoot the pages, and put the pages together with acrobat as pictures. Spies can shoot at speed, and I expect 3 secs/page might be a realistic guess.
I do conversion for fun, at Distributed Proofreaders.
The problem is the mixture of graphics, equations, and text.
It's easy enough to turn a page of text into a smallish file. Get a good automatic-feed scanner ($3500 or so) and a copy of ABBYY OCR software. If the original isn't too speckly, tiny, or smudged, ABBYY will give you a 95% accurate text you can then correct. Best format to save in? Depends on what the school is going to do the files. If they're to be posted on web sites, perhaps XHTML. If it's just for preservation, plain text (if there's no Greek characters) or XML with UTF-8.
Equations -- well, there's supposedly a version of XML for math, but Distributed Proofreaders has ended up using TeX, as it seems to be the mathematical standard. While this would work for preservation, it wouldn't work for a web site.
For a web site, perhaps the best way would be to intersperse text with pngs of the equations and graphics. The pngs would still take a lot more space than text, but the files would be smaller than PDF versions of the whole page.
At work, I set up a document scanning function for our BAR system (Business Approval Request)--everything that's submitted must include documentation, which is often a paper quote or invoice.
We bought an HP Scanjet with sheet feeder for about $200 (sorry, don't remember the exact model), and use Paperport to scan the documents to a network folder named for the person requesting the scan (the executive assistant does it). We save in 300 dpi TIFF files in 1 bit color (B+W), which are small (8.5" x 11" comes out around 50K), and extremely clear and legible, and can be printed out again at almost the same quality. The scanning is pretty fast, and it includes batches. The only slow part is that PaperPort (which comes with the scanner) scans to MAX files, which need to be saved as TIFFs.
Anyone who loves or hates any language, platform, or manufacturer, doesn't know what they're talking about.
Unfortunately you don't have much use for something like Acrobat Capture because you have handwritten notes to deal with. To process the files, SANE and/or TWAIN interfaces are reasonably easy to write code for. The cool thing about SANE is that you can run the saned daemon on any Mac or Linux box, and with a couple of lines of config file changes, it's instantly available over the network from any Mac, Windows, or Unix box (there are TWAIN bridges for Mac/Windows so it even shows up in Photoshop and so forth); there are also standalone GUI clients like XSane.
I wrote a document management system in Python/wxWidgets (for Windows) in about a month part-time, and it works very well. Either on Mac or Windows, PDF makes sense because of the ubiquity of the viewers, even if you lose a bit in compression compared to more optimized formats such as DjVu. On Windows you can easily embed the Acrobat ActiveX control; on Mac OS X you have native PDF support, Panther's Preview kicks ass, and there are several open-source PDF browsing components such as the ones out of TeXShop or Glen Low's Graphviz port you can embed in your own app.
Given a choice I would probably pick the Mac to do this project, because of the wonderful Quartz/CoreGraphics Python bindings. You can just draw right to PDF, and place PDF files as if they were images; for example, here's a short script to rotate a bunch of PDF files (sorry, Slashdot destroys Python indentation):
You could also use ReportLab, but because a lot of the PDF processing code is written in Python it's somewhat slower and memory-hogging for high-volume use. (I used ReportLab on Windows for the above project, and use CoreGraphics Python bindings for my research, so I do know what I'm talking about mostlyChances are she's a plump, old, matronly, bespectacled hausfrau.
but not for $200 you can get a Canon DR-2080C off of ebay for $630 and it can accept both usb2 and scsi-II interfaces
The HP one you picked looked ok, but feeder looks a little chitsy.
We have a panasonic at work, and use it to scan in design packages. it's something like the model KV-S7065C Don't be fooled by the 'low volume' tag - we routinely make 100 page pdf's out it (high volume = insurance office), even though it will take a few min. Thing works great. Highly reccomended. The panasonic comes with software that allows you to save all as a single file, break into xxx page long files (where you get to pick xxx), and many other features.
My favorite is that it makes it easy to create pdf's with changes in page size / resolution. Our packages are mostly design calcs (8.5x11, 300dpi) with a few drawings (11x17, 600dpi), and it works slick.
We used to send out ~5-10 fedex packages a week, but now we just scan and email. Saves so much money, time, and they can get packages right away.
A good way to keep down on the cost is to get a B&W scanner - you probably don't need color anyway, and it keeps the file size way down.
I think I need a new sig here.
For that price, a digital camera on a fixed mount might be easier than a scanner. Lay out the sheet, take a shot, lather, rinse, repeat. Generate a PDF using imagemagick/ghostview.
My father is an attorney,
he has a couple of high speed scanners from panasonic. They cost less than a thousand dollars (4-500) if I remember correctly, they scan at about 20 ppm, and the software that came with them will save each scanned group of pages as a separate document (pdf, tif, whatever). My dad uses this setup to scan all of the files that his cases generate (shrinking his document storage from about 1000 sq ft to 2 shelves in a bookcase). we are talking files that consist of 10,000+ pages, and normally he saves a years worth of cases on 3-4 cds. They can scan up to 500 pages at a time.
Here is a link:
High Speed Scanners
I worked on a similar project in the past, where I had to PDF a lot of paper-based documents.
:(
A nice ADF scanner will save your sanity. We had a newer ScanJet, resembling the 5550c, where you couldn't feed too much at once, and it would jam up, We later got a hold of an older HP Network ScanJet that worked like a champ. If I could remember the model numbers, I'd give them to you.
That said, from the sounds of your situation, outsourcing would be the best solution. They already have the high-end scanners, they high-end software to work with your documents, (i.e. Acrobat Capture) and all you'll have to worry about is giving them the documents, and picking up the CDs with the PDFs on them. I don't remember what it cost us, but I'd wager that the overall value was superior.
Good luck!
Don't waste your money buying a scanner. Teach the professor to use M$ Power Point or OO Impress. Those slide can easily be web published.
Fax the documents to something like an efax account. Most Universities have a heavy-duty fax machine lying around somewhere. Or you could just give it to the secretary and say hey, "The prof. asked me to give you these, fax them to this number." Then, in comes your fax already converted to an electrnoic format. Most of the free fax services only allow you to receive a few faxes per month, but you could always just sign-up for one of the better ones and then cancel.
Is say "Sure. I'll get this done- when I can. Don't expect it to be done for at least a few weeks, maybe longer."
DON'T CLEAN UP THE SCANS. Don't even look at the scans. DO NOT RETYPE ANYTHING.
With the kind of volume you say you're receiving, the only way you're going to survive is to:
1. close your eyes,
2. load the documents into the feeder,
3. press 'scan'.
4. Make sure everyone knows this policy.
Get the Dr. to use a pc in the firt place. That way all you need to do is clean up the material. Sketches, flowcharts, and the kin can be entered using the appropriate tool.
Sigs are nice guns
http://www.xerox.com/go/xrx/equipment/product_deta ils.jsp?Xcntry=USA&Xlang=en_US&prodID=DigiPath&cat =Product+Taxonomy%2fProduction+Workflow%2fFreeFlow +Digital+Workflow
While i'm not suggesting you buy it, but find a local service provider that has one. If your school is large enough they may have something like this already. These are they type of scanners that drive xerox's 120-180ppm printers, they are lightning quick 60 double sided pages a minute, and surprisingly good quality.
poorly set expectations. How did the professors get the idea that it was possible? It's not pssobile under the contraints that you are faced with. If money was not a limiting factor you could do this. But I'll assume money is a factor and time as well. So go back and tell them that it's possible but it's going to cost this much to automate the process and this much if I type it in by hand and this much if someone else does it but with poorer accuracy and so on and so forth. Put the burden on them to decide how they want to deal with this. Only then will the appropriate solution be found and chosen.
http://tinyurl.com/3t236
I've heard the US military has this problem with the millions of documents taken from various government offices in Iraq. there's no easy way to get the information from them except to use fast scanners to put them into pdf. then, you just have to hope they can find a translator to look at a random document and hope it has some valuable info in it.
if someone came up with a google-like crawling engine that would OCR all the pages and put them into a searchable database, it would make their jobs a whole lot easier.
another poster suggested outsourcing to india. but if someone in the US developed the above-mentioned product, the USG would probably pay the 1000% premium to avoid outsourcing, and you'd be rich. ah, if only i were a programmer...
Cheap? Dunno. It was just there. In any sort of volume though, the cost drops precipitously (cheaper that you doing a flatbed scanner!).
Check out something like that (or indeed that) used, use it, resell it. Or new, then use/resell. Or get the school to buy it.
If this is a continuous thing, then all the better to own.
I just bought this one for work and it seems nice, so far. It's relatively inexpensive (around $200) and has a 50 sheet capacity document feeder. Of course, if you want newer and faster (and 2.5 times the price), you can always go for the 9450 PDF model, which as the name suggests can export directly to PDF files.
I work at the Academic Support office at a university. Much of what I do is scanning textbooks for visually impaired students, and I've recently started using Adobe Acrobat 6.0 Standard for some books. After a semi-scientific study, I found that scanning in black and white (that's 1-bit pure B&W, not 8-bit grayscale or whatever) and using Acrobat's adaptive compression gives good results with a small file size. Of course, this is usually with printed text, so YMMV.
The scanner I use is an HP ScanJet 7400C, and while the scanner is OK, the software has some major flaws that require workarounds. However, this is a fairly old scanner with old software (last updated in 2001, I think), so more recent versions may be improved.
Someone else suggested a high speed scanner from Fujitsu. I don't have any experience with these, but in addition to being very fast, they are very expensive and may require you to buy additional hardware (some of them use a SCSI interface instead of USB).
I'd suggest spending the money you have on a mid- or high-end consumer scanner with a good Automatic Document Feeder.
If you've got more questions, I'd be happy to answer them as best I can. Feel free to reply here or send me an email. If you do email, be sure to put "Slashdot" in the subject line.
~~LF
Kent State just announced thier FlashNotes website. I go to school there, email me at fiveonethree@yahoo.com I would be more than happy to come down and help you sort out your options.
A bit of opinion on the project. This is not a good idea. Its one more tool that studnets will rely on to memorize information isntead of taking time ti THINK about thier subjects and really LEARN the material.
It takes up to 50 pages at a time, scans both sides in one pass at up to 20 sheets per minute, and produces a PDF file containing the original graphic image of the page plus (optionally) a OCR'ed text version for doing searches.
You can set it up to start scanning whenever pages are added, and then just refill it as you're doing other things (I've scanned in several books this way).
100 pages is really not that much.
There are entire companies in the middle-east with hundreds of employees typing in or scanning paper documents for REAL large scale conversion jobs like 10000+ pages. The employees are paid about $0.10 per hour. I read an article about such a company once and they mentioned Lockheed-Martin as one of their costumers apparently they had a *huge* amount of specs on paper that they needed to digitize.
Just my 12 minutes...
Acrobat Capture 3.0 is the way to go. I think Adobe makes really, really crappy software in the Acrobat line of products, but Capture gets the job done, and can create pretty small PDFs if you get the settings right. The other OCR'ing PDF creation alternatives, when you're creating hundreds of pages, are MUCH more expensive.
You can't do this project practically for a couple hundred dollars: You need a duplexing auto-feed scanner and those are not cheap. For project I manage, we knew we were going to need to turn tens of thousands of pages of paper into PDF and we dropped the bomb on a Ricoh 450de duplexing document scanner: it does 55 pages per minute, both sides. This scanner has been trouble free and heavily used for 4 years, I cannot recommend it highly enough.
You really don't want to do this on a scanner without an autofeeder, and if documents change from single to double sided you REALLY don't want to scan them on a simplex scanner: it'll take you forever, and be more error-prone in terms of screwing up the order of or omitting pages.
You have a few options, assuming you can't get the scratch to get the stuff to do this project right in-house:
We have a modified need for the same thing -- scanning meeting notes, customer diagrams, and so on, all to PDF. The documents are usually 5-10 pages. We've been happy with a Visioneer Paperport 9450, which comes with a document feeder ($500 including a full copy of Acrobat).
This is a troll but a previous post is considered +5 teh fny?!?!?
Why do it all one way? It sounds like a very great deal of stuff that may never be used by students. Why not try to find a prof who will cooperate with letting you see his/her webpage usage patterns?
/.ers claiming to represent all students. On the other hand, by trying different things in my classes, I've been able to find out what my students will use eagerly. Hint: It ain't the same type of thing for every class!!!!!!
In my experience, it is very hard to predict what students will use for any given class based on the moronic ramblings of
I'd like to say that you're at a really shitty university that would take this kind of student-hostile course of action, but then, I checked out MIT's Open Courseware only to find that the first course I looked at, Gilbert Strang's linear algebra, was a botch job. There was a postage-stamp-sized video of Strang telling anecdotes on the first day of class that could only be appreciated by someone who'd already taken the class. So much for leveraging the web's inherent strong points!
Write it all into latex. You'll be happy afterwards.
Messy text doesn't go well with OCR. You might get some of it through, but you still have to proofread it thoroughly and you may miss some stupid lookalikes.
Diagrams probably require drawing them by hand. Either plot the diagrams with matlab/octave or approximate them with bezier curves with your favorite figure-drawing editor.
Having the professors scan their own shit? Where I work, there's no way a professor would ever consider asking such an impossible thing. They would either scan it themselves or have their secretaries do it (FYI, every department in our hospital has one or more flatbed scanners including some automated ones). I mean, this is real donkey work for which you are likely to be too highly trained and too expensive. Again - no way.
----- One learns to itch where one can scratch.
I'm surprised no one has mentioned Acrobat Capture, which is designed for exactly your scenario. The JBIG2 plugin can make really small PDFs from scanned documents. The downside is that it's not cheap.
I number of years ago, the LARGE (read beltway bandit) contracting firm I was working for landed a private contract with a major insurance firm. Said firm had been NAILed in a class action lawsuit, and as part of the resultant consent decree, had to digitize ALL of it's paper policies and contracts, going back years. This averaged over 90 front-and-back pages per customer, and there were millions of customers. They had (originally) about 90 days to get it done.
This insurance company custom built several scan assembly lines, which used automated (Xerox IIRC) scanners and document handlers, as well as lots of custom software (that we customized).
This was more than seven years ago, so I would be suprised if the core technology isn't available at Kinkos or maybe even somewhere within your own university. Ask around, and if it's not there, call the local Xerox rep and ask to have one of the devices out on a demo. Whether the uni buys one or not, you can probably get YOUR work done, and make YOUR prof happy.
Check whether any of the photocopiers around campus support scanning: we have a Canon ImageRunner in one of the labs which I support. It's extremely fast - ~1 second per page for a double-sided scan and the feeder is pretty robust - we have grad students who take handwritten lecture notes for an entire class and dump this stack of a couple hundred crumpled pages into the feeder and end up with a PDF a couple minutes later.
Somewhere in the department (or at least in the university) there must be a high capacity digital copier. If it isn't already installed you may have to beg to get the scanner support enabled but copiers are built to work with huge quantites of paper and they'll turn your outlines into PDFs and either email them to you or save them somewhere for you to receive in a matter of minutes.
If you can get away with it (which seems unlikely), you should just make JPGs of the pages and put them in .cbr files.
It would be much easier than scanning or typing the stuff up, and there's a good free viewer for windows.
I'm amazed that you can OCR handwritten pages at all -- that's incredible. I had no idea the technology was that good.
Comment removed based on user account deletion
I tend to scan lots of documents and setup a simple perl script that uses the 'scanimage' command line tool to do the scanning. Using my Epson Perfection 1650 scanner (pretty standard flatbed scanner) I can scan an 8"x10" page in black & white mode in about 10 seconds.
I actually added a button to the Nautilus GUI shell so I can move to the directory I want and hit the button to scan a page to that directory. Very convenient.
I scan to tiff and then use the convert utility (part of imagemagick) to convert to png. The resulting files typically run about 100K to 200K depending on the content.
If anyone's interested in seeing the perl script I've posted it to: www.ollies.net/scanscript.html
Steve
By the way, you shouldn't need to do any OCR with these files. I do use OCR (or what Acrobat 6.0 Standard calls "Paper Capture") for my scanning, but only because that allows the PDF to be read aloud, which gives greater accessibility to visually impaired students.
Besides, even the best OCR packages (we have ABBYY Fine Reader, supposedly very good) will do a poor job with handwritten text, and no OCR package that I know of will correctly do formulas.
~~LF
When I started work 18 years ago, we still had no word processors and had to write our notes. Secretaries would type them, get things wrong, we'd have to redraft them and they'd do them again. Of course, any later changes would have to go back to them.
I don't even think about handwriting now. It's just terribly wasteful.
The only handwritten things I see nowadays are things like compliments on report approvals, where someone is trying to add a personal touch (and birthday cards).
A couple years ago I bought a sheet-feeder scanner at Fry's for $29. In addition to regular paper, it could also handle business cards. Unfortunately it got stolen out of my office, and I couldn't find a cheap replacement; I'm now using a flatbed scanner.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
Most universities have excessive duplication of information storage because their departments are like Chinese fiefdoms.
Public affairs, publications, admissions, business services. Ask around. One of these probably has what you're looking for or knows somebody else at another nearby college/University that does.
You may be just a phonecall/fax/fedex package away.
Game: Player 'Donald J Trump' now has AI skill level 'experimental'.
I would suggest hiring a typist or two. Often you can get 7 cents a page (side) of typed document in digital format, although offering upto 50 cents per page is the real way to go. Hire a school kid or the like who types fast and well and bam, your set. The added benefit of helping someone out is just cool.
...closer to $2000. Caveats: this assumes you want to produce an image file, no OCR, black and white; otherwise, you're on your own. I worked for a company that scanned vehicle loan documents for customer service call centers. They used a Kodak 2500 (not sure they make it any more) to scan up to 50 ppm, double-sided. We normally used about 100 dpi for images, which were perfectly legible but kept the image sizes down (about 120kb for a single sided 8.5 x 28 inch sheet). This data was imported into a workstation (old Win98 box with a SCSI card) running a software package called Paperflow. The images could then be indexed by hand and exported to a file server, and index information moved to a database (we were using something that worked with SQL, but I can't remember what). This is only worth the effort and cost if it becomes an ongoing project -- all future notes also get scanned. Option 2: Outsource, but it might not be necessary to go to India. We bought all of our equipment from a company called Mackin Imaging (www.mackinimaging.com), and they do stuff like this for schools, banks, insurance companies, etc. all the time. I have no idea what they'd charge for a one-off project, but it will be done the way you want, indexed, no dropped pages, etc. Hope this helps.
Speaking as such a student, I really hate that kind of PDF. They can be megabytes in size, sometimes a megabyte per page, and they're usually not worth my time or effort to download, and they're difficult to read. Get them to type (or LaTeX) their lecture notes. Offer to convert those to PDF yourself. Don't scan them. Don't encourate them to generate more handwritten PDFs. If you really must, then don't do PDFs, but use the most compressed image format you can find.
I think that the actual results of this would be less than stellar there, James Bond. Try it sometime, it actually doesn't work.
Swing and a miss. Just keep posting that and some day it will be on-topic.
As much as I hate to say it, I really hate having class notes that aren't in power point. Powerpoint allows me to search it very easily. It also has the added benifit that the professor can use the slides in next year's class which helps me to concentrate on the lectures. It's a really big pain, but I don't think there is any way around typing them by hand...
In some fields, typing really is difficult, because you need to draw pictures, so scanning is probably appropriate. But in many fields, most of the material is text, and they ought to be typing it anyway :-) So if they're the type that can be motivated this way, give them a choice of ugly scans (8-bit color, 300dpi) or else submitting their typed notes, and give them a friendly interface for uploading their typed notes (if you can support web, email, and also drag&drop, that increases the chances that they'll use it.)
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
I assume you're scanning most of the documents in grayscale. This sounds obvious, but almost no one does it: reduce the colour depth! I find that most grayscale documents are still perfectly legible at 16 colours (or even 8 depending on how clear the original was). You only need 4 bits for 16 colours instead of 8 bits for 256. Virtually any compressed file format will be able to take advantage of this - you'll get an immediate 2x reduction in size. An you'll probably get even more than that, because the reduced-colour image will compress better too. When I have a lot of pages to scan, I set up a macro that does these three steps:
- increase contrast and brightness (generally clears up any blotches in the background)
- auto-equalize (to restore black/white balance, since incrasing brightness usually makes the text lighter too)
- reduce to 16 (or 8) colours (without dithering)
This means you need a grpahics program that can perform these steps. I use Corel PhotoPAINT only because that's what I'm used to, but any half-decent graphics program (commercial or free) should be able to do those steps.
The quality of the scaning is obviously important; get or borrow the best scanner you can. The point made about putting a black backing onto a flatbed scanner is important. Also important is adjusting the scanner settings so that you get minimum noise (random black dots) without degrading the stuff you want to keep.
For this sort of thing you almost certainly want to do it bi-level/B&W/one bit deep (hopefully there are no shaded pictures, but you can use screening for those), and to my knowledge nothing has been developed that compresses these images better than CCITT Group IV (fax machines use Group III). You almost certainly don't want to use grey-scale, at least not for your final images.
You should see if you can find some post-processing software; we used to use ScanFix, which would straighten the image (which makes Group IV compression a lot better) and depending on settings clean it up as well. You also need to decide upon the size of the final images; you want to scan at 200 to 300 or even 400DPI, but you don't have to have final versions at those high resolutions.
The standard used to be TIFF images with Group IV compression, but not every image viewer can read them, or display them well (esp. if the image needs resizing, and I doubt you can assume everyone reading these has their monitor at a high resolution).
If PDF will accept and display images compressed with Group IV compression, you're probably best off with that, since Acrobat Reader is ubiquitous and fairly easy to use.
PNG is a nice format that I use by preference for > 1 bit deep images, but a quick check of some PNG documentation says that Group IV "often" compresses a lot better than 1 bit "greyscale" PNG; it was simply not designed for document imaging. And you also want to avoid JPEG, it's a lossy (will introduce artifacts) system that also wasn't designed for bi-level images.
Hope this helps.
If the lecture notes you're scanning don't contain any grayscale or color graphics, your best bet is to scan in black-and-white mode (as opposed to color or grayscale) for smallest file size. I'd suggest scanning at 300 DPI for sharp-looking printouts. Be sure to play around with the "threshold" value (or equivalent) in your scanning software until you figure out what looks best. If it's not set to a good level, text may look too thick and blocky, or thin lines might disappear completely.
Once you have a monochrome scan, you'll want to save in a lossless compression format that preserves the monochrome attribute of the image, such as compressed TIF, and not as JPEG. When exporting to PDF, you could experiment with both ZIP and fax (CCITT group 3/4) compression types -- both compress black-and-white images very well. If your PDF software doesn't have those options, the default should probably be good enough. Even at 300 DPI, most pages should fit into about 30K or so.
Most universities already have this service. The professor might not know it exists, but check the other departments to see if they have one (not the scanner - but the service at the school). It is usually somewhat intertwined with a Distance Learning center or department.
It takes away the cost of printing lectures/notes/required readings from the departments and tacks it onto the students who now seem to pay for printing above a certain limit in the labs.
At least this is the way at the universities I have worked at.
I've recently been asked to do a similar task. I spent about a week writing a custom software application. My resulting PDFs are approximately 50-100K per page, based on page content. The PDFs themselves will contain approximately 50-75 pages per "set". Our organization will be using this solution with a Fujitsu 4340C Scanner. We're looking at thousands of pages per month. So far, everything seems to be working well. While we used the Fujitsu scanner, any TWAIN compliant scanner should suffice. If you are able to do the custom development in a Windows environment, I'd be happy to share my experiences and the tools I utilized in the project.
The company I work for developped an invoice archiving system for small to medium company.
Without going into the details, we researched several scanning solution and the best price/quality machine we found was the Canon DR-2080C.
It's a double-side, monopass, color scanner designed for archiving documents. You can load it with 15-20 pages a go, set it to scan all documents to PDF and have it automatically deskew (which is really nice if you're going to OCR the documents afterward). The only issue we've had with it was with a Dell system that wouldn't recognize it no matter what (Dell's forlks are working on it, I'm told).
There is, however, no Linux driver available.
In another industry, programmers, no matter how smart they are, should not create 100's of pages of code that is not distributable, readable, searchable or re-usable. If they do then they are not doing what they are being paid to do.
Give your professors a copy of Open Office and have them redo the work in a format that can be read, indexed, searched and distributed.
Can I bum a sig?
Yes, it makes a PDF of all the pages, but each page is just a picture. There's no way to search for text in the result. Also, graphics are much larger than text.
You might want to contact MIT and ask around, since they were/are doing a lot of what you need. Check out their MIT OpenCourseWare.
:)
Maybe you can also convince your professors to use their notes - than it's just a simple wget job for you.
Leonid Mamtchenkov
The big annoyance of image files of any type is that they make it hard to cut&paste text, but if you're working from raw bitmaps anyway, you don't lose out by using GIF instead of PDF to package the pictures. (PDFs that are created from text make it possible to retrieve the text, at least with newer PDF versions, but you don't have text to retrieve.) So also try running the thing through an OCR to extract anything you can, but don't expect much.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
I see I'm not the only one with one. The nice things about it is that it's built like a tank (weights it too), and can handle legal size. The only downsides is the resolution isn't as high as modern scanners, and that sheet-feeder is bulky.
Anyway I run the output through this and a bit of OCR (doesn't have to be perfect), and store it in a Database.
I once had the job of doing basically the same thing as the asker was asked to do, though I was to scan 176 pages in total.
I ended up getting a login on an old dual PPro 200, running NT4, with an old HP SCSI scanner. Scanned in at 300 dpi, 256 color greyscale. Each page took around 10-15 seconds, each save and page swap took 15-20 seconds (I used the kodakimg app that came with NT, saving to compressed tiff). I initially tried the same thing with a USB scanner, but each page scan took 1-2 minutes. To hell with that.
After the 176 pages (88 pages double-sided), my arms and back were sore, but it wasn't too bad. Thankfully, I got them in two sets (115 and 61 pages), and it was even easier.
I did a scripted resize with photoshop to fit each image on its own page (though you could probably do the same thing with the 'resize image' powertoy for XP), generated a single web page that contained all of the images, loaded the web page (it took a few minutes), and used 'print to pdf' which is available if you have Acrobat installed.
If the asker only has one class, and its only 1-200 pages/week, 2-3 hours to do it all isn't that bad. No OCR, but I've seen so many crappy conversions it is hard for me to trust them.
The professional approach is to go back to them and clarify the outcome:
(a) you can scan the documents in, and they'll take X amount of space, and Y time; and this doesn't include OCR;
(b) you did a few tests (using the supplied document) and these are the results for TIFF, JPG, PDF, etc;
(c) OCR is probably infeasible (or not, do some tests) because of the nature of the documents;
Include in (a) the option of purchasing an automated document scanner, and the corresponding reduction in time.
Based upon all the above, get a clear go-ahead, and make the purchase if new equipment is authorised.
You said "where I work": this is your job: it's a bit poor to do as the other posters suggest and refuse to do the work: you need to make sure that the customer (professors) understand exactly what they are getting, and give them a choice to buy into it or not - i.e. "clarify the expectations".
If you assess that it's 2 weeks worth of work, and the professors don't disagree, then you're supervisor just has to put up with it.
with the Scan to PC Option Package. I can set 50+ pages on the sheet feeder and hit scan to PC and it will make a single PDF in about 5 minutes on the Server. (For note we use a Mac OS X Panther server which Kyocera does support) It is helping us go from paper to electronic quite well. The 5530 is a Copier with network abilities, and the scanner adds a secnd network interface and adress. It can also scan to e-mail. This ystem works quite nicely. We also used Fujitsu ADF 11x17 scanners I believe they were the 1100's it's been quite awhile ago. THose could do 22 ppmbut had a dedicated PC and the kofax card got quite hot, and the one we had only worked in Win 3.11. There are quite a few systems out there that are good at that, but the prices are going to be rather hefty.
Kosh: "Understanding is a 3 edged sword, your side, their side, the Truth."
Your research skills astound me in their nonexistence. If you are typical of today's college student, then I fear not for my job security. For less than $100 you can get a Lexmark(x125) at freakin Office Depot that will sheetfeed scan as well as color print and fax.
Now you're lazy but, I'm smart and lazy so I'd just go to Kinkos, give them the stack to process and present the receipt to the professor for reimbursement. I would also be surprised if your school doesn't already have the facilities to perform your needed task.
One of the things you will learn in your life is that usually your problem has already been solved multiple times by multiple people, and the least bit of effort on the internet will generally provide myriad examples of these solutions. Though, I can't believe this problem actually made it to slashdot. Must be a slow news day.
we use this everyday at my office to automate / consolidate information from various investors and then send them out via pdf. It works pretty well, but its not too fast..
Here in Germany we do hire living people to do this. It's a bit more than just scanning and making PDFs, you should prepare the courseware with a little more respect than that, or the future students will hate you, and for a very good reason.
Is manual work so imposible to pay these days, or should everything be as dirty cheap as possible?
HP is nice for many things, but the xerox docuscan is really the solution for you. Just get the most expensive one you can afford (more money = faster) and go for it. MIT press classics recently did a bigger project converting all their books to PDF for reprint - too bad you don't have those funds http://www.xeroxscanners.com/default.asp?pageid=10 0
Quick suggestions:
.10 a page, in PDF format at 600 DPI. We spent almost as much "coding"(in the paralegal sense) the documents into a sharepoint list.
1. Get the profs to do it in a digital format, _any_ digital format that can be scrolled -- ideally something which can incorporate diagrams and equations, but really whatever the default IT word processor is, or if individual, work with them -- because the result, from the experts themselves in a searchable "hands on" format, will simply be an order of magnitude better than anything you can scan in and attempt to make searchable.
2. Look for local scanning firms. We just finished a 15,000 page run for a client via a local legal dbms firm here in town (Seattle) for
3. Scan and bear it. You are right, unless you have been very lucky in choice of scanners and the people involved didn't wrinkle the sheets too much, it is going to require an attentive human monitor.
Why not outsource the job to India. For a few hundred dollars you should be able to hire a few people to type everything for you.
There are 10 kinds of people. Those who understand binary and those who don't
Find yourself a Xerox Document Centre that you can borrow for a few minutes. I can place a stack of sheets in the tray, select a destination folder, and hit 'go.' A few minutes later, there is a .pdf file sitting on a network share. I have used this for everything from digitising entire books (cut the binding and stack the sheets up) to small documents such as a CV; it is fast and effortless.
+++++++
"Look, dear, it's a crazy hairy scary man!"
True, but you still need some kind of document managment system, else all you have is a pile of scanned images. Were's bill so and so, sent august of 2002? Also some documents will need to be kept in paper form for legal reasons.
I've been reading a few minutes and nobody seams to address your setting etc.
The you should scan in grey-scale or if there is high enough contrast (pen notes, not pencil) in Black and White. The grey-scale with a JPEG medium or even low compressions is going to be much smaller then the deafaults. A pure black and white with group four compression will be even better. At work we scan pages at 300 DPI that way and get 20 to 30 k files (I think, haven't done it for a while).
Also typically images for web viewing of even text are scanned at 72 dpi (all the scholarly journals at my university). This can make things hard to read but really shrinks the file (about 1/16th the size of 300 dpi).
Also if the scanner is set low res pure black and white it will scan a lot faster, but still be pretty slow.
The other option is to pay someone to do it. If you have all of the stuff ready at once and give the pros a week or so to do it when they aren't busy you can probably get as low as 50 cents a page.
Blah blah, I lost my train of thought 2 paragraphs ago
Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
its all about a hp network scanner ... i have been using one for quite some time now scanning in magazines and dnd books etc. create multipart pdfs, puts them on a network share etc.
members are seeing something, your seeing an ad
on a side note, if the professors are utilizing a lot of additional material which includes might include3 handwritten information, you might consider getting encouraging them to transcribe that material(hopefully your not the TA that has to do the transcription) into a digital for, be it text or WORD. this'll difinitely help in reducing the size of your files.
also consider looking into adobe's pdf service, if you're overwhelmed with just orginizing the material itself. probably not so kosher to suggest ity on /. but it could be something the school already has an agreement with adobe(taking into account the units of acrobat the school itself might be using). i know it's not rolling your own, but sometimes using an "out of the box" solution to get thing up and running so you can explore other solutions has it's merit as well...
three can keep a secret, if two are dead - benjamin franklin
It makes no sense at all to me, to have a PDF created of handwritten notes. Since most students will probably just download and print out the PDF anyway. The only adavntage is it may save a few trees not everyone will print them out.
It sounds like the school wants to shift the production costs (i.e printing) to the students. This seems inefficient because the old way where the instructor could go to the copy center and have the notes copied the at the schools expense (I know these expenses are often passed along to the students anyway), rather than at the students DIRECT expense of their time for downloading, then printing out on their own equipment or using their own printing accounts at the computer center.
If the notes were being OCR'd and then made available on-line, or post processed in such a fashion (where they are searchable, indexed, etc) where they were searchable, it would be useful. Otherwise this seems like a waste of time and money.
-MS2k
Scanned in images of handwritten stuff that has been produced by clever lecturer types is usually very difficult to read. The Engineering department where I study puts all its outline solutions in single page .PDFs produced from scaned images. It takes ages to print them out one at a time and the quality is so bad they are almost not worth having.
Getting someone to spend some time converting the documents to Latex makes some very easy to read and editable output. Yes it would take ages, but it can be updated and errors corrected. It also produces something that is much much more usuful to me the student.
Ian
The company I work at scans large amounts of documents to PDF format on a daily basis. Depending on the volume some people do, we use either a Canon DR-3060 or DR-5020 document scanner. These will scan both sides of a page simultaneously, clean up the image (despeckle and deskew) and convert them into TIF or PDF all on the fly. They're fast too. Between 20 and 50 pages per minute. Only problem is that they're expensive.
For your budget, you may be able to afford the Canon DR-2080C which goes for around $600. It has all the features of the more expensive ones, but it's meant for smaller volumes like what you're dealing with. With that, you'd be able to scan 100 pages into a pdf document in around 5 minutes.
charge by the hour, at least 50 dollars an hour. That way you can hire 3 student at 10 bucks an hour to do the actual work.
The Kruger Dunning explains most post on
So, I recommend scanning to TIFF (or TIFF inside PDF). Even if you don't currently have the encoding softeware, you can convert to JBIG2 compression later as it becomes more and more ubiquitous in the future.
And definitely use a automated document feeder of some sort to keep from going crazy. Newer Xerox machines work pretty well for this (I use a DocumentCentre 440ST for this all the time) unless you have hundreds of thousands of pages to deal with, in which case you should either invest in industrial scanning equipment or outsource to a scanning center that does.
Well, if it was compatible with Mac OS X I would think the ScanSnap would be the thing to get. I've been wanted something just like this myself and I use an iBook, but unfortunately this is one of those reasons why people say "windows is better for business".
Anyway, if you could find someone with a PC I think the ScanSnap would be what you're looking for. It's $300-$400, scans directly to PDF, scans both sides at the same time, sheet fed, etc. Here's the URL... http://www.scansnap.com
SOMEBODY PLEASE WRITE SOME MAC OS X SOFTWARE FORT THIS BEAUTIFUL ADF SCANNER!
The script is here.
The parent's link didn't work because freecache only caches files larger than 5 MB, while that is ~1 KB.
Is there anyway to use sane (or any other Linux scanning software) to scan over a network? I've got my printer shared, and I access it via Samba. Is there an equivalent setup for scanners, perhaps using DaemonTools?
I think what your professor wants is not a bitmapped copy of his handwritten notes or some vector curves that resembles such, but actually a typeset version of the lecture notes. If that is the case, assuming that his handwritten notes are sparse (and hopefully without diagrams, since it takes more time to mess around with them), you can definitely do a stack of 100 sheets in a week, or, as someone already suggested, hire some typists to help you out.
I once had a signature.
Michael isn't "a little" anything.
Sponsored by Intel Corporation, I run one of the Grand Challenge teams, Team Overbot. We have a vehicle (a modified six wheel drive Polaris Ranger), a shop in Redwood City, funding, equipment, and people. We're well along; the vehicle has most of its actuators and some of the sensors working, and about a third of the software is running. We're one of the five DARPA-accepted teams.
Many of us are Stanford alumni or students, but this is not a Stanford project.
Our basic technical approach is to build a rugged, reliable vehicle with conservative control strategies. Others may be faster, but we expect they'll get into trouble at high speed. Our top speed is 40MPH. The real problem with the Grand Challenge is not going fast on the easy parts; it's getting through the hard parts.
The 6WD chassis we're using is one of the most bump-tolerant platforms around. It can go over railroad ties at top speed without problems and without going airborne. The center of gravity is low. The front and mid axles have independent suspension; the rear axle is a swing arm. This simplifies low-level vehicle control. All wheels can be driven, although at higher speeds, we will switch from 6WD to 4WD.
We have five computers on board. Three are small PC/104 machines, and two are Pentium 4 machines. All run QNX (the OS for when it has to work.) All are industrial-strength ruggedized units. The actuators are all servomotors driven by industrial microcontrollers. All this hardware is off-the-shelf industrial control gear.
Sensors include LIDAR, doppler RADAR, sonars, cameras, INS, GPS, etc. Some of them are used in unusual ways. That's all I'll say about that.
The pathfinding strategy is indeed borrowed from video game technology. It's more structured than Brooks-type behavior based robotics, and it's less structured than Latoumbe-type planning. There are three layers of control; the top one we call the "back seat driver", because it has only advisory authority over the "driver".
We have road map and topo data onboard, but it's used more as a hint than as rigid guidance. We take the waypoints DARPA gives us (on a CD, at 0430 hrs the morning of the race) and load it in. There's no offline preplanning. Wouldn't help in the real world.
If nobody wins this year, which is quite likely, we'll be back next year with a faster vehicle.
Post questions and I'll answer them here.
John Fagogle
Team Fuckbot
If you want to do a good job, you have to type it, in LaTeX. It's the only way to get something nice and something the professors will be able to enhance in future.
If a digitized copy of the manuscripts will do for you, you can go the scan -> image enhancement -> OCR -> save to PDF way.
For scanning, you already got a lot of good comments how to automatise the scanning of dozens of scripts. If you lack these possibilities also a SCSI or USB desktop scanner should do the job (it's definitely less than 1 min per page), so you scan a script in 2 hours. No need to bother to outsource the job to India. Probably you can scan B/W and don't need greyscale or colors. I would scan handwritten scripts at 200 DPI and save the whole pictures in front of the OCRed text, so the user doesn't see the OCRed text and can only use it for selecting and copy&paste. It would be too much work to correct the OCRed text here. For machine written text I would use 300 dpi or more for better OCRing.
As image enhancement you only need to be able to automatically orient the page so that the text is horizontal. I don't remember if Acrobat does it, but for this job I would anyhow get a good OCR program.
As OCR program I recommend FineReader, but also Omnipage is ok. FineReader does better OCR than Omnipage and Acrobat. It also saves better to PDF (with retaining all of the paragraph structure) than Omnipage.
If you keep the image before the OCRed text in the PDF you can expect files of 10MB for 100 pages for B/W scan at 200 dpi. OCRing of machine written text has become incredibly accurate, so you can do real OCR there and throw away the bitmap picture. This of course gives much nicer output (and smaller filesize), but you need to spend a lot of time correcting the text. Here the best OCR program really pays off (you probably have a lot of words which are not in a dict, need custom dicts (does Acrobat have them?),...). A program with a single flaw (e.g. that recognized you formula as text, or code as paragraph text,...) will let you waste a lot of time correcting it on every second page.
If you're going to be doing this often I suggest that you look at Ascent Capture (http://www.kofax.com/) with one of the supported high speed scanners. I support about 10-15 users on Fujitsu M-4097 scanners which will do about 50 pages per minute (duplex simplex is 26) and they're quite reliable. My guess is that they're about $8000 (US) though, I'm not generally involved in the money side of things.
The answer is simple. You are at a university. MOST modern photocopiers do inbuilt pdf conversion or OCR'ing to network drives or email. Find one of them.
Some company (i belive it was called D-Info) did that a couple of years ago: they where the first to offer a phone-book for all of germany on a CD-ROM. They took the info from the hard-copy phone books (where are, or rather used to be, a public record and thus not copyrighted) -- but they where not allowed to scan it, because layout and such *was* copyrighted (which was made rather clear by the Deutsche Telekom (or, back then, the Deutsche Post).
So, wat did they do? They had a couple of hundred chinese (mostly) women typing away for some months. It seemed to have worked quite well, until the Telekom started to release their own CD-ROM.
I have discovered a truly remarkable sig which this 120 chars is too small to contain.
1. Get Dragon Naturally speaking.
2. Dictate the Essay, albeight a bit lengthy, into it.
3. Import to Word or your favorite word processor.
4. Add any cool equations and such that you cannot dictate.
4. Publish to PDF.
Nice small file size I'm sure.
Scanning is nice, but it only works with fonts it can recognize. Not Proffesorese.
It could take you a day or so to dictate, but after your finished, more than likely you will have alot less spelling and random letter and symbol problems.
But again, this might be more work that you want to do. Why? Well, if you do it this way, make a nice clean portable document that everyone can read, you might find yourself getting more "extra work" than you wanted.
They no longer make it but they can be found on ebay for a few hundred bucks and no I am not selling this one or one at all.
Noone has mentioned what I consider to be the most practical solution. Using a physical fax machine, a virtual efax account, and software of acrobat reader, you can convert mass paper docs into digital images.
Algorithm below...
0. Sign up for an e-fax account
1. Find a standard cheap fax machine with auto-feeder
2. Send the documents to 'yourself' via efax. Phone/PSTN -> digital image
3. receive the document as a fax image in your email
4. Install Adobe Acrobat, so you can print as PDF
5. Using efax's client, and print the faxes as a PDF
Done
I work for a company that develops Document Imaging & Delivery software & enterprise solutions for the mortgage industry.
We reccomend the use Fujitsu M4099D or Kodak i810 scanners for large scale scanning jobs.
However you probably don't need the 50,000+ pages per day scan rates that we spec systems for.
For you I think I would reccomend the Fujitsu 4750 or Kodak i80 scanners.
Now if you are interested in our software, contact me off-forum and I'll put you in touch with our sales guy. ( dchubb@virpack.com )
a) Use a service to scan, organize, store
b) Use an undergrad student to scan, organize, store
c) Use a grad student to scan, organize, store
d) Use a post-doc to scan, organize, store
I guess it has to do with the goals of the project. It looks to me like the current approach is low effort for low value. I assume that after scanning they will have to attach some meta-data to the files and perhaps have them reviewed for legibility and make corrections, which brings it to medium effort for medium value.
I figure if you are going to do medium effort anyway you might as well shoot for high value. I see it as medium effort for someone to transcribe their own notes of a subject they understand completely and it returns high value.
Perhaps it can't realistically be done with existing work but if the people paying the professors want to get more out of their investment then they might introduce it as part of the process and include it as part of the deliverables as it were. I understand that universities are as political if not more so than corporations and that what I have suggested may not always work either place.
Can I bum a sig?
Have it professionally done, like other people here have recommended. High-end sheetfed scanners are great, but you probably can't afford one, and it wouldn't make sense as a one-time expense for this small of a job. I'm a big fan of just handing someone some money and it's magically accomplished.
Alternatively, use a digital camera and well-lit copy stand. You can improvise a copy stand with a tripod or whatever, but make sure you have a lot of light. It's a lot faster than using a scanner, and the results are acceptable if you have a good camera. The more megapixels the better - don't use the old 1.3mp one you have lying around. 3mp will technically work, but more is better. Ideally a digital SLR pointed straight down at the page, a very well-lit area (a clamp light on either side of the page works nicely), and you sitting there sipping Starbucks while you hit a cable shutter release after you flip every page. You could get a few hundred pages an hour done this way--your only limitation is how fast you can turn the pages. You'd only have to stop to transfer images to your computer, and you only have to do that often if you don't have enough memory cards. After you get all the pages into the computer, feed them into Acrobat and you're done.
If you don't want to use acrobat you could make a web-page with thumbnails linked to the hi-res images. Then your end-users wouldn't need to download the Acrobat reader. I love Acrobat's ubiquity but hate the file sizes and the slow start-up time.
I scan and upload various land use and financial documents for a county and its townships to the internet on a shoe-string budget - actually, no budget - all volunteer, public service for fellow citizens. This is my prescription:
...", with each page number hyperlinked to a corresponding graphic file. Your graphic files will run 15-25kb each. The use of PDF graphics format is a waste of time and space unless a professor gives you a MS Word file of their lecture notes which you can convert directly into a PDF file with embedded text. That is the only case in which I would use PDF over PNG. Good luck.
Stay with your current flat-bed scanner. Do not waste money on a sheet-fed scanner. You do not have nearly enough money for a high-end Fujitsu or Bell & Howell sheet-fed scanner which will reliably get the job done without mechanically screwing up. The pros use high-end scanners because they never screw up and they go fast. Cheap sheet-fed scanners miss sheets or jam up too often to trust them with anything. Make a sign-up sheet for work-study or volunteer students in your academic department to sit down at your computer and scanner and scan the documents into the computer. Give them free pops and gummy bears (slur it so it sounds like "rum & beers") or something similar which won't transfer from fingers to documents. Just take a few minutes to set them up and show them what to do. Keep it simple. Let those empty minds waiting to be filled with knowledge (and beer) do the time consuming zombie work. You should focus your attention on how to put the files on the website.
The scan file format I use is Portable Network Graphics format or PNG format. On average, it compresses black and white graphics 20-25 percent smaller than the widely used GIF format. PNG format is also supported to a basic enough level to be displayed using MS Internet Explorer, Netscape, Mozilla, and other internet browsers.
I use free Xsane scanning software on a linux system to scan the documents. Xsane can be set to scan in line-art mode, also known as black and white mode. This software can also be set to save files directly to disk in PNG format and automatically change the file names using numerical iteration, i.e., file-01.png, file-02.png, file-03.png, etc. without the need for human intervention to change the file name each time. I use a 100 dpi scan resolution setting because documents do not need to look ultra-smooth; they just have to be legible. Anything beyond that is a waste of hard drive space. Using this resolution also means I do not have to spend time embedding the graphic file in html code to constrain its width so it can be viewed on the average 15", 800x600 resolution monitor. I just insert weblinks to the individual, one-page graphic files: "Page 1, 2, 3, 4,
I work for a scanning service, we do hundreds of boxes of paper at a time. I can only speak to the really high volume scanners, and Kodak is the best, hands down. Most of them are actually two scanners in one, one for the front and one for the back. An i830 will do about 80 ppm both sides at 200 dpi. That's a $64K scanner. I think the Slashdotters gave you a pretty good list of less expensive models. If it's all clean stacks of mostly white paper a less expensive scanner will do well. More expensive scanners are not only faster, but recognize contrast better (even black writing on pink paper) and feed torn, curled or irregular paper better.
Equally important is good scanning software that will break up your images into the files you want. Software can read bar coded sheets of paper to start new documents, or to index them in a database. Legal firms use software to match a database to the Bates stamps that ID files submitted in discovery. Software can remove the black border around the page, de-skew the image, or de-speckle the image.
As for PDF, it's a great choice compared to a lot of formats. And with Acrobat, you can do bulk conversion of multi-page tiffs to PDF in one pass.
A service bureau will do the job for less than 10 cents a page.
My throw-away email account is bugmenot@fastmail.us in case of questions.
Hope this helps
There's a little known program thats pretty good for this job called Acrobat Capture - it uses isis compatible scanners.
since it's hand written notes and would be hard to OCR anyway (files are tend to be huge), how about using a digital camera and take snapshots of these pages?
my blog
It sounds like they want something where they will just send you a couple of scraps into a machine and evreything is typed up for them (no secretary needed) Use one of the profs as a test pilot that seems cool and see if you can get in with the tech spirit and get evreything inputed directly onto touch scree / stylus set up on his desk??
;) that way no money and the profs are working you. They will like your idea !!
This will also be able to show him upfront what the system can and can't be recognized and can be fixed immediately or before he descided to "submit" the page into the main file system / backup.
It would take up a little more time if you want to do it right and write a program from scratch (get marks for it?) but a good idea would to get both sides to come to a happy medium (prof vs computer) make it so it weens him from using over exaggerated bad penmanship and put in a couple of hours desiging a descent ui for formulas that are variable (lambda, pi etc) friendly and will understand on which side of the divisor evreything belongs. It would be even kinda cool if it could solve the probs too, as it went along.
Because I am not sure what program you are in this may/may not be feasible.
an alternative would be to ask your profs if they buddy with the math/comp si profs and donate some eager brains for possible standings within the course
Good Luck
A loop, by its nature, continues. If that didn't make sense, start reading this sentence again.
How about a 4 megapixel digicam, 1 click and it's 'scanned' very fast, and easy!
Two inexpensive ideas that come to mind:
.tiff files. Write an applescript to automate GraphicConverter to batch-process the images.
1. Find the best fax machine in the establishment. They usually have sheet-feeders and are fairly quick.
Fax the document to your OSX-enabled powerbook. You should then have every page in individual
2. (This is a method I've used, and it works fairly well, although it requires some manual labor).
Find a music stand, a tripod, and a digital camera. Aim the camera at the music stand, and take pictures of each side of the pages as needed. Depending on the writing, you can take the pictures at one megapixels.
Again, Applescript, GraphicConverter to adjust the whites and convert the files...
HTH
If you could scan & post (with lossless compression eg. GIF) a couple of vaguely typical pages then we could all try our favorite compression software on it & get some idea of how much storage could be saved.
I looked into this once for a client. Agencies charge around 5c a page but that is only to scan. Add more for OCR, manual verification and/or transfer to M$ Word or what-have you. I think I recall seeing 50c a page for such value-adds. Agencies are good because you dont get need to buy the kit (30K and up) or watch it run (they need feeding and jam quite a lot, especially if the paper is lower quality). Agencies also make sense for shops with nil/low expectation of producing more paper in the future. Get some quotes, references and examples of their work and start with a short trial run.
I wish at was Friday, but I dont want to wish my life away. So I wish it was last Friday.
PJ & Co. at Groklaw are faced with an easier problem and the best solution thay have is to OCR what they can and then have individual volunteers fix the stuff that the OCR process misses. I say they have an easier problem because they are getting "published" court documents that have been scanned in as graphics. For that matter, you could also technically do what the court is doing and simply scan the notes as graphics and publish them that way as PDFs. That is, don't even try to convert them to text.
They that can give up essential liberty to obtain a little temporary safety deserve neither safety nor liberty.
Ben
We've undertaken a pretty large archiving job at my university. We're scanning every page of every newspaper we've ever printed (started in 1927) up to the time we have digital archives starting around 1993 or so. We're also scanning about 80 300 page yearbooks. Hopefully this can offer you some help or suggestions.
We have a dual-processor G4 and an Epson 1640XL large-format FireWire scanner with the optional auto document feeder. It's probably a bit out of your budget ($2899 + ~$1200 for the ADF) but it's awesome. It can scan at up to 1600dpi and the ADF can automatically duplex and scan both sides of the page. We're using OmniPage Pro X for OCR software.
Right now we're more concerned with scanning the documents and getting them online, so we haven't started OCR'ing everything yet. But the ADF is awesome. It can scan both sides of all 300+ pages of a yearbook automatically in about 2 1/2 hours.
The newspapers are a bit different. They're getting a bit fragile in their old age so we have to manually scan them. We scan them at 300dpi in full color, so the 12x18 pages are around 50MB per page. But the scanner takes less than a minute per page. It's impressive.
We use Photoshop's web gallery feature to generate the image galleries. Pretty simple really. Let me know if you have any questions.
Seems a reasonable comment to me. should never have been modded down.
Get an efax account and fax them to yoursell. They'll arrive to you as multi page tiffs. Quality is very good. Sign up for a free efax account to test efax.com
Hopefully you'll get to read this one and hopefully it won't get modded down to oblivion.
Yes there are scanners out there that can work for you. I have a Canon DR-5020 which we just feed it a ton of paper and come back in a few and it's done. It can scan VERY quickly. PDF format would work just fine as well. It's the best option especially since it's hand written notes as well.
If this is a requirement which is going to be on-going then you will have to pony up the money and spend a few thousand. If you're not ready to do that, you may be in luck. Some places will lease it out to you and with that few hundred bucks I'm sure you can easily get a hold of one for about a week or 2.
Look up for people who do Document Imaging, and you should find a lot of business that come up. If you're in the washington dc area then maybe I can help you out quite a bit.
we've gotten a bunch of jobs like this - turning handwritten documents into searchable pdfs - and had a lot of luck sending them to firms in india, either by sending the documents snailmail or scanning with a sheet feeder and ftp'ing. the firm we got the best results from was called suntec, suntecindia.com I believe. I know outsourcing is a touchy subject these days, but they were all set up for this, we weren't, and their prices were quite good.
I've had to do thuis exact same thing for lawyers and had to advise them on how to do this exact same thing for documents and briefs that were hundreds of pages long.
:)
Basically, they have form fed scanners that can handle hundreds of pages in just a few minutes. If you use these scanners in conjunction with a good text recognition software (like Textbridge for example), you can convert them into plain text docs.
HOWEVER... they must exist in some typed up version first. It cannot recognize handwriting. It CAN recognized typed script. But nowadays, most of those typed documents are done in word processors and do not need to be converted.
Same some time and teach them how to use a word processor. Might I suggest Open Office?
This is my sig. There are many like it but this one is mine.
I work with this equipment (Xerox DigiPath and Docutech) everyday, along with alot of other digital printing software/copiers/printers.
Xerox's DigiPath can scan in all those documents, create DigiPath TIFFs (which GhostScript does quite nicely), PDF (very high quality), and regular PostScript files.
DigiPath contains a program called Scan & Make Ready. All of those documents can be stored, converted, printed, whatever.
I currently work with thousands of jobs that have been converted to this format, it is ideal for black and white document storage.
Find a digital print shop in your area, get a quote for the conversion, it should be reasonable.
Another poster suggested just FAXing the the documents, this to is also a great FREE way, using a fax, a fax server linked to GhostScript, you'd be able to accomplish the same results.
Using the DigiPath, you can make changes to the pages, and other things.
I do not work for Xerox. I work for a digital printing company that uses Xerox color and black/white production equipment. I have scanned and converted many of documents in my time, doing the same thing you want to.
Good Luck!
This desktop scanner is very fast to scan and fast to xfer (ala USB2) -- i recommend it.
I do conversion services on projects of 5,000,000+ pages for government, medical and financial industries. What you are describing is not "large scale" conversion. Working two hours to scan 100 pages for an instructor is not too bad with a flat bed, but if you have the budget to purchase a new scanner I think the one you cited will work fine. We use Fujitsu and Bell & Howell, but those are for production environments with 40,000+ pages a day.
An idiot above suggested you pay to have them professionally scanned. That is a bad idea, the cost would probably exceed that of a new scanner.
A lot of people don't like to download adobe's software so you should provide the documents in two formats. Stick with PDF and also do GIF.
I work in the library at the University that I go to. We have something called "E-Reserves" where a professor can submit a bunch of documents that they want available for students to download and view on their computer. We recently set up a really neat system consisting of a sheet-fed scanner, a piece of software from Doculex called "Gobe" in combination with some scripts that we wrote. Here's how it works:
1) Professor submits stack of documents
2) A person at the library makes sure all the copyright stuff is in order.
3) For each document, a piece of software that's part of Gobe is used to create a "cover sheet"
4) Each of the documents are stacked on top of eachother with a cover sheet on top of each
5) The stack is placed in a sheet-fed scanner
6) Hit go
7) ???
8) It's on the web in 10 minutes!
I am PDFing lecture notes now. At my scientific publish organization, we regularly scan our old journal articles, creating huge PDFs. But these are already well indexed in our databases and our website.
But lecture notes are teaching materials, which many people are investing a lot of time in using, more than the journal articles. These notes should be done right.
Lecture notes should have Tables of Contents, indexes, legibility, bookmarks, and so on. If teachers are teaching with them, the articles should be defined and linked from the ToCs and indexes. They should be typed, so files are smaller and legible. The math should be LaTex or scanned and placed. All the artwork, equations, annimations, and related files should be embedded in the PDF.
I needed to scan up to 50-page documents to PDF and found that the cheapest way to do it, and the most automatic, was a small Oki fax, with a fax-to-email feature. It's quick, self-sufficient (no PC needed), efficient (small PDF size) and Just Works. The cost was $1500 AU from a small reseller. If you can justify that for other jobs as well it's a good solution.
Outsource to India and the Philippines for less than $1/page
I should add, multi-page documents are placed into a single Adobe PDF document because the cover sheet seperates the multipage documents, and the cover sheet provides info such as what to name the file and where to put it. It's all very slick! No OCR though. :(
First, call MIT. Their Opencourseware is the largest such project, and certainly they have a mountain of useful information to pass along.
Second, don't do any cleanup yourself. If they can't give you electronic text (PDF, Word, etc.) then give them nothing put a PDF scan. If they don't want to take the time, have them pull the funds for transcription out of their budget.
If you have to scan, don't desktop scan. I have a small office with an automated copier/scanner. My unit probably has a total of 10 of these machines. They're cheap. Go find one you can use. They're 10x faster than desktop machine and you can post-process the PDF.
If you have to transcribe, hire work study students. Where I am, they'd cost you about $3/hr. If you hire students in your programs, they'll learn something along the way.
Finally, if you are going to the effort, do yourself a favor and invest in a CMS (some are very inexpensive) and put the time into to semantically code your work. That way when they want to convert it to xyz or to change the presentation (how will you handle students with disabilities?) you can do it without too much effort.
Did you happen to check the University's library? They may have a large volume scanner that you can use. It's worth a shot, and they may even let you use it for free.
University libraries tend to have high-volume document scanners left over from specific projects that they may not use on a daily basis. They may have used it to say, convert all of the thesises they have on file to a digital form or for a digital peridical archive or something like that.
I've been using that exact model HP scanner to scan a large number of documents into PDF's. For it's price, it works pretty well. It takes about 10 - 15 seconds per page to scan. I've had a few jams, and I've found that they happen more after it's been running for a while, so when it starts jamming I take a break and let it cool down. The sheet feeder has a capacity of 35 pages, and filling it to capacity doesn't cause any jamming (more than normal).
Even though I'm not doing character recognition, I'm using OmniPage Pro 14. It has a batch mode to automate scanning, and it can also handle double sided pages by scanning all one side, then prompting you to put the paper back in the sheet-feeder upside down. OmniPage Pro can also read PDF files, so if I want to do OCR on the files I have scanned I can load the PDF's back into OmniPage at a later date.
If you can, try scanning the documents as black and white instead of grey scale. They'll be much smaller. I think I'm averaging about 40k per page for black and white scanned documents.
Perhaps the option which is more labor intensive (but not for you) is to tell the prof's to assign a few pages to each student in the course and ask that the students typeset their section. If there are difficulties about assigning that sort of thing, you can always make it optional and have students volunteer. Most probably will. Make sure everyone is using the same format - eg. latex with standard packages - and then just combine everything.
Sure, it's a lot of work. But at the end you'll have a beautiful set of notes, and they'll be easy to edit the next time the class is taught. It will also save you the trouble of either trying to get image->text converts to handle equations or the file size hit that comes with encoding everything as images.
I've seen this work beautifully in a single grad level technical course. Might be a lot more difficult in a general-ed class, where students have less invested in the material and may not even understand the text they're typesetting.
A very cheap, funky, but workable alternative is to use a fax machine and a faxmodem. Chances are there's a machine somewhere in the department which will take a stack of hundreds of pages. Just send them all to a computer and convert them to the compressed format of your choice. Our campus has a centralized system that does this. The result is an ugly, low-resulution mess, but it does work. The quality won't be anything close to what a professional typesetter would produce, but it has the advantage of being both free and ongoing.
- Munpfazy (rather thinks it ought to be the prof's job to typeset their own damn notes... but can see why that might not be easy to argue.)
Get a document feeder scanner. Lots of multifunction machines that do fax have these. I don't have the model number handy, but we have a MF machine at my office that supports scans from the doc-feeder. I scanned in ~60 pages with it one day last week. Just loaded them up, set the DPI, and went to lunch. When I got back, there was a pretty reasonably sized PDF.
As other people pointed out, if you can get a couple of departments in on this, then you can more easily amortize the costs of really good equipment to do this...
One thing that I'll note is that I don't really like PDFs for this sort of stuff. If you really have a 100 page article, you're going to be looking at a 3 meg file and, perhaps, a 30 second startup time... That's fine for someone who's going to read the document from cover to cover, or print it... On the other hand, it's a pain if you only want to look at pages 37 and 38.
GrokLaw gets PDFs of court filings regularly, and I got so fed up with PDF's that I created a (semi-automated) batch system to split up the PDF's into separate PNG images and create a simple index.
You can see a sample here. Far easier to view a page or two there (IMNSHO) -- but not as easy if you just want to download and print it.
Before you go too far, you might want to get a good handle on how people are likely to use what you produce -- Use that knowledge to decide just how you want to organize the result. You may want to make it available in two (or more) different formats. It's not that difficult to bulk convert things between different forms (at lest, not if you can dual boot into Linux, or have OS/X).
Sometimes boldness is in fashion. Sometimes only the brave will be bold.
Try to fax it with a fax machine to a computer with a fax-modem.
The Canon 2080c does double sided well - you can set them so they discard pages of less than a given % of black, so you can toss in a mixture of single & double sided pages & the software tosses out the blanks. I've been happy with them for the larger scanning jobs, they can output multipage tifs or pdfs, rotate the image etc.They're closer to $500-600 though I think...
As some of you know, I hit the road over a year ago, but my wife couldn't bear to leave her favorite recipes behind, so we set our digital camera up on a tripod, with some rulers stuck to the table to stop sheets from mis-aligning and just photographed the lot. It works a treat!
|>>?
Use the fairly user-friendly LyX to do the LaTeX-ing.
Heck, get the academics themselves using it to prepare their notes in the first place!
They might actually thank you for introducing them to this convenient and easy document processor.
Important info:
http://www.lifeaftertheoilcrash.net
http://dieoff.org/synopsis.htm
http://www.peakoil.net
I use to work for Epson, and we had high-end scanners ($1000+) that would take SCSI, Firewire, Ethernet, or Parallel Port if you wanted. I'd personally prefer Ethernet, because you can put the machine anywere. Firewore's nice for the local machine.
Everyone seems to be attacking this with technological solutions. I say return the papers to the prof's in question and tell them you will be glad to put up anything that is submitted in digital format. (Read: not on papaer)
The time-suck that this would represent would be enormous and IMHO any director would understand why undertaking this project in this fashion would be ridiculous.
I think this guy needs a scanner like the Xerox DocuMate 510. It's only $350 and pretty much does what he needs. At 10 pages per minute it's definately not a speed daemon, but it doesn't really sound like he's going to be scanning 10 of thousands of documents anyway. The only pain is going to be dealing with those multi-sided pages.
I think this scanner plus a summer temp should be enough to get all those prof notes scanned and organized. Definately not an unrealistic request.
Lecture notes on photocopy paper sounds like a way for the prof to make even more cash off the student. I'd ask him for more money for the troubles. He is going to make his crappy ass hand written notes into a supplement book and make students buy it for likr 15, 20 bucks. I think the prof should spend some quality time typing is crap up if he want to make money off it. I mean, some lecture notes are good, but when your in lecture class, don't you take notes?
As for mass scanning, I would definately take it to the pro's.
Mark
I know your target audience cant view them, so they arent an end solution, but JPEG2000 would be very appropriate for storing the initial scanned images. It performs exceptionally well for compressing things like handwriting (lots of distinct changes) that arent 'sharp' enough for great PNG/GIF compression.
As long as you're not expecting to OCR the professor's handwriting you should be able to do this task fairly quickly. Under an hour after hardware is configured. This Microtek scanner http://tinyurl.com/ys5dk is a good value. I have had terrible experiences http://tinyurl.com/2zsor with HP document feeders. If you have access to Microsoft Office XP Professional or later you could use the Microsoft Office Document Scanning program to easily scan all pages into one large multi-page TIFF document. In turn, the TIFF document can be processed into PDF files with the free Paper Capture plugin for Adobe Acrobat (not sure if the paper capture plugin is available on Mac). There are probably Open Source software tools available to do the above mentioned process, anyone care to chime in? Anyone care to chime in and suggest Mac software to accomplish the same? Scanning at 1 bit 300dpi would be ideal (for speed and final doc size) but if the professor's notes are in varying shages you may be required to scan at 8bit. I know this is easy because last month I used my Epson Perfection 2400 scanner with my 1.8GHz Windows/Office XP laptop to scan 130 pages in 50 minutes. The result was a multi-page TIFF file that was easily converted to PDF. This was accomplished without an automatic document feeder. I also was using a USB 2.0 connection which I believe helped speed things up. Good luck.
1. make sure you use a good sheet feeder.
2. that's all.
The optimal choice of compression is not important. Does it matter if one format takes twice as much place as another format? 100 pages will fit on a CD.
Scanning time : if a few thousand * 20 seconds is acceptable, then i wouldn't bother too much about that either.
OCR: only consider it for indexing typed pages, if at all.
I don't know if they're still doing it though.
But, for labor intensive tasks, just outsource it.
It's not a classified / top secret document anyway, right?
You can work on more productive things.
my 2 cents
I have done once almost the work you have to do. I put all the paper into sheetfeed scanner (friend of mine in "another" department had one), got jpegs with 300dpi resolution and burned them all on a CD. Then I run them all through OCR on my PC and finally through (a translation software Promt ) Later on I would eyeball the translation and correct it manually, but you do not need this step at all as you do not need any translation.
The whole setup worked just fine for me. Well, if I had no friend with a scanner solution I would probably just buy myself one and use a document management software. My favorite one is Fine-reader Macintosh version is also available.
PDF is good if you want to package the images as books, but I believe jpegs can be processed on almost any system. We actually used these HP digital senders, but not that much.
...a stunned silence fell upon the hall.
As a student I can honestly say even when theyre all scanned and everything theyre still rubbish. My advice is to take those hundred dollars and get someone to type them for you..or do it yourself..once you get to know latex its pretty quick for laying out equations and stuff, only scan in diagrams etc and include them in your latex document in eps format.
/hour I'd happily translate some notes into latex maybe you could approach the students on the course to see if they'd be willing to do the donkey work.
Seriously the last thing the students need is for their professor to thing he's fulfilled the task of putting the course notes on the web when all that up is some scans of barely legible scrawls
For $5
I have discovered a truly remarkable sig which this post is too small to contain.
Frankly, I've seen professors' handwritten lecture notes, and 90% of them add nothing to the educational process. Certainly not more than a quick note saying, "Read sections 2.1, 2.2, and 2.4, paying special attention to least-squares curve fitting and finding orthonormal bases." They're generally disorganized and difficult to follow because they usually take a lot of material for granted when they write.
The mere fact that it's handwritten means that it's basically a rough draft that was hastily flung together. Send them back to him, and have him type them in and rework them until he figures they're worth recycling for next semester. The prof will save time in the long run, and the students will have something nice, clean, and organized to peruse.
You want the truthiness? You can't handle the truthiness!
Dig out the campus directory and look under "secretary pool" for someone taht can type in the messy text at speeds that may only be 1/2 the speed of a cheap scanner. They will most likly type it into word (or maybe word perfect) but then if you convert it to LaTeX, you can add in all the nice forumlas and figures in a way that they can be properly maintained.
Sorry, but 100+ pages isn't large scale. I work for a litigation support company, and we regularly get 10,000+ pages a week to scan, OCR, and load into a database. And then we also hire people to read through them all and do sorting and filtering. And yes, there are scanners that do what you want, but they cost a bundle. Good luck.
or just get the ftp storage and a DocuJob Converter, which converts DocuTech jobs to TIFF or PS, or just use a DigiPath instead.
Holy smokes, it would seem that almost no one or no one in fact knows what they hell they are talking about. Half the messages in here should have PaperPort in the subject line, but they don't. PaperPort has since 1998 and before that presumably been the best Paperless office solution. PC magazine just named it the best document handling system again in their June 8, 2004 issue. Just get an epensive ADF from Visioneer, HP, Brother, or anyone else that provides integration (no, not TWAIN compliant drivers, but a scanner who's buttons and software will actually integrate with PaperPort 9. Make sure it says PaperPort 9, not 7, and not 8, that won't work very well with 9.). If you have to go with PaperPort 7 or 8 then I'd recommend downloading PDFCreator an open source application from SF.net. Cheers, Christian Blackburn
I have a Brother MFC-8420, its a $400 laser printer/sheet-fed copier/scanner/printer. It comes with PaperPort software (which you need to upgrade to PDF capabality.) I use it to scan journal articles, notes and bills and it does a great job.
You absolutely aren't going to find a solution under $500 that does turn you into 10c/hour slave.
+--------------------- You idiot! I told you we were facing the wrong way!
Hardware for image acquisition:
Check to see if the department copy machine has scan functions... most built in the past few years do, even if they aren't used in most places for that. You'll get a decent sheet feeder and way faster scanning than most desktop sheet-fed scanners.
If you have to buy something and have to go *really* cheap, you could get a multi-function print / scan / fax thing. Most will handle legal size, because they're not actually moving the sheet fed paper onto the flatbed glass... the image element stays stationary while the paper goes by. But, of course, you get what you pay for... expect to spend time dealing with paperjams and skipped pages. However, it should be faster than hand-feeding a flatbed.
Software:
I mention this simply because nobody else has (that I've found): Scansoft Omnipage Pro is designed for highly repetitive, batch-oriented OCR. It has options for doing automated or hand-tweaked "area recognition" (separating text from graphics) and has the best proofreading UI I've seen... it flags "low confidence" recognitions automatically, and displays both it's best dictionary guesses and the actual scanned words. Not sure it will help much with hand-written work, but for printed material it works well.
Format: Your primary concern when looking for a destination file format should be longevity... will the files be readable 5 years from now? I've seen a number of people recommending highly efficient but obscure compression schemes, which are a terrible idea if you want the data to stick around. Saving a few bits doesn't do you much good if you can't figure out what they mean. I recommend that people scan to two formats, just for safety (Omnipage can do this automatically).
-R
This scanner by xerox looks promising. Note it does not do duplex (two-sided scans).
http://www.pcmag.co.uk/Products/Hardware/1145964
for example. But HP used to make one too, I couldn't find it just now. It was pretty cool. You could set it up in a central area and let everyone get to it. Scan in your document, and the scanner would send you the results as an E-MAIL attachment. This technology REALLY should have replaced faxing by now.
Anyway, if you make the process easy enough, maybe those lazy professors will do it for themselves. They will for a while at least, 'till the new-toy effect wears off.
... if this job is "scan it straight to PDF", then the result will be huge, really eat bandwidth, and not be very useful to the students. It'll take forever to load.
_ pr oducts_bc.pl?fid=M20b&product_isbn_issn=0534408427 &discipline_number=13
On the other hand, if you want something fast, accurate, easy to use, and useful, then you have a job similar to what I and two others did -- at $15-$25 per page.
http://www.brookscole.com/cgi-brookscole/course
Of course, when we first started jobs like this, the publisher specified MS Word 5.1a for Mac; and it took us 1/2 an hour per page ($11-$15/pg). Then they wanted it in HTML, so they specified MS Word 98. That jumped our time per page to 1.5 hours, and at $15 per page, we lost around $7000.
Then we changed it to Quark + Acrobat, with pieces available in Word (but no final prepublishing in Word), and that took us an hour per page at $25 per page. At that rate, we still went broke, but barely finished our contract, saved the publisher ~$100k by reducing the page count, and made an excellent study guide.
However, as of right now, we said that our next bid would have to be significantly higher ($70k-$100k), and the publisher decided they want to try someone else.
But you are right about the college professors not realizing what they were asking. That hour per page not only included layout, graphics, equations, and formatting. It included approximately 2-3 complete rewrites on the text, chapter after chapter, and sometimes I had to suggest the final wording, myself.
Correct Horse Battery Staple: 72 bits of entropy. Enter "Correct H" into google. When it generates the phrase, that's
The PDF capture thing doesn't work very well actually. You get those ugly documents that are a mixture of quasi-recognized text and bitmaps.
The documents look really ugly, and the faithfulness to the original is not so good.
DjVu manages to produce a fast rendering, totally faithfull image (with hidden searchable OCRed text) in smaller files than PDF's mixed format.
Go with DjVu, it's open source.
The files are about 5 times smaller than with PDF for black and white 300dpi scans, and 10 or 20 times smaller for color scans (nothing even comes close to DjVu for high-res document scans).
DjVu is open source (the decoders and viewers at least). There are open source compressors, but they are not very good for scanned docs. You are better off using the free conversion server (see http://any2djvu.djvuzone.org ), or the commercial app from LizardTech (there is a free download version).
-- Anonycous Moward.
That's a PERSONAL scanner. You need something like this
I'm telling you don't go for the cheap route for something like this. I've worked at several companies that generate and scan in thousands of invoices per person per day. They used some heavy leased Bell+Howell scanners with software called Documentum which provided a browser frontend to the invoices. Similar to what google does to PDF files. You could search text (even handwritten IIRC) and display the documents in your browser and print them.
I agree with the other posters, you don't want to try this with consumer products, just too many pages. Xerox is the way to go. I've used their DigiPath solution with their professional scanner and the main advantage is the some 100 page/minute speed, 600dpi, and it never jams or misses a sheet. Once it is scanned in, you can just export it to PDF using their software. It's the standard for big companies and you should be able to find them at Kinko's or at a professional print shop.
Fujitsu has one of those scanner with autofeed.
I saw one at work, but never use it.
Asking for the impossible is easy, delivering the impossible for just under 200 dollars is slightly more difficult.
.pdfs for the students without any scanner at all.
You need to start by answering a simple question, probably together with your boss (or even letting him answer it for you): who works for whom?
If the students work for the teachers, then you can publish huge and illegible scans and let the students work them out.
If the teachers work for the students, then the teachers should deliver cleanly typed and formatted electonic documents, which you can turn to neat
If you work for the school, then the school should provide you with whatever means it reasonably takes, money- and timewise, to process the work you have, even if that means buying industrial scanners and exhorbitantly-priced software for handwriting recognition, or sitting for weeks there, typing out the hand-written papers.
My point is: everybody wants to offload all their responsibilities on the admin, but that's surely not a reason for the admin to go along with that. If they want you to do the impossible, they should also pay you accordingly. Do they?
However, every desktop scanner I've ever used takes 1-2 minutes of user-attention per page and the resulting files end up Huge, impossible-to-read, or both. All I have at my disposal is my PowerBook, Acrobat, a couple hundred dollars of department funds for a new scanner (this maybe?), and, if I ask nicely, overnight use of the secretary's Win2k box. Any ideas?
- You do need more than a couple hundred dollars, but certainly not 10x--so maybe you can talk them up a bit.
- I have one of these on my desk connected to my Mac (see note) and am sure you would be pleased with its performance. CDW sells it for $999, and I've seen it offered by one of their partners (Scantastik if I recall correctly) for under $800.
- I think that this less expensive scanner might be just fine for what you want to do. CDW has it for $480, and the Fuji web page mentions a $100 rebate.
- The problem with the hugeness of PDFs relates to the graphics file format. You can embed graphics in PDF using more than 1 format, and much software defaults to JPEG. What you want for typed or handwritten pages (no color diagrams or photos) is 1-bit TIFF with CCITT Group 4 compression. That will easily get you back down to < 100K per page, often 20-30k per page at 300dpi.
Note: the fi-4120c does not come with a Mac driver; I wrote my own and it's not yet complete, thus not fit for distribution. In fact, you'll find that the kind of scanner you want is generally not supported on the Mac at all. So you definitely need to check into borrowing that Windows box.
Listen to this guy! The HP line 5500/7400/8200 is the cheapest with automatic document feeders, but I would not recommend it for daily usage.
Fujitsu's 4120c has a nice little ADF. You won't have a flatbed, but that shouldn't matter since you already have one. It's a bit pricier but it'll save you headaches. You might also want to look for the older Fujitsu 3091.
For personal level scanning we use Visioneer 9450's with Acrobat 6.0 Professional - I have found that Acrobat 6 makes file formats smaller if you are willing to sacrifice backwards compatibility. These work pretty well when people want to do maybe 5 batches of 5-20 sheets per day. A little spendy for many users which leads me to out next method.
Each division has a Xerox DocuCenter 425 (Docucenter's are smaller units - not behomeths like the print/copy center class DocuTechs). These units have scan to email capability at approx 25-30 pages a minute including double sides capacity. They work quite well allowing employees to walk up select their name drop their document in the feeder and hit the start button. 30 sec to a minute later it's in their inbox.
Next for the copy center in the department we have a Kodak i60 scanner (I think it scans 60 sheets a minute - could be wrong though). This one scans both sides at the same time, unlike the xerox which sucks it through a second time for the second side. It comes with Kodak Capture software which does a great job at processing jobs including blank page removal which is quite helpful if you have a set of documents that have material on some of the back sides of pages but not all. This works really well as we have a stand alone computer dedicated to this task.
Next in the copy center we have a Xerox DocuCenter Pro 75 which does essentially the same thing except it drops the PDFs or TIFFs directly to a novell or SMB share.
I hope this helps people some...costs are as follows (a rough idea), the visioneer about $200-250, the xerox 425 about $8000-10000 (also serves as network printer and copier), the Kodak i60 is about $1600-2100, and the WCP-75 is about $35000 give or take.
He/she could get a used Palm on eBay for less than what you could buy a scanner for. Sure, the handwriting recog doesn't work for everyone, but it's a step in the right direction.
Your prof prolly already has a computer. Can't the thoughts be typed by the professor? That sure would make more sense for everyone.
Maybe your prof needs a new computer. How about one of those newfangled tabletPC thingies? Again, there's the handwriting recog problem.
Let's play Four Horsemen of the Apocalypse. I'll be Pestilence.
I had to scan in a photocopy of an old book written by a prof. I used abbyy Finereader 5.0 and some kind of Cannon fax/printer/scanner solution. The cannon probably wasn't worth more than $200-400, they had it on hand and I borrowed it. It didn't hold the whole book at once (300+ pages, all ugly photocopies), so I did small stacks at a time, and I would scan more pages while editting the scans from the previous stack. Photocopies are fugly, so you have to remove all the little marks and such, and some of the text and equatons came out bad, so I had to kind of paint the document as well. For the most part I would just format the tech and run the OCR on that, and keep the graphs and equations as simple black and white images.
It took a while, but I was about to be about of a job, so I kind of dragged my feet on it. If you have nice source documents, it shouldn't take too long to do this and the OCR software is pretty good nowadays (unlike when I tried it in the early 90's on a grayscale scanner).
Fine reader is real sweet and worth the money. Someone had a similar question on slashdot and it was recommended, and I used it when I had to do this, so I'll pay it forward and recommend it for you.
-- Having a Creationist Museum is like having an Atheist place of worship
My office is actually working on a large scale paper-to-digital conversion. So far we have nearly 300,000 sheets scanned but we are far from being finished. Scanning the paper is only the first part of the equation. Since you are scanning a hundred pages max I'd recommend a plain scanner with an ADF capacity of at least 30 pages. Our office is using our high-speed copiers to scan-to-tiff as fast as 120ppm on the newest units. Documents are separated by a dark sheet with a large X on it. A scanrouter server then takes these file and drops them in a folder where someone, using a thumbnail view identifies the first page and classifies it by person and document type (depo, condensed depo, cv, etc...) They then select the entire document (looking for the next large X and drops them onto a custom application I wrote. //#/00000001.tif where # is the next available folder number beginning with 1. The program uses an auto-complete box for the persons name so we reduce the number of misspellings and the available doctypes can be modified using an ini file.
.TIF file that does not have an identically named .TXT file. More on this later. Each file it finds matching this criteria are placed in a database.
.dii file (which is a specially formatted text file describing each folder, generated by yet another app) and allow users to search for text by person, type, anything, and read/print out the corresponding page.
This program will rename/move these files into an image share under the format
Next, we run another custom program that will crawl ALL folders in that image share looking for any
Finally, another program which plugs into ScanSoft's OmniPage Pro 14 COM abilities and will pull files from this database, recognize, and place the output TXT file next to the accompanying TIF. The benefit of using a database is that we can unleash a large group of machines on recognizing these pages and start/stop them as needed. We figured that when we're done, we'll need approximately 30 days of computing time on a P4/2.8Ghz to finish off 250,000 files (for those counting, that's about 10 seconds a page)
Now here's where it all comes together, we use an application called Summation which will import a
Thats copyright plain and simple. Be careful the FBI might bump copyright up to 4th under music piracy on their agenda.
http://www.xerox.com/go/xrx/equipment/product_deta ils.jsp?tab=Overview&prodID=DocuShare&Xcntry=USA&X lang=en_US
unless your Powerbook has USB2 stay away from anything but Firewire for the interface, HP has notoriously bad scanner software for Macs, Canon is much better and it's got to be single sided sheetfed (doublesided = no end of trouble). As a prep copy double sided material to single sheet. Scan at 100-150 dpi resolution and use Acrobat 5 or 6 to make the pdfs. If you're good at scripting (or ask a Uni script wiz) you can string it all together with AppleScript. If mastered you will become the Univerity PDF Producer! Don't underestimate that title - doing what you intend to is unique, extremely useful and will be highly valued!
ScanSnap may be just what you need if the notes are on a uniform-sized paper (e.g. A4 or letter). You need Acrobat (included) on a Windows machine, but you just set the notes on the scanner and click a mouse then it scans 50 sheets (both sides in one-pass) without human intervention and gives you an Acrobat file in a few minutes. It is small and weighs light so you can easily bring it into the secretary's office. The price is also reasonable ($495 with Acrobat 6.0), and it seems they are even offering a $100 rebate now.
The specified resolution is for a colored documents. For a b/w one, you will get a better resolution. You can obtain scan samples from a Japanese page (pdf files at the bottom).
Actually, a newer model, fi-5110EOX, has already been available in Japan, and I think that is why they are offering a rebate now. The new model have usb2.0 connection and a higher resolution mode (excellent) that is not possible with fi-4110.
The HP 4C scanner also works with Impressario, the printing/scanning software that ships with IRIX 6.5. I used to have this exact scanner running on my Silicon Graphics Indy.
I don't know what the specifics of your work is, but you probably have a huge supply of untapped workpower at your fingertips.
.TIF to word/wordperfect/Mathematica, whatever, up to three pages worth.
.TIF files in a class-accessible online folder, and accept the end result in an e-mail.
The students who are taking these classes could easilly be a source of tappable work hours.
See Project Gutenberg's proofreading site for an example of this type of effort. http://www.pgdp.net/c/default.php
If you could get the professors to offer a little bit of extra credit for proofreading or converting a page, the task could be much easier for you.
Envision this: You use and ADF to scan an entire stack of notes in order, but you don't worry about how the scanning goes on each page. Then you xerox the whole stack and place the copies in a binder in someone's office. The students are then offered 10 points extra credit per page translated from
The points are justified since the student is in the class and learning something by carefully duplicating, analyzing, correcting, and studying the professors notes for that class. (Can you imagine a more likely way to end up accidentally committing three pages of facts to memory?)
You can place the
If the file isn't legible, the student can check the xeroxed copy out from the binder. Since it's just a copy, you don't need to worry about losing it.
You could skip the scanning altogeather, and ask the students to return any pages they don't finish translating.
Obviously this works best for large classes where the student:pages ratio is large.
Make sure you number pages if you do anything like this.
we've gotten a bunch of jobs like this - turning handwritten documents into searchable pdfs
We had to do this, too. For a Court, which requires the reasons, decisions, etc. to be publicly available online.
*Thousands* of documents, hundreds of pages each. The responsible department got me, as the IT guy, to set it up for them (after they'd already bought the stuff to do it).
Basically, a couple of Ricoh Aficio series copier/scanners, a couple of ancient Fujitsu sheet-feed scanners, and a bunch of students sitting all day in front of computers running OmniPage Pro.
The Ricohs were great on paper - fast, networked, etc. but their scanner drivers were poor (reminded me of bad CD-ROM drivers - "Copywrite 1995 Behavior Tech Computer. All right reverse." [sic,sic,sic]), and their service (contract) involved having to call the Ricoh guy because the scanner portions randomly wouldn't appear on the network, then wait for him to appear while at least one of the students sat idle. 2 stars out of 5.
Ancient Fujitsu scanners, black and white only, don't remember the model number, required proprietary SCSI cards, no support under Windows NT/XP/2K. These were commercial-grade super-expensive scanners when new (about 1990). Installed Windows 95 on a bunch of relics with ISA slots for the SCSI cards and let 'er rip. Scanning was fast, feed was reliable like a good-quality photocopier or fax machine. Only issue was requirement for an old computer running an old OS; better overall than Ricohs - 4 stars out 5.
OmniPage Pro 12 - reading was *excellent*, far better than anything else I've ever seen. Handled French and English, simple monochrome diagrams, etc. with only very small occasional formatting problems. Print to a PDF using Acrobat on the file server. Only real problem was stability, frequently locking up and losing the scan and OCR on page 99 of a 104 page document. 2 stars out of 5, being punitive because of frustration.
As they got to be more proficient with OPP, and as OPP's dictionaries filled up, we were able to add more and more computers and scanners, so that they were running around, tossing files into the scanners, stapling scanned documents back together, and occasionally rebooting one of the Windows 95 workstations. Peak was 15 computers and scanners.
Task took 3 students 3 months full-time.
Fire and Meat. Yummy.
Besides slipping in an obligatory Futurama quote, I'm here to enlighten you. A great many universities/educational institutions are making course materials available online. This is often done as a pdf on a website - wether the site is password protected for current students only is a non-issue here.
Why pdf? So it looks the same if the student views it on a mac, windows or linux pc. Why online? Because thats the zeitgeist! Everything should be available online!
Why not printed? Because a universities primary occupation is not printing. (It's providing administration officers with jobs)
A lecturer/professors job is not waiting in line at the printshop. (it's raking in the research funding and spending it on fast cars and the unibar)
Why do the students have to print it out? They don't, they can read them on a screen. Don't have a computer? Tough luck. Get access to one quick. My university would not accept handwritten anything. They provided a plethora of labs, and access times around the clock.
The fact that these are handwritten notes that need to pdf'd (scanned, ocr'd whatever) is the real problem. Enforce some discipline on the teaching staff, make them learn to use (office presentation tool of choice). Most of them allow saving as pdf/whatever (depedning on plugin).
Yay me!
Basically, professors want to hand me a big (often 100+ page) stack of their handwritten lecture notes (with messy text, equations, and diagrams; sometimes double-sided) and expect me to post a PDF-or-something-similar to their course's web page.
After I stopped laughing, I realized this may be a serious inquiry rather than a joke. I've assisted local government agencies in converting clear, printed, 8.5x11" text documents into searchable text / pdf documents, and the cost for these is over 10 cents a page. (Tax and mill levy records have to be verified 100% correct, as I'm sure your prof's notes need to be.) That's with volume discounting (> 500,000 pages), using nearly perfect ascii text documents, not scribbled notes.
So my advice is to get a few bids from outside contractors, then submit a realistic estimate based on the average. Hint: Given those spec's, it's clear you/your management have no idea what's involved in this process. (Shows at least a modicum of IQ that you had the good sense to ask, however.) If you simply need to scan/save as pics (jpg/tiff -> pdf), you can do this yourself at reasonable cost/effort expenditure. Seems to be implied that you need OCR capabilities for handwritten text, as complicated as equations at that, so you're really pretty screwed. Even simply creating 100-200 kb jpg's & emailing them in an automated process is going to run into problems when the campus mail servers refuse to accept attachements larger than a Meg.
Good luck, BWAhahahahaha!
and let nature take its course. "The dog ate your homework."
Maybe try reading the notes into voice recognition software program.
My company has a new requirement to scan about 30,000 contracts per year (1-4 pages per contract). I've been looking at these digital senders, and have been pretty impressed. A recent develepment at HP, the HP Digital Sender 9100c has been imported into the HP4101 MFP, which is more capable and is cheaper to boot. Biggest difference, the 9100c does 15ppm at 300dpi, the HP4101 MFP does 25ppm at 600dpi. With an additional software package called DSS 3.0, the HP machine can scan your documents, convert them to TIFF or PDF, drop them into a folder on your server according to instructions from the control panel, which are configurable, and can email or fax them also. I am arranging a demo of the HP 4101 this week at my company, so do not have any experience, but have spoken with an IT director who uses them both (9100c and 4101), and he is very impressed with the HP4101, and it's cheaper. I think you can lease them for under $100, if you give a 4-year commitment. I don't know if that works for your department, but if it does, you could get a lot of bang for your buck. Of course, haven't had demo yet, and am paying attention to reports here of slow scan times, etc, and I will be sure to include that in my demo testing. Also the bit about getting a daemon running on the server. Thanks for the tips, all!
I've used Fujitsu scanners for these things. Not the small ones (Scanpartner etc), but the big ones M3096 and M3097. There are duplex capable versions. Watch out to get a SCSI device and not a video interface model for which you need a special interface board from Kofax or Xionics. These scanners are made for large volumes.
Your only realistic option is to take the documents to an outside agency. Take them to kinkos and have them scanned in. Take the digital copy back and upload it.
You should present to your boss that the allocated budget for your project is enough for a one time job. In other words there aren't any revisions, no professors get to modify the documents after words. Which in and of itself does greatly reduce any value to the project.
To do this kind of job correctly, you need a high speed scanner either kodak, fujitsu or panasonic. Some software that will scan the documents in(easy to find or to make). And time. The scanner and the time are your big costs. A couple hundred just isn't enough to do it, more than once.
I would really recommend documenting what you can do for that amount and showing it to the boss. I imagine the powers that be will change their minds and realize that would be a problem for each professor.
On the other side you could push for the prof's giving you the document's digitally, it is after all a college. One could reasonably expect the prof's to type their own material, or have their TA's type it for them.
Just my 2 cents....
Either way good luck, sounds like you need it.
#### ## Laroue ####
As much as I find HP all in one devices to be crashy peices of crap, they are cheap, and most can do ADF scanning for under $300, you'll have to handhold the machine to get it working, but the ADF and scanner on it are actually rather solid units. (then again, what could you possibly screw up in an ADF and a scanner?!?)
I also have to agree with others, you MUST set it up as your policy that you will only put the documents into the scanner, press scan and walk away- nothign more; or else your productivity in other areas will be consumed by this project (unless you can get more money for this project to hire students)
-Millions of Monkeys, Millions of typewriters, 6 hours of sorting through faeces encrusted pages to find: This post
At my university (Univ. of Helsinki), the students at the Maths department typeset the handwritten lecture notes as part of their LaTeX course. The lecture notes are divided to small stacks of perhaps 40 pages each, and the students form teams and divide the pages among them to do the job.
Everyone benefits: the students get LaTeX experience and curriculum units, and the professional staff needs only to proofread the results.
- Ismo
Do as the lectures at my school, take a picture of the paper with a digicam, import the image as a gif, and put it in a MS Word document. Only ~20Mb / doc-file. The dail-up users loves it! ;)
Hey! That's my sig you're smoking there!
sounds like we have a similar job.
We have a HP scanner which i can feed the papers to the paper feeder and one button, the scanning starts automatically. of course, you can set the scanning to be low resolution and resule in a smaller pdf file for you.
but that only scaned one side, you can turn up the pages and scan the other sides of the papers. then in PDF pro you can rearrange the paper order very easily.
It's not hard job if you have a decent HP scanner (or any brand ) and a pdf professional.
Ask these guys. They seem to know how to go about this business (or at least claim to do so)..
First of all, did you said HANDWRITING? If so, do you expect documents to be just a pictures, not recognized text? Anyway, you will need a printer with ADF (automatic document feeder). If you want a recognition of printed text, you can use FineReader - highly recommended Russian OCR. Handwriting may be saved as PDF, or, if size matters, TIFF (group 4, if I remember right, black-white tiff, that uses about 30kb per page). Concerning the automatic scanning, you could use that feature in FineReader, or, search the internet for such free utility. It is a matter of an hour to write such utility for this special purpose. In my case I did everything with HP ("AllInOne") SJ 3300 with ADF and FineReader 7. Just reload the tray with new papers time after time (with high precision it takes about one page per minute). Seems like your professor want's to buy a horse for a hamster price, as such printer and ADF and software costs more then $300, maybe 3 times more.
When bidding on a complicated matter Under promise and over deliver.
outsource it to India!
I bought a Brother MFC 8820D sheet-feed scanner for work. It reduced my workload by 2/3 when copying large amounts of legal documents. The "send to email" feature is nice, but a bit pointless for large scans. Once scanned, the files are saved as PDF and multi-page TIFFs.
If you go down this route, you should check your multi-page scans before saving them. Acrobat has a random buffering problem, which causes some pages to be placed in the wrong order.
I looked around a lot for exactly this - I've decided to try and run a "paperless house" (boy did I have a big shredding session) because I'm going to be a new New Age traveller and sell up and travel - but I want all my documents available.
I found that the Brother MFC9880 with a network card will do the job cheaply. It is a fax/scanner/copier/printer and has a sheet feeder. It converts to multi-page TIFF (for B&W) or JPEG (for color) and will email the job to different email accounts on an SMTP server (you can set this up via a web interface).
It's a lot cheaper than Xerox. A bit of an awkward UI, but then most product UIs are designed by engineers with Aspergers anyway.
It doesn't jam much, either. I just wish it had a built-in shredder (my shredder motor overheated after 30 minutes so I had to resort to the end-of-the-3rd-Reich-style burning of documents).
Sorry for the "me too" but I would totally endorse this recommendation. We were advised to get one from our sister company. Although I was a bit skeptical at first, it soon became apparent that it was a tremendous time saver, particularly compared to the laborious manual alternative. We got the 50 page sheet feeder (would consider that a 'must') model and it was great. Same size as a small fax machine, dead simple to use. Integrated with our Exchange address book too.
We never bothered pushing the model to explore further functionality (e.g. I proposed we looked at programming it to scan documents to save output TIFFs into a central folder, which we could then use best-of-breed OCR software to convert to text) but the potential was clear.
Aegilops
great. scan it to a TIFF to post process..or direct to a PDF. but what about accessibility???
If those PDF's cant be read out by a screen reader then you've just slapped yourself into a LOT of trouble. when the blind person wants access to those documents you'll soon learn about disability rights!
I've been trying to find out how to do this myself for ages!!! I used to archive handwritten documents and sketches using a Visioneer PaperPort Vx sheet-feed scanner on a Windows 95 laptop years ago. I could manage to save at least a file cabinet drawer worth of pages onto a single CD-R. The setup worked great, and was even portable so I could travel with it! It scanned pages pretty quickly.
The kind of medium I was scanning could cause problems. Sheets of pad paper, paper bound notebooks, and even hard-bound notebooks that I took apart would usually have remaining bits of binding glue that would cause a paper jam. I would have to pull the page from the other end of the scanner to help it avoid jamming. Since then I've switched to using spiral-bound unruled notebooks with covers solid enough to keep the corners of the pages from curling due to wear and tear. The spiral binding insured that I didn't have to deal with binding glue jams. Crisp flat pages also prevented jams due to curled corners.
I scanned them in at 300 dpi in black and white using the text enhanced mode so that the contrast was adjusted automatically for better compression. Without this, the blank areas of a scanned page would be percieved as having some shade, and the scanned image would have some pixel dithering to represent the shade. This would cause difficulty for the compression algorithm and result in a large file size. With the text enhanced mode, the blank areas were percieved as being absolutely white, which would maximise the efficiency of the compression algorithm. This would result in much smaller file sizes. At first, I used the PaperPort software's ".MAX" proprietary file format, but I ended up converting them to LZW-compressed TIFFs so that I could open the documents on computers not equipped with PaperPort software.
If the papers you need to scan are crisp uncurled pages without residual binding glue like that you find on pads, scanning will be a breeze. You can use a scanner with an automatic document feeder, because you won't have to worry about paper jams. Otherwise, you will have to scan each page manually. The Visioneer Strobe XP 450 PDF looks like a good one for this. If they do have curls or glue but are all of a uniform size, a flatbed would be your best bet, because you wouldn't have to worry about jams and would have to only manually set the cropping size just once. If the papers vary in size a great deal (say if you were scanning in a bunch of receipts of different lengths and widths) a sheet-feed scanner would be better because they crop the pages automatically, although you would have to worry about jams. At least the Visioneer ones do. There is another sheet-feed scanner for the Mac called the TravelScan 464M, but I don't have any experience with it, so I don't know if it automatically crops.
I eventually decided that I would like to try scanning in greyscale, because although black and white was fine for printed text, I felt that it wasn't clear enough for handwriting and sketches. I knew that the file sizes would be larger, so I decided I would need to burn them onto DVD. I bought the first laptop to burn DVDs immediately when it first came out, which was the PowerBook with SuperDrive. To my disappointment, I found that Visioneer dropped support for the Mac when OS X was introduced, so I couldn't use their scanners. I got a legacy Visioneer Strobe Pro scanner on eBay, ordered the Mac OS 9 installation disk from Visioneer, and I tried installing the PaperPort software for System 9, wit
I think the white paper is so that (semi)transparent materials will come out white. Imagine that the background was black, and you had to manually photoshop all the black out of a taped together/irregular/smaller than scan size image.
I could be wrong of course, but the white paper helps me, and I can't think of any other reason.
Fax machines typically have excellent sheet-feeders. Take your stack of papers, and fax them to a PC with fax software installed. This will create TIFF files. Then, "print" the files to PDFs.
Nothing for 6-digit uids?
go here : http://www.GNLServices.com
Located in Ohio...can handle out of state jobs.
I have a similiar task at a law firm. It's a pain, but having a high-speed scanner is a God-send. I use a Canon DR-5020. It's rather old, and only black and white, but it hardly ever clogs up, and is really quite fast. I can only imagine how much faster the newer models are. Scanning 100 pages at a time is nothing for this scanner (the more the merrier - there is nothing more annoying than scanning a big box of papers in which every 10 sheets or so is stapled together). Of course, I imagine the price tag is a bit hefty, but these scanners are quite easy to use. I rarely need to stop scanning due to paper jams, etc. In addition, it scans directly into Adobe Acrobat without a problem whatsoever. I scan at 300dpi b&w. I typically have around 5,000 - 10,000 pages in a document, usually ranging from 300MB to 500MB (we then burn them to CDs for storage - one CD for around 10-20 boxes full of paper, so it's quite a space saver, eh?). The last document I did was 311MB (PDF) with 6,870 pages.
Check out Canon's high-speed scanners here.
If you need a color high-speed scanner, may I suggest the DR-9080C.
It claims to do "90 pages-per-minute (black-and-white or grayscale) and 50 pages-per-minute (color)."
-Ares
You want an HP Digital sender. It will do color and grayscale scanning of one and two sided documents. It then converts the scanned documents to a PDF file and emails it to wherever you'd like. I use this thing daily at work in order to scan handwritten notes to post on my website. The only downside is these things start at about three thousand dollars.
While we're all giving suggestions here's mine -
I do this at my job quite a bit...I recommend capturing the pages at 300dpi grayscale and then converting/saving them as bitonal tiffs. That way Acrobat (6.0 for sure, not certain about earlier versions) can automatically apply Group4 compression to them (compression used by faxes). This will reduce your filesize tremendously. Converting the grayscale scans to bitonal is fairly simple...
in Photoshop: first run auto contrast under the Image menu, then Image>Mode>Bitmap (options: 50% threshold, output resolution = 300 dpi).
If you feel the quality is too low try capturing at a higher resolution (but still output to 300 in your bitonal conversion).
If you don't have that much to do this is perfectly practical to do yourself on a flatbed (don't forget black construction paper for the double sided scans). If you have more than a hundred, I'd recommend outsourcing to a service provider.
*begin shameless sales pitch*
Coincidentally, I'm a digital imaging project coordinator so anyone feel free to send me a PM if you have any work of this nature you'd like to farm out.
*end shameless sales pitch*
Many comments are covering the hardware and software for a project such as the original question posed, but I'm curious: does anyone use this approach to digitize their general office documentation? How many IT managers out there actually scan in all the invoices, work orders, support requests, purchase orders, etc. into a digital database? Do you think it would even be worthwhile?
I used to scan/ocr typewritten hard-copy manuscripts for a technical publisher. I would get a 500+ page manuscript on Friday and have to deliver ASCII text files for all of it by Sunday evening.
I used a Hewlett Packard 4C scanner with 50 page doc feeder.
It wasn't too bad - drop in 50 pages, hit scan... go do something for 30 to 50 minutes, come back, repeat. (You aren't really asking about OCR, so I'll skip the gruesome details)
Of all the scanners in the consumer market ($1000 or under ish), past and present, that I've tried I've always liked the HP 4C and its replacement series (6200c or somesuch?). They would handle about 1 page every 30 to 50 seconds with the doc feeder at the resolution I was using.
The Digital Sorceress
...or gimme your lecture recordings.
Since professors love to hear themselves talk, perhaps they recorded or will record their lecture notes. Play these through a voice-text conversion application.
Yes, yes, yes
Even if it's not LaTeX (the tool doesn't REALLY matter), this is the way to go. It get's the student's engaged with the material, they organize and format it in a way that works for them, and the job gets done. It's a strategy that benefits a lot of people.
This is a common task for law firms. typically in the discovery process you get tons of documents, and it is becoming more and more common to send those out to a vendor to get imaged as 300dpi type 4 tiffs. there should be some vendors in your area that can do this, and provide OCR services (IF it's machine-generated text, handwriting doesnt ocr.) Look for firms advertising litigation support. It's also common to number each page for future reference, the vendor can do this and it can make it a lot easier to find things. iPro is a popular scanning software, and Ricoh's are popular scanners. Tell the vendor you just want .pdfs, that should be easy for them.
Have you looked into OCR software? You may have to do some cleaning up afterward, but the files would be much smaller because they would be characters instead of being basically pictures. I don't know if the software would be able to work well with the messy handwriting, but I would give it a shot. Look here for some more information.
I hate sigs.
How about taking a digital photo of each page? You might need a macro lense or take the photo from further away with zoom. Obviously a web cam won't give you enough resolution though.
(1) Switch on TV, scanner, and get some kind of distraction from the mind numbing endeavour.
(2) SoftSnow.biz - Probably not as good for handwritten notes as it should be, but is very highly recommended amongst people I know who scan entire books (200 pages average)
Alternatively, take a look at project gutenberg's distributed proofing tools.
(3) HTML Tidy
(4) Reproof XHTML manually / convert to other formats as needed.
The slowest part is the manual proofreading.
See if there is a local scanning sub-contractor that can handle that work en-mass.
These guys have large very fast machines that could scan all of your documents in just a few minutes, and I am sure the charge would be minimal.
Best of Luck.
--x
You've described a fairly average problem. Here at Penn State, our Electronic Reserves department scans course reserves using HP scanners with document feeders, Acrobat, and photocopies of pages. They build low res PDFs, tweaked for the express purpose of being displayed on the web. As I recall, they scan in grayscale at 200 dpi for text, and set Acrobat to reduce page size, use maximum page compression, etc.
I work in the Preservation department and we use the Xerox Digipath system, which has high-speed black and white scanners and can build PDFs from them. Fast but highly expensive ($50,000 for the entire system) and the individual image files are stored in the Xerox proprietary file format.
If you can get a Fujitsu scanner with your department funds, I would HIGHLY recommend that. They don't curl pages as they pull the documents through, and can get decent scanning speeds. Some models also scan both sides of the page at once. I would definitely recommend using web display settings in Acrobat when saving the files to PDF. I would, however, recommend using 300 dpi when scanning, just so you get decent images going into Acrobat. It's also the national guideline recommended by the Digital Library Federation for scanning anything in grayscale.
I would also brief the professors who want these materials online that they're going to have to accept delays in getting those images up and loss of image quality. 100s of pages do not go up in a few hours, all cleaned up and OCRed. I wouldn't even bother with cleaning and OCR. The pages are being scanned so people can read them, not search through them. If the professors want more out of this, they should kick in money for it.
And back up all of the PDFs to some media (CD, hard drive, etc.). A worthwhile investment so the pages don't have to scanned a second or third time.
how bout telling your professor to learn how to type? buy a laptop for him and the savings will add up.
My Gawd WTF...