Preserving Old Research Notes and Documents?
twistedcubic asks: "I have several thousand 8.5 x 11 inch dead tree pages of notes and research that takes up too much storage space. I would like to have all these notes scanned into PDF files (for example) so I can recycle the pages and reclaim storage space. Does anyone know of a store that provides this service, or an inexpensive machine that will do the job in a reasonable amount of time?"
"I have several thousand PDF files taking up too much disk storage space. I would like to have all these files printed on to 8.5 x 11 inch dead tree pages of notes so I can delete the files, empty the recycle bin and reclaim storage space. Does anyone know of a store that provides this service, or an inexpensive machine that will do the job in a reasonable amount of time?"
For future reference, I suggest a printer.
--BladeMelbourne
10-year old nephew and a scanner.
Even if you could scan all of them, are you going to just leave named
untitled, untitled-1, untitled-2... untitled-3000
or are you going to rename all of them and organize them in some way? You probably won't find a solution that won't take a lot of time and work.
"Scientists have proof without certainty; Creationists have certainty without proof" -Ashley Montagu
Filing cabinet.
/^([Ss]ame [Bb]at (time, |channel.)){2}$/
Okay, I don't know of a 'mass scanner' sort of device where you can dump a bunch of paper into it and it'll automatically handle it. But I can tell you that my aunt has a scanner that had a feeder that would accomodate one sheet at a time. She had it set up so she'd just feed the paper through, push a button, and it'd scan it for her and save it somewhere.
Unfortunately, I'm having a terrible time remembering the brand of it. I also don't know if they're even made these days. It's not a great solution to your problem, but I imagine it'd be a bit easier than using a flatbed scanner.
Apologies, this isn't that helpful of post. I'm just hoping I can spark a memory or two in somebody who knows the answer and can post it.
"Derp de derp."
i would try and convert the pages to some sort of text format to allow searching...
google.com
ADF (Automatic Document Feeder) scanners are fairly pricey (good ones are in the US$400 - US$1000 range, but you can get a cheapie Brother MFC-3240C All-In-One (C$140) that has a 20-page document feeder and then get a slave (e.g. some grad student) to feed in your pages for you.
My Brother MFC-2340C scanner comes with the PaperPort application, which generates PDFs and supports double-sided scanning even though the scanner doesn't support it. (You just flip over the whole stack once you've scanned one side, and start scanning the other side. Paperport knows how to automatically reconcile the pages.)
If you have Acrobat Professional, you can do a Paper Capture(TM) which is basically doing an OCR on the PDF and then storing the recognized words as "keywords" so that the PDF is searchable via Spotlight or other indexing mechanisms.
A document scanner is indeed a very useful piece of equipment -- I use it to scan notes and scrap paper containing rough ideas, often with lots of mathematics. Sometimes writing stuff on paper is just easier than typing in LaTeX...
The eminent computer scientist Edsger Dijkstra also liked to write stuff using pen and paper. His digitized works, called EWDs (after his initials, Edsger Wybe Dijkstra) are available here:
http://www.cs.utexas.edu/users/EWD/
There's tons of companies that specialize in electronic document scanning & OCR, usually for the legal industry. Probably cost .05 to .10 a page, but you might be able to cut a deal as an individual rather than a law firm.
Are the notes graphics-heavy (i.e., scientific/engineering)?
If not, give it to a typing service. Once you show them how much "stuff" you have, I'm sure they'll give you a discount. They might even agree to use OpenOffice2 (because it handles huge documents well, the files are small, and it has an excellent PDF exporter).
You'd still have to scan in the pictures/drawing/graphs, and place them appropriately, which will take time.
Also, there are firms that specialize it digitizing paper documents (mostly forms and regularized documents for businesses). Depending on the amount of hand-writing & graphics, it might not be appropriate, though.
All in all, no matter how you do it, the project will
"I don't know, therefore Aliens" Wafflebox1
There are companies that will do this for you. For example, IMC in WV (http://www.imcwv.com/). They can scan it all to PDF using the image as what you see in the PDF backed up with the OCR'd text. That way the document is somewhat searchable, but you always see the exact scan of the doc when you look at the PDF.
I'm better, because I'm bigger
Disclaimer: I used to work for this company as a coop student.
I would contact PRG Schultz as they have done this for large clients in the past. Hey have a program called imDex which is pretty slick. Basically, it's a searchable, cross-indexable database, so you'll have OCR'd text, along with TIFF's or PDF's of the documents. If you would like more information, let me know.
The problem is then you have to come up with a safe long term way to store digital data.
Clue:
There isn't one.
The best thing to do is NOT convert the paper to digitized format. Find some space instead, and store the paper. Your data will be much safer.
Many libraries will have reader-printers that for a small fee (eg, $0.20/page?) you can print a copy.
Most of the expense with fiche is the production of the silver halide original; diazo copies are relatively cheap. If it's really important to you, have a copy made and lock the original film in a safe deposit box (or at least offsite)
Isn't that what children are for?
Surely one of your kids has screwed up and needs the responsibility of justifying his food and shelter.Why not just make Jr. scan for a few hours whenever he screws up and stays out late,steals the car,etc.Hell,kids are taking up valuable processor cycles for hours as it is.Time to show them that computers are for more than games,www.,and pr0n.
The patience taught by scanning even the first hundred pages or so will be priceless.The look on their face when they find out they have to convert and title them should be savored as a rare delicacy.The defeat at learning that they will or lose recreational computer time will be better than paladium.
*Repent!Quit Your Job!Slack Off!The World Ends Tomorrow and You May Die!
I've helped setup something like this. The best small scale solution would be to get a good flatbed scanner with an automatic document feeder (ADF). You can get decent HP scanners for about $400-700.
Once you have the scanner, you can setup a few scanning profiles that automatically set resolution, color depth, black&white threshold, etc. Then scan the notes into adobe as images. If you scan them in as monochrome images at 100-150dpi you can get fairly small files that are very readable on screen and as printouts.
Finally get a RA or student labor to feed the documents in to adobe and save them in separate files. The adf lets you do 25-100 sheets at a time so the help and start a scan and surf the web or something until the batch is done.
N.B.: Having a flatbed scanner lets you handle odd sized sheets of paper or delicate stuff. Although you can scan and ocr the documents, ocr is probably going to screw things up a bit and you probably don't want to try to read through the documents to catch and correct the ocr errors. Also if you have any math, diagrams, or handwriting in the notes, the ocr program will probably produce unusable junk.
"When you sit with a nice girl for two hours, it seems like two minutes. When you sit on a hot stove for two minutes, it
Not sure anyone like Kinko's does this. If they do, the price for several thousand pages will almost certainly be greater than the cost of buying an auto-feed scanner.
u 're-printing, -of-course) machine that was sitting out so that anyone with a campus ID could use it. Maybe it's time to talk to the powers that be about buying some equipment?
I assume if you've collected that much research, you work for a university or some sort of research institution. My undergrad college of 1,100 students had like three of these, including one that was part of some ginormous Xerox do-everything-and-then-collate-and-bind-it-(if-yo
I'm doing the same thing, scanning piles of my old college papers using an automatic document feeder. I bought an HP 8250 because it does duplexing, so I can just tell it to scan both sides of the sheet. I have lots of mixed double-sided/single-sided documents (like notebooks), it's a lot easier to just scan everything doublesided and go through it in Acrobat and just delete the blank pages. Plus, with the duplexing feeder, I can cut the bindings off old books, drop the whole stack in the feeder, and scan the whole thing. But I haven't quite decided to destroy my old textbooks like that yet.
The HP 8250 software was just updated for MacOS X 10.4, which makes me really happy since I bought it just before 10.4 shipped, and they updated it promptly. It works well for bulk scanning on a Mac, and it was pretty hard to find a good Mac ADF duplex scanner. It also does 35mm slides, but it would probably be better to get some better software for that job, something like Silverfast SE.
Anyway, lots of my documents are handwritten (and many in Japanese too), so OCR isn't workable. I don't really need machine-readable documents to do text search, but I could always use Acrobat for OCR on some documents. I think there's a way to keep the graphic image intact while the searchable OCR text is on an invisible layer, but I haven't quite figured out that Acrobat function yet.
The ideal solution would probably be an HP Digital Sender. It's about the size of a laser printer, except it's kind of a laser printer in reverse. You load a document into it, it scans the whole thing at 30-40 pages per minute, turns it into PDF, and then sends the PDF across ethernet via SMTP to wherever you want. Works with any OS, obviously. It's sold as an alternative to fax.
The problem is that they're about $2500 each (MSRP $3200), because they're a niche item. Shame really, because if they'd dropped in price the way laser printers have, they could have made fax a thing of the past.
As it is, I spend ages screwing around with a flatbed scanner, like every other poor sod trying to solve his personal filing problems.
GCHQ Quantum Insert installed. If only our tongues were made of glass, how much more careful we would be when we speak
Print them in a book and wait for Google Print to scan them all for you.
Easy: Release all your research papers under the GNU Free Documentation Licence and post them on Wikipedia. If they delete them as original research, post them to my wiki where we keep everything.
The ideal way to do this would be to find a digital copier with an automatic document feeder, like a Ricoh Aficio or a Canon ImageRunner, that you have access to. They generally have a function to scan instead of copy, and they scan at very high speeds (even duplex scanning!). The data is often retrievable over a simple LAN - I've even seen some that support TWAIN over LAN.
If you don't know where you are going, you will wind up somewhere else.
go buy a modem, and grab an old fax machine, then fax the documents to yourself. You should be able to fax a decent number of pages at a time and can walk away and leave it running. these will be saved as multi-page tiffs which while not pdfs and searchable at least solve part of your problem.
RandomAndInteresting.comdefending the world from stupidity since 1979
Try a legal copyist.
"N.B.: Having a flatbed scanner lets you handle odd sized sheets of paper or delicate stuff. Although you can scan and ocr the documents, ocr is probably going to screw things up a bit and you probably don't want to try to read through the documents to catch and correct the ocr errors. Also if you have any math, diagrams, or handwriting in the notes, the ocr program will probably produce unusable junk."
ABBYY Finereader does an excellent job. On more difficult pages I had to do some tweaking, but part of that was my inexperience with the product.
And yes it can handle diagrams, pictures, etc.
Thousands of pages? That is VERY little. One piece of paper is ~0.1mm thick. 10,000 pages would take up only 1 m, which is only one or two drawers in a filing cabinet.
The problems with flatbeds are that they are often slow & the ADF jams quite a bit. A nice scanner dedicated to documents is what you want.
I am extremely happy with my Canon DR-2080C. Note: It is the only piece of hardware I've bought, knowing that it won't work with Linux. I ran windows SPECIFICALLY to use this document scanner. It looks like it has been discontinued & the DR-2050C is the model to get now. Looks like it does larger documents, which is nice. These do duplex scans in one pass, so you can get about 40 sides (so 20 2-sided pages) per minute. These will probably set you back ~$650 new.
If you have more money to spend, there are even better document scanners available.
Unless you need the capability to grep the documents, there's little point in digitizing old notes. Digitization carries a number of risks, anyway, not the least of which is that in a few decades (and by "a few" I mean one or two) you may find the information unreadable by any still-functioning hardware. Then again, you could just upload it to "the Internet" and let various system administrators guarantee its perpetuity.
A frank question you have to ask, though, is how important it is to preserve this information. A strong test is how often anyone has needed to refer to these old notes in the intervening years. It's difficult to say this about the output of one's labor, but it may well be that it truly serves no further purpose and what you really need to do is bypass the scanner, go directly to the recycling center and bid it all farewell.
This is not my sandwich.
Even if there is a MAJOR change in the spec, there are open source viewers & you won't be stuck out in the cold. This is why a lot of places DO use PDF for archiving. If they don't think they'll be stuck out in the cold, why do you?This is neither here, nor there.Authors get to decide whether a document is protected or not. Patches for the free viewers are available to remove DRM if you have accidentally added it.Can you elaborate?I dare-say that PDF has much broader adoption than laserdisks ever had. Certainly, ANY software format (deprecated or not) are more accessible than dead removable media formats. However, you can always find a reader & dump the files onto a new medium.
As far as medium goes, I agree that magnetic medium makes the most sense. Put it on the hard drive of at least one networked machine & back it up to tape. Hard drives also die, but there is no excuse not to make the data accessible & the files can always be recovered from backup.
I also have a Brother MFC and it's the best investment we've made. Actually, Dell sells an MFC at a much cheaper price, but we had to give it up because the scanbed doesn't support legal size paper (but the ADF does). Dell's MFC also comes with PaperPort. You can probably purchase from Dell with some back to school bargains (check the discount deal sites like techbargains, xpbargains, fatwallet, etc.). Even the cheapest laser MFC is a network printer (although scanning requires USB connection)--which means you can connect both via ethernet and USB, the ethernet is for printing for all the other systems on your LAN and the USB is for your Windows box. I also second the suggestion about Adobe Acrobat, it's the best piece of software out there, even though it's a Windows only piece of software.
Linux at home
To scan and store hand-written notes it might be better to use DJVU format http://djvulibre.djvuzone.org/. You can find free readers for almost every platform (including Zaurus!) and filesize is very small despite the good quality.
You can also convert to/from PDF and PS using a free (non-gpl but open source license) gs driver from AT&T.
This is what we do at work. We spend about $5000 on the set up, but remember that this is an enterprise where we scan about 125,000 pages to .pdf a month. It is probably possible for about $500 or so, for what you are looking at (oh, and some programming)
;)
First, you'll need a low-volume scanner. (Check the duty cycle to make sure it can handle you bookshelf of papers.) Then, you'll need something to convert the images to pdf. If you have any programming experience, write a quick app that uses http://www.imagemagick.org/ Image Magick to convert from tiff to pdf. Put each binding in its own folder, and pretend the "untitled1.pdf" says "page1.pdf"
If you want to get fancier have the front end app rename the untiled1.tiff to whatever you'd like. Also, you can embed extra information into the pdf by using metadata and Adobe XMP SDK (free download from Adobe). Make the meta data like:
TITLE="My Book"
AUTHOR="Bart Simpson"
etc.
Are these notes preserved "just in case" they are needed?
Do you actually REFER to the notes every now and then?
Do you need text or just scanned-images?
Do the advantages of having them outweigh the advantages of destruction? Remember, if you destroy it then it can't come back and haunt you in a lawsuit. But then again, it can't help you either. Caution - before you destroy anything make sure you have an official data-retention policy, and stick with that policy. Otherwise, destroying data CAN be seen as a sign that you have something to hide.
Once you've answered these questions, you can decide among your options for each document
- destruction
- file in archives, probably off-site, if necessary secure against fire or other disaster
- microfilm or microfiche
- scan images
- scan and "95% accuracy" OCR
- scan and 99.99% accuracy OCR with human verification.
If you scan or micro-photo-copy, you have to decide what to do with the originals - keep on site, archive off site, or destroy.
If you scan, you have to have a plan to copy the data to new media and new file formats as old ones become obsolete. If you have any 8-inch floppies or obsolete-format computer files lying around, you know the problem I'm talking about.
You should also set an "expiration date" on all documents. If a document has to be preserved until, say, 2010, it's OK to convert to digital and destroy the original, since 5 years from now it's almost certain you'll still be able to read it. If you'll need it in 2100 however, I'd recommend keeping the paper copy or at least a microfilm copy.
One more thing - if you need to preserve color information, color microfilm may age a lot faster than black and white, causing color shifts. This is probably okay for line drawings, charts, and such but not for photographs.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
This is an anti-Adobe troll. As has been pointed out, PDF is -not- a proprietary format.
Legal businesses and accounting departments use this stuff regularly.
Have you googled for it, there might be a sourceforge FOSS project along those lines.
"Enjoy what you're doing! If it becomes drudgery, you're doing it wrong!" - Jim Butterfield
Those handwritten notebooks are legal documents proving you did that work when you dated it. That's why engineers keep notebooks. If they are digital, do they keep their status as legal documents? This is an important question to me, so I hope someone knows :D
slashdot: where everyone yells sarcastic metaphors to themselves to understand the issue
Ask the experts at the Einstein Papers Project. They have been doing this for quite some time.
Xerox makes several high speed grey scale docuemnt scanners.
If you search tm out you can find services that utilize these machines. Typically they OCR the pages side by side with a raw image scan. SO even if the ocr is only so-so you can search the documents and read the original hand written pages. These services are most utilized by law firms that are given crates of documents for legal cases.
It's not cheap but it's very fast and convient.
(Of course, you will still need to spend lots of time scanning, naming and classifying those pages. The ADF and 10yo nephew suggested in another post might be useful for that.)
DjVu offers very compact representation without the need to OCR the document (I've converted a 13 megs scanned PDF into a 600K DjVu which was much faster and easier to read), and optionally a "hidden text layer" if you want to OCR it to make it searchable.
"I'm never quite so stupid as when I'm being smart" (Linus van Pelt)
If you really need to keep them, throw them in boxes and put them in document storage somewhere. Then, on the off chance you might need them for patent disputes, etc., you can hire someone for $8/hr to go thru them.
ACHTUNG! Das computermachine ist nicht fuer gefingerpoken und mittengrabben. Ist nicht fuer gewerken bei das dumpkopfen.
Does this really need to be done quickly? If not, you could do it yourself with just one to three pages a day, which should be very managable. This would save you the money of paying someone and it would give you the chance to quality check each page as you go.
Unless you plan on using OCR, these documents could also be saved in tiff, png, or jpeg formats. Personally, I would consider a format that allows for the embedding of keywords into the file, so that searching will be easier later on.
Good luck.
1. Get a scanner with a document feeder.
2. Get software to scan to PDF format.
3. Get Google Desktop Search which will index the contents of PDF or get an Apple Mac with Mac OS X 10.4 (Tiger) and Spotlight will index your PDF's. If you have a Mac, you may be able to scan to PDF without needing Adobe Acrobat.
Don't know about scanner services, but check around and you might find someone who can scan the documents to PDF and give you a DVD-R or CD-R's with the files. Kinko's? Print Shop?
We have highend Kodak scanners that are unbelievably quick (24ppm) and scan both sides of a page at the same time. Of course, they cost $15,000.00 USD. So that's not practical for most budgets.
In the last two weeks, I have done broadly the same thing. I work at a hospital with a large canon photocopier (ir5020i). This has an auto-document feeder like any self-respecting copier would.
It is also a network device to scan / print. I took in my computer (mac mini) plugged into the ethernet port and (adding 20-30 minutes of fiddling) was away.
So.. make friends with your local big company (a hospital would be good - you can make a small donation).
Bear in mind though that it took me pretty much all (working) day and I only had the equivalent of about 6 reams of paper (3000 sheets). Thank goodness it was a public holiday!
To save time, go through meticulously beforehand with a staple remover. To separate the sheets, place them vertically and blow down onto them to get air between them. HTH.
I work at a place that does this kind of stuff. I can say though that storage space is cheaper than you would think and in even the medium term a better idea.
If you really need to be able to access it though something like that should cost between 10 and 20 cents a page (in that quantity) depenind on the standards for the accuracy and the feedability. (if it is 100 page documents in 3 ring biders with no staples and clean edges and no post its expect it to cost a lot less than 2 page documents covered with stickies and stapled.
I am not plugging my place of employement, it should be easy to find somewhere local.
Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
I explored offering this service as part of my consulting practice a few years back. Turned out hardware and software weren't the limitations, even if I could use a simple ADF scanner with a reasonably- priced temp staff. The limitation was the time to organize it into usable categories and make it the file name match the type of content, etc. I'm intrigued by some of the new software such as PaperPort http://www.scansoft.com/paperport/standard/ that apparently makes the documents searchable, minimizing the need to title and folder each scanned image, but I'm sure its relying on some kind of OCR and not sure how kind that would be to hand-written docs. As long as it allows a simple interface to tag each scanned image. You're gold.