Slashdot Mirror


Preserving Old Research Notes and Documents?

twistedcubic asks: "I have several thousand 8.5 x 11 inch dead tree pages of notes and research that takes up too much storage space. I would like to have all these notes scanned into PDF files (for example) so I can recycle the pages and reclaim storage space. Does anyone know of a store that provides this service, or an inexpensive machine that will do the job in a reasonable amount of time?"

30 of 101 comments (clear)

  1. In a few months time... by Anonymous Coward · · Score: 5, Funny
    In a few months time... coming as a duplicated story post...

    "I have several thousand PDF files taking up too much disk storage space. I would like to have all these files printed on to 8.5 x 11 inch dead tree pages of notes so I can delete the files, empty the recycle bin and reclaim storage space. Does anyone know of a store that provides this service, or an inexpensive machine that will do the job in a reasonable amount of time?"

    For future reference, I suggest a printer.

    --BladeMelbourne

    1. Re:In a few months time... by sribe · · Score: 2, Informative

      Check out the Fuji ScanSnap. Their lowest-end document scanner; but still faster than all the slow consumer-level junk; and comes with a version of Acrobat that will OCR the images and put the text in a "hidden" layer for searching.

  2. Easy by ptaff · · Score: 3, Insightful

    10-year old nephew and a scanner.

  3. Dee, dee, dee... by atomic-penguin · · Score: 2, Insightful

    Filing cabinet.

    --
    /^([Ss]ame [Bb]at (time, |channel.)){2}$/
  4. searchable db? by Bluntzilla · · Score: 2, Interesting

    i would try and convert the pages to some sort of text format to allow searching...

  5. Re:Not the ideal solution, but a start.. by NanoGator · · Score: 3, Informative

    Sorry to reply to my own post, but I felt bad about the unhelpfulness of my previous comment. I headed over to Visioneer's site (www.visioneer.com) and found a few scanners that handle like 25 pages at a time. The more you spend, the faster it scans. Sorry, I cannot personally recommend a scanner in particular. Never had one like this.

    Good luck!

    --
    "Derp de derp."
  6. Buy a scanner with an ADF by zhiwenchong · · Score: 3, Insightful

    ADF (Automatic Document Feeder) scanners are fairly pricey (good ones are in the US$400 - US$1000 range, but you can get a cheapie Brother MFC-3240C All-In-One (C$140) that has a 20-page document feeder and then get a slave (e.g. some grad student) to feed in your pages for you.

    My Brother MFC-2340C scanner comes with the PaperPort application, which generates PDFs and supports double-sided scanning even though the scanner doesn't support it. (You just flip over the whole stack once you've scanned one side, and start scanning the other side. Paperport knows how to automatically reconcile the pages.)

    If you have Acrobat Professional, you can do a Paper Capture(TM) which is basically doing an OCR on the PDF and then storing the recognized words as "keywords" so that the PDF is searchable via Spotlight or other indexing mechanisms.

    A document scanner is indeed a very useful piece of equipment -- I use it to scan notes and scrap paper containing rough ideas, often with lots of mathematics. Sometimes writing stuff on paper is just easier than typing in LaTeX...

    The eminent computer scientist Edsger Dijkstra also liked to write stuff using pen and paper. His digitized works, called EWDs (after his initials, Edsger Wybe Dijkstra) are available here:
    http://www.cs.utexas.edu/users/EWD/

    1. Re:Buy a scanner with an ADF by jd · · Score: 2, Insightful
      That would be the best method, but I would seriously question the wisdom of PDF files. Although they represent documents fairly well, the format is too proprietary and too variable to be safe. You want the baseline documents to be in a format you can read at ANY time in the future, not just three weeks down the road.


      With the merge of Adobe and Macromedia, the constant toying with DRM schemes, the allowing of unsafe code in current Adobe formats, etc, make format choice as vital as scanner choice.


      A good example of this was the use of Laserdisks for the 1980's survey of Britain to commemorate the Domesday Book. The Domesday Project is now unusable on anything but a very small number of machines, because they weren't adequately careful.


      Oh, and disks are also an important decision. Do NOT go with Blu-Ray or HD-DVDs, because these formats are fighting a battle to the death, One will win and whoever uses the other will end up with media no future computer will be able to read.


      It is interesting to note that Papyrus documents with iron oxide inks have proven the most durable of all written media. More modern papers are designed (quite intentionally) to fail in a fraction of the time, as are modern inks. Durability is expensive, and cheap sells.


      The same is true of electronic and optical media. The "silver" alumin(i)um CDs are much less durable than the "gold" disks, but both will fail in the space of decades even if kept well. If kept poorly, the surface will not just scratch, it'll peel off within a few months. (I know from experience.)


      In comparison, the old magnetic "core" memories were pretty much guaranteed to hold data for a century or two.


      Assuming you don't want to keep re-copying the notes, you want to pick formats and media that meet the sort of timescale they'll potentially remain important - plus 10%. Where a note may be of historical usefulness (and nobody can really predict those in advance), you want to pick a format and a medium that is as durable as you can reasonably afford to invest in.


      Even where the notes are relatively trivial, YOU may want to read them later, and virtually no format in existance today has lasted for very long in comparison to a human lifespan. Indeed, computers themselves have not existed for long, in comparison to a human lifespan.


      I pity those scientists who may still have important logs on 8" floppies or drum hard drives. They're not going to find it easy to retrieve the data now, even if the data is still there to retrieve. And whilst the CIA probably has forensics to read ancient magnetic storage systems with decayed data, I doubt they'd loan the machines to careless researchers, even if the researchers had the sorts of money you'd need to hire such equiptment and the data was valuable enough for them to spend the money.


      In other words, don't digitize (or file) for the sake of doing so. Think about when you would want the information and pick a technology that you can be confident will exist THEN (and preferably now as well).

      --
      It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
    2. Re:Buy a scanner with an ADF by fm6 · · Score: 3, Insightful
      That would be the best method, but I would seriously question the wisdom of PDF files. Although they represent documents fairly well, the format is too proprietary and too variable to be safe. You want the baseline documents to be in a format you can read at ANY time in the future, not just three weeks down the road.
      I'm not a big fan of PDF, at least as it's commonly used. (It's essential for prepress applications, but it's most commonly used for online document sharing, an application for which it sucks.) So I hate to disagree with a fellow PDF-hater. But your arguments against using it are nonsense.

      Technically, yes, PDF is a proprietary format. a well-documented, widely licensed format. Really, it's just Postscript with a few organizational elements. Both Postscript and PDF have many third-part implementations, including one that's available under the GPL.

      With the merge of Adobe and Macromedia, the constant toying with DRM schemes, the allowing of unsafe code in current Adobe formats, etc, make format choice as vital as scanner choice.
      I don't see what the merger with Macromedia has to do with anything. DRM would be an issue if Adobe was the only source for PDF software -- but it's not.
      A good example of this was the use of Laserdisks for the 1980's survey of Britain to commemorate the Domesday Book. The Domesday Project is now unusable on anything but a very small number of machines, because they weren't adequately careful.
      Hindsight is all very well -- but what format would you have chosen? Floppy disks would have been too expensive, CDs didn't exist yet. If it had been up to me, I would have chosen 9-track mag tape -- and I would have been wrong. (I still have a 9-track tape containing a backup of my student files, and no way to read it!) In any case, that mistake had to do with a choice of hardware. It's a lot easier to recreate old software than old hardware.

      I'll skip past all your other hardware examples (papyrus???) and skip to...

      In other words, don't digitize (or file) for the sake of doing so....
      What, you think this is some kind of whim? If these documents are at all important, he has to bring them online. As long as they exist only in dead tree form, they are awkward to access, expensive to store, and run the risk of being lost in day-to-day use, to say nothing of the odd natural disaster.
    3. Re:Buy a scanner with an ADF by pipingguy · · Score: 2, Interesting


      If you have Acrobat Professional, you can do a Paper Capture(TM) which is basically doing an OCR on the PDF and then storing the recognized words as "keywords" so that the PDF is searchable via Spotlight or other indexing mechanisms.

      Maybe I'm mistaken, but doesn't Google index PDFs? If that's the case, you can just upload it to a website and wait for it to be crawled for later searching.

      That doesn't really help with the scanning problem though. Parent's solution of slave useage might be best.

    4. Re:Buy a scanner with an ADF by sribe · · Score: 4, Insightful

      That would be the best method, but I would seriously question the wisdom of PDF files. Although they represent documents fairly well, the format is too proprietary and too variable to be safe. You want the baseline documents to be in a format you can read at ANY time in the future, not just three weeks down the road.

      Bull. PDF is completely open and is not going away. To get the specs you merely have to download them for free from Adobe's web site. There are multiple open-source implementations of PDF readers. Although Adobe is adding features all the time, the basic format that would be used for storing scanned images has been stable and forward-compatible for years and years. There are multiple court systems which have designated PDF as the format for filing, storing, and archiving court records. There is work on an official national standard for long-term archiving of records in PDF format. (PDF-A, specifies things like: the PDF must embed the fonts used, and so on, to ensure that it will be portable across OS's and decades.)

      ...the constant toying with DRM schemes...

      A flaming example of a red herring. Your scanner software is not going to create a PDF with any DRM unless you tell it to. And some future version of your PDF reader is not going to suddenly refuse to read non-DRM'd files.

      The "silver" alumin(i)um CDs are much less durable than the "gold" disks, but both will fail in the space of decades even if kept well.

      Most "gold" CDs are merely "silver" CDs with a gold-colored label on the top. It's not even clear that the gold vs aluminum reflective layer is a real issue. But the dye type does matter, hugely.

  7. OCR probably not the way to go by Nutria · · Score: 4, Insightful
    OCR is no match for eyeballs. You'd spend so much time editing it for slight errors, it wouldn't be worth your time.

    Are the notes graphics-heavy (i.e., scientific/engineering)?

    If not, give it to a typing service. Once you show them how much "stuff" you have, I'm sure they'll give you a discount. They might even agree to use OpenOffice2 (because it handles huge documents well, the files are small, and it has an excellent PDF exporter).

    You'd still have to scan in the pictures/drawing/graphs, and place them appropriately, which will take time.

    Also, there are firms that specialize it digitizing paper documents (mostly forms and regularized documents for businesses). Depending on the amount of hand-writing & graphics, it might not be appropriate, though.

    All in all, no matter how you do it, the project will
    • take a long time
    • cost a lot of money
    --
    "I don't know, therefore Aliens" Wafflebox1
  8. Scan to PDF with OCR behind the image by fatboy-fitz · · Score: 2, Informative

    There are companies that will do this for you. For example, IMC in WV (http://www.imcwv.com/). They can scan it all to PDF using the image as what you see in the PDF backed up with the OCR'd text. That way the document is somewhat searchable, but you always see the exact scan of the doc when you look at the PDF.

    --
    I'm better, because I'm bigger
  9. The unorthodox method by UnapprovedThought · · Score: 2, Funny
    1. Climb to the top of a tall building
    2. Find the side that is closest to the parking lot
    3. Shake all of the pages out
    4. Have an assistant below shoo away potential meddlers
    5. Pull out your 12Mpixel camera
    6. Take several pictures as the papers flip end-over-end
    7. ??? (do some really amazing 3-D stuff with GIMP)
    8. Convert pictures to PDF
  10. imDex by cstew · · Score: 3, Informative

    Disclaimer: I used to work for this company as a coop student.

    I would contact PRG Schultz as they have done this for large clients in the past. Hey have a program called imDex which is pretty slick. Basically, it's a searchable, cross-indexable database, so you'll have OCR'd text, along with TIFF's or PDF's of the documents. If you would like more information, let me know.

  11. What are you going to store them on? by the+eric+conspiracy · · Score: 2, Informative

    The problem is then you have to come up with a safe long term way to store digital data.

    Clue:

    There isn't one.

    The best thing to do is NOT convert the paper to digitized format. Find some space instead, and store the paper. Your data will be much safer.

    1. Re:What are you going to store them on? by Hydroksyde · · Score: 2, Insightful

      You can easily make backups of data on a computer. You could put multiple copies in many places, all around the country or even all around the world. But paper has this annoying habit of losing data easily when it is burned or made wet, and there goes your only copy. If the world trade centre were full of paper, the disaster would have had a much greater impact economically.

    2. Re:What are you going to store them on? by aminorex · · Score: 3, Informative

      Not unless the notebooks in question were made of acid-free archival paper. I've seen cheap paper falling apart in 5 years, irrecoverable in 10. Phase-change media, like CD-RW, will easily outlast my children.

      --
      -I like my women like I like my tea: green-
    3. Re:What are you going to store them on? by cfavader · · Score: 3, Interesting

      The matter of the fact is, documents on papers are not nearly as available as electronic copies. Hell, you could let thousands of people read all those documents at once for just a tiny amount of money in bandwidth costs (unless you have a university host it for free, which I'm sure they will). For most of us, this accessability is easily worth keeping a backup of the data, even if it also requires us to store it on new mediums as time goes on (i.e. switch from floppies to cdrs to dvdrs to whatever every 5-10 years).

    4. Re:What are you going to store them on? by jhoger · · Score: 2, Insightful

      How do you store paper in a long term way without copying it? Clue: there isn't one.

      You have to copy EVERYTHING to new media eventually. You need to have a plan, and you need to execute it. Simple as that. Paper will disintegrate, and yes, hardware will become obsolete. You just need to progress to the stone in the river before the current one is submerged.

      But which is easier/cheaper to propagate to new media and make backup copies? Digital data in open, documented, implement formats, or paper? Which is cheaper and easier to store?

      There's also the argument that computers become obsolete. Well, yeah... but I think you would have a hard time finding many computers in the last 25 years that don't have a software emulator around. All you need to do is archive an, ideally, open source emulation of the machine that implements the software, and fire it up to transfer the stuff to the next machine when it becomes necessary.

      The only real impediment to survival of data is that it become uninteresting therefore not actively maintained.

  12. Go low tech? by andreMA · · Score: 2, Informative
    If you just want to have it to refer to very infrequently and (possibly) print a page, look into having it filmed as microfiche. Viewers are fairly cheap and in a pinch a strong lens (loupe, possibly) will do.

    Many libraries will have reader-printers that for a small fee (eg, $0.20/page?) you can print a copy.

    Most of the expense with fiche is the production of the silver halide original; diazo copies are relatively cheap. If it's really important to you, have a copy made and lock the original film in a safe deposit box (or at least offsite)

  13. simple by urdine · · Score: 2, Funny

    Print them in a book and wait for Google Print to scan them all for you.

  14. hylafax by np_bernstein · · Score: 4, Interesting

    go buy a modem, and grab an old fax machine, then fax the documents to yourself. You should be able to fax a decent number of pages at a time and can walk away and leave it running. these will be saved as multi-page tiffs which while not pdfs and searchable at least solve part of your problem.

    --
    RandomAndInteresting.comdefending the world from stupidity since 1979
  15. A store by XCorvis · · Score: 2

    Try a legal copyist.

  16. Re:Microfilm! by theonetruekeebler · · Score: 2, Insightful
    How are you planning to store microfilm for a century

    In a drawer or filing cabinet.

    and what are guarantees that it'll actually stay preserved for that long?

    Wet-film microfilm has an estimated survivability of 500 years in ideal conditions and a minimum of 100 years in any reasonable conditions. To my knowledge this exceeds the lifetime of any digital medium.

    It's fairly trivial to store redundant copies of your digital files, even in multiple locations worldwide. The costs are minimal too.

    It's fairly trivial to store redundant copies of your microfilm, even in multiple locations worldwide. The costs are minimal too.

    --
    This is not my sandwich.
  17. Re:Lots of Work by justforaday · · Score: 2

    You don't want to spend money on physical storage, yet you're asking about a service that will do the job of scanning for you? Here's a hint: for the cost of hiring someone to do this job for you, you can rent a small room at a self-store place for 15-20 years.

    --
    I'll turn into a supernova and burn up everything. Well I'll turn into a black little hole and you'll turn into string.
  18. It's possible... by NemoX · · Score: 2

    This is what we do at work. We spend about $5000 on the set up, but remember that this is an enterprise where we scan about 125,000 pages to .pdf a month. It is probably possible for about $500 or so, for what you are looking at (oh, and some programming)

    First, you'll need a low-volume scanner. (Check the duty cycle to make sure it can handle you bookshelf of papers.) Then, you'll need something to convert the images to pdf. If you have any programming experience, write a quick app that uses http://www.imagemagick.org/ Image Magick to convert from tiff to pdf. Put each binding in its own folder, and pretend the "untitled1.pdf" says "page1.pdf" ;)

    If you want to get fancier have the front end app rename the untiled1.tiff to whatever you'd like. Also, you can embed extra information into the pdf by using metadata and Adobe XMP SDK (free download from Adobe). Make the meta data like:
    TITLE="My Book"
    AUTHOR="Bart Simpson"
    etc.

  19. Re:Maybe you should try djvulibre by twistedcubic · · Score: 2, Informative

    Dude! I already found a $100 scanner that does the job and works in Linux (HP officejet 4215). It scans really fast. My only problem up til now was that PDF redering was too slow. But then I compared the results to DJVU... Wow! The DJVU files render incredibly fast! Thanks!

  20. DjVu, not PDF by TeXMaster · · Score: 2, Informative
    There is a file format which is specifically created for this kind of stuff, and it's called DjVu. There is a free (as in open source) reference library, and proprietary tools by LizardTech.

    (Of course, you will still need to spend lots of time scanning, naming and classifying those pages. The ADF and 10yo nephew suggested in another post might be useful for that.)

    DjVu offers very compact representation without the need to OCR the document (I've converted a 13 megs scanned PDF into a 600K DjVu which was much faster and easier to read), and optionally a "hidden text layer" if you want to OCR it to make it searchable.

    --
    "I'm never quite so stupid as when I'm being smart" (Linus van Pelt)
  21. Re:Microfilm! by theonetruekeebler · · Score: 2, Insightful
    I have not "tried to argue" that copying 100 microfilms costs the same as copying 100 sets of bits. That's inane. What I have argued is that if this data is important enough to preserve for a century, it should be archived to a non-digital medium. And after the initial transfer, the cost of duplicating a master film is...

    Ah, fuck it. I'm tired of doing your research for you. You log in as an AC, then expect a legitimate user to Google "lifetime of microfilm" and "cost of microfilm transfer" because you're too sorry to educate yourself. I no longer see any benefit in changing the relationship between my knowledge and your ignorance.

    The only reason use Slashdot as an Anonymous Coward is if you would be fired, arrested, or sued for your post.

    --
    This is not my sandwich.