Slashdot Mirror


Building a Searchable Literature Archive With Keywords?

Sooner Boomer writes "I'm trying to help drag a professor I work with into the 20th century. Although he is involved in cutting-edge research (nanotechnology), his method of literature search is to begin with digging through the hundreds of 3-ring binders that contain articles (usually from PDFs) that he has printed out. Even though the binders are labeled, the articles can only go under one 'heading' and there's no way to do a keyword search on subject, methods, materials, etc. Yeah, google is pretty good for finding stuff, as are other on-line literature services, but they only work for articles that are already on-line. His literature also includes articles copied from books, professional correspondence, and other sources. Is there a FOSS database or archive method (preferably with a web interface) where he could archive the PDFs and scanned documents and be able to search by keywords? It would also be nice to categorize them under multiple subject headings if possible. I know this has been covered ad nauseum with things like photos and the like, but I'm not looking at storage as such: instead I'm trying to find what's stored."

44 of 211 comments (clear)

  1. Document Management Software and OCR by eldavojohn · · Score: 5, Informative
    I think what you are looking for is something called "document management" software. As far as FOSS goes, KnowledgeTree offers a community version that might be down your alley. They have an online demo if you're interested. There's also Alfresco but I haven't tried either of these.

    From the sound of it, you want to verify that your product supports document tagging (not unlike Slashdot's tagging system I guess) so that he can attach his categories to documents as he puts them in (or more likely as you do the manual labor, right?).

    ... where he could archive the PDFs and scanned documents and be able to search by keywords?

    So, my big concern is the part where you said he scans things from books and articles and so some of the PDFs might just be massive images, right? I don't think you're going to find systems with OCR built in so you might have quite the chore on your hands. If you don't have it electronically or if it's just an image electronically, you may have to implement some sort of process for getting a doc into this system so it can be searched, right? Look into GOCR or Tesseract if this is the case.

    Also, judging by your nickname ("Sooner Boomer"), you're at the University of Oklahoma. Why in the world would you name yourself after a group of people who not only disobeyed the Indian Appropriation Act but also moved out onto Native American territory before it was officially declared property of the United States? And then you also chose "Boomer" which refers to "white settlers who believed the Unassigned Lands were public property and open to anyone for settlement, not just Indian tribes. Their reasoning came from a clause in the Homestead Act of 1862, which said that any settler could claim 160 acres of public land. Some boomers entered and were removed more than once by the United States Army." If you are a descendant of either a Sooner or a Boomer, I respectfully do not agree with their actions.

    --
    My work here is dung.
    1. Re:Document Management Software and OCR by qoncept · · Score: 4, Funny

      If you are a descendant of either a Sooner or a Boomer, I respectfully do not agree with their actions.

      Except he's not. He just prematurely ejaculates. And he'd gone all this time with no one drawing attention to it as you just have.

      --
      Whale
    2. Re:Document Management Software and OCR by Red+Flayer · · Score: 2, Interesting

      I think what you are looking for is something called "document management" software.

      Ugh... for a secon there, I thought Clippy started posting to slashdot. Would you like help with that?

      I don't think he's looking for a DMS, which includes lots of things like workflows, audit trails, etc. DMSs are typically used to make an office go paperless, but he's not looking for a processing and tracking mechanism. He's looking for an easy way to create a searchable archive index.

      My suggestion? Since he's a professor, get a bunch of students to "help with research" by scanning the docs and OCRing them. If he's willing to shell out a few hundred bucks from his research grant, there are services that will do this... Most of the best OCR tools are proprietary, not open source, but even a crappy one should get enough text that the OCRed files could be indexed usefully.

      For an indexer, I've heard good things about MPS, and a friend did a similar project to yours with Yaz/Zebra, but he was working with a library, there may have been a special reason for that.

      --
      "Trolls they were, but filled with the evil will of their master: a fell race..." -- J.R.R. Tolkien on Olog-hai
    3. Re:Document Management Software and OCR by Shadow+Wrought · · Score: 3, Insightful

      OCR certainly requires work if you need it to be completely accurate. In practice, speaking as a paralegal who's overseen the OCR'ing of millions of pages, it's just not a reasonable expectation. If you can supplement it with coding, in this case keyword tags, date, author, publication and title would build a pretty strong database. If he's looking to do that already, then whatever OCR you get is gravy. Some is better than none.

      --
      If brevity is the soul of wit, then how does one explain Twitter?
    4. Re:Document Management Software and OCR by shaitand · · Score: 2, Insightful

      That depends, if all he needs the OCR for is to build a searchable keyword index then the error rate can be quite high and still get good results. The results of a query should point to the original PDF, not the result of the OCR.

    5. Re:Document Management Software and OCR by burki · · Score: 3, Informative

      For an Open Source DMS that generates searchable PDF Files, try ArchivistaBox: http://sourceforge.net/projects/archivista/

      Tesseract (including fracture / black-letter recognition) and the Linux port of Cuneiform (BSD licence) OCR engines are used for text recognition. The hocr2pdf module (see http://www.exactcode.de/ is used to generate the searchable PDF files.
      (http://sourceforge.net/forum/forum.php?forum_id=868471)

    6. Re:Document Management Software and OCR by digitalunity · · Score: 2, Interesting

      Another obvious, more practical but potentially less powerful method is simply to index all of the printouts with serial numbers and manually create a database of tags with serial numbers.

      You then have a library card index system, but electronic. Sure, it won't help with documents that aren't entered properly, but it's dramatically more efficient than thumbing through PDF's until you find what you need.

      --
      You can't legislate goodness. Let each to his own destiny, by will of his freely made choices.
    7. Re:Document Management Software and OCR by snspdaarf · · Score: 2, Insightful

      And, since what the Boomers and Sooners did was about 100 years ago, slamming someone for their Slashdot nickname helps with their current problem how? The Chickasaw side of my family has owned the same land since before statehood, and the German side was in on the Land Run, and we don't get all fired up about what the people were doing back then. The past is past. Learn to relax a little.

      --
      Why, without your clothes, you're naked, Miss Dudley!
    8. Re:Document Management Software and OCR by NoobixCube · · Score: 2, Funny

      What he needs is a bunch of undergrads or interns to painstakingly transcribe and proofread every scrap and napkin of text!

      --
      Admit it. You post strawman arguments as AC so you get modded Insightful for refuting them, rather than Troll
    9. Re:Document Management Software and OCR by electrons_are_brave · · Score: 5, Insightful

      As an ex-librarian, I can give you a professional's answer. You need a professional. But - if that's not possible, then what you are aiming for is a dream, and a huge data entry task to boot. And you will be creating a system that he will never be able to maintain. Aim lower. Ask him - does he want to keep the paper copies or move them all onto computer. Not both. If he wants to keep the paper - it's simple. Weed weed weed. 60% of what anyone holds is rubbish, and if's available online (and I mean in a proper source not a dissapearing link) he'll find it when he needs it. (I'm thinking he can't be using much of it given the difficulty of finding it). So that will leave you with about 20 three-rings out of the hundreds. Number each document, put them in a filing cabinets by MAIN SUBJECT. If you want to spend your life typing then, by all means, use incite, the word referencing system or some simple library freeware to create a db with author, title, journal etc and main subject (or maybe two). If he wants them all digital - same deal. Scan the ones that aren't there. Forget any sort of magic software that will catalogue for you, you crazy dreamer. The best you can do is use incite or some other referencing software to search for and make a record of the ones that have the record available on line. And then type the rest in. Personally, he sounds like a hoarder, so he will probably resist both suggestions. If this is the case then sort the folders into main subject and type a list (bib reference) and stick it to the front of each. At least that will cut down on his search time - but again, it's a lot of typing.

    10. Re:Document Management Software and OCR by electrons_are_brave · · Score: 2, Insightful

      BTW - my above answer is based on the assumption that he has no money to spend on getting this done.

    11. Re:Document Management Software and OCR by RicRoc · · Score: 2, Informative

      Archivista seems to be be a solution worth looking into for him. I guess he has to try and install the software and test it himself, because the explanation on the (English) website is almost incomprehensible -- I understand spoken German, but mixing German and English, ugh!
      I work with Alfresco, which is a nice DMS, but without the integrated OCR that Archivista seems to provide. Alfresco can integrate with various OCR solutions though, has a very active community -- and a comprehensible website!

      --
      Who?
    12. Re:Document Management Software and OCR by Miseph · · Score: 2, Insightful

      It sounds like 'going paperless" is exactly what he wants. It so happens that his office deals in research materials, but that doesn't really change the objective: stop using unwieldy and resource intensive paper documents in favor of highly indexed digital ones.

      Out of curiosity, have you considered a wikiesque system... autolinking titles and keywords between articles and some sort of glossary (but not the "let's have everybody able to edit it because everything is an opinion and all opinions are equally invalid" communal happy horseshit)? It might not be any more useful to the professor as a research tool, but it could certainly be useful as an educational tool, particularly for undergrads who might not remember every term and principle right off the top of their heads or know off-hand what else they should be looking at to help grok what they're reading.

      --
      Try not to take me more seriously than I take myself.
  2. fox? by SnarfQuest · · Score: 4, Funny

    I'm trying to help drag a professor I work with into the 20th century

    Maybe after that, you should try to bring him into the 21st century. You know, the one where PDF's exist?

    --
    Who would win this election: Andrew Weiner vs Andrew Weiner's weiner.
    1. Re:fox? by fuzzyfuzzyfungus · · Score: 4, Funny

      PDF has been around since 1993. That's what, six months or so after we switched from coal-fired data furnaces to vacuum tubes, right?

    2. Re:fox? by Hognoxious · · Score: 2, Funny

      Maybe wait for the 22nd. If we're lucky, by then it won't suck. But you may still have to wait for the Hurd port.

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
  3. Try Papers by matt4077 · · Score: 3, Informative

    Papers is a Mac software that does exactly what you need, and does it very well. It's not webbased and Mac only unfortunately, but you can probably find out there what the right terms to google for are.

  4. Reference management software by ckthorp · · Score: 2, Insightful

    I've had good luck with JabRef which uses a BibTex database on the backend so it integrates very well with LaTeX.

    1. Re:Reference management software by eggy78 · · Score: 2, Interesting

      I never really enjoyed using JabRef, but have had pretty good luck with Aigaion... a little more setup but it's great for our lab, where everyone works from a common database of papers. It allows export to RIS, BibTex, etc. although we do occasionally run into some errors with the LaTeX special characters and such. At least as far as our advisor is concerned, this absolutely revolutionized the way we handled our references. It's searchable, you can add keywords, your own annotations, include abstracts, and upload one or more attachments (the original paper) in whatever format you want. Technically it's an annotated bibliography that supports attachments, but it is pretty solid. One thing to note: We are still using a 1.3.x version of it; we haven't been brave enough or had the time to try the 2.x releases.

    2. Re:Reference management software by joe+155 · · Score: 2, Interesting

      I agree. I'm in the first year of my PhD and I've been making an effort to build an extensive bibtex database because it provides everything I need in terms of references and notes. What I do is read a paper, make pretty extensive notes on it and then put them in the abstract section of Jabref so that when you use the search function for terms it searches through all the relevant text in the article for what you work on. I've also tried to put down some keywords which are related just to make sure that they're linked with the article. Then if I want to know everything I've ever read on, say, political corruption it's just a search away.

      If you wanted to add papers you've not digitized your notes for then you could put in the references and just a few quick keywords. Papers you don't have you can search through google scholar to find them. It works OK.

      I've also been impressed with Papers for OSX, but Jabref can move systems really easily and is GPL.

      --
      *''I can't believe it's not a hyperlink.''
    3. Re:Reference management software by cahkaylahlee · · Score: 2, Informative

      You can also set your preferences in Google Scholar to provide a link to the paper's citation in BibTex formatting. It isn't always 100% accurate, but it does save a lot of data entry time.

  5. Zotero by k2enemy · · Score: 2, Informative

    Zotero might be useful.

    1. Re:Zotero by hnwombat · · Score: 2, Informative

      I'll second this one. I'm a doctoral student, and have been using it to handle my research. A nice, simple firefox-based interface. It'll snarf references right off pages from search engines. You can attach things, including links to or copies of pdfs to those references, summaries, etc. You can apply keyword tags to citations, and you can organize the citations into a nice directory tree.

      To get them out, there's a sweet interface available for open office. I think it's also available for Word, but I use M$ as little as possible.

      The only real downsides are copying between computers and citation formats.

      Copying is actually easier than it is with the other reference managers I've tried (yeah, I'm talking about you, refworks (bleaargh!) and end note (urrrp!)). You may have to do it more than you would with others, but it's easy to do when you need to. You can export some or all your references to a file, sneakernet the file to the new computer, import it into zotero, and you're done.

      There are a lot of citation formats currently available, they just don't happen to be the ones I need. However, there's one close, and the system is designed to be extensible; it's not *that* hard to add your own styles. As soon as I get a round tuit I'll be adding styles for the journals I'll be submitting to, and contribute those back to the project.

      Like I said, really sweet, and free.

    2. Re:Zotero by yes+it+is · · Score: 2, Informative

      Thirded. I've also built my own phrase index on top of zotero using Perl and OpenCalais.

  6. Cheap scanner, expensive OCR software by MartinSchou · · Score: 4, Insightful

    Most highend consumer All in One printers comes with an ADF capable of handling most types of paper as long as it's not crumpled up, stapled or the like. Some of the more expensive ones can do two sided scanning to a network repository. I work with consumer level HP printers, and the Office Jet Pro L7xxx series does this. The Pro L7680 is 200 US$ at Newegg.com

    Now, while that printer comes with some okay OCR software, it's basicly thrown in for free. A lot of the stuff in the kind of documents you're talking about is going to be math heavy mixed in with images, graphs, tables and personal notes. I don't know any OCR software that'll transform that into exact replicas via LaTeX or the like, I'm pretty sure the really expensive OCR software will translate the written text and reproduce the rest as images and neatly transform it into some easily searchable pdf-documents.

    That brings you from paper to searchable pdf-files. Catagorizing those is probably not all that hard. I'd suspect you could do some text analysis and break each document down into a list of technical terms and the number of times they're used.

    A document that uses the cashmir effect in a single example is probably not a document related to that specific field, whereas documents that talk about it repeatedly, referencing known articles on the subject etc. is. Sorting that out ... beyond my knowledge.

    I'd suggest you start out with an experiment. Take a "typical" page from the binders, scan it to a non-compressed image at a decent resolution (e.g. TIFF). We usually reccomend around 300 dpi for OCR - beyond that you start picking up things that we don't really look for when we're reading.

    Test that page against various OCR software, see what they reproduce as the output. Pick the one that's the best result.

    And don't worry - the OCR software is going to be the single most expensive purchase in this equation. I am however more than ready to be proven wrong in that regard.

  7. Personal Document Management by steveha · · Score: 3, Interesting

    I am hoping that someone will make a nice personal document management package as free software.

    If you use Windows, you can buy this:

    http://www.nuance.com/paperport/

    The basic features would be:

    • Scan in a document (group multiple pages into a single PDF)
    • Easily scan a page and insert it into a pre-existing PDF (if you missed a page yesterday, today go back and put it in)
    • OCR the documents and provide an index to allow searching
    • Provide a really convenient photocopier feature (scan+print)
    • Fast and easy. Scan in color, but detect black-and-white and auto-convert to greyscale. Do not pop up any dialogs; when the user clicks on the "Scan!" button, start scanning.
    • Also allow dropping in saved HTML pages, OpenOffice.org documents, etc. Manage the user's saved documents, no matter what kind of documents they are.

    In a perfect world, the GNOME guys and the KDE guys would both start competing over who can make the slickest product and we all would win.

    steveha

    --
    lf(1): it's like ls(1) but sorts filenames by extension, tersely
  8. Summation by Anonymous Coward · · Score: 2, Interesting

    Law firms use a program called Summation to do this all the time. They take all the paper docs and electronic docs in a case (sometimes tens of thousands of pages) and load them into this program as TIFFs or PDFs. They are then OCR searchable. Not nearly as good of a search algo as something like google, as it is purely Boolean...but it gets the job done. Not sure about cost, but your university may have a license. An alternative is a program called Concordance, which does the same thing. One last option would be to scan everything to OCR searchable PDFs, throw them into a folder, and setup google desktop to only search that folder...you could then essentially "google" the contents of all those PDFs.

  9. Quick and dirty solution by oldhack · · Score: 2, Informative

    Assuming you have electronic versions of the documents in one format or another, stick them all in a file system and use desktop search (MS or Google). More than that you're looking at good bit of time and money.

    --
    Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
    1. Re:Quick and dirty solution by pete-classic · · Score: 2, Funny

      Maybe you're unfamiliar with three ring binders.

      They're archaic devices used to store non-electronic paper-based documents. You can ask your granddad about them.

      I'm beginning to think these kids today don't realize that the desktop metaphor is . . . a metaphor!

      -Peter

    2. Re:Quick and dirty solution by oldhack · · Score: 3, Funny

      I am a granddad, you insensitive clod.

      --
      Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
  10. Comment removed by account_deleted · · Score: 4, Interesting

    Comment removed based on user account deletion

  11. Re:Is the material copyrighted? by shaitand · · Score: 2, Insightful

    Copying excerpts for educational use is actually an explicitly protected fair use case. The copyright act actually uses it as an example if I remember correctly.

    The parent said he is copying parts of texts, not entire books.

  12. Papers for Mac OS X and iPhone by 200_success · · Score: 2, Insightful

    For Mac OS X, try Papers. There's also an iPhone/iPod Touch version. Mac OS is great at handling PDFs in general.

  13. DevonThink on a Mac by Autumnmist · · Score: 2, Insightful

    If your professor uses a Mac, consider Devonthink by DevonTechnologies.
    http://www.devon-technologies.com/products/devonthink/index.html

    For searching, the software has an artificial intelligence system, keywords, meta data. It can store PDFs, word docs, emails, notes. It can be integrated with a scanner so you can scan and store documents in the database. It's got OCR built in...

    I have DevonThink (personal edition, not Pro/Office) and I don't even use 1/10 of the power built into this system. You should check out some of the reviews online and videos of people using DevonThink.

    --
    --- "Many of the truths we cling to depend greatly on our own point of view." ~ Ben Kenobi, 'Return of the Jedi'
  14. So, what I think you're asking for is... by Basilius · · Score: 4, Informative

    ...something like this:

    1. You want to be able to store documents that currently exist electronically, and also handle documents you're going to scan. The latter may, or may not, be OCR'd.

    2. You want to attach keywords to the articles, and be able to bring up a list of articles that match some arbitrary combination of these keywords.

    3. Full-text search isn't as important (but would be useful if available).

    If that's the case, I'm thinking Alfresco might be what you're looking for. Multi-platform, open source, java-based content repository. Supports document tagging (and loads, loads more). Relatively easy to use right out of the box, and has a CIFS interface so you can just create a project and simply tree-copy your current documents into the project. Don't let the "enterprise" designation on the software scare you away.

    I've actually considered going that route for my own personal document library, but while Alfresco might be one of the only good solutions, it's like killing a fly with a cannon.

    I'm frankly amazed that with the "paperless living" meme currently going through the productivity circles that someone hasn't come up with a simple tool to do something just like what you're looking for: point it at a root folder, let it suck in all the files, then start tagging away. Search with keywords or filenames or both, and provide a clickable list of hits. Full-text search isn't needed, as there's already a ton of tools out there that'll happily index your hard drive for you.

    And, if a tool like that exists, could someone point me to it, please?

  15. wow.. by way2trivial · · Score: 3, Insightful

    2 years? ago I bought for my small business a fujitu f1-5120c duplex scanner--it came with adobe acrobat
    I scan every bill, correspondence, notice, and everything to pdf- then I throw it the hell away.

    the version of acrobat included does OCR-I open acrobat, choose create pdf from scanner, and scan away.
    I can mix a scan job up between B&W & color or duplex or simplex within one job
    I can open an existing PDF and append to it

    I save everything to an infrant nas box.

    I can go to windows search, type in 1179.21 (actually did this one once)
    set to look INSIDE the files of that directory and get results that include
    a soda delievery notice, a soda invoice, and my bank statement where I paid it off

    they have other model scanners that combine sheetfed+flatbed...

    here is a beauty
    http://www.fujitsu.com/us/services/computing/peripherals/scanners/workgroup/fi-6230.html

    --
    every day http://en.wikipedia.org/wiki/Special:Random
  16. Suggestion by vondo · · Score: 3, Insightful

    I wrote and maintain a project to do this:

    http://sourceforge.net/projects/docdb-v/

    "DocDB is a powerful and flexible collaborative web based document server which maintains a versioned list of documents. Information maintained in the database includes, author(s), title, topic(s), abstract, access restriction information, etc."

    It's intended for collaborations, but groups from 5 to 500 use it.

  17. Digital Archive software by who's+got+my+nicknam · · Score: 2, Insightful

    What you are looking for is a proper archiving application. I suggest ICAAtom. Scan your documents as TIFFs if you are going to be saving them as images; if your hardware will do OCR nicely, then you would be better off scanning them to text, as they will be more searchable. ICA Atom supports all of the standard archiving metadata protocols, of course, so you will have good searching capabilities as long as you enter proper metadata.

    --
    "Apparatus dignosco occultus, satis non supernus."
  18. Why Not To by DynaSoar · · Score: 4, Insightful

    There's at least two reasons the professor's method is beneficial:

    1. By having to search by hand and scan by eye, he becomes more familiar with more of what's actually in the papers. His familiarity with the material gets better.

    2. Repetitive scanning/searching of the papers leads to the mind partially wandering while doing so. This can result in inspiration and intuitive leaps.

    Both methods together are preferable. But good luck on getting the professor to use them. You may have better luck getting him to create his own indices or tables of contents on paper to put in the binders. With his familiarity it shouldn't be too difficult.

    --
    "I may be synthetic, but I'm not stupid." -- Bishop 341-B
  19. Zotero, Mendeley by pesho · · Score: 2, Informative
    You should try Zotero or Mendeley.

    Zotero is a firefox extension that can grab reserach papers directly from the journal or library web sites. It organizes the papers in collections, has keywords (they call them tags), can automatically index the PDFs. The metadata is stored also on a remote server and you can browse through it using a web interface. You also get a Word and Openoffice plugins to insert citations in the papers you write. The plugins are a little rough around the edges, but are usable. The references formatting is very robust and comes with styles for a lot of journals.

    Mendeley is stand alone application. I haven't tryed it yet,but is seems to have very similar functionality.

  20. Re:Is the material copyrighted? by shaitand · · Score: 2, Interesting

    According to cornell, limitations on copyright holders are as follows. Note that research and teaching are both explicitly stated cases of fair use exemption.

    107
    Permits the âoefair useâ of an ownerâ(TM)s work without permission â" for the purpose of âoecriticism, comment, news reporting, teaching, scholarship, or research.â This exemption outlines four factors that must be met in order to argue a fair use.

    108
    Permits a library or archives to reproduce works for archiving purposes, to make copies for patrons and to participate in interlibrary loan â" all without permission

    109
    Permits individuals to lend, give or sell copies of works they own without seeking permission of the copyright holder. This is also referred to as the First Sale Doctrine.

    110
    Permits displays of work and educational performances in face-to-face teaching and distance education. The TEACH Act expands upon the limitations in section 110.

    121
    Permits reproduction of works without permission of the copyright holder for the blind and other people with disabilities

    http://www.copyright.gov/title17/92chap1.html#107

    The copyright act section 107. This section lists many cases of fair use but gives 4 primary criteria for courts to consider. The first is the purpose of the work and makes it clear that non-profit educational use is protected. I am unable to find any reference to a classroom in section 107 (not that there is reason to think the professor doesn't teach his students by having them perform or assist with research in the classroom).

  21. I wrote a few articles about this by nbauman · · Score: 2, Insightful

    I wrote a few articles about this for Law Office Computing magazine
    http://www.nasw.org/users/nbauman/txtsrch.htm
    http://www.nasw.org/users/nbauman/lawdb.htm
    http://www.nasw.org/users/nbauman/discover.htm
    It was a long time ago, the software and hardware has changed, but the concepts are still the same, and the costs are a lot less.

    Free text search works reasonably well with small databases, but it doesn't work with big databases. If you want precision, you have to develop a set of tags (we called them keywords). A good model is Pubmed http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed. The New York Times used to have a great text search, but they changed it (eliminated tags) and now it's awfully difficult to get through.

    Basically, the researcher has a body of knowledge, and he already has a filing and organizing system (in this case, a looseleaf binder system, which is a pretty good start). You should usually try to replicate his filing and organizing system in the database, for example one field for the looseleaf, another field for the tab, and then some goodies that he couldn't search the looseleafs for, like date, author, journal citation, etc. It would probably be useful to have a controlled vocabulary of a few good keywords, but keywords should be selected carefully so they're unique and don't duplicate.

    I assume he doesn't have the PDFs any more. That would have made it a lot easier.

    It would be handy to scan every word of (most) every document into full text, but it may not be necessary. Why do you need everything in full digital text? Scanning of unconventional text takes a human proofreading step, and probably isn't worth it.

    He'll probably want to keep complete images of the original documents anyway along with the text. You should do a few tests to see how much resolution you need. 600 dpi works for ordinary text like they use in the printed newspaper. But if you want journal articles to come out, with footnotes and superscripts, you might need higher resolution.

    Somebody is going to have to enter the fields manually, which is't too bad if you've got a thousand records (looseleaf tabs) to enter (about 20 hours), but can get difficult if you've got an order of magnitude more.

    Scanning should be straightforward, if everything is neatly filed away in looseleaf books already. There are many cheap consumer-grade scanners on the market that can get 600-2400 dpi (the bundled software is probably more important than the hardware specs) but they can take up to 1 minute a page; there are more expensive scanners in the =>$1,000 range that can go a lot faster. If you're at a university, look around for somebody who already has one. Law firms and libraries do a lot of this.

    You might start by estimating the number of pages and documents you have.

    But let me suggest an alternative: Instead of scanning everything, just enter everything into a database without scanning it. Does he really need full text search? Or would it be enough to search his looseleaf books by a dozen fields? He doesn't have to print the document out from an image file, it's right there in his looseleaf books.

    If anybody knows of up-to-date articles on this subject, I'd love to know the citation.

  22. The bigger problem is not OCR, it's which scanner. by Futurepower(R) · · Score: 2, Informative

    We've done considerable research on the problem of scanning documents, also, and came to the same conclusion: The new Fujitsu fi-6130 seems excellent, although we haven't tried it. That model and the 6230 are new, and there is some evidence that waiting for the second version of those models would be a good idea.

    The big attraction of the fi-6130 is its speed: 40 pages per minute.

    If you are interested, I suggest you download the manual. (PDF)

    The manual talks about a connection for an "imprinter", which sounds as though it is a printer that works only with that particular Fujitsu scanner. That causes me to doubt whether buying from Fujitsu would be a good idea; we don't want to get involved with corporate marketing drone foolishness. Everything else, however, looks quite good.

    The scanner comes with OCR software. I suppose and hope that Fujitsu did a lot of work and found the best OCR software.

    The scanner software makes a PDF. The OCR software tries to recognize the words, so that the software can make a searchable PDF. Even if the OCR recognition isn't perfect, it can be very useful.

    It seems to me that the Fujitsu fi-6230, suggested in the parent comment, is a poor design. It combines an automatic sheetfed scanner with a flatbed scanner for a lot more money. That doesn't make sense, since the attractive feature of the sheetfed scanner is its speed. Speed is important with a flatbed scanner, but not as important, since the operation will always be manual. It seems to me that it would be better to have a flatbed scanner that is a separate piece of equipment, rather than two pieces seemingly glued together, without any logical connection, since apparently the 6250 has two imaging elements.

    Be careful about using Windows Search, as suggested in the parent comment. The Windows XP version is buggy, and sometimes won't look into files that are there. We use VCOM's PowerDesk pdfind.exe program, a older version of which is free. We also use Funduc Software's Search and Replace program.

    Most scanners are quite slow, don't have automatic document feeders that allow scanning of papers of widely different sizes, and don't build OCR'd indexes inside the PDF files.

  23. Re:The bigger problem is not OCR, it's which scann by Angstroman · · Score: 2, Informative

    I have been using an fi-6130 for several months now. It is quite simply the best scanner I have used. It is fast, highly reliable and very seldom misfeeds (1 per 500-800 pages in my experience). I use it for scanning archival financial records and also for technical papers. It includes a copy of Kofax Virtual ReScan, which does a great job of creating readable 1-bit monotone scans of originals with colored backgrounds. There are a number of possible target formats, and it has several automated ways of handling group separator sheets. I highly recommend it. I have seen no evidence of "marketing drone foolishness."