Building a Searchable Literature Archive With Keywords?
Sooner Boomer writes "I'm trying to help drag a professor I work with into the 20th century. Although he is involved in cutting-edge research (nanotechnology), his method of literature search is to begin with digging through the hundreds of 3-ring binders that contain articles (usually from PDFs) that he has printed out. Even though the binders are labeled, the articles can only go under one 'heading' and there's no way to do a keyword search on subject, methods, materials, etc. Yeah, google is pretty good for finding stuff, as are other on-line literature services, but they only work for articles that are already on-line. His literature also includes articles copied from books, professional correspondence, and other sources. Is there a FOSS database or archive method (preferably with a web interface) where he could archive the PDFs and scanned documents and be able to search by keywords? It would also be nice to categorize them under multiple subject headings if possible. I know this has been covered ad nauseum with things like photos and the like, but I'm not looking at storage as such: instead I'm trying to find what's stored."
From the sound of it, you want to verify that your product supports document tagging (not unlike Slashdot's tagging system I guess) so that he can attach his categories to documents as he puts them in (or more likely as you do the manual labor, right?).
... where he could archive the PDFs and scanned documents and be able to search by keywords?
So, my big concern is the part where you said he scans things from books and articles and so some of the PDFs might just be massive images, right? I don't think you're going to find systems with OCR built in so you might have quite the chore on your hands. If you don't have it electronically or if it's just an image electronically, you may have to implement some sort of process for getting a doc into this system so it can be searched, right? Look into GOCR or Tesseract if this is the case.
Also, judging by your nickname ("Sooner Boomer"), you're at the University of Oklahoma. Why in the world would you name yourself after a group of people who not only disobeyed the Indian Appropriation Act but also moved out onto Native American territory before it was officially declared property of the United States? And then you also chose "Boomer" which refers to "white settlers who believed the Unassigned Lands were public property and open to anyone for settlement, not just Indian tribes. Their reasoning came from a clause in the Homestead Act of 1862, which said that any settler could claim 160 acres of public land. Some boomers entered and were removed more than once by the United States Army." If you are a descendant of either a Sooner or a Boomer, I respectfully do not agree with their actions.
My work here is dung.
I'm trying to help drag a professor I work with into the 20th century
Maybe after that, you should try to bring him into the 21st century. You know, the one where PDF's exist?
Who would win this election: Andrew Weiner vs Andrew Weiner's weiner.
Papers is a Mac software that does exactly what you need, and does it very well. It's not webbased and Mac only unfortunately, but you can probably find out there what the right terms to google for are.
Fleur de Sel
Most highend consumer All in One printers comes with an ADF capable of handling most types of paper as long as it's not crumpled up, stapled or the like. Some of the more expensive ones can do two sided scanning to a network repository. I work with consumer level HP printers, and the Office Jet Pro L7xxx series does this. The Pro L7680 is 200 US$ at Newegg.com
Now, while that printer comes with some okay OCR software, it's basicly thrown in for free. A lot of the stuff in the kind of documents you're talking about is going to be math heavy mixed in with images, graphs, tables and personal notes. I don't know any OCR software that'll transform that into exact replicas via LaTeX or the like, I'm pretty sure the really expensive OCR software will translate the written text and reproduce the rest as images and neatly transform it into some easily searchable pdf-documents.
That brings you from paper to searchable pdf-files. Catagorizing those is probably not all that hard. I'd suspect you could do some text analysis and break each document down into a list of technical terms and the number of times they're used.
A document that uses the cashmir effect in a single example is probably not a document related to that specific field, whereas documents that talk about it repeatedly, referencing known articles on the subject etc. is. Sorting that out ... beyond my knowledge.
I'd suggest you start out with an experiment. Take a "typical" page from the binders, scan it to a non-compressed image at a decent resolution (e.g. TIFF). We usually reccomend around 300 dpi for OCR - beyond that you start picking up things that we don't really look for when we're reading.
Test that page against various OCR software, see what they reproduce as the output. Pick the one that's the best result.
And don't worry - the OCR software is going to be the single most expensive purchase in this equation. I am however more than ready to be proven wrong in that regard.
I am hoping that someone will make a nice personal document management package as free software.
If you use Windows, you can buy this:
http://www.nuance.com/paperport/
The basic features would be:
In a perfect world, the GNOME guys and the KDE guys would both start competing over who can make the slickest product and we all would win.
steveha
lf(1): it's like ls(1) but sorts filenames by extension, tersely
Comment removed based on user account deletion
...something like this:
1. You want to be able to store documents that currently exist electronically, and also handle documents you're going to scan. The latter may, or may not, be OCR'd.
2. You want to attach keywords to the articles, and be able to bring up a list of articles that match some arbitrary combination of these keywords.
3. Full-text search isn't as important (but would be useful if available).
If that's the case, I'm thinking Alfresco might be what you're looking for. Multi-platform, open source, java-based content repository. Supports document tagging (and loads, loads more). Relatively easy to use right out of the box, and has a CIFS interface so you can just create a project and simply tree-copy your current documents into the project. Don't let the "enterprise" designation on the software scare you away.
I've actually considered going that route for my own personal document library, but while Alfresco might be one of the only good solutions, it's like killing a fly with a cannon.
I'm frankly amazed that with the "paperless living" meme currently going through the productivity circles that someone hasn't come up with a simple tool to do something just like what you're looking for: point it at a root folder, let it suck in all the files, then start tagging away. Search with keywords or filenames or both, and provide a clickable list of hits. Full-text search isn't needed, as there's already a ton of tools out there that'll happily index your hard drive for you.
And, if a tool like that exists, could someone point me to it, please?
2 years? ago I bought for my small business a fujitu f1-5120c duplex scanner--it came with adobe acrobat
I scan every bill, correspondence, notice, and everything to pdf- then I throw it the hell away.
the version of acrobat included does OCR-I open acrobat, choose create pdf from scanner, and scan away.
I can mix a scan job up between B&W & color or duplex or simplex within one job
I can open an existing PDF and append to it
I save everything to an infrant nas box.
I can go to windows search, type in 1179.21 (actually did this one once)
set to look INSIDE the files of that directory and get results that include
a soda delievery notice, a soda invoice, and my bank statement where I paid it off
they have other model scanners that combine sheetfed+flatbed...
here is a beauty
http://www.fujitsu.com/us/services/computing/peripherals/scanners/workgroup/fi-6230.html
every day http://en.wikipedia.org/wiki/Special:Random
I wrote and maintain a project to do this:
http://sourceforge.net/projects/docdb-v/
"DocDB is a powerful and flexible collaborative web based document server which maintains a versioned list of documents. Information maintained in the database includes, author(s), title, topic(s), abstract, access restriction information, etc."
It's intended for collaborations, but groups from 5 to 500 use it.
I am a granddad, you insensitive clod.
Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
There's at least two reasons the professor's method is beneficial:
1. By having to search by hand and scan by eye, he becomes more familiar with more of what's actually in the papers. His familiarity with the material gets better.
2. Repetitive scanning/searching of the papers leads to the mind partially wandering while doing so. This can result in inspiration and intuitive leaps.
Both methods together are preferable. But good luck on getting the professor to use them. You may have better luck getting him to create his own indices or tables of contents on paper to put in the binders. With his familiarity it shouldn't be too difficult.
"I may be synthetic, but I'm not stupid." -- Bishop 341-B