Building a Searchable Literature Archive With Keywords?

← Back to Stories (view on slashdot.org)

Building a Searchable Literature Archive With Keywords?

Posted by timothy on Wednesday April 8, 2009 @08:20AM from the must-be-in-here-somewhere dept.

Sooner Boomer writes "I'm trying to help drag a professor I work with into the 20th century. Although he is involved in cutting-edge research (nanotechnology), his method of literature search is to begin with digging through the hundreds of 3-ring binders that contain articles (usually from PDFs) that he has printed out. Even though the binders are labeled, the articles can only go under one 'heading' and there's no way to do a keyword search on subject, methods, materials, etc. Yeah, google is pretty good for finding stuff, as are other on-line literature services, but they only work for articles that are already on-line. His literature also includes articles copied from books, professional correspondence, and other sources. Is there a FOSS database or archive method (preferably with a web interface) where he could archive the PDFs and scanned documents and be able to search by keywords? It would also be nice to categorize them under multiple subject headings if possible. I know this has been covered ad nauseum with things like photos and the like, but I'm not looking at storage as such: instead I'm trying to find what's stored."

16 of 211 comments (clear)

Document Management Software and OCR by eldavojohn · 2009-04-08 08:21 · Score: 5, Informative

I think what you are looking for is something called "document management" software. As far as FOSS goes, KnowledgeTree offers a community version that might be down your alley. They have an online demo if you're interested. There's also Alfresco but I haven't tried either of these.

From the sound of it, you want to verify that your product supports document tagging (not unlike Slashdot's tagging system I guess) so that he can attach his categories to documents as he puts them in (or more likely as you do the manual labor, right?).

... where he could archive the PDFs and scanned documents and be able to search by keywords?
So, my big concern is the part where you said he scans things from books and articles and so some of the PDFs might just be massive images, right? I don't think you're going to find systems with OCR built in so you might have quite the chore on your hands. If you don't have it electronically or if it's just an image electronically, you may have to implement some sort of process for getting a doc into this system so it can be searched, right? Look into GOCR or Tesseract if this is the case.

Also, judging by your nickname ("Sooner Boomer"), you're at the University of Oklahoma. Why in the world would you name yourself after a group of people who not only disobeyed the Indian Appropriation Act but also moved out onto Native American territory before it was officially declared property of the United States? And then you also chose "Boomer" which refers to "white settlers who believed the Unassigned Lands were public property and open to anyone for settlement, not just Indian tribes. Their reasoning came from a clause in the Homestead Act of 1862, which said that any settler could claim 160 acres of public land. Some boomers entered and were removed more than once by the United States Army." If you are a descendant of either a Sooner or a Boomer, I respectfully do not agree with their actions.

--
My work here is dung.
1. Re:Document Management Software and OCR by qoncept · 2009-04-08 08:26 · Score: 4, Funny
  
  If you are a descendant of either a Sooner or a Boomer, I respectfully do not agree with their actions.
  Except he's not. He just prematurely ejaculates. And he'd gone all this time with no one drawing attention to it as you just have.
  
  --
  Whale
2. Re:Document Management Software and OCR by Shadow+Wrought · 2009-04-08 08:39 · Score: 3, Insightful
  
  OCR certainly requires work if you need it to be completely accurate. In practice, speaking as a paralegal who's overseen the OCR'ing of millions of pages, it's just not a reasonable expectation. If you can supplement it with coding, in this case keyword tags, date, author, publication and title would build a pretty strong database. If he's looking to do that already, then whatever OCR you get is gravy. Some is better than none.
  
  --
  If brevity is the soul of wit, then how does one explain Twitter?
3. Re:Document Management Software and OCR by burki · 2009-04-08 09:06 · Score: 3, Informative
  
  For an Open Source DMS that generates searchable PDF Files, try ArchivistaBox: http://sourceforge.net/projects/archivista/
  Tesseract (including fracture / black-letter recognition) and the Linux port of Cuneiform (BSD licence) OCR engines are used for text recognition. The hocr2pdf module (see http://www.exactcode.de/ is used to generate the searchable PDF files.
  (http://sourceforge.net/forum/forum.php?forum_id=868471)
4. Re:Document Management Software and OCR by electrons_are_brave · 2009-04-08 12:25 · Score: 5, Insightful
  
  As an ex-librarian, I can give you a professional's answer. You need a professional. But - if that's not possible, then what you are aiming for is a dream, and a huge data entry task to boot. And you will be creating a system that he will never be able to maintain. Aim lower. Ask him - does he want to keep the paper copies or move them all onto computer. Not both. If he wants to keep the paper - it's simple. Weed weed weed. 60% of what anyone holds is rubbish, and if's available online (and I mean in a proper source not a dissapearing link) he'll find it when he needs it. (I'm thinking he can't be using much of it given the difficulty of finding it). So that will leave you with about 20 three-rings out of the hundreds. Number each document, put them in a filing cabinets by MAIN SUBJECT. If you want to spend your life typing then, by all means, use incite, the word referencing system or some simple library freeware to create a db with author, title, journal etc and main subject (or maybe two). If he wants them all digital - same deal. Scan the ones that aren't there. Forget any sort of magic software that will catalogue for you, you crazy dreamer. The best you can do is use incite or some other referencing software to search for and make a record of the ones that have the record available on line. And then type the rest in. Personally, he sounds like a hoarder, so he will probably resist both suggestions. If this is the case then sort the folders into main subject and type a list (bib reference) and stick it to the front of each. At least that will cut down on his search time - but again, it's a lot of typing.
fox? by SnarfQuest · 2009-04-08 08:26 · Score: 4, Funny

I'm trying to help drag a professor I work with into the 20th century
Maybe after that, you should try to bring him into the 21st century. You know, the one where PDF's exist?

--
Who would win this election: Andrew Weiner vs Andrew Weiner's weiner.
1. Re:fox? by fuzzyfuzzyfungus · 2009-04-08 08:34 · Score: 4, Funny
  
  PDF has been around since 1993. That's what, six months or so after we switched from coal-fired data furnaces to vacuum tubes, right?
Try Papers by matt4077 · 2009-04-08 08:26 · Score: 3, Informative

Papers is a Mac software that does exactly what you need, and does it very well. It's not webbased and Mac only unfortunately, but you can probably find out there what the right terms to google for are.

--
Fleur de Sel
Cheap scanner, expensive OCR software by MartinSchou · 2009-04-08 08:35 · Score: 4, Insightful

Most highend consumer All in One printers comes with an ADF capable of handling most types of paper as long as it's not crumpled up, stapled or the like. Some of the more expensive ones can do two sided scanning to a network repository. I work with consumer level HP printers, and the Office Jet Pro L7xxx series does this. The Pro L7680 is 200 US$ at Newegg.com
Now, while that printer comes with some okay OCR software, it's basicly thrown in for free. A lot of the stuff in the kind of documents you're talking about is going to be math heavy mixed in with images, graphs, tables and personal notes. I don't know any OCR software that'll transform that into exact replicas via LaTeX or the like, I'm pretty sure the really expensive OCR software will translate the written text and reproduce the rest as images and neatly transform it into some easily searchable pdf-documents.
That brings you from paper to searchable pdf-files. Catagorizing those is probably not all that hard. I'd suspect you could do some text analysis and break each document down into a list of technical terms and the number of times they're used.
A document that uses the cashmir effect in a single example is probably not a document related to that specific field, whereas documents that talk about it repeatedly, referencing known articles on the subject etc. is. Sorting that out ... beyond my knowledge.
I'd suggest you start out with an experiment. Take a "typical" page from the binders, scan it to a non-compressed image at a decent resolution (e.g. TIFF). We usually reccomend around 300 dpi for OCR - beyond that you start picking up things that we don't really look for when we're reading.
Test that page against various OCR software, see what they reproduce as the output. Pick the one that's the best result.
And don't worry - the OCR software is going to be the single most expensive purchase in this equation. I am however more than ready to be proven wrong in that regard.
Personal Document Management by steveha · 2009-04-08 08:38 · Score: 3, Interesting
I am hoping that someone will make a nice personal document management package as free software.
If you use Windows, you can buy this:
http://www.nuance.com/paperport/
The basic features would be:
- Scan in a document (group multiple pages into a single PDF)
- Easily scan a page and insert it into a pre-existing PDF (if you missed a page yesterday, today go back and put it in)
- OCR the documents and provide an index to allow searching
- Provide a really convenient photocopier feature (scan+print)
- Fast and easy. Scan in color, but detect black-and-white and auto-convert to greyscale. Do not pop up any dialogs; when the user clicks on the "Scan!" button, start scanning.
- Also allow dropping in saved HTML pages, OpenOffice.org documents, etc. Manage the user's saved documents, no matter what kind of documents they are.
In a perfect world, the GNOME guys and the KDE guys would both start competing over who can make the slickest product and we all would win.
steveha
--
lf(1): it's like ls(1) but sorts filenames by extension, tersely
Comment removed by account_deleted · 2009-04-08 08:45 · Score: 4, Interesting

Comment removed based on user account deletion
So, what I think you're asking for is... by Basilius · 2009-04-08 09:00 · Score: 4, Informative

...something like this:
1. You want to be able to store documents that currently exist electronically, and also handle documents you're going to scan. The latter may, or may not, be OCR'd.
2. You want to attach keywords to the articles, and be able to bring up a list of articles that match some arbitrary combination of these keywords.
3. Full-text search isn't as important (but would be useful if available).
If that's the case, I'm thinking Alfresco might be what you're looking for. Multi-platform, open source, java-based content repository. Supports document tagging (and loads, loads more). Relatively easy to use right out of the box, and has a CIFS interface so you can just create a project and simply tree-copy your current documents into the project. Don't let the "enterprise" designation on the software scare you away.
I've actually considered going that route for my own personal document library, but while Alfresco might be one of the only good solutions, it's like killing a fly with a cannon.
I'm frankly amazed that with the "paperless living" meme currently going through the productivity circles that someone hasn't come up with a simple tool to do something just like what you're looking for: point it at a root folder, let it suck in all the files, then start tagging away. Search with keywords or filenames or both, and provide a clickable list of hits. Full-text search isn't needed, as there's already a ton of tools out there that'll happily index your hard drive for you.
And, if a tool like that exists, could someone point me to it, please?
wow.. by way2trivial · 2009-04-08 09:01 · Score: 3, Insightful

2 years? ago I bought for my small business a fujitu f1-5120c duplex scanner--it came with adobe acrobat
I scan every bill, correspondence, notice, and everything to pdf- then I throw it the hell away.
the version of acrobat included does OCR-I open acrobat, choose create pdf from scanner, and scan away.
I can mix a scan job up between B&W & color or duplex or simplex within one job
I can open an existing PDF and append to it
I save everything to an infrant nas box.
I can go to windows search, type in 1179.21 (actually did this one once)
set to look INSIDE the files of that directory and get results that include
a soda delievery notice, a soda invoice, and my bank statement where I paid it off
they have other model scanners that combine sheetfed+flatbed...
here is a beauty
http://www.fujitsu.com/us/services/computing/peripherals/scanners/workgroup/fi-6230.html

--
every day http://en.wikipedia.org/wiki/Special:Random
Suggestion by vondo · 2009-04-08 09:11 · Score: 3, Insightful

I wrote and maintain a project to do this:
http://sourceforge.net/projects/docdb-v/
"DocDB is a powerful and flexible collaborative web based document server which maintains a versioned list of documents. Information maintained in the database includes, author(s), title, topic(s), abstract, access restriction information, etc."
It's intended for collaborations, but groups from 5 to 500 use it.
Re:Quick and dirty solution by oldhack · 2009-04-08 09:22 · Score: 3, Funny

I am a granddad, you insensitive clod.

--
Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
Why Not To by DynaSoar · 2009-04-08 09:52 · Score: 4, Insightful

There's at least two reasons the professor's method is beneficial:
1. By having to search by hand and scan by eye, he becomes more familiar with more of what's actually in the papers. His familiarity with the material gets better.
2. Repetitive scanning/searching of the papers leads to the mind partially wandering while doing so. This can result in inspiration and intuitive leaps.
Both methods together are preferable. But good luck on getting the professor to use them. You may have better luck getting him to create his own indices or tables of contents on paper to put in the binders. With his familiarity it shouldn't be too difficult.

--
"I may be synthetic, but I'm not stupid." -- Bishop 341-B