Building a Searchable Literature Archive With Keywords?
Sooner Boomer writes "I'm trying to help drag a professor I work with into the 20th century. Although he is involved in cutting-edge research (nanotechnology), his method of literature search is to begin with digging through the hundreds of 3-ring binders that contain articles (usually from PDFs) that he has printed out. Even though the binders are labeled, the articles can only go under one 'heading' and there's no way to do a keyword search on subject, methods, materials, etc. Yeah, google is pretty good for finding stuff, as are other on-line literature services, but they only work for articles that are already on-line. His literature also includes articles copied from books, professional correspondence, and other sources. Is there a FOSS database or archive method (preferably with a web interface) where he could archive the PDFs and scanned documents and be able to search by keywords? It would also be nice to categorize them under multiple subject headings if possible. I know this has been covered ad nauseum with things like photos and the like, but I'm not looking at storage as such: instead I'm trying to find what's stored."
From the sound of it, you want to verify that your product supports document tagging (not unlike Slashdot's tagging system I guess) so that he can attach his categories to documents as he puts them in (or more likely as you do the manual labor, right?).
... where he could archive the PDFs and scanned documents and be able to search by keywords?
So, my big concern is the part where you said he scans things from books and articles and so some of the PDFs might just be massive images, right? I don't think you're going to find systems with OCR built in so you might have quite the chore on your hands. If you don't have it electronically or if it's just an image electronically, you may have to implement some sort of process for getting a doc into this system so it can be searched, right? Look into GOCR or Tesseract if this is the case.
Also, judging by your nickname ("Sooner Boomer"), you're at the University of Oklahoma. Why in the world would you name yourself after a group of people who not only disobeyed the Indian Appropriation Act but also moved out onto Native American territory before it was officially declared property of the United States? And then you also chose "Boomer" which refers to "white settlers who believed the Unassigned Lands were public property and open to anyone for settlement, not just Indian tribes. Their reasoning came from a clause in the Homestead Act of 1862, which said that any settler could claim 160 acres of public land. Some boomers entered and were removed more than once by the United States Army." If you are a descendant of either a Sooner or a Boomer, I respectfully do not agree with their actions.
My work here is dung.
I'm trying to help drag a professor I work with into the 20th century
Maybe after that, you should try to bring him into the 21st century. You know, the one where PDF's exist?
Who would win this election: Andrew Weiner vs Andrew Weiner's weiner.
http://owl.anytimecomm.com/
we use this at my office. Works well for us.
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
Papers is a Mac software that does exactly what you need, and does it very well. It's not webbased and Mac only unfortunately, but you can probably find out there what the right terms to google for are.
Fleur de Sel
I've had good luck with JabRef which uses a BibTex database on the backend so it integrates very well with LaTeX.
...but you can try library management software. Good point to start is
http://ask.slashdot.org/article.pl?sid=06/03/22/1320207
and
http://slashdot.org/article.pl?sid=07/12/11/1756247
As a computer, I find your faith in technology amusing.
DSpace ? http://www.dspace.org/
There is also the issue of making copies of any copyrighted material. Unless you have obtained permission to do so from the copyright holder (usually for a fee) you could find yourself in a whole lot of, very expensive, trouble for copyright infringement.
It may be worth looking at Beagle: http://beagle-project.org/ - it's Linux only though.
Zotero might be useful.
Most highend consumer All in One printers comes with an ADF capable of handling most types of paper as long as it's not crumpled up, stapled or the like. Some of the more expensive ones can do two sided scanning to a network repository. I work with consumer level HP printers, and the Office Jet Pro L7xxx series does this. The Pro L7680 is 200 US$ at Newegg.com
Now, while that printer comes with some okay OCR software, it's basicly thrown in for free. A lot of the stuff in the kind of documents you're talking about is going to be math heavy mixed in with images, graphs, tables and personal notes. I don't know any OCR software that'll transform that into exact replicas via LaTeX or the like, I'm pretty sure the really expensive OCR software will translate the written text and reproduce the rest as images and neatly transform it into some easily searchable pdf-documents.
That brings you from paper to searchable pdf-files. Catagorizing those is probably not all that hard. I'd suspect you could do some text analysis and break each document down into a list of technical terms and the number of times they're used.
A document that uses the cashmir effect in a single example is probably not a document related to that specific field, whereas documents that talk about it repeatedly, referencing known articles on the subject etc. is. Sorting that out ... beyond my knowledge.
I'd suggest you start out with an experiment. Take a "typical" page from the binders, scan it to a non-compressed image at a decent resolution (e.g. TIFF). We usually reccomend around 300 dpi for OCR - beyond that you start picking up things that we don't really look for when we're reading.
Test that page against various OCR software, see what they reproduce as the output. Pick the one that's the best result.
And don't worry - the OCR software is going to be the single most expensive purchase in this equation. I am however more than ready to be proven wrong in that regard.
I am hoping that someone will make a nice personal document management package as free software.
If you use Windows, you can buy this:
http://www.nuance.com/paperport/
The basic features would be:
In a perfect world, the GNOME guys and the KDE guys would both start competing over who can make the slickest product and we all would win.
steveha
lf(1): it's like ls(1) but sorts filenames by extension, tersely
Law firms use a program called Summation to do this all the time. They take all the paper docs and electronic docs in a case (sometimes tens of thousands of pages) and load them into this program as TIFFs or PDFs. They are then OCR searchable. Not nearly as good of a search algo as something like google, as it is purely Boolean...but it gets the job done. Not sure about cost, but your university may have a license. An alternative is a program called Concordance, which does the same thing. One last option would be to scan everything to OCR searchable PDFs, throw them into a folder, and setup google desktop to only search that folder...you could then essentially "google" the contents of all those PDFs.
Assuming you have electronic versions of the documents in one format or another, stick them all in a file system and use desktop search (MS or Google). More than that you're looking at good bit of time and money.
Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
There are other tools, Strigi comes to mind, but that was too unstable for me. I do not know about commercial apps doing this - there are probably some, but I am a Linux user so I need not to apply there... Then there are document management systems, but I think that is an overkill for your needs.
Somebody will try to tell you Alfresco is the solution. Give it a shot, but I haven't met anybody who has actually been able to use the open source version in production. The commercial version is nice though and there is a 30 day trial.
Apache Solr is built on their Lucene project and does the web interface search part of you want. There are VM images online that you can download and deploy. I don't know what you should use to do the tagging part of the project.
Comment removed based on user account deletion
If only GOOGLE had a way to search your DESKTOP, that would be perfect.
For Mac OS X, try Papers. There's also an iPhone/iPod Touch version. Mac OS is great at handling PDFs in general.
Not trying to sound like a fanboi... However, I have hundreds of data sheets for various microprocessors, IC's, power supplies, embedded API's, 5 years worth of emails, etc. Spotlight indexes them all beautifully, and access is very quick, only a few seconds to pull up all references. I believe spotlight will even index network attached storage although I could be wrong.
Check out http://www.citeulike.org/ Does pretty much what you are asking for. You put in the details of papers, and assign keyword tags. You can also look at other people's libraries and so on.
OCR is pretty nasty stuff and it doesn't work very well at all. It's probably worth saying that the OCR results should probably only be used to generate your index and keywords.
Actually accessing the document should show the original PDF, not the error riddled OCR scan of it.
I think he's actually a crewman aboard the SSBN Oklahoma.
(jk afik there is no SSBN Oklahoma)
If your professor uses a Mac, consider Devonthink by DevonTechnologies.
http://www.devon-technologies.com/products/devonthink/index.html
For searching, the software has an artificial intelligence system, keywords, meta data. It can store PDFs, word docs, emails, notes. It can be integrated with a scanner so you can scan and store documents in the database. It's got OCR built in...
I have DevonThink (personal edition, not Pro/Office) and I don't even use 1/10 of the power built into this system. You should check out some of the reviews online and videos of people using DevonThink.
--- "Many of the truths we cling to depend greatly on our own point of view." ~ Ben Kenobi, 'Return of the Jedi'
Bibliography Tools in the Context of WWW and LATEX
Looks like that covers your needs.
CC.
TaijiQuan (Huang, 5 loosenings)
http://www.pdfhacks.com/
(disclaimer: not affiliated, just a user)
There are tools to index (kw_index) as well as a web based interface to a pdf collection (pdfportal).
OCR of your scanned pdfs is the enemy here. But as suggested, tesseract or google's continuation of it works pretty well.
here is a sample script from a set of tools I was experimenting with to index pdfs (all open source with windows binaries available):
All from pdfhacks, GnuWin32 and Ghostscript.
Take a look at http://www.naa.gov.au/records-management/secure-and-store/e-preservation/at-NAA/software.aspxXENA and DPR, which were developed as an archiving soluton by the National Archives of Australia but are now open source, and fully open standards:
If on a Mac, here's two you may consider (neither have a web interface).
Skim is open source and is a PDF reader and note-taker for OS X.
http://skim-app.sourceforge.net/
Yep is not open source, but will scan, tag and search PDFs ("like iTunes for PDFs").
http://www.ironicsoftware.com/yep/
http://en.wikipedia.org/wiki/Greenstone_(software)
-- From Grenstone's Web Site --
About Greenstone:
Greenstone is a suite of software for building and distributing digital library collections.
It provides a new way of organizing information and publishing it on the Internet or on CD-ROM.
Greenstone is produced by the New Zealand Digital Library Project at the University of Waikato, and developed and distributed in cooperation with UNESCO and the Human Info NGO.
It is open-source, multilingual software, issued under the terms of the GNU General Public License. Read the Greenstone Factsheet for more information.
The aim of the Greenstone software is to empower users, particularly in universities, libraries, and other public service institutions, to build their own digital libraries.
Digital libraries are radically reforming how information is disseminated and acquired in UNESCO's partner communities and institutions in the fields of education, science and culture around the world, and particularly in developing countries.
We hope that this software will encourage the effective deployment of digital libraries to share information and place it in the public domain. Further information can be found in the book 'How to build a digital library', authored by two of the group's members.
Shameless Plug:
I would highly recommend the Fujitsu ScanSnap 510 (or 510M if you're a Mac). It's ain't free and it ain't open source, but it comes with everything you need to scan in large quantities of documents, name them, put them in the folders you want, and create OCR text backed PDF's, so you keep your original files and have "searchable" backed text. It does double-sided scanning at about 15 pages per minute (my real-world estimate).
I just bought the Mac version and have managed to reduce two packed drawers of a file cabinet down to just a few documents of which I wanted to keep the originals. Plus, with them being text backed (per a previous post) I can use Spotlight to search for them.
My next plan is to scan in my old Engineering notes.
Fujitsu is coming out with the 1500, but I don't know much more than it's supposed to be improved. The 510 is fantastic, though. Check out the reviews on Amazon:
http://www.amazon.com/Fujitsu-ScanSnap-S510-Sheet-fed-Scanner/dp/B000RUOW66/
Included with the scanner is Adobe Acrobat in addition to ABBYY FineReader OCR software.
No Linux software that I'm aware of, but once you have the files in PDF format you can use them to your liking. They aren't particularly cheap at $450, but I've been very happy with the devices utility.
I had a HP All-in-One as well, but not having a double-sided scanner made it a pain to use.
...something like this:
1. You want to be able to store documents that currently exist electronically, and also handle documents you're going to scan. The latter may, or may not, be OCR'd.
2. You want to attach keywords to the articles, and be able to bring up a list of articles that match some arbitrary combination of these keywords.
3. Full-text search isn't as important (but would be useful if available).
If that's the case, I'm thinking Alfresco might be what you're looking for. Multi-platform, open source, java-based content repository. Supports document tagging (and loads, loads more). Relatively easy to use right out of the box, and has a CIFS interface so you can just create a project and simply tree-copy your current documents into the project. Don't let the "enterprise" designation on the software scare you away.
I've actually considered going that route for my own personal document library, but while Alfresco might be one of the only good solutions, it's like killing a fly with a cannon.
I'm frankly amazed that with the "paperless living" meme currently going through the productivity circles that someone hasn't come up with a simple tool to do something just like what you're looking for: point it at a root folder, let it suck in all the files, then start tagging away. Search with keywords or filenames or both, and provide a clickable list of hits. Full-text search isn't needed, as there's already a ton of tools out there that'll happily index your hard drive for you.
And, if a tool like that exists, could someone point me to it, please?
2 years? ago I bought for my small business a fujitu f1-5120c duplex scanner--it came with adobe acrobat
I scan every bill, correspondence, notice, and everything to pdf- then I throw it the hell away.
the version of acrobat included does OCR-I open acrobat, choose create pdf from scanner, and scan away.
I can mix a scan job up between B&W & color or duplex or simplex within one job
I can open an existing PDF and append to it
I save everything to an infrant nas box.
I can go to windows search, type in 1179.21 (actually did this one once)
set to look INSIDE the files of that directory and get results that include
a soda delievery notice, a soda invoice, and my bank statement where I paid it off
they have other model scanners that combine sheetfed+flatbed...
here is a beauty
http://www.fujitsu.com/us/services/computing/peripherals/scanners/workgroup/fi-6230.html
every day http://en.wikipedia.org/wiki/Special:Random
It's a nice Java web app. We use it at the Institute for Clean and Secure Energy (ICSE), and it does a great job.
Tellico for KDE might be a suitable solution. I use it extensively as a collection manager.
If you are running Mac OS X, you can quickly accomplish this very thing with a piece of software called "Yep!" It will track all of your pdf's and allow you to tag them. You can do previews, groups, etc. It will sort by date, etc. Very intuitive, very fast, easy to use.
You can download it from www [dot] yepthat [dot] com/yep/index.html
It's relative inexpensive at $34USD.
Sooner - There are a community of researchers who work in the nanotech field and collaborate through nanohub.org. I am not in the field, so I'm not sure how helpful it will be, but it's billed as "A resource for nanoscience and technology, the nanoHUB was created by the NSF-funded Network for Computational Nanotechnology."
This community is probably a much better place to ask the question than slashdot, IMHO. :-)
JR
I wrote and maintain a project to do this:
http://sourceforge.net/projects/docdb-v/
"DocDB is a powerful and flexible collaborative web based document server which maintains a versioned list of documents. Information maintained in the database includes, author(s), title, topic(s), abstract, access restriction information, etc."
It's intended for collaborations, but groups from 5 to 500 use it.
As a young academic I can vouch this being a problem that is looking for a good solution. Olivia Judson talked about this issue in the NY Times a few months back (December 16, 2008, Defeating Bedlam). Folks who spend a lot of time with the literature need a version of EndNotes or RefMag that stores the bloody PDF along with the citation info; storing the PDF might have taken a prohibitive amount of memory in the past but these days memory is cheap. The program must also be able to search within the PDF, assigning keywords yourself is for chumps. "Papers" and "Yep" look good but what about all of us who don't have the luxury of working on a Mac.
Our institution uses something called ePrints - I'm not sure if it's entirely what you're looking for but it does support different Subjects (headings?) and you can upload the documents using it.
If anybody knew of (or planned for) an adaptation to physics (with interfaces to arXiv.org, the APS journals and ideally other journals), I would be very interested (even as a paying customer).
What we need is a "Wikidata" project that would catalog every book, paper, recording, movie, etc. There are a few attempts that I know of such as openlibrary.org and wikidata proposals on wikimedia, but nothing that I know of that has reached critical mass. Such a system would be free as in freedom, and include abstracts, location item info, would allow users to create there own sub-database of items to search, etc. something that would be a harbinger of death to google.
In healthcare there is a company called Laserfiche that does exactly what you are asking for. Its not free, but maybe there is a similar FOSS.
DekiWiki is a wiki that will index attachments (using Lucene) although I am not sure to the extent you'll need. It would be worth looking into also since it IS free.
I have used both and both work well. I hope that leads you into the right direction.
After you get past the easy part, which is the scanning / OCR / selection and installation of doc management software, training users, etc., you'll reach the hard part: Developing controlled vocabularies based on the ontologies specific to your domain's metadata.
Don't librarian's (particularly those in the library science realm) deal with this sort of thing all the time?
What you are looking for is a proper archiving application. I suggest ICAAtom. Scan your documents as TIFFs if you are going to be saving them as images; if your hardware will do OCR nicely, then you would be better off scanning them to text, as they will be more searchable. ICA Atom supports all of the standard archiving metadata protocols, of course, so you will have good searching capabilities as long as you enter proper metadata.
"Apparatus dignosco occultus, satis non supernus."
This researcher should learn to talk to his local librarians. Many universities have a bibliography management system e.g. Refworks, that would be a lightweight solution. And many of the articles he has in print are quite likely now already properly digitized and available by PDF through his university library. If he's a proper researcher, he should care about more than what he has in his binder. There are likely more recent articles that reference those articles, building on that knowledge. Which he's missing. He can chat with a librarian online, or try the 20th century version of communication and make an appointment to talk in person.
Any solution that doesn't provide full text searching is less likely to be useful unless the exact, specific query from each and every user can be mandated.
I've lived thru "Document Management Guy" (actually a team of them, some with PhDs and publications) claims that keywords stored in document metadata was all that was needed. I called BS based on my years and years of DMS experience.
If an end user can't find a document, then the document doesn't exist, period. The document is useless unless the purpose is to have the document, but not have the document found. Images of text isn't generally useful without adding significant metadata based on how users will search for a document. IT people don't think like end users, so ask them what search terms they would use to find a few sample documents.
I've been away from Documentum, FileAid, Docushare and Sharepoint for a few years, but last time I used Sharepoint, the full text search results were worthless. I knew about a document - MS-Word, no less. Searches for a few specific, keywords failed to locate it. Yes, it was in a collection that was indexed.
About 6 months ago, the company I work at implemented the OSS version of Alfresco. We're ok with it, but need to upgrade to v3.x to get a much better GUI. We did trial the beta v3, but it wasn't ready for use at the time and had a few flaws with version control. Those are all fixed now.
Once you OCR all your paper (FineReader is not bad), and full-text index your PDFs (Beagle for Linux, MOSS for Windows), you'll still have a problem with narrowing down a keyword search. Try Amplify http://www.hapax.com/amplify.php on the title/abstract/methods page of each document and maybe you'll get useful metadata.
I agree that for the immediate use listed there is unlikely to be any copyright violations. But if someone were to make a good collection for their lab, that perhaps then became popular in the department, it would start running into copyright gray areas. For example the university discontinues subscribing to a journal, but articles remain available on a broad intranet system. Normally if you already had a copy of the article that's legit, but now a new student has access to articles that were only available before they showed up. Or articles are scanned from copyright-legit sources and made available to a large audience, but not as large as the whole web. My guess is systems like this will be tolerated as long as they aren't very good. And when they become good, they'll be tolerated because everything else is not as good.
"The ability to delude yourself may be an important survival tool" - Jane Wagner -
There's at least two reasons the professor's method is beneficial:
1. By having to search by hand and scan by eye, he becomes more familiar with more of what's actually in the papers. His familiarity with the material gets better.
2. Repetitive scanning/searching of the papers leads to the mind partially wandering while doing so. This can result in inspiration and intuitive leaps.
Both methods together are preferable. But good luck on getting the professor to use them. You may have better luck getting him to create his own indices or tables of contents on paper to put in the binders. With his familiarity it shouldn't be too difficult.
"I may be synthetic, but I'm not stupid." -- Bishop 341-B
If you've got some money, get yourself a Google Mini for $3K or so and a scanner. The base Google mini will index up to 50,000 documents, and supports PDF as well.
http://www.google.com/enterprise/mini/fileformats.html
I would expect any modern photocopier to scan to pdf (while 150dpi is okay to look at, the OCR is better at 300dpi), then Adobe Professional does the OCR (my uni has a site license).
I actually bought a nifty tablet/pen thingy recently, and now I can write notes directly on the pdf too, in my own handwriting. I love it.
"You only get ONE LIFE." Richard Rahl, Faith of the Fallen - Terry Goodkind
Kike, not kyke.
Yes, I just grammar Nazi'd the race Nazi (Nazi Nazi?).
Zotero is a firefox extension that can grab reserach papers directly from the journal or library web sites. It organizes the papers in collections, has keywords (they call them tags), can automatically index the PDFs. The metadata is stored also on a remote server and you can browse through it using a web interface. You also get a Word and Openoffice plugins to insert citations in the papers you write. The plugins are a little rough around the edges, but are usable. The references formatting is very robust and comes with styles for a lot of journals.
Mendeley is stand alone application. I haven't tryed it yet,but is seems to have very similar functionality.
I use BibDesk to organize my research. It's not perfect for that as its basic use is to cite and all that but you can actually import pdfs and tag them, i.e. put them into smart folders. It does have the collaborative approach to organizing data in a flat structure similar to that of delicious.
Zotero is what you want. Integrates smoothly into a research workflow. Great for managing research materials of all kinds. Powerful search and tagging features. Adding sources is quick and easy and it works hand in glove with lots of research databases. Also interoperates with Word or OpenOffice to manage citations and biblographies.
www.zotero.org
As an alternative to off-the-shelf software you could create a series of html pages, put them online and let Google index them for you. Create a separate html page for each scanned document with the desired keywords and a link to the document. Create an index page with a link to each of these html pages. link to that page on your home page or blog or any other page that you know Google scans. Wait a few days and search your site via Google. Build up a two column table in your favorite spreadsheet application with the file name in one column and the keywords in the other. Export as csv, and with a little coding in the programming language of your choice, you can generate the whole set of html in no time. Cheap free and easy!
I am currently using Zotero (http://www.zotero.org/) to organize all my articles and citations. It is open source, developed by George Mason University. The software works as an add-on to firefox, and automatically downloads the citation or the PDF of the article. The citaion can then be tagged with various labels and all the words in the article are searchable. I haven't used the tagging feature much, but the software has already proven invaluable in research and paper writing.
I'm a biology grad student and have been dealing with some similar issues. I've ended up using an online app (CiteULike) that has a great tagging interface and uses a bookmarklet for posting from journal sites, ISI, and PubMed.
It also has a great bibtex export/import feature. Since I'm using LaTeX for my dissertation I'm slowly migrating to a BSD licensed Mac program called BibDesk. Its tagging interface could use a little work though.
I've tried Zotero, and heard good things about Mendeley and Papers, but none of them have worked as well for me.
I've used Aigaion for managing all of my documents and references in the course of my Ph.D. I now recommend it to all of my grad students.
The website calls it a "Web based bibliography management software"
From the site:
"Both for individual researchers as for research groups or projects, it is of major importance to organize the literature one has read. A well organized bibliography is a powerful instrument. It speeds up the search for publications one has already read and supports the user in structuring information. Aigaion provides a bibliography management software environment that supports a user in just this: Organizing and managing a complete bibliography, from small bibliographies to bibliographies for a complete research department."
What's better than having a program on your desktop that can search through your files, how about an online database of your files that you can access from any internet connected computer in the world (www.refbase.net).
1a. Windows Desktop Search with added "IFilters", or
1b. Google Desktop Search
I recommend (1a), amazingly, because once you've located and installed all the third-party IFilters - including one(s) for PDF files - WDS will be able to index and make searchable MANY more files than GDS (in my case, about THREE TIMES as many). If the original PDFs from which so much of the binder material was printed are still available, then your effort with the following is greatly reduced.
2. Good major-manufacturer scanner with ADF.
I haven't kept up with scanners in recent years, so I'll leave it to you or someone else to make specific recommendations. It may be important to stick with well-known brands for purposes of compatibility with the scanning/OCR software (3).
3. Forget Adobe: buy the latest version of OmniPage Pro. Just like Adobe, it can OCR text and pump it into a PDF while "fronting" it with an image of the original page, for sake of complex layouts and possibility of future OCR corrections.
No need to worry about complex database systems to store all the stuff; just create a storage directory (or hierachy, if there are tens of thousands of files). When you're done, you'll have a library of PDF files that have been fully indexed by a desktop search engine, such that any snippet of text in a document can be used to locate it.
I started Labmeeting with just this problem in mind.
First, we focus only on the biomedical and related spaces right now. Eventually, we might expand into Nanotech and CS, but we are helping out PubMed users first, for the most part.
We let you upload lots of papers, index them for you, provide a great interface for searching and annotating them. We have tens of thousands of bioscientists on the site with private paper collections.
I know this won't necessarily help your professor right now since we mostly focus on biology, but I'll let you know if we ever expand.
as you are at a University, you probably have access to a copy of Adobe Accrobat, I have found that it has an alright OCR for scanned pdfs. Also you may want to look at using Scribd it is not open source but is free and searchable.
I believe that there has been a few people working on developing a 'searchable literature archive with keywords' for a while now... they call these strange people 'librarians'.
Have a look at Xinco http://www.xinco.org/. Java based, simple to setup on Tomcat - MYSQL or other alternatives. Not as bloated as Alfresco http://www.alfresco.com/ when it comes to smaller projects. Good functionality and the installation documentation will get you up and running quickly. A serious lack of further documentation can be problematic. Referencing the source code helps. The client interface was definitely designed by a programmer, not artistic but functional.
The University of Oklahoma participates in JSTOR:
http://www.jstor.org/
They also appear to be EBSCO participants:
http://search.ebscohost.com/
I'm pretty sure "the 20th century" is right there already, if you can drag him to the library.
Note: This isn't going to work for people not affiliated with an institution. Both of these services make paper journal content available online for subscription fees paid by the institution (or business), so unless you are in the "bog boy clique", you're not going to have access to them, unless you pay through the nose.
-- Terry
More seriously though - I am a little surprised that a professor cannot simply work with people at his current organisation that are hired specifically to catalogue and conveniently store (mostly digital) literary information.
Sure PhD students are nice, cheap slaves - but how hard is it to acquire a copy of endnote or reference manager from the library and ask how to export their preprepared metadata and thesaurus keywords into his install?
Since you only have one person that needs to access the files I would just use Desktop Search. Personally I like Google Desktop ( http://desktop.google.com/ ) & Copernic Desktop Search ( http://www.copernic.com/en/products/desktop-search/index.html ). Here is an article reviewing some of them - http://lifehacker.com/400365/five-best-desktop-search-applications.
The main thing that you need to do is OCR the documents when you scan them in (You can convert non-OCR PDFs into OCR PDFs but I don't know anything that can search them before you put text in them). On Linux the two mains ones that I know of are Tracker and Beagle (http://www.linux.com/feature/143259).
I know these are not all open source or have ewb interfaces but they are really easy to use. You just put the files in folders and point the desktop search at them. Great for someone that doesn't know a lot about computers.
I wrote a few articles about this for Law Office Computing magazine
http://www.nasw.org/users/nbauman/txtsrch.htm
http://www.nasw.org/users/nbauman/lawdb.htm
http://www.nasw.org/users/nbauman/discover.htm
It was a long time ago, the software and hardware has changed, but the concepts are still the same, and the costs are a lot less.
Free text search works reasonably well with small databases, but it doesn't work with big databases. If you want precision, you have to develop a set of tags (we called them keywords). A good model is Pubmed http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed. The New York Times used to have a great text search, but they changed it (eliminated tags) and now it's awfully difficult to get through.
Basically, the researcher has a body of knowledge, and he already has a filing and organizing system (in this case, a looseleaf binder system, which is a pretty good start). You should usually try to replicate his filing and organizing system in the database, for example one field for the looseleaf, another field for the tab, and then some goodies that he couldn't search the looseleafs for, like date, author, journal citation, etc. It would probably be useful to have a controlled vocabulary of a few good keywords, but keywords should be selected carefully so they're unique and don't duplicate.
I assume he doesn't have the PDFs any more. That would have made it a lot easier.
It would be handy to scan every word of (most) every document into full text, but it may not be necessary. Why do you need everything in full digital text? Scanning of unconventional text takes a human proofreading step, and probably isn't worth it.
He'll probably want to keep complete images of the original documents anyway along with the text. You should do a few tests to see how much resolution you need. 600 dpi works for ordinary text like they use in the printed newspaper. But if you want journal articles to come out, with footnotes and superscripts, you might need higher resolution.
Somebody is going to have to enter the fields manually, which is't too bad if you've got a thousand records (looseleaf tabs) to enter (about 20 hours), but can get difficult if you've got an order of magnitude more.
Scanning should be straightforward, if everything is neatly filed away in looseleaf books already. There are many cheap consumer-grade scanners on the market that can get 600-2400 dpi (the bundled software is probably more important than the hardware specs) but they can take up to 1 minute a page; there are more expensive scanners in the =>$1,000 range that can go a lot faster. If you're at a university, look around for somebody who already has one. Law firms and libraries do a lot of this.
You might start by estimating the number of pages and documents you have.
But let me suggest an alternative: Instead of scanning everything, just enter everything into a database without scanning it. Does he really need full text search? Or would it be enough to search his looseleaf books by a dozen fields? He doesn't have to print the document out from an image file, it's right there in his looseleaf books.
If anybody knows of up-to-date articles on this subject, I'd love to know the citation.
I didn't read the article and I'm not sure exactly what you want, but take a look at Bookends from Sonny Software.
Saved my butt, made life easy.
http://www.sonnysoftware.com/
Reference Management and Bibliography Software for Mac OS X
When they came for the communists, I said "He's next door. Take him away. Goddam commies."
DocMGR is what you're after. Its a web app that takes submitted documents (PDF, Office, etc), OCRs and indexes them, and allows stuff to be searched. www.docmgr.org
is the best Scan/OCR app you can buy with many nice features. The first is it's reasonably priced. Second is the fact that it can import PDF files directly and OCR/Index them and it handles almost every langauge on the planet. Definately worth looking into as the school may actually have a school license, which means you don't have to buy anything.
Mod me up/Mod me down: I wont frown as I've no crown
Give a call and see if their software does what you want; if they haven't messed it up since I last touched it, it should do document archiving, scanning, OCR, search, tags, categorizations, and whatever custom database fields you want to throw at it.
http://datagenix.com/
--
The most obvious solution would be to use scholar.google.com and for any paper that you find that isn't online already look it up in your personal collection.
Comment removed based on user account deletion
We've done considerable research on the problem of scanning documents, also, and came to the same conclusion: The new Fujitsu fi-6130 seems excellent, although we haven't tried it. That model and the 6230 are new, and there is some evidence that waiting for the second version of those models would be a good idea.
The big attraction of the fi-6130 is its speed: 40 pages per minute.
If you are interested, I suggest you download the manual. (PDF)
The manual talks about a connection for an "imprinter", which sounds as though it is a printer that works only with that particular Fujitsu scanner. That causes me to doubt whether buying from Fujitsu would be a good idea; we don't want to get involved with corporate marketing drone foolishness. Everything else, however, looks quite good.
The scanner comes with OCR software. I suppose and hope that Fujitsu did a lot of work and found the best OCR software.
The scanner software makes a PDF. The OCR software tries to recognize the words, so that the software can make a searchable PDF. Even if the OCR recognition isn't perfect, it can be very useful.
It seems to me that the Fujitsu fi-6230, suggested in the parent comment, is a poor design. It combines an automatic sheetfed scanner with a flatbed scanner for a lot more money. That doesn't make sense, since the attractive feature of the sheetfed scanner is its speed. Speed is important with a flatbed scanner, but not as important, since the operation will always be manual. It seems to me that it would be better to have a flatbed scanner that is a separate piece of equipment, rather than two pieces seemingly glued together, without any logical connection, since apparently the 6250 has two imaging elements.
Be careful about using Windows Search, as suggested in the parent comment. The Windows XP version is buggy, and sometimes won't look into files that are there. We use VCOM's PowerDesk pdfind.exe program, a older version of which is free. We also use Funduc Software's Search and Replace program.
Most scanners are quite slow, don't have automatic document feeders that allow scanning of papers of widely different sizes, and don't build OCR'd indexes inside the PDF files.
before i return to work: if he's already getting the PDFs somehow, save them to a directory, and use google desktop to search through them for the keywords. It will search through most common documents.
not only is time travel possible, it's irrelevant.
Loving the fact that the guy didn't know what century it is!
Here's what I've used for my own documents: 1) Convert the pdfs to tiffs with ImageMagick, 2) OCR the Tiffs with tesseract, 3) Index the text with Xapian. The OCR step won't get all the text right, but will get about 90% or better and which will be good enough for indexing and searching with xapian. For me that's been a pretty good solution.
Seriously. This is what we get paid to do. There's far too much to communicate on a forum, and if the SNR here is typical, you'll get awful, unrelated and just plain wrong advice.
If you can't hire, see if your school has a library science program and look for a good intern.
Failing that, read the Polar Bear book (Rosenfeld & Morville, pub: O' Reilly) yourself and follow the threads to resources particular to your problem.
Tactical help: populate the kewords, title, subject properties in your PDFs and Office docs. If you populate in Office and make a PDF, the properties come along. They're in File>Properties... Filling them out will help any search engine that can consume binary docs make sense of your content.
And there are bulk scan-to-OCR packages out there. Funnel into PDF and populate the properties.
Worth saying again: populate the properties.
Drupal + Modules (CCK + Filefield + Taxonomy + Views) Taxonomy provides categorization of content. You can have multiple vocabularies (sets of terms). You can assign one or many terms to each piece of content (or document in this case).
My father runs a small business and has to track a bunch of paperwork for each client, so I got him a cheap LED lit flatbed scanner, but like everyone else, he discovered that it was too slow to manually scan in each page, even if the scanner itself was quite fast.
He eventually figured out that the fastest scanning technique is not to use a scanner at all, but a digital camera. He made a rig with a marked out area the size of an A4 sheet of paper, and then he attached a camera mount so that the camera would be facing down, pre-aligned to photograph the entire sheet. I've seen it in action, he can easily do a page per second: he just places the next page on the platform with one hand, and presses the shutter button with the other hand.
The resolution is more than good enough for OCR, and most cameras have better depth-of-field than scanners, so more of the page is in focus, even near bindings and staples.
I suppose you meant "drag a professor into the 21st century".
I have been using an fi-6130 for several months now. It is quite simply the best scanner I have used. It is fast, highly reliable and very seldom misfeeds (1 per 500-800 pages in my experience). I use it for scanning archival financial records and also for technical papers. It includes a copy of Kofax Virtual ReScan, which does a great job of creating readable 1-bit monotone scans of originals with colored backgrounds. There are a number of possible target formats, and it has several automated ways of handling group separator sheets. I highly recommend it. I have seen no evidence of "marketing drone foolishness."
Evernote (Evernote.com) might be too small for your needs, and it's not open source, but it: - Has OCR - Is very cheap ($6 a month for the pro version, free for the light version) - Recognizes handwriting - Accepts tags - Has a web interface (and a desktop client) Its only limitation is the 500 MB monthy upload cap. Since you have hundreds of files to get through, you will go over the cap if you upload all at once. But since scanning those things is going to take you ages, you might be fine. Also, if your boss is still collecting paper, he's probably pretty old-school. Evernote is dead simple to use.
I couldn't tell you how many times I've went to use a phonebook or reference manual and tried to flip through to the search page.
Wanna fight ? Bend over, stick your head up your ass, and fight for air.
If you use pdftotext or some such utility from the adobe SDK then you could use a simple shell script to take the filename of all the pdf files in the current directory and create an index of keywords and counts per file and ultimately mix them into another index. I would recommend using sqlite - a nice little database - without a database you will most likely run short on memory when sorting and and searching.
I'm short on time otherwise I would write a more verbose example. For keywords you use something similiar to a huffman encoding or a trie. Instead of every letter you use a file containing word counts or relations per file.
#!/bin/bash
for i in *.pdf; do
file=${i%%.pdf} ; output=/tmp/${file}.txt;
echo -n $file > $output
echo -e "\001${i}" >> $output
done
# EOF
#!/usr/bin/perl -w ... etc.
foreach (@ARGV) {
print $_, "\n";
open (IN, "/usr/X11/bin/pdftotext $_ - |")
or die "$!";
while () {
s/\t/\s/;
s/\s+/\s/;
s/\s/\n/;
next if m/^\s*$/;
print "$_"; # append to file (>>$ARGV[x])
## re-read and do a word count
}
close (IN);
}
# EOF
Save the word counts in the data file and rescan it. The interesting part of the project would be word parts/suffixes/prefixes and also scanning the meta-information from the pdf into the mix. It sounds like a nice project - I'm guessing you'll run out of memory before information so a cutoff value of some low number or if you have the processing power try to build a backward chaining (prologe type) system with a simple DCG grammer. This would allow the searching/matching of similar phrases or parts of speech. The computer science of it all is using a `pdftotext' like program as an input stream. A text by norvig, grahm, or (for a more statistical approach) Patrick Winston (excellent author). The `texts' would cover the fun algorithms of dependancy related searches and backward chaining with grammers etc., I don't think hashtables will work for anything more than initial counts; so a b-tree in the filesystem - or better yet a ramdisk if you have the memory. Such as, (I'm a linux goof):
mount -vt tmpfs tmpfs -osize=1024M,nr_inodes=2m \ /{mount-point}/
.
I've worked on similar projects for some time - I get caught up in the rule system and negation retraction. I really don't have an end goal so these little exploritary programs lead to much fun. I have a perl program I have been using to scan images and pdf-files. It's based on Image::ExifTool. I understand that perl is not the coolest thing since ruby on a turnpike or whatever but this module kicks a$$ and perl works well, if not best, with regexps (IMHO).
To think you would be creating the `ultimate grep'! What a wonderful addition to anyone's life :) I would also check out ispell - it seems to have some nice rules for suffixes and the such. Post something similiar and I'll post the perl script that I've been using (at request -it's not that cool or useful outside of itself or for ideas) ...
Have fun...
http://jssindex.sourceforge.net/
JSSindex can index a collection of PDF, DjVu, postscript or HTML documents, and generate a self-contained set of HTML and javascript files that allow to full text search in the collection using a web browser. Since the search engine runs in the browser (in javascript) there is no need for a server. The code is platform independent.
The JSSindex script is written in Lush, which runs on Linux and Mac.
From the website:
JSS is a simple search engine designed for CDROM or Web-based document collections. The documents to be indexed can be in HTML, PostScript (.ps and .ps.gz), PDF, and DjVu. The main feature of JSS is that the query engine and the index are entirely in JavaScript, and therefore require no other software than a JavaScript-enabled Web browser.
What is the advantage? If you are distributing a collection of document on CD-ROM, you can provide platform-independent full-text search without asking your users to install any software on their machine. If you publish a collection of documents on the web, you don't need to install any server-side scripts: search queries run entirely in the user's web browser.
TikiWiki allows word searching of uploaded files (batch loading from a file directory is supported). You'd need to convert the images to a suitable format (can a PDF hold a page image and text from OCR?), and a command-line filter which extracts text from the file for indexing. By default only the first 8K is stored, but you have the source code. Assorted command-line filters can be defined, so future PDFs can be stored directly.
I use JabRef. http://jabref.sourceforge.net/ It's not a web interface, but it provides keyword searching, user-defined groups, local file storage as well as links to web versions, and everything starts out with a full citation information (which can be "unpublished", "personal communication", etc.). You didn't request full-text searching, but if you do have full-text pdfs, any of the OS-based file search programs should handle it.
My JabRef / bibtex database is well over 1500 articles, and I have NEVER regretted scanning / downloading over 22 shelf-feet of binders and folders.
Use JabRef (http://jabref.sf.net) to store the references in a BibTeX database, and set up the links in JabRef for each article to point to the appropriate pdf, jpeg, zip or other document.
And don't forget to use DOI references (http://en.wikipedia.org/wiki/Digital_object_identifier) to point to the online abstract of the article. Very useful.
There is a OSX application specifically written for these kind of scenario's: DEVONthink. http://www.stevenberlinjohnson.com/movabletype/archives/000231.html It has the Abby OCR engine built-in, a web server and an extremely smart search filter, which is able to find related documents based on metrics like keyword frequency.
Aigaion - A Web based bibliography management software
http://www.aigaion.nl/
It speeds up the search for publications one has already read and supports the user in structuring information. Aigaion provides a bibliography management software environment that supports a user in just this: Organizing and managing a complete bibliography, from small bibliographies to bibliographies for a complete research department.
Link
This might be an overkill for just one guy's papers.
But you might wanna take a look at
http://www.dspace.org/
DSpace is a open-source project for preserving various kinds of digital assets (images, documents, audio, etc). It is used by many university libraries throughout the world. It has a fairly large community.
The downside: you need to know how to install and configure it as it requires a Web server, database, servlet engine, and etc. All are available for free but you may need to spend some time to install and configure.
The interface is via the web so its fairly straight forward. You can find live examples of who is using DSpace on this website: http://www.dspace.org/index.php/DSpace-Repositories/Repositories-Alphabetical.html
In CERN DS (with certainly a focus on high-energy physics) my papers are shown only up to 2006; so this database appears useless for me.
Here's what I did to rid myself of my entire bookcase full of ring binders: I sat down and looked them up one at a time on the net (if it's there, Google will find it). If there's a PDF available, pick that, otherwise convert it to PDF yourself (for uniformity, ease of viewing and printing, and for future-safety) after downloading the .ps/.ppt/.doc or whatever, unless it's a plain .txt or possibly .html file.
Make sure to rename the files sensibly (no "oopsla1998xyz.pdf"). I use a straightforward "MainAuthor(s) (et al) - TitleOfPaper", cut down to a reasonable length. Note that it's better to omit parts of the title than to start tossing in abbreviations - they will just get in the way of search and readability.
Then throw the dead-tree version in the bin (wonderful feeling) and pick up the next; you'll quickly get down to just 1-2 minutes per paper. And don't hesitate to discard all those papers that aren't really worth keeping, to speed things up even more.
Place the files in a simple hierarchy of directories (one level, no more), named after main subject - and don't be too concerned about getting that exactly right: a dozen broad subject names is way easier to handle than a hundred specific ones.
When it comes to searching, just leave it to the operating system! The built-in search in Ubuntu/Fedora/MacOS/Windows/whatever is good enough these days (just don't disable the indexing service...), otherwise install Google Desktop or similar if you need even more power. Keywords already in the articles can be searched for just as any other words in the content. Keywords not already present: forget it - you're not going to do that manually, and it's not worth it.
I now have about 4 ring binders left, with material I couldn't find online (that was worth keeping); a few really special ones I scanned to PDF myself. It usually takes me all of 10 seconds to find any paper I'm looking for on my computer: browse to "papers" and do a search. And it's all very easy to maintain: for any new paper, make sure it's a PDF, rename the file, and drop it in a suitable subdirectory depending on main subject - done.
We use the Index service in Windows to index pdf files. Acrobat Reader (or just the index plugin from adobe) is required to allow the index server to index the contents of the pdf files. You can then use the windows find program or write a simple web front end to query the index for any word or term present in the files or the file properties. Lots of examples on the net. Good luck.
Refbase is built directly for scientific literature management, is web based, open source, and does contain keywords amongst a range of other search options too.
It might be overkill for a single individual, but can be extremely effective if a whole department is wanting to share their literature resources.
I work with electronic medical records and we have found Fujitsu scanners to be top-notch. Fast, reliable, and generally affordable. We've also used some of the larger production scanners from Kodak and Bowe Bell and Howell. They are solid scanners, but are more expensive and haven't taken the beating we tend to give the Fujitsus.
We just replaced an old 3097 with the 6130 and are waiting to hear how it holds up. Note, most of these scanners are deployed to remote scan areas in the hospital where it is the responsibility of the users to handle maintenance, which means these scanners aren't cleaned and don't have rollers replaced. These older models have gone years with almost no maintenance.
I think what you are looking for is something called "document management" software.
... where he could archive the PDFs and scanned documents and be able to search by keywords?
I agree with the OCR requirement, but if he just needs to search the resulting PDFs, wouldn't DocSearcher do the job for him? I've found it trivial to set up and run and it's certainly helped me keep track of docs etc.
to bad the OSS community has no real answers.
this is something i submitted a week or so ago:
"I'm looking for software that can help my company manage information in documents that may be in pdf, doc or web form. I work for a biotech company with 15 people, and we have large numbers of documents that range from very technical scientific publications (usually pdf) to company reports like 10-Ks, to web pages to newspaper articles to pictures. We use these documents to review and stay current with the scientific literature; to learn about what competitors are doing, gain market information (who is selling how much of what), generate publicity for our products ,and so forth.
We currently use the windows file tree as our organizer, which creates several problems: I can't put one file into multiple bins; I can't use keywords to search; I can't organize files into groups.
What I would like (I think) to do is organize the information by keywords and subjects; associate groups of files into binders, and create summarys for the binders (eg, I might have 5 files that go together, and my own summary of what the five files mean); add sticky notes to anything at anytime (actuallly, I would like keywords and stickys [comments in adobe acrobat] to be the same: words in stickys are keywords, and keywords show up in the stick; add URLS and webpages directly from the browser; have a function that mimics or is compatible with a package like endnote or procite or papyrus or refcite (formats bibliographys in word docs)
I'm not even sure what the solution looks like, but it needs to be cheap (http://www.ncbi.nlm.nih.gov/sites/entrez. This has a lot of features that scientists need, such as keyword search returns a list of articles that can be viewed by abstract."
this is a problem that comes up a lot, for a lot of people
I've tried a lot of the solutions , like zotero, and they just don't cut it for one person,- much less if you need to share the info among a small group of people.
There is a fabulous market for someone who wants to write this software
The main problem, which I don't think anyone has addressed, is that free information has a price - a human can only remember so much. So, the glut of free pdf/web info is actually bad, cause you loose sight of the important stuff; this use to be done for you with your $ monthy journal subscriptions - if you are in nanoscience, you might get nanoletters from the american chemical society, the editors do the weeding out for you
the other problem is how does one do natural language querys ?
Of the available answers, most are owned by a de facto monopoly, thomson reuters; refman is probably the best
Surely there must be someone who makes a pdf library database front end better then the collection feature in adobe acrobat
I realize that slashdot is going to take the technology solution as the only one, and in this case its probably the right way to go, but ...
People have managed documents and information like this for centuries and it worked rather well, perhaps you should stop being lazy and learn how to use traditional reference materials as you're going to need this skill for a few more years anyway.
Those skills are still useful today. Just because Google can index and allow you to find words in the documents it knows about doesn't mean that it can help you figure out what you're looking for. If you have no traditional reference skills, Google becomes a lot less effective. This of course isn't specific to Google, all search engines in the world won't help you if you can't figure out what you're looking for.
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
I ran into this issue also since I have tons of pdfs and sometimes it can take a while to find that paper you remember that mentioned HylD or ZnO. I use the search client copernic http://www.copernic.com/ . It has a serious advantage over google desktop since it gives you this handy little preview pain which is useful when sorting through results. Since I carry everything around on a hard drive, I just have the program set to index that drive (which is set to always have the same drive letter). As for versions of the program, they kind of went in a bad direction with the 2.x releases and I kept using 1.6/1.7 for a while, but recently started using the current release, 3.x and it works like a champ. Good luck.
Good grief, forget your FOSS idealogy, scan them to PDF, OCR them using Acrobat Pro (the education price is ridiculously cheap), store them on a Mac, and use the built-in Spotlight to search them.
Why not use something like Jabref? Easy to manage the references?
Apologies in advance for suggesting a Microsoft solution...
If you can get the documents into a searchable form (using OCR, as has been outlined by many other posts already), there is always Microsoft Search Server 2008 Express - it is free, and will index anything that it can read the contents of - just dump the files into a share, and the search will index the contents.
It's like a more complex version of the search 4.0 engine that Outlook 2007 uses to index the contents of a mailbox, but it has a web front-end for searching.
We tried it at work, decided not to deploy it because we had a bunch of really specific requirements that it didn't suit - turned out my workplace needed a "proper" doc management system, but have just shelled out good money for SharePoint licenses to do this.
I hate SharePoint, but that's OT.
For the simple set-up you've described, search server express might work OK.
I'm surprised nobody has mentioned Calibre, which was also featured on Lifehacker sometime back.
It is based on PyQT (as well as dateutil, mechanize, lxml, BeautifulSoup) . They even have a CoverFlow like interface which is pretty good. I suppose it is usable on Win, Lin and Mac.
You have to provide a login/password to librarything (or a few other alternatives) and you can then search and tag for the book's metadata and cover images from these sources automagically.
I personally also use them to archive my PDF's that I download from the internet, tag them, specify authors and other metadata (incidentally, most of the papers that people create from latex do not have any metadata).
I see the developers pushing out a release every week, so it is under pretty active development. I dont know if there is a plan to integrate any indexing features in it, but I suppose the developers are open to it.
Google desktop searches through your computer's files to find keywords inside the files themselves. If he saves all his documents he finds online to there, he should be able to do keyword searches in those documents.
Also, if the pdfs are ocr'd then he could search via that as well.
I have always like the basic idea around Citeseer.
"CiteSeerx is a scientific literature digital library and search engine that focuses primarily on the literature in computer and information science. CiteSeerx aims to improve the dissemination of scientific literature and to provide improvements in functionality, usability, availability, cost, comprehensiveness, efficiency, and timeliness in the access of scientific and scholarly knowledge.
Rather than creating just another digital library, CiteSeerx attempts to provide resources such as algorithms, data, metadata, services, techniques, and software that can be used to promote other digital libraries. CiteSeerx has developed new methods and algorithms to index PostScript and PDF research articles on the Web. ..."
The basic issue for you would be that is was made to focus on Computer and Information sciences as it currently is implemented.
http://citeseerx.ist.psu.edu/about/site
In the short term, this is may not be valuable for you. In the long term, I think this can be the basis for most or any academic (or even non academic) research literature.
The easiest solution is Refworks , an online citation manager. You can automatically import articles from online databases or create your own reference entries with space to add any kind of article information or user-specified metadata of any kind, plus you can attach .pdfs directly to the database entry. The database is stored online by refworks and is searchable from anywhere via a web browser. Many Universities already have site licenses for this system, so check with your university librarians. Otherwise, check out their website for further details. The Microsoft Office plug-in for the manager, Write-and-Cite III" works with Microsoft word and the database to automatically generate reference lists and citations formatted to the style of almost every major and minor academic journal in most disciplines. The whole database is searchable and may be organized by project. You can also automatically import any article or abstract from Google scholar or other academic databases like JSTOR, ProQuest, etc.
Being involved in CS research myself,
I think this is a very interesting problem!
I notice that many people still have these big piles of paper, even when they are in their twenties.
I suggest the following:
* hire a student to lookup all the printed papers in Google Scholar or some other database
* throw away the paper ones you found online
* save all results as PDF (no html, no txt),
with the title as the filename
* Ignore articles you cannot find - no scanning.
* Forget FOSS and install Copernic Desktop Search.
It works really great.
Now the problem is also that your professor wants to make notes on his papers. To really use the PDFs, you need him to buy a tablet notebook, on which you can write an annotate PDFs.
One of my colleagues has one, and it seems that e-paper is finally arriving.
As a last advice, teach your professor to organize his articles in directories, you can search each and every one of them individually. When I write a paper, I do a literature search on relevant literature in one particular directory.
And, do not focus too much on clever archiving strategies with keywords and such, they are not worth the effort.
In order to properly create a hierarchical index which is searchable, you may be interested in constructing an ontology, which is a description of your subject matter in terms of some broad categories. Those broad categories then branch out into logical subject areas. Many databases support hierarchical structures which match well with the way an ontology works. Once the ontology is constructed, which consists of keywords which represent the categories, you index the document on those keywords. Then your system can browse the hierarchy or zero in on a particular term. In linguistics ontologies are used to construct meaning trees of words as a starting point into determining the meaning and intent of some written text. Perhaps some of the commercial packages discussed can do this, but this is what I would look for in a product if I was faced with your task.
http://www.mendeley.com/
opensource software meant for cataloging academic research papers, with a web backup/archive that can be shared with others
InftyReader is a program that specializes in doing OCR on scientific documents and mathematical formulas. It saves documents in a variety of formats including LaTeX and MathML.
Two unfortunate things about it: 1) it's a Windows binary 2) it costs $900USD for 2 concurrent use licenses. It was free until they licensed a conventional OCR engine to better handle the text (its non-math recognition was pretty bad before).