Building a Searchable Literature Archive With Keywords?
Sooner Boomer writes "I'm trying to help drag a professor I work with into the 20th century. Although he is involved in cutting-edge research (nanotechnology), his method of literature search is to begin with digging through the hundreds of 3-ring binders that contain articles (usually from PDFs) that he has printed out. Even though the binders are labeled, the articles can only go under one 'heading' and there's no way to do a keyword search on subject, methods, materials, etc. Yeah, google is pretty good for finding stuff, as are other on-line literature services, but they only work for articles that are already on-line. His literature also includes articles copied from books, professional correspondence, and other sources. Is there a FOSS database or archive method (preferably with a web interface) where he could archive the PDFs and scanned documents and be able to search by keywords? It would also be nice to categorize them under multiple subject headings if possible. I know this has been covered ad nauseum with things like photos and the like, but I'm not looking at storage as such: instead I'm trying to find what's stored."
From the sound of it, you want to verify that your product supports document tagging (not unlike Slashdot's tagging system I guess) so that he can attach his categories to documents as he puts them in (or more likely as you do the manual labor, right?).
... where he could archive the PDFs and scanned documents and be able to search by keywords?
So, my big concern is the part where you said he scans things from books and articles and so some of the PDFs might just be massive images, right? I don't think you're going to find systems with OCR built in so you might have quite the chore on your hands. If you don't have it electronically or if it's just an image electronically, you may have to implement some sort of process for getting a doc into this system so it can be searched, right? Look into GOCR or Tesseract if this is the case.
Also, judging by your nickname ("Sooner Boomer"), you're at the University of Oklahoma. Why in the world would you name yourself after a group of people who not only disobeyed the Indian Appropriation Act but also moved out onto Native American territory before it was officially declared property of the United States? And then you also chose "Boomer" which refers to "white settlers who believed the Unassigned Lands were public property and open to anyone for settlement, not just Indian tribes. Their reasoning came from a clause in the Homestead Act of 1862, which said that any settler could claim 160 acres of public land. Some boomers entered and were removed more than once by the United States Army." If you are a descendant of either a Sooner or a Boomer, I respectfully do not agree with their actions.
My work here is dung.
I'm trying to help drag a professor I work with into the 20th century
Maybe after that, you should try to bring him into the 21st century. You know, the one where PDF's exist?
Who would win this election: Andrew Weiner vs Andrew Weiner's weiner.
http://owl.anytimecomm.com/
we use this at my office. Works well for us.
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
Papers is a Mac software that does exactly what you need, and does it very well. It's not webbased and Mac only unfortunately, but you can probably find out there what the right terms to google for are.
Fleur de Sel
I've had good luck with JabRef which uses a BibTex database on the backend so it integrates very well with LaTeX.
...but you can try library management software. Good point to start is
http://ask.slashdot.org/article.pl?sid=06/03/22/1320207
and
http://slashdot.org/article.pl?sid=07/12/11/1756247
As a computer, I find your faith in technology amusing.
DSpace ? http://www.dspace.org/
It may be worth looking at Beagle: http://beagle-project.org/ - it's Linux only though.
Zotero might be useful.
Most highend consumer All in One printers comes with an ADF capable of handling most types of paper as long as it's not crumpled up, stapled or the like. Some of the more expensive ones can do two sided scanning to a network repository. I work with consumer level HP printers, and the Office Jet Pro L7xxx series does this. The Pro L7680 is 200 US$ at Newegg.com
Now, while that printer comes with some okay OCR software, it's basicly thrown in for free. A lot of the stuff in the kind of documents you're talking about is going to be math heavy mixed in with images, graphs, tables and personal notes. I don't know any OCR software that'll transform that into exact replicas via LaTeX or the like, I'm pretty sure the really expensive OCR software will translate the written text and reproduce the rest as images and neatly transform it into some easily searchable pdf-documents.
That brings you from paper to searchable pdf-files. Catagorizing those is probably not all that hard. I'd suspect you could do some text analysis and break each document down into a list of technical terms and the number of times they're used.
A document that uses the cashmir effect in a single example is probably not a document related to that specific field, whereas documents that talk about it repeatedly, referencing known articles on the subject etc. is. Sorting that out ... beyond my knowledge.
I'd suggest you start out with an experiment. Take a "typical" page from the binders, scan it to a non-compressed image at a decent resolution (e.g. TIFF). We usually reccomend around 300 dpi for OCR - beyond that you start picking up things that we don't really look for when we're reading.
Test that page against various OCR software, see what they reproduce as the output. Pick the one that's the best result.
And don't worry - the OCR software is going to be the single most expensive purchase in this equation. I am however more than ready to be proven wrong in that regard.
I am hoping that someone will make a nice personal document management package as free software.
If you use Windows, you can buy this:
http://www.nuance.com/paperport/
The basic features would be:
In a perfect world, the GNOME guys and the KDE guys would both start competing over who can make the slickest product and we all would win.
steveha
lf(1): it's like ls(1) but sorts filenames by extension, tersely
Law firms use a program called Summation to do this all the time. They take all the paper docs and electronic docs in a case (sometimes tens of thousands of pages) and load them into this program as TIFFs or PDFs. They are then OCR searchable. Not nearly as good of a search algo as something like google, as it is purely Boolean...but it gets the job done. Not sure about cost, but your university may have a license. An alternative is a program called Concordance, which does the same thing. One last option would be to scan everything to OCR searchable PDFs, throw them into a folder, and setup google desktop to only search that folder...you could then essentially "google" the contents of all those PDFs.
Assuming you have electronic versions of the documents in one format or another, stick them all in a file system and use desktop search (MS or Google). More than that you're looking at good bit of time and money.
Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
There are other tools, Strigi comes to mind, but that was too unstable for me. I do not know about commercial apps doing this - there are probably some, but I am a Linux user so I need not to apply there... Then there are document management systems, but I think that is an overkill for your needs.
Comment removed based on user account deletion
If only GOOGLE had a way to search your DESKTOP, that would be perfect.
Copying excerpts for educational use is actually an explicitly protected fair use case. The copyright act actually uses it as an example if I remember correctly.
The parent said he is copying parts of texts, not entire books.
I seem to remember something about "educational use" in Section 107 of the Copyright Act....
We were all warned a long time ago that MS products sucked, remember the Magic 8 Ball said, "Outlook not so good"
For Mac OS X, try Papers. There's also an iPhone/iPod Touch version. Mac OS is great at handling PDFs in general.
Not trying to sound like a fanboi... However, I have hundreds of data sheets for various microprocessors, IC's, power supplies, embedded API's, 5 years worth of emails, etc. Spotlight indexes them all beautifully, and access is very quick, only a few seconds to pull up all references. I believe spotlight will even index network attached storage although I could be wrong.
Check out http://www.citeulike.org/ Does pretty much what you are asking for. You put in the details of papers, and assign keyword tags. You can also look at other people's libraries and so on.
OCR is pretty nasty stuff and it doesn't work very well at all. It's probably worth saying that the OCR results should probably only be used to generate your index and keywords.
Actually accessing the document should show the original PDF, not the error riddled OCR scan of it.
If your professor uses a Mac, consider Devonthink by DevonTechnologies.
http://www.devon-technologies.com/products/devonthink/index.html
For searching, the software has an artificial intelligence system, keywords, meta data. It can store PDFs, word docs, emails, notes. It can be integrated with a scanner so you can scan and store documents in the database. It's got OCR built in...
I have DevonThink (personal edition, not Pro/Office) and I don't even use 1/10 of the power built into this system. You should check out some of the reviews online and videos of people using DevonThink.
--- "Many of the truths we cling to depend greatly on our own point of view." ~ Ben Kenobi, 'Return of the Jedi'
Bibliography Tools in the Context of WWW and LATEX
Looks like that covers your needs.
CC.
TaijiQuan (Huang, 5 loosenings)
If on a Mac, here's two you may consider (neither have a web interface).
Skim is open source and is a PDF reader and note-taker for OS X.
http://skim-app.sourceforge.net/
Yep is not open source, but will scan, tag and search PDFs ("like iTunes for PDFs").
http://www.ironicsoftware.com/yep/
http://en.wikipedia.org/wiki/Greenstone_(software)
-- From Grenstone's Web Site --
About Greenstone:
Greenstone is a suite of software for building and distributing digital library collections.
It provides a new way of organizing information and publishing it on the Internet or on CD-ROM.
Greenstone is produced by the New Zealand Digital Library Project at the University of Waikato, and developed and distributed in cooperation with UNESCO and the Human Info NGO.
It is open-source, multilingual software, issued under the terms of the GNU General Public License. Read the Greenstone Factsheet for more information.
The aim of the Greenstone software is to empower users, particularly in universities, libraries, and other public service institutions, to build their own digital libraries.
Digital libraries are radically reforming how information is disseminated and acquired in UNESCO's partner communities and institutions in the fields of education, science and culture around the world, and particularly in developing countries.
We hope that this software will encourage the effective deployment of digital libraries to share information and place it in the public domain. Further information can be found in the book 'How to build a digital library', authored by two of the group's members.
Shameless Plug:
I would highly recommend the Fujitsu ScanSnap 510 (or 510M if you're a Mac). It's ain't free and it ain't open source, but it comes with everything you need to scan in large quantities of documents, name them, put them in the folders you want, and create OCR text backed PDF's, so you keep your original files and have "searchable" backed text. It does double-sided scanning at about 15 pages per minute (my real-world estimate).
I just bought the Mac version and have managed to reduce two packed drawers of a file cabinet down to just a few documents of which I wanted to keep the originals. Plus, with them being text backed (per a previous post) I can use Spotlight to search for them.
My next plan is to scan in my old Engineering notes.
Fujitsu is coming out with the 1500, but I don't know much more than it's supposed to be improved. The 510 is fantastic, though. Check out the reviews on Amazon:
http://www.amazon.com/Fujitsu-ScanSnap-S510-Sheet-fed-Scanner/dp/B000RUOW66/
Included with the scanner is Adobe Acrobat in addition to ABBYY FineReader OCR software.
No Linux software that I'm aware of, but once you have the files in PDF format you can use them to your liking. They aren't particularly cheap at $450, but I've been very happy with the devices utility.
I had a HP All-in-One as well, but not having a double-sided scanner made it a pain to use.
...something like this:
1. You want to be able to store documents that currently exist electronically, and also handle documents you're going to scan. The latter may, or may not, be OCR'd.
2. You want to attach keywords to the articles, and be able to bring up a list of articles that match some arbitrary combination of these keywords.
3. Full-text search isn't as important (but would be useful if available).
If that's the case, I'm thinking Alfresco might be what you're looking for. Multi-platform, open source, java-based content repository. Supports document tagging (and loads, loads more). Relatively easy to use right out of the box, and has a CIFS interface so you can just create a project and simply tree-copy your current documents into the project. Don't let the "enterprise" designation on the software scare you away.
I've actually considered going that route for my own personal document library, but while Alfresco might be one of the only good solutions, it's like killing a fly with a cannon.
I'm frankly amazed that with the "paperless living" meme currently going through the productivity circles that someone hasn't come up with a simple tool to do something just like what you're looking for: point it at a root folder, let it suck in all the files, then start tagging away. Search with keywords or filenames or both, and provide a clickable list of hits. Full-text search isn't needed, as there's already a ton of tools out there that'll happily index your hard drive for you.
And, if a tool like that exists, could someone point me to it, please?
2 years? ago I bought for my small business a fujitu f1-5120c duplex scanner--it came with adobe acrobat
I scan every bill, correspondence, notice, and everything to pdf- then I throw it the hell away.
the version of acrobat included does OCR-I open acrobat, choose create pdf from scanner, and scan away.
I can mix a scan job up between B&W & color or duplex or simplex within one job
I can open an existing PDF and append to it
I save everything to an infrant nas box.
I can go to windows search, type in 1179.21 (actually did this one once)
set to look INSIDE the files of that directory and get results that include
a soda delievery notice, a soda invoice, and my bank statement where I paid it off
they have other model scanners that combine sheetfed+flatbed...
here is a beauty
http://www.fujitsu.com/us/services/computing/peripherals/scanners/workgroup/fi-6230.html
every day http://en.wikipedia.org/wiki/Special:Random
Tellico for KDE might be a suitable solution. I use it extensively as a collection manager.
I wrote and maintain a project to do this:
http://sourceforge.net/projects/docdb-v/
"DocDB is a powerful and flexible collaborative web based document server which maintains a versioned list of documents. Information maintained in the database includes, author(s), title, topic(s), abstract, access restriction information, etc."
It's intended for collaborations, but groups from 5 to 500 use it.
Our institution uses something called ePrints - I'm not sure if it's entirely what you're looking for but it does support different Subjects (headings?) and you can upload the documents using it.
If anybody knew of (or planned for) an adaptation to physics (with interfaces to arXiv.org, the APS journals and ideally other journals), I would be very interested (even as a paying customer).
Don't librarian's (particularly those in the library science realm) deal with this sort of thing all the time?
What you are looking for is a proper archiving application. I suggest ICAAtom. Scan your documents as TIFFs if you are going to be saving them as images; if your hardware will do OCR nicely, then you would be better off scanning them to text, as they will be more searchable. ICA Atom supports all of the standard archiving metadata protocols, of course, so you will have good searching capabilities as long as you enter proper metadata.
"Apparatus dignosco occultus, satis non supernus."
This researcher should learn to talk to his local librarians. Many universities have a bibliography management system e.g. Refworks, that would be a lightweight solution. And many of the articles he has in print are quite likely now already properly digitized and available by PDF through his university library. If he's a proper researcher, he should care about more than what he has in his binder. There are likely more recent articles that reference those articles, building on that knowledge. Which he's missing. He can chat with a librarian online, or try the 20th century version of communication and make an appointment to talk in person.
Copying excerpts for educational use in a classroom setting is actually an explicitly protected fair use case.
This is not a classroom setting, this is a research setting. Very different.
Though it may be covered under other criteria of fair use, the educational purposes exemption from copyright does not apply.
"Trolls they were, but filled with the evil will of their master: a fell race..." -- J.R.R. Tolkien on Olog-hai
Once you OCR all your paper (FineReader is not bad), and full-text index your PDFs (Beagle for Linux, MOSS for Windows), you'll still have a problem with narrowing down a keyword search. Try Amplify http://www.hapax.com/amplify.php on the title/abstract/methods page of each document and maybe you'll get useful metadata.
I agree that for the immediate use listed there is unlikely to be any copyright violations. But if someone were to make a good collection for their lab, that perhaps then became popular in the department, it would start running into copyright gray areas. For example the university discontinues subscribing to a journal, but articles remain available on a broad intranet system. Normally if you already had a copy of the article that's legit, but now a new student has access to articles that were only available before they showed up. Or articles are scanned from copyright-legit sources and made available to a large audience, but not as large as the whole web. My guess is systems like this will be tolerated as long as they aren't very good. And when they become good, they'll be tolerated because everything else is not as good.
"The ability to delude yourself may be an important survival tool" - Jane Wagner -
There's at least two reasons the professor's method is beneficial:
1. By having to search by hand and scan by eye, he becomes more familiar with more of what's actually in the papers. His familiarity with the material gets better.
2. Repetitive scanning/searching of the papers leads to the mind partially wandering while doing so. This can result in inspiration and intuitive leaps.
Both methods together are preferable. But good luck on getting the professor to use them. You may have better luck getting him to create his own indices or tables of contents on paper to put in the binders. With his familiarity it shouldn't be too difficult.
"I may be synthetic, but I'm not stupid." -- Bishop 341-B
I would expect any modern photocopier to scan to pdf (while 150dpi is okay to look at, the OCR is better at 300dpi), then Adobe Professional does the OCR (my uni has a site license).
I actually bought a nifty tablet/pen thingy recently, and now I can write notes directly on the pdf too, in my own handwriting. I love it.
"You only get ONE LIFE." Richard Rahl, Faith of the Fallen - Terry Goodkind
Zotero is a firefox extension that can grab reserach papers directly from the journal or library web sites. It organizes the papers in collections, has keywords (they call them tags), can automatically index the PDFs. The metadata is stored also on a remote server and you can browse through it using a web interface. You also get a Word and Openoffice plugins to insert citations in the papers you write. The plugins are a little rough around the edges, but are usable. The references formatting is very robust and comes with styles for a lot of journals.
Mendeley is stand alone application. I haven't tryed it yet,but is seems to have very similar functionality.
I use BibDesk to organize my research. It's not perfect for that as its basic use is to cite and all that but you can actually import pdfs and tag them, i.e. put them into smart folders. It does have the collaborative approach to organizing data in a flat structure similar to that of delicious.
Zotero is what you want. Integrates smoothly into a research workflow. Great for managing research materials of all kinds. Powerful search and tagging features. Adding sources is quick and easy and it works hand in glove with lots of research databases. Also interoperates with Word or OpenOffice to manage citations and biblographies.
www.zotero.org
As an alternative to off-the-shelf software you could create a series of html pages, put them online and let Google index them for you. Create a separate html page for each scanned document with the desired keywords and a link to the document. Create an index page with a link to each of these html pages. link to that page on your home page or blog or any other page that you know Google scans. Wait a few days and search your site via Google. Build up a two column table in your favorite spreadsheet application with the file name in one column and the keywords in the other. Export as csv, and with a little coding in the programming language of your choice, you can generate the whole set of html in no time. Cheap free and easy!
What's better than having a program on your desktop that can search through your files, how about an online database of your files that you can access from any internet connected computer in the world (www.refbase.net).
1a. Windows Desktop Search with added "IFilters", or
1b. Google Desktop Search
I recommend (1a), amazingly, because once you've located and installed all the third-party IFilters - including one(s) for PDF files - WDS will be able to index and make searchable MANY more files than GDS (in my case, about THREE TIMES as many). If the original PDFs from which so much of the binder material was printed are still available, then your effort with the following is greatly reduced.
2. Good major-manufacturer scanner with ADF.
I haven't kept up with scanners in recent years, so I'll leave it to you or someone else to make specific recommendations. It may be important to stick with well-known brands for purposes of compatibility with the scanning/OCR software (3).
3. Forget Adobe: buy the latest version of OmniPage Pro. Just like Adobe, it can OCR text and pump it into a PDF while "fronting" it with an image of the original page, for sake of complex layouts and possibility of future OCR corrections.
No need to worry about complex database systems to store all the stuff; just create a storage directory (or hierachy, if there are tens of thousands of files). When you're done, you'll have a library of PDF files that have been fully indexed by a desktop search engine, such that any snippet of text in a document can be used to locate it.
I started Labmeeting with just this problem in mind.
First, we focus only on the biomedical and related spaces right now. Eventually, we might expand into Nanotech and CS, but we are helping out PubMed users first, for the most part.
We let you upload lots of papers, index them for you, provide a great interface for searching and annotating them. We have tens of thousands of bioscientists on the site with private paper collections.
I know this won't necessarily help your professor right now since we mostly focus on biology, but I'll let you know if we ever expand.
as you are at a University, you probably have access to a copy of Adobe Accrobat, I have found that it has an alright OCR for scanned pdfs. Also you may want to look at using Scribd it is not open source but is free and searchable.
The University of Oklahoma participates in JSTOR:
http://www.jstor.org/
They also appear to be EBSCO participants:
http://search.ebscohost.com/
I'm pretty sure "the 20th century" is right there already, if you can drag him to the library.
Note: This isn't going to work for people not affiliated with an institution. Both of these services make paper journal content available online for subscription fees paid by the institution (or business), so unless you are in the "bog boy clique", you're not going to have access to them, unless you pay through the nose.
-- Terry
Since you only have one person that needs to access the files I would just use Desktop Search. Personally I like Google Desktop ( http://desktop.google.com/ ) & Copernic Desktop Search ( http://www.copernic.com/en/products/desktop-search/index.html ). Here is an article reviewing some of them - http://lifehacker.com/400365/five-best-desktop-search-applications.
The main thing that you need to do is OCR the documents when you scan them in (You can convert non-OCR PDFs into OCR PDFs but I don't know anything that can search them before you put text in them). On Linux the two mains ones that I know of are Tracker and Beagle (http://www.linux.com/feature/143259).
I know these are not all open source or have ewb interfaces but they are really easy to use. You just put the files in folders and point the desktop search at them. Great for someone that doesn't know a lot about computers.
According to cornell, limitations on copyright holders are as follows. Note that research and teaching are both explicitly stated cases of fair use exemption.
107
Permits the âoefair useâ of an ownerâ(TM)s work without permission â" for the purpose of âoecriticism, comment, news reporting, teaching, scholarship, or research.â This exemption outlines four factors that must be met in order to argue a fair use.
108
Permits a library or archives to reproduce works for archiving purposes, to make copies for patrons and to participate in interlibrary loan â" all without permission
109
Permits individuals to lend, give or sell copies of works they own without seeking permission of the copyright holder. This is also referred to as the First Sale Doctrine.
110
Permits displays of work and educational performances in face-to-face teaching and distance education. The TEACH Act expands upon the limitations in section 110.
121
Permits reproduction of works without permission of the copyright holder for the blind and other people with disabilities
http://www.copyright.gov/title17/92chap1.html#107
The copyright act section 107. This section lists many cases of fair use but gives 4 primary criteria for courts to consider. The first is the purpose of the work and makes it clear that non-profit educational use is protected. I am unable to find any reference to a classroom in section 107 (not that there is reason to think the professor doesn't teach his students by having them perform or assist with research in the classroom).
I wrote a few articles about this for Law Office Computing magazine
http://www.nasw.org/users/nbauman/txtsrch.htm
http://www.nasw.org/users/nbauman/lawdb.htm
http://www.nasw.org/users/nbauman/discover.htm
It was a long time ago, the software and hardware has changed, but the concepts are still the same, and the costs are a lot less.
Free text search works reasonably well with small databases, but it doesn't work with big databases. If you want precision, you have to develop a set of tags (we called them keywords). A good model is Pubmed http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed. The New York Times used to have a great text search, but they changed it (eliminated tags) and now it's awfully difficult to get through.
Basically, the researcher has a body of knowledge, and he already has a filing and organizing system (in this case, a looseleaf binder system, which is a pretty good start). You should usually try to replicate his filing and organizing system in the database, for example one field for the looseleaf, another field for the tab, and then some goodies that he couldn't search the looseleafs for, like date, author, journal citation, etc. It would probably be useful to have a controlled vocabulary of a few good keywords, but keywords should be selected carefully so they're unique and don't duplicate.
I assume he doesn't have the PDFs any more. That would have made it a lot easier.
It would be handy to scan every word of (most) every document into full text, but it may not be necessary. Why do you need everything in full digital text? Scanning of unconventional text takes a human proofreading step, and probably isn't worth it.
He'll probably want to keep complete images of the original documents anyway along with the text. You should do a few tests to see how much resolution you need. 600 dpi works for ordinary text like they use in the printed newspaper. But if you want journal articles to come out, with footnotes and superscripts, you might need higher resolution.
Somebody is going to have to enter the fields manually, which is't too bad if you've got a thousand records (looseleaf tabs) to enter (about 20 hours), but can get difficult if you've got an order of magnitude more.
Scanning should be straightforward, if everything is neatly filed away in looseleaf books already. There are many cheap consumer-grade scanners on the market that can get 600-2400 dpi (the bundled software is probably more important than the hardware specs) but they can take up to 1 minute a page; there are more expensive scanners in the =>$1,000 range that can go a lot faster. If you're at a university, look around for somebody who already has one. Law firms and libraries do a lot of this.
You might start by estimating the number of pages and documents you have.
But let me suggest an alternative: Instead of scanning everything, just enter everything into a database without scanning it. Does he really need full text search? Or would it be enough to search his looseleaf books by a dozen fields? He doesn't have to print the document out from an image file, it's right there in his looseleaf books.
If anybody knows of up-to-date articles on this subject, I'd love to know the citation.
I didn't read the article and I'm not sure exactly what you want, but take a look at Bookends from Sonny Software.
Saved my butt, made life easy.
http://www.sonnysoftware.com/
Reference Management and Bibliography Software for Mac OS X
When they came for the communists, I said "He's next door. Take him away. Goddam commies."
DocMGR is what you're after. Its a web app that takes submitted documents (PDF, Office, etc), OCRs and indexes them, and allows stuff to be searched. www.docmgr.org
is the best Scan/OCR app you can buy with many nice features. The first is it's reasonably priced. Second is the fact that it can import PDF files directly and OCR/Index them and it handles almost every langauge on the planet. Definately worth looking into as the school may actually have a school license, which means you don't have to buy anything.
Mod me up/Mod me down: I wont frown as I've no crown
Give a call and see if their software does what you want; if they haven't messed it up since I last touched it, it should do document archiving, scanning, OCR, search, tags, categorizations, and whatever custom database fields you want to throw at it.
http://datagenix.com/
--
Comment removed based on user account deletion
We've done considerable research on the problem of scanning documents, also, and came to the same conclusion: The new Fujitsu fi-6130 seems excellent, although we haven't tried it. That model and the 6230 are new, and there is some evidence that waiting for the second version of those models would be a good idea.
The big attraction of the fi-6130 is its speed: 40 pages per minute.
If you are interested, I suggest you download the manual. (PDF)
The manual talks about a connection for an "imprinter", which sounds as though it is a printer that works only with that particular Fujitsu scanner. That causes me to doubt whether buying from Fujitsu would be a good idea; we don't want to get involved with corporate marketing drone foolishness. Everything else, however, looks quite good.
The scanner comes with OCR software. I suppose and hope that Fujitsu did a lot of work and found the best OCR software.
The scanner software makes a PDF. The OCR software tries to recognize the words, so that the software can make a searchable PDF. Even if the OCR recognition isn't perfect, it can be very useful.
It seems to me that the Fujitsu fi-6230, suggested in the parent comment, is a poor design. It combines an automatic sheetfed scanner with a flatbed scanner for a lot more money. That doesn't make sense, since the attractive feature of the sheetfed scanner is its speed. Speed is important with a flatbed scanner, but not as important, since the operation will always be manual. It seems to me that it would be better to have a flatbed scanner that is a separate piece of equipment, rather than two pieces seemingly glued together, without any logical connection, since apparently the 6250 has two imaging elements.
Be careful about using Windows Search, as suggested in the parent comment. The Windows XP version is buggy, and sometimes won't look into files that are there. We use VCOM's PowerDesk pdfind.exe program, a older version of which is free. We also use Funduc Software's Search and Replace program.
Most scanners are quite slow, don't have automatic document feeders that allow scanning of papers of widely different sizes, and don't build OCR'd indexes inside the PDF files.
before i return to work: if he's already getting the PDFs somehow, save them to a directory, and use google desktop to search through them for the keywords. It will search through most common documents.
not only is time travel possible, it's irrelevant.
Loving the fact that the guy didn't know what century it is!
Here's what I've used for my own documents: 1) Convert the pdfs to tiffs with ImageMagick, 2) OCR the Tiffs with tesseract, 3) Index the text with Xapian. The OCR step won't get all the text right, but will get about 90% or better and which will be good enough for indexing and searching with xapian. For me that's been a pretty good solution.
Seriously. This is what we get paid to do. There's far too much to communicate on a forum, and if the SNR here is typical, you'll get awful, unrelated and just plain wrong advice.
If you can't hire, see if your school has a library science program and look for a good intern.
Failing that, read the Polar Bear book (Rosenfeld & Morville, pub: O' Reilly) yourself and follow the threads to resources particular to your problem.
Tactical help: populate the kewords, title, subject properties in your PDFs and Office docs. If you populate in Office and make a PDF, the properties come along. They're in File>Properties... Filling them out will help any search engine that can consume binary docs make sense of your content.
And there are bulk scan-to-OCR packages out there. Funnel into PDF and populate the properties.
Worth saying again: populate the properties.
Drupal + Modules (CCK + Filefield + Taxonomy + Views) Taxonomy provides categorization of content. You can have multiple vocabularies (sets of terms). You can assign one or many terms to each piece of content (or document in this case).
My father runs a small business and has to track a bunch of paperwork for each client, so I got him a cheap LED lit flatbed scanner, but like everyone else, he discovered that it was too slow to manually scan in each page, even if the scanner itself was quite fast.
He eventually figured out that the fastest scanning technique is not to use a scanner at all, but a digital camera. He made a rig with a marked out area the size of an A4 sheet of paper, and then he attached a camera mount so that the camera would be facing down, pre-aligned to photograph the entire sheet. I've seen it in action, he can easily do a page per second: he just places the next page on the platform with one hand, and presses the shutter button with the other hand.
The resolution is more than good enough for OCR, and most cameras have better depth-of-field than scanners, so more of the page is in focus, even near bindings and staples.
I suppose you meant "drag a professor into the 21st century".
I have been using an fi-6130 for several months now. It is quite simply the best scanner I have used. It is fast, highly reliable and very seldom misfeeds (1 per 500-800 pages in my experience). I use it for scanning archival financial records and also for technical papers. It includes a copy of Kofax Virtual ReScan, which does a great job of creating readable 1-bit monotone scans of originals with colored backgrounds. There are a number of possible target formats, and it has several automated ways of handling group separator sheets. I highly recommend it. I have seen no evidence of "marketing drone foolishness."
Evernote (Evernote.com) might be too small for your needs, and it's not open source, but it: - Has OCR - Is very cheap ($6 a month for the pro version, free for the light version) - Recognizes handwriting - Accepts tags - Has a web interface (and a desktop client) Its only limitation is the 500 MB monthy upload cap. Since you have hundreds of files to get through, you will go over the cap if you upload all at once. But since scanning those things is going to take you ages, you might be fine. Also, if your boss is still collecting paper, he's probably pretty old-school. Evernote is dead simple to use.
I couldn't tell you how many times I've went to use a phonebook or reference manual and tried to flip through to the search page.
Wanna fight ? Bend over, stick your head up your ass, and fight for air.
There is a OSX application specifically written for these kind of scenario's: DEVONthink. http://www.stevenberlinjohnson.com/movabletype/archives/000231.html It has the Abby OCR engine built-in, a web server and an extremely smart search filter, which is able to find related documents based on metrics like keyword frequency.
Link
In CERN DS (with certainly a focus on high-energy physics) my papers are shown only up to 2006; so this database appears useless for me.
I work with electronic medical records and we have found Fujitsu scanners to be top-notch. Fast, reliable, and generally affordable. We've also used some of the larger production scanners from Kodak and Bowe Bell and Howell. They are solid scanners, but are more expensive and haven't taken the beating we tend to give the Fujitsus.
We just replaced an old 3097 with the 6130 and are waiting to hear how it holds up. Note, most of these scanners are deployed to remote scan areas in the hospital where it is the responsibility of the users to handle maintenance, which means these scanners aren't cleaned and don't have rollers replaced. These older models have gone years with almost no maintenance.
I think what you are looking for is something called "document management" software.
... where he could archive the PDFs and scanned documents and be able to search by keywords?
I agree with the OCR requirement, but if he just needs to search the resulting PDFs, wouldn't DocSearcher do the job for him? I've found it trivial to set up and run and it's certainly helped me keep track of docs etc.
to bad the OSS community has no real answers.
this is something i submitted a week or so ago:
"I'm looking for software that can help my company manage information in documents that may be in pdf, doc or web form. I work for a biotech company with 15 people, and we have large numbers of documents that range from very technical scientific publications (usually pdf) to company reports like 10-Ks, to web pages to newspaper articles to pictures. We use these documents to review and stay current with the scientific literature; to learn about what competitors are doing, gain market information (who is selling how much of what), generate publicity for our products ,and so forth.
We currently use the windows file tree as our organizer, which creates several problems: I can't put one file into multiple bins; I can't use keywords to search; I can't organize files into groups.
What I would like (I think) to do is organize the information by keywords and subjects; associate groups of files into binders, and create summarys for the binders (eg, I might have 5 files that go together, and my own summary of what the five files mean); add sticky notes to anything at anytime (actuallly, I would like keywords and stickys [comments in adobe acrobat] to be the same: words in stickys are keywords, and keywords show up in the stick; add URLS and webpages directly from the browser; have a function that mimics or is compatible with a package like endnote or procite or papyrus or refcite (formats bibliographys in word docs)
I'm not even sure what the solution looks like, but it needs to be cheap (http://www.ncbi.nlm.nih.gov/sites/entrez. This has a lot of features that scientists need, such as keyword search returns a list of articles that can be viewed by abstract."
this is a problem that comes up a lot, for a lot of people
I've tried a lot of the solutions , like zotero, and they just don't cut it for one person,- much less if you need to share the info among a small group of people.
There is a fabulous market for someone who wants to write this software
The main problem, which I don't think anyone has addressed, is that free information has a price - a human can only remember so much. So, the glut of free pdf/web info is actually bad, cause you loose sight of the important stuff; this use to be done for you with your $ monthy journal subscriptions - if you are in nanoscience, you might get nanoletters from the american chemical society, the editors do the weeding out for you
the other problem is how does one do natural language querys ?
Of the available answers, most are owned by a de facto monopoly, thomson reuters; refman is probably the best
Surely there must be someone who makes a pdf library database front end better then the collection feature in adobe acrobat
I realize that slashdot is going to take the technology solution as the only one, and in this case its probably the right way to go, but ...
People have managed documents and information like this for centuries and it worked rather well, perhaps you should stop being lazy and learn how to use traditional reference materials as you're going to need this skill for a few more years anyway.
Those skills are still useful today. Just because Google can index and allow you to find words in the documents it knows about doesn't mean that it can help you figure out what you're looking for. If you have no traditional reference skills, Google becomes a lot less effective. This of course isn't specific to Google, all search engines in the world won't help you if you can't figure out what you're looking for.
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
I ran into this issue also since I have tons of pdfs and sometimes it can take a while to find that paper you remember that mentioned HylD or ZnO. I use the search client copernic http://www.copernic.com/ . It has a serious advantage over google desktop since it gives you this handy little preview pain which is useful when sorting through results. Since I carry everything around on a hard drive, I just have the program set to index that drive (which is set to always have the same drive letter). As for versions of the program, they kind of went in a bad direction with the 2.x releases and I kept using 1.6/1.7 for a while, but recently started using the current release, 3.x and it works like a champ. Good luck.
Good grief, forget your FOSS idealogy, scan them to PDF, OCR them using Acrobat Pro (the education price is ridiculously cheap), store them on a Mac, and use the built-in Spotlight to search them.
Why not use something like Jabref? Easy to manage the references?
I'm surprised nobody has mentioned Calibre, which was also featured on Lifehacker sometime back.
It is based on PyQT (as well as dateutil, mechanize, lxml, BeautifulSoup) . They even have a CoverFlow like interface which is pretty good. I suppose it is usable on Win, Lin and Mac.
You have to provide a login/password to librarything (or a few other alternatives) and you can then search and tag for the book's metadata and cover images from these sources automagically.
I personally also use them to archive my PDF's that I download from the internet, tag them, specify authors and other metadata (incidentally, most of the papers that people create from latex do not have any metadata).
I see the developers pushing out a release every week, so it is under pretty active development. I dont know if there is a plan to integrate any indexing features in it, but I suppose the developers are open to it.
Google desktop searches through your computer's files to find keywords inside the files themselves. If he saves all his documents he finds online to there, he should be able to do keyword searches in those documents.
Also, if the pdfs are ocr'd then he could search via that as well.
I have always like the basic idea around Citeseer.
"CiteSeerx is a scientific literature digital library and search engine that focuses primarily on the literature in computer and information science. CiteSeerx aims to improve the dissemination of scientific literature and to provide improvements in functionality, usability, availability, cost, comprehensiveness, efficiency, and timeliness in the access of scientific and scholarly knowledge.
Rather than creating just another digital library, CiteSeerx attempts to provide resources such as algorithms, data, metadata, services, techniques, and software that can be used to promote other digital libraries. CiteSeerx has developed new methods and algorithms to index PostScript and PDF research articles on the Web. ..."
The basic issue for you would be that is was made to focus on Computer and Information sciences as it currently is implemented.
http://citeseerx.ist.psu.edu/about/site
In the short term, this is may not be valuable for you. In the long term, I think this can be the basis for most or any academic (or even non academic) research literature.
The easiest solution is Refworks , an online citation manager. You can automatically import articles from online databases or create your own reference entries with space to add any kind of article information or user-specified metadata of any kind, plus you can attach .pdfs directly to the database entry. The database is stored online by refworks and is searchable from anywhere via a web browser. Many Universities already have site licenses for this system, so check with your university librarians. Otherwise, check out their website for further details. The Microsoft Office plug-in for the manager, Write-and-Cite III" works with Microsoft word and the database to automatically generate reference lists and citations formatted to the style of almost every major and minor academic journal in most disciplines. The whole database is searchable and may be organized by project. You can also automatically import any article or abstract from Google scholar or other academic databases like JSTOR, ProQuest, etc.
In order to properly create a hierarchical index which is searchable, you may be interested in constructing an ontology, which is a description of your subject matter in terms of some broad categories. Those broad categories then branch out into logical subject areas. Many databases support hierarchical structures which match well with the way an ontology works. Once the ontology is constructed, which consists of keywords which represent the categories, you index the document on those keywords. Then your system can browse the hierarchy or zero in on a particular term. In linguistics ontologies are used to construct meaning trees of words as a starting point into determining the meaning and intent of some written text. Perhaps some of the commercial packages discussed can do this, but this is what I would look for in a product if I was faced with your task.
InftyReader is a program that specializes in doing OCR on scientific documents and mathematical formulas. It saves documents in a variety of formats including LaTeX and MathML.
Two unfortunate things about it: 1) it's a Windows binary 2) it costs $900USD for 2 concurrent use licenses. It was free until they licensed a conventional OCR engine to better handle the text (its non-math recognition was pretty bad before).