Building a Searchable Literature Archive With Keywords?

Document Management Software and OCR by eldavojohn · 2009-04-08 08:21 · Score: 5, Informative

I think what you are looking for is something called "document management" software. As far as FOSS goes, KnowledgeTree offers a community version that might be down your alley. They have an online demo if you're interested. There's also Alfresco but I haven't tried either of these.

From the sound of it, you want to verify that your product supports document tagging (not unlike Slashdot's tagging system I guess) so that he can attach his categories to documents as he puts them in (or more likely as you do the manual labor, right?).

... where he could archive the PDFs and scanned documents and be able to search by keywords?

So, my big concern is the part where you said he scans things from books and articles and so some of the PDFs might just be massive images, right? I don't think you're going to find systems with OCR built in so you might have quite the chore on your hands. If you don't have it electronically or if it's just an image electronically, you may have to implement some sort of process for getting a doc into this system so it can be searched, right? Look into GOCR or Tesseract if this is the case.

Also, judging by your nickname ("Sooner Boomer"), you're at the University of Oklahoma. Why in the world would you name yourself after a group of people who not only disobeyed the Indian Appropriation Act but also moved out onto Native American territory before it was officially declared property of the United States? And then you also chose "Boomer" which refers to "white settlers who believed the Unassigned Lands were public property and open to anyone for settlement, not just Indian tribes. Their reasoning came from a clause in the Homestead Act of 1862, which said that any settler could claim 160 acres of public land. Some boomers entered and were removed more than once by the United States Army." If you are a descendant of either a Sooner or a Boomer, I respectfully do not agree with their actions.

--
My work here is dung.

Re:Document Management Software and OCR by CyberLord+Seven · 2009-04-08 08:26 · Score: 1

I wish I had mod-points.
Very well done and informative on all counts.

--
We have always been at war with Eurasia!
Re:Document Management Software and OCR by qoncept · 2009-04-08 08:26 · Score: 4, Funny

If you are a descendant of either a Sooner or a Boomer, I respectfully do not agree with their actions.
Except he's not. He just prematurely ejaculates. And he'd gone all this time with no one drawing attention to it as you just have.

--
Whale
Re:Document Management Software and OCR by WillKemp · 2009-04-08 08:27 · Score: 1

Look into GOCR or Tesseract if this is the case.
Unless you're really lucky with the images, OCR requires a lot of work correcting errors. It would probably be less work to just be able to add searchable tags to whatever system is used to store the PDFs and leave them as images.
Re:Document Management Software and OCR by Red+Flayer · 2009-04-08 08:33 · Score: 2, Interesting

I think what you are looking for is something called "document management" software.
Ugh... for a secon there, I thought Clippy started posting to slashdot. Would you like help with that?

I don't think he's looking for a DMS, which includes lots of things like workflows, audit trails, etc. DMSs are typically used to make an office go paperless, but he's not looking for a processing and tracking mechanism. He's looking for an easy way to create a searchable archive index.

My suggestion? Since he's a professor, get a bunch of students to "help with research" by scanning the docs and OCRing them. If he's willing to shell out a few hundred bucks from his research grant, there are services that will do this... Most of the best OCR tools are proprietary, not open source, but even a crappy one should get enough text that the OCRed files could be indexed usefully.

For an indexer, I've heard good things about MPS, and a friend did a similar project to yours with Yaz/Zebra, but he was working with a library, there may have been a special reason for that.

--
"Trolls they were, but filled with the evil will of their master: a fell race..." -- J.R.R. Tolkien on Olog-hai
Re:Document Management Software and OCR by Shadow+Wrought · 2009-04-08 08:39 · Score: 3, Insightful

OCR certainly requires work if you need it to be completely accurate. In practice, speaking as a paralegal who's overseen the OCR'ing of millions of pages, it's just not a reasonable expectation. If you can supplement it with coding, in this case keyword tags, date, author, publication and title would build a pretty strong database. If he's looking to do that already, then whatever OCR you get is gravy. Some is better than none.

--
If brevity is the soul of wit, then how does one explain Twitter?
Re:Document Management Software and OCR by shaitand · 2009-04-08 08:53 · Score: 2, Insightful

That depends, if all he needs the OCR for is to build a searchable keyword index then the error rate can be quite high and still get good results. The results of a query should point to the original PDF, not the result of the OCR.
Re:Document Management Software and OCR by Anonymous Coward · 2009-04-08 08:54 · Score: 0

This is actually My line of business. Yes Document Management Software is exactly what he wants. Those hundreds of binders can be scanned in and indexed within a reasonable amount of time. In general we save companies hundreds of man hours a week with our software (shameless plug - www.docuware.com)
Re:Document Management Software and OCR by Anonymous Coward · 2009-04-08 08:58 · Score: 0

Adobe Acrobat Standard version has OCR built in. Yes, a chore, but there you go.
Re:Document Management Software and OCR by digitalunity · 2009-04-08 09:03 · Score: 1

OCR, in my experience, is also crap with equations and technical literature in general. It's linguistic fuzzy matching changes technical words it doesn't recognize into similar words it does, on the basis that well shit it might have just not scanned very well.
Not much way around this.

--
You can't legislate goodness. Let each to his own destiny, by will of his freely made choices.
Re:Document Management Software and OCR by burki · 2009-04-08 09:06 · Score: 3, Informative

For an Open Source DMS that generates searchable PDF Files, try ArchivistaBox: http://sourceforge.net/projects/archivista/
Tesseract (including fracture / black-letter recognition) and the Linux port of Cuneiform (BSD licence) OCR engines are used for text recognition. The hocr2pdf module (see http://www.exactcode.de/ is used to generate the searchable PDF files.
(http://sourceforge.net/forum/forum.php?forum_id=868471)
Re:Document Management Software and OCR by digitalunity · 2009-04-08 09:06 · Score: 2, Interesting

Another obvious, more practical but potentially less powerful method is simply to index all of the printouts with serial numbers and manually create a database of tags with serial numbers.
You then have a library card index system, but electronic. Sure, it won't help with documents that aren't entered properly, but it's dramatically more efficient than thumbing through PDF's until you find what you need.

--
You can't legislate goodness. Let each to his own destiny, by will of his freely made choices.
Re:Document Management Software and OCR by snspdaarf · 2009-04-08 09:16 · Score: 2, Insightful

And, since what the Boomers and Sooners did was about 100 years ago, slamming someone for their Slashdot nickname helps with their current problem how? The Chickasaw side of my family has owned the same land since before statehood, and the German side was in on the Land Run, and we don't get all fired up about what the people were doing back then. The past is past. Learn to relax a little.

--
Why, without your clothes, you're naked, Miss Dudley!
Re:Document Management Software and OCR by Hognoxious · 2009-04-08 09:42 · Score: 1

Wouldn't it be even less work to get somebody else (TAs or some other form of free labour that needs to be on the prof's good side) to do it?

--
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
Re:Document Management Software and OCR by Anonymous Coward · 2009-04-08 09:57 · Score: 0

Another suggestion: Sphinx Search.
Re:Document Management Software and OCR by NoobixCube · 2009-04-08 10:08 · Score: 2, Funny

What he needs is a bunch of undergrads or interns to painstakingly transcribe and proofread every scrap and napkin of text!

--
Admit it. You post strawman arguments as AC so you get modded Insightful for refuting them, rather than Troll
Re:Document Management Software and OCR by mungewell · 2009-04-08 10:23 · Score: 1

We are just starting to use Knowledge Tree, and it does have a 'Tag Field' where you can associate searchable keywords with every/any document contained in the system. It also supports the concept of linking documents, so you can manually add specific links between documents.
If 'you' are just looking to index the text contained in a series of PDFs, why not just use one of the many desktop search engines.
Mungewell.
Re:Document Management Software and OCR by Shadow+Wrought · 2009-04-08 10:51 · Score: 1

Very much so. And, if its photocopies of photocopies, then you start running into other OCR issues as well. Like anything else it just a tool which is sometimes helpful, but has its limitations. Depending on the volume though I'd think it would be easy enough even in Access or the OO equivilant to create a coding template with the Binder number and tab (assuming it's already so organized) and then the author, title, date, publication, and key words.

I'm sure there's an undergrad somewhere who'd like a semester long "internship!"

--
If brevity is the soul of wit, then how does one explain Twitter?
Re:Document Management Software and OCR by electrons_are_brave · 2009-04-08 12:25 · Score: 5, Insightful

As an ex-librarian, I can give you a professional's answer. You need a professional. But - if that's not possible, then what you are aiming for is a dream, and a huge data entry task to boot. And you will be creating a system that he will never be able to maintain. Aim lower. Ask him - does he want to keep the paper copies or move them all onto computer. Not both. If he wants to keep the paper - it's simple. Weed weed weed. 60% of what anyone holds is rubbish, and if's available online (and I mean in a proper source not a dissapearing link) he'll find it when he needs it. (I'm thinking he can't be using much of it given the difficulty of finding it). So that will leave you with about 20 three-rings out of the hundreds. Number each document, put them in a filing cabinets by MAIN SUBJECT. If you want to spend your life typing then, by all means, use incite, the word referencing system or some simple library freeware to create a db with author, title, journal etc and main subject (or maybe two). If he wants them all digital - same deal. Scan the ones that aren't there. Forget any sort of magic software that will catalogue for you, you crazy dreamer. The best you can do is use incite or some other referencing software to search for and make a record of the ones that have the record available on line. And then type the rest in. Personally, he sounds like a hoarder, so he will probably resist both suggestions. If this is the case then sort the folders into main subject and type a list (bib reference) and stick it to the front of each. At least that will cut down on his search time - but again, it's a lot of typing.
Re:Document Management Software and OCR by electrons_are_brave · 2009-04-08 12:30 · Score: 2, Insightful

BTW - my above answer is based on the assumption that he has no money to spend on getting this done.
Re:Document Management Software and OCR by RicRoc · 2009-04-08 12:32 · Score: 2, Informative

Archivista seems to be be a solution worth looking into for him. I guess he has to try and install the software and test it himself, because the explanation on the (English) website is almost incomprehensible -- I understand spoken German, but mixing German and English, ugh!
I work with Alfresco, which is a nice DMS, but without the integrated OCR that Archivista seems to provide. Alfresco can integrate with various OCR solutions though, has a very active community -- and a comprehensible website!

--
Who?
Re:Document Management Software and OCR by gbutler69 · 2009-04-08 12:39 · Score: 1

I'm in the beginning of deploying "Alfresco" for our company Intra-Net/Document Management. It seems to be highly featureful and I would think it is exactly what you need to fill the bill.

--
Over-the-top Response Guy! Giving "Over-the-Top Responses" since 1970.
Re:Document Management Software and OCR by Miseph · 2009-04-08 14:13 · Score: 2, Insightful

It sounds like 'going paperless" is exactly what he wants. It so happens that his office deals in research materials, but that doesn't really change the objective: stop using unwieldy and resource intensive paper documents in favor of highly indexed digital ones.
Out of curiosity, have you considered a wikiesque system... autolinking titles and keywords between articles and some sort of glossary (but not the "let's have everybody able to edit it because everything is an opinion and all opinions are equally invalid" communal happy horseshit)? It might not be any more useful to the professor as a research tool, but it could certainly be useful as an educational tool, particularly for undergrads who might not remember every term and principle right off the top of their heads or know off-hand what else they should be looking at to help grok what they're reading.

--
Try not to take me more seriously than I take myself.
Re:Document Management Software and OCR by YourExperiment · 2009-04-08 21:22 · Score: 1

He wasn't criticising the guy for the actions of his ancestors, he was criticising his choice of username, which he believes glorifies the actions of those people. It's sorta like picking a username like "AdolfHitler666". Except that this guy might just be a Battlestar Galactica fan with a predilection for oriental chicks.
Re:Document Management Software and OCR by snspdaarf · 2009-04-09 00:10 · Score: 1

That's the point. It's a user name. One based (probably) on the words to a college fight song rather than the actions of two small groups of people in the past. Why beat him up about it?
Nice Godwin, but I would venture to say there are very few college fight songs that reference Hitler, or Satan, so it's not really a good parallel.

--
Why, without your clothes, you're naked, Miss Dudley!
Re:Document Management Software and OCR by YourExperiment · 2009-04-09 00:32 · Score: 1

Not much point me trying to Godwin a thread when you go and reply anyway though. :)
Re:Document Management Software and OCR by Sooner+Boomer · 2009-04-09 01:09 · Score: 1

You've made the same comments that a lot of other people have. In the university, and especially in the sciences and engineering fields, there are two "types" of professors; those that are teachers and do research, and those that only do research. The teachers are the ones that are on tenure track and will become a "line item" or continuously funded position. The professors that only do research are dependant on grant monies for their salaries and are "soft money" or non-line item funded. They may or may not have grad students, and the grad students they DO have are focused on the research.

--
Chaos maximizes locally around me.
Re:Document Management Software and OCR by Anonymous Coward · 2009-04-09 01:47 · Score: 0

It's linguistic fuzzy matching changes technical words it doesn't recognize into similar words it does, on the basis that well shit it might have just not scanned very well. Not much way around this.
Actually, very easy way around this. Set your OCR software so it doesn't do the replacement automatically. I've used several packages, and Omnipage is one of the most accurate. Any of the top contenders will not replace automatically and will allow you to go through the suspected errors manually if you want to. That said, you're still right about equations. They all suck for equations and for whatever reason that market is being ignored so far. Omnipage doesn't even try to do anything sane with equations or any math notation for that matter.
Re:Document Management Software and OCR by Anonymous Coward · 2009-04-09 01:51 · Score: 1, Informative

And as a current reference librarian at a major research institution I'll add my two cents.
The journal articles that probably make up the bulk of his printed pdfs are already online, and professionally indexed, in the databases that his institution subscribes to or in Open Access repositories. He could use zotero to store a link to the original in that database and keep a local and fulltext indexed copy on his harddrive. And you won't have to worry about OCR. Depending on the field even many of the book chapters may be available as ebooks that could be handled the same way.
He could of course use repository software like DSpace or Fedora to create his own repository but why bother when the end result is ultimately redundant. Not to mention the fact that he won't be the copyright owner for much of what he archives so sharing the fruit of all this work could have legal complications.
And besides, if he's like most faculty, the contents of that binder are probably out-of-date, no matter how beloved they are.
Re:Document Management Software and OCR by Anonymous Coward · 2009-04-09 02:20 · Score: 0

Sphincter Search
Re:Document Management Software and OCR by jmeece · 2009-04-09 09:30 · Score: 1

he could also take which are OCR documents, and which are not. This way if the OCR document doesn't make sense, then they can go to the hard copy. Get all the easy stuff first, it will help, but sounds like some things may still be manual. Incremental progress is a good thing though.
Re:Document Management Software and OCR by pz · 2009-04-10 03:06 · Score: 1

As an ex-librarian, I can give you a professional's answer. You need a professional.
... if's [sic] available online (and I mean in a proper source not a dissapearing link) ...
Personally, he sounds like a hoarder, so he will probably resist both suggestions.
So, as an ex-librarian, your position is that the only people who are qualified to keep libraries are librarians? Libraries are repositories of knowledge. There's nothing at all that requires a professional librarian in the collection and maintenance of knowledge. Perhaps you'd care to reconsider and rephrase your opinions?
Some references are only available online in a tenuous fashion. Some are not at all. Circumstances change: relying on your university to always have a subscription to a lesser journal is foolhardy. Relying on being at the same university for your entire career, or that all universities you might work for have the same electronic journal subscriptions, is naive. Speaking personally, some of my most important references are photocopied from rare, dusty journals that are unlikely to ever be available online, and in no possible future am I going to give up the paper copy because recreating it should the electronic version I've created fail would be far, far too costly. How much space does a researcher's library actually take up? A dozen shelf-feet? Two? Advising someone to get rid of the paper copies of their knowledge base -- copies that if treated even half-reasonably will last many decades -- and rely exclusively on an electronic version that requires substantial, active maintenance, monetary outlay and personnel to last the same time, is idiotic.
Scan, yes. Index, yes. Possibly archive paper versions to well-marked boxes, yes. Spend a lot of effort doing so, yes. Throw away the originals? A clear mistake based on misguided principles. That strategy can work if the originals are not that important, are inherently not relevant after a brief while (like bills), but for academic papers, it does not apply.

--

Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
Re:Document Management Software and OCR by PrebleNY · 2009-04-10 04:00 · Score: 1

Koha is a great project (another great open project is Evergreen)... but last I checked it doesnt really do document management. Outside of the librarian/information professional communities there seems to be a lack of understanding about information categorization and classification. There are fairly large differences in how you handle the storage, metadata and retrieval options for something that could 'circulate' in a traditional library and a component piece (article/paper). Traditionally there have been specialized resources for source level objects (books/journals) and there were specialized resources for digging deeper and locating chapters/articles/papers/abstracts within source objects. I commonly see researchers checking a library catalog for a known article title, when the OPAC only includes journal title data Maybe this is something new with the blurring of publishing on the web? There has been a lot of work on federated searching as a a solution to this... but in most implementations my experience has been that patrons lose the ability to use the individual resources to fullest effect and get a results list that leads to further confusion. In any case, Koha would be great for setting up a lending library, and could be used as a framework for managing the metadata (title, author, year) with a link to a file... but in terms of actually having searchable text/OCR and additional tools it would be a poor choice (something like LibraryThing would be simpler if you werent worried about checkin/checkout, patron records, late fees etc)

fox? by SnarfQuest · 2009-04-08 08:26 · Score: 4, Funny

I'm trying to help drag a professor I work with into the 20th century

Maybe after that, you should try to bring him into the 21st century. You know, the one where PDF's exist?

--
Who would win this election: Andrew Weiner vs Andrew Weiner's weiner.

Re:fox? by fuzzyfuzzyfungus · 2009-04-08 08:34 · Score: 4, Funny

PDF has been around since 1993. That's what, six months or so after we switched from coal-fired data furnaces to vacuum tubes, right?
Re:fox? by myrrdyn · 2009-04-08 09:11 · Score: 1

PDF has been around since 1993. That's what, six months or so after we switched from coal-fired data furnaces to vacuum tubes, right?
No, but it was just in time for the eternal September coming...

--
Elen sìla lùmenn' omentielvo
Re:fox? by Hognoxious · 2009-04-08 09:47 · Score: 2, Funny

Maybe wait for the 22nd. If we're lucky, by then it won't suck. But you may still have to wait for the Hurd port.

--
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
Re:fox? by Richard+Steiner · 2009-04-09 06:13 · Score: 1

No, but it was just in time for the eternal September coming...
Oh, God... Don't remind me... AOL lusers on USENET??!? OMGWTF???!

--
Mainframe/UNIX Bit Twiddler and long time Windows/Linux Hobbyist.
The Theorem Theorem: If If, Then Then.

OWL by LWATCDR · 2009-04-08 08:26 · Score: 1

http://owl.anytimecomm.com/
we use this at my office. Works well for us.

--
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.

Try Papers by matt4077 · 2009-04-08 08:26 · Score: 3, Informative

Papers is a Mac software that does exactly what you need, and does it very well. It's not webbased and Mac only unfortunately, but you can probably find out there what the right terms to google for are.

--
Fleur de Sel

Re:Try Papers by Anonymous Coward · 2009-04-08 09:35 · Score: 0

I wish Ask Slashdot made it mandatory to say what OS the user is running.
I can't seem to throw a cat without hitting a friendly piece of OSX software that does this. I use Journaler. Not often, but it's doing the trick for indexing any kind of saved document I throw at it. Good software should be keyword indexing any OCR text for you when imported. Author didn't mention what the professor is using to create the PDFs, but Adobe Acrobat Standard comes with "not awful" OCR that needs a lot of attention for proofreading.
With FOSS OCR you have Tesseract, OCRopus, Clara... Can't remember what the one I used on Windows was, but it was great at the time. (~2000)

I know this has been covered ad nauseum with things like photos and the like, but I'm not looking at storage as such: instead I'm trying to find what's stored.
This is what Digital Asset Management is all about, so be sure to add that DAM criteria to broaden your search. ;)
Re:Try Papers by ridgecritter · 2009-04-08 16:38 · Score: 1

I concur. Papers works great for me, with 9,021 (tonight's count) references easily searchable, accessible, and useful! You can simply type the title of those PDFs your prof has into Google Scholar and download them direct to Papers. No OCR hassle. Papers picks up reference metadata (authors, journal, etc.) automagically. Best knowledge organization tool I've found.

Reference management software by ckthorp · 2009-04-08 08:26 · Score: 2, Insightful

I've had good luck with JabRef which uses a BibTex database on the backend so it integrates very well with LaTeX.

Re:Reference management software by eggy78 · 2009-04-08 09:06 · Score: 2, Interesting

I never really enjoyed using JabRef, but have had pretty good luck with Aigaion... a little more setup but it's great for our lab, where everyone works from a common database of papers. It allows export to RIS, BibTex, etc. although we do occasionally run into some errors with the LaTeX special characters and such. At least as far as our advisor is concerned, this absolutely revolutionized the way we handled our references. It's searchable, you can add keywords, your own annotations, include abstracts, and upload one or more attachments (the original paper) in whatever format you want. Technically it's an annotated bibliography that supports attachments, but it is pretty solid. One thing to note: We are still using a 1.3.x version of it; we haven't been brave enough or had the time to try the 2.x releases.
Re:Reference management software by joe+155 · 2009-04-08 09:34 · Score: 2, Interesting

I agree. I'm in the first year of my PhD and I've been making an effort to build an extensive bibtex database because it provides everything I need in terms of references and notes. What I do is read a paper, make pretty extensive notes on it and then put them in the abstract section of Jabref so that when you use the search function for terms it searches through all the relevant text in the article for what you work on. I've also tried to put down some keywords which are related just to make sure that they're linked with the article. Then if I want to know everything I've ever read on, say, political corruption it's just a search away.

If you wanted to add papers you've not digitized your notes for then you could put in the references and just a few quick keywords. Papers you don't have you can search through google scholar to find them. It works OK.

I've also been impressed with Papers for OSX, but Jabref can move systems really easily and is GPL.

--
*''I can't believe it's not a hyperlink.''
Re:Reference management software by cahkaylahlee · 2009-04-08 15:03 · Score: 2, Informative

You can also set your preferences in Google Scholar to provide a link to the paper's citation in BibTex formatting. It isn't always 100% accurate, but it does save a lot of data entry time.
Re:Reference management software by rhsanborn · 2009-04-08 23:23 · Score: 1

If you wouldn't mind satiating my curiosity. In what format do you take notes? Do you keep the original copy of the paper? Do you link, in some way, a specific point or note to a specific place in the paper?

I'm just digging into my masters and haven't had the best text processing skills, so I'm interested in how others do it.
Re:Reference management software by Anonymous Coward · 2009-04-09 04:21 · Score: 0

Two other great, graphical front-ends for bibtex are BibDesk and Papers (both on MacOS).
Re:Reference management software by joe+155 · 2009-04-09 05:25 · Score: 1

My note taking system is based on standard word-processing which creates three copies.

I write the notes straight into, say, Word (its just what we have at the uni), then print them and then copy them into Jabref. This does probably create more copies than I need but I don't really need the space on my USB drive and I get free printing, so it's not too bad for me but it would work just as well with fewer copies.

I do keep the PDFs but I don't really go back to them after I've read through them. In terms of how I note I do something along the lines of putting the title of the article and the author and year at the top and then do something like:

(p.155)
"item 1 shares the largest pairwise coefficient with item 7"

Or I would do:

(pp.155-7)
Notes the importance of item 7 on item 1...

This allows me to go back to the notes and then get usable quotes which I can use directly, or cite in the usual way (which is what you'd have to do in the second example anyway). This does create a lot of notes, but they don't really get unmanageable and because you've read the article it makes it a lot easier to go back through notes you've made.

Moreover in Jabref it searches through the whole lot, so it's really easy to search through it etc. (using regex no less). I'd definitely recommend this way of doing things to anyone who is hoping to stay in academia because it builds up a great library which is easily searchable and it shows you what you've read (as well as being easy to use with LaTeX). It would be better if I'd know about this at the start of my BA.

--
*''I can't believe it's not a hyperlink.''

I don't know if it is what you want... by blue_goddess · 2009-04-08 08:30 · Score: 1

...but you can try library management software. Good point to start is
http://ask.slashdot.org/article.pl?sid=06/03/22/1320207
and
http://slashdot.org/article.pl?sid=07/12/11/1756247

--
As a computer, I find your faith in technology amusing.

DSpace? by leenks · 2009-04-08 08:30 · Score: 1

DSpace ? http://www.dspace.org/

Is the material copyrighted? by bihoy · 2009-04-08 08:32 · Score: 0

There is also the issue of making copies of any copyrighted material. Unless you have obtained permission to do so from the copyright holder (usually for a fee) you could find yourself in a whole lot of, very expensive, trouble for copyright infringement.

Re:Is the material copyrighted? by shaitand · 2009-04-08 08:47 · Score: 2, Insightful

Copying excerpts for educational use is actually an explicitly protected fair use case. The copyright act actually uses it as an example if I remember correctly.
The parent said he is copying parts of texts, not entire books.
Re:Is the material copyrighted? by Fallen+Kell · 2009-04-08 08:47 · Score: 1

I seem to remember something about "educational use" in Section 107 of the Copyright Act....

--
We were all warned a long time ago that MS products sucked, remember the Magic 8 Ball said, "Outlook not so good"
Re:Is the material copyrighted? by Anonymous Coward · 2009-04-08 08:53 · Score: 0

The electronic versions of these documents will be no more encumbered than the paper versions he already has. He makes no mention of distributing these documents to the public via the web, just to the person who's already using them. He's not selling them. Since the paper versions that he has are apparently printed PDF files, he's not even actually making new copies... he's just reorganizing the digital copies he already has.
There are no copyright issues here.
Re:Is the material copyrighted? by Red+Flayer · 2009-04-08 09:36 · Score: 1

Copying excerpts for educational use in a classroom setting is actually an explicitly protected fair use case.

This is not a classroom setting, this is a research setting. Very different.

Though it may be covered under other criteria of fair use, the educational purposes exemption from copyright does not apply.

--
"Trolls they were, but filled with the evil will of their master: a fell race..." -- J.R.R. Tolkien on Olog-hai
Re:Is the material copyrighted? by shaitand · 2009-04-08 11:40 · Score: 2, Interesting

According to cornell, limitations on copyright holders are as follows. Note that research and teaching are both explicitly stated cases of fair use exemption.
107
Permits the âoefair useâ of an ownerâ(TM)s work without permission â" for the purpose of âoecriticism, comment, news reporting, teaching, scholarship, or research.â This exemption outlines four factors that must be met in order to argue a fair use.
108
Permits a library or archives to reproduce works for archiving purposes, to make copies for patrons and to participate in interlibrary loan â" all without permission
109
Permits individuals to lend, give or sell copies of works they own without seeking permission of the copyright holder. This is also referred to as the First Sale Doctrine.
110
Permits displays of work and educational performances in face-to-face teaching and distance education. The TEACH Act expands upon the limitations in section 110.
121
Permits reproduction of works without permission of the copyright holder for the blind and other people with disabilities
http://www.copyright.gov/title17/92chap1.html#107
The copyright act section 107. This section lists many cases of fair use but gives 4 primary criteria for courts to consider. The first is the purpose of the work and makes it clear that non-profit educational use is protected. I am unable to find any reference to a classroom in section 107 (not that there is reason to think the professor doesn't teach his students by having them perform or assist with research in the classroom).

Beagle by WillKemp · 2009-04-08 08:34 · Score: 1

It may be worth looking at Beagle: http://beagle-project.org/ - it's Linux only though.

Zotero by k2enemy · 2009-04-08 08:35 · Score: 2, Informative

Zotero might be useful.

Re:Zotero by hnwombat · 2009-04-08 09:36 · Score: 2, Informative

I'll second this one. I'm a doctoral student, and have been using it to handle my research. A nice, simple firefox-based interface. It'll snarf references right off pages from search engines. You can attach things, including links to or copies of pdfs to those references, summaries, etc. You can apply keyword tags to citations, and you can organize the citations into a nice directory tree.
To get them out, there's a sweet interface available for open office. I think it's also available for Word, but I use M$ as little as possible.
The only real downsides are copying between computers and citation formats.
Copying is actually easier than it is with the other reference managers I've tried (yeah, I'm talking about you, refworks (bleaargh!) and end note (urrrp!)). You may have to do it more than you would with others, but it's easy to do when you need to. You can export some or all your references to a file, sneakernet the file to the new computer, import it into zotero, and you're done.
There are a lot of citation formats currently available, they just don't happen to be the ones I need. However, there's one close, and the system is designed to be extensible; it's not *that* hard to add your own styles. As soon as I get a round tuit I'll be adding styles for the journals I'll be submitting to, and contribute those back to the project.
Like I said, really sweet, and free.
Re:Zotero by yes+it+is · 2009-04-08 11:48 · Score: 2, Informative

Thirded. I've also built my own phrase index on top of zotero using Perl and OpenCalais.
Re:Zotero by Anonymous Coward · 2009-04-08 13:08 · Score: 0

Zoteo Is a GREAT document / bibliographic management program. It even looks inside the PDF's and does full text indexing. think of it as an iTunes for papers.
Re:Zotero by langelgjm · 2009-04-08 13:17 · Score: 1

You can export some or all your references to a file, sneakernet the file to the new computer, import it into zotero, and you're done.
If you're brave, you can try the beta, which includes syncing functionality. I currently use it to sync between my XP desktop and my OS X laptop, and I haven't had any problems so far (though I make a point of it to back up the database regularly in case something goes wrong).
The Word integration for Office 2007 is fantastic; for Office 2008 on the Mac, you need Leopard, and it's slightly clunkier, but works.

--
"Anyone who [rips a CD] is probably engaging in copyright infringement." - David O. Carson
Re:Zotero by edward2020 · 2009-04-08 16:18 · Score: 1

I'll even fourth it.

--
Don't worry about the mule, just load the wagon.
Re:Zotero by Trepidity · 2009-04-08 17:47 · Score: 1

The main thing that turns me off about Zotero is the poor browsing interface. Why can't I click on an author's name to get all papers in my database by that author? Why can't I click on a journal to get all papers in that journal? Why do I always have to go through a damn Advanced Search to do any of this?
As a result, I use Aigaion instead. It's web-based, though, so you'll need somewhere you can run it (I've got a small VPS that it's on).

--
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
Re:Zotero by Reece_Arnott · 2009-04-08 20:45 · Score: 1

I'll second Zotero as well. I also use Zotero (I'm also a doctoral student and use it to handle my research). As for copying between computers, I've never had a problem. I just set it up to store everything under a specific folder rather than the default one under the Firefox profile, and copy that folder between computers.
As for citation formats, I did have a few problems until I figured the easiest thing to do was to leave everything as defaults. I was exporting to Bibtex format and importing it into LyX and wanting to change the style slightly. I could have created a zotero xml based export style but it was easier to leave it as was as all I wanted was a small change to the default one.
That brings up the one thing I would like changed/added: use or import normal Bibtex and/or Endnote styles rather than creating another way of doing the same thing.
There may be another solution that fits your needs if all you want is to keep track of a lot of files but if you also make notes on them, tag them, and say which is related to which, and even create snapshots of webpages etc. its just great. I'm also using it as a simple store for various piles of open source code I may want to find, use, or refer to later.

Cheap scanner, expensive OCR software by MartinSchou · 2009-04-08 08:35 · Score: 4, Insightful

Most highend consumer All in One printers comes with an ADF capable of handling most types of paper as long as it's not crumpled up, stapled or the like. Some of the more expensive ones can do two sided scanning to a network repository. I work with consumer level HP printers, and the Office Jet Pro L7xxx series does this. The Pro L7680 is 200 US$ at Newegg.com

Now, while that printer comes with some okay OCR software, it's basicly thrown in for free. A lot of the stuff in the kind of documents you're talking about is going to be math heavy mixed in with images, graphs, tables and personal notes. I don't know any OCR software that'll transform that into exact replicas via LaTeX or the like, I'm pretty sure the really expensive OCR software will translate the written text and reproduce the rest as images and neatly transform it into some easily searchable pdf-documents.

That brings you from paper to searchable pdf-files. Catagorizing those is probably not all that hard. I'd suspect you could do some text analysis and break each document down into a list of technical terms and the number of times they're used.

A document that uses the cashmir effect in a single example is probably not a document related to that specific field, whereas documents that talk about it repeatedly, referencing known articles on the subject etc. is. Sorting that out ... beyond my knowledge.

I'd suggest you start out with an experiment. Take a "typical" page from the binders, scan it to a non-compressed image at a decent resolution (e.g. TIFF). We usually reccomend around 300 dpi for OCR - beyond that you start picking up things that we don't really look for when we're reading.

Test that page against various OCR software, see what they reproduce as the output. Pick the one that's the best result.

And don't worry - the OCR software is going to be the single most expensive purchase in this equation. I am however more than ready to be proven wrong in that regard.

Re:Cheap scanner, expensive OCR software by serviscope_minor · 2009-04-08 20:33 · Score: 1

scan it to a non-compressed image at a decent resolution (e.g. TIFF)
Why noncompressed? Not all compression is lossy. The G4 lossless compression for binary images in TIFF is excellent, and artifact free since it's lossless.

--
SJW n. One who posts facts.

Personal Document Management by steveha · 2009-04-08 08:38 · Score: 3, Interesting

I am hoping that someone will make a nice personal document management package as free software.

If you use Windows, you can buy this:

http://www.nuance.com/paperport/

The basic features would be:

Scan in a document (group multiple pages into a single PDF)
Easily scan a page and insert it into a pre-existing PDF (if you missed a page yesterday, today go back and put it in)
OCR the documents and provide an index to allow searching
Provide a really convenient photocopier feature (scan+print)
Fast and easy. Scan in color, but detect black-and-white and auto-convert to greyscale. Do not pop up any dialogs; when the user clicks on the "Scan!" button, start scanning.
Also allow dropping in saved HTML pages, OpenOffice.org documents, etc. Manage the user's saved documents, no matter what kind of documents they are.

In a perfect world, the GNOME guys and the KDE guys would both start competing over who can make the slickest product and we all would win.

steveha

--
lf(1): it's like ls(1) but sorts filenames by extension, tersely

Re:Personal Document Management by Anonymous Coward · 2009-04-08 09:12 · Score: 0

The software should also do auto-cropping. Put a magazine clipping on the flatbed, hit the "Scan!" button, and it not only scans it in, but it notices that the clipping is only 1/3 of the scanner surface and crops it down minimally, automatically.
It should also have a "scan image" mode where it saves the scan as a JPEG instead of a PDF, doesn't OCR, etc.
Re:Personal Document Management by Anonymous Coward · 2009-04-09 05:06 · Score: 0

Knowledge tree (mentioned above) does all that.

Summation by Anonymous Coward · 2009-04-08 08:40 · Score: 2, Interesting

Law firms use a program called Summation to do this all the time. They take all the paper docs and electronic docs in a case (sometimes tens of thousands of pages) and load them into this program as TIFFs or PDFs. They are then OCR searchable. Not nearly as good of a search algo as something like google, as it is purely Boolean...but it gets the job done. Not sure about cost, but your university may have a license. An alternative is a program called Concordance, which does the same thing. One last option would be to scan everything to OCR searchable PDFs, throw them into a folder, and setup google desktop to only search that folder...you could then essentially "google" the contents of all those PDFs.

Quick and dirty solution by oldhack · 2009-04-08 08:40 · Score: 2, Informative

Assuming you have electronic versions of the documents in one format or another, stick them all in a file system and use desktop search (MS or Google). More than that you're looking at good bit of time and money.

--
Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.

Re:Quick and dirty solution by pete-classic · 2009-04-08 09:15 · Score: 2, Funny

Maybe you're unfamiliar with three ring binders.
They're archaic devices used to store non-electronic paper-based documents. You can ask your granddad about them.
I'm beginning to think these kids today don't realize that the desktop metaphor is . . . a metaphor!
-Peter
Re:Quick and dirty solution by oldhack · 2009-04-08 09:22 · Score: 3, Funny

I am a granddad, you insensitive clod.

--
Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.

Beagle or Google Desktop by janoc · 2009-04-08 08:41 · Score: 1

I am using Beagle and/or Google Desktop for exactly this task. Both are able to index PDFs and search them. Unfortunately, they will not deal with PDFs directly from scanner (large images), you need to process those with OCR first. I believe that both Beagle and Google Desktop are able to search the metadata too, so even for image documents you can still search authors and titles if you are diligent and fill them in when the document is scanned. This needs a bit of discipline and insight into how the data are actually stored, but if you are willing to invest the time, it works pretty well.

There are other tools, Strigi comes to mind, but that was too unstable for me. I do not know about commercial apps doing this - there are probably some, but I am a Linux user so I need not to apply there... Then there are document management systems, but I think that is an overkill for your needs.

Solr by Anonymous Coward · 2009-04-08 08:42 · Score: 0

Somebody will try to tell you Alfresco is the solution. Give it a shot, but I haven't met anybody who has actually been able to use the open source version in production. The commercial version is nice though and there is a 30 day trial.

Apache Solr is built on their Lucene project and does the web interface search part of you want. There are VM images online that you can download and deploy. I don't know what you should use to do the tagging part of the project.

Comment removed by account_deleted · 2009-04-08 08:45 · Score: 4, Interesting

Comment removed based on user account deletion

(Let me google that for you)^2 by clinko · 2009-04-08 08:45 · Score: 1, Offtopic

If only GOOGLE had a way to search your DESKTOP, that would be perfect.

Re:(Let me google that for you)^2 by Anonymous Coward · 2009-04-08 11:55 · Score: 0

Google Desktop does not have the ability to archive the printouts as PDFs nor to assign arbitrary keywords to papers. Both of those are non-trivial tasks. The submitter is looking for a suggestion on which brand of car to get, not which brand of car polish.

Papers for Mac OS X and iPhone by 200_success · 2009-04-08 08:48 · Score: 2, Insightful

For Mac OS X, try Papers. There's also an iPhone/iPod Touch version. Mac OS is great at handling PDFs in general.

Apple Spotlight by troylanes · 2009-04-08 08:48 · Score: 1

Not trying to sound like a fanboi... However, I have hundreds of data sheets for various microprocessors, IC's, power supplies, embedded API's, 5 years worth of emails, etc. Spotlight indexes them all beautifully, and access is very quick, only a few seconds to pull up all references. I believe spotlight will even index network attached storage although I could be wrong.

Re:Apple Spotlight by tobiah · 2009-04-08 09:18 · Score: 1

Yup, I've got a crude filing system with hundreds of papers that works great because of Spotlight. I don't bother with OCR for older docs, just punch in some keywords in the file description section. Network drive indexing works if the drive is formatted in HFS+.

--
"The ability to delude yourself may be an important survival tool" - Jane Wagner -

Citeulike by badger17 · 2009-04-08 08:49 · Score: 1

Check out http://www.citeulike.org/ Does pretty much what you are asking for. You put in the details of papers, and assign keyword tags. You can also look at other people's libraries and so on.

Re:CiteULike by aveng0 · 2009-04-08 10:26 · Score: 1

Labmeeting.com is a relatively new site (compared to Citeulike) for life-science researchers. You can upload all your PDFs and it will automatically determine the associated Pubmed records. You can add papers to multiple folders and it supports fulltext searching (assuming you have uploaded a PDF). It is free for academic users.

You can also read your PDFs from anywhere (through Scribd, embedded into the site).

Check out a sample paper page

You can also import/export citations from/to a bunch of formats (BibTex/RIS/Endnote).

There are a whole bunch of other interesting features on Labmeeting that I didn't mention here, so just check it out.

OCR is aweful by shaitand · 2009-04-08 08:50 · Score: 1

OCR is pretty nasty stuff and it doesn't work very well at all. It's probably worth saying that the OCR results should probably only be used to generate your index and keywords.

Actually accessing the document should show the original PDF, not the error riddled OCR scan of it.

sooner boomer by Anonymous Coward · 2009-04-08 08:50 · Score: 0

I think he's actually a crewman aboard the SSBN Oklahoma.
(jk afik there is no SSBN Oklahoma)

DevonThink on a Mac by Autumnmist · 2009-04-08 08:51 · Score: 2, Insightful

If your professor uses a Mac, consider Devonthink by DevonTechnologies.
http://www.devon-technologies.com/products/devonthink/index.html

For searching, the software has an artificial intelligence system, keywords, meta data. It can store PDFs, word docs, emails, notes. It can be integrated with a scanner so you can scan and store documents in the database. It's got OCR built in...

I have DevonThink (personal edition, not Pro/Office) and I don't even use 1/10 of the power built into this system. You should check out some of the reviews online and videos of people using DevonThink.

--
--- "Many of the truths we cling to depend greatly on our own point of view." ~ Ben Kenobi, 'Return of the Jedi'

Re:DevonThink on a Mac by bkk_diesel · 2009-04-08 16:36 · Score: 1

I would certainly second this.

I've been using DevonThink Pro for about 6 months now, and it is an excellent way to get piles of paperwork under control - in fact I would be bold enough to say that for many users it's probably a compelling reason to buy a Mac.

Don't let the "beta" status of v2 scare you away, the software is solid and will do exactly what you want it to do.

Definitely get an auto-feed scanner though - with any significant pile of paper your flatbed just won't cut it.

Since you guys are doing research as well, you should look into DevonAgent as well. I haven't used it much, but the small amount of time I spent playing with it showed it had some interesting capabilities.
Re:DevonThink on a Mac by pnevin · 2009-04-08 22:26 · Score: 1

Seriously, this.
I've basically done what is being proposed with the original question, but with multiple folders of legal research documents. Devonthink Pro and a Fujitsu ScanSnap has made transferring the docs to searchable PDFs (and then finding them later) remarkably straightforward.
Re:DevonThink on a Mac by Anonymous Coward · 2009-04-11 15:09 · Score: 0

Seconding this, it's not open source, but it's fantastic. All of my DEVONthink databases total 28 GB of various information, a lot of it in PDF format. Their AI engine will help with the OP's organizational needs.

Almost recent diss on the subject by foobsr · 2009-04-08 08:52 · Score: 1

Bibliography Tools in the Context of WWW and LATEX

Looks like that covers your needs.

CC.

--
TaijiQuan (Huang, 5 loosenings)

pdfhacks by Anonymous Coward · 2009-04-08 08:53 · Score: 0

http://www.pdfhacks.com/

(disclaimer: not affiliated, just a user)

There are tools to index (kw_index) as well as a web based interface to a pdf collection (pdfportal).

OCR of your scanned pdfs is the enemy here. But as suggested, tesseract or google's continuation of it works pretty well.

here is a sample script from a set of tools I was experimenting with to index pdfs (all open source with windows binaries available):

pdftk example.pdf dump_data output example.data.txt pdftotext example.pdf example.txt kw_catcher 1000 keywords_only example.txt > example.keywords.txt page_refs example.txt example.keywords.txt example.data.txt > example.pagerefs.txt enscript --columns 2 --font "Times-Roman@10" --header "|INDEX" --header-font "Times-Bold@14" --margins 54:54:36:54 --word-wrap --output example.index.ps example.pagerefs.txt ps2pdf example.index.ps example.index.pdf

All from pdfhacks, GnuWin32 and Ghostscript.

Re:pdfhacks by Anonymous Coward · 2009-04-08 09:14 · Score: 0

Note, you can also use pdftk to create a package of pdfs (attach_files parameter) which can then be searched in later versions of the free acrobat reader. (without an index though, I could find no free solution to index pdfs the way Acrobat pro does)
e.g.:

pdftk coverpage.pdf attach_files path-to-folder-full-of-pdfs\*.pdf output example-package.pdf dont_ask

Again, only the text content of the pdf package will be searchable.

Xena and DPR by Anonymous Coward · 2009-04-08 08:55 · Score: 0

Take a look at http://www.naa.gov.au/records-management/secure-and-store/e-preservation/at-NAA/software.aspxXENA and DPR, which were developed as an archiving soluton by the National Archives of Australia but are now open source, and fully open standards:

Mac: Skim and Yep by koick · 2009-04-08 08:55 · Score: 1

If on a Mac, here's two you may consider (neither have a web interface).

Skim is open source and is a PDF reader and note-taker for OS X.

http://skim-app.sourceforge.net/

Yep is not open source, but will scan, tag and search PDFs ("like iTunes for PDFs").

http://www.ironicsoftware.com/yep/

Try 'Green Stone', a _digital library_ system. by Eyeballs · 2009-04-08 08:58 · Score: 1

http://en.wikipedia.org/wiki/Greenstone_(software)

-- From Grenstone's Web Site --
About Greenstone:
Greenstone is a suite of software for building and distributing digital library collections.

It provides a new way of organizing information and publishing it on the Internet or on CD-ROM.

Greenstone is produced by the New Zealand Digital Library Project at the University of Waikato, and developed and distributed in cooperation with UNESCO and the Human Info NGO.

It is open-source, multilingual software, issued under the terms of the GNU General Public License. Read the Greenstone Factsheet for more information.

The aim of the Greenstone software is to empower users, particularly in universities, libraries, and other public service institutions, to build their own digital libraries.

Digital libraries are radically reforming how information is disseminated and acquired in UNESCO's partner communities and institutions in the fields of education, science and culture around the world, and particularly in developing countries.

We hope that this software will encourage the effective deployment of digital libraries to share information and place it in the public domain. Further information can be found in the book 'How to build a digital library', authored by two of the group's members.

Re:Try 'Green Stone', a _digital library_ system. by Anonymous Coward · 2009-04-08 12:02 · Score: 0

http://wiki.greenstone.org/wiki/index.php/Greenstone_FAQ
Sick bastards, first time I have ever seen that. Very unprofessional and a major strike against their credibility right off the bat.
I doubt I'll look any further into the product now.
How cruel is that? "Um, Ya, here's our FAQs... oh you want answers too!? We only have Frequently Asked Questions - no Frequently Answered Questions here!.
And the irony of using a wiki to deliver such a cruel joke is not lost on me.

Fujitsu ScanSnap by heynonnynonny · 2009-04-08 08:59 · Score: 1

Shameless Plug:

I would highly recommend the Fujitsu ScanSnap 510 (or 510M if you're a Mac). It's ain't free and it ain't open source, but it comes with everything you need to scan in large quantities of documents, name them, put them in the folders you want, and create OCR text backed PDF's, so you keep your original files and have "searchable" backed text. It does double-sided scanning at about 15 pages per minute (my real-world estimate).

I just bought the Mac version and have managed to reduce two packed drawers of a file cabinet down to just a few documents of which I wanted to keep the originals. Plus, with them being text backed (per a previous post) I can use Spotlight to search for them.

My next plan is to scan in my old Engineering notes.

Fujitsu is coming out with the 1500, but I don't know much more than it's supposed to be improved. The 510 is fantastic, though. Check out the reviews on Amazon:
http://www.amazon.com/Fujitsu-ScanSnap-S510-Sheet-fed-Scanner/dp/B000RUOW66/

Included with the scanner is Adobe Acrobat in addition to ABBYY FineReader OCR software.

No Linux software that I'm aware of, but once you have the files in PDF format you can use them to your liking. They aren't particularly cheap at $450, but I've been very happy with the devices utility.

I had a HP All-in-One as well, but not having a double-sided scanner made it a pain to use.

So, what I think you're asking for is... by Basilius · 2009-04-08 09:00 · Score: 4, Informative

...something like this:

1. You want to be able to store documents that currently exist electronically, and also handle documents you're going to scan. The latter may, or may not, be OCR'd.

2. You want to attach keywords to the articles, and be able to bring up a list of articles that match some arbitrary combination of these keywords.

3. Full-text search isn't as important (but would be useful if available).

If that's the case, I'm thinking Alfresco might be what you're looking for. Multi-platform, open source, java-based content repository. Supports document tagging (and loads, loads more). Relatively easy to use right out of the box, and has a CIFS interface so you can just create a project and simply tree-copy your current documents into the project. Don't let the "enterprise" designation on the software scare you away.

I've actually considered going that route for my own personal document library, but while Alfresco might be one of the only good solutions, it's like killing a fly with a cannon.

I'm frankly amazed that with the "paperless living" meme currently going through the productivity circles that someone hasn't come up with a simple tool to do something just like what you're looking for: point it at a root folder, let it suck in all the files, then start tagging away. Search with keywords or filenames or both, and provide a clickable list of hits. Full-text search isn't needed, as there's already a ton of tools out there that'll happily index your hard drive for you.

And, if a tool like that exists, could someone point me to it, please?

Re:So, what I think you're asking for is... by datababe72 · 2009-04-08 10:35 · Score: 1

There's a little company called Arctus that does something like what you asked for (tagging a directory structure), and some more. Its aimed at pharma/biotech researchers, but from what I've seen of them they'd be happy to customize for you, for a price.
Re:So, what I think you're asking for is... by Sooner+Boomer · 2009-04-08 11:40 · Score: 1

Thanks. I looked at the web page and went down several levels looking at the features of the document management software. I'm put off somewhat by the fact that they don't list any prices. Instead, you have to contact them and they quote you a price for your bespoke application. I'll keep this in mind, and recomend it as an alternative, but for now I'm going to keep looking.

--
Chaos maximizes locally around me.
Re:So, what I think you're asking for is... by Basilius · 2009-04-08 12:12 · Score: 1

That's the enterprise version of Alfresco. Check out the Labs version. Free, open source version.
Re:So, what I think you're asking for is... by fsiefken · 2009-04-09 01:30 · Score: 1

For tagging files: Leap http://www.yepsoftware.com/leap/index.html DEVONthink Has import files and folders option, tag support and you get advanced search (boolean etc.) as well http://www.devon-technologies.com/products/devonthink/devonthink2.html

wow.. by way2trivial · 2009-04-08 09:01 · Score: 3, Insightful

2 years? ago I bought for my small business a fujitu f1-5120c duplex scanner--it came with adobe acrobat
I scan every bill, correspondence, notice, and everything to pdf- then I throw it the hell away.

the version of acrobat included does OCR-I open acrobat, choose create pdf from scanner, and scan away.
I can mix a scan job up between B&W & color or duplex or simplex within one job
I can open an existing PDF and append to it

I save everything to an infrant nas box.

I can go to windows search, type in 1179.21 (actually did this one once)
set to look INSIDE the files of that directory and get results that include
a soda delievery notice, a soda invoice, and my bank statement where I paid it off

they have other model scanners that combine sheetfed+flatbed...

here is a beauty
http://www.fujitsu.com/us/services/computing/peripherals/scanners/workgroup/fi-6230.html

--
every day http://en.wikipedia.org/wiki/Special:Random

just use DSpace by Anonymous Coward · 2009-04-08 09:06 · Score: 0

It's a nice Java web app. We use it at the Institute for Clean and Secure Energy (ICSE), and it does a great job.

Tellico by seyyah · 2009-04-08 09:06 · Score: 1

Tellico for KDE might be a suitable solution. I use it extensively as a collection manager.

Use Yep! by Anonymous Coward · 2009-04-08 09:07 · Score: 0

If you are running Mac OS X, you can quickly accomplish this very thing with a piece of software called "Yep!" It will track all of your pdf's and allow you to tag them. You can do previews, groups, etc. It will sort by date, etc. Very intuitive, very fast, easy to use.

You can download it from www [dot] yepthat [dot] com/yep/index.html

It's relative inexpensive at $34USD.

Find out what his colleagues use - nanohub.org by Anonymous Coward · 2009-04-08 09:07 · Score: 0

Sooner - There are a community of researchers who work in the nanotech field and collaborate through nanohub.org. I am not in the field, so I'm not sure how helpful it will be, but it's billed as "A resource for nanoscience and technology, the nanoHUB was created by the NSF-funded Network for Computational Nanotechnology."

This community is probably a much better place to ask the question than slashdot, IMHO. :-)

JR

Suggestion by vondo · 2009-04-08 09:11 · Score: 3, Insightful

I wrote and maintain a project to do this:

http://sourceforge.net/projects/docdb-v/

"DocDB is a powerful and flexible collaborative web based document server which maintains a versioned list of documents. Information maintained in the database includes, author(s), title, topic(s), abstract, access restriction information, etc."

It's intended for collaborations, but groups from 5 to 500 use it.

Re:Suggestion by CarpetShark · 2009-04-08 10:33 · Score: 1

The description of your project does not suggest either full-text searching, or PDF capabilities. I've been looking for something that can index all of my ebooks and do full-text searches that bring up the pages (or better, chapters) that are most relevant.
Re:Suggestion by vondo · 2009-04-08 10:39 · Score: 1

It's agnostic as to the document type (PDF/HTML/Word) whatever and can use plugins (beagle, etc) to do full-text search. Generally we find that proper meta-data to describe the document (title, abstract, topics, keywords) is much more useful than a full text search. But yeah, it won't do what you mention.
Re:Suggestion by Sooner+Boomer · 2009-04-08 11:47 · Score: 1

Thanks for thye recomendation. I see from the web page that some of the dependancies are perl, My SQL, and it can use a web front-end. Apache? I also see it runs on linux - Debian or Ubuntu Server (or something else?)?

--
Chaos maximizes locally around me.
Re:Suggestion by vondo · 2009-04-08 12:18 · Score: 1

Right, MySQL, Perl, Apache. Runs on RedHat, Ubuntu, Mac OS X. Anything Unix like.
Re:Suggestion by CarpetShark · 2009-04-08 14:11 · Score: 1

That's probably true for cases when content is added by people who wrote the content or understand it well. My interest is in building a library of topics I DON'T know well though. For instance, I want to have ebooks available for random searches which turn up techniques and algorithms I'm unaware of, yet may prove relevant to my work, as and when I find some particular problem worth researching. It's very difficult to catalog that sort of on-demand knowledge ahead of time. Additionally, I don't care if an ebook is about accountancy. If it happens to have some nice sorting algorithm in Appendix C, and I search for sorting algorithms, then I'd like that to come up.

Defeating Bedlam part 2 by Anonymous Coward · 2009-04-08 09:14 · Score: 0

As a young academic I can vouch this being a problem that is looking for a good solution. Olivia Judson talked about this issue in the NY Times a few months back (December 16, 2008, Defeating Bedlam). Folks who spend a lot of time with the literature need a version of EndNotes or RefMag that stores the bloody PDF along with the citation info; storing the PDF might have taken a prohibitive amount of memory in the past but these days memory is cheap. The program must also be able to search within the PDF, assigning keywords yourself is for chumps. "Papers" and "Yep" look good but what about all of us who don't have the luxury of working on a Mac.

ePrints by Demoriel · 2009-04-08 09:15 · Score: 1

Our institution uses something called ePrints - I'm not sure if it's entirely what you're looking for but it does support different Subjects (headings?) and you can upload the documents using it.

I, Librarian seems pretty close by nniillss · 2009-04-08 09:21 · Score: 1

What the submitter needs (and I also need) is an organizer for scientific papers with an interface for standard fields such as authors, journal, title, doi, http links etc. I, Librarian seems to fulfill this need; unfortunately with direct interfaces (for retrieving pdf and meta information at the same time) only with pubmed.

If anybody knew of (or planned for) an adaptation to physics (with interfaces to arXiv.org, the APS journals and ideally other journals), I would be very interested (even as a paying customer).

Re:I, Librarian seems pretty close by Anonymous Coward · 2009-04-08 09:48 · Score: 0

Do you or other physicists use:
NASA ADS:
http://adsabs.harvard.edu/abstract_service.html
or CERN DS:
http://cdsweb.cern.ch/
Do you think it would be useful to integrate these databases with I, Librarian as PubMed and PubMed Central? Let me know.

Wikidata by Anonymous Coward · 2009-04-08 09:21 · Score: 0

What we need is a "Wikidata" project that would catalog every book, paper, recording, movie, etc. There are a few attempts that I know of such as openlibrary.org and wikidata proposals on wikimedia, but nothing that I know of that has reached critical mass. Such a system would be free as in freedom, and include abstracts, location item info, would allow users to create there own sub-database of items to search, etc. something that would be a harbinger of death to google.

Look into FOSD Medical DMR or EMR systems by Anonymous Coward · 2009-04-08 09:23 · Score: 0

In healthcare there is a company called Laserfiche that does exactly what you are asking for. Its not free, but maybe there is a similar FOSS.

DekiWiki is a wiki that will index attachments (using Lucene) although I am not sure to the extent you'll need. It would be worth looking into also since it IS free.

I have used both and both work well. I hope that leads you into the right direction.

You need controlled vocabularies of your keywords by Anonymous Coward · 2009-04-08 09:29 · Score: 0

After you get past the easy part, which is the scanning / OCR / selection and installation of doc management software, training users, etc., you'll reach the hard part: Developing controlled vocabularies based on the ontologies specific to your domain's metadata.

Talk to your school librarian. by phallstrom · 2009-04-08 09:30 · Score: 1

Don't librarian's (particularly those in the library science realm) deal with this sort of thing all the time?

Digital Archive software by who's+got+my+nicknam · 2009-04-08 09:33 · Score: 2, Insightful

What you are looking for is a proper archiving application. I suggest ICAAtom. Scan your documents as TIFFs if you are going to be saving them as images; if your hardware will do OCR nicely, then you would be better off scanning them to text, as they will be more searchable. ICA Atom supports all of the standard archiving metadata protocols, of course, so you will have good searching capabilities as long as you enter proper metadata.

--
"Apparatus dignosco occultus, satis non supernus."

Ask your librarian by danthelibrarian · 2009-04-08 09:35 · Score: 1

This researcher should learn to talk to his local librarians. Many universities have a bibliography management system e.g. Refworks, that would be a lightweight solution. And many of the articles he has in print are quite likely now already properly digitized and available by PDF through his university library. If he's a proper researcher, he should care about more than what he has in his binder. There are likely more recent articles that reference those articles, building on that knowledge. Which he's missing. He can chat with a librarian online, or try the 20th century version of communication and make an appointment to talk in person.

Keywords aren't the total solution by Anonymous Coward · 2009-04-08 09:36 · Score: 0

Any solution that doesn't provide full text searching is less likely to be useful unless the exact, specific query from each and every user can be mandated.

I've lived thru "Document Management Guy" (actually a team of them, some with PhDs and publications) claims that keywords stored in document metadata was all that was needed. I called BS based on my years and years of DMS experience.

If an end user can't find a document, then the document doesn't exist, period. The document is useless unless the purpose is to have the document, but not have the document found. Images of text isn't generally useful without adding significant metadata based on how users will search for a document. IT people don't think like end users, so ask them what search terms they would use to find a few sample documents.

I've been away from Documentum, FileAid, Docushare and Sharepoint for a few years, but last time I used Sharepoint, the full text search results were worthless. I knew about a document - MS-Word, no less. Searches for a few specific, keywords failed to locate it. Yes, it was in a collection that was indexed.

About 6 months ago, the company I work at implemented the OSS version of Alfresco. We're ok with it, but need to upgrade to v3.x to get a much better GUI. We did trial the beta v3, but it wasn't ready for use at the time and had a few flaws with version control. Those are all fixed now.

Amplify to autotag by atcat · 2009-04-08 09:38 · Score: 1

Once you OCR all your paper (FineReader is not bad), and full-text index your PDFs (Beagle for Linux, MOSS for Windows), you'll still have a problem with narrowing down a keyword search. Try Amplify http://www.hapax.com/amplify.php on the title/abstract/methods page of each document and maybe you'll get useful metadata.

Success = Copyright Problems by tobiah · 2009-04-08 09:46 · Score: 1

I agree that for the immediate use listed there is unlikely to be any copyright violations. But if someone were to make a good collection for their lab, that perhaps then became popular in the department, it would start running into copyright gray areas. For example the university discontinues subscribing to a journal, but articles remain available on a broad intranet system. Normally if you already had a copy of the article that's legit, but now a new student has access to articles that were only available before they showed up. Or articles are scanned from copyright-legit sources and made available to a large audience, but not as large as the whole web. My guess is systems like this will be tolerated as long as they aren't very good. And when they become good, they'll be tolerated because everything else is not as good.

--
"The ability to delude yourself may be an important survival tool" - Jane Wagner -

Re:Success = Copyright Problems by imidan · 2009-04-08 10:19 · Score: 1

You're right, of course, that such a system could run into problems if its use became more widespread. It seems like one option is to keep the content restricted--can't just add in any electronic resource without a thorough understanding of its copyright terms--like certain linux distributions only including unencumbered code, or (in theory) Wikipedia only including unencumbered images.
Or, go to the effort of keeping track of the copyright terms and encumberances. This, obviously, is way beyond the scope of their project, but it's a service that academic libraries ought to be offering: document management services that make the users and the institution capable of, at the least, demonstrating a good faith effort to obey licensing terms, and, ideally, avoiding any infringement-style problems altogether.
Libraries are looking for ways to stay relevant in the digital age, and document management (including cataloging, indexing, ownership, tracking, search, etc.) is something that they've been doing forever.

Why Not To by DynaSoar · 2009-04-08 09:52 · Score: 4, Insightful

There's at least two reasons the professor's method is beneficial:

1. By having to search by hand and scan by eye, he becomes more familiar with more of what's actually in the papers. His familiarity with the material gets better.

2. Repetitive scanning/searching of the papers leads to the mind partially wandering while doing so. This can result in inspiration and intuitive leaps.

Both methods together are preferable. But good luck on getting the professor to use them. You may have better luck getting him to create his own indices or tables of contents on paper to put in the binders. With his familiarity it shouldn't be too difficult.

--
"I may be synthetic, but I'm not stupid." -- Bishop 341-B

Re:Why Not To by BitZtream · 2009-04-09 01:23 · Score: 1

Thank you for this post.
I love Google, and I'm only 30 so I'm not an old school engineer or anything, but I still have a printed copy of almost every reference manual I have ever used.
I don't really know why I prefer the dead tree format as much as I do, and I still usually keep a search able copy available at all times as well as there are times when I know what I'm looking for and can jump straight to it.
Your first point sheds some light on my habits actually, perhaps subconsciously I knew this. Perhaps that isn't the reason I like it at all, but it does make sense.

--
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager

Got any money? by Anonymous Coward · 2009-04-08 09:52 · Score: 0

If you've got some money, get yourself a Google Mini for $3K or so and a scanner. The base Google mini will index up to 50,000 documents, and supports PDF as well.

http://www.google.com/enterprise/mini/fileformats.html

EndNote X by daisybelle · 2009-04-08 09:54 · Score: 1

I'm a linguist, and I use EndNote X for storing all my papers (and books now actually too). It makes reference lists in my papers without me doing anything, but more importantly, it stores the reference/paper itself (which for me are mostly pdfs, with some Word and some html documents) with the record. There are fields in EndNote for notes, keywords, all that jazz, which are very searchable.

I would expect any modern photocopier to scan to pdf (while 150dpi is okay to look at, the OCR is better at 300dpi), then Adobe Professional does the OCR (my uni has a site license).

I actually bought a nifty tablet/pen thingy recently, and now I can write notes directly on the pdf too, in my own handwriting. I love it.

--
"You only get ONE LIFE." Richard Rahl, Faith of the Fallen - Terry Goodkind

Re:Keywords by pwfffff · 2009-04-08 09:56 · Score: 0, Flamebait

Kike, not kyke.

Yes, I just grammar Nazi'd the race Nazi (Nazi Nazi?).

Zotero, Mendeley by pesho · 2009-04-08 09:56 · Score: 2, Informative

You should try Zotero or Mendeley.

Zotero is a firefox extension that can grab reserach papers directly from the journal or library web sites. It organizes the papers in collections, has keywords (they call them tags), can automatically index the PDFs. The metadata is stored also on a remote server and you can browse through it using a web interface. You also get a Word and Openoffice plugins to insert citations in the papers you write. The plugins are a little rough around the edges, but are usable. The references formatting is very robust and comes with styles for a lot of journals.

Mendeley is stand alone application. I haven't tryed it yet,but is seems to have very similar functionality.

Re:Zotero, Mendeley by the+plant+doctor · 2009-04-08 17:00 · Score: 1

I've been using Mendeley lately. They're making progress, and are asking for feedback on improvement. That was my first thought when I read this as well though, Zotero or Mendeley.

BibDesk by Anonymous Coward · 2009-04-08 10:09 · Score: 1, Informative

I use BibDesk to organize my research. It's not perfect for that as its basic use is to cite and all that but you can actually import pdfs and tag them, i.e. put them into smart folders. It does have the collaborative approach to organizing data in a flat structure similar to that of delicious.

Zotero by Volfied · 2009-04-08 10:10 · Score: 1

Zotero is what you want. Integrates smoothly into a research workflow. Great for managing research materials of all kinds. Powerful search and tagging features. Adding sources is quick and easy and it works hand in glove with lots of research databases. Also interoperates with Word or OpenOffice to manage citations and biblographies.

www.zotero.org

Homebrew Solution by mraiser · 2009-04-08 10:12 · Score: 1

As an alternative to off-the-shelf software you could create a series of html pages, put them online and let Google index them for you. Create a separate html page for each scanned document with the desired keywords and a link to the document. Create an index page with a link to each of these html pages. link to that page on your home page or blog or any other page that you know Google scans. Wait a few days and search your site via Google. Build up a two column table in your favorite spreadsheet application with the file name in one column and the keywords in the other. Export as csv, and with a little coding in the programming language of your choice, you can generate the whole set of html in no time. Cheap free and easy!

Suggestion in using Zotero by Anonymous Coward · 2009-04-08 10:13 · Score: 0

I am currently using Zotero (http://www.zotero.org/) to organize all my articles and citations. It is open source, developed by George Mason University. The software works as an add-on to firefox, and automatically downloads the citation or the PDF of the article. The citaion can then be tagged with various labels and all the words in the article are searchable. I haven't used the tagging feature much, but the software has already proven invaluable in research and paper writing.

CiteULike by Anonymous Coward · 2009-04-08 10:13 · Score: 0

I'm a biology grad student and have been dealing with some similar issues. I've ended up using an online app (CiteULike) that has a great tagging interface and uses a bookmarklet for posting from journal sites, ISI, and PubMed.

It also has a great bibtex export/import feature. Since I'm using LaTeX for my dissertation I'm slowly migrating to a BSD licensed Mac program called BibDesk. Its tagging interface could use a little work though.

I've tried Zotero, and heard good things about Mendeley and Papers, but none of them have worked as well for me.

Try Aigaion by Anonymous Coward · 2009-04-08 10:21 · Score: 0

I've used Aigaion for managing all of my documents and references in the course of my Ph.D. I now recommend it to all of my grad students.

The website calls it a "Web based bibliography management software"

From the site:

"Both for individual researchers as for research groups or projects, it is of major importance to organize the literature one has read. A well organized bibliography is a powerful instrument. It speeds up the search for publications one has already read and supports the user in structuring information. Aigaion provides a bibliography management software environment that supports a user in just this: Organizing and managing a complete bibliography, from small bibliographies to bibliographies for a complete research department."

Try a server based solution like RefBase by jjh37997 · 2009-04-08 10:25 · Score: 1

What's better than having a program on your desktop that can search through your files, how about an online database of your files that you can access from any internet connected computer in the world (www.refbase.net).

Re:Try a server based solution like RefBase by mspohr · 2009-04-08 21:29 · Score: 1

I've set up RefBase (www.refbase.net) for several sites. It works great and will do just what he needs. It also has the ability to generate standard format citations.

--
I don't read your sig. Why are you reading mine?

SIMPLE solution, but not FOSS by macraig · 2009-04-08 10:40 · Score: 1

1a. Windows Desktop Search with added "IFilters", or
1b. Google Desktop Search

I recommend (1a), amazingly, because once you've located and installed all the third-party IFilters - including one(s) for PDF files - WDS will be able to index and make searchable MANY more files than GDS (in my case, about THREE TIMES as many). If the original PDFs from which so much of the binder material was printed are still available, then your effort with the following is greatly reduced.

2. Good major-manufacturer scanner with ADF.

I haven't kept up with scanners in recent years, so I'll leave it to you or someone else to make specific recommendations. It may be important to stick with well-known brands for purposes of compatibility with the scanning/OCR software (3).

3. Forget Adobe: buy the latest version of OmniPage Pro. Just like Adobe, it can OCR text and pump it into a PDF while "fronting" it with an image of the original page, for sake of complex layouts and possibility of future OCR corrections.

No need to worry about complex database systems to store all the stuff; just create a storage directory (or hierachy, if there are tens of thousands of files). When you're done, you'll have a library of PDF files that have been fully indexed by a desktop search engine, such that any snippet of text in a document can be used to locate it.

Web-based Solution for Biomedical professors and by lJlolel · 2009-04-08 10:42 · Score: 1

I started Labmeeting with just this problem in mind.

First, we focus only on the biomedical and related spaces right now. Eventually, we might expand into Nanotech and CS, but we are helping out PubMed users first, for the most part.

We let you upload lots of papers, index them for you, provide a great interface for searching and annotating them. We have tens of thousands of bioscientists on the site with private paper collections.

I know this won't necessarily help your professor right now since we mostly focus on biology, but I'll let you know if we ever expand.

OCR and Scribd by the_denman · 2009-04-08 10:57 · Score: 1

as you are at a University, you probably have access to a copy of Adobe Accrobat, I have found that it has an alright OCR for scanned pdfs. Also you may want to look at using Scribd it is not open source but is free and searchable.

Building a Searchable Literature Archive With Keyw by Anonymous Coward · 2009-04-08 11:03 · Score: 0

I believe that there has been a few people working on developing a 'searchable literature archive with keywords' for a while now... they call these strange people 'librarians'.

Try Xinco by Anonymous Coward · 2009-04-08 11:10 · Score: 0

Have a look at Xinco http://www.xinco.org/. Java based, simple to setup on Tomcat - MYSQL or other alternatives. Not as bloated as Alfresco http://www.alfresco.com/ when it comes to smaller projects. Good functionality and the installation documentation will get you up and running quickly. A serious lack of further documentation can be problematic. Referencing the source code helps. The client interface was definitely designed by a programmer, not artistic but functional.

The University of Oklahoma participates in JSTOR by tlambert · 2009-04-08 11:19 · Score: 1

The University of Oklahoma participates in JSTOR:

http://www.jstor.org/

They also appear to be EBSCO participants:

http://search.ebscohost.com/

I'm pretty sure "the 20th century" is right there already, if you can drag him to the library.

Note: This isn't going to work for people not affiliated with an institution. Both of these services make paper journal content available online for subscription fees paid by the institution (or business), so unless you are in the "bog boy clique", you're not going to have access to them, unless you pay through the nose.

-- Terry

Re:Building a Searchable Literature Archive With K by Anonymous Coward · 2009-04-08 11:23 · Score: 0

More seriously though - I am a little surprised that a professor cannot simply work with people at his current organisation that are hired specifically to catalogue and conveniently store (mostly digital) literary information.

Sure PhD students are nice, cheap slaves - but how hard is it to acquire a copy of endnote or reference manager from the library and ask how to export their preprepared metadata and thesaurus keywords into his install?

Desktop Search? by natmsincome.com · 2009-04-08 11:35 · Score: 1

Since you only have one person that needs to access the files I would just use Desktop Search. Personally I like Google Desktop ( http://desktop.google.com/ ) & Copernic Desktop Search ( http://www.copernic.com/en/products/desktop-search/index.html ). Here is an article reviewing some of them - http://lifehacker.com/400365/five-best-desktop-search-applications.

The main thing that you need to do is OCR the documents when you scan them in (You can convert non-OCR PDFs into OCR PDFs but I don't know anything that can search them before you put text in them). On Linux the two mains ones that I know of are Tracker and Beagle (http://www.linux.com/feature/143259).

I know these are not all open source or have ewb interfaces but they are really easy to use. You just put the files in folders and point the desktop search at them. Great for someone that doesn't know a lot about computers.

I wrote a few articles about this by nbauman · 2009-04-08 11:44 · Score: 2, Insightful

I wrote a few articles about this for Law Office Computing magazine
http://www.nasw.org/users/nbauman/txtsrch.htm
http://www.nasw.org/users/nbauman/lawdb.htm
http://www.nasw.org/users/nbauman/discover.htm
It was a long time ago, the software and hardware has changed, but the concepts are still the same, and the costs are a lot less.

Free text search works reasonably well with small databases, but it doesn't work with big databases. If you want precision, you have to develop a set of tags (we called them keywords). A good model is Pubmed http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed. The New York Times used to have a great text search, but they changed it (eliminated tags) and now it's awfully difficult to get through.

Basically, the researcher has a body of knowledge, and he already has a filing and organizing system (in this case, a looseleaf binder system, which is a pretty good start). You should usually try to replicate his filing and organizing system in the database, for example one field for the looseleaf, another field for the tab, and then some goodies that he couldn't search the looseleafs for, like date, author, journal citation, etc. It would probably be useful to have a controlled vocabulary of a few good keywords, but keywords should be selected carefully so they're unique and don't duplicate.

I assume he doesn't have the PDFs any more. That would have made it a lot easier.

It would be handy to scan every word of (most) every document into full text, but it may not be necessary. Why do you need everything in full digital text? Scanning of unconventional text takes a human proofreading step, and probably isn't worth it.

He'll probably want to keep complete images of the original documents anyway along with the text. You should do a few tests to see how much resolution you need. 600 dpi works for ordinary text like they use in the printed newspaper. But if you want journal articles to come out, with footnotes and superscripts, you might need higher resolution.

Somebody is going to have to enter the fields manually, which is't too bad if you've got a thousand records (looseleaf tabs) to enter (about 20 hours), but can get difficult if you've got an order of magnitude more.

Scanning should be straightforward, if everything is neatly filed away in looseleaf books already. There are many cheap consumer-grade scanners on the market that can get 600-2400 dpi (the bundled software is probably more important than the hardware specs) but they can take up to 1 minute a page; there are more expensive scanners in the =>$1,000 range that can go a lot faster. If you're at a university, look around for somebody who already has one. Law firms and libraries do a lot of this.

You might start by estimating the number of pages and documents you have.

But let me suggest an alternative: Instead of scanning everything, just enter everything into a database without scanning it. Does he really need full text search? Or would it be enough to search his looseleaf books by a dozen fields? He doesn't have to print the document out from an image file, it's right there in his looseleaf books.

If anybody knows of up-to-date articles on this subject, I'd love to know the citation.

Bookends by martinX · 2009-04-08 11:47 · Score: 1

I didn't read the article and I'm not sure exactly what you want, but take a look at Bookends from Sonny Software.

Saved my butt, made life easy.

http://www.sonnysoftware.com/
Reference Management and Bibliography Software for Mac OS X

--
When they came for the communists, I said "He's next door. Take him away. Goddam commies."

DocMGR by ianh1981 · 2009-04-08 11:53 · Score: 1

DocMGR is what you're after. Its a web app that takes submitted documents (PDF, Office, etc), OCRs and indexes them, and allows stuff to be searched. www.docmgr.org

ReadIris Pro by fast+turtle · 2009-04-08 11:54 · Score: 1

is the best Scan/OCR app you can buy with many nice features. The first is it's reasonably priced. Second is the fact that it can import PDF files directly and OCR/Index them and it handles almost every langauge on the planet. Definately worth looking into as the school may actually have a school license, which means you don't have to buy anything.

--
Mod me up/Mod me down: I wont frown as I've no crown

I used to work for these guys... by Ibiwan · 2009-04-08 12:00 · Score: 1

Give a call and see if their software does what you want; if they haven't messed it up since I last touched it, it should do document archiving, scanning, OCR, search, tags, categorizations, and whatever custom database fields you want to throw at it.
http://datagenix.com/

--
-- //no comment

Google Scholar by Anonymous Coward · 2009-04-08 12:09 · Score: 0

The most obvious solution would be to use scholar.google.com and for any paper that you find that isn't online already look it up in your personal collection.

Comment removed by account_deleted · 2009-04-08 12:20 · Score: 1

Comment removed based on user account deletion

The bigger problem is not OCR, it's which scanner. by Futurepower(R) · 2009-04-08 12:49 · Score: 2, Informative

We've done considerable research on the problem of scanning documents, also, and came to the same conclusion: The new Fujitsu fi-6130 seems excellent, although we haven't tried it. That model and the 6230 are new, and there is some evidence that waiting for the second version of those models would be a good idea.

The big attraction of the fi-6130 is its speed: 40 pages per minute.

If you are interested, I suggest you download the manual. (PDF)

The manual talks about a connection for an "imprinter", which sounds as though it is a printer that works only with that particular Fujitsu scanner. That causes me to doubt whether buying from Fujitsu would be a good idea; we don't want to get involved with corporate marketing drone foolishness. Everything else, however, looks quite good.

The scanner comes with OCR software. I suppose and hope that Fujitsu did a lot of work and found the best OCR software.

The scanner software makes a PDF. The OCR software tries to recognize the words, so that the software can make a searchable PDF. Even if the OCR recognition isn't perfect, it can be very useful.

It seems to me that the Fujitsu fi-6230, suggested in the parent comment, is a poor design. It combines an automatic sheetfed scanner with a flatbed scanner for a lot more money. That doesn't make sense, since the attractive feature of the sheetfed scanner is its speed. Speed is important with a flatbed scanner, but not as important, since the operation will always be manual. It seems to me that it would be better to have a flatbed scanner that is a separate piece of equipment, rather than two pieces seemingly glued together, without any logical connection, since apparently the 6250 has two imaging elements.

Be careful about using Windows Search, as suggested in the parent comment. The Windows XP version is buggy, and sometimes won't look into files that are there. We use VCOM's PowerDesk pdfind.exe program, a older version of which is free. We also use Funduc Software's Search and Replace program.

Most scanners are quite slow, don't have automatic document feeders that allow scanning of papers of widely different sizes, and don't build OCR'd indexes inside the PDF files.

a quick thought by mistahkurtz · 2009-04-08 12:54 · Score: 1

before i return to work: if he's already getting the PDFs somehow, save them to a directory, and use google desktop to search through them for the keywords. It will search through most common documents.

--
not only is time travel possible, it's irrelevant.

Lovin' it by justice83 · 2009-04-08 13:08 · Score: 1

Loving the fact that the guy didn't know what century it is!

Re:Lovin' it by Sooner+Boomer · 2009-04-09 00:57 · Score: 1

I know very well what century it is. I was using exageration to emphasize how far the professor needs to come to be effective in the current technological environment.

--
Chaos maximizes locally around me.

Quick and dirty solution by Nightshade · 2009-04-08 13:55 · Score: 1

Here's what I've used for my own documents: 1) Convert the pdfs to tiffs with ImageMagick, 2) OCR the Tiffs with tesseract, 3) Index the text with Xapian. The OCR step won't get all the text right, but will get about 90% or better and which will be good enough for indexing and searching with xapian. For me that's been a pretty good solution.

Hire an Information Architect. by jddj · 2009-04-08 14:11 · Score: 1

Seriously. This is what we get paid to do. There's far too much to communicate on a forum, and if the SNR here is typical, you'll get awful, unrelated and just plain wrong advice.

If you can't hire, see if your school has a library science program and look for a good intern.

Failing that, read the Polar Bear book (Rosenfeld & Morville, pub: O' Reilly) yourself and follow the threads to resources particular to your problem.

Tactical help: populate the kewords, title, subject properties in your PDFs and Office docs. If you populate in Office and make a PDF, the properties come along. They're in File>Properties... Filling them out will help any search engine that can consume binary docs make sense of your content.

And there are bulk scan-to-OCR packages out there. Funnel into PDF and populate the properties.

Worth saying again: populate the properties.

Drupal by pingers · 2009-04-08 14:12 · Score: 1

Drupal + Modules (CCK + Filefield + Taxonomy + Views) Taxonomy provides categorization of content. You can have multiple vocabularies (sets of terms). You can assign one or many terms to each piece of content (or document in this case).

My father's small business had a similar problem by bertok · 2009-04-08 14:15 · Score: 1

My father runs a small business and has to track a bunch of paperwork for each client, so I got him a cheap LED lit flatbed scanner, but like everyone else, he discovered that it was too slow to manually scan in each page, even if the scanner itself was quite fast.

He eventually figured out that the fastest scanning technique is not to use a scanner at all, but a digital camera. He made a rig with a marked out area the size of an A4 sheet of paper, and then he attached a camera mount so that the camera would be facing down, pre-aligned to photograph the entire sheet. I've seen it in action, he can easily do a page per second: he just places the next page on the platform with one hand, and presses the shutter button with the other hand.

The resolution is more than good enough for OCR, and most cameras have better depth-of-field than scanners, so more of the page is in focus, even near bindings and staples.

20th Century? by weakpawns · 2009-04-08 14:36 · Score: 1

I suppose you meant "drag a professor into the 21st century".

Re:20th Century? by Anonymous Coward · 2009-04-08 21:37 · Score: 0

I suppose you meant "drag a professor into the 21st century".
You, sir, assume too much about professors.
Re:20th Century? by Sooner+Boomer · 2009-04-09 01:00 · Score: 1

"I suppose you meant "drag a professor into the 21st century"."
No wp, I meant I'm trying to get him to use technology that was available in the '90's. If can do this, there is hope for progress into *this* century.

--
Chaos maximizes locally around me.

Re:The bigger problem is not OCR, it's which scann by Angstroman · 2009-04-08 14:50 · Score: 2, Informative

I have been using an fi-6130 for several months now. It is quite simply the best scanner I have used. It is fast, highly reliable and very seldom misfeeds (1 per 500-800 pages in my experience). I use it for scanning archival financial records and also for technical papers. It includes a copy of Kofax Virtual ReScan, which does a great job of creating readable 1-bit monotone scans of originals with colored backgrounds. There are a number of possible target formats, and it has several automated ways of handling group separator sheets. I highly recommend it. I have seen no evidence of "marketing drone foolishness."

Evernote? by baffledexpert · 2009-04-08 15:06 · Score: 1

Evernote (Evernote.com) might be too small for your needs, and it's not open source, but it: - Has OCR - Is very cheap ($6 a month for the pro version, free for the light version) - Recognizes handwriting - Accepts tags - Has a web interface (and a desktop client) Its only limitation is the 500 MB monthy upload cap. Since you have hundreds of files to get through, you will go over the cap if you upload all at once. But since scanning those things is going to take you ages, you might be fine. Also, if your boss is still collecting paper, he's probably pretty old-school. Evernote is dead simple to use.

Re:Evernote? by mjhorn · 2009-04-09 08:46 · Score: 1

This was my first thought as well. In my experience the OCR is very good. I've taken very fuzzy pics of a business card using a camera phone, thrown the image into evernote, and never had a problem pulling it up with the search. Best of all the OCR is all automatic as you load the images in. The ability to access the database anywhere also seems like it could be beneficial, as he'd have access to his collection anywhere he has internet access. I will agree though that perhaps Evernote was not designed to be efficient for such a massive collection of documents. I think its at least worth checking out though.

doh! by Joebert · 2009-04-08 15:09 · Score: 1

I couldn't tell you how many times I've went to use a phonebook or reference manual and tried to flip through to the search page.

--
Wanna fight ? Bend over, stick your head up your ass, and fight for air.

DCG grammers by Anonymous Coward · 2009-04-08 15:10 · Score: 0

If you use pdftotext or some such utility from the adobe SDK then you could use a simple shell script to take the filename of all the pdf files in the current directory and create an index of keywords and counts per file and ultimately mix them into another index. I would recommend using sqlite - a nice little database - without a database you will most likely run short on memory when sorting and and searching.

I'm short on time otherwise I would write a more verbose example. For keywords you use something similiar to a huffman encoding or a trie. Instead of every letter you use a file containing word counts or relations per file.

#!/bin/bash
for i in *.pdf; do
file=${i%%.pdf} ; output=/tmp/${file}.txt;
echo -n $file > $output
echo -e "\001${i}" >> $output
done
# EOF

#!/usr/bin/perl -w
foreach (@ARGV) {
print $_, "\n";
open (IN, "/usr/X11/bin/pdftotext $_ - |")
or die "$!";
while () {
s/\t/\s/;
s/\s+/\s/;
s/\s/\n/;
next if m/^\s*$/;
print "$_"; # append to file (>>$ARGV[x])
## re-read and do a word count ... etc.
}
close (IN);
}
# EOF

Save the word counts in the data file and rescan it. The interesting part of the project would be word parts/suffixes/prefixes and also scanning the meta-information from the pdf into the mix. It sounds like a nice project - I'm guessing you'll run out of memory before information so a cutoff value of some low number or if you have the processing power try to build a backward chaining (prologe type) system with a simple DCG grammer. This would allow the searching/matching of similar phrases or parts of speech. The computer science of it all is using a `pdftotext' like program as an input stream. A text by norvig, grahm, or (for a more statistical approach) Patrick Winston (excellent author). The `texts' would cover the fun algorithms of dependancy related searches and backward chaining with grammers etc., I don't think hashtables will work for anything more than initial counts; so a b-tree in the filesystem - or better yet a ramdisk if you have the memory. Such as, (I'm a linux goof):

mount -vt tmpfs tmpfs -osize=1024M,nr_inodes=2m \ /{mount-point}/

.

I've worked on similar projects for some time - I get caught up in the rule system and negation retraction. I really don't have an end goal so these little exploritary programs lead to much fun. I have a perl program I have been using to scan images and pdf-files. It's based on Image::ExifTool. I understand that perl is not the coolest thing since ruby on a turnpike or whatever but this module kicks a$$ and perl works well, if not best, with regexps (IMHO).

To think you would be creating the `ultimate grep'! What a wonderful addition to anyone's life :) I would also check out ispell - it seems to have some nice rules for suffixes and the such. Post something similiar and I'll post the perl script that I've been using (at request -it's not that cool or useful outside of itself or for ideas) ...

Have fun...

JSSindex: Javascript Search Engine by Anonymous Coward · 2009-04-08 15:58 · Score: 0

http://jssindex.sourceforge.net/

JSSindex can index a collection of PDF, DjVu, postscript or HTML documents, and generate a self-contained set of HTML and javascript files that allow to full text search in the collection using a web browser. Since the search engine runs in the browser (in javascript) there is no need for a server. The code is platform independent.
The JSSindex script is written in Lush, which runs on Linux and Mac.

From the website:

JSS is a simple search engine designed for CDROM or Web-based document collections. The documents to be indexed can be in HTML, PostScript (.ps and .ps.gz), PDF, and DjVu. The main feature of JSS is that the query engine and the index are entirely in JavaScript, and therefore require no other software than a JavaScript-enabled Web browser.

What is the advantage? If you are distributing a collection of document on CD-ROM, you can provide platform-independent full-text search without asking your users to install any software on their machine. If you publish a collection of documents on the web, you don't need to install any server-side scripts: search queries run entirely in the user's web browser.

TikiWIki? by Anonymous Coward · 2009-04-08 15:59 · Score: 0

TikiWiki allows word searching of uploaded files (batch loading from a file directory is supported). You'd need to convert the images to a suitable format (can a PDF hold a page image and text from OCR?), and a command-line filter which extracts text from the file for indexing. By default only the first 8K is stored, but you have the source code. Assorted command-line filters can be defined, so future PDFs can be stored directly.

Use JabRef / bibtex by Anonymous Coward · 2009-04-08 16:19 · Score: 0

I use JabRef. http://jabref.sourceforge.net/ It's not a web interface, but it provides keyword searching, user-defined groups, local file storage as well as links to web versions, and everything starts out with a full citation information (which can be "unpublished", "personal communication", etc.). You didn't request full-text searching, but if you do have full-text pdfs, any of the OS-based file search programs should handle it.

My JabRef / bibtex database is well over 1500 articles, and I have NEVER regretted scanning / downloading over 22 shelf-feet of binders and folders.

Jabref by Anonymous Coward · 2009-04-08 16:37 · Score: 0

Use JabRef (http://jabref.sf.net) to store the references in a BibTeX database, and set up the links in JabRef for each article to point to the appropriate pdf, jpeg, zip or other document.

And don't forget to use DOI references (http://en.wikipedia.org/wiki/Digital_object_identifier) to point to the online abstract of the article. Very useful.

DEVONthink by fsiefken · 2009-04-08 18:08 · Score: 1

There is a OSX application specifically written for these kind of scenario's: DEVONthink. http://www.stevenberlinjohnson.com/movabletype/archives/000231.html It has the Abby OCR engine built-in, a web server and an extremely smart search filter, which is able to find related documents based on metrics like keyword frequency.

Aigaion by Anonymous Coward · 2009-04-08 19:14 · Score: 0

Aigaion - A Web based bibliography management software
http://www.aigaion.nl/
It speeds up the search for publications one has already read and supports the user in structuring information. Aigaion provides a bibliography management software environment that supports a user in just this: Organizing and managing a complete bibliography, from small bibliographies to bibliographies for a complete research department.

PHP Bibtex database manager by Ardeaem · 2009-04-08 19:34 · Score: 1

Assuming you use latex, PHP bibtex database manager might be a good option. I use it, and it is quite handy if you want to share the database among several researchers.

Link

DSpace.org by Anonymous Coward · 2009-04-08 19:40 · Score: 0

This might be an overkill for just one guy's papers.
But you might wanna take a look at
http://www.dspace.org/
DSpace is a open-source project for preserving various kinds of digital assets (images, documents, audio, etc). It is used by many university libraries throughout the world. It has a fairly large community.

The downside: you need to know how to install and configure it as it requires a Web server, database, servlet engine, and etc. All are available for free but you may need to spend some time to install and configure.

The interface is via the web so its fairly straight forward. You can find live examples of who is using DSpace on this website: http://www.dspace.org/index.php/DSpace-Repositories/Repositories-Alphabetical.html

Integration with NASA ADS by nniillss · 2009-04-08 20:07 · Score: 1

No, I (theoretical solid state physicist) didn't know about NASA ADS, but it seems to cover most of the relevant literature (including arXiv and APS journals). So yes, an integration into I, Librarian would be great.

In CERN DS (with certainly a focus on high-energy physics) my papers are shown only up to 2006; so this database appears useless for me.

The easy way by Anonymous Coward · 2009-04-08 20:23 · Score: 0

Here's what I did to rid myself of my entire bookcase full of ring binders: I sat down and looked them up one at a time on the net (if it's there, Google will find it). If there's a PDF available, pick that, otherwise convert it to PDF yourself (for uniformity, ease of viewing and printing, and for future-safety) after downloading the .ps/.ppt/.doc or whatever, unless it's a plain .txt or possibly .html file.

Make sure to rename the files sensibly (no "oopsla1998xyz.pdf"). I use a straightforward "MainAuthor(s) (et al) - TitleOfPaper", cut down to a reasonable length. Note that it's better to omit parts of the title than to start tossing in abbreviations - they will just get in the way of search and readability.

Then throw the dead-tree version in the bin (wonderful feeling) and pick up the next; you'll quickly get down to just 1-2 minutes per paper. And don't hesitate to discard all those papers that aren't really worth keeping, to speed things up even more.

Place the files in a simple hierarchy of directories (one level, no more), named after main subject - and don't be too concerned about getting that exactly right: a dozen broad subject names is way easier to handle than a hundred specific ones.

When it comes to searching, just leave it to the operating system! The built-in search in Ubuntu/Fedora/MacOS/Windows/whatever is good enough these days (just don't disable the indexing service...), otherwise install Google Desktop or similar if you need even more power. Keywords already in the articles can be searched for just as any other words in the content. Keywords not already present: forget it - you're not going to do that manually, and it's not worth it.

I now have about 4 ring binders left, with material I couldn't find online (that was worth keeping); a few really special ones I scanned to PDF myself. It usually takes me all of 10 seconds to find any paper I'm looking for on my computer: browse to "papers" and do a search. And it's all very easy to maintain: for any new paper, make sure it's a PDF, rename the file, and drop it in a suitable subdirectory depending on main subject - done.

Windows Indexing service by Anonymous Coward · 2009-04-08 21:00 · Score: 0

We use the Index service in Windows to index pdf files. Acrobat Reader (or just the index plugin from adobe) is required to allow the index server to index the contents of the pdf files. You can then use the windows find program or write a simple web front end to query the index for any word or term present in the files or the file properties. Lots of examples on the net. Good luck.

Refbase by Anonymous Coward · 2009-04-08 22:51 · Score: 0

Refbase is built directly for scientific literature management, is web based, open source, and does contain keywords amongst a range of other search options too.
It might be overkill for a single individual, but can be extremely effective if a whole department is wanting to share their literature resources.

Re:The bigger problem is not OCR, it's which scann by rhsanborn · 2009-04-08 23:19 · Score: 1

I work with electronic medical records and we have found Fujitsu scanners to be top-notch. Fast, reliable, and generally affordable. We've also used some of the larger production scanners from Kodak and Bowe Bell and Howell. They are solid scanners, but are more expensive and haven't taken the beating we tend to give the Fujitsus.

We just replaced an old 3097 with the 6130 and are waiting to hear how it holds up. Note, most of these scanners are deployed to remote scan areas in the hospital where it is the responsibility of the users to handle maintenance, which means these scanners aren't cleaned and don't have rollers replaced. These older models have gone years with almost no maintenance.

OCR, yes, but does he need management by N+Monkey · 2009-04-09 00:38 · Score: 1

I think what you are looking for is something called "document management" software.

... where he could archive the PDFs and scanned documents and be able to search by keywords?

I agree with the OCR requirement, but if he just needs to search the resulting PDFs, wouldn't DocSearcher do the job for him? I've found it trivial to set up and run and it's certainly helped me keep track of docs etc.

Fabulous question by cinnamon+colbert · 2009-04-09 00:46 · Score: 1

to bad the OSS community has no real answers.
this is something i submitted a week or so ago:

"I'm looking for software that can help my company manage information in documents that may be in pdf, doc or web form. I work for a biotech company with 15 people, and we have large numbers of documents that range from very technical scientific publications (usually pdf) to company reports like 10-Ks, to web pages to newspaper articles to pictures. We use these documents to review and stay current with the scientific literature; to learn about what competitors are doing, gain market information (who is selling how much of what), generate publicity for our products ,and so forth.
We currently use the windows file tree as our organizer, which creates several problems: I can't put one file into multiple bins; I can't use keywords to search; I can't organize files into groups.
What I would like (I think) to do is organize the information by keywords and subjects; associate groups of files into binders, and create summarys for the binders (eg, I might have 5 files that go together, and my own summary of what the five files mean); add sticky notes to anything at anytime (actuallly, I would like keywords and stickys [comments in adobe acrobat] to be the same: words in stickys are keywords, and keywords show up in the stick; add URLS and webpages directly from the browser; have a function that mimics or is compatible with a package like endnote or procite or papyrus or refcite (formats bibliographys in word docs)
I'm not even sure what the solution looks like, but it needs to be cheap (http://www.ncbi.nlm.nih.gov/sites/entrez. This has a lot of features that scientists need, such as keyword search returns a list of articles that can be viewed by abstract."

this is a problem that comes up a lot, for a lot of people
I've tried a lot of the solutions , like zotero, and they just don't cut it for one person,- much less if you need to share the info among a small group of people.
There is a fabulous market for someone who wants to write this software

The main problem, which I don't think anyone has addressed, is that free information has a price - a human can only remember so much. So, the glut of free pdf/web info is actually bad, cause you loose sight of the important stuff; this use to be done for you with your $ monthy journal subscriptions - if you are in nanoscience, you might get nanoletters from the american chemical society, the editors do the weeding out for you

the other problem is how does one do natural language querys ?

Of the available answers, most are owned by a de facto monopoly, thomson reuters; refman is probably the best

Surely there must be someone who makes a pdf library database front end better then the collection feature in adobe acrobat

Perhaps you are part of the problem. by BitZtream · 2009-04-09 01:11 · Score: 1

I realize that slashdot is going to take the technology solution as the only one, and in this case its probably the right way to go, but ...

People have managed documents and information like this for centuries and it worked rather well, perhaps you should stop being lazy and learn how to use traditional reference materials as you're going to need this skill for a few more years anyway.

Those skills are still useful today. Just because Google can index and allow you to find words in the documents it knows about doesn't mean that it can help you figure out what you're looking for. If you have no traditional reference skills, Google becomes a lot less effective. This of course isn't specific to Google, all search engines in the world won't help you if you can't figure out what you're looking for.

--
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager

As a grad student, this is what I do... by froghunter · 2009-04-09 01:19 · Score: 1

I ran into this issue also since I have tons of pdfs and sometimes it can take a while to find that paper you remember that mentioned HylD or ZnO. I use the search client copernic http://www.copernic.com/ . It has a serious advantage over google desktop since it gives you this handy little preview pain which is useful when sorting through results. Since I carry everything around on a hard drive, I just have the program set to index that drive (which is set to always have the same drive letter). As for versions of the program, they kind of went in a bad direction with the 2.x releases and I kept using 1.6/1.7 for a while, but recently started using the current release, 3.x and it works like a champ. Good luck.

Do it the easy & obvious way by sribe · 2009-04-09 01:36 · Score: 1

Good grief, forget your FOSS idealogy, scan them to PDF, OCR them using Acrobat Pro (the education price is ridiculously cheap), store them on a Mac, and use the built-in Spotlight to search them.

Reference Management by tutak · 2009-04-09 02:10 · Score: 1

Why not use something like Jabref? Easy to manage the references?

I'm cringing at the likely replies to this, but... by Anonymous Coward · 2009-04-09 02:26 · Score: 0

Apologies in advance for suggesting a Microsoft solution...

If you can get the documents into a searchable form (using OCR, as has been outlined by many other posts already), there is always Microsoft Search Server 2008 Express - it is free, and will index anything that it can read the contents of - just dump the files into a share, and the search will index the contents.

It's like a more complex version of the search 4.0 engine that Outlook 2007 uses to index the contents of a mailbox, but it has a web front-end for searching.

We tried it at work, decided not to deploy it because we had a bunch of really specific requirements that it didn't suit - turned out my workplace needed a "proper" doc management system, but have just shelled out good money for SharePoint licenses to do this.

I hate SharePoint, but that's OT.

For the simple set-up you've described, search server express might work OK.

Calibre by sandGorgons · 2009-04-09 02:33 · Score: 1

I'm surprised nobody has mentioned Calibre, which was also featured on Lifehacker sometime back.
It is based on PyQT (as well as dateutil, mechanize, lxml, BeautifulSoup) . They even have a CoverFlow like interface which is pretty good. I suppose it is usable on Win, Lin and Mac.
You have to provide a login/password to librarything (or a few other alternatives) and you can then search and tag for the book's metadata and cover images from these sources automagically.
I personally also use them to archive my PDF's that I download from the internet, tag them, specify authors and other metadata (incidentally, most of the papers that people create from latex do not have any metadata).
I see the developers pushing out a release every week, so it is under pretty active development. I dont know if there is a plan to integrate any indexing features in it, but I suppose the developers are open to it.

Google Desktop by Malenx · 2009-04-09 03:32 · Score: 1

Google desktop searches through your computer's files to find keywords inside the files themselves. If he saves all his documents he finds online to there, he should be able to do keyword searches in those documents.

Also, if the pdfs are ocr'd then he could search via that as well.

Citeseer by homboe · 2009-04-09 04:18 · Score: 1

I have always like the basic idea around Citeseer.

"CiteSeerx is a scientific literature digital library and search engine that focuses primarily on the literature in computer and information science. CiteSeerx aims to improve the dissemination of scientific literature and to provide improvements in functionality, usability, availability, cost, comprehensiveness, efficiency, and timeliness in the access of scientific and scholarly knowledge.

Rather than creating just another digital library, CiteSeerx attempts to provide resources such as algorithms, data, metadata, services, techniques, and software that can be used to promote other digital libraries. CiteSeerx has developed new methods and algorithms to index PostScript and PDF research articles on the Web. ..."

The basic issue for you would be that is was made to focus on Computer and Information sciences as it currently is implemented.

http://citeseerx.ist.psu.edu/about/site

In the short term, this is may not be valuable for you. In the long term, I think this can be the basis for most or any academic (or even non academic) research literature.

Online Citation Manager: Refworks by bored_grad · 2009-04-09 05:19 · Score: 1

The easiest solution is Refworks , an online citation manager. You can automatically import articles from online databases or create your own reference entries with space to add any kind of article information or user-specified metadata of any kind, plus you can attach .pdfs directly to the database entry. The database is stored online by refworks and is searchable from anywhere via a web browser. Many Universities already have site licenses for this system, so check with your university librarians. Otherwise, check out their website for further details. The Microsoft Office plug-in for the manager, Write-and-Cite III" works with Microsoft word and the database to automatically generate reference lists and citations formatted to the style of almost every major and minor academic journal in most disciplines. The whole database is searchable and may be organized by project. You can also automatically import any article or abstract from Google scholar or other academic databases like JSTOR, ProQuest, etc.

Keep it very simple by Anonymous Coward · 2009-04-09 06:31 · Score: 0

Being involved in CS research myself,
I think this is a very interesting problem!
I notice that many people still have these big piles of paper, even when they are in their twenties.

I suggest the following:
* hire a student to lookup all the printed papers in Google Scholar or some other database
* throw away the paper ones you found online
* save all results as PDF (no html, no txt),
with the title as the filename
* Ignore articles you cannot find - no scanning.
* Forget FOSS and install Copernic Desktop Search.
It works really great.

Now the problem is also that your professor wants to make notes on his papers. To really use the PDFs, you need him to buy a tablet notebook, on which you can write an annotate PDFs.
One of my colleagues has one, and it seems that e-paper is finally arriving.

As a last advice, teach your professor to organize his articles in directories, you can search each and every one of them individually. When I write a paper, I do a literature search on relevant literature in one particular directory.
And, do not focus too much on clever archiving strategies with keywords and such, they are not worth the effort.

Searchable index with keywords - ontology needed by FindItByMe · 2009-04-09 19:31 · Score: 1

In order to properly create a hierarchical index which is searchable, you may be interested in constructing an ontology, which is a description of your subject matter in terms of some broad categories. Those broad categories then branch out into logical subject areas. Many databases support hierarchical structures which match well with the way an ontology works. Once the ontology is constructed, which consists of keywords which represent the categories, you index the document on those keywords. Then your system can browse the hierarchy or zero in on a particular term. In linguistics ontologies are used to construct meaning trees of words as a starting point into determining the meaning and intent of some written text. Perhaps some of the commercial packages discussed can do this, but this is what I would look for in a product if I was faced with your task.

I think this may satisfy your need pretty well by Anonymous Coward · 2009-04-10 07:59 · Score: 0

http://www.mendeley.com/
opensource software meant for cataloging academic research papers, with a web backup/archive that can be shared with others

Better OCR for math by extra88 · 2009-04-18 10:22 · Score: 1

InftyReader is a program that specializes in doing OCR on scientific documents and mathematical formulas. It saves documents in a variety of formats including LaTeX and MathML.

Two unfortunate things about it: 1) it's a Windows binary 2) it costs $900USD for 2 concurrent use licenses. It was free until they licensed a conventional OCR engine to better handle the text (its non-math recognition was pretty bad before).

Slashdot Mirror

Building a Searchable Literature Archive With Keywords?

211 comments