A Grep-like Utility That Works on More than Text?
Nutria writes "This article got me thinking: What's a poor Unix-using guy to do, when he needs to grep text, compressed tarballs, OO.o documents, Debian archives, mime-encoded files, Evil Microsoft documents, PDF files, compressed AbiWord files, etc." Is there an extensible searching program for Unix that can handle a variety of different file-types? Search engines like ht://Dig can accomplish part of this task, however currently it doesn't index the whole file (just portions of the metadata). If you had to perform a substring search on a set of documents of different types, what tools would you use to accomplish this task?
strings filename | grep
the problem I see isn't searching compressed or tarballed files- where the text inside is still largely in plaintext. It's just a different file format- but it's still bytes and I've seen supergrep programs in the past for both Linux and Windows that do this (try Tucows or SourceForge before posting on slashdot in search of freeware and shareware.) The problem I see is searching *encrypted* files, esepcially ones with different keys. Now that would be *hard*.
SJW: a person who perceives an injustice, and while correcting it, commits a greater injustice.
Useful programs include wvWare, rtf2htm, pdftotext, and yes, even strings.
Beagle will probably meet at least some of these goals. Beagle also aims to index things like your gaim-logs, browsing history, and email as well as your files. This is closely related to the dashboard project. Both projects can be retrieved from Gnome CVS.
http://www.nat.org/beagle/
http://www.nat.org/dashboard/
This would make a great shell script project. You could use file to detect the type and then filter and grep it appropriately. This sounds useful enough that I'll probably write this script this weekend. Thanks for the idea.
Slashdot: Failed Car Analogies. Amateur Lawyering. Anecdote Battles.
It has multiple 'filters' that allow you to index .txt .pdf .doc and various other file formats. It also has some grep like command line utilities that allows you to 'grep' for text.
Oh, and it has a perl front end for apache, and IIS. ( win32 installer available from here: www.namazu.org/windows
caveat: As with any full text search engine, the files should be reindexed on a regular basis, but this is worth it as the searches are very fast even with gigs of documents.
enjoy
Seth Nickell has a blog entry discussing solutions in this area, including Gnome Storage, WinFS, Dashboard, Medusa, Spotlight, and Beagle.
It acts as a web search engine or a grep-like tool. You can either build an index or configure it to search on demand.
You can add filters to deal with any file type you like. I use xpdf and wvWare to handle PDFs and word documents.
A little bit of work to set up, but it's a nifty bit bit of software.
Because it involves the user explicitly using the right utility to decode each file type for every single file.
So you don't get the nice simple case the user is looking for like:
grep foo *
or even something cool like
find . -name "$pattern" -exec egrep -l -e "$1" {} \;
Grep is often used to hunt through large amounts of docs because you know something has that text, but you can't remember what.
What the guy is asking for is something that would take my second example, figure out the file type, and then do it automatically for you.
Cheers
Lost at C:>. Found at C.
Get doodle
or simply do a
#apt-get install doodle
Haven't actually tried it out (everything I write seems to be text or TeX), but I remembered reading this article a while back: "How to Index Anything", Linux J., July 1 2003.
Christian Jones
Medicine. Mathematics. Mediocrity.