A Grep-like Utility That Works on More than Text?
Nutria writes "This article got me thinking: What's a poor Unix-using guy to do, when he needs to grep text, compressed tarballs, OO.o documents, Debian archives, mime-encoded files, Evil Microsoft documents, PDF files, compressed AbiWord files, etc." Is there an extensible searching program for Unix that can handle a variety of different file-types? Search engines like ht://Dig can accomplish part of this task, however currently it doesn't index the whole file (just portions of the metadata). If you had to perform a substring search on a set of documents of different types, what tools would you use to accomplish this task?
strings filename | grep
the problem I see isn't searching compressed or tarballed files- where the text inside is still largely in plaintext. It's just a different file format- but it's still bytes and I've seen supergrep programs in the past for both Linux and Windows that do this (try Tucows or SourceForge before posting on slashdot in search of freeware and shareware.) The problem I see is searching *encrypted* files, esepcially ones with different keys. Now that would be *hard*.
SJW: a person who perceives an injustice, and while correcting it, commits a greater injustice.
file+[GLUE] and optionally, grep. GNU File is a nice utility to determine file types. From there, you could use some sort of glue language, such as Python or Perl, to perform the necessary actions when certain filetypes are found: decompress, index, etc. I'd be surprised if there wasn't a plugin style grep utility already out there.
assert(expired(knowledge));
Beagle
Useful programs include wvWare, rtf2htm, pdftotext, and yes, even strings.
Beagle will probably meet at least some of these goals. Beagle also aims to index things like your gaim-logs, browsing history, and email as well as your files. This is closely related to the dashboard project. Both projects can be retrieved from Gnome CVS.
http://www.nat.org/beagle/
http://www.nat.org/dashboard/
What's a poor Unix-using guy to do, when he needs to grep text, compressed tarballs, OO.o documents, Debian archives, mime-encoded files, Evil Microsoft documents, PDF files, compressed AbiWord files, etc.
Um, why not pipe the output of your favorite program that interacts with the file type you're interested in to grep? Isn't that the "poor unix guy" way?
catdoc Blah.doc | grep foo
zcat compressed.txt.gz | grep foo
apt-cache show package | grep foo
pdftotext Blah.pdf | grep foo
etc...
The same way everything else works!
1) Be had a half-decent version years ago.
2) Apple will have a reasonably robust version out soon.
3) Microsoft will have a more frustrating knock-off of Apple's version a few years later.
4) Four competing, incompatible open-source projects will copy the Apple and Microsoft implementations. When one of those companies sends a cease and desist letter to an open-source project that has shamelessly ripped off its trademarked name, Linux zealots will complain about how "intellectual property laws stifle our innovation"!
What I'm listening to now on Pandora...
I'm sure there are utilities to convert most of the formats to text. MS Word to text? No problems, there's an utility to handle that. PDF? Of course.
Basically, what you might do is something like this:
1. Figure out what kind of file it is (using 'file')
2. Select the correct converter based on file
3. grep
4. Profit? -- not quite sure about this one.
This would make a great shell script project. You could use file to detect the type and then filter and grep it appropriately. This sounds useful enough that I'll probably write this script this weekend. Thanks for the idea.
Slashdot: Failed Car Analogies. Amateur Lawyering. Anecdote Battles.
It has multiple 'filters' that allow you to index .txt .pdf .doc and various other file formats. It also has some grep like command line utilities that allows you to 'grep' for text.
Oh, and it has a perl front end for apache, and IIS. ( win32 installer available from here: www.namazu.org/windows
caveat: As with any full text search engine, the files should be reindexed on a regular basis, but this is worth it as the searches are very fast even with gigs of documents.
enjoy
You mean a personal slave^H^H^H^H^H^H^H^H^H^H^H^H^H^Hhapless grad student?
--TheOrangeSquid Is it any wonder things seem so awry? We swim in a sea of confusion and don't have to think to survive
Because strings will do things like: uncompress ASCII text from tarballs, recognize unicode font files, work with XML document formats to ignore markup and focus on content, search within images for strings, etc, etc.
The person is asking about a true file contents indexing scheme, where you have a database of file name, meta type, and keywords culled from inside the document -- something that'd work with PDF, JPG w/ EXIF, pictures of text, OpenOffice files, XML files, HTML files, etc!
Strings does none of that.
--
Internet Explorer (n): Another bug -- that is, a feature that can't be turned off -- in Windows.
But, down the road, maybe ReiserFS4 will do the trick.
You are being MICROattacked, from various angles, in a SOFT manner.
It acts as a web search engine or a grep-like tool. You can either build an index or configure it to search on demand.
You can add filters to deal with any file type you like. I use xpdf and wvWare to handle PDFs and word documents.
A little bit of work to set up, but it's a nifty bit bit of software.
this isn't such a bad as a stupid idea.
If the user was stupid enough to share their entire drive, and allow google to index it, it could find most things in most file types.
With more support coming all the time.
- http://www.milkme.co.uk
Because it involves the user explicitly using the right utility to decode each file type for every single file.
So you don't get the nice simple case the user is looking for like:
grep foo *
or even something cool like
find . -name "$pattern" -exec egrep -l -e "$1" {} \;
Grep is often used to hunt through large amounts of docs because you know something has that text, but you can't remember what.
What the guy is asking for is something that would take my second example, figure out the file type, and then do it automatically for you.
Cheers
Lost at C:>. Found at C.
Get doodle
or simply do a
#apt-get install doodle
Haven't actually tried it out (everything I write seems to be text or TeX), but I remembered reading this article a while back: "How to Index Anything", Linux J., July 1 2003.
Christian Jones
Medicine. Mathematics. Mediocrity.
I think the fscking puppy is the WORST thing about XP. Why on earth would I click 'search' in my 'file manager' if I wasn't searching for a FILE? I despise any 'improvement' to Windows that makes me use my mouse any more than I already do. If Microsoft knew what they were doing the search would START with files, with an option to do all the other crap on the sidelines.
Apple has a really good search tool, it's simple but lets you string together conditions intuitively. KDE seems pretty decent too.
"Sometimes, I think Trent just needs a cup of hot chocolate and a blankie." -Tori Amos on Nine Inch Nails
Google has a search appliance that is capable of searching files on a local network. This may be more than what you're looking for, but if this is a serious enterprise application, this seems just right. It's like having your own personal google.
There's no sig like SIGSEG
http://www.robertames.com/index.sh.txt
Call it BSD-licensed by author, and sharing back is encouraged (of course). I downloaded all my mail from yahoo account (2 wks before they upped it to 100mb) and stuffed it in a directory so I could search it better. I have to agree that namazu rocks. :^)
Major operations are "create, update" for working with the index, then "search, list, file" for searching. "Search" does the google thing, "list" will list out the filenames (like for further processing), and "file" will dump the contents of each search result (ie: for further grepping)
$ ~/bin/index.sh search something
....snip...
3. Fwd: Re: get-edid... "something special happened" :^) (score: 18)
/home/rames/email/test/Sent/Fwd_ Re_ get-edid___ _something special happened_ _^).eml (3,335 bytes)
Author: Robert Ames
Date: Wed, 5 Jun 2002 21:04:39 -0700 (PDT)
Branden Had a minor buglet that I forwarded to upstream regarding "read-edid" (probably
a hardware database out of date thing, no worries). Anyway, I mentioned to him that I
was confused at first by
$ ~/bin/index.sh list something
/home/rames/email/test/Subscriptions/Signature Confirmation - Petition to ignore the _R
/home/rames/email/test/Inbox/Being Twenty-Something.eml
/home/rames/email/test/Sent/Fwd_ Re_ get-edid___ _something special happened_ _^).eml
ename _The Two Towers_ to Something Less Offensive Petition_ - 191.eml
...now mod me up! Where's mah karma? :^)
--Robert
Does anyone know of a win32 version of grep which handles unicode files correctly?
At the moment I have to convert to ascii first, which while not that big a problem, is an extra step.
Here's a snapshot of the source for Glimpse 3.0, packaged up from my system as I don't have the original tarfile anymore.
--
I don't want to rule the world... I just want to be in charge of mayonnaise.
Try POPsearch (http://www.popsearch.net/)
You could add support to your script to many file formats by using some of the many PostScript converters that already exist. Pipe their output to an application that extracts plaintext from the PostScript file, then pipe to grep.
..Kazaa