A Grep-like Utility That Works on More than Text?

← Back to Stories (view on slashdot.org)

A Grep-like Utility That Works on More than Text?

Posted by Cliff on Thursday September 2, 2004 @06:45AM from the substring-searches-on-steroids dept.

Nutria writes "This article got me thinking: What's a poor Unix-using guy to do, when he needs to grep text, compressed tarballs, OO.o documents, Debian archives, mime-encoded files, Evil Microsoft documents, PDF files, compressed AbiWord files, etc." Is there an extensible searching program for Unix that can handle a variety of different file-types? Search engines like ht://Dig can accomplish part of this task, however currently it doesn't index the whole file (just portions of the metadata). If you had to perform a substring search on a set of documents of different types, what tools would you use to accomplish this task?

6 of 65 comments (clear)

Min score:

Reason:

Sort:

How, indeed! by Otter · 2004-09-02 07:02 · Score: 4, Funny

The same way everything else works!

1) Be had a half-decent version years ago.

2) Apple will have a reasonably robust version out soon.

3) Microsoft will have a more frustrating knock-off of Apple's version a few years later.

4) Four competing, incompatible open-source projects will copy the Apple and Microsoft implementations. When one of those companies sends a cease and desist letter to an open-source project that has shamelessly ripped off its trademarked name, Linux zealots will complain about how "intellectual property laws stifle our innovation"!

--
What I'm listening to now on Pandora...
Use a pipe and untilities by Matt+Perry · 2004-09-02 07:06 · Score: 4, Informative

The unix shell and pipe are your friends:
grep text
grep text filename
compressed tarballs
tar zOxf filename | grep text
OO.o document
unzip -p filename | grep text
mime-encoded files
mimedecode filename | grep text
Evil Microsoft documents
strings filename | grep text
PDF files
strings filename | grep text
This would make a great shell script project. You could use file to detect the type and then filter and grep it appropriately. This sounds useful enough that I'll probably write this script this weekend. Thanks for the idea.

--
Slashdot: Failed Car Analogies. Amateur Lawyering. Anecdote Battles.
1. Re:Use a pipe and untilities by bunnyman · 2004-09-02 09:33 · Score: 4, Informative
  
  less(1) already does this! Check out the $LESSOPEN variable on your Linux system, it points to a shell script that detects what type of file you are viewing, and runs a filter on it to get plain text from it.
Re:Beagle by HRbnjR · 2004-09-02 07:34 · Score: 4, Informative

Seth Nickell has a blog entry discussing solutions in this area, including Gnome Storage, WinFS, Dashboard, Medusa, Spotlight, and Beagle.
Haven't tried it, but SWISH sounds good by chjones · 2004-09-02 10:02 · Score: 4, Informative

Haven't actually tried it out (everything I write seems to be text or TeX), but I remembered reading this article a while back: "How to Index Anything", Linux J., July 1 2003.

--
Christian Jones
Medicine. Mathematics. Mediocrity.
1. Re:Haven't tried it, but SWISH sounds good by bobv-pillars-net · 2004-09-02 13:15 · Score: 4, Informative
  
  Actually, the last time I needed to index a large group of files (for disaster recovery purposes; the files in question were all from the lost+found directory) I used Swish++.
  And yes, it works VERY well, orders of magnitude faster than Ht:/Dig, features incremental reindexing, and can be configured to auto-convert various filetypes to text before indexing. I'd say it's exactly what the poster ordered.
  
  --
  The Web is like Usenet, but
  the elephants are untrained.