A Grep-like Utility That Works on More than Text?

← Back to Stories (view on slashdot.org)

A Grep-like Utility That Works on More than Text?

Posted by Cliff on Thursday September 2, 2004 @06:45AM from the substring-searches-on-steroids dept.

Nutria writes "This article got me thinking: What's a poor Unix-using guy to do, when he needs to grep text, compressed tarballs, OO.o documents, Debian archives, mime-encoded files, Evil Microsoft documents, PDF files, compressed AbiWord files, etc." Is there an extensible searching program for Unix that can handle a variety of different file-types? Search engines like ht://Dig can accomplish part of this task, however currently it doesn't index the whole file (just portions of the metadata). If you had to perform a substring search on a set of documents of different types, what tools would you use to accomplish this task?

18 of 65 comments (clear)

Min score:

Reason:

Sort:

ever herad of 'strings'? by szyzyg · 2004-09-02 06:46 · Score: 1, Informative

strings filename | grep
1. Re:ever herad of 'strings'? by JabberWokky · 2004-09-02 07:03 · Score: 2, Informative
  
  Realistically, this should operate similar to text filters in less. I.e., a set of filters that convert files to text for grep'ing.
  A more generic solution would be to have less stop using those filters and instead use strings, which would identify the file type and then subcall strings.text.plain or strings.text.html or otherwise the strings.mime.type. The fallback would be the normal strings behaviour. That way, anything that expected text (a la grep, less, etc) can have a flag to filter through strings to get the 'clean' feed:
  grep -riw --strings "gold" *
  Or just pipe it through:
  strings mytext.html |grep -iw "gold"
  That latter case could be handled by a strings.text.html that consists of a simple script calling lynx --dump.
  --
  Evan
  
  --
  "$30 for the One True Ring. $10 each additional ring!" -- JRR "Bob" Tolkien
Got to be better than XP's puppy dog, but by Marxist+Hacker+42 · 2004-09-02 06:50 · Score: 2, Informative

the problem I see isn't searching compressed or tarballed files- where the text inside is still largely in plaintext. It's just a different file format- but it's still bytes and I've seen supergrep programs in the past for both Linux and Windows that do this (try Tucows or SourceForge before posting on slashdot in search of freeware and shareware.) The problem I see is searching *encrypted* files, esepcially ones with different keys. Now that would be *hard*.

--
SJW: a person who perceives an injustice, and while correcting it, commits a greater injustice.
ht://Dig doesn't do this? by dougmc · 2004-09-02 06:55 · Score: 2, Informative

Search engines like ht://Dig can accomplish part of this task, however currently it doesn't index the whole file (just portions of the metadata).
Um, ht://Dig WILL do it. As will any of the other search engine packages out there, if you can feed them the text that they work on. It's generally not difficult (on a *nix platform) to convert all of the formats you mentioned into straight text or html and feed those to your favorite search engine platforms.
Useful programs include wvWare, rtf2htm, pdftotext, and yes, even strings.
Beagle by Markusis · 2004-09-02 06:57 · Score: 3, Informative

Beagle will probably meet at least some of these goals. Beagle also aims to index things like your gaim-logs, browsing history, and email as well as your files. This is closely related to the dashboard project. Both projects can be retrieved from Gnome CVS.

http://www.nat.org/beagle/
http://www.nat.org/dashboard/
Use a pipe and untilities by Matt+Perry · 2004-09-02 07:06 · Score: 4, Informative

The unix shell and pipe are your friends:
grep text
grep text filename
compressed tarballs
tar zOxf filename | grep text
OO.o document
unzip -p filename | grep text
mime-encoded files
mimedecode filename | grep text
Evil Microsoft documents
strings filename | grep text
PDF files
strings filename | grep text
This would make a great shell script project. You could use file to detect the type and then filter and grep it appropriately. This sounds useful enough that I'll probably write this script this weekend. Thanks for the idea.

--
Slashdot: Failed Car Analogies. Amateur Lawyering. Anecdote Battles.
1. Re:Use a pipe and untilities by robbkidd · 2004-09-02 07:43 · Score: 3, Informative
  PDF files[?]
  strings filename | grep text
  
  I'm guessing you've never tried that search before. PDF stores the meat of a document in compressed data streams. strings would return a bunch of font names, headers and compressed garbage.
  
  There are a few other tools available, at various stages of stability:
  
  PDFtoHTML
  
  PDFsearch
  
  CPAN PDF modules
2. Re:Use a pipe and untilities by bunnyman · 2004-09-02 09:33 · Score: 4, Informative
  
  less(1) already does this! Check out the $LESSOPEN variable on your Linux system, it points to a shell script that detects what type of file you are viewing, and runs a filter on it to get plain text from it.
3. Re:Use a pipe and untilities by Spoing · 2004-09-03 01:17 · Score: 2, Informative
  less(1) already does this! Check out the $LESSOPEN variable on your Linux system, it points to a shell script that detects what type of file you are viewing, and runs a filter on it to get plain text from it.
  
  *blink*
  $ less firefox-i686-linux-gtk2+xft.tar.gz drwxrwxrwx cltbld/cltbld 0 2004-08-24 11:26:07 firefox/ -rwxr-xr-x cltbld/cltbld 30869 1999-10-05 22:14:51 firefox/LICENSE -rwxr-xr-x cltbld/cltbld 177 2004-08-09 16:01:23 firefox/README.txt drwxrwxrwx cltbld/cltbld 0 2004-08-24 11:26:07 firefox/chrome/ drwxrwxr-x cltbld/cltbld 0 2004-08-24 10:25:42 firefox/chrome/comm/
  
  (Spoing does a happy dance.)
  OK, here's one for you... tab completion in Bash for commands.
  An example;
  $ ls 1 [tab] [tab] 1file.txt 110012.tar.gz $ gunzip 1 [tab] [changes to...] $ gunzip 110012.tar.gz
  
  It works for other commands also -- and it's programable!
  --
  A firewall can not protect you from yourself. Turn off what you do not need. Do not use the firewall to do your work.
Full text search engine? by bentfork · 2004-09-02 07:18 · Score: 3, Informative

I like the Catfish namazu.org <-- google cache ( thats catfish in japanese ).
It has multiple 'filters' that allow you to index .txt .pdf .doc and various other file formats. It also has some grep like command line utilities that allows you to 'grep' for text.
Oh, and it has a perl front end for apache, and IIS. ( win32 installer available from here: www.namazu.org/windows
caveat: As with any full text search engine, the files should be reindexed on a regular basis, but this is worth it as the searches are very fast even with gigs of documents.
enjoy
Re:Beagle by HRbnjR · 2004-09-02 07:34 · Score: 4, Informative

Seth Nickell has a blog entry discussing solutions in this area, including Gnome Storage, WinFS, Dashboard, Medusa, Spotlight, and Beagle.
Penetrator by cbcbcb · 2004-09-02 08:15 · Score: 3, Informative

I've just implemented a Word/PDF/text search system using Penetrator.
It acts as a web search engine or a grep-like tool. You can either build an index or configure it to search on demand.
You can add filters to deal with any file type you like. I use xpdf and wvWare to handle PDFs and word documents.
A little bit of work to set up, but it's a nifty bit bit of software.
1. Re:Penetrator by GrumpySimon · 2004-09-02 09:43 · Score: 2, Informative
  
  This looks very cool (and isn't a goatse link!), I'm just setting it up now. Trying to get html2text to kill things like font tags.
  
  Also - your default pdftotext setting seems to barf on files with spaces in their names. I changed the line to '%s' and this seems to work.
  
  Cheers,
  Simon
  
  --
  henry -- the human evolution news relay
Half a solution .... by gstoddart · 2004-09-02 09:43 · Score: 3, Informative

Um, why not pipe the output of your favorite program that interacts with the file type you're interested in to grep? Isn't that the "poor unix guy" way?

Because it involves the user explicitly using the right utility to decode each file type for every single file.

So you don't get the nice simple case the user is looking for like:

grep foo *

or even something cool like

find . -name "$pattern" -exec egrep -l -e "$1" {} \;

Grep is often used to hunt through large amounts of docs because you know something has that text, but you can't remember what.

What the guy is asking for is something that would take my second example, figure out the file type, and then do it automatically for you.

Cheers

--
Lost at C:>. Found at C.
1. Re:Half a solution .... by cortana · 2004-09-02 11:49 · Score: 2, Informative
  
  This is what the less package is for. Specifically lesspipe(1).
  
  find -type f -exec lesspipe {} \; | grep whatever
doodle by tenco · 2004-09-02 09:51 · Score: 2, Informative

Get doodle
or simply do a
#apt-get install doodle
Haven't tried it, but SWISH sounds good by chjones · 2004-09-02 10:02 · Score: 4, Informative

Haven't actually tried it out (everything I write seems to be text or TeX), but I remembered reading this article a while back: "How to Index Anything", Linux J., July 1 2003.

--
Christian Jones
Medicine. Mathematics. Mediocrity.
1. Re:Haven't tried it, but SWISH sounds good by bobv-pillars-net · 2004-09-02 13:15 · Score: 4, Informative
  
  Actually, the last time I needed to index a large group of files (for disaster recovery purposes; the files in question were all from the lost+found directory) I used Swish++.
  And yes, it works VERY well, orders of magnitude faster than Ht:/Dig, features incremental reindexing, and can be configured to auto-convert various filetypes to text before indexing. I'd say it's exactly what the poster ordered.
  
  --
  The Web is like Usenet, but
  the elephants are untrained.