Researchers Expanding Diff, Grep Unix Tools

← Back to Stories (view on slashdot.org)

Researchers Expanding Diff, Grep Unix Tools

Posted by timothy on Thursday December 8, 2011 @06:44AM from the now-with-raisins dept.

itwbennett writes "At the Usenix Large Installation System Administration (LISA) conference being held this week in Boston, two Dartmouth computer scientists presented variants of the grep and diff Unix command line utilities that can handle more complex types of data. The new programs, called Context-Free Grep and Hierarchical Diff, will provide the ability to parse blocks of data rather than single lines. The research has been funded in part by Google and the U.S. Energy Department."

49 of 276 comments (clear)

Min score:

Reason:

Sort:

Strange names by gnasher719 · 2011-12-08 06:46 · Score: 4, Funny

Space characters in the name of a Unix command line tool is asking for trouble.
1. Re:Strange names by realyendor · 2011-12-08 06:51 · Score: 4, Insightful
  
  I expect those are just the spoken names and that the commands will still be single words, similar to:
  "GNU awk" -> gawk
  "enhanced grep" -> egrep
2. Re:Strange names by dougmc · 2011-12-08 06:57 · Score: 2
  
  and I really should spend a few more seconds thinking about what I'm responding to. Obviously gawk and egrep are existing tools, given as examples, not proposed names for these new tools.
3. Re:Strange names by rwa2 · 2011-12-08 06:57 · Score: 2
  
  Yay, a tools thread!
  I am liking meld (python-based visual diff)
  But I suppose they have a different concept of hierarchical diff than diffing/merging two directory structures.
4. Re:Strange names by adonoman · 2011-12-08 07:12 · Score: 3, Interesting
  
  But having to use quotes every time you call a command is a sure way to make sure your command is never used.
  Would you rather type this:
  ./"Context-Free Grep" ...
  or this:
  ./cfgrep ..
5. Re:Strange names by ivoras · 2011-12-08 07:15 · Score: 2
  
  But of course, "eegrep" isn't :)
  (enhanced enhaced grep)
  
  --
  -- Sig down
6. Re:Strange names by ripler · 2011-12-08 07:24 · Score: 4, Funny
  
  Next thing you know we'll have CSIgrep. (enhance enhance enhance grep)
7. Re:Strange names by Longjmp · 2011-12-08 07:31 · Score: 4, Insightful
  
  Definitely
  II mean, where would we end up if unix commands actually give a hint what they are doing ;-)
  As a unix novice, if I wanted to search for something, my first choice of course would be grep
  Also if I wanted help on something, the first word that jumps to my mind would be man
  
  heh.
  
  --
  There are fewer illiterates than people who can't read.
8. Re:Strange names by mytec · 2011-12-08 07:31 · Score: 4, Informative
  
  According to this paper, they are called bgrep and bdiff.
9. Re:Strange names by EdIII · 2011-12-08 07:36 · Score: 3, Informative
  
  and I really should spend a few more seconds thinking about what I'm responding to
  That's not what Slashdot is about........
10. Re:Strange names by Anne+Thwacks · 2011-12-08 07:42 · Score: 4, Funny
  
  CSIgrep would take 30 mins to get the result! (With ad breaks)
  
  --
  Sent from my ASR33 using ASCII
11. Re:Strange names by iluvcapra · 2011-12-08 07:48 · Score: 4, Insightful
  
  If you don't like a tool's name, export an alias.
  It's not about typing commands as much as it's about making these work:
  
  $ find . -name ".txt" | xargs wc $ for file in $*; do mv $file old/$file done
  
  Versus these:
  
  $ find . -name ".txt" -print0 | xargs -0 wc $ for file in $*; do mv "$file" "old/$file" done
  
  A lot of scripts you run into are just broken because of braindead assumptions.
  
  --
  Don't blame me, I voted for Baltar.
12. Re:Strange names by gangien · 2011-12-08 07:52 · Score: 2
  
  in scripts, i pretty much quote everything. seems to be the way to avoid problems. of course, i'm not a sysadmin by trade, so maybe it's bad for some reason or something.
  when at the prompt i hit tab.
  We'd probably avoid a lot of problems, if people wouldn't be so lazy to not type a few extra characters.
13. Re:Strange names by toadlife · 2011-12-08 07:53 · Score: 4, Funny
  
  "I have only been able to come up with one algorithm for creating Unix command names: think of a good English word to describe what you want to do, then think of an obscure near- or partial-synonym, throw away all the vowels, arbitrarily shorten what's left, and then, finally, as a sop to the literate programmer, maybe reinsert one of the missing vowels."
  Rachel Padman
  
  --
  I don't always use unix-like operating systems; but when I do, I prefer FreeBSD.
14. Re:Strange names by mfnickster · 2011-12-08 08:36 · Score: 4, Insightful
  There's nothing that says the name of the tool and the command you type must be the same
  Very true. Unix programmers seem to follow these rules:
  
  delete any spaces in the name
  
  delete any vowels in the name
  
  delete any superfluous consonants
  
  chuck the entire thing and just abbreviate it to the first letter of each word in the name
  
  So these tools will likely be run as "ctxtfrgrp" and "hierdiff" or just "cfgrep" and "hdiff"
  --
  "Slow down, Cowboy! It has been 3 years, 7 months and 26 days since you last successfully posted a comment."
15. Re:Strange names by urdak · 2011-12-08 09:21 · Score: 2
  
  Like 'cat' for concatenate, or vi for what exactly?
  "vi" is short of "visual".
  First there was "ed", the, you guessed it, "editor". But "ed" was a real pain to use, because you wouldn't see what you were actually editing (if you ever used ed, you'd know what I mean). So the "visual" editor "vi" was invented.
16. Re:Strange names by jd · 2011-12-08 09:44 · Score: 5, Funny
  
  You have to figure in two's complement notation. If it's sufficiently counter-intuitive, the sign bit flips over and it becomes totally intuitive.
  
  --
  It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
17. Re:Strange names by jejones · 2011-12-08 10:02 · Score: 3, Interesting
  
  Alas, history and lots of shell scripts have probably made existing command names unchangeable. History in this case goes back to the time people got RSI from ASR-33 Teletypes and didn't want to have to type very much, and names that make sense only if you know other programs (in ed, "g//p" prints all lines containing the specified regular expression, hence the name "grep").
  That said, we programmers are users of programming languages as much as Joe Sixpack is a user of the desktop, and surely we deserve good design as much as they do, so we can get things done rather than taking perverse pride in mastering needlessly ghastly syntax.
18. Re:Strange names by morgauxo · 2011-12-08 10:07 · Score: 4, Insightful
  
  GP was a joke I am sure.
  
  As to yours though.. I wouldn't want spaces in my commands. How do you tell where the command ends and the arguments begin?
  
  As for man... man is the MANual. That's not that bad is it? Ok, help might be a little better but it's not a big deal unless you are very closed minded. It's really a history thing. Man wasn't just somebody's idea of a help command. Unix (or Unics as it was called back then) originally actually had a manual. As in dead trees paper! It got big. Real big. One day Dennis Ritchie accidentally dropped a copy and killed his dog. Flattened the poor girl like a pancake. After that he decided it needed to be digital. Man is a digital copy of that original dog killing book plus decades of additions and updates. Thus it is man(ual).
  
  Now should manual have been "manual" or maybe the real whole title "Unix Programmers Manual"? It might be easier to remember. 5 years after you learned that command and you are still typing it 5 times a day would you still appreciate the ease of using real whole English words? Are you that abc? (abreviationally challenged) Or do you just really love typing. Is your r/l name Mavis Beacon?
  
  That's how a lot of Unix commands are, they make plenty of sense with history. I'm sure grep and the others all have their own stories. Well.. not all. How much of a story does it take to come to ls is a lazy way to type list? Oh, yah, you are AbC. Sorry about that.
  
  Yes, the history of decades old programming decisions isn't really something you want to learn to use an OS (or any other software). But what's the alternative? Throw everything out x number of years and start over? It sounds great when you are a hopeless newbie but once you actually learn something do you want to do it all over again every 10 years just to make it easy for the next batch of basement kiddies? Your clock is ticking too you know! Now get off my lwn!!!! (lawn)
  
  P.S. Ok, Ok, I made up the dog part of the story. But it COULD have happened! The rest was real. Actually, I don't KNOW that it didn't happen... hmm....
19. Re:Strange names by Anomalyst · 2011-12-08 10:16 · Score: 2
  
  dont forget the 'n' prefix to indicate the previous flavour is deprecated.
  
  --
  There is no right to feel safe thru security vaudeville at the expense of everyone's freedom, privacy and tax money.
20. Re:Strange names by 93+Escort+Wagon · 2011-12-08 12:31 · Score: 3, Funny
  
  Just wait until Microsoft sees your post and we'll have eeegrep.
  No, I expect they'd call it grep#. And when Apple forks their own version, it'll be objective grep.
  
  --
  #DeleteChrome
21. Re:Strange names by rk · 2011-12-08 12:51 · Score: 4, Funny
  
  Unix is user-friendly, it's just picky about who its friends are.
22. Re:Strange names by GrandTeddyBearOfDoom · 2011-12-08 13:25 · Score: 2
  
  I remember when I first installed Linux in 1995, which came on a cover CD with essentially no instructions. I had to reinstall two or three times and watch carefully the list of packages installed to get an idea of what to type. I took a while to find my way around the /bin and usr/bin stuff, and it took me a week, and the confirmation dialog in the openlookalike file manager (do you want to Remove), including getting X up and working, before I guessed that rm and not del or era was the command to delete a file. I guessed man correctly, having seen that the package man was installed and the installer screen indicated that this was the manual. But I was determined to play with this new toy, and few users today will try so hard. What fun.
  
  --
  -- The Grand Teddy Bear has Spoken: "Windows 8 Source Code Available NOW! more disgusting than your pr..."
23. Re:Strange names by jbolden · 2011-12-08 13:30 · Score: 2
  
  People were using terminals that were as slow as 110 baud. No one wanted to type extra characters.
24. Re:Strange names by smellotron · 2011-12-08 18:03 · Score: 2
  
  "cat" was a really lousy choice.
  
  The distinguished artist sees "cat" as an excellent choice—a palette for the creative file-namer, a mad-lib left incomplete!
  At least, that's how I justify log files named dog and crap.
25. Re:Strange names by smellotron · 2011-12-08 18:09 · Score: 2
  
  How much can you improve a 100 line program that does nothing by concatenate streams?
  
  Make it a shell built-in and chide the user if only a single input was used (e.g. cat file | grep blah).
awk? by realyendor · 2011-12-08 06:49 · Score: 2

Done! It's called "awk". Just set the RS and FS fields as appropriate. :P
Interesting... by DangerOnTheRanger · 2011-12-08 06:51 · Score: 3, Interesting

With these tools, you could make grep and diff work with binary files in a meaningful way - very useful at times. I bet you could even adapt the "Context-Free Grep" into a sort of packet sniffer with enough work. I'd sure like to try these new programs sometime.

--
My blog
Re:How's it compare to Meld? by Anonymous Coward · 2011-12-08 07:06 · Score: 3, Insightful

It is surprising that Slashdot even let you post a deb: url, as the filter usually seems to destroy most non-http(s) links. However, not everyone uses a Debian-based distro, and not everyone tries some random package (even from the repository) before reading a little about it, so posting the home page would have been a bit more useful.
Link to one of their papers on these tools by treerex · 2011-12-08 07:07 · Score: 4, Informative

http://www.cs.dartmouth.edu/reports/TR2011-705.pdf
Re:bad, wrong and stupid by interval1066 · 2011-12-08 07:08 · Score: 2

Do we really need to improve on something that works already? A grep that handles binary formats might be nice, but I think I'd rather see this spun off into some kind of new tool or two, like an "extended" grep and diff, maybe. Maybe they're doing that.

--
Python: 'And then suddenly you have a language which says "we're all stuck with whatever the whiniest coder wants".'
Re:DOE?????? by iced_tea · 2011-12-08 07:10 · Score: 3, Interesting

They have HUGE amounts of data kicking around from various simulations/experiments.

Check out the wikipedia article for supercomputers, and you'll see DOE mentioned.

Tools like this could help with analysis and finding certain data sets. IIRC, regex are already used in DNA sequencing. There is probably a similar application and use for tools like this with their data.
Re:Follow the money...? by Tanktalus · 2011-12-08 07:13 · Score: 5, Insightful

Context-free grep/diff can be used to search for data/changes in arbitrary non-line-record-based files. Such as XML, HTML, JSON, SQLite databases, other databases, Apache configs, and many other pieces of data. Heck, even most programming languages are not line-based, but statement terminated/separated. Imagine being able to grep for a function name, and getting its entire prototype/usage even when it spans multiple lines (very common in standard glibc headers). And, depending on the plugin's capabilities, you could grep for a function name as a function name and not get back any usage of that text as a variable or embedded in a string, or a comment (skip commented-out calls!).
If there's sufficient configurability, you could ask for the entire block that given text is in, and such a grep would be able to display everything in the corresponding {...}. Makes grep that much valuable.
So, my question is, why aren't more IT-heavy corporations/government departments not involved?
RTFA? by DragonWriter · 2011-12-08 07:19 · Score: 4, Informative

funded in part by Google and the U.S. Energy Department
I wonder what's the interest of these two in this.
FTFA:
Google's interest in this technology springs from the company's efforts in cloud computing, where it must automate operations across a wide range of networking gear, Weaver said. The DOE foresees that this sort of software could play a vital role in smart grids, in which millions of energy consuming end-devices would have connectivity of some sort. The software would help "make sense of all the log files and the configurations of the power control networks," Weaver said.
Re:Follow the money...? by Doc+Ruby · 2011-12-08 07:22 · Score: 2

Vast amounts of OS SW has been funded by the government. BSD was developed by UC Berkeley, which is largely funded by Pentagon contracts.
And the Internet.
Meanwhile, the vast majority of open source projects never get past the opening statement.
You clearly don't know what it takes to accomplish a project like this one. What have you ever done, that gives you some standing to announce that this Usenix project is a load of crap?

--
--
make install -not war
Ooooh! by gstoddart · 2011-12-08 07:25 · Score: 3, Interesting

As soon as I see "Context-Free Grep", I immediately think of a Context Free Grammar.
That basically implies we can have much more sophisticated rules that match other structural elements the way a language compiler does. Which means that in theory you could do grep's that take into account structures a little more complex than just a flat file.
Grep and diff that can be made aware of the larger structure of documents potentially has a lot of uses. Anybody who has had to maintain structured config files across multiple environments has likely wished for this before.
Sounds really cool.

--
Lost at C:>. Found at C.
Re:How's it compare to Meld? by Compaqt · 2011-12-08 07:26 · Score: 2

Yeah, I usually post a disclaimer ("for Debian/Ubuntu/Mint" -- now "Debian/Mint/Ubuntu").
Second, yes, /. does allow that, and I hope they continue to do so, because deb:// and click to install is neat and handy (even a lot of old Linux hands don't even know about it).
Finally, (as you mentioned) it's not a link to download software, but rather install software from the repositories, so there's that level of security.

--
I'm not a lawyer, but I play one on the Internet. Blog
Microsoft Ad by lucm · 2011-12-08 07:38 · Score: 3, Interesting

I know I'll be modded down, but I have to say it: what they describe is already available in Powershell, where objects can be piped in search/filter functions.

--
lucm, indeed.
They should call it... by goombah99 · 2011-12-08 07:50 · Score: 3, Insightful

perl. Isn't this exactly why perl was invented?

--
Some drink at the fountain of knowledge. Others just gargle.
Re:bad, wrong and stupid by gstoddart · 2011-12-08 07:51 · Score: 2

Do we really need to improve on something that works already?

This would work, but better. No, I'm not being flippant.
If you have structured data (say XML), you could target hierarchies like config-root:server-name:name. That way if the text inside "name" is only being looked for in that one field, you won't hit a bunch of other stuff that also happen to be similar strings but are unrelated.
I'm sure you'd still have your regular grep/diff utilities, but there's definitely places where being able to match these strings in-context would be of value.
Of course, someone is going to need to write a corresponding context-free sed (and maybe awk as well) to go along with the grep. But there's actually a lot of places where this would be a huge improvement in terms of certain kinds of automation.
Use of a context-free grammar also lets this be insensitive to whitespace and newlines, so it would work on "prettified" HTML or stuff that's all formatted haphazardly. This is basically how those things are parsed now ... the grammar rules define the structure, and don't need it to be all perfectly laid out in order to be able to handle it.

--
Lost at C:>. Found at C.
Re:Follow the money...? by bobaferret · 2011-12-08 07:59 · Score: 2

So weird. I spent the last 6 months writing some Java libraries that do exactly this. There were some similar things out there, but they weren't licensed appropriately for my uses, or were WAY too expensive. Writing a hierarchical diff engine is the most complex thing I've ever done, hell writing an efficient pure diff engine is insane itself. You have to identify blocks/structure. then you have to diff the structures, then you have to diff the content in the structures. Once all of that is said and done then you have to find a way to represent the differences using the recognized structures. And from my point of view half the reason was to be able to represent ONLY the changes so that I'd have a nice size savings, on a constantly changing tree. You also have to choose a format that allows you to roll back to an previous diff given the initial sate or final state. There are also a large number of trade offs that have to be made including window size etc. You can't do a diff across a massive amount of data w/o a massive amount of processing power and memory. So you effectively have to diff independent streams against each other that have similar sized sliding windows on each stream. /rant Good stuff though, just funny to read about, and difficult to do.
I don't have a an answer to your question, but I wrote my software to deal with IT problems, because diff and grep just weren't good enough, and no one seems to do it for free.
Terrible idea by deblau · 2011-12-08 08:03 · Score: 4, Insightful

This violates so many rules of the Unix philosophy that I don't even know where to begin...
FTFA:

Grep has issues with data blocks as well. "With regular expressions, you don't really have the ability to extract things that are nested arbitrarily deep," Weaver said.

If your data structures are so complex that diff/grep won't cut it, they should probably be massaged into XML, in which case you can use XSLT off the shelf. It's already customizable to whatever data format you're working with.
FTFA:

With [operational data in block-like data structures], a tool such as diff "can be too low-level," Weaver said. "Diff doesn't really pay attention to the structure of the language you are trying to tell differences between." He has seen cases where dif reports that 10 changes have been made to a file, when in fact only two changes have been made, and the remaining data has simply been shifted around.
No, 10 changes have been made. The fact that only two substantive changes have been made based on 10 edits is a subjective determination. That is, unless you want to detect that moving a block of code or data from one place to another in a file has no actual effect, in which case good luck because that's a domain-specific hard problem.

--
This post expresses my opinion, not that of my employer. And yes, IAAL.
1. Re:Terrible idea by Tetsujin · 2011-12-08 10:43 · Score: 2
  This violates so many rules of the Unix philosophy that I don't even know where to begin...
  I'll take this on. It's a subject that is of particular interest to me.
  First of all, you have to consider whether it even matters that a tool violates "rules" of the "Unix philosophy". I mean, seriously, why assume that some system design ideas cooked up 30-40 years ago are necessarily the One True Path? Because "those who do not understand Unix are doomed to reinvent it poorly"? What if the designers in question do understand Unix? Or what if <gasp> they might actually have some ideas that surpass those of Doug Mcllroy, ESR, K & R, and so on?
  Second, how does one account for tools like Perl? By many accounts it is one of the greatest Unix tools ever created. By combining the functionality and syntax of several useful tools, incorporating a rich regexp syntax, and binding it together with a general-purpose programming language, it can be a very versatile and effective tool. But it runs afoul of various "rules" as well: (I will use a star to mark the rules I don't particularly agree with)
  
  "Write programs that do one thing and do it well" (Doug Mcllroy's summary of the philosophy, first clause)
  "Clarity is better than cleverness" (ESR, second rule* - I think there are times when it's worth having a compact notation with a difficult learning curve.)
  "Design programs to be connected to other programs." (ESR, third rule - I would argue that Perl encompasses as much functionality as it can to avoid having to connect to other programs - to avoid outside dependencies, to eliminate the problem of communicating with other processes, and to stabilize and simplify the interface to that functionality.)
  "Design for simplicity: add complexity only where you must" (ESR, fifth rule... Though it could be argued that this is exactly how the design of Perl evolved.)
  "Programmer time is expensive; conserve it in preference to machine time." (ESR 13th rule - Perl runs afoul of this if you accept the idea that Perl code is particularly hard to maintain. A language with a clearer syntax would, presumably, conserve programmer time.)
  "Use shell scripts to increase leverage and portability." (Gancarz, 7th rule. I would argue that Perl scripting exists largely as a way to avoid solving problems in the shell language.)
  Perl's biggest "violation", which it shares with other scripting languages, is that first one: "do one thing and do it well." Perl, Python, etc. are perfectly capable of doing a fork/exec or popen or loading a .so or whatever - but generally if there's a piece of functionality that people want to have in those languages, they re-implement it as a native library for those languages. Why do we accept so blatant a violation of what may be rightly considered the Unix philosophy? Because it works. It's useful. So a better question, then, is why is it that violating such an important "rule" is apparently necessary to create such a useful tool? There are various reasons: First, any reliance on an outside program is a maintenance issue. If your script is written for GNU find, for instance, and you move it to a system that has some other implementation of find, it may not work. Things can change from revision to revision as well. Second, it actually makes it easier to access the functionality, since you don't have to deal with writing out a stream of values and/or reading back a stream of results - when you call a Perl module, everything is neatly packaged into a (usually) synchronous call/result function interface, and presented as native Perl data.
  Perl could be a contentious example - but I chose it because to me, it and other scripting languages are examples of people bypassing the shell environment, rather than augmenting it. I would go so far a
  --
  Bow-ties are cool.
the perl man page by goombah99 · 2011-12-08 08:34 · Score: 2

From the header of 1988 perl man page:
Submitted-by: Larry Wall
Posting-number: Volume 13, Issue 1
Archive-name: perl/part01
[ Perl is kind of designed to make awk and sed semi-obsolete. This posting
will include the first 10 patches after the main source. The following
description is lifted from Larry's manpage. --r$ ]
Perl is a interpreted language optimized for scanning arbitrary text
files, extracting information from those text files, and printing
reports based on that information. It's also a good language for many
system management tasks. The language is intended to be practical
(easy to use, efficient, complete) rather than beautiful (tiny,
elegant, minimal). It combines (in the author's opinion, anyway) some
of the best features of C, sed, awk, and sh, so people familiar with
those languages should have little difficulty with it. (Language
historians will also note some vestiges of csh, Pascal, and even
BASIC-PLUS.) Expression syntax corresponds quite closely to C
expression syntax. If you have a problem that would ordinarily use sed
or awk or sh, but it exceeds their capabilities or must run a little
faster, and you don't want to write the silly thing in C, then perl may
be for you. There are also translators to turn your sed and awk
scripts into perl scripts.

--
Some drink at the fountain of knowledge. Others just gargle.
Structural Regular Expressions by vAltyR · 2011-12-08 08:42 · Score: 2

This reminds me of a paper Rob Pike wrote a while back addressing this problem. His solution was a generalization of regular expressions, which he termed Structural Regular Expressions. I'm not sure how these stack up against context-free grammars, but it's an interesting approach that seems at least fairly similar to the Dartmouth work. In any case, I didn't see it as a reference, so I thought I'd mention it.
Re:Follow the money...? by bobaferret · 2011-12-08 08:45 · Score: 2

LOL and that my friend is the hard part. It cost me $4000 in legal fees to make sure they are not owned by the company I work for, and 6 weeks of work. I'm leaning towards an AGPL/open core model. I just see so many people NOT happy with open core stuff. Also, I didn't get a grant from Google or the D.O.E. And these are just small, yet integral, parts of a larger system. That I don't really want to give away yet. Hell, deciding on licensing is harder than coding sometimes. Gotta feed the family you know, while at the same time pay back the OSS world for all of the great stuff that I use every day for free. How to do both is a hard ethical question. It's easy to say just consult, or write a book. It's much harder to actually _do_ these things. Hell, it's hard enough, just to open up your code to the worlds criticisms. The only thing I know at this point, is that it's not doing me or anyone else any good just sitting on it.
Re:Follow the money...? by bobaferret · 2011-12-08 09:41 · Score: 3, Interesting

I wouldn't call it a cancer. But it's definitely useful if you don't ever want commercial companies to use your code in public. It matches up well with the open core model. Commercial people will only use it if you can give them a differently licensed copy of the code. Apache, MIT, and BSD are great if you truly want to give your code away and don't care what people do with it behind closed doors. AGPL is nice to make sure people always give back. LGPL and GPL nice if you only want them to give back if they change it. Should people pay and how much is an age old question. I have to balance the cost of support and development vs. the cost of the product. The more I lean on the community the less I can charge and the more exposure I get. While in the other direction I get more money, but have to spend more of it. And there is no one size fits all solution to any of this.
Perl by wdef · 2011-12-08 09:55 · Score: 2

Perl can context grep any ****ing thing any which way from Sunday. Much easier and more powerful than awk.
Re:Object grep by jd · 2011-12-08 10:12 · Score: 2

XML is ok, but there are many data formats that could really use a diff/grep utility that could make sense of them. HDF5 and NetCDF are nice in the scientific community, for example. Computer graphics geeks might find intelligent diff/grep tools for the Renderman format to be useful. Office users might want to know if two documents are genuinely different or were compressed differently. Hell, it would be incredibly useful if they could diff a MS Office file and LibreOffice file in their native formats to see if they were logically the same even if syntactically represented differently.
I'm sure that's the kind of thinking on the Google side. If you can equate two files (even if they aren't absolutely identical when in file form) and search in a file-format-independent way, then you can eliminate duplicate indexing and boost searching. An obvious place for that would be Google Docs, where the internal format used for a file isn't necessarily the format used by you on your machine.
A truly universal tool's only relationship to XML would be to use XML to define how different file formats worked. This would mean you could have a dictionary of file formats and object representations, without having to link to a billion libraries or having to stuff the utilities full of different kinds of parser. You'd have a single parser that would really be no different from the modern diff and grep, plus a layer in front that used the file format descriptions to convert the inputs into a usable representation.
If you wanted to stick to the Unix convention, and make this capability universal to all tools, you'd have a single file decompiler utility that used the dictionary to turn any stdin/file input into a standardized output and a single file compiler utility that could take the output of something like diff or grep and convert the representation back into a format that's meaningful with respect to the original file format. Hey presto, any problems Google or the DoE are solved without having to alter any specific tool or create any compatibility issue.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)