Researchers Expanding Diff, Grep Unix Tools
itwbennett writes "At the Usenix Large Installation System Administration (LISA) conference being held this week in Boston, two Dartmouth computer scientists presented variants of the grep and diff Unix command line utilities that can handle more complex types of data. The new programs, called Context-Free Grep and Hierarchical Diff, will provide the ability to parse blocks of data rather than single lines. The research has been funded in part by Google and the U.S. Energy Department."
Space characters in the name of a Unix command line tool is asking for trouble.
A nice GUI diff for Linux. (Has 3-way).
Click here to install
I'm not a lawyer, but I play one on the Internet. Blog
Done! It's called "awk". Just set the RS and FS fields as appropriate. :P
What's the relevance of this work to DOE? Shouldn't DOD be the funding agency? Or does DOE simply have more money than they know what to do with?
I wonder what's the interest of these two in this.
-dZ.
Carol vs. Ghost
With these tools, you could make grep and diff work with binary files in a meaningful way - very useful at times. I bet you could even adapt the "Context-Free Grep" into a sort of packet sniffer with enough work. I'd sure like to try these new programs sometime.
My blog
I would have wished for a download link ..
Hey don't blame me, IANAB
http://www.cs.dartmouth.edu/reports/TR2011-705.pdf
Do we really need to improve on something that works already? A grep that handles binary formats might be nice, but I think I'd rather see this spun off into some kind of new tool or two, like an "extended" grep and diff, maybe. Maybe they're doing that.
Python: 'And then suddenly you have a language which says "we're all stuck with whatever the whiniest coder wants".'
The grep is "in design process", the diff is "not released yet". And should be a lot of alternative tools to those 2, some that should have go around the same goal (i.e. mailgrep). Im all for improving those 2 venerable tools, but the announcement look a bit of out of time or scale.
It's a new program. They're not replacing grep. They're not going to break into your house and apt-get remove grep. If the data you need to grep is broken into lines, keep using grep. If you'd rather manually sort through data that's not broken neatly into lines, feel free to do that. Personally, this has the potential to be a huge help for me, though it depends a lot on what's required to make the necessary library for a given data type.
There used to be a utility, sgrep, for searching SGML/XML.
Beta is broken and the link to classic doesn't work. Stop wasting our time or there won't be anybody left here.
I'd like a grep tool that could scan XML data for instances of objects (according to some XSD or DTD), and take object state values as arguments to search objects for.
If it could scan objects in memory I'd love that better, but XML seems the only likely candidate for a format that a universal tool would parse.
--
make install -not war
FTFA:
Google's interest in this technology springs from the company's efforts in cloud computing, where it must automate operations across a wide range of networking gear, Weaver said. The DOE foresees that this sort of software could play a vital role in smart grids, in which millions of energy consuming end-devices would have connectivity of some sort. The software would help "make sense of all the log files and the configurations of the power control networks," Weaver said.
As soon as I see "Context-Free Grep", I immediately think of a Context Free Grammar.
That basically implies we can have much more sophisticated rules that match other structural elements the way a language compiler does. Which means that in theory you could do grep's that take into account structures a little more complex than just a flat file.
Grep and diff that can be made aware of the larger structure of documents potentially has a lot of uses. Anybody who has had to maintain structured config files across multiple environments has likely wished for this before.
Sounds really cool.
Lost at C:>. Found at C.
I know I'll be modded down, but I have to say it: what they describe is already available in Powershell, where objects can be piped in search/filter functions.
lucm, indeed.
Many years ago, I abused the capabilities of Flex (the fast lexical analyzer generator) to instrument students' C++ code. I was actually adding reference-counting code to check for leaks (part of that assignment's grading rubric). I just had to parse the code into a nested tree of { } bracing and adjoining text, and then pattern-match on that tree to find class and method definition boundaries, where I inserted code. I think it only broke on one out of about fifty submissions, where I had to intervene and instrument the code by hand instead.
Something like this could be done to handle XML since it is essentially little regular languages embedded in a well-formed tree of angle brackets and quoted strings. But I wouldn't bother with this, since XSLT and XPath exist for your problem...
I haven't read the original paper for this slashdot discussion, but the idea of a grep-like and sed-like tool that could use context-free grammars rather than regular expressions is very interesting to me. The hard part will be making it concise enough to use from the command-line rather than an edit/compile sort of parser-generator experience.
Are you sure?
They can probably do it remotely on must OS's anyway. Quick - make friends with Theo.
Sent from my ASR33 using ASCII
perl. Isn't this exactly why perl was invented?
Some drink at the fountain of knowledge. Others just gargle.
This would work, but better. No, I'm not being flippant.
If you have structured data (say XML), you could target hierarchies like config-root:server-name:name. That way if the text inside "name" is only being looked for in that one field, you won't hit a bunch of other stuff that also happen to be similar strings but are unrelated.
I'm sure you'd still have your regular grep/diff utilities, but there's definitely places where being able to match these strings in-context would be of value.
Of course, someone is going to need to write a corresponding context-free sed (and maybe awk as well) to go along with the grep. But there's actually a lot of places where this would be a huge improvement in terms of certain kinds of automation.
Use of a context-free grammar also lets this be insensitive to whitespace and newlines, so it would work on "prettified" HTML or stuff that's all formatted haphazardly. This is basically how those things are parsed now ... the grammar rules define the structure, and don't need it to be all perfectly laid out in order to be able to handle it.
Lost at C:>. Found at C.
There's a version-control system called darcs (written by the son of a colleague of mine) that incorporates some interesting ideas along these lines. For example, say you have a program with 100,000 lines of code, and there's a function in it called Foo, which is called thousands and thousands of times. You want to change the name of the function to Bar. In a traditional diff-based system, this results in thousands of differences. Darcs is supposed to be able to handle changes like this and recognize that it's only *one* change. It's also supposed to be able to handle the case where programmer A makes this change and checks it in, and then programmer B, who has simultaneously been doing lots of other work on the code, checks in his own changes -- with the old name for the function.
Find free books.
People have been trying to adapt line-oriented regular expressions to handle other sorts of data since at least the 1980s. Structured regular expressions were introduced with the Plan 9 system, but never seem to have caught on elsewhere.
It certainly would be nice to have tools that readily handle multi-line data, rather than forcing everything to fit into a line oriented format. It would be wonderful to be able to fix up indentation in version controlled files without making the history unreadable, for example.
PCRE has recursive patterns (available as pcregrep) and .NET has balancing groups, also allowing grep-like operations involving context-free grammars. For XML data, there are various XML query languages that allow wonderfully complex queries over XML structures. There are also refactoring tools that allow syntax-aware searches across source files.
For diff, the situation is a bit more complicated. There are XML-based diff tools, programming language syntax aware diff tools, and complex edit distance based diff tools already. It seems difficult to come up with something more generic. Let's say you want to diff programming language source files in languages for which there is no diff tools. What good is a context free diff tool going to be? You'd need to specify the entire grammar for the language.
I think the most useful way these people could spend their time and money would be to port PCRE-style recursive patterns and .NET like balancing groups to more UNIX regular expression libraries (foremost, Python).
This violates so many rules of the Unix philosophy that I don't even know where to begin...
FTFA:
Grep has issues with data blocks as well. "With regular expressions, you don't really have the ability to extract things that are nested arbitrarily deep," Weaver said.
If your data structures are so complex that diff/grep won't cut it, they should probably be massaged into XML, in which case you can use XSLT off the shelf. It's already customizable to whatever data format you're working with.
FTFA:
With [operational data in block-like data structures], a tool such as diff "can be too low-level," Weaver said. "Diff doesn't really pay attention to the structure of the language you are trying to tell differences between." He has seen cases where dif reports that 10 changes have been made to a file, when in fact only two changes have been made, and the remaining data has simply been shifted around.
No, 10 changes have been made. The fact that only two substantive changes have been made based on 10 edits is a subjective determination. That is, unless you want to detect that moving a block of code or data from one place to another in a file has no actual effect, in which case good luck because that's a domain-specific hard problem.
This post expresses my opinion, not that of my employer. And yes, IAAL.
This man has a point -- these government-sponsored dumbification programs have obviously already worked on him. You could be next.
Why do we need to write another perl?
From the header of 1988 perl man page:
Submitted-by: Larry Wall
Posting-number: Volume 13, Issue 1
Archive-name: perl/part01
[ Perl is kind of designed to make awk and sed semi-obsolete. This posting
will include the first 10 patches after the main source. The following
description is lifted from Larry's manpage. --r$ ]
Perl is a interpreted language optimized for scanning arbitrary text
files, extracting information from those text files, and printing
reports based on that information. It's also a good language for many
system management tasks. The language is intended to be practical
(easy to use, efficient, complete) rather than beautiful (tiny,
elegant, minimal). It combines (in the author's opinion, anyway) some
of the best features of C, sed, awk, and sh, so people familiar with
those languages should have little difficulty with it. (Language
historians will also note some vestiges of csh, Pascal, and even
BASIC-PLUS.) Expression syntax corresponds quite closely to C
expression syntax. If you have a problem that would ordinarily use sed
or awk or sh, but it exceeds their capabilities or must run a little
faster, and you don't want to write the silly thing in C, then perl may
be for you. There are also translators to turn your sed and awk
scripts into perl scripts.
Some drink at the fountain of knowledge. Others just gargle.
This reminds me of a paper Rob Pike wrote a while back addressing this problem. His solution was a generalization of regular expressions, which he termed Structural Regular Expressions. I'm not sure how these stack up against context-free grammars, but it's an interesting approach that seems at least fairly similar to the Dartmouth work. In any case, I didn't see it as a reference, so I thought I'd mention it.
They should call it... perl. Isn't this exactly why perl was invented?
Perl could do this - with the right libraries. But that's the real value they're adding here. They created tools that operate on files with knowledge of the structure of those files. So for instance a "diff" between two XML files with identical contents but differences in formatting could report that the files are identical... Or if you had some file structure that defined a directed-graph structure, a format meant to be edited in-place (and which therefore might sometimes have holes in it where data was removed - or which might have data presented in a different order depending on the sequence of operations used to store it) - the "diff" tool would decode the files, examining the data structure they're meant to represent - and show the differences in that.
Obviously it could be done in Perl - but it wouldn't be a one-liner unless you had those libraries which translate the particular file format into the desired level of abstraction.
Bow-ties are cool.
I know I'll be modded down
Dude, the only part of your post that I find objectionable is this assumption that you're going to be crucified for posting your thoughts. I know that there are some people on Slashdot who are pretty predictably triggered to shout down certain opinions - just don't assume that everyone here is like that, OK?
I think there's a lot to like about Powershell, and part of me will always be a bit jealous that Windows got a shell with those kinds of capabilities before Linux did. It does indeed seem that what they describe bgrep and bdiff doing could be accomplished in Powershell. I've never been too clear on some of the particulars of how that would be done, though. As I understand it, you can search/filter either XML data streams, or a sequence of .NET objects. Would the way to accomplish this in .NET, then, be to have a commandlet that opens the source file and passes them through as .NET objects? It would be a bit less compact than having the special type handling right in the "find" or "filter" command but it does lend a certain clarity to things, too...
Bow-ties are cool.
So what? Maybe people want a non-proprietary solution that works on more than one OS.
If there are such people, and it's not just me, I'd love to oblige them. :) I really need to get crackin'...
Bow-ties are cool.
Why do we need to write another perl?
Is it really "writing another perl"? The meat of these tools (which, I think, aren't yet implemented?) is that they filter and compare parsed data structures - and provide plug-in hooks so people can insert parsers for additional data types. Certainly this could be done as a Perl library - and doing so might have some advantages over creating new tools with their own plug-in mechanism. But implementing bgrep and bdiff is nowhere close to "writing another perl".
Bow-ties are cool.
I'm also working on a text processing tool that deals with blocks of data is already here.
http://www.nongnu.org/txr
Suppose you signal the nesting level by indentation, as most programmers today do.
If you add a condition around some code, then for example 3 lines might indented, resulting in 5 lines being altered instead of the 2 which actually have changed.
For this, the proposed improved grep and diff might be good, at least better than the current state of diff. Okay, maybe I'm not telling about the -b flag, but the -b flag might be a problem if you code in whitespace or so ;-)
http://en.wikipedia.org/wiki/Whitespace_(programming_language)
The appropriate way to deal with this would be to convert all program code to intermediate language, including comments and if available assertions, and to only check this code into the versioning system.
On check out, the code would be transformed according to an either agreed on formatting, or even to a different formatting for everyone.
tags:programming languages versioning systems patents prior art
Hey don't blame me, IANAB
Right, but what is there not already a parser for in CPAN? And if you are handy with perl, what kind of comparison is difficult?
Perl can context grep any ****ing thing any which way from Sunday. Much easier and more powerful than awk.
Just in time for RedHat's move to binary logging...
Right, but what is there not already a parser for in CPAN? And if you are handy with perl, what kind of comparison is difficult?
I couldn't say, honestly. :) So write a Perl script that recognizes the input file types, chooses the correct module, implements some kind of matching rule syntax, and performs the comparison with whatever module you chose in step 2, and a plugin system so people can add more file types without modifying your script, and yes, you've pretty much got bgrep.
I think at that point you're beyond "Perl's capabilities" and into the realm of "capabilities of things you can implement in Perl".
Bow-ties are cool.
So you would rather convert your data into XML instead of having a tool do it for you? That's pretty much the point of this, having a tool to do the work for you. Maybe it will even work by converting it to XML and using XSLT. But the data definitions will help everyone who uses it instead of everyone rolling their own.
FTFA
I have a binary file problem right now, I built a parser to convert it to text, and I can see the differences easily that way. Many files are exact duplicates in a different format (like a JPEG saved as a GIF and BMP), and many others are only slightly different (think French or Spanish converted to the English alphabet where the accents are lost). Weeding through the files is a lot easier, and if I could define the format and have the tool do it for me I would not have had to "roll my own"
If your format definition says ordering is important, like for a programming language, that would be 10 edits. I can think of piles of examples, po translation files would be one, where the order doesn't matter. If someone sorts the list to be able to compare if one file is missing phrases, I don't care, I only want to see what's new and different. The format definition would say that order is not important.
the one letter per word algorithm, e.g First Unix Command Creator, and its obviously improved successor bj.
@ http://www.usenix.org/publications/multimedia/ and http://www.youtube.com/user/USENIXAssociation?feature=watch
Recipes for USA bankrupt - http://tinypaste.com/0d66f dd = dollar deluge (printed in the infinity)
...And when Apple forks their own version, it'll be objective grep.
iGrep.
There is an already existing option in the AIX version of grep.
-p[Separator] Displays the entire paragraph containing matched lines. Paragraphs are delimited by paragraph separators, as specified by the Separator parameter, which are patterns in the same form as the search pattern. Lines containing the paragraph separators are used only as separators; they are never included in the output. The default paragraph separator is a blank line.
http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?topic=/com.ibm.aix.cmds/doc/aixcmds2/grep.htm
By default the -p separate stanza and ouputs separated with lines of the same char. (Very useful) I miss that option in Linux or other Unix flavors.
VladTepes
For anyone actually looking for the poster information, it can be found here: http://www.cs.dartmouth.edu/~gweave01/grepDiff/index.html