Researchers Expanding Diff, Grep Unix Tools
itwbennett writes "At the Usenix Large Installation System Administration (LISA) conference being held this week in Boston, two Dartmouth computer scientists presented variants of the grep and diff Unix command line utilities that can handle more complex types of data. The new programs, called Context-Free Grep and Hierarchical Diff, will provide the ability to parse blocks of data rather than single lines. The research has been funded in part by Google and the U.S. Energy Department."
Space characters in the name of a Unix command line tool is asking for trouble.
http://www.cs.dartmouth.edu/reports/TR2011-705.pdf
Context-free grep/diff can be used to search for data/changes in arbitrary non-line-record-based files. Such as XML, HTML, JSON, SQLite databases, other databases, Apache configs, and many other pieces of data. Heck, even most programming languages are not line-based, but statement terminated/separated. Imagine being able to grep for a function name, and getting its entire prototype/usage even when it spans multiple lines (very common in standard glibc headers). And, depending on the plugin's capabilities, you could grep for a function name as a function name and not get back any usage of that text as a variable or embedded in a string, or a comment (skip commented-out calls!).
If there's sufficient configurability, you could ask for the entire block that given text is in, and such a grep would be able to display everything in the corresponding {...}. Makes grep that much valuable.
So, my question is, why aren't more IT-heavy corporations/government departments not involved?
FTFA:
Google's interest in this technology springs from the company's efforts in cloud computing, where it must automate operations across a wide range of networking gear, Weaver said. The DOE foresees that this sort of software could play a vital role in smart grids, in which millions of energy consuming end-devices would have connectivity of some sort. The software would help "make sense of all the log files and the configurations of the power control networks," Weaver said.
This violates so many rules of the Unix philosophy that I don't even know where to begin...
FTFA:
Grep has issues with data blocks as well. "With regular expressions, you don't really have the ability to extract things that are nested arbitrarily deep," Weaver said.
If your data structures are so complex that diff/grep won't cut it, they should probably be massaged into XML, in which case you can use XSLT off the shelf. It's already customizable to whatever data format you're working with.
FTFA:
With [operational data in block-like data structures], a tool such as diff "can be too low-level," Weaver said. "Diff doesn't really pay attention to the structure of the language you are trying to tell differences between." He has seen cases where dif reports that 10 changes have been made to a file, when in fact only two changes have been made, and the remaining data has simply been shifted around.
No, 10 changes have been made. The fact that only two substantive changes have been made based on 10 edits is a subjective determination. That is, unless you want to detect that moving a block of code or data from one place to another in a file has no actual effect, in which case good luck because that's a domain-specific hard problem.
This post expresses my opinion, not that of my employer. And yes, IAAL.