Researchers Expanding Diff, Grep Unix Tools
itwbennett writes "At the Usenix Large Installation System Administration (LISA) conference being held this week in Boston, two Dartmouth computer scientists presented variants of the grep and diff Unix command line utilities that can handle more complex types of data. The new programs, called Context-Free Grep and Hierarchical Diff, will provide the ability to parse blocks of data rather than single lines. The research has been funded in part by Google and the U.S. Energy Department."
Space characters in the name of a Unix command line tool is asking for trouble.
Done! It's called "awk". Just set the RS and FS fields as appropriate. :P
With these tools, you could make grep and diff work with binary files in a meaningful way - very useful at times. I bet you could even adapt the "Context-Free Grep" into a sort of packet sniffer with enough work. I'd sure like to try these new programs sometime.
My blog
It is surprising that Slashdot even let you post a deb: url, as the filter usually seems to destroy most non-http(s) links. However, not everyone uses a Debian-based distro, and not everyone tries some random package (even from the repository) before reading a little about it, so posting the home page would have been a bit more useful.
http://www.cs.dartmouth.edu/reports/TR2011-705.pdf
Do we really need to improve on something that works already? A grep that handles binary formats might be nice, but I think I'd rather see this spun off into some kind of new tool or two, like an "extended" grep and diff, maybe. Maybe they're doing that.
Python: 'And then suddenly you have a language which says "we're all stuck with whatever the whiniest coder wants".'
They have HUGE amounts of data kicking around from various simulations/experiments.
Check out the wikipedia article for supercomputers, and you'll see DOE mentioned.
Tools like this could help with analysis and finding certain data sets. IIRC, regex are already used in DNA sequencing. There is probably a similar application and use for tools like this with their data.
Context-free grep/diff can be used to search for data/changes in arbitrary non-line-record-based files. Such as XML, HTML, JSON, SQLite databases, other databases, Apache configs, and many other pieces of data. Heck, even most programming languages are not line-based, but statement terminated/separated. Imagine being able to grep for a function name, and getting its entire prototype/usage even when it spans multiple lines (very common in standard glibc headers). And, depending on the plugin's capabilities, you could grep for a function name as a function name and not get back any usage of that text as a variable or embedded in a string, or a comment (skip commented-out calls!).
If there's sufficient configurability, you could ask for the entire block that given text is in, and such a grep would be able to display everything in the corresponding {...}. Makes grep that much valuable.
So, my question is, why aren't more IT-heavy corporations/government departments not involved?
FTFA:
Google's interest in this technology springs from the company's efforts in cloud computing, where it must automate operations across a wide range of networking gear, Weaver said. The DOE foresees that this sort of software could play a vital role in smart grids, in which millions of energy consuming end-devices would have connectivity of some sort. The software would help "make sense of all the log files and the configurations of the power control networks," Weaver said.
Vast amounts of OS SW has been funded by the government. BSD was developed by UC Berkeley, which is largely funded by Pentagon contracts.
And the Internet.
Meanwhile, the vast majority of open source projects never get past the opening statement.
You clearly don't know what it takes to accomplish a project like this one. What have you ever done, that gives you some standing to announce that this Usenix project is a load of crap?
--
make install -not war
As soon as I see "Context-Free Grep", I immediately think of a Context Free Grammar.
That basically implies we can have much more sophisticated rules that match other structural elements the way a language compiler does. Which means that in theory you could do grep's that take into account structures a little more complex than just a flat file.
Grep and diff that can be made aware of the larger structure of documents potentially has a lot of uses. Anybody who has had to maintain structured config files across multiple environments has likely wished for this before.
Sounds really cool.
Lost at C:>. Found at C.
Yeah, I usually post a disclaimer ("for Debian/Ubuntu/Mint" -- now "Debian/Mint/Ubuntu").
Second, yes, /. does allow that, and I hope they continue to do so, because deb:// and click to install is neat and handy (even a lot of old Linux hands don't even know about it).
Finally, (as you mentioned) it's not a link to download software, but rather install software from the repositories, so there's that level of security.
I'm not a lawyer, but I play one on the Internet. Blog
I know I'll be modded down, but I have to say it: what they describe is already available in Powershell, where objects can be piped in search/filter functions.
lucm, indeed.
perl. Isn't this exactly why perl was invented?
Some drink at the fountain of knowledge. Others just gargle.
This would work, but better. No, I'm not being flippant.
If you have structured data (say XML), you could target hierarchies like config-root:server-name:name. That way if the text inside "name" is only being looked for in that one field, you won't hit a bunch of other stuff that also happen to be similar strings but are unrelated.
I'm sure you'd still have your regular grep/diff utilities, but there's definitely places where being able to match these strings in-context would be of value.
Of course, someone is going to need to write a corresponding context-free sed (and maybe awk as well) to go along with the grep. But there's actually a lot of places where this would be a huge improvement in terms of certain kinds of automation.
Use of a context-free grammar also lets this be insensitive to whitespace and newlines, so it would work on "prettified" HTML or stuff that's all formatted haphazardly. This is basically how those things are parsed now ... the grammar rules define the structure, and don't need it to be all perfectly laid out in order to be able to handle it.
Lost at C:>. Found at C.
So weird. I spent the last 6 months writing some Java libraries that do exactly this. There were some similar things out there, but they weren't licensed appropriately for my uses, or were WAY too expensive. Writing a hierarchical diff engine is the most complex thing I've ever done, hell writing an efficient pure diff engine is insane itself. You have to identify blocks/structure. then you have to diff the structures, then you have to diff the content in the structures. Once all of that is said and done then you have to find a way to represent the differences using the recognized structures. And from my point of view half the reason was to be able to represent ONLY the changes so that I'd have a nice size savings, on a constantly changing tree. You also have to choose a format that allows you to roll back to an previous diff given the initial sate or final state. There are also a large number of trade offs that have to be made including window size etc. You can't do a diff across a massive amount of data w/o a massive amount of processing power and memory. So you effectively have to diff independent streams against each other that have similar sized sliding windows on each stream. /rant Good stuff though, just funny to read about, and difficult to do.
I don't have a an answer to your question, but I wrote my software to deal with IT problems, because diff and grep just weren't good enough, and no one seems to do it for free.
This violates so many rules of the Unix philosophy that I don't even know where to begin...
FTFA:
Grep has issues with data blocks as well. "With regular expressions, you don't really have the ability to extract things that are nested arbitrarily deep," Weaver said.
If your data structures are so complex that diff/grep won't cut it, they should probably be massaged into XML, in which case you can use XSLT off the shelf. It's already customizable to whatever data format you're working with.
FTFA:
With [operational data in block-like data structures], a tool such as diff "can be too low-level," Weaver said. "Diff doesn't really pay attention to the structure of the language you are trying to tell differences between." He has seen cases where dif reports that 10 changes have been made to a file, when in fact only two changes have been made, and the remaining data has simply been shifted around.
No, 10 changes have been made. The fact that only two substantive changes have been made based on 10 edits is a subjective determination. That is, unless you want to detect that moving a block of code or data from one place to another in a file has no actual effect, in which case good luck because that's a domain-specific hard problem.
This post expresses my opinion, not that of my employer. And yes, IAAL.
From the header of 1988 perl man page:
Submitted-by: Larry Wall
Posting-number: Volume 13, Issue 1
Archive-name: perl/part01
[ Perl is kind of designed to make awk and sed semi-obsolete. This posting
will include the first 10 patches after the main source. The following
description is lifted from Larry's manpage. --r$ ]
Perl is a interpreted language optimized for scanning arbitrary text
files, extracting information from those text files, and printing
reports based on that information. It's also a good language for many
system management tasks. The language is intended to be practical
(easy to use, efficient, complete) rather than beautiful (tiny,
elegant, minimal). It combines (in the author's opinion, anyway) some
of the best features of C, sed, awk, and sh, so people familiar with
those languages should have little difficulty with it. (Language
historians will also note some vestiges of csh, Pascal, and even
BASIC-PLUS.) Expression syntax corresponds quite closely to C
expression syntax. If you have a problem that would ordinarily use sed
or awk or sh, but it exceeds their capabilities or must run a little
faster, and you don't want to write the silly thing in C, then perl may
be for you. There are also translators to turn your sed and awk
scripts into perl scripts.
Some drink at the fountain of knowledge. Others just gargle.
This reminds me of a paper Rob Pike wrote a while back addressing this problem. His solution was a generalization of regular expressions, which he termed Structural Regular Expressions. I'm not sure how these stack up against context-free grammars, but it's an interesting approach that seems at least fairly similar to the Dartmouth work. In any case, I didn't see it as a reference, so I thought I'd mention it.
LOL and that my friend is the hard part. It cost me $4000 in legal fees to make sure they are not owned by the company I work for, and 6 weeks of work. I'm leaning towards an AGPL/open core model. I just see so many people NOT happy with open core stuff. Also, I didn't get a grant from Google or the D.O.E. And these are just small, yet integral, parts of a larger system. That I don't really want to give away yet. Hell, deciding on licensing is harder than coding sometimes. Gotta feed the family you know, while at the same time pay back the OSS world for all of the great stuff that I use every day for free. How to do both is a hard ethical question. It's easy to say just consult, or write a book. It's much harder to actually _do_ these things. Hell, it's hard enough, just to open up your code to the worlds criticisms. The only thing I know at this point, is that it's not doing me or anyone else any good just sitting on it.
I wouldn't call it a cancer. But it's definitely useful if you don't ever want commercial companies to use your code in public. It matches up well with the open core model. Commercial people will only use it if you can give them a differently licensed copy of the code. Apache, MIT, and BSD are great if you truly want to give your code away and don't care what people do with it behind closed doors. AGPL is nice to make sure people always give back. LGPL and GPL nice if you only want them to give back if they change it. Should people pay and how much is an age old question. I have to balance the cost of support and development vs. the cost of the product. The more I lean on the community the less I can charge and the more exposure I get. While in the other direction I get more money, but have to spend more of it. And there is no one size fits all solution to any of this.
Perl can context grep any ****ing thing any which way from Sunday. Much easier and more powerful than awk.
XML is ok, but there are many data formats that could really use a diff/grep utility that could make sense of them. HDF5 and NetCDF are nice in the scientific community, for example. Computer graphics geeks might find intelligent diff/grep tools for the Renderman format to be useful. Office users might want to know if two documents are genuinely different or were compressed differently. Hell, it would be incredibly useful if they could diff a MS Office file and LibreOffice file in their native formats to see if they were logically the same even if syntactically represented differently.
I'm sure that's the kind of thinking on the Google side. If you can equate two files (even if they aren't absolutely identical when in file form) and search in a file-format-independent way, then you can eliminate duplicate indexing and boost searching. An obvious place for that would be Google Docs, where the internal format used for a file isn't necessarily the format used by you on your machine.
A truly universal tool's only relationship to XML would be to use XML to define how different file formats worked. This would mean you could have a dictionary of file formats and object representations, without having to link to a billion libraries or having to stuff the utilities full of different kinds of parser. You'd have a single parser that would really be no different from the modern diff and grep, plus a layer in front that used the file format descriptions to convert the inputs into a usable representation.
If you wanted to stick to the Unix convention, and make this capability universal to all tools, you'd have a single file decompiler utility that used the dictionary to turn any stdin/file input into a standardized output and a single file compiler utility that could take the output of something like diff or grep and convert the representation back into a format that's meaningful with respect to the original file format. Hey presto, any problems Google or the DoE are solved without having to alter any specific tool or create any compatibility issue.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)