The Double Life of Memory Exposed With Automata Processor
An anonymous reader writes "As Nicole Hemsoth over at HPCwire reports 'In a nutshell, the Automata processor is a programmable silicon device that lends itself to handing high speed search and analysis across massive, complex, unstructured data. As an alternate processing engine for targeted areas, it taps into the inner parallelism inherent to memory to provide a robust and absolutely remarkable, if early benchmarks are to be believed, option for certain types of processing.'"
Basically, the chip is designed solely to process Nondeterministic Finite Automata and can explore all valid paths of an NFA in parallel, hiding the whole O(n^2) complexity thing. Micron has a stash of technical documents including a paper covering the design and development of the chip. Imagine how fast you can process regexes now.
In the real world (as opposed to the Big O world), the running time is waaayyyyyy faster.
We wrote a program to translate ~10,000 non-trivial Perl regular expressions to DFAs (specifically, into Ragel syntax). Actually, only about a quarter could be completely translated, because of various Perl extensions. The rest had to be programmatically analyzed for sub-expressions which could be turned into filters--only if the filters matched did the PCRE actually run.
Achieved a 20x (yes, 20x!) improvement in throughput compared to Perl or PCRE, and very nearly that compared to RE2 (RE2 doesn't scale well wrt memory when you try to union more than a few complex expressions together). However, as you pointed out, the state machine was enormous--over 1GB compiled, and over 2GB in source code. It was so big that neither GCC nor clang could compile it within an afternoon because their array and structure handling code has some algorithmic deficiencies. So we had to come up with some tricks there, too.
You would think the DFA approach would run afoul of memory bandwidth and cache trashing at that scale. And it probably does, although there appears to be a significant degree of locality involved for most inputs. But regardless it's clearly better than running NFAs.
Tangent:
I was once on a conference call with Intel engineers who were hawking some fancy substring matching code. I think they implemented Aho–Corasick, utilizing new Xeon vector operations. Their goal was obviously to sell more expensive hardware by selling optimized software at a nominal cost and tying you to their platform. Their numbers were nice, but Ragel was still performance competitive, could scale better (in terms of number of patterns), is FOSS, and has an ingenius design which makes it trivial to integrate into your application without having to deal with an API that breaks your threading or I/O model.
I'm sure the Automata processor would do well if it can be productized properly. But there's plenty of room for improvement in the way people do regular expressions now.