Slashdot Mirror


Regex Golf, xkcd, and Peter Norvig

mikejuk writes "A recent xkcd strip has started some deep academic thinking. When AI expert Peter Norvig gets involved you know the algorithms are going to fly. Code Golf is a reasonably well known sport of trying to write an algorithm in the shortest possible code. Regex Golf is similar, but in general the aim is to create a regular expression that accepts the strings in one list and rejects the strings in a second list. This started Peter Norvig, the well-known computer scientist and director of research at Google, thinking about the problem. Is it possible to write a program that would create a regular expression to solve the xkcd problem? The result is an NP hard problem that needs AI-like techniques to get an approximate answer. To find out more, read the complete description, including Python code, on Peter Norvig's blog. It ends with this challenge: 'I hope you found this interesting, and perhaps you can find ways to improve my algorithm, or more interesting lists to apply it to. I found it was fun to play with, and I hope this page gives you an idea of how to address problems like this.'"

5 of 172 comments (clear)

  1. FWIW, the Regex Golf game by Amorymeltzer · · Score: 5, Interesting

    http://regex.alf.nu/

    Some favor trickiness, some favor just listing possibilities, but it's fun. I'm at 3651.

    --
    I live in constant fear of the Coming of the Red Spiders.
  2. RegExps by ledow · · Score: 5, Interesting

    Regexp's are a programming language unto themselves.

    I'm currently doing some temp IT work for schools while my promised job becomes available and it's eye-opening. The web-filtering is all reg-exp based but nobody understands how it works.

    They just copy/paste an example and change the parts of the URL that they can see to match the one they want. They barely bother to test the impact, past the site they need becoming "unfiltered" or "filtered" as necessary (i.e. no implication of knock-on effects on other sites with similar names). Let's not even mention the use of "." without the escape character for them to mean a literal period (but, obviously, it means "any character" in a regexp).

    I talked to them about changing their template regexp because, from the start, I could see that it wasn't really up to the job and just met if not opposition then at least apathy about the problem.

    Until someone brought an iPad into the helpdesk where a site that was supposed to be unfiltered was filtered - because nobody had considered what happens if you use "http://example.com" instead of "http://www.example.com". I was the one to spot it, and tell them that it's because their regexp was very basic.

    The good thing was, the other tech on the team was young and keen to learn and I was able to give them a quick rundown of regexps and we crafted an alternative template for them to use that would take account of the situation without, for instance, the blocking of "microsoft.com" affecting "antimicrosoft.com".

    But it is amazing how many people I know that work in IT have no idea how to program, no idea how to handle regexps, and just work on a "copy a working example" basis.

    1. Re:RegExps by Anonymous Coward · · Score: 5, Funny

      But it is amazing how many people I know that work in IT have no idea how to program, no idea how to handle regexps, and just work on a "copy a working example" basis.

      You will be truly amazed by the number of people who copy a not-working example...

  3. Re: Regex this by ShanghaiBill · · Score: 5, Funny

    Some of us programmers have families, a life, don't watch anime, and aren't basement dwelling nerds.

    Umm ... I just spent the last hour playing regex golf with my wife and kids.

  4. This problem has been studied for decades by DogPhilosopher · · Score: 5, Informative

    There's a field called Grammar Induction, and the problem of learning regular languages, aka regular inference, can be considered a subfield. People have been working on this since the '50s. Applications include learning DTDs for XML/wrapper induction, and all kinds of problems in bioinformatics and natural language processing.

    There's a strong link with the graph coloring problem, see
    http://www.cs.ru.nl/~sicco/papers/alt12.pdf

    In this field, the focus is generally on learning FSAs, but these can easily be transformed into regexps. There's work on learning regexps directly, see
    http://www.informatik.uni-trier.de/~fernau/papers/Fer05c.pdf

    Enjoy.