Slashdot Mirror


Regex Golf, xkcd, and Peter Norvig

mikejuk writes "A recent xkcd strip has started some deep academic thinking. When AI expert Peter Norvig gets involved you know the algorithms are going to fly. Code Golf is a reasonably well known sport of trying to write an algorithm in the shortest possible code. Regex Golf is similar, but in general the aim is to create a regular expression that accepts the strings in one list and rejects the strings in a second list. This started Peter Norvig, the well-known computer scientist and director of research at Google, thinking about the problem. Is it possible to write a program that would create a regular expression to solve the xkcd problem? The result is an NP hard problem that needs AI-like techniques to get an approximate answer. To find out more, read the complete description, including Python code, on Peter Norvig's blog. It ends with this challenge: 'I hope you found this interesting, and perhaps you can find ways to improve my algorithm, or more interesting lists to apply it to. I found it was fun to play with, and I hope this page gives you an idea of how to address problems like this.'"

34 of 172 comments (clear)

  1. FWIW, the Regex Golf game by Amorymeltzer · · Score: 5, Interesting

    http://regex.alf.nu/

    Some favor trickiness, some favor just listing possibilities, but it's fun. I'm at 3651.

    --
    I live in constant fear of the Coming of the Red Spiders.
    1. Re:FWIW, the Regex Golf game by Garridan · · Score: 3, Informative

      Or, if you want to try your hand at meta regex golf there's a place for that, too.

  2. Is there a regex... by Hognoxious · · Score: 3, Funny

    ... that can find frost psits?

    --
    Confucius say, "Find worm in apple - bad. Find half a worm - worse."
  3. Meta-meta-meta-meta.. by Travis+Mansbridge · · Score: 2

    Now build a robot that can design a program to write a regex expression to solve the xkcd problem.

  4. RegExps by ledow · · Score: 5, Interesting

    Regexp's are a programming language unto themselves.

    I'm currently doing some temp IT work for schools while my promised job becomes available and it's eye-opening. The web-filtering is all reg-exp based but nobody understands how it works.

    They just copy/paste an example and change the parts of the URL that they can see to match the one they want. They barely bother to test the impact, past the site they need becoming "unfiltered" or "filtered" as necessary (i.e. no implication of knock-on effects on other sites with similar names). Let's not even mention the use of "." without the escape character for them to mean a literal period (but, obviously, it means "any character" in a regexp).

    I talked to them about changing their template regexp because, from the start, I could see that it wasn't really up to the job and just met if not opposition then at least apathy about the problem.

    Until someone brought an iPad into the helpdesk where a site that was supposed to be unfiltered was filtered - because nobody had considered what happens if you use "http://example.com" instead of "http://www.example.com". I was the one to spot it, and tell them that it's because their regexp was very basic.

    The good thing was, the other tech on the team was young and keen to learn and I was able to give them a quick rundown of regexps and we crafted an alternative template for them to use that would take account of the situation without, for instance, the blocking of "microsoft.com" affecting "antimicrosoft.com".

    But it is amazing how many people I know that work in IT have no idea how to program, no idea how to handle regexps, and just work on a "copy a working example" basis.

    1. Re:RegExps by Hognoxious · · Score: 3, Funny

      Regexp's are a programming language unto themselves.

      It would appear that apostrophes are too.

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
    2. Re:RegExps by Anonymous Coward · · Score: 4, Insightful

      ...and then there's the people who think they understand regexes, but try to use them to decide context-free grammars.

    3. Re:RegExps by Anonymous Coward · · Score: 5, Funny

      But it is amazing how many people I know that work in IT have no idea how to program, no idea how to handle regexps, and just work on a "copy a working example" basis.

      You will be truly amazed by the number of people who copy a not-working example...

    4. Re:RegExps by Anonymous Coward · · Score: 2, Informative

      If regular expressions are a programming language, then they are not a very powerful one. The languages they recognize are the so-called regular languages, which are the least expressive category in the Chomsky hierarchy of languages (note the difference between the language of regular expressions and the languages recognized by regular expressions). See the Wikipedia articles on regular languages and the Chomsky hierarchy for details.

    5. Re: RegExps by LordSnooty · · Score: 2

      I don't know, apostrophes can be used to indicate missing letters and it helps to highlight that regexp isn't really a word (yet).

    6. Re: RegExps by Anonymous Coward · · Score: 2, Insightful

      Calling backslashes "whacks" disqualifies you from being involved in such discussions.

    7. Re:RegExps by Redmancometh · · Score: 2

      A $40/hr G-code monkey would be underpaid.

  5. ioccc 2013 US president matching code by hankwang · · Score: 4, Interesting

    The International Obfuscated C Code contest had a winning entry that could flag the names of US presidents as republican or democrat.

    main(int riguing,char**acters){puts(1[acters-~!(*(int*)1[acters]%4796%275%riguing)]);}

    Quoting: "This one-line C program accepts as a first command-line argument the last name of any of the last 31 US Presidents (from Franklin Pierce onwards), in lower case, and prints out their political affiliation. Use "republican" as the 2nd command-line argument, and "democrat" as the 3rd (or equivalent strings of your choice)."

    De-obfuscated, it is a boolean expression acting on a string s,

    (*(int*)s)%4796%275%4

    I wonder whether you can make a regexp that is shorter than this and accomplishes the same thing.

    1. Re:ioccc 2013 US president matching code by hankwang · · Score: 2
      Pardon, I forgot the "not" operator. De-obfuscated, it is a boolean expression acting on a string s,

      !((*(int*)s)%4796%275%4)

    2. Re:ioccc 2013 US president matching code by hankwang · · Score: 4, Informative

      It takes the first four bytes of the president's name, converts them to int; then applies four modulo operations (%4796 %275 %4). How the author figured out that those four operations would do the job, who knows? Maybe brute-forced the parameter space.

    3. Re:ioccc 2013 US president matching code by thogard · · Score: 2

      I think the subtletyâZ the objector had was that arrays and pointers are slightly different which is true In this context, an array is a pointer with potentially compiler allocated backing memory for the data while the pointer might not. A pointer will also have an address while the pointer used in array definitions won't have an address. Old compilers used to treat them identically but then again they used to treat pointers as integers as well. Modern compilers tend to know enough about the CPUs and have built in array checks that they do work slightly differently.

  6. Re:Regex this by Anonymous Coward · · Score: 2, Insightful

    That other people have hobbies different than you is going to be quite common. If you get this angry every time you run into someone with a different hobby then I fear for your long term mental and physical health.

    Take deep breaths. It is going to be ok.

  7. Hard AI by Okian+Warrior · · Score: 4, Insightful

    Regex me a list of folks that have time to sit around fucking off their life-time in order to write a regex to work on the XKCD "problem", and folks that don't.

    The field of study known as "AI" has been stagnant, for about 50 years now. One of the field's many problems is the lack of a good definition for intelligence.

    Despite lacking a definition, we have working examples intelligent systems in the real world - humans.

    Humans are very good at partitioning sets by descriptive differences, and they discover these descriptive rules largely by themselves.

    We don't know what intelligence is yet, but if we keep looking at problems and trying to figure out the human approach, eventually we'll have enough contrasts and similarity to partition sets based on differences in intelligence.

    In other words, the more problems we solve, the more data we can use to formulate rules that define intelligence.

    That's a pretty important and useful goal.

    (And belaboring the obvious: If we had even simple AI constructs we could automate much of out work force, freeing us up for more leisurely pursuits. Whether this leads to a post-scarcity utopia or unemployment/welfare apocalypse depends on your political affiliation.)

    1. Re:Hard AI by Eythian · · Score: 2

      That's not true. The field hasn't been stagnant at all. The problem is you're missing most of the field in your assumption of what it is.

      AI is not all about making "computer programs that can think" (a.k.a "Strong AI"), it's more about creating systems that can adapt to their environment in order to improve their chance of success. If their environment is getting lists of names and they adjust, without the direct input of a person, to better match them according to some scoring system, that is AI.

      It can use statistical methods, evolutionary characteristics, or just adjusting variables according to how wrong they currently are in order to try to match some problem. I'm not even touching symbolic AI here, because I don't understand it nearly so well.

      Strong AI is part of the field sure, but certainly not all of it as you imply. It is a very large field.

    2. Re:Hard AI by Antonovich · · Score: 4, Interesting

      You're both right. The term AI originally had much broader scope because (computer) people hadn't tried to implement it yet and realised that it was going to be *very* hard to do. Now people seem to use the terms Strong AI or Artificial General Intelligence for trying to get the kind of non-domain-specific "intelligence" humans (are supposed to) have, and "simple AI" just means tech that can adapt within a highly restricted domain (like search, navigation, etc.). The problem, of course, is that until we get an even remotely satisfactory definition of what "human intelligence" is, we aren't going to get very far. Unfortunately, it turns out that attempting to specify exactly what intelligence is ends up involving linguists, psychologists, biologists (neuroscientists, etc.) and worst of all, philosophers. Basically, no one agrees on anything and even when there are areas of agreement, each person seems to have deal-breaker elements that no one else agrees on. It gets messy, and quick. Worse still, it turns out that developmental/cultural factors (so development of an individual in a particular socio-cultural context) radically colours what kinds of definitions of intelligence people come up with, and the sorts of solutions it makes sense to think about (representationism vs enaction/actional theories, etc.). For example, a lot of work has been done on the psychological and cultural/historical effects of literacy (and numeracy) and show that someone growing up in a society in the Western Tradition (this is a specific term) that learns to read/write with an alphabet will often strongly prefer a particular view of the higher mental functions (language, thinking, "intelligence", etc.)(see John Goody, Roy Harris, David R Olson, and many more). If we want to create the tech then we need to go beyond that but it's incredibly hard to go beyond your own intellectual baggage. Researchers like Pei Wang have a bit of an advantage coming from a non-Western Tradition culture and he has some great thoughts on the different solution spaces that opens up. I like the way he uses levels (perspectives?) to look at this and the problem more generally. He talks about a Western bias for creating theories looking for "truth" (which has strong links to representationalist thinking) that he wants to avoid. It's certainly not impossible for someone from the Western Tradition (like Goertzel) to think otherwise but it requires significant amounts of meta-reflection - I have a theory that "works" but does everyone think it works? What kinds of people think it works, and what kind doesn't? Do people from all cultural traditions agree? What would my theory sound like to someone from the middle ages? Ancient Greece? Why would I come up with such a theory? How does it fit in to my view on other things (politics, religion, etc.)? If you scratch the surface you quickly realise that there is a vast world of stuff that affects even what we think it's interesting to think about. It's incredibly hard for us to not just assume that the Western Tradition is simply more advanced, purer and better. We are surrounded from birth with a moral framework (the UN, judeo-christian value system, etc.) and an educational system that promotes a Platonic view of pure forms and pure knowledge ("mathematics is the language of the universe", etc.). This sort of thing can even go completely out of whack, and result in the kind of stuff we saw with the Nazis. So if there is considerable debate on what it means to know something, then defining intelligence is not going to be a simple affair! I personally adhere to the school of thought that thinks that the best way to move forward is simply to re-create a human-like machine that acts/reacts in a way that most people would call "human". Almost everyone agrees humans are intelligent, so if it acts the same way then most of us will agree it's intelligent. This (almost) completely avoids the sort of shit we have to deal with that I mention above. Turns out it's far from trivial though...

  8. whitelisting with regexp by v1 · · Score: 3, Funny

    There's really no better way to scan a log file for odd log entries than to write a big regexp that filters out whitelisted entries. Lets you find log entries you're NOT expecting. (and occasionally, log entries that not even the developers are expecting)

    Editing them of course is a royal pain, (not to mention debugging!) so I usually write a script to compose the regexp. I just checked on one of mine, and it composes a 17,000+ character single-liner that scans my wired.log file.

    I've got a smaller one that keeps an eye on secure.log for anomalies.

    --
    I work for the Department of Redundancy Department.
  9. Re: Regex this by ShanghaiBill · · Score: 5, Funny

    Some of us programmers have families, a life, don't watch anime, and aren't basement dwelling nerds.

    Umm ... I just spent the last hour playing regex golf with my wife and kids.

  10. This problem has been studied for decades by DogPhilosopher · · Score: 5, Informative

    There's a field called Grammar Induction, and the problem of learning regular languages, aka regular inference, can be considered a subfield. People have been working on this since the '50s. Applications include learning DTDs for XML/wrapper induction, and all kinds of problems in bioinformatics and natural language processing.

    There's a strong link with the graph coloring problem, see
    http://www.cs.ru.nl/~sicco/papers/alt12.pdf

    In this field, the focus is generally on learning FSAs, but these can easily be transformed into regexps. There's work on learning regexps directly, see
    http://www.informatik.uni-trier.de/~fernau/papers/Fer05c.pdf

    Enjoy.

  11. Re:Regex this by alienmole · · Score: 2

    Why does this upset you? Is it the recognition that people much more intelligent than you are spending time on problems whose significance you don't understand, and attacking them makes you feel better?

  12. Re:Terrifying by bunratty · · Score: 2

    The algorithm is an AI algorithm. It's using hueristics to attempt to minimize some function, as many AI algorithms do.

    --
    What a fool believes, he sees, no wise man has the power to reason away.
  13. Re:Unless Pyhon has changed recently. by bunratty · · Score: 3, Informative

    Python is very similar to Perl, and also has most of the characteristics that make Lisp a good language to program AI algorithms.

    --
    What a fool believes, he sees, no wise man has the power to reason away.
  14. Please, no, don't filter all users out by Xaedalus · · Score: 4, Interesting

    As a junior cadet code monkey of a user, I come to /. precisely because I know my limitations when it comes to understanding tech, coding, and hard science. Despite the -1's, trolls, and certainty-addicted neckbeards (or, because of them) I've learned a lot about how intrinsically cool it is to code, the artistic side of coding, the wonders of science (and how it rivals religious experiences for appreciating one's place in the universe), and I've improved my debating skills on here. /. is one of the net benefits of my life, and I consider it a necessity. Chances are good, I'm not the only user who thinks that.

    --
    Here's to hot beer, cold women, and Glaswegian kisses for all.
  15. could make google more "googlely" by LifesABeach · · Score: 2

    sowing only web pages for delta airlines; and not delta faucets? that's "googlely."

  16. Re:The Motion Picture by lgw · · Score: 3, Funny

    We fans know the first one is really "Star Trek: The Tone Poem".

    --
    Socialism: a lie told by totalitarians and believed by fools.
  17. Re:Terrifying by simplypeachy · · Score: 2

    Well now you've just ruined my terror. I'll have to go read up on the up-coming UK legal hilarity to restore it.

  18. Re:Obligatory XKCD by RussR42 · · Score: 2

    I believe this is what you were looking for.

  19. ^https?://([a-z0-9\-]+\.)*foo\.com(/|$) by raymorris · · Score: 4, Interesting

    For matching URLs from a domain, here's a regex we came up with that covers some special cases. Hopefully Slashdot doesn't mangle it too badly.

    https?://([a-z0-9\-]+\.)*foo\.com(/|$)

    That covers:
    https as well as http
    "subdomains" like maps.google.com as well as www.google.com and google.com
    It's not fooled by google.com.hacker.ru

  20. Re:Unless Pyhon has changed recently. by phantomfive · · Score: 2

    Usually easier to use the language you know, rather than learn a whole new language for a project.

    --
    "First they came for the slanderers and i said nothing."
  21. Is the problem actually NP-complete? by Shaterri · · Score: 3, Interesting

    My reading of Norvig's blog post is that he suggests his specific approach (stapling together short regexps with ORs) requires solving the NP-Complete Set Cover problem, but he doesn't actually say anything about whether the core problem (match everything in list A and nothing in list B) does; it's conceivable that e.g. some sort of dynamical programming approach could do the job more efficiently than Norvig's algorithm does. Does anyone know whether the root problem (to avoid having to do the optimization, just phrase it as 'is there a separating regexp for the sets A and B of length less than k?') is specifically known to be NP-complete?