Regex Golf, xkcd, and Peter Norvig
mikejuk writes "A recent xkcd strip has started some deep academic thinking. When AI expert Peter Norvig gets involved you know the algorithms are going to fly. Code Golf is a reasonably well known sport of trying to write an algorithm in the shortest possible code. Regex Golf is similar, but in general the aim is to create a regular expression that accepts the strings in one list and rejects the strings in a second list. This started Peter Norvig, the well-known computer scientist and director of research at Google, thinking about the problem. Is it possible to write a program that would create a regular expression to solve the xkcd problem? The result is an NP hard problem that needs AI-like techniques to get an approximate answer. To find out more, read the complete description, including Python code, on Peter Norvig's blog. It ends with this challenge: 'I hope you found this interesting, and perhaps you can find ways to improve my algorithm, or more interesting lists to apply it to. I found it was fun to play with, and I hope this page gives you an idea of how to address problems like this.'"
http://regex.alf.nu/
Some favor trickiness, some favor just listing possibilities, but it's fun. I'm at 3651.
I live in constant fear of the Coming of the Red Spiders.
... that can find frost psits?
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
Now build a robot that can design a program to write a regex expression to solve the xkcd problem.
Why is The Motion Picture not considered the subtitle for the first Star Trek movie?
Regexp's are a programming language unto themselves.
I'm currently doing some temp IT work for schools while my promised job becomes available and it's eye-opening. The web-filtering is all reg-exp based but nobody understands how it works.
They just copy/paste an example and change the parts of the URL that they can see to match the one they want. They barely bother to test the impact, past the site they need becoming "unfiltered" or "filtered" as necessary (i.e. no implication of knock-on effects on other sites with similar names). Let's not even mention the use of "." without the escape character for them to mean a literal period (but, obviously, it means "any character" in a regexp).
I talked to them about changing their template regexp because, from the start, I could see that it wasn't really up to the job and just met if not opposition then at least apathy about the problem.
Until someone brought an iPad into the helpdesk where a site that was supposed to be unfiltered was filtered - because nobody had considered what happens if you use "http://example.com" instead of "http://www.example.com". I was the one to spot it, and tell them that it's because their regexp was very basic.
The good thing was, the other tech on the team was young and keen to learn and I was able to give them a quick rundown of regexps and we crafted an alternative template for them to use that would take account of the situation without, for instance, the blocking of "microsoft.com" affecting "antimicrosoft.com".
But it is amazing how many people I know that work in IT have no idea how to program, no idea how to handle regexps, and just work on a "copy a working example" basis.
The International Obfuscated C Code contest had a winning entry that could flag the names of US presidents as republican or democrat.
Quoting: "This one-line C program accepts as a first command-line argument the last name of any of the last 31 US Presidents (from Franklin Pierce onwards), in lower case, and prints out their political affiliation. Use "republican" as the 2nd command-line argument, and "democrat" as the 3rd (or equivalent strings of your choice)."
De-obfuscated, it is a boolean expression acting on a string s,
I wonder whether you can make a regexp that is shorter than this and accomplishes the same thing.
Avantslash: low-bandwidth mobile slashdot.
"Find it interesting"? I find it fucking terrifying. There's no way I'll be following that link. That's in bat country.
That other people have hobbies different than you is going to be quite common. If you get this angry every time you run into someone with a different hobby then I fear for your long term mental and physical health.
Take deep breaths. It is going to be ok.
I'd have been extremely surprised if solving for the shortest regex golf pattern for a pair of lists weren't NP hard. And greedy approaches are fairly obvious.
The point is, that's what makes it analogous to golf. The optimal solution is your hole in one. Some greedy algorithm solution is your par. Those aren't the interesting areas, they're just the end points.
For those who couldn't work it out for the,selves, here are the rules of regex golf:
How many characters you use in your regex is your score
Lower is better
There's a par calculated for each hole, and totaled for the course, which serves as a frame of reference for performance
My only political goal is to see to it that no political party achieves its goals.
Every single human should be working towards curing cancer and getting in to space and solving the mysterious universe that we live in!
Fuck people that have fun and hobbies!
Regex me a list of folks that have time to sit around fucking off their life-time in order to write a regex to work on the XKCD "problem", and folks that don't.
The field of study known as "AI" has been stagnant, for about 50 years now. One of the field's many problems is the lack of a good definition for intelligence.
Despite lacking a definition, we have working examples intelligent systems in the real world - humans.
Humans are very good at partitioning sets by descriptive differences, and they discover these descriptive rules largely by themselves.
We don't know what intelligence is yet, but if we keep looking at problems and trying to figure out the human approach, eventually we'll have enough contrasts and similarity to partition sets based on differences in intelligence.
In other words, the more problems we solve, the more data we can use to formulate rules that define intelligence.
That's a pretty important and useful goal.
(And belaboring the obvious: If we had even simple AI constructs we could automate much of out work force, freeing us up for more leisurely pursuits. Whether this leads to a post-scarcity utopia or unemployment/welfare apocalypse depends on your political affiliation.)
Think of it this way: people who do things like this aren't out in public, adding to traffic on the highway (or possibly wandering out into the street aimlessly, getting hit by cars and making huge traffic backups). See, that's a plus for you.
Or think of it another way: other people have no obligation to live their lives in any way that you may or may not approve of. After all, what's a bigger waste of time, writing regex golf code, or bitching about other people writing regex code?
He works at Google on AI, and this problem is both amusing and interesting. Having two monitors is totally normal.
Slashdot really has gone to the dogs, hasn't it. Would be great if there were some sort of technical test one could take here; the results of which could be used to filter out all the - for want of a better word - users.
Yeah, right. Look where you are.
There's really no better way to scan a log file for odd log entries than to write a big regexp that filters out whitelisted entries. Lets you find log entries you're NOT expecting. (and occasionally, log entries that not even the developers are expecting)
Editing them of course is a royal pain, (not to mention debugging!) so I usually write a script to compose the regexp. I just checked on one of mine, and it composes a 17,000+ character single-liner that scans my wired.log file.
I've got a smaller one that keeps an eye on secure.log for anomalies.
I work for the Department of Redundancy Department.
Some of us programmers have families, a life, don't watch anime, and aren't basement dwelling nerds.
Umm ... I just spent the last hour playing regex golf with my wife and kids.
There's a field called Grammar Induction, and the problem of learning regular languages, aka regular inference, can be considered a subfield. People have been working on this since the '50s. Applications include learning DTDs for XML/wrapper induction, and all kinds of problems in bioinformatics and natural language processing.
There's a strong link with the graph coloring problem, see
http://www.cs.ru.nl/~sicco/papers/alt12.pdf
In this field, the focus is generally on learning FSAs, but these can easily be transformed into regexps. There's work on learning regexps directly, see
http://www.informatik.uni-trier.de/~fernau/papers/Fer05c.pdf
Enjoy.
Be sure to allow cookies, else the site won't work.
Cookies are an essential part of websites nowadays, dontcha' know...
Why does this upset you? Is it the recognition that people much more intelligent than you are spending time on problems whose significance you don't understand, and attacking them makes you feel better?
Take deep breaths. It is going to be ok.
This is what I really can't stand in a teacher. Don't lie to your student. Don't sugar coat the hard truth. It's not going to be okay. He's going to die. Dead; no longer living. It may take time. It might be not that painful, but there's a fair chance that it's going to be awful. Just tell him the truth now. Don't say
He will have to face it:
It might seem to him to be okay to be angry, about people with different hobbies, but if it causes him stress, make him take up smoking, or worse, restart smoking after giving up, and then die of a disease caused by his lack of tolerance, trust me he's not going to thank you for hiding the truth.
It's a special case of automatic code optimisation: Since regular expressions should be reduce-able to logical gates, it stands to reason a minimized equivalent boolean function could be decompiled into a regex.
On it's own, having compilers\interpreters that optimize regexes is beneficial. But how about you turn the table over and ask yourself this, "Can I design a high, expressive, programming language that could be fully optimised without sacrificing human readability and productivity?" As far as I can tell, this is the holy grail of system research, or what's left of it nowadays...
Sure, it's NP-hard, but it's also nonrecursive via Rice's theorem.
Calling this NP-hard is like calling the Andromeda Galaxy "bigger than the average rhino".
Yeah, fuck people who have time to do useless things, like regex golf, and posting to Slashdot, and..... wait... what?
It's called exercising our 'little grey cells'. Beats watching TV.
And no, I didn't use two monitors. I used pencil and paper.
Ok, I get the need for optimization of coding habits. But as far as I understand this article however, it's about solving the XKCD "problem" - how to write a formula that separates elected presidents from unelected presidents. If I'm wrong, then I'm sorry. I read a bit of the article, and determined that it was a shot at being nifty - trying to get a bunch of brogramers to 'stroke out the biggest cock, in the shortest amount of time' sorta thing.
Politics; n. : A religion whereby man is god.
Python code? I would expect Perl or similar if parsing was your emphasis, or Lisp or some AI language if AI was your emphasis, but not a standard block-structured language.
(-1: Post disagrees with my already-settled worldview) is not a valid mod option.
Maybe if you and BringsApples didn't spend so much time posting to Slashdot, you would have more time to do something constructive and potentially educational. And if you didn't spend so much mental effort worrying about what others do to your stereotype and what you think you can accomplish by complaining about it on the internet, then maybe you could come up with something fun to do with your family that accomplishes the same thing, so you don't have to act like family time is mutually exclusive with free time.
As a junior cadet code monkey of a user, I come to /. precisely because I know my limitations when it comes to understanding tech, coding, and hard science. Despite the -1's, trolls, and certainty-addicted neckbeards (or, because of them) I've learned a lot about how intrinsically cool it is to code, the artistic side of coding, the wonders of science (and how it rivals religious experiences for appreciating one's place in the universe), and I've improved my debating skills on here. /. is one of the net benefits of my life, and I consider it a necessity. Chances are good, I'm not the only user who thinks that.
Here's to hot beer, cold women, and Glaswegian kisses for all.
sowing only web pages for delta airlines; and not delta faucets? that's "googlely."
and I just closed a multi-billion-dollar business deal over a round of regex golf at a very exclusive regex golf resort in the Bahamas.
Obligatory XKCD.
Oh, never mind.
There are sorts of "problems" that may seem a waste. When Einstein was pondering Relativity, I'm sure it would have been thought of in the same manner. The point is, sometimes great things arise from "trivial" pursuits. This could also be construed as a implementation of a search algorithm that can take a relatively broad search term and produce a more generalized search pattern, i.e. now you cannot specify give me presidential candidates A, B, C, D, E, F according to whether or not they won an election.
This may seem trivial, but it is this sort of logic/AI that may lead to computers in the future answers questions like we see in Star Trek.
So why did you "go to school for coding" if you clearly don't like it?
Well, I certainly didn't mean to attack anyone, I just didn't really know that my comments would hurt feelings. And perhaps you're right, maybe I don't understand why 'sorting a list in the quickest way possible, being made into a game' is useful. Can you enlighten me?
Politics; n. : A religion whereby man is god.
I believe this is what you were looking for.
"Coding" is what replaceable, uncreative idiots do. It's better to solve an interesting problem, for whatever definition of interesting is to the solver, than it to be a brainless drone making another fucking login page. Why don't you try to actually make something instead of letting the potential (that you probably don't have, to be honest) waste away? If you can't do that, then at least leave the actual programmers and computer scientist to practice the art while you die having accomplish nothing but having your code thrown in the trash a single support cycle after you get laid off.
I think all of IT have gone that way. It has become diluted by people who just want to solve their problem at hand as sloppily as possible and get home to watch TV.
For matching URLs from a domain, here's a regex we came up with that covers some special cases. Hopefully Slashdot doesn't mangle it too badly.
https?://([a-z0-9\-]+\.)*foo\.com(/|$)
That covers:
https as well as http
"subdomains" like maps.google.com as well as www.google.com and google.com
It's not fooled by google.com.hacker.ru
Oh, solving problems. See I thought that this was just about playing some programing-related game, via shortening algorithms. Hell, if we're talking about solving problems, let's get to world hunger and water filtration. I'm working on the world hunger bit. You wanna work with me, or do the water filtration bit?
Politics; n. : A religion whereby man is god.
if a_1=nopaper
if a_2=paper
//problem with this solution though is that it tends to be gender specific,
//and that is the problem with two compared variables in a statement it can easily lead to a 3turd condition.
This message was not sent from an iPhone because Peter Sellers really was a deviated prevert without a dime for the call
Anyway, let me direct your attention to the qualifier that I have attached that bit:
for whatever definition of interesting is to the solver
My reading of Norvig's blog post is that he suggests his specific approach (stapling together short regexps with ORs) requires solving the NP-Complete Set Cover problem, but he doesn't actually say anything about whether the core problem (match everything in list A and nothing in list B) does; it's conceivable that e.g. some sort of dynamical programming approach could do the job more efficiently than Norvig's algorithm does. Does anyone know whether the root problem (to avoid having to do the optimization, just phrase it as 'is there a separating regexp for the sets A and B of length less than k?') is specifically known to be NP-complete?
At least your family game night will give the kiddos useful skills for the future!
It's probably not very useful for a simple problem such as this, but imagine the space station being in a situation where every second counts to protect the life of the crew, such as a rupture that vents oxygen. It may have a n-length list of commands it can run, each with certain percentage chance of working in a particular situation and may need to run a quick simulation to guess which one has the greatest chance of saving the life of the crew. It would probably use some combo of min-max as well as quicksorting to find the one command with the greatest probability of survival in a minimal amount of time.
Agree fully. At first I thought *-search (pick your flavor) was just an odd mathematical quirk when it was introduced, until I was told it is used in the GPS system you have in your car. You start to see how each incremental step leads to ever greater things.
Turing complete languages don't have to be non-regular either. The tape for a Turing machine is obviously regular. How a language is structured and what it's capable of doing are pretty unrelated.
Have you ever noticed how everyone else in the world is not you? You might find it enlightening to think about that.
systemd is Roko's Basilisk.
I might be missing something here, but the list of winners and the list of losers in US presidential elections both contain Richard Nixon. How can a regexp match ALL the winners and NONE of the losers in that case?
I bought a touring machine once, but it hasn't come back yet.
This started Peter Norvig, the well-known computer scientist, director of research at Google and wearer of brightly colored shirts, thinking about the problem.
Now I got to get me some brightly colored shirts.
I am not your blowing wind, I am the lightning.
No, I don't particularly find those interesting and don't really care all that much
You don't eat or drink? I see your point, but how is coming up with better ways to feed the multitudes with limited land and resources not a puzzle? See, the thing that I cannot stand about this article is that it is going to attract the attention of some really smart people (maybe you are one of these) and take your time, where if you would direct your gigantic brain and all of it's magnificent abilities (and no, I'm not being sarcastic here) to problems that aim to make life better for all, then your own life would also be made better. If you cannot understand how that works (life better for me is life better for you), then make it into a puzzle, or whatever you have to do to make it interesting for you. Otherwise, if you don't understand how life functions, then your intelligence is temporary.
Politics; n. : A religion whereby man is god.
Do you play any games at all? Video, board, general games with your kids? How about sports? Baseball, basketball? Any of them. I'm sure there is something you do in your "life-time" that could be considered "not useful".
Because people who are encouraged to create new solutions for fun may comes up with something that someone getting whipped into coding may not.
More solutions, more creative ones, can mean the Terminators will be 20% faster at face-recognition and may not be so inaccurate at shooting.
Kids learn well by playing games, machines may learn to learn faster out of humans playing games.
It is the IT vs CS dilemma.
Aren't there any solutions posted anywhere? This is a good exercise if you want to spend time puzzling out the more arcane features of regular expression evaluators. Some of these aren't fully working solutions.
Abba is better as a negation of (.)(.)\2\1 , but I couldn't quickly figure out how to do it. Triples looks like it's supposed to make use of an old math trick to tell if a number is divisible by 3. Just add all the digits together to get another number. Repeat on the new number until a single digit is left, and if that digit is 0,3,6, or 9, then the number is divisble by 3. Will take me too much time to figure out how to program that into a regular expression. I cheated on Glob. Obviously it's supposed to sort the full matches from the partial matches. Balance is one of those exercises in being contrary, making a tool do what everyone says it's not meant to do. Just copied the matching string for the 2 Long count puzzles. 3012 points.
Intellectual Property is a monopolistic, selfish, and defective concept. It is "tyranny over the mind of man"
From Norvig's blog:
To avoid a contradiction and achieve Randall's intent, eliminate all winners from the set of losers:
In [293]: losers = losers - winners
The code on Norvig's blog is pretty interesting.
This one was worth my coffee break time today.
I might be missing something here, but the list of winners and the list of losers in US presidential elections both contain Richard Nixon. How can a regexp match ALL the winners and NONE of the losers in that case?
The methods to distribute food and to purify water are largely solved. If it weren't, then the developed world would not be able to support such large populations. It's a distribution and education problem in areas so afflicted. If you want, you can probably buy some tractors and water filters right now and have them shipped to poor villages. The techniques needed to make fertilizer or purify water by boiling are well known. What is lacking is the industrial capability to manufacture enough and the supporting infrastructure, and the unwillingness of foreign entities to spend money on it and hostility by groups that benefit from the status quo (be it the warlords that will seize the goods, the arms dealers that want the conflict, or anyone else who will accept bloodshed and misery in exchange for money). A political issue that no amount of "world changing" technology is going to fix. Not interesting.
And besides that, tinkering around with regex to do neat things, while clever, doesn't mean that that cleverness will transfer to other tasks. Would you trust your rocket scientist to give you brain surgery just because both tasks are considered intelligently prestigious enough to be in a popular idiom proclaiming their brilliance?
If you want, you can probably buy some tractors and water filters right now and have them shipped to poor villages.
The length at which I'd like to point out the intricacies of growing food and filtering water in areas of the world that are heavily populated and dry, far exceed my want to type, and I'm sure your want to read. Let's just say that sending over a tractor and a filter would be as close to solving the problem that they're having as sending over a scalpel and air-mask to your house, when you need heart surgery.
The techniques needed to make fertilizer or purify water by boiling are well known.
Are they? If they're so well-known, let's hear your take on it. How do you make fertilizer? How do you 'boil' 50,000 gallons of water? Because if you ever need food or water (God forbid) to the point that these people need it, then that's going to be your main objective of the day, not regex.
And let me remind you that if we cannot pull together as a people, then what we see in the areas where dictators rule food/water supply, we'll see here one day as well. $85 BILLION are printed each month. One day, someone is going to have to pay that back. It may not be you or I, and it may not be our kids, but one day the whole idea of simply going to the store and buying whatever you need is either going to be to expensive, or not an option. Humanity will need people like you (and whatever you teach your kids) to solve problems that we'll face.
...tinkering around with regex to do neat things, while clever, doesn't mean that that cleverness will transfer to other tasks.
Exactly. That's why it's important to train your way of thinking to be in line with the rest of the natural world, and not some way to do some man-made confined way of thinking. This is exactly why I think it's a waste of great human potential.
Politics; n. : A religion whereby man is god.
That's true, the regex is used with the non-case-sensitive flag. We use it with mod_rewrite, so it would be[NC] something like:
RewriteCond %{HTTP_REFERER} https?://([a-z0-9\-]+\.)*foo\.com(/|$) [NC]
In for example Perl the regex would be used within //i (or another delimiter for clarity).
It reads
that are TRUE. I have an elegant solution, but it requires Perl and more space than Slashdot allows for its comments.
"There is no god but allah" - well, they got it half right.