(Useful) Stupid Regex Tricks?

← Back to Stories (view on slashdot.org)

(Useful) Stupid Regex Tricks?

Posted by ScuttleMonkey on Monday November 10, 2008 @02:17AM from the hope-you-like-reading-lots-of-random-characters dept.

careysb writes to mention that in the same vein as '*nix tricks' and 'VIM tricks', it would be nice to see one on regular expressions and the programs that use them. What amazingly cool tricks have people discovered with respect to regular expressions in everyday life as a developer or power user?"

23 of 516 comments (clear)

Min score:

Reason:

Sort:

New Slashot Section by Frankie70 · 2008-11-10 02:24 · Score: 5, Interesting

Maybe we should have a new section for "Useful Stupid Tricks" on Slashdot.
Regex Support by Extremus · 2008-11-10 02:30 · Score: 2, Interesting

I have used regex in the past, mainly for keeping long SQL scripts. The problem is the lack of full support for regex in most of editors. IMO the best (for windows, at least) is the EditPad Pro.
is it an rfc-822 compliant e-mail address? by Anonymous Coward · 2008-11-10 02:32 · Score: 3, Interesting

please validate using the rfc and not your sketchy interpretation of an e-mail address. /.*@.*\..*/ will not cut it.
Try instead
([^\\x00-\\x20\\x22\\x28\\x29\\x2c\\x2e\\x3a-\\x3c\\x3e\\x40\\x5b-\\x5d\\x7f-\\xff]+|\\x22([^\\x0d\\x22\\x5c\\x80-\\xff]|\\x5c[\\x00-\\x7f])*\\x22)(\\x2e([^\\x00-\\x20\\x22\\x28\\x29\\x2c\\x2e\\x3a-\\x3c\\x3e\\x40\\x5b-\\x5d\\x7f-\\xff]+|\\x22([^\\x0d\\x22\\x5c\\x80-\\xff]|\\x5c\\x00-\\x7f)*\\x22))*\\x40([^\\x00-\\x20\\x22\\x28\\x29\\x2c\\x2e\\x3a-\\x3c\\x3e\\x40\\x5b-\\x5d\\x7f-\\xff]+|\\x5b([^\\x0d\\x5b-\\x5d\\x80-\\xff]|\\x5c[\\x00-\\x7f])*\\x5d)(\\x2e([^\\x00-\\x20\\x22\\x28\\x29\\x2c\\x2e\\x3a-\\x3c\\x3e\\x40\\x5b-\\x5d\\x7f-\\xff]+|\\x5b([^\\x0d\\x5b-\\x5d\\x80-\\xff]|\\x5c[\\x00-\\x7f])*\\x5d))*
See the original at http://www.iamcal.com/publish/articles/php/parsing_email/
1. Re:is it an rfc-822 compliant e-mail address? by Ken+D · 2008-11-10 04:26 · Score: 3, Interesting
  
  The problem is that email addresses are not suitable for regex based validation.
  There are too many legacy formats, too many variations, that are legal addresses.
  Why, back in the old days, you could send mail to things like "bob%example.com@example.org" which would shoot the email off to example.org, who's mail server would then shoot the email off to example.com. A way to hand route your email around a broken network link in the old days. Throw in a few UUCP hops, maybe getting final delivery to a BITNET connected system. Ah, those were the days!
Re:How about by Anonymous Coward · 2008-11-10 02:34 · Score: 5, Interesting

I actually like these. Nice little highly enriched concentrations of geekery on a single page. Think how long it might take to round up the sort of stuff that appears here by Googling.
Turing word: insipid
In a sentence: You find this page insipid but I find it inspiring.
99 Bottles of Beer on the wall by Pahalial · 2008-11-10 02:35 · Score: 2, Interesting

Saw this one recently, by Andrew Savige. He did use a Perl module to generate the regex itself, but even so!

http://search.cpan.org/dist/Acme-EyeDrops/lib/Acme/EyeDrops.pm#99_Bottles_of_Beer

(I would quote the final result but /. won't allow that many "junk" characters.. let's hope that doesn't cripple this entire discussion.)

--
Stuff.
Match a library call number by Gulthek · 2008-11-10 02:41 · Score: 4, Interesting

Here's a chunk of perl script I wrote (years ago) that determines if $text matches any of the styles of library call number that I've ever encountered.
Slashcode is interestingly interpreting my formatting, but you should get the gist.
$text =~ / ^[A-Z]+ # starts with at least one capital letter \s? # followed by an optional space \d+ # followed by one or more digits /x or $text =~ / ^\d+ # starts with one or more digits \. # followed by a single decimal /x or $text =~ / \d+ # starts with one or more digits \s # and a space /x or $text =~ / Thesis # starts with "Thesis" .+ # with one or more characters of any kind \d{4} # then four numbers - year \s+ # separated by at least one space [A-Z]+ # from one or more capital letters \d+ # followed by one or more numbers /xi # case ignored here in case we run into THESIS or thesis or $text =~ / \d+ # starts with one or more digits \- # connected with a dash \d+ # to one or more following digits /x or $text =~ / \d+ # starts with one or more digits # followed by a space [A-Z]* #followed by zero or more capital letters \d+ # followed by one or more digits /x
Remove trailing whitespace by cerberusss · 2008-11-10 02:56 · Score: 3, Interesting

To remove trailing whitespace from a textfile (vim regex, don't know if the \s will work in other regex dialects):
/\s\+$//e

--
8 of 13 people found this answer helpful. Did you?
Be lazy! by subreality · 2008-11-10 03:03 · Score: 4, Interesting

OK, you asked for stupid tricks, but this one's just plain lazy.
Between bash and grep, there are quite a lot of special characters that you have to escape... Or just ignore with dots!
/I.do.this.frequently..(even.with.parenthases).,.because.sometimes.my....backslash..key.is.tired/
A couple neat things happened: The extra dot after frequently is matching an inline paren. The paren in the PATTERN right next to it starts the mark of an atom, closed by its brother. The comma is because I put one outside the paren (here represented as the dot to the left of the comma) as is my style. Also note the literal backslash, just before you see the word backslash in hidden parenthesis.
Why not add quotes to match the spaces easily? I get a word or two in, and I find I naturally switch to using dots. These are throwaways for single tries through grep. For production code, I hone in carefully on the parts that I'm dead sure I can anchor to, escaped by any means needed, before carefully choosing my atom to match as tightly as possible, so it'll error out if my data has gone wrong.
Even in a simple case like this, half the fun is in explaining it. :)
email validation... by Ramley · 2008-11-10 03:08 · Score: 2, Interesting

This was always useful when appropriate: /^[\w.|-]+@(?:[\w.|-]{2,63}\.)+[a-z]{2,6}$/ Validates a valid email address (rfc 5322) -- although not taking into account an IP address (user@192.168.1.2)
some that I've used ... by ianare · 2008-11-10 03:11 · Score: 4, Interesting

SSN
^(?!000)([0-6]\d{2}|7([0-6]\d|7[012]))([ -]?)(?!00)\d\d\3(?!0000)\d{4}$

US phone with or without parentheses
^$[0-9]{3}$\s?[0-9]{3}(-|\s)?[0-9]{4}$|^[0-9]{3}-?[0-9]{3}-?[0-9]{4}$

ISO Date (19th to 21st century only)
^((18|19|20)\d\d)-(0[1-9]|1[012])-(0[1-9]|1[0-9]|2[0-9]|3[01])$
Not a trick, but a question. by Janek+Kozicki · 2008-11-10 03:13 · Score: 2, Interesting

I was wondering with my friend someday if it's possible with regex to select a pattern which occurs twice or more times repeatedly in single line but is separated by undefined characters. For example I want to select only lines in which the same pattern "[FB][ot]o" occurs exactly two times (in example below . is any character, for clarity):

...Foo... - is not selected
...Foo...Bto... - is not selected
...Bto...Bto... - is selected
a normal /[FB][ot]o.*[FB][ot]o/ would select the second and third case. But I only want the third case. The first occurrence would define my pattern, and second occurrence must exactly match it. Magic stuff like this is not working: /$[FB][ot]o$.*\1/ although that seems to be the closest description of what we wanted.

--
# #\ @ ? Colonize Mars #
Re:Here's One for Slashdot Stories! by Talderas · 2008-11-10 03:21 · Score: 2, Interesting

You can permanently cloak zerg units that can burrow if you control an arbiter. By burrowing the zerg unit just as it enters the arbiter's cloaking field radius, the zerg will become permanently cloaked.

--
"Lack of speed can be overcome. In the worst case by patience." --Znork
Re:IP and Hardware addresses by nschubach · 2008-11-10 03:25 · Score: 4, Interesting

There's a really cool little "real time" regex analyzer written in Flex: (if you're not one of them scared to death by Flash content)
http://gskinner.com/RegExr/
Maybe you can monkey your way into "regexing" the a out of apple :p

--
Every time I start to have faith in humanity, I ruin it by driving to work between 7 and 8 am.
Validating credit card numbers by hansamurai · 2008-11-10 03:47 · Score: 2, Interesting

Does anyone know if the Luhn Algorithm can be implemented in regex only?
http://en.wikipedia.org/wiki/Luhn_algorithm
(sorry if I double post this... I swear I posted it 10 minutes ago)

--
Reviewing just the first hour of video games.
Re:How about by Bandman · 2008-11-10 03:53 · Score: 4, Interesting

I like it, but I've got a bookmark folder called "Slash-doc" where I store useful threads that contain a lot of information.
I've got a lot of threads bookmarked.
Best Practices for Process Documentation
How would you make a distributed Office system
Quality Open Source / Calendar / Messaging Systems
and some others.
Some of the information in the threads is out of date, but the ideas are useful and interesting to read. I need to go back through Ask Slashdot and get the more recent threads that seem to act as references

--
Check out my sysadmin blog!
Useful parsing configs in bash by Bandman · 2008-11-10 04:02 · Score: 2, Interesting

I have to parse files with bash sometimes, and I use these:
^# = line with a leading comment
^$ = empty line
They're simple, but work usually. You can make them a lot more bullet proof by adding in blank checking between the characters, but it seems to work.
cat httpd.conf | grep -v \^\# | grep -v \^\$ | less
makes httpd.conf a lot more readable.

--
Check out my sysadmin blog!
Re:IP and Hardware addresses by josecanuc · 2008-11-10 04:19 · Score: 2, Interesting

There's quite a few. I mostly lurk, occasionally post when the topic is something I know well or if I have a snarky comment.
valid utf-8 string by Danny+Rathjens · 2008-11-10 04:26 · Score: 2, Interesting

Here is the crazy regex to detect a valid UTF-8 string. :) /^( [\x09\x0A\x0D\x20-\x7E] # ASCII | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )*$/x This can crash perl if the string being checked is too big. :D So it's usually better to just let perl attempt to decode anything non-ascii as utf8 and see if it fails or not. (And hope all the utf8 parsing exploits have been fixed :) eval { $param = decode( 'utf8', $param, Encode::FB_CROAK) if $param =~ /[^\x00-\x7E]/ }; $param = decode( 'iso-8859-1', $param, Encode::FB_CROAK) if $@; # utf8 decode of non-ascii text failed so treat as latin1
Re:IP and Hardware addresses by LordKronos · 2008-11-10 04:38 · Score: 2, Interesting

Oh, wow, you are right. Using 0177.0.0.1 in firefox gets you to localhost, as does 0x7f.0.0.1
Nice catch.
Re:IP and Hardware addresses by josecanuc · 2008-11-10 05:06 · Score: 3, Interesting

Folks who think a low ID means a old person: get real. Slashdot hasn't been around forever. It started in 1997. Accounts were added later.
Folks with a low ID just happened to register within the few months following the addition of accounts. Must have been 1998 or 1999. I was in college at the time. I'm currently not yet 30 years old. Is that old to you?
Re:Here's One for Slashdot Stories! by Eponymous+Bastard · 2008-11-10 06:46 · Score: 2, Interesting

On fastest:
Stasis your oponents' fleet with your arbiter and start a 15 second countdown. On zero, your teammate nukes the stasis. Wait 30 seconds for the nuke to come down right on your open-mouthed oponents' fleet.
To add insult to injury, if you manage to stasis both their and your ships, you can recall them out right before the nuke hits.
Re:IP and Hardware addresses by LordKronos · 2008-11-10 07:44 · Score: 2, Interesting

Yes:
http://search.cpan.org/~abigail/Regexp-Common-2.122/lib/Regexp/Common/profanity.pm