(Useful) Stupid Regex Tricks?
careysb writes to mention that in the same vein as '*nix tricks' and 'VIM tricks', it would be nice to see one on regular expressions and the programs that use them. What amazingly cool tricks have people discovered with respect to regular expressions in everyday life as a developer or power user?"
To filter a string to make sure it's a valid ip address this regexp is quite useful.
/^((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])$/
/^[0-9a-fA-F]{2}:[0-9a-fA-F]{2}:[0-9a-fA-F]{2}:[0-9a-fA-F]{2}:[0-9a-fA-F]{2}:[0-9a-fA-F]{2}$/
And this one for mac addresses
(Useful) Stupid * Tricks
Yes sir, that will guarantee a front page story. You better head back to the drawing board if it doesn't fit that pattern. Next week: (Useful) Stupid Starcraft Tricks.
Maybe we should have a new section for "Useful Stupid Tricks" on Slashdot.
You see yourself in digg.com. You are likely to be eaten by a grue.
-- Por mais que eu ande no vale das trevas e da morte, meu PowerMac G4 Não Travará!!!
I have used regex in the past, mainly for keeping long SQL scripts. The problem is the lack of full support for regex in most of editors. IMO the best (for windows, at least) is the EditPad Pro.
Stupid (Useful) Ask Slashdot tricks?
I'm not sure whether these are legitimate, or just a "I don't know what the hell I'm doing, so let's see if I can get someone else to show me how to do my job, under the guise of sharing information."
I'd like to say the former, but my cynicism is making me lean to the latter.....
"City hall" in German is "Rathaus" Kinda explains a few things......
Beautiful regexp that validates RFC 822 addresses: Mail-RFC822-Address.html
Unselfish actions pay back better
MS Office does support regexp while not as good as Perl regex, they are very helpful.
.bas addon for regexp, which helped me alot.
Link to and excel
Don't forget to add the lib {tools->References->MS VBA Scrip regexp 5.5}
http://www.tmehta.com/regexp/using_functions.htm
please validate using the rfc and not your sketchy interpretation of an e-mail address. /.*@.*\..*/ will not cut it.
Try instead
([^\\x00-\\x20\\x22\\x28\\x29\\x2c\\x2e\\x3a-\\x3c\\x3e\\x40\\x5b-\\x5d\\x7f-\\xff]+|\\x22([^\\x0d\\x22\\x5c\\x80-\\xff]|\\x5c[\\x00-\\x7f])*\\x22)(\\x2e([^\\x00-\\x20\\x22\\x28\\x29\\x2c\\x2e\\x3a-\\x3c\\x3e\\x40\\x5b-\\x5d\\x7f-\\xff]+|\\x22([^\\x0d\\x22\\x5c\\x80-\\xff]|\\x5c\\x00-\\x7f)*\\x22))*\\x40([^\\x00-\\x20\\x22\\x28\\x29\\x2c\\x2e\\x3a-\\x3c\\x3e\\x40\\x5b-\\x5d\\x7f-\\xff]+|\\x5b([^\\x0d\\x5b-\\x5d\\x80-\\xff]|\\x5c[\\x00-\\x7f])*\\x5d)(\\x2e([^\\x00-\\x20\\x22\\x28\\x29\\x2c\\x2e\\x3a-\\x3c\\x3e\\x40\\x5b-\\x5d\\x7f-\\xff]+|\\x5b([^\\x0d\\x5b-\\x5d\\x80-\\xff]|\\x5c[\\x00-\\x7f])*\\x5d))*
See the original at http://www.iamcal.com/publish/articles/php/parsing_email/
Saw this one recently, by Andrew Savige. He did use a Perl module to generate the regex itself, but even so!
/. won't allow that many "junk" characters.. let's hope that doesn't cripple this entire discussion.)
http://search.cpan.org/dist/Acme-EyeDrops/lib/Acme/EyeDrops.pm#99_Bottles_of_Beer
(I would quote the final result but
Stuff.
Why couldn't Bill try out his regular expressions?
His mom wouldn't let him play with matches.
Here's a chunk of perl script I wrote (years ago) that determines if $text matches any of the styles of library call number that I've ever encountered.
Slashcode is interestingly interpreting my formatting, but you should get the gist.
$text =~ /
^[A-Z]+ # starts with at least one capital letter
\s? # followed by an optional space
\d+ # followed by one or more digits
or $text =~ /
^\d+ # starts with one or more digits
\. # followed by a single decimal
or $text =~ /
\d+ # starts with one or more digits
\s # and a space
or $text =~ /
Thesis # starts with "Thesis"
\d{4} # then four numbers - year
\s+ # separated by at least one space
[A-Z]+ # from one or more capital letters
\d+ # followed by one or more numbers
or $text =~ /
\d+ # starts with one or more digits
\- # connected with a dash
\d+ # to one or more following digits
or $text =~ /
\d+ # starts with one or more digits
# followed by a space
[A-Z]* #followed by zero or more capital letters
\d+ # followed by one or more digits
I've never found regexes to be useful at all. I prefer to write my own parsers from scratch in assembly language, or conway's game of life, if I'm feeling m/(ambitious|artistic|autistic|masochistic)/.
But even an artist gets lazy sometimes.
This regex matches a number: interger or float, scientific notation or plain, plus or minus...
[-+]?(?:\b[0-9]+(?:\.[0-9]*)?|\.[0-9]+\b)(?:[eE][-+]?[0-9]+\b)?
Colorless green Cthulhu waits dreaming furiously.
use Regex::Common qw(URI net);
$text_with_urls =~ m/$RE{URI}/;
$text_with_ips =~ m/$RE{net}{IPv4}/;
Build it, and they will come^Hplain.
stuff that matters
understand the concept?
if not, try going to this site, it looks like it might be more your speed
buhbye
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
8 of 13 people found this answer helpful. Did you?
I wonder why such FAQs are still posted on a site like Slashdot. We now have a great repository for exactly this kind of questions:
http://stackoverflow.com/questions/tagged?tagnames=regex&sort=votes&pagesize=15
on the daily WTF: http://thedailywtf.com/Articles/Now-I-Have-Two-Hundred-Problems.aspx enjoy!
Cal Henderson's routine is the best RFC compliant regex I have ever found to verify an email address:
http://code.iamcal.com/php/rfc822/
OK, you asked for stupid tricks, but this one's just plain lazy.
Between bash and grep, there are quite a lot of special characters that you have to escape... Or just ignore with dots!
/I.do.this.frequently..(even.with.parenthases).,.because.sometimes.my....backslash..key.is.tired/
A couple neat things happened: The extra dot after frequently is matching an inline paren. The paren in the PATTERN right next to it starts the mark of an atom, closed by its brother. The comma is because I put one outside the paren (here represented as the dot to the left of the comma) as is my style. Also note the literal backslash, just before you see the word backslash in hidden parenthesis.
Why not add quotes to match the spaces easily? I get a word or two in, and I find I naturally switch to using dots. These are throwaways for single tries through grep. For production code, I hone in carefully on the parts that I'm dead sure I can anchor to, escaped by any means needed, before carefully choosing my atom to match as tightly as possible, so it'll error out if my data has gone wrong.
Even in a simple case like this, half the fun is in explaining it. :)
my $re = '';
$re = qr/
\{ (?:
(?> [^{}]+ ) # nao-chaves
|
(??{ $re }) # sub-bloco de chaves
)* \}
factor 966971: 966971
This was always useful when appropriate: /^[\w.|-]+@(?:[\w.|-]{2,63}\.)+[a-z]{2,6}$/
Validates a valid email address (rfc 5322) -- although not taking into account an IP address (user@192.168.1.2)
SSN
^(?!000)([0-6]\d{2}|7([0-6]\d|7[012]))([ -]?)(?!00)\d\d\3(?!0000)\d{4}$
US phone with or without parentheses
^\([0-9]{3}\)\s?[0-9]{3}(-|\s)?[0-9]{4}$|^[0-9]{3}-?[0-9]{3}-?[0-9]{4}$
ISO Date (19th to 21st century only)
^((18|19|20)\d\d)-(0[1-9]|1[012])-(0[1-9]|1[0-9]|2[0-9]|3[01])$
That's also one of my favorite. Python has this feature too.
Colorless green Cthulhu waits dreaming furiously.
I was wondering with my friend someday if it's possible with regex to select a pattern which occurs twice or more times repeatedly in single line but is separated by undefined characters. For example I want to select only lines in which the same pattern "[FB][ot]o" occurs exactly two times (in example below . is any character, for clarity):
...Foo... - is not selected
...Foo...Bto... - is not selected
...Bto...Bto... - is selected
a normal /[FB][ot]o.*[FB][ot]o/ would select the second and third case. But I only want the third case. The first occurrence would define my pattern, and second occurrence must exactly match it. Magic stuff like this is not working: /\([FB][ot]o\).*\1/ although that seems to be the closest description of what we wanted.
#
#\ @ ? Colonize Mars
#
While I'm not providing any specific trick per say, on topic are a few useful links:
http://www.regular-expressions.info/ - this one is handy for regex info particularly in Javascript which I use so infrequently I need to know how to match, capture, substitute, etc.
http://perldoc.perl.org/perlre.html - plenty of regex info there which is Perl specific, but of course extends to many other similar implementations
I only post comments when someone on the internet is wrong.
There are no Stupid Starcraft Tricks.
My Starcraft 2 Blog
This is easy once you re-word your definition of a word: in your case, a word starts with a capital followed by a run of non-capital letters.
The regex: /[A-Z][a-z]*/
Will match the first of such words in a string. (it will also match single-character words; change * to + if you don't want that). Make sure you're matching is case-sensitive for this to work. Many regex engines will have an abbreviation for [A-Z] and [a-z] you can use instead.
To get the second of such words in string: /[A-Z][a-z]*[^A-Z]*([A-Z][a-z]*)/
The second word will be in the first sub-match (\1). The [^A-Z]* will gobble up everything between the last letter of the first word and the start of the second word. If there is no second word, this match will fail.
Repeat the first part of this (everything up to the open parenthesis) to get third word, fourth word, etc. Rather than repeating that part of the expression, you can use parenthesis and counts (usually {n,m}) for this in most engines.
Does anyone know if the Luhn Algorithm can be implemented in regex only?
http://en.wikipedia.org/wiki/Luhn_algorithm
(sorry if I double post this... I swear I posted it 10 minutes ago)
Reviewing just the first hour of video games.
#$%^&*(&^%{{}}{/\/\||```
(No, that's not a regex at all. And no, I don't even have a single girlfriend.)
8 of 13 people found this answer helpful. Did you?
I have to parse files with bash sometimes, and I use these:
^# = line with a leading comment
^$ = empty line
They're simple, but work usually. You can make them a lot more bullet proof by adding in blank checking between the characters, but it seems to work.
cat httpd.conf | grep -v \^\# | grep -v \^\$ | less
makes httpd.conf a lot more readable.
Check out my sysadmin blog!
Bad filename character for Windows (if it matches, the filename is invalid):
E-mail (use case insensitive):
GUID (use case insensitive):
IP on local private network:
Removes .NET named capture syntax so that a .NET Regex string can be used elsewhere (such as Javascript) (replace with nothing):
Flame away about how horrible it is that I missed some edge case that even nobody on Slashdot has ever heard of, but they work well for me and hopefully for you too.
Now, if you actually find a common case that I missed, I would appreciate the help...
Peter predicted that you would "deliberately forget" creation 2000 years ago...
You are a great candidate for the Useless Use of Cat award... specially endearing is your making a comment on the few commands your line uses :D
Here is the crazy regex to detect a valid UTF-8 string. :)
:D :) /[^\x00-\x7E]/ };
/^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$/x
This can crash perl if the string being checked is too big.
So it's usually better to just let perl attempt to decode anything non-ascii as utf8 and see if it fails or not. (And hope all the utf8 parsing exploits have been fixed
eval { $param = decode( 'utf8', $param, Encode::FB_CROAK) if $param =~
$param = decode( 'iso-8859-1', $param, Encode::FB_CROAK) if $@; # utf8 decode of non-ascii text failed so treat as latin1
Dear slashdot editors,
slashdot.org is not stackoverflow.com.
The articles and discussions here are not searchable in a sane way. Your recent attempts to mimic stackoverflow are just a waste of everybody's time because all those little tidbits that people post get lost in the internet noise immediately.
We know you're bit desperate for traffic these days. But this is not the way to go.
That was a new one on me; I hadn't encountered that award before. Would something like:
<REPORT_NAME sed 's/[^a-z0-9,.-]//gi' > REPORT.out
be preferable in this instance?
It is a solemn thought: dead, the noblest man's meat is inferior to pork.
I came up with a Regex that can be used to match literally anything (yes, anything!). It is, therefore, the most flexible regex ever concocted. Here it is:
.*
Your regex doesn't allow + signs in the name part.
Nor, I would suspect would it handle quoted strings e.g. "Jeremy P"@example.com is technically a valid RFC 822 address.
And having just looked up the RFC 5322 spec which you quote, I see there are more cases you fail to take acount of e.g.
Jeremy P <jeremyp@example.com>
Also, what makes you think upper case in domain names is invalid? jeremyp@example.COM fails validation.
All I want is a secure system where it's easy to do anything I want. Is that too much to ask ~~ Randall Munroe
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
-- Jamie Zawinski
>> Standing on head makes smile of frown, but rest of face also upside down.
IMHO, this is exactly the way that Slashdot should be going. Threads like this are interesting, add to the reservoirs of internet knowledge, and have the highest quality to noise ratios.
I (and I suspect many others) read Slashdot not for the latest +5 funny comment (though those can be fun to read) but to read the opinions of brilliant minds. And when those minds start trading secrets... Everyone wins.
cause it's an interesting discussion of a common (mis)understanding. did you know the RFC specifies leading-zero-for-octal and leading-0x-for-hex? i knew those were commonly used conventions in some places but didn't know that included IP addresses.
if the mods do their job, the posts correcting the GP's mistaken understanding will also score high marks.