Joe’s Notes on Perl #5 – Regular Expressions (Regex)

Coming up is something that could be considered infuriating but very useful, if not the whole point of Perl.

Regular expressions (Regex) in effect, scan a scalar looking for patterns of text using specific rules.

I’m not going to do too many examples due to how variable Regexs can be.

Regexes hurt

This will be you in no time! – XKCD


GeneralNegationOperatorsAnchorsCaptureSubstituteSwitches

 

General syntax: $string =~ /pattern/

Here are some examples of how Regexs work:

Matches are done in a case sensitive matter unless you say otherwise, which I’ll cover later.

Using blocks of letters or numbers can accommodate for ranges.

  • [A-Z] will find characters from A-Z
  • [aA-zZ] will find all characters from A-Z regardless of case
  • [1-9] will look for numbers between 1 to 9
  • [A-G, Q,L, X-Z] will find letters A to G, Q, L, and X to Z
  • Scalars work in place of text

To find special characters, escape them with a backslash:

  • \^
  • \$
  • \\
  • \.

So as you can see you can precisely configure what to search for. However with Perl things are even easier than that:

  • \w for any word
  • \d for any character
  • \s for any whitespace

To alternate between matches use vertical bars. Eg:

  • Socks|Poppy|Milly
  • (Socks Poppy)|Milly will look for “Socks Poppy” and “Milly”
  • (Socks | Poppy) Milly will look for “Socks/Poppy Milly”

To look for repeating strings

  • (Socks){4} will look for “SocksSocksSocksSocks”

 

Negation

Say you don’t want to find a word, digit or special character:

  • [^b] don’t find any b’s
  • \W don’t find any words
  • \D no digits
  • \S no whitespace

 

Operators

When doing searches, controlling the scope of your search is vital so that you don’t match stuff that is irrelevant.

Tack these operators on to a switch or class to modify the search.

  • “.” Matches any character
  • “+” Matches more than 1
  • “*” Matches 0 or more
  • “?” Optionally

 

Anchors

  • “^” Match at the beginning of a line
  • “$” Match at the end of a line
  • \b defines a word boundry, eg /\bapples\b/

 

Capturing Data back from a match

Say you want a certain portion of the data back from a match:

#1: This will match the first word (\w) at least once (+). The $1 indicates the word to return (In this example “Copious”. This could then be stored to a scalar or whatever your needs be.

#2: Will match the first word (\w) at least once (+) and then the second word (\w) at least once (+). $2 is the second bracket and returns “cats”.

 

Substitutions – s/pattern/replacement/

Regexes are great for finding text and replacing with another.

The above will print Copius dogs can carefully creep

Works with any of the above as well:

Will print QopQus Qats Qan Qarefully Qreep

Swap text around by capturing:

Prints Copius can cats carefully creep

 

Regex Switches

Regex can be unwieldy sometimes, restrain it with switches.

 

/i – Case Insensitive

Case insensitive search = $string =~ /pattern/i

This will find every occurrence where there is cat, regardless of case.

 

/g – Global

Probably the most useful switch, will look along the entire string for matches.

 

/e

When substituting, will force expressions to work. You could push to an array or do a summation.

Will return Copiuscatscancarefullycreep (the array)

 

/m – Multiline

If you have a string that has \n newline characters in it, use /m to account for it. If doing a global search use /mg

 

/s – Single line

If you have a multi line string, it will search the first line.

 

/x

/x will let you embed comments into Regexs, great for testing.

 

Right, so you’ve got that yeah? Me neither. Just fiddle around with Perl and get to know (and hate) Regexes.

Enjoy!

XKCD

XKCD