Regular expressions are a tool, not a problem

What are regular expressions?

Regular expressions are a way to describe a pattern to match a piece of text.

So if you want to find another “similar” word in a text, badly formatted phone numbers, or replace “same kind” of a (X)HTML element regardless of attributes, regular expressions are the way to go. Regular expression can be also referred to as a “regex” or “regexp”.

Some quick examples:

  • \w*st will match best, 1st, worst, st and any other word ending with st, since \w stands for any alpha-numeric character, and * means that it repeats zero or more times.
  • [a-zA-Z]+st will match any word that has more than three letters and ends with st, but not 1st, since only small and capital letters are matched and + means it repeats one or more times.
  • [AB]\w+ will match the words that start with a capital A or B.

There are these special characters, match any character with a dot “.”, digit with “\d”, alphanumeric character with “\w”, and so on. Certain characters have to be escaped, such as “.” has to become “\.” if you really want to match a dot, otherwise if unescaped it will match any character. More info on Wikipedia.

As an example, let’s say you want to extract all references to 1980’s in a book. In a standard  search, you would search for 198, and then have to stop and copy/paste each occurrence. With regular expressions you can search for 198\d, and you get all the years back, because \d means any digit, and it will match the full year now.

What’s that, you want to put it all in a time machine and make it like happened in the 70’s instead? No problem, search for 198(\d) and replace with 197$1. You will see that each last digit will appear as the year in the seventies, so instead of 1980 it will be 1970, instead of 1981 1971, and so on. Good old days.

Attitude

If you have ever asked another fellow developer to check out a regexp, you have probably encountered this famous quote:

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. — Jamie Zawinski.

I have to say that I don’t agree completely. Regular expressions are a good tool to be used when appropriate, as with other programming tools and patterns. But, of course, when used to solve a wrong kind of problem, you can easily run into even bigger problems.

One more problem is that regular expressions are not consistent across languages and text editors. While most of them rely on the same syntax, there are some specifics. So you have Perl (PHP, Java), Microsoft, JavaScript, and even VisualStudio has it’s own syntax, while most of the text editors will do only greedy matching.

About tools and examples

As I mostly work in .NET environment, I can highly recommend Expresso. You can use it to find matches, do search and replace, but also it gives nice regex analysis and explanation, that can be extremely useful, especially at the start. It also provides a solid library of examples for dates, URLs, Zip codes, etc.

If you don’t have a favorite tool, go download this one – it requires registration, but it’s free. I also have seen people using Regulator (free), and RegexBuddy (commercial).

When writing a new regex, I usually start with a really simple one that matches a part, and then slowly extend it until it matches everything properly.

If you like to see the examples in action (and if you are reading this, I guess you do), take the examples below and copy/paste them to Expresso, to see the breakdown and matches.

For searching go to “Test Mode” tab, paste the regular expression in the appropriate field, and example’s and into “Sample Text” field. Then click “Run Match” (Play) button and have a look the break down and matched. You can expand the Analyzer tree to see the breakdown, and then click on the search results to mark them in the sample text.

To replace, go to “Design Mode” tab, do the same as above and additionally paste the replace section to “Replacement String” field. You’ll also notice a bunch of tabs below that can give a great insight of what can you use in a regular expression.

If you like learning by example as much as me, that’s the way to go. Alter the regex and play around.

Examples

Simple

Phone number starting with a “+” and 10 or more digits

This is the kind of thing is where regular expressions excel.

\+\d{10,}

+1234567890, +0987654321, +1234567890123…

+123, +123456789, 1234567890, +123456_890…

Phone number starting with a “+” and 10 or more digits allowing whitespace

\+\s*(\d(?:\s)*){10,}

+1234567890, +0987654321, +1234567890123, +123456789 0123, + 1 2 3 4 5 6 7 8 9 0 1 2 3…

+123, +123456789, 1234567890, +123456_890…

Numbers formated with leading zeros, 00000001 – 00099999

Matching numbers apart from being numbers is not the best feature of regular expressions, but as you can see it can be achieved easily in simple cases.

0{3}\d{5})

00000001, 00000012, 00000123, 00001234, 00012345, 00099999…

1, 0001, 00100000, 00999999…

Find five and six letter words

(?<=(?:\s|\G|\A))[a-zA-Z]{5,6}(?=(?:\s|\Z|,|\.|\?|!))

Alarm, AMIGO, donut, abacus, WAlkER…

A1arm, AMIG0, 12345, 123456, ABBA…

Remove extra empty lines

All the extra empty lines, or lines containing whitespace will be removed. That is, lines that contain only whitespace and line breaks will be replaced by the first line break found.

([\n\f])\s*([\n\f\r\v][\s\t]*)+(?<!.)

replace with $1

Check CSV for a value

If you like to test this one, take each CSV without quotes and use only that as the input sample text.

(?:^|,)xyz(?:$|,)

“xyz”, “xyz,yy”, “yyy,xyz,xyy”, “123,am,xyz”…

“1,2,3”, “x,y,z”, “xy,z”, “1 xyz”…

Check if URL contains parameter named lang with a value g_, e_, or h_

(?< =^|&)(lang|language|change_lang)=(g_|e_|h_)(?:$|&)

HTML

Remove spacer gif

If you are thinking this comes out of 2002, you are right. ;)

<[\w\=\d\s\-/\._,"]*src="baza_files/spacer.gif"[\w\=\d\s\-/\._,"]*>

Word html cleanup

Sadly, I still use this one from time to time, although there is a better way to do it using Tidy, but I leave this for another post.

(?s)( class=\w+(?=([^<]*>)))|(<!--\[if.*?<!\[endif\]-->)|(<!\[if !\w+\]>)|(<!\[endif\]>)|(<o:p>[^<]*</o:p>)|(<span[^>]*>)|(</span>)|(font-family:[^>]*[;'])|(font-size:[^>]*[;'])(?-s)

Html gallery table cleanup

Replacing this with empty string will clean all elements except “img”:

<(?<tag>/*\!*[^(img)][\w\"\=\d\s\%\#\-,/_\.]*)>

Then you can wrap each image in a div with searching for:

(<tag></*\!*[\w"\=\d\s\%\#\-,/_\.]*>)

And replacing it with:

<div class="img">$1</div>

Check for malicious characters in user input

[^<>`~!/@\#}$%:;)(_^{&*=|'+]+

Please note that this should only be used as a first line of defense, not a final solution to malicious input.

Extract the URL from anchor element href attribute (<a href=”…”>)

href=[\"\'](http:\/\/|\.\/|\/)?\w+(\.\w+)*(\/\w+(\.\w+)?)*(\/|\?\w*=\w*(&\w*=\w*)*)?[\"\']

Find any HTML element

\w?<\s?\/?[^\s>]+(\s+[^"'=]+(=("[^"]*")|('[^\']*')|([^\s"'>]*))?)*\s*\/?> <\s?\/?[^\s>]+(\s+[^"'=]+(=("[^"]*")|('[^\']*')|([^\s"'>]*))?)*\s*\/?>

C#

These come from days of VisualStudio 2003 (yeah, hard to believe it). So it’s here more as a illustration of what can be achieved.

Generate properties from private members

Search for:

\s*(?:private\s)?(.*)\sm([^\s]*)(\s.*)?;

Replace with:

public $1 $2
{
	get { return this.m$2; }
	set { this.m$2 = value; }
}

Generate properties from csv ( method parameters)

Search for:

(.*?)\s(.*?)(?:,\s*|$)

Replace with:

$1 _$2;
public $1 $2
{
	get { return _$2; }
	set { this._$2 = value; }
}

In conclusion

As you can see, regular expressions can be used to solve problems, without creating another one, or a headache to resolve, but take care to use them when appropriate. At the end, here’s my list of appropriate and unsuitable use cases.

When to use regular expressions

  • Find similar words and capitalize them.
  • Change order of words.
  • Change name of a variable in a program.
  • String structure validation.
  • Date / number structure validation.
  • Simple parsing.
  • Change or replace XHTML element.

When to avoid regular expressions

  • Date and number value or range validation (use cast and then validate).
  • Parenthesis open/close consistency (use parsing and count parenthesis).
  • Grammar validation (use appropriate tools).

Have fun!


Comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.