Regular Expressions in .NET are cool

Sun, Aug 31, 2008 17-minute read

We have all been faced with the problem of finding a specific part of a larger string, and use that particular part of the string for further processing.
It can be pattern matching on lots of lines for when you perhaps have to load a text file into an object structure or it can be very simple, find a particular range of text in another string.

The obvious way to find those bits of strings is to use string.IndexOf and string.Substring. Those methods on the string class is nice and very fast if its a simple string, and if you know exactly how the string is formattet, but as soon as the string you are searching in gets a little more complicated you end up writing a lot of lines of code that easily becomes error prone, and can be hard to extend unless you really do things correct from the begining.

In some situations string.IndexOf and string.Substring is the correct choice, but I will show you with this blog post that using regular expressions you can end up with code that is having much less lines, is easier to understand, and very easy to extend and even in some situations faster.

Admittedly regular expressions have a steep learning curve, but when you get into it, you will be amazed what you can do with regular expressions.

I have created some tasks that I will solve in this blog post, I will solve them using regular expressions and also with the more traditional approach using string.IndexOf and string.Substring. The reason for doing both is to show you the difference in the lines of code, the readability and not least the performance difference of the two solutions.

The tasks are:

Create parser code that can parse a VCARD string into a simple object structure.
Create code that can extract the telephone number from a VCARD string

The first task is to create a parser that can parse a VCARD into an object. I have created the VCARD below, which will be used as an example VCARD.

BEGIN:VCARD
FN;ENCODING=QUOTED-PRINTABLE;CHARSET=UTF-8:Bj=C3=B8rn Bouet Smith
TEL:+4512345678
X-IRMC-URL:http://blog.smithfamily.dk
NOTE;CHARSET=UTF-8;ENCODING=QUOTED-PRINTABLE:This is a note=
 With multiple lines=
 Of text
END:VCARD

You might not be familar with the VCARD format, but its nothing more than a string representation of a Contact. You can read more about VCARD at this website.

The VCARD format is being used all over. Most if not all mobilephones use the VCARD format when syncronizing their contacts to a server or to outlook for that matter.

Anyway onto the code. I have created a Contact class that will be the resulting object that the parser will create based on the string representation. The class is like the following:

/// <summary>
/// Sample object that contains 4 specific fields and a collection of unspecified fields
/// </summary>
public class Contact
{
    public Contact()
    {
        OtherFields = new List<KeyValuePair<string, string>>();
    }
 
    /// <summary>
    /// Gets or sets the name.
    /// </summary>
    /// <value>The name.</value>
    public string Name
    {
        get;
        set;
    }
 
    /// <summary>
    /// Gets or sets the telephone.
    /// </summary>
    /// <value>The telephone.</value>
    public string Telephone
    {
        get;
        set;
    }
    /// <summary>
    /// Gets or sets the note.
    /// </summary>
    /// <value>The note.</value>
    public string Note
    {
        get;
        set;
    }
 
    /// <summary>
    /// Gets or sets the URL.
    /// </summary>
    /// <value>The URL.</value>
    public string Url
    {
        get;
        set;
    }
 
    /// <summary>
    /// Gets or sets the other fields this contact contains
    /// </summary>
    /// <value>The other fields.</value>
    public List<KeyValuePair<string,string>> OtherFields
    {
        get;
        set;
    }
 
 
}

So basically all we have to do is parse that string into this simple object, sound pretty easy right :)

Well naturally its not rocket science, but its not that easy when you look at the VCARD, its not enought to split by line and do a parsing line by line and you are done. The reason for this is that the value of a field can span multiple lines. So our parse have to take that into account, and naturally parse all unknown fields into the OtherFields property of the contact.

I have made an assumption in my code, that is that you have a QuotedPrintable decoder available.

I have made a few methods that both examples will use, which assumes you have the QuotedPrintable decoder available. The methods are as follows:

/// <summary>
/// Gets the field contents doing any neccesary quoted printable decoding
/// </summary>
/// <param name="contents">The contents.</param>
/// <param name="charsetStr">The charset as a string</param>
/// <param name="encodingStr">The encoding as a string</param>
/// <returns>The decoded contents or the original contents if no decoding is neccesary</returns>
private string GetFieldContents(string contents, string charsetStr, string encodingStr)
{
    bool mustDecode = !string.IsNullOrEmpty(encodingStr);
    bool haveCharset = !string.IsNullOrEmpty(charsetStr);
 
    if (mustDecode)
    {
        if (haveCharset)
        {
            return DecodeQuotedPrintable(contents, Encoding.GetEncoding(charsetStr));
        }
        else
        {
            return DecodeQuotedPrintable(contents);
        }
    }
    return contents;
 
}
/// <summary>
/// Decodes the quoted printable string
/// </summary>
/// <param name="contents">The contents.</param>
/// <param name="encoding">The encoding.</param>
/// <returns></returns>
public string DecodeQuotedPrintable(string contents, Encoding encoding)
{
    //Assumes that you have a method that can decode quoted printable taking encoding into account
    //There is plenty of free code available on the net, and to include one here would be out of
    //scope for this blog post
    return contents;
}
/// <summary>
/// Decodes the quoted printable string
/// </summary>
/// <param name="contents">The contents.</param>
/// <returns></returns>
public string DecodeQuotedPrintable(string contents)
{
    //Assumes that you have a method that can decode quoted printable using system default encoding
    //There is plenty of free code available on the net, and to include one here would be out of
    //scope for this blog post
    return contents;
}

Basically what these methods does is that they decode the contents if needed, otherwise they just return the string. This is where my code is missing the QuotedPrintable decoder, the reason for this is that to include a complete QuotedPrintable decoder in this blog post would be completely out of scope, and would move focus from whats important :)

The first thing you need to do when using Regular Expressions is to create your Regex object. To this particular job I have created the following regular expression:

private static readonly Regex rStatic = new Regex(@"^(?<FIELDNAME>[\w-]{1,})
(?:(?:;?)(?:ENCODING=(?<ENC>[^:;]*)|CHARSET=(?<CHARSET>[^:;]*))){0,2}
:(?:(?<CONTENT>(?:[^\r\n]*=\r\n){1,}[^\r\n]*)|(?<CONTENT>[^\r\n]*))",
    RegexOptions.ExplicitCapture |
    RegexOptions.IgnoreCase |
    RegexOptions.IgnorePatternWhitespace |
    RegexOptions.Multiline | RegexOptions.Compiled);

I will explain each part of the regular expressions, but I will not delve into all the details on how to create regular expressions, the syntax and so forth, for that you should consult the .NET documentation, and perhaps use one of the tools out there that can help you create and test regular expressions. I use a tool called Expresso from the company Ultrapico. I have used it since 2003, and its extremely good. You can download it from this website.

The Regex object I have created contains two parts, the regular expression and some options. The option RegexOptions.Compiled must be used with care. The reason for this is that if you specify that option the .NET framework will compile an assembly each time the Regex object is created, which will cause memory leaks unless you create the Regex object as a static class variable, which I have done in the above example.

Regex	Description
^	The first part of the regular expression, simply states that it should start matching at the beginning of the line
(?<FIELDNAME>[\w-]{1,})	This part means capture any number and any character into the capture group FIELDNAME. And require at least one character
(?:(?:;?)(?:ENCODING=(?<ENC>[^:;])\|CHARSET=(?<CHARSET>[^:;]))){0,2}	This part is built up of first having a non capturing group, thats indicated by the ?: The reason for having the non capturing group in this example is that I want to require up to 2 instances of the entire regex part, and to do that I have to enclose the entire section in a non capturing group. First element in the non capturing group is (?:;?), which is another non capturing group, that indicates that the character ; might be available. You specify "might be available" with ?, which is similar as specifying {0,1}, but since ? is shorter, you should use that. The next element is a non capturing group that must contain the the ENCODING=, and also a capturing group called ENC which will capture all characters except ; and : or contain the text CHARSET= and then a capturing group called CHARSET, which will also capture all characters but ; and :. The entire regex part should be available up to two times, but might not be available at all indicated by {0,2}
:(?:(?<CONTENT>(?:[^\r\n]=\r\n){1,}[^\r\n])\|(?<CONTENT>[^\r\n]*))	The last part of the regex contains two alternatives that is wrapped in a non capturing group. First alternative is a capturing group called CONTENT that will capture all characters that matches the pattern, where all lines end in a = sign and all characters on the following line. The pattern should be matched at least one time, but there is no upper limit. The other alternative is a capturing group also called CONTENT, but which only matches content that end on a single line, and all characters but the line break characters \r\n

Now that the Regex is in place and explained, why don't I show you the code that is going to use the regular expression to solve the task that we created above

/// <summary>
/// Returns a contact, parsing the VCARD string using regular expressions
/// </summary>
/// <param name="contents">The contents.</param>
/// <returns></returns>
public Contact GetContactRegex(string contents)
{
    //Create new instance of a Contact
    Contact contact = new Contact();
 
    //Match the contents with the regular expression
    MatchCollection matches = rStatic.Matches(contents);
 
    //Iterate over each match
    foreach (Match match in matches)
    {
        //Assign values from the match group we created in the regular expressions
        string fieldName = match.Groups["FIELDNAME"].Value;
        string fieldValue = match.Groups["CONTENT"].Value;
        string charSetStr = match.Groups["CHARSET"].Value;
        string encodingStr = match.Groups["ENC"].Value;
 
        //Assign values to the contact object from the values of the capture groups
        switch (fieldName)
        {
            case "FN":
                //name
                contact.Name = GetFieldContents(fieldValue, charSetStr, encodingStr);
                break;
            case "TEL":
                //telephone
                contact.Telephone = GetFieldContents(fieldValue, charSetStr, encodingStr);
                break;
            case "X-IRMC-URL":
                //url
                contact.Url = GetFieldContents(fieldValue, charSetStr, encodingStr);
                break;
            case "NOTE":
                contact.Note = GetFieldContents(fieldValue, charSetStr, encodingStr);
                break;
            default:
                //All other fields just add them to the other fields collection
                contact.OtherFields.Add(new KeyValuePair<string, string>(fieldName, GetFieldContents(fieldValue, charSetStr, encodingStr)));
                break;
        }
 
    }
    return contact;
}

See the c# code is very easily read, and if the logic of the VCARD changes all you have to do is change the regular expression, and you don't have to change the logic of the c# code.

Lets move onto solving the same task using c# code only, i.e. no regular expressions, but using the same support methods, i.e. GetFieldContents.

I have created the following solution, which might not be perfect, but gets the job done.

I have created a class that will represent a single property in the VCARD:

/// <summary>
/// Class that represents a single property in the VCARD
/// </summary>
public class Line
{
    /// <summary>
    /// Gets or sets the name of the field.
    /// </summary>
    /// <value>The name of the field.</value>
    public string FieldName
    {
        get;
        set;
    }
 
    /// <summary>
    /// Gets or sets the charset.
    /// </summary>
    /// <value>The charset.</value>
    public string Charset
    {
        get;
        set;
    }
 
    /// <summary>
    /// Gets or sets the encoding.
    /// </summary>
    /// <value>The encoding.</value>
    public string Encoding
    {
        get;
        set;
    }
 
    /// <summary>
    /// Gets or sets the contents.
    /// </summary>
    /// <value>The contents.</value>
    public string Contents
    {
        get;
        set;
    }
}

And two methods that allows me to parse a single property line represented as a string into a Line object

/// <summary>
/// Parses a string into a Line object
/// </summary>
/// <param name="lineString">The line string.</param>
/// <returns></returns>
private Line GetLine(string lineString)
{
    Line line = new Line();
 
    if (lineString.Contains("CHARSET="))
    {
        line.Charset = GetParameterValue(lineString, "CHARSET=");
    }
    if (lineString.Contains("ENCODING="))
    {
        line.Encoding = GetParameterValue(lineString, "ENCODING=");
    }
    int firstSeperator = lineString.IndexOfAny(new char[] { ';', ':' });
 
    line.FieldName = lineString.Substring(0, firstSeperator);
 
 
    int contentStart = lineString.IndexOf(":") + 1;
    line.Contents = lineString.Substring(contentStart).Trim();
 
 
    return line;
}
 
/// <summary>
/// Gets the parameter value
/// </summary>
/// <param name="contents">The contents.</param>
/// <param name="parameter">The parameter.</param>
/// <returns></returns>
private string GetParameterValue(string contents, string parameter)
{
    int paramStart = contents.IndexOf(parameter) + parameter.Length;
    if (paramStart == parameter.Length - 1)
    {
        //Not found
        return null;
    }
    int paramEnd = contents.IndexOfAny(new char[] { ';', ':' }, paramStart);
    if (paramEnd == -1)
    {
        //Not found, so return the rest of the string
        return contents.Substring(paramStart);
    }
    return contents.Substring(paramStart, paramEnd - paramStart);
}

The method that returns the Contact based on the VCARD string is as follows:

/// <summary>
/// Returns a contact object using no regular expressions
/// </summary>
/// <param name="contents">The contents.</param>
/// <returns></returns>
public Contact GetContactRegular(string contents)
{
    //Create new instance of a contact
    Contact contact = new Contact();
 
    //Split all lines into a string array
    string[] lines = contents.Split(new string[] { "\r\n" }, StringSplitOptions.RemoveEmptyEntries);
 
    //Create a string build that will hold each property of the VCARD as we built them from the lines array
    StringBuilder currentLine = new StringBuilder(100);
 
    //bool value to indicate whether not the current line belongs together with the next line
    bool addNextLine = false;
    //Create Collections of Line objects
    List<Line> allLines = new List<Line>();
 
    //Interate over each string in the lines array, parsing them into a Line object
    foreach (string line in lines)
    {
        //Check whether or not the current line belongs together with the next one
        addNextLine = line.EndsWith("=");
 
        currentLine.AppendLine(line);
        if (!addNextLine)
        {
            //If line does not belong together with the next one, Parse the string into a Line object
            allLines.Add(GetLine(currentLine.ToString()));
            currentLine = new StringBuilder(100);
        }
 
    }
 
    foreach (Line l in allLines)
    {
        switch (l.FieldName)
        {
            case "FN":
                //name
                contact.Name = GetFieldContents(l.Contents, l.Charset, l.Encoding);
                break;
            case "TEL":
                //telephone
                contact.Telephone = GetFieldContents(l.Contents, l.Charset, l.Encoding);
                break;
            case "X-IRMC-URL":
                //url
                contact.Url = GetFieldContents(l.Contents, l.Charset, l.Encoding);
                break;
            case "NOTE":
                contact.Note = GetFieldContents(l.Contents, l.Charset, l.Encoding);
                break;
            default:
                //All other fields just add them to the other fields collection
                contact.OtherFields.Add(new KeyValuePair<string, string>(l.FieldName, GetFieldContents(l.Contents, l.Charset, l.Encoding)));
                break;
 
        }
    }
 
    return contact;
}

Okay first impression: Thats a whole lot of code to get to the same result as the method that was using regular expressions. In comparison, the regular expressions code consists of only 43 lines of code plus the regular expression, and the alternative code that is not using regular expressions is a whopping 138 lines of code. Thats more than three times the amount of code that might contain bugs, that need proper unit testing etc. So if you ask me, I would prefer the 43 lines of code to maintain, instead of the 138 lines of code :)

Regular expressions can be faster runtime than regular c# code, but in this example its not. I have run the two Methods 1 million times, and the numbers is as the following:

Method	Milliseconds
GetContactRegular	24625
GetContactRegex	45546,875
GetContactRegex (No RegexOptions.Compiled)	105453,125

As can clearly see, the traditional way of doing things is way faster in this particular example, almost twice as fast as the regular expressions solution. That shouldnt make you say, then I will not use the Regex solution, since sometimes easily maintainable and easily extensible code is worth more than speed. You can also see that by using RegexOptions.Compiled we gain more than twice the speed over not using the Compiled version of the regular expression, so when ever you use regular expressions, consider using the RegexOptions.Compiled since it will give you increased speed. Just remember to create the Regex static so you don't leak memory.

Okay this task shows us clearly that regular expressions can give you code that is easier to read, less code to manage, and easier to extend, but this particular example fails comparing speed.

Lets move onto the next task: Returning the Telephone number from the VCARD only. We will be using the same VCARD to test:

BEGIN:VCARD
FN;ENCODING=QUOTED-PRINTABLE;CHARSET=UTF-8:Bj=C3=B8rn Bouet Smith
TEL:+4512345678
X-IRMC-URL:http://blog.smithfamily.dk
NOTE;CHARSET=UTF-8;ENCODING=QUOTED-PRINTABLE:This is a note=
 With multiple lines=
 Of text
END:VCARD

Lets start with the Regex solution, to that purpose I have created the following Regex:

private static readonly Regex rSimple = new Regex("^(?:TEL):(?<TEL>[^\r\n]*)",
    RegexOptions.ExplicitCapture |
    RegexOptions.IgnoreCase |
    RegexOptions.IgnorePatternWhitespace |
    RegexOptions.Multiline | RegexOptions.Compiled);

This regular expression is very simple, and simply graps the text after the : sign up to the end of the line for all lines that begin with TEL.

The accompanying c# code is:

string tel = rSimple.Match(contact).Groups["TEL"].Value;

Thats simple :) - you cannot get that easily with c# code only.

Lets do the c# solution as well:

int indexStart = contact.IndexOf("TEL:")+4;
int indexEnd = contact.IndexOf("\r\n", indexStart);
string tel = contact.Substring(indexStart, indexEnd - indexStart);

Again its not bad, three lines of code and you have the telephone number. But what happens if the TEL property line was not that simple. What if it was like:

TEL;CELL;HOME:+4512345678

Then we would have to revise our c# code to take into account that it had to skip the parameters if available. By adding just a few characters to our regular expression we can have it take into account the optional parameters. If we add the following to our regular expression: ([^:]*) and end up with a regular expression like:

^(?:TEL)([^:]*):(?<TEL>[^\r\n]*)

Then our regular expression still do the job, and we dont have to change the accompanying c# code at all. In contrast to tweak our c# code to handle parameters we would need to find the first index of the : character and then do a substring from that point. Nothing hard, but more error prone.

Speed, well actually in this case where the regular expression is so simple, its much faster than the c# code.

I have again run the same code 1 million times and the results is:

Method	Milliseconds
c# code	4125
regular expression	2984,375

So you see this time, the regular expression wins the speed test, and also the extensibility and maintainability test if you ask me :)

The lesson to be learned from this blog post is that regular expressions is your friend, and can help you make code faster in a lot less lines of c# code. In some situations when the regular expressions is very simple they even provide you with a decent performance gain.

When doing string searches, regular expressions can help you immensely, and they are not at all that dangerous or hard to learn as many people think.

Another bonus with regular expressions is that you can to very advanced pattern matching and replacement, i.e. lets say you wanted to reformat all phone numbers in the VCARDS to a particular format, that would be possible using the same regular expression as the above. Lets say you wanted to add a prefix to all the phone numbers, like to dial an outside line, i.e. 0, then it would be possible with the following line of code.

rSimple.Replace(contact, "TEL:0,${TEL}")

Which simply states that using the same regular expression, replace the contents of the match match with the contents of the TEL group and put 0, infront of it :)

Simple right :)

I hope if you haven't even used regular expressions before that you will consider it now or even if you have used regular expressions before that I have given you further reason for doing it :)

Regular Expressions in .NET are cool

Comments