User Tools

Site Tools


notes:csharp:regex

Regular Expressions in C#

Special characters: ^ $ \ . * + ? ( ) [ ] { } |

The characters ^ and $ are called anchors:

  • ^ matches the beginning of the string
  • $ matches the end of the string; the void after the last character

Parts of a regex can be repeated:

  • * matches the preceding part zero or more times; * is equivalent to {0,}
  • + matches the preceding part one or more times; + is equivalent to {1,}
  • ? matches the preceding part zero or one time; ? is equivalent to {0,1}

{…} represents a bounded repeat:

  • a{n} matches 'a' repeated exactly n times
  • a{n,} matches 'a' repeated n or more times
  • a{n,m} matches 'a' repeated between n and m times inclusive

Other simple patterns:

  • . - any single character except the newline character
  • \s - any whitespace character
  • \S - any character that isn't a whitespace
  • \b - a word boundary
  • \B - any position that isn't a word boundary

The above repeats are greedy because they find the longest match. To make them non-greedy, add ? behind the repeat i.e., *?, +?, ??, {…}?.

RegEx Maching strings
a*b a, ab, aab, aaab, etc.
a+b ab, aab, aaab, etc.
a?b b, ab
do(es)? do, does
o{2} oo
o{2,} oo, ooo, oooo, etc.
o{1,3} o, oo, ooo
[adg] 'a' or 'd' or 'g'
[a-z] any character from 'a' to 'z'
B[iu]rma Birma or Burma
.* any number of characters other than newline
\w* any number of alphanumeric characters
[^1-6] any character except the digits from 1 to 6
“[^“\r\n]*” any string enclosed in quotes
\b(in|out)\b a word 'in' or 'out'
\bxxx\b.*\byyy\b a word 'xxx' followed by 'yyy'
\ba\w*\b words that start with the letter 'a'
\b\w{5,6}\b five and six letter words
\b\d{4,5}\b 4- or 5-digit number
^\w* the first word in a line or in the text
^test the string 'test' if it is the first string in a line or in the text
^51|^52 the strings '51' or '52' if they are the first strings in a line or in the text
^a{3,4}$ the strings 'aaa' or 'aaaa' if they are the only strings in a line or in the text
^test$ the string 'test' if it's the only string in a line or in the text
test$ the string 'test' if a line or the text ends with it
[.?!] the punctuation at the end of a sentence; ”.“ and ”?“ lose their special meanings
[\d]{1,7} 7-digit number
(\d+|) a number or empty
(\d+|\**|) a number or asterisks or empty

Simple examples:

using System.Text.RegularExpressions;
...
// Check if all characters are numeric.
Regex reg = new Regex(@"^\d+$");
bool b1 = reg.Match("3451").Success; // true
bool b2 = reg.Match("34a1").Success; // false
 
// Check if a string has letters a,A,b,B.
bool b3 = Regex.IsMatch("acd", "a|b", RegexOptions.IgnoreCase); // true
bool b4 = Regex.IsMatch("eBd", "a|b", RegexOptions.IgnoreCase); // true
bool b5 = Regex.IsMatch("efg", "a|b", RegexOptions.IgnoreCase); // false
 
// Modify a string.
string s1 = Regex.Replace("aib", "a|b", "X"); // XiX

Example: Extract email addresses from a string:

public List<string> GetEmails(string str)
{
    const string Pattern = @"[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}";
    List<string> emails = new List<string>();
 
    Regex reg = new Regex(Pattern, RegexOptions.IgnoreCase);
    MatchCollection matches = reg.Matches(str);
    foreach (Match m in matches)
        emails.Add(m.Value);
 
    return emails;
}
...
string emailsStr = "leon@micro.com john@ Mel@kata.ca Phil@@lego.com";
List<string> emails = GetEmails(emailsStr); // emails = { "leon@micro.com", "Mel@kata.ca" }

Example: Use named groups to match patterns:

string str = @"Leon's GDI+ GDI++ +GDU Mike'ss Kale's Chloe''s";
Regex re = new Regex(@"((?<GroupPlusSign>\w+\+)|(?<GroupApostrophe>\w+'s))(?=(\s|$))", RegexOptions.None);
MatchCollection matches = re.Matches(str); // 3 matches: Leon's GDI+ Kale's
 
int i = 1;
foreach (Match m in matches)
{
    Console.WriteLine($"Match #{i}: " +
        $"GroupPlusSign={m.Groups["GroupPlusSign"].Value}, " +
        $"GroupApostrophe={m.Groups["GroupApostrophe"].Value}");
    ++i;
}

Output:

Match #1: GroupPlusSign=, GroupApostrophe=Leon's
Match #2: GroupPlusSign=GDI+, GroupApostrophe=
Match #3: GroupPlusSign=, GroupApostrophe=Kale's

Greediness

Example: “hello out there, how are you”

Pattern Description Matching substring
h.*o find an 'h', followed by multiple arbitrary characters (even if they are 'o'), followed by an 'o' “hello out there, how are yo”

Make * ungreedy:

Pattern Description Matching substring
h.*?o find an 'h', followed by multiple arbitrary characters, followed by the first occurence of 'o' “hello”

Backreferences and Named Groups

Backreferences are used to capture matches for later reuse.

Pattern Description
(exp) match exp and capture it in an automatically numbered group
(?<name>exp) match exp and capture it in a named group
(?:exp) match exp, but do not capture it
\b(\w+)\b\s*\b\1\b match repeated words; uses an automatically numbered group #1 (\w+)
\b(?<Word>\w+)\b\s*\k<Word>\b match repeated words; uses a named group 'Word'
(\w+)\s*=\s*(.*?)\s*$ name=value pairs; name is in $1, value is in $2 (note: we make * ungreedy by using *?)

Example: “Today is monday the 18th.”

Pattern Description
[0-9]+th '18th' is matched
(?:[0-9]+)th '18th' is matched (avoiding capturing with the ?: operator)
([0-9]+)th '18th' is matched and '18' is captured in $1

Named groups - .NET syntax:

string str = @"leon's GDI+ GDI++ +GDU Hello9's hi'ss ho''s";
Regex re = new Regex(@"((?<GroupPlusSign>\w+\+)|(?<GroupApostropheS>\w+'s))(?=(\s|$))", RegexOptions.None);
MatchCollection matches = re.Matches(sb.ToString());
foreach (Match m in matches)
{
    string word1 = m.Groups["GroupPlusSign"].Value; // matches GDI+
    string word2 = m.Groups["GroupApostropheS"].Value; // matches leon's and Hello9's
}

Common Patterns

Pattern Examples / Comments
Email-1 \w[-._\w]*\w@\w[-._\w]*\w\.\w{2,3}
Email-2 [\w\.-]+(\+[\w-]*)?@([\w-]+\.)+[\w-]+
Phone-1 ([+]|)(([0-9]+)([-|\s]|))* 1-222-345 345-564321, 33211, 34-23-67
Phone-2 ([+]|)([0-9]|)(\s|\-|)[\d]{3}(\s|\-|)[\d]{3}(\s|\-|)[\d]{4} +1-111-222-3333, 453 678 9900
Phone-3 /^([\d]{3}|[\d]{3}-[\d]{1,3}|[\d]{3}-[\d]{1,3}-[\d]{1,4})$/
Phone-4 [\d]{3}[-|\s]?[\d]{3}[-|\s]?[\d]{4} 905-234-3422
Phone-5 (1|\+1)?[- .]?(\([0-9]\d{2}\)|[0-9]\d{2})[- .]?\d{3}[- .]?\d{4}
Canadian postal code [a-zA-Z][0-9][a-zA-Z](\s|\-|)[0-9][a-zA-Z][0-9]
Money ^\$?\d{1,3}((,?\d{3})*(\.\d{2})?|(\.?\d{3})*(,\d{2})?)$
Domain name [a-z0-9][-a-z0-9]*(\.[-a-z0-9]+)*\.[a-z]{2,6}
IP address-1 \b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b
IP address-2 ((2[0-4]\d|25[0-5]|[01]?\d\d?)\.){3}(2[0-4]\d|25[0-5]|[01]?\d\d?) Note: enforcing N<256 arithmetically is not possible with RegExp
IP address-3 (\d{1,3}\.){3}\d{1,3} allows numbers to be greater than 255
Number [0-255] ^(?:(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d))
Number [0-255] ^(25[0-5]|2[0-4]\d|[01]\d\d|\d?\d) no optimization
An identifier in a programming language [A-Za-z_][A-Za-z0-9_]*
C-style hexadecimal number 0[xX][A-Fa-f0-9]+
A sequence of digits ^\d+$ mandatory
A sequence of digits ^\d*$ optional i.e. an empty string is allowed
Padding spaces ^\s+|\s+$
An HTML tag <[A-Za-z][A-Za-z0-9]*>
A generic tag <[^>]+> greedy '+' and a negated character class
A generic tag <.+?> slower - '+' is lazy instead of greedy
A number between 1000 and 9999 \b[1-9][0-9]{3}\b
A number between 100 and 99999 \b[1-9][0-9]{2,4}\b

URL (IP):

^(http\://|https\://|ftp\://|)((([a-z_0-9\-]+)+(([\:]?)+([a-z_0-9\-]+))?)(\@+)?)?(((((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5])))\.(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5])))\.(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5])))\.(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5]))))|((([a-z0-9\-])+\.)+([a-z]{2}\.[a-z]{2}|[a-z]{2,4})))(([\:])(([1-9]{1}[0-9]{1,3})|([1-5]{1}[0-9]{2,4})|(6[0-5]{2}[0-3][0-6])))?$

URL (port and IP allowed)

^(((ht|f)tp(s?))\://)?((([a-zA-Z0-9_\-]{2,}\.)+[a-zA-Z]{2,})|((?:(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d)(?(\.?\d)\.)){4}))(:[a-zA-Z0-9]+)?(/[a-zA-Z0-9\-\._\?\,\'/\\\+&amp;%\$#\=~]*)?$

Date:

^((((((0?[13578])|(1[02]))\-((0?[1-9])|([1-2][0-9])|(3[01])))|(((0?[469])|(11))\-((0?[1-9])|([1-2][0-9])|(30)))|(0?2\-((0?[1-9])|([1-2][0-9]))))\-\d{2}(([02468][048])|([13579][26])))|(((((0?[13578])|(1[02]))\-((0?[1-9])|([1-2][0-9])|(3[01])))|(((0?[469])|(11))\-((0?[1-9])|([1-2][0-9])|(30)))|(0?2\-((0?[1-9])|(1[0-9])|(2[0-8]))))\-\d{2}(([02468][1235679])|([13579][01345789]))))$

Time:

^(\s(((0?[1-9])|(1[0-9])|(2[0-3])|(0)|(00))\:([0-5][0-9])(|(\s)|(\:([0-5][0-9])))))?$

General Info

Regular expressions can be used for string-related operations such as:

  • Validation: Check if an input string is well-formed.
  • Parsing: Extract information from an input string.
  • Transformation: Search substrings and replace them with a new substring.
  • Iteration: Search all occurrences of a substring.
  • Tokenization: Split a string into substrings.
notes/csharp/regex.txt · Last modified: 2019/01/31 by leszek