User Tools

Site Tools


notes:csharp:regex

Regular Expressions in C#

Special characters: ^ $ \ . * + ? ( ) [ ] { } |

The characters ^ and $ are called anchors:

  • ^ matches the beginning of the string
  • $ matches the end of the string; the void after the last character

Parts of a regex can be repeated:

  • * matches the preceding part zero or more times; * is equivalent to {0,}
  • + matches the preceding part one or more times; + is equivalent to {1,}
  • ? matches the preceding part zero or one time; ? is equivalent to {0,1}

{…} represents a bounded repeat:

  • a{n} matches 'a' repeated exactly n times
  • a{n,} matches 'a' repeated n or more times
  • a{n,m} matches 'a' repeated between n and m times inclusive

Other simple patterns:

  • . - any single character except the newline character
  • \s - any whitespace character
  • \S - any character that isn't a whitespace
  • \b - a word boundary
  • \B - any position that isn't a word boundary

The above repeats are greedy because they find the longest match. To make them non-greedy, add ? behind the repeat i.e., *?, +?, ??, {…}?.

RegEx Maching strings
a*b a, ab, aab, aaab, etc.
a+b ab, aab, aaab, etc.
a?b b, ab
do(es)? do, does
o{2} oo
o{2,} oo, ooo, oooo, etc.
o{1,3} o, oo, ooo
[adg] 'a' or 'd' or 'g'
[a-z] any character from 'a' to 'z'
B[iu]rma Birma or Burma
.* any number of characters other than newline
\w* any number of alphanumeric characters
[^1-6] any character except the digits from 1 to 6
“[^“\r\n]*” any string enclosed in quotes
\b(in|out)\b a word 'in' or 'out'
\bxxx\b.*\byyy\b a word 'xxx' followed by 'yyy'
\ba\w*\b words that start with the letter 'a'
\b\w{5,6}\b five and six letter words
\b\d{4,5}\b 4- or 5-digit number
^\w* the first word in a line or in the text
^test the string 'test' if it is the first string in a line or in the text
^51|^52 the strings '51' or '52' if they are the first strings in a line or in the text
^a{3,4}$ the strings 'aaa' or 'aaaa' if they are the only strings in a line or in the text
^test$ the string 'test' if it's the only string in a line or in the text
test$ the string 'test' if a line or the text ends with it
[.?!] the punctuation at the end of a sentence; ”.“ and ”?“ lose their special meanings
[\d]{1,7} 7-digit number
(\d+|) a number or empty
(\d+|\**|) a number or asterisks or empty

Simple examples:

using System.Text.RegularExpressions;
...
// Check if all characters are numeric.
Regex reg = new Regex(@"^\d+$");
bool b1 = reg.Match("3451").Success; // true
bool b2 = reg.Match("34a1").Success; // false
 
// Check if a string has letters a,A,b,B.
bool b3 = Regex.IsMatch("acd", "a|b", RegexOptions.IgnoreCase); // true
bool b4 = Regex.IsMatch("eBd", "a|b", RegexOptions.IgnoreCase); // true
bool b5 = Regex.IsMatch("efg", "a|b", RegexOptions.IgnoreCase); // false
 
// Modify a string.
string s1 = Regex.Replace("aib", "a|b", "X"); // XiX

Example: Extract email addresses from a string:

public List<string> GetEmails(string str)
{
    const string Pattern = @"[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}";
    List<string> emails = new List<string>();
 
    Regex reg = new Regex(Pattern, RegexOptions.IgnoreCase);
    MatchCollection matches = reg.Matches(str);
    foreach (Match m in matches)
        emails.Add(m.Value);
 
    return emails;
}
...
string emailsStr = "leon@micro.com john@ Mel@kata.ca Phil@@lego.com";
List<string> emails = GetEmails(emailsStr); // emails = { "leon@micro.com", "Mel@kata.ca" }

Example: Use named groups to match patterns:

string str = @"Leon's GDI+ GDI++ +GDU Mike'ss Kale's Chloe''s";
Regex re = new Regex(@"((?<GroupPlusSign>\w+\+)|(?<GroupApostrophe>\w+'s))(?=(\s|$))", RegexOptions.None);
MatchCollection matches = re.Matches(str); // 3 matches: Leon's GDI+ Kale's
 
int i = 1;
foreach (Match m in matches)
{
    Console.WriteLine($"Match #{i}: " +
        $"GroupPlusSign={m.Groups["GroupPlusSign"].Value}, " +
        $"GroupApostrophe={m.Groups["GroupApostrophe"].Value}");
    ++i;
}

Output:

Match #1: GroupPlusSign=, GroupApostrophe=Leon's
Match #2: GroupPlusSign=GDI+, GroupApostrophe=
Match #3: GroupPlusSign=, GroupApostrophe=Kale's

Common Patterns

Pattern Examples / Comments
Email-1 \w[-._\w]*\w@\w[-._\w]*\w\.\w{2,3}
Email-2 [\w\.-]+(\+[\w-]*)?@([\w-]+\.)+[\w-]+
Phone-1 ([+]|)(([0-9]+)([-|\s]|))* 1-222-345 345-564321, 33211, 34-23-67
Phone-2 ([+]|)([0-9]|)(\s|\-|)[\d]{3}(\s|\-|)[\d]{3}(\s|\-|)[\d]{4} +1-111-222-3333, 453 678 9900
Phone-3 /^([\d]{3}|[\d]{3}-[\d]{1,3}|[\d]{3}-[\d]{1,3}-[\d]{1,4})$/
Phone-4 [\d]{3}[-|\s]?[\d]{3}[-|\s]?[\d]{4} 905-234-3422
Phone-5 (1|\+1)?[- .]?(\([0-9]\d{2}\)|[0-9]\d{2})[- .]?\d{3}[- .]?\d{4}
Canadian postal code [a-zA-Z][0-9][a-zA-Z](\s|\-|)[0-9][a-zA-Z][0-9]
Money ^\$?\d{1,3}((,?\d{3})*(\.\d{2})?|(\.?\d{3})*(,\d{2})?)$
Domain name [a-z0-9][-a-z0-9]*(\.[-a-z0-9]+)*\.[a-z]{2,6}
IP address-1 \b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b
IP address-2 ((2[0-4]\d|25[0-5]|[01]?\d\d?)\.){3}(2[0-4]\d|25[0-5]|[01]?\d\d?) Note: enforcing N<256 arithmetically is not possible with RegExp
IP address-3 (\d{1,3}\.){3}\d{1,3} allows numbers to be greater than 255
Number [0-255] ^(?:(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d))
Number [0-255] ^(25[0-5]|2[0-4]\d|[01]\d\d|\d?\d) no optimization
An identifier in a programming language [A-Za-z_][A-Za-z0-9_]*
C-style hexadecimal number 0[xX][A-Fa-f0-9]+
A sequence of digits ^\d+$ mandatory
A sequence of digits ^\d*$ optional i.e. an empty string is allowed
Padding spaces ^\s+|\s+$
An HTML tag <[A-Za-z][A-Za-z0-9]*>
A generic tag <[^>]+> greedy '+' and a negated character class
A generic tag <.+?> slower - '+' is lazy instead of greedy
A number between 1000 and 9999 \b[1-9][0-9]{3}\b
A number between 100 and 99999 \b[1-9][0-9]{2,4}\b

URL (IP):

^(http\://|https\://|ftp\://|)((([a-z_0-9\-]+)+(([\:]?)+([a-z_0-9\-]+))?)(\@+)?)?(((((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5])))\.(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5])))\.(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5])))\.(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5]))))|((([a-z0-9\-])+\.)+([a-z]{2}\.[a-z]{2}|[a-z]{2,4})))(([\:])(([1-9]{1}[0-9]{1,3})|([1-5]{1}[0-9]{2,4})|(6[0-5]{2}[0-3][0-6])))?$

URL (port and IP allowed)

^(((ht|f)tp(s?))\://)?((([a-zA-Z0-9_\-]{2,}\.)+[a-zA-Z]{2,})|((?:(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d)(?(\.?\d)\.)){4}))(:[a-zA-Z0-9]+)?(/[a-zA-Z0-9\-\._\?\,\'/\\\+&amp;%\$#\=~]*)?$

Date:

^((((((0?[13578])|(1[02]))\-((0?[1-9])|([1-2][0-9])|(3[01])))|(((0?[469])|(11))\-((0?[1-9])|([1-2][0-9])|(30)))|(0?2\-((0?[1-9])|([1-2][0-9]))))\-\d{2}(([02468][048])|([13579][26])))|(((((0?[13578])|(1[02]))\-((0?[1-9])|([1-2][0-9])|(3[01])))|(((0?[469])|(11))\-((0?[1-9])|([1-2][0-9])|(30)))|(0?2\-((0?[1-9])|(1[0-9])|(2[0-8]))))\-\d{2}(([02468][1235679])|([13579][01345789]))))$

Time:

^(\s(((0?[1-9])|(1[0-9])|(2[0-3])|(0)|(00))\:([0-5][0-9])(|(\s)|(\:([0-5][0-9])))))?$

General Info

Regular expressions can be used for string-related operations such as:

  • Validation: Check if an input string is well-formed.
  • Parsing: Extract information from an input string.
  • Transformation: Search substrings and replace them with a new substring.
  • Iteration: Search all occurrences of a substring.
  • Tokenization: Split a string into substrings.
notes/csharp/regex.txt · Last modified: 2018/12/03 by leszek