Learn Regular Expression (Regex)

 

What are Regular Expressions?

Regular Expressions are a powerful pattern matching language that is part of many modern programming languages. Regular Expressions allow you to apply a pattern to an input string and return a list of the matches within the text. Regular expressions also allow text to be replaced using replacement patterns. It is a very powerful version of find and replace.

There are two parts to learning Regular Expressions;

  • learning the Regex syntax
  • learning how to work with Regex in your programming language

This article introduces you to the Regular Expression syntax. After learning the syntax for Regular Expressions you can use it many different languages as the syntax is fairly similar between languages.

Microsoft's .NET Framework contains a set of classes for working with Regular Expressions in the System.Text.RegularExpressions namespace.

Download the Regular Expression Designer

When learning Regular Expressions, it helps to have a tool that you can use to test Regex patterns. Rad Software has a Free Regular Expression Tool available for download that will help as you go through the article.

The basics - Finding text

Regular Expressions are similar to find and replace in that ordinary characters match themselves. If I want to match the word "went" the Regular Expression pattern would be "went".

Text:    Anna Jones and a friend went to lunch
Regex:   went
Matches: Anna Jones and a friend went to lunch
went

The following are special characters when working with Regular Expressions. They will be discussed throughout the article.

. $ ^ { [ ( | ) * + ? \

Matching any character with dot

The full stop or period character (.) is known as dot. It is a wildcard that will match any character except a new line (\n). For example if I wanted to match the 'a' character followed by any two characters.

Text:    abc def ant cow
Regex:   a..
Matches: abc def ant cow
abc
ant

If the Singleline option is enabled, a dot matches any character including the new line character.

Matching word characters

Backslash and a lowercase 'w' (\w) is a character class that will match any word character. The following Regular Expression matches 'a' followed by two word characters.

Text:    abc anaconda ant cow apple
Regex:   a\w\w
Matches: abc anaconda ant cow apple
abc
ana
ant
app

Backslash and an uppercase 'W' (\W) will match any non-word character.

Matching white-space

White-space can be matched using \s (backslash and 's'). The following Regular Expression matches the letter 'a' followed by two word characters then a white space character.

Text:    "abc anaconda ant"
Regex:   a\w\w\s
Matches:
"abc "

Note that ant was not matched as it is not followed by a white space character.

White-space is defined as the space character, new line (\n), form feed (\f), carriage return (\r), tab (\t) and vertical tab (\v). Be careful using \s as it can lead to unexpected behaviour by matching line breaks (\n and \r). Sometimes it is better to explicitly specify the characters to match instead of using \s. e.g. to match Tab and Space use [\t\0x0020]

Matching digits

The digits zero to nine can be matched using \d (backslash and lowercase 'd'). For example, the following Regular Expression matches any three digits in a row.

Text:    123 12 843 8472
Regex:   \d\d\d
Matches: 123 12 843 8472
123
843
847

Matching sets of single characters

The square brackets are used to specify a set of single characters to match. Any single character within the set will match. For example, the following Regular Expression matches any three characters where the first character is either 'd' or 'a'.

Text:    abc def ant cow
Regex:   [da]..
Matches: abc def ant cow
abc
def
ant

The caret (^) can be added to the start of the set of characters to specify that none of the characters in the character set should be matched. The following Regular Expression matches any three character where the first character is not 'd' and not 'a'.

Text:    abc def ant cow
Regex:   [^da]..
Matches:
"bc "
"ef "
"nt "
"cow"

Matching ranges of characters

Ranges of characters can be matched using the hyphen (-). the following Regular Expression matches any three characters where the second character is either 'a', 'b', 'c' or 'd'.

Text:    abc pen nda uml
Regex:   .[a-d].
Matches: abc pen nda uml
abc
nda

Ranges of characters can also be combined together. the following Regular Expression matches any of the characters from 'a' to 'z' or any digit from '0' to '9' followed by two word characters.

Text:    abc no 0aa i8i
Regex:   [a-z0-9]\w\w
Matches: abc no 0aa i8i
abc
0aa
i8i

The pattern could be written more simply as [a-z\d]

Specifying the number of times to match with Quantifiers

Quantifiers let you specify the number of times that an expression must match. The most frequently used quantifiers are the asterisk character (*) and the plus sign (+). Note that the asterisk (*) is usually called the star when talking about Regular Expressions.

Matching zero or more times with star (*)

The star tells the Regular Expression to match the character, group, or character class that immediately precedes it zero or more times. This means that the character, group, or character class is optional, it can be matched but it does not have to match. The following Regular Expression matches the character 'a' followed by zero or more word characters.

Text:    Anna Jones and a friend owned an anaconda
Regex:   a\w*
Options: IgnoreCase
Matches: Anna Jones and a friend owned an anaconda
Anna
and
a
an
anaconda

Matching one or more times with plus (+)

The plus sign tells the Regular Expression to match the character, group, or character class that immediately precedes it one or more times. This means that the character, group, or character class must be found at least once. After it is found once it will be matched again if it follows the first match. The following Regular Expression matches the character 'a' followed by at least one word character.

Text:    Anna Jones and a friend owned an anaconda
Regex:   a\w+
Options: IgnoreCase
Matches: Anna Jones and a friend owned an anaconda
Anna
and
an
anaconda

Note that "a" was not matched as it is not followed by any word characters.

Matching zero or one times with question mark (?)

To specify an optional match use the question mark (?). The question mark matches zero or one times. The following Regular Expression matches the character 'a' followed by 'n' then optionally followed by another 'n'.

Text:    Anna Jones and a friend owned an anaconda
Regex:   an?
Options: IgnoreCase
Matches: Anna Jones and a friend owned an anaconda
An
a
an
a
an
an
a
a

Specifying the number of matches

The minimum number of matches required for a character, group, or character class can be specified with the curly brackets ({n}). The following Regular Expression matches the character 'a' followed by a minimum of two 'n' characters. There must be two 'n' characters for a match to occur.

Text:    Anna Jones and Anne owned an anaconda
Regex:   an{2}
Options: IgnoreCase
Matches: Anna Jones and Anne owned an anaconda
Ann
Ann

A range of matches can be specified by curly brackets with two numbers inside ({n,m}). The first number (n) is the minimum number of matches required, the second (m) is the maximum number of matches permitted. This Regular Expression matches the character 'a' followed by a minimum of two 'n' characters and a maximum of three 'n' characters.

Text:    Anna and Anne lunched with an anaconda annnnnex
Regex:   an{2,3}
Options: IgnoreCase
Matches: Anna and Anne lunched with an anaconda annnnnex
Ann
Ann
annn

The Regex stops matching after the maximum number of matches has been found.

Matching the start and end of a string

To specify that a match must occur at the beginning of a string use the caret character (^). For example, I want a Regular Expression pattern to match the beginning of the string followed by the character 'a'.

Text:    an anaconda ate Anna Jones
Regex:   ^a
Matches: an anaconda ate Anna Jones
"a" at position 1

The pattern above only matches the a in "an".

Note that the caret (^) has different behaviour when used inside the square brackets.

If the Multiline option is on, the caret (^) will match the beginning of each line in a multiline string rather than only the start of the string.

To specify that a match must occur at the end of a string use the dollar character ($). If the Multiline option is on then the pattern will match at the end of each line in a multiline string. This Regular Expression pattern matches the word at the end of the line in a multiline string.

Text:    "an anaconda
ate Anna
Jones"
Regex:   \w+$
Options: Multiline, IgnoreCase
Matches:
Jones

Microsoft have an online reference for Regex in .NET: Regular Expression Syntax on MSDN

C# Regular Expression (Regex) Examples in .NET

More Advanced Regular Expression Syntax

This article continues from Learn Regular Expression (Regex) syntax with C# and .NET and covers character escapes, match grouping, some C# code examples, matching boundaries and RegexOptions.

Matching special characters with character escapes

Special characters such as Tab and carriage return are matched using character escapes. The syntax is similar to C and C#. The common character escapes are listed below.

Special CharacterDescription
\t Matches a tab
\r Matches a carriage return
\n Matches a new line
\u0020 Matches a Unicode character
using hexadecimal representation.
Exactly four digits must be specified.


In this example, the Regular Expression pattern matches one or more word characters followed by a carriage return then a new line.

Text:    an anaconda ate
Anna Jones Regex: \w+\r\n Match: ate

Depending on your operating system you might have to combine the \r and \n character escapes to create the correct new line sequence for your platform. For Microsoft Windows systems you should generally use \r\n which is a carriage return then line feed (CRLF). To simply match the end of a line or string use the dollar sign ($).

Match Grouping

Groups perform a few different functions. They allow the quantifiers (such as plus and star) to be applied to sections of the match instead of just individual characters.

A group is specified by the round brackets ( and ). If you want to match the round bracket characters you must use the escape character before the bracket e.g. \( or \).

This regex matches 'http://' optionally followed by 'www.' then starts a group and matches one or more of any character that is not a full stop/period (.) closes the group then matches '.com'.

Text:    http://www.yahoo.com/index.html and http://yahoo.com
            Regex:   http://(www\.)?([^\.]+)\.com
            Matches:
            http://www.yahoo.com
            http://yahoo.com

The question mark after the group (www\.) applies to the whole group making it optional.

An example in C#

The regular expression classes are in the System.Text.RegularExpressions namespace.

using System.Text.RegularExpressions;

The Regex class represents a regular expression. A regular expression pattern must be specified when creating a Regex object. The pattern cannot be changed.

Regex exp = new Regex(
            @"http://(www\.)(?[^\.]+)\.com",
            RegexOptions.IgnoreCase);
            string InputText = "http://www.yahoo.com/";

The MatchCollection class stores a list of successful matches found by applying the regular expression pattern to an input string.

MatchCollection MatchList = exp.Matches(InputText);
            Match FirstMatch = MatchList[0];
            Console.WriteLine(FirstMatch.Value);

The Group class represents a group within the regex pattern. Each Match object has a Groups collection.

Group GroupCurrent;
            for (int i = 1; i < FirstMatch.Groups.Count; i++)
            {
            GroupCurrent = FirstMatch.Groups[i];

The Success property on the group can be used to check if the Group matched or not.

if (GroupCurrent.Success)
            {
            Console.WriteLine("\tMatched:" + GroupCurrent.Value);
            }
            else
            {
            Console.WriteLine("\tGroup didn't match");
            }
            }

Groups within a Match can be referenced by number or by name (see below).

if (MatchList.Count > 0)
            {
                if (MatchList[1].Success)
                {
                    Console.WriteLine("Group 1 matched");
                }
            }

Matches also allow sections of the match to be used in replacement expressions when using Regex.Replace().

 

Named Groups

Groups can be named to allow easier identification with the following syntax.

(?<NameOfGroup>expression)

Matching boundaries between words

To match a boundary between a word character (\w) and a non-word character (\W) use \b. The match will occur at the first or last character in words separated by any nonalphanumeric characters. For example, the following Regular Expression matches one or more word characters followed by a word boundary followed by a hyphen (-) followed by another word boundary followed by one or more word characters.

Text:    Anna Jones and John William-Scott went to lunch- with an anaconda
            Regex:   \w+\b-\b\w+
Options: IgnoreCase Matches: Anna Jones and John William-Scott went to lunch- with an anaconda William-Scott

Use \B to specify that a match must not occur on a \b boundary.

 

Regular Expression Options

Regular Expression Options can be used in the constructor for the Regex class.

RegexOptions.None - Specifies that no options are set.

RegexOptions.IgnoreCase - Specifies case-insensitive matching.

RegexOptions.Multiline - Multiline mode. Changes the meaning of ^ and $ so they match at the beginning and end, respectively, of any line, and not just the beginning and end of the entire string.

RegexOptions.Singleline - Specifies single-line mode. Changes the meaning of the dot (.) so it matches every character (instead of every character except \n).

RegexOptions.ExplicitCapture - Specifies that the only valid captures are  groups that are explicitly named or in the form (?<name>...).

RegexOptions.IgnorePatternWhitespace - Eliminates unescaped white space from the pattern and enables comments marked with the hash sign (#).

RegexOptions.Compiled - Specifies that the regular expression is compiled to an assembly. The regular expression will be faster to match but it takes more time to compile initially. This option (although tempting) should only be used when the expression will be used many times. e.g. in a foreach loop

RegexOptions.ECMAScript - Enables ECMAScript-compliant behavior for the expression. This flag can be used only in conjunction with the IgnoreCase, Multiline, and Compiled flags. The use of this flag with any other flags results in an exception.

RegexOptions.RightToLeft - Specifies that the search will be from right to left instead of from left to right.

转载于:https://www.cnblogs.com/techworld/archive/2008/05/09/1189191.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值