Powershell regular expression.

http://powershell.com/cs/blogs/ebook/archive/2009/03/30/chapter-13-text-and-regular-expressions.aspx

 

 

 

Chapter 13. Text and Regular Expressions

PowerShell distinguishes sharply between text in single quotation marks and text in double quotation marks. PowerShell won't modify text wrapped in single quotation marks but it does inspect text in single quotation marks and may modify it by inserting variable contents automatically. Enclosing text in double quotation marks is the foremost and easiest way to couple results and descriptions.

The formatting operator -f, one of many specialized string operators, offers more options. For example, you can use -f to output text column-by-column and to set it flush. Other string commands are also important. They can replace selected text, change case, and much more.

Pattern recognition adds a layer of complexity because it uses wildcard characters to match patterns. In simple cases, you can use the same wildcards that you use in the file system. Substantially more powerful, but also more complex, are regular expressions.

Topics Covered:

Defining Text

Use quotation marks to delimit it if you'd like to save text in a variable or to output it. Use single quotation marks if you want text to be stored in a variable in (literally) exactly the same way you specified it:

$text = 'This text may also contain $env:windir `: $(2+2)'
This text may also contain $env:windir `: $(2+2)

Text will have an entirely different character when you wrap it in (conventional) double quotation marks because enclosed special characters will be evaluated:

$text = "This text may also contain $env:windir `: $(2+2)"
This text may also contain C:/Windows: 4

Special Characters in Text

If text is enclosed in double quotation marks, PowerShell will look for particular special characters in it. Two special characters are important in this regard: "$" and the special backtick character, "`".

Resolving Variables

If PowerShell encounters one of the variables from Chapter 3, it will assign the variable its value:

$windir = "The Windows directory is here: $env:windir"
$windir
The Windows directory is here: C:/Windows

This also applies to direct variables, which calculate their value themselves:

$result = "One CD has the capacity of $(720MB / 1.44MB) diskettes."
$result
One CD has the capacity of 500 diskettes.

Inserting Special Characters

The peculiar backtick character, "`", has two tasks: if you type it before characters that are particularly important for PowerShell, such as "$" or quotation marks, PowerShell will interpret the characters following the backtick as normal text characters. You could output quotation marks in text like this:

"This text includes `" quotation marks` ""
This text includes "quotation marks"

If one of the letters listed in Table 13.1 follows the backtick character, PowerShell will insert special characters:

$text = "This text consists of`ntwo lines."
This text consists of
two lines!
Escape SequenceSpecial Characters
`nNew line
`rCarriage return
`tTabulator
`aAlarm
`bBackspace
`'Single quotation mark
`"Double quotation mark
`0Null
``Backtick character

Table 13.1: Special characters and "escape" sequences for text

"Here-Strings": Acquiring Text of Several Lines

Using "here-strings" is the best way to acquire long text consisting of several lines or many special characters. "Here-strings" are called by this name because they enable you to acquire text exactly the way you want to store it in a text variable, much like a text editor. Here-strings are preceded by the @" character and terminated by the "@ character. Note here once again that PowerShell will automatically resolve (assign variable values and evaluate backtick characters in) text enclosed by@" and "@ characters. If you use single quotation marks instead, the text will remain exactly the way you typed it:

$text = @ "
Here-Strings can easily stretch over several lines and may also include
"quotation marks". Nevertheless, here, too, variables are replaced with
their values: C:/Windows, and subexpressions like 4 are likewise replaced
with their result. The text will be concluded only if you terminate the
here-string with the termination symbol "@.
"@

$text
Here-Strings can easily stretch over several lines and may also include
"quotation marks". Nevertheless, here, too, variables are replaced with
their values: C:/Windows, and subexpressions like 4 are likewise replaced
with their result. The text will be concluded only if you terminate the
here-string with the termination symbol "@.

Communicating with the User

If you'd like to request users to input text, use Read-Host:

$text = Read-Host "Your entry"
Your entry: Hello world !
$text
Hello world!

Text acquired by Read-Host behaves like text enclosed in single quotation marks. Consequently, special characters and variables are not resolved. Manually use the ExpandString() method if you want to resolve the contents of a text variable later on, that is, have the variables and special characters in it replaced. PowerShell normally uses this method internally when you allocate text in double quotation marks:

# Query and output text entry by user:
$text = Read-Host "Your entry"
Your entry: $env:windir
$text
$env:windir

# Treat entered text as if it were in double quotation marks:
$ExecutionContext. InvokeCommand.ExpandString( $text)
$text
C:/Windows

If you'd like to use Read-Host to acquire sensitive data, passwords, use the -asSecureString parameter. The screen entries will be masked by asterisks. The result will be a so-called SecureString. To be able to work on the encrypted SecureStringas a normal text entry, it must be changed to plain text first:

$pwd = Read-Host -asSecureString "Password"
Password: *************
$pwd
System.Security.SecureString
[ Runtime.InteropServices.Marshal]::`
PtrToStringAuto([ Runtime.InteropServices.Marshal]::`
SecureStringToBSTR( $pwd))
strictly confidential

Querying User Name and Password

If you'd like to authenticate a user, such as query his name and password, use Get-Credential. This cmdlet uses the secure dialog boxes that are integrated into Windows to request user name and password:

Get-Credential -Credential "Your name?"
UserName Password
-------- --------
/Your name System.Security.SecureString

The result is an object having two properties: the given user name is in UserName and the encrypted password is inPassword as an instance of SecureString:

Figure 13.1: Querying user passwords using the integrated secure dialog box

Normally, Get-Credential is used if logon data are actually needed, such as to run a program under a particular user name:

$logon = Get-Credential
$startinfo = new-object System.Diagnostics.ProcessStartInfo
$startinfo. UserName = $logon. UserName
$startinfo. Password = $logon. Password
$startinfo. FileName = "$env:windir/regedit.exe"
$startinfo. UseShellExecute = $false
[ System.Diagnostics.Process]:: Start( $startinfo)

However, the user context that creates the Secure String can turn it into readable text whenever you wish, as was the case forRead-Host. For this reason, you can also use Get-Credential to query sensitive information that you can work on subsequently in plain text:

$logon = Get-Credential
[ Runtime.InteropServices.Marshal]::`
PtrToStringAuto([ Runtime.InteropServices.Marshal]::`
SecureStringToBSTR( $logon. Password))
MySecretPassword

Using Special Text Commands

Often, results need to be properly output and provided with descriptions. The simplest approach doesn't require any special commands: insert the result as a variable or sub-expression directly into text and make sure that text is enclosed in double quotation marks.

# Embedding a subexpression in text:
"One CD has the capacity of $(720MB / 1.44MB) diskettes."
One CD has the capacity of 500 diskettes.

# Embedding a variable in text:
$result = 720MB / 1.44MB
"One CD has the capacity of $result diskettes."
One CD has the capacity of 500 diskettes.

More options are offered by special text commands that PowerShell furnishes from three different areas:

  • String operators: PowerShell includes a number of string operators for general text tasks, which you can use to replace text and to compare text (Table 13.2).
  • Dynamic methods: the String data type, which saves text, includes its own set of text statements that you can use to search through, dismantle, reassemble, and modify text in diverse ways (Table 13.6).
  • Static methods:finally, the String .NET class includes static methods bound to no particular text.

String Operators

The -f format operator is the most important PowerShell string operator. You'll soon be using it to format numeric values for easier reading:

"{0} diskettes per CD" -f (720mb /1.44mb)
500 diskettes per CD

All operators function in basically the same way: they anticipate data from the left and the right that they can link together. For example, you can use -replace to substitute parts of the string for other parts:

"Hello Carl" -replace "Carl", "Eddie"
Hello Eddie

There are three implementations of the -replace operator; many other string operators also have three implementations. Its basic version is case insensitive. If you'd like to distinguish between lowercase and uppercase, use the version beginning with "c" (for case-sensitive):

# No replacement because case sensitivity was turned off this time:
"Hello Carl" -creplace "carl", "eddie"
Hello Carl

The third type begins with "i" (for insensitive) and is case insensitive. This means that this version is actually superfluous because it works the same way as -replace. The third version is merely demonstrative: if you use -ireplace instead of -replace, you'll make clear that you expressly do not want to distinguish between uppercase and lowercase.

OperatorDescriptionExample
*Repeats a string"=" * 20
+Combines two string parts"Hello " + "World"
-replace, -ireplaceSubstitutes a string; case insensitive"Hello Carl" -replace "Carl", "Eddie"
-creplaceSubstitutes a string; case sensitive"Hello Carl" -creplace "carl", "eddie"
-eq, -ieqVerifies equality; case insensitive"Carl" -eq "carl"
-ceqVerifies equality; case sensitive"Carl" -ceq "carl"
-like, -ilikeVerifies whether a string is included in another string (wildcards are permitted; case insensitive)"Carl" -like "*AR*"
-clikeVerifies whether a string is included in another string (wildcards are permitted; case sensitive)"Carl" -clike "*AR*"
-notlike, -inotlikeVerifies whether a string is not included in another string (wildcards are permitted; case insensitive)"Carl" -notlike "*AR*"
-cnotlikeVerifies whether a string is included in another string (wildcards are permitted; case sensitive)"Carl" -cnotlike "*AR*"
-match, -imatchVerifies whether a pattern is in a string; case insensitive"Hello" -match "[ao]"
-cmatchVerifies whether a pattern is in a string; case sensitive"Hello" -cmatch "[ao]"
-notmatch, -inotmatchVerifies whether a pattern is not in a string; case insensitive"Hello" -notmatch "[ao]"
-cnotmatchVerifies whether a pattern is not in a string; case sensitive"Hello" -cnotmatch "[ao]"

Table 13.2: Operators used for handling string

Formatting String

The format operator -f formats a string and requires a string, along with wildcards on its left side and on its right side, that the results are to be inserted into the string instead of the wildcards:

"{0} diskettes per CD" -f (720mb /1.44mb)
500 diskettes per CD

It is absolutely necessary that exactly the same results are on the right side that are to be used in the string are also on the left side. If you want to just calculate a result, then the calculation should be in parentheses. As is generally true in PowerShell, the parentheses ensure that the enclosed statement is evaluated first and separately and that subsequently, the result is processed instead of the parentheses. Without parentheses, -f would report an error:

"{0} diskettes per CD" -f 720mb /1.44mb
Bad numeric constant: 754974720 diskettes per CD.
At line:1 char:33
+ "{0} diskettes per CD" -f 720mb/1 <<<< .44mb

You may use as many wildcard characters as you wish. The number in the braces states which value will appear later in the wildcard and in which order:

"{0} {3} at {2}MB fit into one CD at {1}MB" `
-f (720mb /1.44mb), 720, 1.44, "diskettes"
500 diskettes at 1.44MB fit into one CD at 720MB

Setting Numeric Formats

The formatting operator -f can insert values into text as well as format the values. Every wildcard used has the following formal structure: {index[,alignment][:format]}:

  • Index: This number indicates which value is to be used for this wildcard. For example, you could use several wildcards with the same index if you want to output one and the same value several times, or in various display formats. The index number is the only obligatory specification. The other two specifications are voluntary.
  • Alignment: Positive or negative numbers can be specified that determine whether the value is right justified (positive number) or left justified (negative number). The number states the desired width. If the value is wider than the specified width, the specified width will be ignored. However, if the value is narrower than the specified width, the width will be filled with blank characters. This allows columns to be set flush.
  • Format: The value can be formatted in very different ways. Here you can use the relevant format name to specify the format you wish. You'll find an overview of available formats below.

Formatting statements are case sensitive in different ways than what is usual in PowerShell. You can see how large the differences can be when you format dates:

# Formatting with a small letter d:
"Date: {0:d}" -f ( Get-Date)
Date: 08/28/2007
# Formatting with a large letter D:
"Date: {0:D}" -f ( Get-Date)
Date: Tuesday, August 28, 2007
SymbolTypeCallResult
#Digit placeholder"{0:(#).##}" -f $value(1000000)
%Percentage"{0:0%}" -f $value100000000%
,Thousands separator"{0:0,0}" -f $value1,000,000
,.Integral multiple of 1,000"{0:0,.} " -f $value1000
.Decimal point"{0:0.0}" -f $value1000000.0
00 placeholder"{0:00.0000}" -f $value1000000.0000
cCurrency"{0:c}" -f $value1,000,000.00 €
dDecimal"{0:d}" -f $value1000000
eScientific notation"{0:e}" -f $value1.000000e+006
eExponent wildcard"{0:00e+0}" -f $value10e+5
fFixed point"{0:f}" -f $value1000000.00
gGeneral"{0:g}" -f $value1000000
nThousands separator"{0:n}" -f $value1,000,000.00
xHexadecimal"0x{0:x4}" -f $value0x4240

Table 13.3: Formatting numbers

Using the formats in Table 13.3, you can format numbers quickly and comfortably. No need for you to squint your eyes any longer trying to decipher whether a number is a million or 10 million:

10000000000
"{0:N0}" -f 10000000000
10,000,000,000

There's also a very wide range of time and date formats. The relevant formats are listed in Table 13.4 and their operation is shown in the following lines:

$date = Get-Date
Foreach ( $format in "d", "D", "f", "F", "g", "G", "m", "r", "s", "t", "T", `
"u", "U", "y", "dddd, MMMM dd yyyy", "M/yy", "dd-MM-yy") {
"DATE with $format : {0}" -f $date. ToString( $format) }
DATE with d : 10/15/2007
DATE with D : Monday, 15 October, 2007
DATE with f : Monday, 15 October, 2007 02:17 PM
DATE with F : Monday, 15 October, 2007 02:17:02 PM
DATE with g : 10/15/2007 02:17
DATE with G : 10/15/2007 02:17:02
DATE with m : October 15
DATE with r : Mon, 15 Oct 2007 02:17:02 GMT
DATE with s : 2007-10-15T02:17:02
DATE with t : 02:17 PM
DATE with T : 02:17:02 PM
DATE with u : 2007-10-15 02:17:02Z
DATE with U : Monday, 15 October, 2007 00:17:02
DATE with y : October, 2007
DATE with dddd, MMMM dd yyyy : Monday, October 15 2007
DATE with M/yy : 10/07
DATE with dd-MM-yy : 15-10-07
SymbolTypeCallResult
dShort date format"{0:d}" -f $value09/07/2007
DLong date format"{0:D}" -f $valueFriday, September 7, 2007
tShort time format"{0:t}" -f $value10:53 AM
TLong time format"{0:T}" -f $value10:53:56 AM
fFull date and time (short)"{0:f}" -f $valueFriday, September 7, 2007 10:53 AM
FFull date and time (long)"{0:F}" -f $valueFriday, September 7, 2007 10:53:56 AM
gStandard date (short)"{0:g}" -f $value09/07/2007 10:53 AM
GStandard date (long)"{0:G}" -f $value09/07/2007 10:53:56 AM
MDay of month"{0:M}" -f $valueSeptember 07
rRFC1123 date format"{0:r}" -f $valueFri, 07 Sep 2007 10:53:56 GMT
sSortable date format"{0:s}" -f $value2007-09-07T10:53:56
uUniversally sortable date format"{0:u}" -f $value2007-09-07 10:53:56Z
UUniversally sortable GMT date format"{0:U}" -f $valueFriday, September 7, 2007 08:53:56
YYear/month format pattern"{0:Y}" -f $valueSeptember 2007

Table 13.4: Formatting date values

If you want to find out which type of formatting options are supported, you need only look for .NET types that support thetoString() method:

[ appdomain]:: currentdomain.getassemblies() | ForEach-Object {
$_. GetExportedTypes() | Where-Object { ! $_. IsSubclassof([ System.Enum])}
} | ForEach-Object {
$Methods = $_. getmethods() | Where-Object { $_. name -eq "tostring"} | %{ "$_"};
If ( $methods -eq "System.String ToString(System.String)") {
$_. fullname
}
}
System.Enum
System.DateTime
System.Byte
System.Convert
System.Decimal
System.Double
System.Guid
System.Int16
System.Int32
System.Int64
System.IntPtr
System.SByte
System.Single
System.UInt16
System.UInt32
System.UInt64
Microsoft.PowerShell.Commands.MatchInfo

For example, among the supported data types is the "globally unique identifier" System.Guid. Because you'll frequently require GUID, which is clearly understood worldwide, here's a brief example showing how to create and format a GUID:

$guid = [ GUID]:: NewGUID()
Foreach ( $format in "N", "D", "B", "P") {
"GUID with $format : {0}" -f $GUID. ToString( $format)}
GUID with N : 0c4d2c4c8af84d198b698e57c1aee780
GUID with D : 0c4d2c4c-8af8-4d19-8b69-8e57c1aee780
GUID with B : {0c4d2c4c-8af8-4d19-8b69-8e57c1aee780}
GUID with P : (0c4d2c4c-8af8-4d19-8b69-8e57c1aee780)
SymbolTypeCallResult
ddDay of month"{0:dd}" -f $value07
dddAbbreviated name of day"{0:ddd}" -f $valueFri
ddddFull name of day"{0:dddd}" -f $valueFriday
ggEra"{0:gg}" -f $valueA. D.
hhHours from 01 to 12"{0:hh}" -f $value10
HHHours from 0 to 23"{0:HH}" -f $value10
mmMinute"{0:mm}" -f $value53
MMMonth"{0:MM}" -f $value09
MMMAbbreviated month name"{0:MMM}" -f $valueSep
MMMMFull month name"{0:MMMM}" -f $valueSeptember
ssSecond"{0:ss}" -f $value56
ttAM or PM"{0:tt}" -f $value 
yyYear in two digits"{0:yy}" -f $value07
yyyyYear in four digits"{0:YY}" -f $value2007
zzTime zone including leading zero"{0:zz}" -f $value+02
zzzTime zone in hours and minutes"{0:zzz}" -f $value+02:00

Table 13.5: Customized date value formats

Outputting Values in Tabular Form: Fixed Width

To display the output of several lines in a fixed-width font and align them one below the other, each column of the output must have a fixed width. A formatting operator can set outputs to a fixed width.

In the following example, Dir returns a directory listing, from which a subsequent loop outputs file names and file sizes. Because file names and sizes vary, the result is ragged right and hard to read:

dir | ForEach-Object { "$($_.name) = $($_.Length) Bytes" }
history.csv = 307 Bytes
info.txt = 8562 Bytes
layout.lxy = 1280 Bytes
list.txt = 164186 Bytes
p1.nrproj = 5808 Bytes
ping.bat = 116 Bytes
SilentlyContinue = 0 Bytes

The following result with fixed column widths is far more legible. To set widths, add a comma to the sequential number of the wildcard and after it specify the number of characters available to the wildcard. Positive numbers will set values to right alignment, negative numbers to left alignment:

dir | ForEach-Object { "{0,-20} = {1,10} Bytes" -f $_. name, $_. Length }
history.csv = 307 Bytes
info.txt = 8562 Bytes
layout.lxy = 1280 Bytes
list.txt = 164186 Bytes
p1.nrproj = 5808 Bytes
ping.bat = 116 Bytes
SilentlyContinue = 0 Bytes

String Object Methods

You know from Chapter 6 that PowerShell stores everything in objects and that every object contains a set of instructions known as methods. Text is stored in a String object, which includes a number of useful commands for working with text. For example, to ascertain the file extension of a file name, use LastIndexOf() to determine the position of the last "." character, and then use Substring() to extract text starting from the position:

$path = "c:/test/Example.bat"
$path. Substring( $path. LastIndexOf( ".") +1 )
bat

Another approach uses the dot as separator and Split() to split up the path into an array. The result is that the last element of the array (-1 index number) will include the file extension:

$path. Split( ".")[ -1]
bat

Table 13.6 provides an overview of all the methods that include a string object.

FunctionDescriptionExample
CompareTo()Compares one string to another("Hello").CompareTo("Hello")
Contains()Returns "True" if a specified comparison string is in a string or if the comparison string is empty("Hello").Contains("ll")
CopyTo()Copies part of a string to another string$a = ("Hello World").toCharArray()
("User!").CopyTo(0, $a, 6, 5)
$a

EndsWith()Tests whether the string ends with a specified string("Hello").EndsWith("lo")
Equals()Tests whether one string is identical to another string("Hello").Equals($a)
IndexOf()Returns the index of the first occurrence of a comparison string("Hello").IndexOf("l")
IndexOfAny()Returns the index of the first occurrence of any character in a comparison string("Hello").IndexOfAny("loe")
Insert()Inserts new string at a specified index in an existing string("Hello World").Insert(6, "brave ")
GetEnumerator()Retrieves a new object that can enumerate all characters of a string("Hello").GetEnumerator()
LastIndexOf()Finds the index of the last occurrence of a specified character("Hello").LastIndexOf("l")
LastIndexOfAny()Finds the index of the last occurrence of any character of a specified string("Hello").LastIndexOfAny("loe")
PadLeft()Pads a string to a specified length and adds blank characters to the left (right-aligned string)("Hello").PadLeft(10)
PadRight()Pads string to a specified length and adds blank characters to the right (left-aligned string)("Hello").PadRight(10) + "World!"
Remove()Removes any requested number of characters starting from a specified position("Hello World").Remove(5,6)
Replace()Replaces a character with another character("Hello World").Replace("l", "x")
Split()Converts a string with specified splitting points into an array("Hello World").Split("l")
StartsWith()Tests whether a string begins with a specified character("Hello World").StartsWith("He")
Substring()Extracts characters from a string("Hello World").Substring(4, 3)
ToCharArray()Converts a string into a character array("Hello World").toCharArray()
ToLower()Converts a string to lowercase("Hello World").toLower()
ToLowerInvariant()Converts a string to lowercase using casing rules of the invariant language("Hello World").toLowerInvariant()
ToUpper()Converts a string to uppercase("Hello World").toUpper()
ToUpperInvariant()Converts a string to uppercase using casing rules of the invariant language("Hello World").ToUpperInvariant()
Trim()Removes blank characters to the right and left(" Hello ").Trim() + "World"
TrimEnd()Removes blank characters to the right(" Hello ").TrimEnd() + "World"
TrimStart()Removes blank characters to the left(" Hello ").TrimStart() + "World"
Chars()Provides a character at the specified position("Hello").Chars(0)

Table 13.6: The methods of a string object

Analyzing Methods: Split() as Example

You already know in detail from Chapter 6 how to use Get-Member to find out which methods an object contains and how to invoke them. Just as a quick refresher, let's look again at an example of the Split() method to see how it works.

( "something" | Get-Member Split). definition
System.String[] Split(Params Char[] separator), System.String[] Split(
Char[] separator, Int32 count), System.String[] Split(Char[] separator,
StringSplitOptions options), System.String[] Split(Char[] separator,
Int32 count, StringSplitOptions options), System.String[] Split(String[]
separator, StringSplitOptions options), System.String[] Split(String[]
separator, Int32 count, StringSplitOptions options)

Definition gets output, but it isn't very easy to read. Because Definition is also a string object, you can use methods fromTable 13.6, including Replace(), to insert a line break where appropriate. That makes the result much more understandable:

( "something" | Get-Member Split). Definition.Replace( "), ", ")`n")
System.String[] Split(Params Char[] separator)
System.String[] Split(Char[] separator, Int32 count)
System.String[] Split(Char[] separator, StringSplitOptions options)
System.String[] Split(Char[] separator, Int32 count,
StringSplitOptions options)
System.String[] Split(String[] separator, StringSplitOptions options)
System.String[] Split(String[] separator, Int32 count,
StringSplitOptions options)

There are six different ways to invoke Split(). In simple cases, you might use Split() with only one argument, Split(), you will expect a character array and will use every single character as a possible splitting separator. That's important because it means that you may use several separators at once:

"a,b;c,d;e;f". Split( ",;")
a
b
c
d
e
f

If the splitting separator itself consists of several characters, then it has got to be a string and not a single Char character. There are only two signatures that meet this condition:

System.String[] Split(String[] separator,
StringSplitOptions options)
System.String[] Split(String[] separator, Int32 count,
StringSplitOptions options)

You must make sure that you pass data types to the signature that is exactly right for it to be able to use a particular signature. If you want to use the first signature, the first argument must be of the String[] type and the second argument of theStringSplitOptions type. The simplest way for you to meet this requirement is by assigning arguments first to a strongly typed variable. Create the variable with exactly the type that the signature requires:

# Create a variable of the [StringSplitOptions] type:
[StringSplitOptions] $option = "None"

# Create a variable of the String[] type:
[string[]] $separator = ",;"
# Invoke Split with the wished signature and use a two-character long separator:
( "a,b;c,;d,e;f,;g"). Split( $separator, $option)
a,b;c
d,e;f
g

Split() in fact now uses a separator consisting of several characters. It splits the string only at the points where it finds precisely the characters that were specified. There does remain the question of how do you know it is necessary to assign the value "None" to the StringSplitOptions data type. The simple answer is: you don't know and it isn't necessary to know. If you assign a value to an unknown data type that can't handle the value, the data type will automatically notify you of all valid values:

[ StringSplitOptions] $option = "werner wallbach"
Cannot convert value "werner wallbach" to type
"System.StringSplitOptions" due to invalid enumeration
values. Specify one of the following enumeration values
and try again. The possible enumeration values are
"None, RemoveEmptyEntries".
At line:1 char:28
+ [StringSplitOptions]$option <<<< = "werner wallbach"

By now it should be clear to you what the purpose is of the given valid values and their names. For example, what wasRemoveEmptyEntries() able to accomplish? If Split() runs into several separators following each other, empty array elements will be the consequence. RemoveEmptyEntries() deletes such empty entries. You could use it to remove redundant blank characters from a text:

[ StringSplitOptions] $option = "RemoveEmptyEntries"
"This text has too much whitespace". Split( " ", $option)
This
text
has
too
much
whitespace

Now all you need is just a method that can convert the elements of an array back into text. The method is called Join(); it is not in a String object but in the String class.

Using String Class Commands

Chapter 6 clearly defined the distinction between classes and objects (or instances). Just to refresh your memory: everyString object is derived from the String class. Both include diverse methods. You can see these methods at work when you press (Tab) after the following instruction, which activates AutoComplete:

[ String]:: (Tab)

Get-Member will return a list of all methods. This time, specify the -Static parameter in addition:

"sometext" | Get-Member -Static -MemberType Method

You've already used static methods. In reality, the -f format operator corresponds to the Format() static method, and that's why the following two statements work in exactly the same way:

# Using a format operator:
"Hex value of 180 is &h{0:X}" -f 180
Hex value of 180 is &hB4

# The static method Format has the same result:
[ string]:: Format( "Hex value of 180 is &h{0:X}", 180)
Hex value of 180 is &hB4

The Format() static method is very important but is usually ignored because -f is much easier to handle. But you wouldn't be able to do without two other static methods: Join() and Concat().

Join(): Changing Arrays to Text

Join() is the counterpart of Split() discussed above. Join() assembles an array of string elements into a string. It enables you to complete the above example and to make a function that removes superfluous white space characters from the string:

function RemoveSpace([ string] $text) {
$private:array = $text. Split( " ", `
[ StringSplitOptions]:: RemoveEmptyEntries)
[ string]:: Join( " ", $array)
}

RemoveSpace "Hello, this text has too much whitespace."
Hello, this text has too much whitespace.

Concat(): Assembling a String Out of Several Parts

Concat() assembles a string out of several separate parts. At first glance, it works like the "+" operator:

"Hello" + " " + "World!"
Hello World!

But note that the "+" operator always acts strangely when the first value isn't a string:

# Everything will be fine if the first value is string:
"Today is " + ( Get-Date)
Today is 08/29/2007 11:02:24

# If the first value is not text, errors may result:
( Get-Date) + " is a great date!"
Cannot convert argument "1", with value: " is a great date!",
for "op_Addition" to type "System.TimeSpan": "Cannot convert
value " is a great date!" to type "System.TimeSpan". Error:
"Input string was not in a correct format.""
At line:1 char:13
+ (Get-Date) + <<<< " is a great date!

If the first value of the calculation is a string, all other values will be put into the string form and assembled as requested into a complete string. If the first value is not a string—in the example, it was a date value—all the other values will be changed to this type. That's just what causes an error, because it is impossible to change "is a great date!" to a date value. For this reason, the "+" operator is an unreliable tool for assembling a string.

Concat() causes fewer problems: it turns everything you specify to the method into a string. Concat(), when converting, also takes into account your current regional settings; it will provide, for example, U.S. English date and time formats:

[ string]:: Concat( "Today is ", ( Get-Date))
Today is 8/29/2007 11:06:00 AM


[ string]:: Concat(( Get-Date), " is a great date!")
8/29/2007 11:06:24 AM is a great date!

Simple Pattern Recognition

Recognizing patterns is a frequent task that is necessary for verifying user entries, such as to determine whether a user has given a valid network ID or valid e-mail address. Useful and effective pattern recognition requires wildcard characters that represent a certain number and type of characters.

A simple form of wildcards was invented for the file system many years ago and it still works today. In fact, you've doubtlessly used it before in one form or another:

# List all files in the current directory that
# have the txt file extension:
Dir *. txt

# List all files in the Windows directory that
# begin with "n" or "w":
dir $env:windir/[ nw] *. *

# List all files whose file extensions begin with
# "t" and which are exactly 3 characters long:
Dir *. t??

# List all files that end in one of the letters
# from "e" to "z"
dir *[ e-z]. *
WildcardDescriptionExample
*Any number of any character (including no characters at all)Dir *.txt
?Exactly one of any charactersDir *.??t
[xyz]One of specified charactersDir [abc]*.*
[x-z]One of the characters in the specified areaDir *[p-z].*

Table 13.7: Using simple placeholders

The placeholders in Table 13.7 not only work in the file system, but also in conjunction with string operators like -like and -notlike. This makes child's play of pattern recognition. For example, if you want to verify whether a user has given a valid IP address, you could do so in the following way:

$ip = Read-Host "IP address"
If ( $ip -like "*.*.*.*") { "valid" } Else { "invalid" }

If you want to verify whether a valid e-mail address is in a variable, you could check the pattern in the following way:

$email = "tobias.weltner@powershell.de"
$email -like "*.*@*.*"

However, such wildcards only reveal the worst errors and are not very exact:

# Wildcards are appropriate only for very simple pattern
# recognition and leave room for erroneous entries:
$ip = "300.werner.6666."
If ( $ip -like "*.*.*.*") { "valid" } Else { "invalid" }
valid

# The following invalid e-mail address was not identified as false:
$email = ".@."
$email -like "*.*@*.*"
True

Regular Expressions

Use regular expressions for more accurate pattern recognition if you require it. Regular expressions offer many more wildcard characters; for this reason, they can describe patterns in much greater detail. For the very same reason, however, regular expressions are also much more complicated.

Describing Patterns

Using the regular expression elements listed in Table 13.11, you can describe patterns with much greater precision. These elements are grouped into three categories:

  • Char: The Char represents a single character and a collection of Char objects represents a string.
  • Quantifier: Allows you to determine how often a character or a string occurs in a pattern.
  • Anchor: Allows you to determine whether a pattern is a separate word or must be at the beginning or end of a sentence.

The pattern represented by a regular expression may consist of four different character types:

  • Literal characterslike "abc" that exactly matches the "abc" string.
  • Masked or "escaped" characters with special meanings in regular expressions; when preceded by "/", they are understood as literal characters: "/[test/]" looks for the "[test]" string. The following characters have special meanings and for this reason must be masked if used literally: ". ^ $ * + ? { [ ] / | ( )".
  • Predefined wildcard charactersthat represent a particular character category and work like placeholders. For example, "/d" represents any number from 0 to 9.
  • Custom wildcard characters: They consist of square brackets, within which the characters are specified that the wildcard represents. If you want to use any character except for the specified characters, use "^" as the first character in the square brackets. For example, the placeholder "[^f-h]" stands for all characters except for "f", "g", and "h".
ElementDescription
.Exactly one character of any kind except for a line break (equivalent to [^/n])
[^abc]All characters except for those specified in brackets
[^a-z]All characters except for those in the range specified in the brackets
[abc]One of the characters specified in brackets
[a-z]Any character in the range indicated in brackets
/aBell alarm (ASCII 7)
/cAny character allowed in an XML name
/cA-/cZControl+A to Control+Z, equivalent to ASCII 0 to ASCII 26
/dA number (equivalent to [0-9])
/DAny character except for numbers
/eEscape (ASCII 9)
/fForm feed (ASCII 15)
/nNew line
/rCarriage return
/sAny whitespace character like a blank character, tab, or line break
/SAny character except for a blank character, tab, or line break
/tTab character
/uFFFFUnicode character with the hexadecimal code FFFF. For example, the Euro symbol has the code 20AC
/vVertical tab (ASCII 11)
/wLetter, digit, or underline
/WAny character except for letters
/xnnParticular character, where nn specifies the hexadecimal ASCII code
.*Any number of any character (including no characters at all)

Table 13.8: Placeholders for characters

Quantifiers

Every wildcard listed in Table 13.8 is represented by exactly one character. Using quantifiers, you can more precisely determine how many characters are respectively represented. For example, "/d{1,3}" stands for a number occurring one to three times for a one-to-three digit number.

ElementDescription
*Preceding expression is not matched or matched once or several times (matches as much as possible)
*?Preceding expression is not matched or matched once or several times (matches as little as possible)
.*Any number of any character (including no characters at all)
?Preceding expression is not matched or matched once (matches as much as possible)
??Preceding expression is not matched or matched once (matches as little as possible)
{n,}n or more matches
{n,m}Inclusive matches between n and m
{n}Exactly n matches
+Preceding expression is matched once

Table 13.9: Quantifiers for patterns

Anchors

Anchors determine whether a pattern has to be at the beginning or ending of a string. For example, the regular expression "/b/d{1,3}" finds numbers only up to three digits if these turn up separately in a string. The number "123" in the string "Bart123" would not be found.

ElementsDescription
$Matches at end of a string (/Z is less ambiguous for multi-line texts)
/AMatches at beginning of a string, including multi-line texts
/bMatches on word boundary (first or last characters in words)
/BMust not match on word boundary
/ZMust match at end of string, including multi-line texts
^Must match at beginning of a string (/A is less ambiguous for multi-line texts)

Table 13.10: Anchor boundaries

Recognizing IP Addresses

The patterns, such as an IP address, can be much more precisely described by regular expressions than by simple wildcard characters. Usually, you would use a combination of characters and quantifiers to specify which characters may occur in a string and how often:

$ip = "10.10.10.10"
$ip -match "/b/d{1,3}/./d{1,3}/./d{1,3}/./d{1,3}/b"
True

$ip = "a.10.10.10"
$ip -match "/b/d{1,3}/./d{1,3}/./d{1,3}/./d{1,3}/b"
False

$ip = "1000.10.10.10"
$ip -match "/b/d{1,3}/./d{1,3}/./d{1,3}/./d{1,3}/b"
False

The pattern is described here as four numbers (char: /d) between one and three digits (using the quantifier {1,3}) and anchored on word boundaries (using the anchor /b), meaning that it is surrounded by white space like blank characters, tabs, or line breaks. Checking is far from perfect since it is not verified whether the numbers really do lie in the permitted number range from 0 to 255.

# There still are entries incorrectly identified as valid IP addresses:
$ip = "300.400.500.999"
$ip -match "/b/d{1,3}/./d{1,3}/./d{1,3}/./d{1,3}/b"
True

Validating E-Mail Addresses

If you'd like to verify whether a user has given a valid e-mail address, use the following regular expression:

$email = "test@somewhere.com"
$email -match "/b[A-Z0-9._%+-]+@[A-Z0-9.-]+/.[A-Z]{2,4}/b"
True

$email = ".@."
$email -match "/b[A-Z0-9._%+-]+@[A-Z0-9.-]+/.[A-Z]{2,4}/b"
False

Whenever you look for an expression that occurs as a single "word" in text, delimit your regular expression by word boundaries (anchor: /b). The regular expression will then know you're interested only in those passages that are demarcated from the rest of the text by white space like blank characters, tabs, or line breaks.

The regular expression subsequently specifies which characters may be included in an e-mail address. Permissible characters are in square brackets and consist of "ranges" (for example, "A-Z0-9") and single characters (such as "._%+-"). The "+" behind the square brackets is a quantifier and means that at least one of the given characters must be present. However, you can also stipulate as many more characters as you wish.

Following this is "@" and, if you like, after it a text again having the same characters as those in front of "@". A dot (/.) in the e-mail address follows. This dot is introduced with a "/" character because the dot actually has a different meaning in regular expressions if it isn't within square brackets. The backslash ensures that the regular expression understands the dot behind it literally.

After the dot is the domain identifier, which may consist solely of letters ([A-Z]). A quantifier ({2,4}) again follows the square brackets. It specifies that the domain identifier may consist of at least two and at most four of the given characters.

However, this regular expression still has one flaw. While it does verify whether a valid e-mail address is in the text somewhere, there could be another text before or after it:

$email = "Email please to test@somewhere.com and reply!"
$email -match "/b[A-Z0-9._%+-]+@[A-Z0-9.-]+/.[A-Z]{2,4}/b"
True

Because of "/b", when your regular expression searches for a pattern somewhere in the text, it only takes into account word boundaries. If you prefer to check whether the entire text corresponds to an authentic e-mail, use the elements for sentence beginnings (anchor: "^") and endings (anchor: "$"):instead of word boundaries.

$email -match "^[A-Z0-9._%+-]+@[A-Z0-9.-]+/.[A-Z]{2,4}$"

Simultaneous Searches for Different Terms

Sometimes, search terms are ambiguous because there may be several ways to write them. You can use the "?" quantifier to mark parts of the search term as optional. In simple cases, put a "?" after an optional character. Then the character in front of "?" may, but doesn't have to, turn up in the search term:

"color" -match "colou?r"
True
"colour" -match "colou?r"
True

The "?" character here doesn't represent any character at all, as you might expect after using simple wildcards. For regular expressions, "?" is a quantifier and always specifies how often a character or expression in front of it may occur. In the example, therefore, "u?" ensures that the letter "u" may, but not necessarily, be in the specified location in the pattern. Other quantifiers are "*" (may also match more than one character) and "+" (must match characters at least once).

If you prefer to mark more than one character as optional, put the character in a sub-expression, which are placed in parentheses. The following example recognizes both the month designator "Nov" and "November":

"Nov" -match "/bNov(ember)?/b"
True


"November" -match "/bNov(ember)?/b"
True

If you'd rather use several alternative search terms, use the OR character "|":

"Bob and Ted" -match "Alice|Bob"
True

And if you want to mix alternative search terms with fixed text, use sub-expressions again:

# finds "and Bob":
"Peter and Bob" -match "and (Bob|Willy)"
True




# does not find "and Bob":
"Bob and Peter" -match "and (Bob|Willy)"
False

Case Sensitivity

In keeping with customary PowerShell practice, the -match operator is case insensitive. Use the operator -cmatch as alternative if you'd prefer case sensitivity.:

# -match is case insensitive:
"hello" -match "heLLO"
True

# -cmatch is case sensitive:
"hello" -cmatch "heLLO"
False

If you want case sensitivity in only some pattern segments, use -match. Also, specify in your regular expression which text segments are case sensitive and which are insensitive. Anything following the "(?i)" construct is case insensitive. Conversely, anything following "(?-i)" is case sensitive. This explains why the word "test" in the below example is recognized only if its last two characters are lowercase, while case sensitivity has no importance for the first two characters:

"TEst" -match "(?i)te(?-i)st"
True

"TEST" -match "(?i)te(?-i)st"
False

If you use a .NET framework RegEx object instead of -match, the RegEx object will automatically sense shifts between uppercase and lowercase, behaving like -cmatch. If you prefer case insensitivity, either use the above construct to specify an option in your regular expression or avail yourself of "IgnoreCase" to tell the RegEx object your preference:

[ regex]:: matches( "test", "TEST", "IgnoreCase")
ElementDescriptionCategory
(xyz)Sub-expression 
|Alternation constructSelection
/When followed by a character, the character is not recognized as a formatting character but as a literal characterEscape
x?Changes the x quantifier into a "lazy" quantifierOption
(?xyz)Activates of deactivates special modes, among others, case sensitivityOption
x+Turns the x quantifier into a "greedy" quantifierOption
?:Does not backtrackReference
?<name>Specifies name for back referencesReference

Table 13.11: Regular expression elements

Of course, a regular expression can perform any number of detailed checks, such as verifying whether numbers in an IP address lie within the permissible range from 0 to 255. The problem is that this makes regular expressions long and hard to understand. Fortunately, you generally won't need to invest much time in learning complex regular expressions like the ones coming up. It's enough to know which regular expression to use for a particular pattern. Regular expressions for nearly all standard patterns can be downloaded from the Internet. In the following example, we'll look more closely at a complex regular expression that evidently is entirely made up of the conventional elements listed in Table 13.11:

$ip = "300.400.500.999"
$ip -match "/b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)/.)" + `
"{3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)/b"
False

The expression validates only expressions running into word boundaries (the anchor is /b). The following sub-expression defines every single number:

(?:25[0 -5]|2[0 -4][0 -9]|[01]?[0 -9][0 -9]?)

The construct ?: is optional and enhances speed. After it come three alternatively permitted number formats separated by the alternation construct "|". 25[0-5] is a number from 250 through 2552[0-4][0-9] is a number from200 through 249. Finally, [01]?[0-9][0-9]? is a number from 0-9 or 00-99 or 100-199. The quantifier "?" ensures that the preceding pattern must be included. The result is that the sub-expression describes numbers from 0 through 255. An IP address consists of four such numbers. A dot always follows the first three numbers. For this reason, the following expression includes a definition of the number:

(?:(?:25[0 -5]|2[0 -4][0 -9]|[01]?[0 -9][0 -9]?) /.){3}

A dot, (/.), is appended to the number. This construct is supposed to be present three times ({3}). When the fourth number is also appended, the regular expression is complete. You have learned to create sub-expressions (by using parentheses) and how to iterate sub-expressions (by indicating the number of iterations in braces after the sub-expression), so you should now be able to shorten the first used IP address regular expression:

$ip = "10.10.10.10"
$ip -match "/b/d{1,3}/./d{1,3}/./d{1,3}/./d{1,3}/b"
True


$ip -match "/b(?:/d{1,3}/.){3}/d{1,3}/b"
True

Finding Information in Text

Regular expressions can recognize patterns. They can also filter out data corresponding to certain patterns from text. As such, regular expressions are excellent tools for parsing raw data. For example, use the same regular expression as the one above to identify e-mail addresses if you want to extract an e-mail address from a letter. Afterwards, look in the $matchesvariable to see which results were returned. The $matches variable is created automatically when you use the -matchoperator (or one of its siblings, like -cmatch).

$matches is a hash table (Chapter 4), so you can either output the entire hash table or access single elements in it by using their names, which you must specify in square brackets:

$rawtext = "If it interests you, my e-mail address is tobias@powershell.com."

# Simple pattern recognition:
$rawtext -match "/b[A-Z0-9._%+-]+@[A-Z0-9.-]+/.[A-Z]{2,4}/b"
True

# Reading data matching the pattern from raw text:
$matches
Name Value
---- -----
0 tobias@powershell.com

$matches[0]
tobias@powershell.com

Does that also work for more than one e-mail addresses in text? Unfortunately, it doesn't do so right away. The -matchoperator looks only for the first matching expression. So, if you want to find more than one occurrence of a pattern in raw text, you have to switch over to the RegEx object underlying the -match operator and use it directly.

In one essential respect, the RegEx object behaves unlike the -match operator. Case sensitivity is the default for the RegEx object, but not for -match. For this reason, you must put the "(?i)" option in front of the regular expression to eliminate confusion, making sure the expression is evaluated without taking case sensitivity into account.

# A raw text contains several e-mail addresses. -match finds the first one only:
$rawtext = "test@test.com sent an e-mail that was forwarded to spam@muell.de."
$rawtext -match "/b[A-Z0-9._%+-]+@[A-Z0-9.-]+/.[A-Z]{2,4}/b"
True

$matches
Name Value
---- -----
0 test@test.com

# A RegEx object can find any pattern but is case sensitive by default:
$regex = [ regex] "(?i)/b[A-Z0-9._%+-]+@[A-Z0-9.-]+/.[A-Z]{2,4}/b"
$regex. Matches( $rawtext)
Groups : {test@test.com}
Success : True
Captures : {test@test.com}
Index : 4
Length : 13
Value : test@test.com

Groups : {spam@muell.de}
Success : True
Captures : {spam@muell.de}
Index : 42
Length : 13
Value : spam@muell.de


# Limit result to e-mail addresses:
$regex. Matches( $rawtext) | Select-Object -Property Value
Value
-----
test@test.com
spam@muell.de

# Continue processing e-mail addresses:
$regex. Matches( $rawtext) | ForEach-Object { "found: $($_.Value)" }
found: test@test.com
found: spam@muell.de

Searching for Several Keywords

You can use the alternation construct "|" to search for a group of keywords, and then find out which keyword was actually found in the string:

"Set a=1" -match "Get|GetValue|Set|SetValue"
True


$matches
Name Value
---- -----
0 Set

$matches tells you which keyword actually occurs in the string. But note the order of keywords in your regular expression—it's crucial because the first matching keyword is the one selected. In this example, the result would be incorrect:

"SetValue a=1" -match "Get|GetValue|Set|SetValue"
True


$matches[0]
Set

Either change the order of keywords so that longer keywords are checked before shorter ones …:

"SetValue a=1" -match "GetValue|Get|SetValue|Set"
True


$matches[0]
SetValue

... or make sure that your regular expression is precisely formulated, and remember that you're actually searching for single words. Insert word boundaries into your regular expression so that sequential order no longer plays a role:

"SetValue a=1" -match "/b(Get|GetValue|Set|SetValue)/b"
True


$matches[0]
SetValue

It's true here, too, that -match finds only the first match. If your raw text has several occurrences of the keyword, use a RegExobject again:

$regex = [ regex] "/b(Get|GetValue|Set|SetValue)/b"
$regex. Matches( "Set a=1; GetValue a; SetValue b=12")
Groups : {Set, Set}
Success : True
Captures : {Set}
Index : 0
Length : 3
Value : Set

Groups : {GetValue, GetValue}
Success : True
Captures : {GetValue}
Index : 9
Length : 8
Value : GetValue

Groups : {SetValue, SetValue}
Success : True
Captures : {SetValue}
Index : 21
Length : 8
Value : SetValue

Forming Groups

A raw text line is often a heaping trove of useful data. You can use parentheses to collect this data in sub-expressions so that it can be evaluated separately later. The basic principle is that all the data that you want to find in a pattern should be wrapped in parentheses because $matches will return the results of these sub-expressions as independent elements. For example, if a text line contains a date first, then text, and if both are separated by tabs, you could describe the pattern like this:

# Defining pattern: two characters separated by a tab
$pattern = "(.*)/t(.*)"

# Generate example line with tab character
$line = "12/01/2009`tDescription"

# Use regular expression to parse line:
$line -match $pattern
True

# Show result:
$matches
Name Value
---- -----
2 Description
1 12/01/2009
0 12/01/2009 Description

$matches[1]
12/01/2009

$matches[2]
Description

When you use sub-expressions, $matches will contain the entire searched pattern in the first array element named "0". Sub-expressions defined in parentheses follow in additional elements. To make them easier to read and understand, you can assign sub-expressions their own names and later use the names to call results. To assign names to a sub-expression, type ?<Name> in parentheses for the first statement:

# Assign subexpressions their own names:
$pattern = "(?<Date>.*)/t(?<Text>.*)"

# Generate example line with tab character:
$line = "12/01/2009`tDescription"

# Use a regular expression to parse line:
$line -match $pattern
True

# Show result:
$matches
Name Value
---- -----
Text Description
Date 12/01/2009
0 12/01/2009 Description


$matches. Date
12/01/2009

$matches. Text
Description

Each result retrieved by $matches for each sub-expression naturally requires storage space. If you don't need the results, discard them to increase the speed of your regular expression. To do so, type "?:" as the first statement in your sub-expression:

# Don't return a result for the second subexpression:
$pattern = "(?<Date>.*)/t(?:.*)"

# Generate example line with tab character:
$line = "12/01/2009`tDescription"

# Use a regular expression to parse line:
$line -match $pattern
True

# No more results will be returned for the second subexpression:
$matches
Name Value
---- -----
Date 12/01/2009
0 12/01/2009 Description

Further Use of Sub-Expressions

With the help of results from each sub-expression, you can create surprisingly flexible regular expressions. For example, how could you define a Web site HTML tag as a pattern? A tag always has the same structure: <tagname [parameter]>...</tagname>. This means that a pattern for one particular strictly predefined HTML tag can be found quickly:

"<body background=1>contents</body>" -match "<body/b[^>]*>(.*?)</body>"
True


$matches[1]
Contents

The pattern begins with the fixed text "<body". Any additional words, separated by word boundaries, may follow with the exception of ">". The concluding ">" follows and then the contents of the body tag, which may consist of any number of any characters (.*?). The expression, enclosed in parentheses, is a sub-expression and will be returned later as a result in$matches so that you'll know what is inside the body tag. The concluding part of the tag follows in the form of fixed text ("</body").

This regular expression works fine for body tags, but not for other tags. Does this mean that a regular expression has to be defined for every HTML tag? Naturally not. There's a simpler solution. The problem is that the name of the tag in the regular expression occurs twice, once initially ("<body...>") and once terminally ("</body>"). If the regular expression is supposed to be able to process any tags, then it would have to be able to find out the name of the tag automatically and use it in both locations. How to accomplish that? Like this:

"<body background=2>Contents</body>" -match "<([A-Z][A-Z0-9]*)[^>]*>(.*?)<//1>"
True


$matches
Name Value
---- -----
2 Contents
1 body
0 <body background=2>Contents</body>

This regular expression no longer contains a strictly predefined tag name and works for any tags matching the pattern. How does that work? The initial tag in parentheses is defined as a sub-expression, more specifically as a word that begins with a letter and that can consist of any additional alphanumeric characters.

([A-Z][A-Z0-9]*)

The name of the tag revealed here must subsequently be iterated in the terminal part. Here you'll find "<//1>". "/1" refers to the result of the first sub-expression. The first sub-expression evaluated the tag name and so this name is used automatically for the terminal part.

The following RegEx object could directly return the contents of any HTML tag:

$regexTag = [ regex] "(?i)<([A-Z][A-Z0-9]*)[^>]*>(.*?)<//1>"
$result = $regexTag. Matches( "<button>Press here</button>")
$result[0]. Groups[2]. Value + " is in tag " + $result[0]. Groups[1]. Value
Press here is in tag button

Greedy or Lazy? Detailed or Concise Results...

Readers who have paid careful attention may wonder why the contents of the HTML tag were defined by ".*?" and not simply by ".*" in regard to regular expressions. . After all, ".*" should suffice so that an arbitrary character (char: ".") can turn up any number of times (quantifier: "*"). At first glance, the difference between ".*" and ".*? is not easy to recognize; but a short example should make it clear.

Assume that you would like to evaluate month specifications in a logging file, but the months are not all specified in the same way. Sometimes you use the short form, other times the long form of the month name is used. As you've seen, that's no problem for regular expressions, because sub-expressions allow parts of a keyword to be declared optional:

"Feb" -match "Feb(ruary)?"
True

$matches[0]
Feb

"February" -match "Feb(ruary)?"
True

$matches[0]
February

In both cases, the regular expression recognizes the month, but returns different results in $matches. By default, the regular expression is "greedy" and wants to achieve a match in as much detail as possible. If the text is "February," then the expression will search for a match starting with "Feb" and then continue searching "greedily" to check whether even more characters match the pattern. If they do, the entire (detailed) text is reported back.

However, if your main concern is just standardizing the names of months, you would probably prefer getting back the shortest common text. That's exactly what the "??" quantifier does, which in contrast to the regular expression is "lazy." As soon as it recognizes a pattern, it returns it without checking whether additional characters might match the pattern optionally.

"Feb" -match "Feb(ruary)?"
True

$matches[0]
Feb

"February" -match "Feb(ruary)?"
True

$matches[0]
Feb

Just what is the connection between the "??" quantifier of this example and the "*?" if the preceding example? In reality, "*?" is not a self-contained quantifier. It just turns a normally "greedy" quantifier into a "lazy" quantifier. This means you could use "?" to force the quantifier "*" to be "lazy" and to return the shortest possible result. That's exactly what happened with our regular expressions for HTML tags. You can see how important this is if you use the greedy quantifier "*" instead of "*?", then it will attempt to retrieve a result in as much detail as possible. That can go wrong:

# The greedy quantifier * returns results in as much detail as possible:
"<body background=1>Contents</body></body>" -match "<body/b[^>]*>(.*)</body>"
True

$matches[1]
Contents</body>

# The right quantifier is *?, the lazy one, which returns results that
# are as short as possible
"<body background=1>Contents</body></body>" -match "<body/b[^>]*>(.*?)</body>"
True

$matches[1]
Contents

According to the definition of the regular expression, any characters are allowed inside the tag. Moreover, the entire expression must end with "</body>". If "</body>" is also inside the tag, the following will happen: the greedy quantifier ("*"), coming across the first "</body>", will at first assume that the pattern is already completely matched. But because it is greedy, it will continue to look and will discover the second "</body>" that also fits the pattern. The result is that it will take both "</body>" specifications into account, allocate one to the contents of the tag, and use the other as the conclusion of the tag.

I this example, it would be better to use the lazy quantifier ("*?") that notices when it encounters the first "</body>" that the pattern is already correctly matched and consequently doesn't go to the trouble of continuing to search. It will ignore the second "</body>" and use the first to conclude the tag.

Finding String Segments

Entire books have been written about the uses of regular expressions. That's why it would go beyond the scope of this book to discuss more details. However, our last example, which locates text segments, shows how you can use the elements listed in Table 13.11 to easily harvest surprising search results. If you type two words, the regular expression will retrieve the text segment between the two words if at least one word is, and not more than six other words are, between the two words:

"Find word segments from start to end" -match "/bstart/W+(?:/w+/W+){1,6}?end/b"
True
$matches[0]
Name Value
---- -----
0 start to end

Replacing a String

You already know how to replace a string because you were already introduced to the -replace operator. Simply tell the operator what term you want to replace in a string and the task is done:

"Hello, Ralph" -replace "Ralph", "Martina"
Hello, Martina

But simple replacement isn't always sufficient, so you need to use regular expressions for replacements. Some of the following interesting examples show how that could be useful.

Perhaps you'd like to replace several different terms in a string with one other term. Without regular expressions, you'd have to replace each term separately. Or use instead the alternation operator, "|", with regular expressions:

"Mr. Miller and Mrs. Meyer" -replace "(Mr.|Mrs.)", "Our client"
Our client Miller and Our client Meyer

You can type any term in parentheses and use the "|" symbol to separate them. All the terms will be replaced with the replacement string you specify.

Using Back References

This last example replaces specified keywords anywhere in a string. Often, that's sufficient, but sometimes you don't want to replace a keyword everywhere it occurs but only when it occurs in a certain context. In such cases, the context must be defined in some way in the pattern. How could you change the regular expression so that it replaces only the names Miller and Meyer? Like this:

"Mr. Miller, Mrs. Meyer and Mr. Werner" `
-replace "(Mr.|Mrs.)/s*(Miller|Meyer)", "Our client"
Our client, Our client and Mr. Werner

The result looks a little peculiar, but the pattern you're looking for was correctly identified. The only replacements were Mr. orMrs. Miller and Mr. or Mrs. Meyer. The term "Mr. Werner" wasn't replaced. Unfortunately, the result also shows that it doesn't make any sense here to replace the entire pattern. At least the name of the person should be retained. Is that possible?

This is where the back referencing you've already seen comes into play. Whenever you use parentheses in your regular expression, the result inside the parentheses is evaluated separately, and you can use these separate results in your replacement string. The first sub-expression always reports whether a "Mr." or a "Mrs." was found in the string. The second sub-expression returns the name of the person. The terms "$1" and "$2" provide you the sub-expressions in the replacement string (the number is consequently a sequential number; you could also use "$3" and so on for additional sub-expressions).

"Mr. Miller, Mrs. Meyer and Mr. Werner" `
-replace "(Mr.|Mrs.)/s*(Miller|Meyer)", "Our client $2"
Our client , Our client and Mr. Werner

Strangely enough, at first the back references don't seem to work. The cause can be found quickly: "$1" and "$2" look like PowerShell variables, but in reality they are regular terms of the -replace operator. As a result, if you put the replacement string inside double quotation marks, PowerShell will replace "$2" with the PowerShell variable $2, which is normally empty. So that replacement with back references works, consequently, you must either put the replacement string inside single quotation marks or add a backtick to the "$" special character so that PowerShell won't recognize it as its own variable and replace it:

# Replacement text must be inside single quotation marks
# so that the PS variable $2:
"Mr. Miller, Mrs. Meyer and Mr. Werner" -replace `
"(Mr.|Mrs.)/s*(Miller|Meyer)", 'Our client $2'
Our client Miller, Our client Meyer and Mr. Werner

# Alternatively, $ can also be masked by `$:
"Mr. Miller, Mrs. Meyer and Mr. Werner" -replace `
"(Mr.|Mrs.)/s*(Miller|Meyer)", "Our client `$2"
Our client Miller, Our client Meyer and Mr. Werner

Putting Characters First at Line Beginnings

Replacements can also be made in multiple instances in text of several lines. For example, when you respond to an e-mail, usually the text of the old e-mail is quoted in your new e-mail as and marked with ">" at the beginning of each line. Regular expressions can do the marking.

However, to accomplish this, you need to know a little more about "multi-line" mode. Normally, this mode is turned off, and the "^" anchor represents the text beginning and the "$" the text ending. So that these two anchors refer respectively to the line beginning and line ending of a text of several lines, the multi-line mode must be turned on with the "(?m)" statement. Only then will -replace substitute the pattern in every single line. Once the multi-line mode is turned on, the anchors "^" and "/A", as well as "$" and "/Z", will suddenly behave differently. "/A" will continue to indicate the text beginning, while "^" will mark the line ending; "/Z" will indicate the text ending, while "$" will mark the line ending.

# Using Here-String to create a text of several lines:
$text = @ "
Here is a little text.
I want to attach this text to an e-mail as a quote.
That's why I would put a ">" before every line.
"@
$text
Here is a little text.
I want to attach this text to an e-mail as a quote.
That's why I would put a ">" before every line.

# Normally, -replace doesn't work in multiline mode.
# For this reason, only the first line is replaced:
$text -replace "^", "> "
> Here is a little text.
I want to attach this text to an e-mail as a quote.
That's why I would put a ">" before every line.

# If you turn on multiline mode, replacement will work in every line:
$text -replace "(?m)^", "> "
> Here is a little text.
> I want to attach this text to an e-mail as a quote.
> That's why I would put a ">" before every line.

# The same can also be accomplished by using a RegEx object,
# where the multiline option must be specified:
[ regex]:: Replace( $text, "^", "> ", `
[ Text.RegularExpressions.RegExOptions]:: Multiline)
> Here is a little text.
> I want to attach this text to an e-mail as a quote.
> That's why I would put a ">" before every line.

# In multiline mode, /A stands for the text beginning
# and ^ for the line beginning:
[ regex]:: Replace( $text, "/A", "> ", `
[ Text.RegularExpressions.RegExOptions]:: Multiline)
> Here is a little text.
I want to attach this text to an e-mail as a quote.
That's why I would put a ">" before every line.

Removing Superfluous White Space

Regular expressions can perform routine tasks as well, such as remove superfluous white space. The pattern describes a blank character (char: "/s") that occurs at least twice (quantifier: "{2,}"). That is replaced with a normal blank character.

"Too many blank characters" -replace "/s{2,}", " "
Too many blank characters

Finding and Removing Doubled Words

How is it possible to find and remove doubled words in text? Here, you can use back referencing again. The pattern could be described as follows:

"/b(/w+)(/s+/1){1,}/b"

The pattern searched for is a word (anchor: "/b"). It consists of one word (the character "/w" and quantifier "+"). A blank character follows (the character "/s" and quantifier "?"). This pattern, the blank character and the repeated word, must occur at least once (at least one and any number of iterations of the word, quantifier "{1,}"). The entire pattern is then replaced with the first back reference, that is, the first located word.

# Find and remove doubled words in a text:
"This this this is a test" -replace "/b(/w+)(/s+/1){1,}/b", '$1'
This is a test

Summary

Text is demarcated either by single or double quotation marks. If you use double quotation marks, PowerShell will replace PowerShell variables and special characters in text. Text enclosed in single quotation marks will remain unchanged. The same is true for characters in text marked with the backtick character, which can be used to insert special characters in the text (Table 13.1).

The user can query text directly through the Read-Host cmdlet. Lengthier text, text of several lines, can also be inputted through Here-Strings, which are begun with @"(Enter) and ended with "@(Enter).

By using the format operator -f, you can output formatted text. This gives you the option to display text in different ways or to set fixed widths to output text in aligned columns (Table 13.3 through Table 13.5). Along with the formatting operator, PowerShell has a number of further string operators you can use to validate patterns or to replace a string (Table 13.2). Most of these operators are also available in two special forms, which are either case-insensitive (preceded by "i") or case-sensitive (preceded by "c").

PowerShell stores text in string objects, which contain dynamic methods to work on the stored text. You can use these methods by typing a dot after the string object (or the variable in which the text is stored) and then activating auto complete (Table 13.6). Along with the dynamic methods that always refer to text stored in a string object, there are also static methods that are provided directly by the string data type by qualifying the string object with "[string]::".

The simplest way to describe patterns is to use the simple wildcards in Table 13.7. This allows you to check whether text is recognized in a particular pattern. However, simple wildcards are appropriate tools only for rudimentary pattern recognition. Moreover, simple wildcards can only recognize patterns; they cannot extract data from them. A far more sophisticated tool is regular expressions. They consist of the diverse elements listed in Table 13.11, consisting basically of the categories "character," "quantifier," and "anchor." Regular expressions describe any complex pattern and can be used along with the operators -match or -replace. Use the .NET object [regex] if you want to be very specific and utilize advanced functionality of regular expressions.

The -match operator reports whether the string contains the pattern you're looking for and subsequently retrieves the contents of the pattern in the $matches variable. This means that you can use -match not only to recognize patterns, but also to parse unstructured data directly. The -replace operator searches for a pattern and replaces it with an alternative string. Both operators also support back references, whose use was explained in detail in several chapter examples.


Posted  Mar 30 2009, 08:04 AM by  ps1

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值