http://powershell.com/cs/blogs/ebook/archive/2009/03/30/chapter-13-text-and-regular-expressions.aspx
Chapter 13. Text and Regular Expressions
PowerShell distinguishes sharply between text in single quotation marks and text in double quotation marks. PowerShell won't modify text wrapped in single quotation marks but it does inspect text in single quotation marks and may modify it by inserting variable contents automatically. Enclosing text in double quotation marks is the foremost and easiest way to couple results and descriptions.
The formatting operator -f, one of many specialized string operators, offers more options. For example, you can use -f to output text column-by-column and to set it flush. Other string commands are also important. They can replace selected text, change case, and much more.
Pattern recognition adds a layer of complexity because it uses wildcard characters to match patterns. In simple cases, you can use the same wildcards that you use in the file system. Substantially more powerful, but also more complex, are regular expressions.
Topics Covered:
- Defining Text
- Using Special Text Commands
- Simple Pattern Recognition
- Regular Expressions
- Describing Patterns
- Simultaneous Searches for Different Terms
- Case Sensitivity
- Finding Information in Text
- Searching for Several Keywords
- Forming Groups
- Further Use of Sub-Expressions
- Greedy or Lazy? Detailed or Concise Results...
- Finding String Segments
- Replacing a String
- Using Back References
- Putting Characters First at Line Beginnings
- Removing Superfluous White Space
- Finding and Removing Doubled Words
- Summary
Defining Text
Use quotation marks to delimit it if you'd like to save text in a variable or to output it. Use single quotation marks if you want text to be stored in a variable in (literally) exactly the same way you specified it:
Text will have an entirely different character when you wrap it in (conventional) double quotation marks because enclosed special characters will be evaluated:
Special Characters in Text
If text is enclosed in double quotation marks, PowerShell will look for particular special characters in it. Two special characters are important in this regard: "$" and the special backtick character, "`".
Resolving Variables
If PowerShell encounters one of the variables from Chapter 3, it will assign the variable its value:
$windir
This also applies to direct variables, which calculate their value themselves:
$result
Inserting Special Characters
The peculiar backtick character, "`", has two tasks: if you type it before characters that are particularly important for PowerShell, such as "$" or quotation marks, PowerShell will interpret the characters following the backtick as normal text characters. You could output quotation marks in text like this:
If one of the letters listed in Table 13.1 follows the backtick character, PowerShell will insert special characters:
two lines!
Escape Sequence | Special Characters |
`n | New line |
`r | Carriage return |
`t | Tabulator |
`a | Alarm |
`b | Backspace |
`' | Single quotation mark |
`" | Double quotation mark |
`0 | Null |
`` | Backtick character |
"Here-Strings": Acquiring Text of Several Lines
Using "here-strings" is the best way to acquire long text consisting of several lines or many special characters. "Here-strings" are called by this name because they enable you to acquire text exactly the way you want to store it in a text variable, much like a text editor. Here-strings are preceded by the @" character and terminated by the "@ character. Note here once again that PowerShell will automatically resolve (assign variable values and evaluate backtick characters in) text enclosed by@" and "@ characters. If you use single quotation marks instead, the text will remain exactly the way you typed it:
Here-Strings can easily stretch over several lines and may also include
"quotation marks". Nevertheless, here, too, variables are replaced with
their values: C:/Windows, and subexpressions like 4 are likewise replaced
with their result. The text will be concluded only if you terminate the
here-string with the termination symbol "@.
"@
$text
"quotation marks". Nevertheless, here, too, variables are replaced with
their values: C:/Windows, and subexpressions like 4 are likewise replaced
with their result. The text will be concluded only if you terminate the
here-string with the termination symbol "@.
Communicating with the User
If you'd like to request users to input text, use Read-Host:
Your entry: Hello world !
$text
Text acquired by Read-Host behaves like text enclosed in single quotation marks. Consequently, special characters and variables are not resolved. Manually use the ExpandString() method if you want to resolve the contents of a text variable later on, that is, have the variables and special characters in it replaced. PowerShell normally uses this method internally when you allocate text in double quotation marks:
$text = Read-Host "Your entry"
Your entry: $env:windir
$text
# Treat entered text as if it were in double quotation marks:
$ExecutionContext. InvokeCommand.ExpandString( $text)
$text
If you'd like to use Read-Host to acquire sensitive data, passwords, use the -asSecureString parameter. The screen entries will be masked by asterisks. The result will be a so-called SecureString. To be able to work on the encrypted SecureStringas a normal text entry, it must be changed to plain text first:
PtrToStringAuto([ Runtime.InteropServices.Marshal]::`
SecureStringToBSTR( $pwd))
Querying User Name and Password
If you'd like to authenticate a user, such as query his name and password, use Get-Credential. This cmdlet uses the secure dialog boxes that are integrated into Windows to request user name and password:
-------- --------
/Your name System.Security.SecureString
The result is an object having two properties: the given user name is in UserName and the encrypted password is inPassword as an instance of SecureString:
Normally, Get-Credential is used if logon data are actually needed, such as to run a program under a particular user name:
$startinfo = new-object System.Diagnostics.ProcessStartInfo
$startinfo. UserName = $logon. UserName
$startinfo. Password = $logon. Password
$startinfo. FileName = "$env:windir/regedit.exe"
$startinfo. UseShellExecute = $false
[ System.Diagnostics.Process]:: Start( $startinfo)
However, the user context that creates the Secure String can turn it into readable text whenever you wish, as was the case forRead-Host. For this reason, you can also use Get-Credential to query sensitive information that you can work on subsequently in plain text:
[ Runtime.InteropServices.Marshal]::`
PtrToStringAuto([ Runtime.InteropServices.Marshal]::`
SecureStringToBSTR( $logon. Password))
Using Special Text Commands
Often, results need to be properly output and provided with descriptions. The simplest approach doesn't require any special commands: insert the result as a variable or sub-expression directly into text and make sure that text is enclosed in double quotation marks.
"One CD has the capacity of $(720MB / 1.44MB) diskettes."
# Embedding a variable in text:
$result = 720MB / 1.44MB
"One CD has the capacity of $result diskettes."
More options are offered by special text commands that PowerShell furnishes from three different areas:
- String operators: PowerShell includes a number of string operators for general text tasks, which you can use to replace text and to compare text (Table 13.2).
- Dynamic methods: the String data type, which saves text, includes its own set of text statements that you can use to search through, dismantle, reassemble, and modify text in diverse ways (Table 13.6).
- Static methods:finally, the String .NET class includes static methods bound to no particular text.
String Operators
The -f format operator is the most important PowerShell string operator. You'll soon be using it to format numeric values for easier reading:
All operators function in basically the same way: they anticipate data from the left and the right that they can link together. For example, you can use -replace to substitute parts of the string for other parts:
There are three implementations of the -replace operator; many other string operators also have three implementations. Its basic version is case insensitive. If you'd like to distinguish between lowercase and uppercase, use the version beginning with "c" (for case-sensitive):
"Hello Carl" -creplace "carl", "eddie"
The third type begins with "i" (for insensitive) and is case insensitive. This means that this version is actually superfluous because it works the same way as -replace. The third version is merely demonstrative: if you use -ireplace instead of -replace, you'll make clear that you expressly do not want to distinguish between uppercase and lowercase.
Operator | Description | Example |
* | Repeats a string | "=" * 20 |
+ | Combines two string parts | "Hello " + "World" |
-replace, -ireplace | Substitutes a string; case insensitive | "Hello Carl" -replace "Carl", "Eddie" |
-creplace | Substitutes a string; case sensitive | "Hello Carl" -creplace "carl", "eddie" |
-eq, -ieq | Verifies equality; case insensitive | "Carl" -eq "carl" |
-ceq | Verifies equality; case sensitive | "Carl" -ceq "carl" |
-like, -ilike | Verifies whether a string is included in another string (wildcards are permitted; case insensitive) | "Carl" -like "*AR*" |
-clike | Verifies whether a string is included in another string (wildcards are permitted; case sensitive) | "Carl" -clike "*AR*" |
-notlike, -inotlike | Verifies whether a string is not included in another string (wildcards are permitted; case insensitive) | "Carl" -notlike "*AR*" |
-cnotlike | Verifies whether a string is included in another string (wildcards are permitted; case sensitive) | "Carl" -cnotlike "*AR*" |
-match, -imatch | Verifies whether a pattern is in a string; case insensitive | "Hello" -match "[ao]" |
-cmatch | Verifies whether a pattern is in a string; case sensitive | "Hello" -cmatch "[ao]" |
-notmatch, -inotmatch | Verifies whether a pattern is not in a string; case insensitive | "Hello" -notmatch "[ao]" |
-cnotmatch | Verifies whether a pattern is not in a string; case sensitive | "Hello" -cnotmatch "[ao]" |
Formatting String
The format operator -f formats a string and requires a string, along with wildcards on its left side and on its right side, that the results are to be inserted into the string instead of the wildcards:
It is absolutely necessary that exactly the same results are on the right side that are to be used in the string are also on the left side. If you want to just calculate a result, then the calculation should be in parentheses. As is generally true in PowerShell, the parentheses ensure that the enclosed statement is evaluated first and separately and that subsequently, the result is processed instead of the parentheses. Without parentheses, -f would report an error:
At line:1 char:33
+ "{0} diskettes per CD" -f 720mb/1 <<<< .44mb
You may use as many wildcard characters as you wish. The number in the braces states which value will appear later in the wildcard and in which order:
-f (720mb /1.44mb), 720, 1.44, "diskettes"
Setting Numeric Formats
The formatting operator -f can insert values into text as well as format the values. Every wildcard used has the following formal structure: {index[,alignment][:format]}:
- Index: This number indicates which value is to be used for this wildcard. For example, you could use several wildcards with the same index if you want to output one and the same value several times, or in various display formats. The index number is the only obligatory specification. The other two specifications are voluntary.
- Alignment: Positive or negative numbers can be specified that determine whether the value is right justified (positive number) or left justified (negative number). The number states the desired width. If the value is wider than the specified width, the specified width will be ignored. However, if the value is narrower than the specified width, the width will be filled with blank characters. This allows columns to be set flush.
- Format: The value can be formatted in very different ways. Here you can use the relevant format name to specify the format you wish. You'll find an overview of available formats below.
Formatting statements are case sensitive in different ways than what is usual in PowerShell. You can see how large the differences can be when you format dates:
"Date: {0:d}" -f ( Get-Date)
"Date: {0:D}" -f ( Get-Date)
Symbol | Type | Call | Result |
# | Digit placeholder | "{0:(#).##}" -f $value | (1000000) |
% | Percentage | "{0:0%}" -f $value | 100000000% |
, | Thousands separator | "{0:0,0}" -f $value | 1,000,000 |
,. | Integral multiple of 1,000 | "{0:0,.} " -f $value | 1000 |
. | Decimal point | "{0:0.0}" -f $value | 1000000.0 |
0 | 0 placeholder | "{0:00.0000}" -f $value | 1000000.0000 |
c | Currency | "{0:c}" -f $value | 1,000,000.00 € |
d | Decimal | "{0:d}" -f $value | 1000000 |
e | Scientific notation | "{0:e}" -f $value | 1.000000e+006 |
e | Exponent wildcard | "{0:00e+0}" -f $value | 10e+5 |
f | Fixed point | "{0:f}" -f $value | 1000000.00 |
g | General | "{0:g}" -f $value | 1000000 |
n | Thousands separator | "{0:n}" -f $value | 1,000,000.00 |
x | Hexadecimal | "0x{0:x4}" -f $value | 0x4240 |
Using the formats in Table 13.3, you can format numbers quickly and comfortably. No need for you to squint your eyes any longer trying to decipher whether a number is a million or 10 million:
"{0:N0}" -f 10000000000
10,000,000,000
There's also a very wide range of time and date formats. The relevant formats are listed in Table 13.4 and their operation is shown in the following lines:
Foreach ( $format in "d", "D", "f", "F", "g", "G", "m", "r", "s", "t", "T", `
"u", "U", "y", "dddd, MMMM dd yyyy", "M/yy", "dd-MM-yy") {
"DATE with $format : {0}" -f $date. ToString( $format) }
DATE with D : Monday, 15 October, 2007
DATE with f : Monday, 15 October, 2007 02:17 PM
DATE with F : Monday, 15 October, 2007 02:17:02 PM
DATE with g : 10/15/2007 02:17
DATE with G : 10/15/2007 02:17:02
DATE with m : October 15
DATE with r : Mon, 15 Oct 2007 02:17:02 GMT
DATE with s : 2007-10-15T02:17:02
DATE with t : 02:17 PM
DATE with T : 02:17:02 PM
DATE with u : 2007-10-15 02:17:02Z
DATE with U : Monday, 15 October, 2007 00:17:02
DATE with y : October, 2007
DATE with dddd, MMMM dd yyyy : Monday, October 15 2007
DATE with M/yy : 10/07
DATE with dd-MM-yy : 15-10-07
Symbol | Type | Call | Result |
d | Short date format | "{0:d}" -f $value | 09/07/2007 |
D | Long date format | "{0:D}" -f $value | Friday, September 7, 2007 |
t | Short time format | "{0:t}" -f $value | 10:53 AM |
T | Long time format | "{0:T}" -f $value | 10:53:56 AM |
f | Full date and time (short) | "{0:f}" -f $value | Friday, September 7, 2007 10:53 AM |
F | Full date and time (long) | "{0:F}" -f $value | Friday, September 7, 2007 10:53:56 AM |
g | Standard date (short) | "{0:g}" -f $value | 09/07/2007 10:53 AM |
G | Standard date (long) | "{0:G}" -f $value | 09/07/2007 10:53:56 AM |
M | Day of month | "{0:M}" -f $value | September 07 |
r | RFC1123 date format | "{0:r}" -f $value | Fri, 07 Sep 2007 10:53:56 GMT |
s | Sortable date format | "{0:s}" -f $value | 2007-09-07T10:53:56 |
u | Universally sortable date format | "{0:u}" -f $value | 2007-09-07 10:53:56Z |
U | Universally sortable GMT date format | "{0:U}" -f $value | Friday, September 7, 2007 08:53:56 |
Y | Year/month format pattern | "{0:Y}" -f $value | September 2007 |
If you want to find out which type of formatting options are supported, you need only look for .NET types that support thetoString() method:
$_. GetExportedTypes() | Where-Object { ! $_. IsSubclassof([ System.Enum])}
} | ForEach-Object {
$Methods = $_. getmethods() | Where-Object { $_. name -eq "tostring"} | %{ "$_"};
If ( $methods -eq "System.String ToString(System.String)") {
$_. fullname
}
}
System.DateTime
System.Byte
System.Convert
System.Decimal
System.Double
System.Guid
System.Int16
System.Int32
System.Int64
System.IntPtr
System.SByte
System.Single
System.UInt16
System.UInt32
System.UInt64
Microsoft.PowerShell.Commands.MatchInfo
For example, among the supported data types is the "globally unique identifier" System.Guid. Because you'll frequently require GUID, which is clearly understood worldwide, here's a brief example showing how to create and format a GUID:
Foreach ( $format in "N", "D", "B", "P") {
"GUID with $format : {0}" -f $GUID. ToString( $format)}
GUID with D : 0c4d2c4c-8af8-4d19-8b69-8e57c1aee780
GUID with B : {0c4d2c4c-8af8-4d19-8b69-8e57c1aee780}
GUID with P : (0c4d2c4c-8af8-4d19-8b69-8e57c1aee780)
Symbol | Type | Call | Result |
dd | Day of month | "{0:dd}" -f $value | 07 |
ddd | Abbreviated name of day | "{0:ddd}" -f $value | Fri |
dddd | Full name of day | "{0:dddd}" -f $value | Friday |
gg | Era | "{0:gg}" -f $value | A. D. |
hh | Hours from 01 to 12 | "{0:hh}" -f $value | 10 |
HH | Hours from 0 to 23 | "{0:HH}" -f $value | 10 |
mm | Minute | "{0:mm}" -f $value | 53 |
MM | Month | "{0:MM}" -f $value | 09 |
MMM | Abbreviated month name | "{0:MMM}" -f $value | Sep |
MMMM | Full month name | "{0:MMMM}" -f $value | September |
ss | Second | "{0:ss}" -f $value | 56 |
tt | AM or PM | "{0:tt}" -f $value | |
yy | Year in two digits | "{0:yy}" -f $value | 07 |
yyyy | Year in four digits | "{0:YY}" -f $value | 2007 |
zz | Time zone including leading zero | "{0:zz}" -f $value | +02 |
zzz | Time zone in hours and minutes | "{0:zzz}" -f $value | +02:00 |
Outputting Values in Tabular Form: Fixed Width
To display the output of several lines in a fixed-width font and align them one below the other, each column of the output must have a fixed width. A formatting operator can set outputs to a fixed width.
In the following example, Dir returns a directory listing, from which a subsequent loop outputs file names and file sizes. Because file names and sizes vary, the result is ragged right and hard to read:
info.txt = 8562 Bytes
layout.lxy = 1280 Bytes
list.txt = 164186 Bytes
p1.nrproj = 5808 Bytes
ping.bat = 116 Bytes
SilentlyContinue = 0 Bytes
The following result with fixed column widths is far more legible. To set widths, add a comma to the sequential number of the wildcard and after it specify the number of characters available to the wildcard. Positive numbers will set values to right alignment, negative numbers to left alignment:
info.txt = 8562 Bytes
layout.lxy = 1280 Bytes
list.txt = 164186 Bytes
p1.nrproj = 5808 Bytes
ping.bat = 116 Bytes
SilentlyContinue = 0 Bytes
String Object Methods
You know from Chapter 6 that PowerShell stores everything in objects and that every object contains a set of instructions known as methods. Text is stored in a String object, which includes a number of useful commands for working with text. For example, to ascertain the file extension of a file name, use LastIndexOf() to determine the position of the last "." character, and then use Substring() to extract text starting from the position:
$path. Substring( $path. LastIndexOf( ".") +1 )
Another approach uses the dot as separator and Split() to split up the path into an array. The result is that the last element of the array (-1 index number) will include the file extension:
Table 13.6 provides an overview of all the methods that include a string object.
Function | Description | Example |
CompareTo() | Compares one string to another | ("Hello").CompareTo("Hello") |
Contains() | Returns "True" if a specified comparison string is in a string or if the comparison string is empty | ("Hello").Contains("ll") |
CopyTo() | Copies part of a string to another string | $a = ("Hello World").toCharArray() ("User!").CopyTo(0, $a, 6, 5) $a |
EndsWith() | Tests whether the string ends with a specified string | ("Hello").EndsWith("lo") |
Equals() | Tests whether one string is identical to another string | ("Hello").Equals($a) |
IndexOf() | Returns the index of the first occurrence of a comparison string | ("Hello").IndexOf("l") |
IndexOfAny() | Returns the index of the first occurrence of any character in a comparison string | ("Hello").IndexOfAny("loe") |
Insert() | Inserts new string at a specified index in an existing string | ("Hello World").Insert(6, "brave ") |
GetEnumerator() | Retrieves a new object that can enumerate all characters of a string | ("Hello").GetEnumerator() |
LastIndexOf() | Finds the index of the last occurrence of a specified character | ("Hello").LastIndexOf("l") |
LastIndexOfAny() | Finds the index of the last occurrence of any character of a specified string | ("Hello").LastIndexOfAny("loe") |
PadLeft() | Pads a string to a specified length and adds blank characters to the left (right-aligned string) | ("Hello").PadLeft(10) |
PadRight() | Pads string to a specified length and adds blank characters to the right (left-aligned string) | ("Hello").PadRight(10) + "World!" |
Remove() | Removes any requested number of characters starting from a specified position | ("Hello World").Remove(5,6) |
Replace() | Replaces a character with another character | ("Hello World").Replace("l", "x") |
Split() | Converts a string with specified splitting points into an array | ("Hello World").Split("l") |
StartsWith() | Tests whether a string begins with a specified character | ("Hello World").StartsWith("He") |
Substring() | Extracts characters from a string | ("Hello World").Substring(4, 3) |
ToCharArray() | Converts a string into a character array | ("Hello World").toCharArray() |
ToLower() | Converts a string to lowercase | ("Hello World").toLower() |
ToLowerInvariant() | Converts a string to lowercase using casing rules of the invariant language | ("Hello World").toLowerInvariant() |
ToUpper() | Converts a string to uppercase | ("Hello World").toUpper() |
ToUpperInvariant() | Converts a string to uppercase using casing rules of the invariant language | ("Hello World").ToUpperInvariant() |
Trim() | Removes blank characters to the right and left | (" Hello ").Trim() + "World" |
TrimEnd() | Removes blank characters to the right | (" Hello ").TrimEnd() + "World" |
TrimStart() | Removes blank characters to the left | (" Hello ").TrimStart() + "World" |
Chars() | Provides a character at the specified position | ("Hello").Chars(0) |
Analyzing Methods: Split() as Example
You already know in detail from Chapter 6 how to use Get-Member to find out which methods an object contains and how to invoke them. Just as a quick refresher, let's look again at an example of the Split() method to see how it works.
Char[] separator, Int32 count), System.String[] Split(Char[] separator,
StringSplitOptions options), System.String[] Split(Char[] separator,
Int32 count, StringSplitOptions options), System.String[] Split(String[]
separator, StringSplitOptions options), System.String[] Split(String[]
separator, Int32 count, StringSplitOptions options)
Definition gets output, but it isn't very easy to read. Because Definition is also a string object, you can use methods fromTable 13.6, including Replace(), to insert a line break where appropriate. That makes the result much more understandable:
System.String[] Split(Char[] separator, Int32 count)
System.String[] Split(Char[] separator, StringSplitOptions options)
System.String[] Split(Char[] separator, Int32 count,
StringSplitOptions options)
System.String[] Split(String[] separator, StringSplitOptions options)
System.String[] Split(String[] separator, Int32 count,
StringSplitOptions options)
There are six different ways to invoke Split(). In simple cases, you might use Split() with only one argument, Split(), you will expect a character array and will use every single character as a possible splitting separator. That's important because it means that you may use several separators at once:
b
c
d
e
f
If the splitting separator itself consists of several characters, then it has got to be a string and not a single Char character. There are only two signatures that meet this condition:
StringSplitOptions options)
System.String[] Split(String[] separator, Int32 count,
StringSplitOptions options)
You must make sure that you pass data types to the signature that is exactly right for it to be able to use a particular signature. If you want to use the first signature, the first argument must be of the String[] type and the second argument of theStringSplitOptions type. The simplest way for you to meet this requirement is by assigning arguments first to a strongly typed variable. Create the variable with exactly the type that the signature requires:
[StringSplitOptions] $option = "None"
# Create a variable of the String[] type:
[string[]] $separator = ",;"
# Invoke Split with the wished signature and use a two-character long separator:
( "a,b;c,;d,e;f,;g"). Split( $separator, $option)
d,e;f
g
Split() in fact now uses a separator consisting of several characters. It splits the string only at the points where it finds precisely the characters that were specified. There does remain the question of how do you know it is necessary to assign the value "None" to the StringSplitOptions data type. The simple answer is: you don't know and it isn't necessary to know. If you assign a value to an unknown data type that can't handle the value, the data type will automatically notify you of all valid values:
"System.StringSplitOptions" due to invalid enumeration
values. Specify one of the following enumeration values
and try again. The possible enumeration values are
"None, RemoveEmptyEntries".
At line:1 char:28
+ [StringSplitOptions]$option <<<< = "werner wallbach"
By now it should be clear to you what the purpose is of the given valid values and their names. For example, what wasRemoveEmptyEntries() able to accomplish? If Split() runs into several separators following each other, empty array elements will be the consequence. RemoveEmptyEntries() deletes such empty entries. You could use it to remove redundant blank characters from a text:
"This text has too much whitespace". Split( " ", $option)
text
has
too
much
whitespace
Now all you need is just a method that can convert the elements of an array back into text. The method is called Join(); it is not in a String object but in the String class.
Using String Class Commands
Chapter 6 clearly defined the distinction between classes and objects (or instances). Just to refresh your memory: everyString object is derived from the String class. Both include diverse methods. You can see these methods at work when you press (Tab) after the following instruction, which activates AutoComplete:
Get-Member will return a list of all methods. This time, specify the -Static parameter in addition:
You've already used static methods. In reality, the -f format operator corresponds to the Format() static method, and that's why the following two statements work in exactly the same way:
"Hex value of 180 is &h{0:X}" -f 180
# The static method Format has the same result:
[ string]:: Format( "Hex value of 180 is &h{0:X}", 180)
The Format() static method is very important but is usually ignored because -f is much easier to handle. But you wouldn't be able to do without two other static methods: Join() and Concat().
Join(): Changing Arrays to Text
Join() is the counterpart of Split() discussed above. Join() assembles an array of string elements into a string. It enables you to complete the above example and to make a function that removes superfluous white space characters from the string:
$private:array = $text. Split( " ", `
[ StringSplitOptions]:: RemoveEmptyEntries)
[ string]:: Join( " ", $array)
}
RemoveSpace "Hello, this text has too much whitespace."
Concat(): Assembling a String Out of Several Parts
Concat() assembles a string out of several separate parts. At first glance, it works like the "+" operator:
But note that the "+" operator always acts strangely when the first value isn't a string:
"Today is " + ( Get-Date)
# If the first value is not text, errors may result:
( Get-Date) + " is a great date!"
for "op_Addition" to type "System.TimeSpan": "Cannot convert
value " is a great date!" to type "System.TimeSpan". Error:
"Input string was not in a correct format.""
At line:1 char:13
+ (Get-Date) + <<<< " is a great date!
If the first value of the calculation is a string, all other values will be put into the string form and assembled as requested into a complete string. If the first value is not a string—in the example, it was a date value—all the other values will be changed to this type. That's just what causes an error, because it is impossible to change "is a great date!" to a date value. For this reason, the "+" operator is an unreliable tool for assembling a string.
Concat() causes fewer problems: it turns everything you specify to the method into a string. Concat(), when converting, also takes into account your current regional settings; it will provide, for example, U.S. English date and time formats:
[ string]:: Concat(( Get-Date), " is a great date!")
Simple Pattern Recognition
Recognizing patterns is a frequent task that is necessary for verifying user entries, such as to determine whether a user has given a valid network ID or valid e-mail address. Useful and effective pattern recognition requires wildcard characters that represent a certain number and type of characters.
A simple form of wildcards was invented for the file system many years ago and it still works today. In fact, you've doubtlessly used it before in one form or another:
# have the txt file extension:
Dir *. txt
# List all files in the Windows directory that
# begin with "n" or "w":
dir $env:windir/[ nw] *. *
# List all files whose file extensions begin with
# "t" and which are exactly 3 characters long:
Dir *. t??
# List all files that end in one of the letters
# from "e" to "z"
dir *[ e-z]. *
Wildcard | Description | Example |
* | Any number of any character (including no characters at all) | Dir *.txt |
? | Exactly one of any characters | Dir *.??t |
[xyz] | One of specified characters | Dir [abc]*.* |
[x-z] | One of the characters in the specified area | Dir *[p-z].* |
The placeholders in Table 13.7 not only work in the file system, but also in conjunction with string operators like -like and -notlike. This makes child's play of pattern recognition. For example, if you want to verify whether a user has given a valid IP address, you could do so in the following way:
If ( $ip -like "*.*.*.*") { "valid" } Else { "invalid" }
If you want to verify whether a valid e-mail address is in a variable, you could check the pattern in the following way:
$email -like "*.*@*.*"
However, such wildcards only reveal the worst errors and are not very exact:
# recognition and leave room for erroneous entries:
$ip = "300.werner.6666."
If ( $ip -like "*.*.*.*") { "valid" } Else { "invalid" }
# The following invalid e-mail address was not identified as false:
$email = ".@."
$email -like "*.*@*.*"
Regular Expressions
Use regular expressions for more accurate pattern recognition if you require it. Regular expressions offer many more wildcard characters; for this reason, they can describe patterns in much greater detail. For the very same reason, however, regular expressions are also much more complicated.
Describing Patterns
Using the regular expression elements listed in Table 13.11, you can describe patterns with much greater precision. These elements are grouped into three categories:
- Char: The Char represents a single character and a collection of Char objects represents a string.
- Quantifier: Allows you to determine how often a character or a string occurs in a pattern.
- Anchor: Allows you to determine whether a pattern is a separate word or must be at the beginning or end of a sentence.
The pattern represented by a regular expression may consist of four different character types:
- Literal characterslike "abc" that exactly matches the "abc" string.
- Masked or "escaped" characters with special meanings in regular expressions; when preceded by "/", they are understood as literal characters: "/[test/]" looks for the "[test]" string. The following characters have special meanings and for this reason must be masked if used literally: ". ^ $ * + ? { [ ] / | ( )".
- Predefined wildcard charactersthat represent a particular character category and work like placeholders. For example, "/d" represents any number from 0 to 9.
- Custom wildcard characters: They consist of square brackets, within which the characters are specified that the wildcard represents. If you want to use any character except for the specified characters, use "^" as the first character in the square brackets. For example, the placeholder "[^f-h]" stands for all characters except for "f", "g", and "h".
Element | Description |
. | Exactly one character of any kind except for a line break (equivalent to [^/n]) |
[^abc] | All characters except for those specified in brackets |
[^a-z] | All characters except for those in the range specified in the brackets |
[abc] | One of the characters specified in brackets |
[a-z] | Any character in the range indicated in brackets |
/a | Bell alarm (ASCII 7) |
/c | Any character allowed in an XML name |
/cA-/cZ | Control+A to Control+Z, equivalent to ASCII 0 to ASCII 26 |
/d | A number (equivalent to [0-9]) |
/D | Any character except for numbers |
/e | Escape (ASCII 9) |
/f | Form feed (ASCII 15) |
/n | New line |
/r | Carriage return |
/s | Any whitespace character like a blank character, tab, or line break |
/S | Any character except for a blank character, tab, or line break |
/t | Tab character |
/uFFFF | Unicode character with the hexadecimal code FFFF. For example, the Euro symbol has the code 20AC |
/v | Vertical tab (ASCII 11) |
/w | Letter, digit, or underline |
/W | Any character except for letters |
/xnn | Particular character, where nn specifies the hexadecimal ASCII code |
.* | Any number of any character (including no characters at all) |
Quantifiers
Every wildcard listed in Table 13.8 is represented by exactly one character. Using quantifiers, you can more precisely determine how many characters are respectively represented. For example, "/d{1,3}" stands for a number occurring one to three times for a one-to-three digit number.
Element | Description |
* | Preceding expression is not matched or matched once or several times (matches as much as possible) |
*? | Preceding expression is not matched or matched once or several times (matches as little as possible) |
.* | Any number of any character (including no characters at all) |
? | Preceding expression is not matched or matched once (matches as much as possible) |
?? | Preceding expression is not matched or matched once (matches as little as possible) |
{n,} | n or more matches |
{n,m} | Inclusive matches between n and m |
{n} | Exactly n matches |
+ | Preceding expression is matched once |
Anchors
Anchors determine whether a pattern has to be at the beginning or ending of a string. For example, the regular expression "/b/d{1,3}" finds numbers only up to three digits if these turn up separately in a string. The number "123" in the string "Bart123" would not be found.
Elements | Description |
$ | Matches at end of a string (/Z is less ambiguous for multi-line texts) |
/A | Matches at beginning of a string, including multi-line texts |
/b | Matches on word boundary (first or last characters in words) |
/B | Must not match on word boundary |
/Z | Must match at end of string, including multi-line texts |
^ | Must match at beginning of a string (/A is less ambiguous for multi-line texts) |
Recognizing IP Addresses
The patterns, such as an IP address, can be much more precisely described by regular expressions than by simple wildcard characters. Usually, you would use a combination of characters and quantifiers to specify which characters may occur in a string and how often:
$ip -match "/b/d{1,3}/./d{1,3}/./d{1,3}/./d{1,3}/b"
$ip = "a.10.10.10"
$ip -match "/b/d{1,3}/./d{1,3}/./d{1,3}/./d{1,3}/b"
$ip = "1000.10.10.10"
$ip -match "/b/d{1,3}/./d{1,3}/./d{1,3}/./d{1,3}/b"
The pattern is described here as four numbers (char: /d) between one and three digits (using the quantifier {1,3}) and anchored on word boundaries (using the anchor /b), meaning that it is surrounded by white space like blank characters, tabs, or line breaks. Checking is far from perfect since it is not verified whether the numbers really do lie in the permitted number range from 0 to 255.
$ip = "300.400.500.999"
$ip -match "/b/d{1,3}/./d{1,3}/./d{1,3}/./d{1,3}/b"
Validating E-Mail Addresses
If you'd like to verify whether a user has given a valid e-mail address, use the following regular expression:
$email -match "/b[A-Z0-9._%+-]+@[A-Z0-9.-]+/.[A-Z]{2,4}/b"
$email = ".@."
$email -match "/b[A-Z0-9._%+-]+@[A-Z0-9.-]+/.[A-Z]{2,4}/b"
Whenever you look for an expression that occurs as a single "word" in text, delimit your regular expression by word boundaries (anchor: /b). The regular expression will then know you're interested only in those passages that are demarcated from the rest of the text by white space like blank characters, tabs, or line breaks.
The regular expression subsequently specifies which characters may be included in an e-mail address. Permissible characters are in square brackets and consist of "ranges" (for example, "A-Z0-9") and single characters (such as "._%+-"). The "+" behind the square brackets is a quantifier and means that at least one of the given characters must be present. However, you can also stipulate as many more characters as you wish.
Following this is "@" and, if you like, after it a text again having the same characters as those in front of "@". A dot (/.) in the e-mail address follows. This dot is introduced with a "/" character because the dot actually has a different meaning in regular expressions if it isn't within square brackets. The backslash ensures that the regular expression understands the dot behind it literally.
After the dot is the domain identifier, which may consist solely of letters ([A-Z]). A quantifier ({2,4}) again follows the square brackets. It specifies that the domain identifier may consist of at least two and at most four of the given characters.
However, this regular expression still has one flaw. While it does verify whether a valid e-mail address is in the text somewhere, there could be another text before or after it:
$email -match "/b[A-Z0-9._%+-]+@[A-Z0-9.-]+/.[A-Z]{2,4}/b"
Because of "/b", when your regular expression searches for a pattern somewhere in the text, it only takes into account word boundaries. If you prefer to check whether the entire text corresponds to an authentic e-mail, use the elements for sentence beginnings (anchor: "^") and endings (anchor: "$"):instead of word boundaries.
Simultaneous Searches for Different Terms
Sometimes, search terms are ambiguous because there may be several ways to write them. You can use the "?" quantifier to mark parts of the search term as optional. In simple cases, put a "?" after an optional character. Then the character in front of "?" may, but doesn't have to, turn up in the search term:
True
"colour" -match "colou?r"
True
The "?" character here doesn't represent any character at all, as you might expect after using simple wildcards. For regular expressions, "?" is a quantifier and always specifies how often a character or expression in front of it may occur. In the example, therefore, "u?" ensures that the letter "u" may, but not necessarily, be in the specified location in the pattern. Other quantifiers are "*" (may also match more than one character) and "+" (must match characters at least once).
If you prefer to mark more than one character as optional, put the character in a sub-expression, which are placed in parentheses. The following example recognizes both the month designator "Nov" and "November":
"November" -match "/bNov(ember)?/b"
If you'd rather use several alternative search terms, use the OR character "|":
And if you want to mix alternative search terms with fixed text, use sub-expressions again:
"Peter and Bob" -match "and (Bob|Willy)"
# does not find "and Bob":
"Bob and Peter" -match "and (Bob|Willy)"
Case Sensitivity
In keeping with customary PowerShell practice, the -match operator is case insensitive. Use the operator -cmatch as alternative if you'd prefer case sensitivity.:
"hello" -match "heLLO"
# -cmatch is case sensitive:
"hello" -cmatch "heLLO"
If you want case sensitivity in only some pattern segments, use -match. Also, specify in your regular expression which text segments are case sensitive and which are insensitive. Anything following the "(?i)" construct is case insensitive. Conversely, anything following "(?-i)" is case sensitive. This explains why the word "test" in the below example is recognized only if its last two characters are lowercase, while case sensitivity has no importance for the first two characters:
"TEST" -match "(?i)te(?-i)st"
If you use a .NET framework RegEx object instead of -match, the RegEx object will automatically sense shifts between uppercase and lowercase, behaving like -cmatch. If you prefer case insensitivity, either use the above construct to specify an option in your regular expression or avail yourself of "IgnoreCase" to tell the RegEx object your preference:
Element | Description | Category |
(xyz) | Sub-expression | |
| | Alternation construct | Selection |
/ | When followed by a character, the character is not recognized as a formatting character but as a literal character | Escape |
x? | Changes the x quantifier into a "lazy" quantifier | Option |
(?xyz) | Activates of deactivates special modes, among others, case sensitivity | Option |
x+ | Turns the x quantifier into a "greedy" quantifier | Option |
?: | Does not backtrack | Reference |
?<name> | Specifies name for back references | Reference |
Of course, a regular expression can perform any number of detailed checks, such as verifying whether numbers in an IP address lie within the permissible range from 0 to 255. The problem is that this makes regular expressions long and hard to understand. Fortunately, you generally won't need to invest much time in learning complex regular expressions like the ones coming up. It's enough to know which regular expression to use for a particular pattern. Regular expressions for nearly all standard patterns can be downloaded from the Internet. In the following example, we'll look more closely at a complex regular expression that evidently is entirely made up of the conventional elements listed in Table 13.11:
$ip -match "/b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)/.)" + `
"{3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)/b"
The expression validates only expressions running into word boundaries (the anchor is /b). The following sub-expression defines every single number:
The construct ?: is optional and enhances speed. After it come three alternatively permitted number formats separated by the alternation construct "|". 25[0-5] is a number from 250 through 255. 2[0-4][0-9] is a number from200 through 249. Finally, [01]?[0-9][0-9]? is a number from 0-9 or 00-99 or 100-199. The quantifier "?" ensures that the preceding pattern must be included. The result is that the sub-expression describes numbers from 0 through 255. An IP address consists of four such numbers. A dot always follows the first three numbers. For this reason, the following expression includes a definition of the number:
A dot, (/.), is appended to the number. This construct is supposed to be present three times ({3}). When the fourth number is also appended, the regular expression is complete. You have learned to create sub-expressions (by using parentheses) and how to iterate sub-expressions (by indicating the number of iterations in braces after the sub-expression), so you should now be able to shorten the first used IP address regular expression:
$ip -match "/b/d{1,3}/./d{1,3}/./d{1,3}/./d{1,3}/b"
$ip -match "/b(?:/d{1,3}/.){3}/d{1,3}/b"
Finding Information in Text
Regular expressions can recognize patterns. They can also filter out data corresponding to certain patterns from text. As such, regular expressions are excellent tools for parsing raw data. For example, use the same regular expression as the one above to identify e-mail addresses if you want to extract an e-mail address from a letter. Afterwards, look in the $matchesvariable to see which results were returned. The $matches variable is created automatically when you use the -matchoperator (or one of its siblings, like -cmatch).
$matches is a hash table (Chapter 4), so you can either output the entire hash table or access single elements in it by using their names, which you must specify in square brackets:
# Simple pattern recognition:
$rawtext -match "/b[A-Z0-9._%+-]+@[A-Z0-9.-]+/.[A-Z]{2,4}/b"
# Reading data matching the pattern from raw text:
$matches
---- -----
0 tobias@powershell.com
$matches[0]
Does that also work for more than one e-mail addresses in text? Unfortunately, it doesn't do so right away. The -matchoperator looks only for the first matching expression. So, if you want to find more than one occurrence of a pattern in raw text, you have to switch over to the RegEx object underlying the -match operator and use it directly.
In one essential respect, the RegEx object behaves unlike the -match operator. Case sensitivity is the default for the RegEx object, but not for -match. For this reason, you must put the "(?i)" option in front of the regular expression to eliminate confusion, making sure the expression is evaluated without taking case sensitivity into account.
$rawtext = "test@test.com sent an e-mail that was forwarded to spam@muell.de."
$rawtext -match "/b[A-Z0-9._%+-]+@[A-Z0-9.-]+/.[A-Z]{2,4}/b"
$matches
---- -----
0 test@test.com
# A RegEx object can find any pattern but is case sensitive by default:
$regex = [ regex] "(?i)/b[A-Z0-9._%+-]+@[A-Z0-9.-]+/.[A-Z]{2,4}/b"
$regex. Matches( $rawtext)
Success : True
Captures : {test@test.com}
Index : 4
Length : 13
Value : test@test.com
Groups : {spam@muell.de}
Success : True
Captures : {spam@muell.de}
Index : 42
Length : 13
Value : spam@muell.de
# Limit result to e-mail addresses:
$regex. Matches( $rawtext) | Select-Object -Property Value
-----
test@test.com
spam@muell.de
# Continue processing e-mail addresses:
$regex. Matches( $rawtext) | ForEach-Object { "found: $($_.Value)" }
found: spam@muell.de
Searching for Several Keywords
You can use the alternation construct "|" to search for a group of keywords, and then find out which keyword was actually found in the string:
$matches
---- -----
0 Set
$matches tells you which keyword actually occurs in the string. But note the order of keywords in your regular expression—it's crucial because the first matching keyword is the one selected. In this example, the result would be incorrect:
$matches[0]
Either change the order of keywords so that longer keywords are checked before shorter ones …:
$matches[0]
... or make sure that your regular expression is precisely formulated, and remember that you're actually searching for single words. Insert word boundaries into your regular expression so that sequential order no longer plays a role:
$matches[0]
It's true here, too, that -match finds only the first match. If your raw text has several occurrences of the keyword, use a RegExobject again:
$regex. Matches( "Set a=1; GetValue a; SetValue b=12")
Success : True
Captures : {Set}
Index : 0
Length : 3
Value : Set
Groups : {GetValue, GetValue}
Success : True
Captures : {GetValue}
Index : 9
Length : 8
Value : GetValue
Groups : {SetValue, SetValue}
Success : True
Captures : {SetValue}
Index : 21
Length : 8
Value : SetValue
Forming Groups
A raw text line is often a heaping trove of useful data. You can use parentheses to collect this data in sub-expressions so that it can be evaluated separately later. The basic principle is that all the data that you want to find in a pattern should be wrapped in parentheses because $matches will return the results of these sub-expressions as independent elements. For example, if a text line contains a date first, then text, and if both are separated by tabs, you could describe the pattern like this:
$pattern = "(.*)/t(.*)"
# Generate example line with tab character
$line = "12/01/2009`tDescription"
# Use regular expression to parse line:
$line -match $pattern
# Show result:
$matches
---- -----
2 Description
1 12/01/2009
0 12/01/2009 Description
$matches[1]
$matches[2]
When you use sub-expressions, $matches will contain the entire searched pattern in the first array element named "0". Sub-expressions defined in parentheses follow in additional elements. To make them easier to read and understand, you can assign sub-expressions their own names and later use the names to call results. To assign names to a sub-expression, type ?<Name> in parentheses for the first statement:
$pattern = "(?<Date>.*)/t(?<Text>.*)"
# Generate example line with tab character:
$line = "12/01/2009`tDescription"
# Use a regular expression to parse line:
$line -match $pattern
# Show result:
$matches
---- -----
Text Description
Date 12/01/2009
0 12/01/2009 Description
$matches. Date
$matches. Text
Each result retrieved by $matches for each sub-expression naturally requires storage space. If you don't need the results, discard them to increase the speed of your regular expression. To do so, type "?:" as the first statement in your sub-expression:
$pattern = "(?<Date>.*)/t(?:.*)"
# Generate example line with tab character:
$line = "12/01/2009`tDescription"
# Use a regular expression to parse line:
$line -match $pattern
# No more results will be returned for the second subexpression:
$matches
---- -----
Date 12/01/2009
0 12/01/2009 Description
Further Use of Sub-Expressions
With the help of results from each sub-expression, you can create surprisingly flexible regular expressions. For example, how could you define a Web site HTML tag as a pattern? A tag always has the same structure: <tagname [parameter]>...</tagname>. This means that a pattern for one particular strictly predefined HTML tag can be found quickly:
$matches[1]
The pattern begins with the fixed text "<body". Any additional words, separated by word boundaries, may follow with the exception of ">". The concluding ">" follows and then the contents of the body tag, which may consist of any number of any characters (.*?). The expression, enclosed in parentheses, is a sub-expression and will be returned later as a result in$matches so that you'll know what is inside the body tag. The concluding part of the tag follows in the form of fixed text ("</body").
This regular expression works fine for body tags, but not for other tags. Does this mean that a regular expression has to be defined for every HTML tag? Naturally not. There's a simpler solution. The problem is that the name of the tag in the regular expression occurs twice, once initially ("<body...>") and once terminally ("</body>"). If the regular expression is supposed to be able to process any tags, then it would have to be able to find out the name of the tag automatically and use it in both locations. How to accomplish that? Like this:
$matches
---- -----
2 Contents
1 body
0 <body background=2>Contents</body>
This regular expression no longer contains a strictly predefined tag name and works for any tags matching the pattern. How does that work? The initial tag in parentheses is defined as a sub-expression, more specifically as a word that begins with a letter and that can consist of any additional alphanumeric characters.
The name of the tag revealed here must subsequently be iterated in the terminal part. Here you'll find "<//1>". "/1" refers to the result of the first sub-expression. The first sub-expression evaluated the tag name and so this name is used automatically for the terminal part.
The following RegEx object could directly return the contents of any HTML tag:
$result = $regexTag. Matches( "<button>Press here</button>")
$result[0]. Groups[2]. Value + " is in tag " + $result[0]. Groups[1]. Value
Greedy or Lazy? Detailed or Concise Results...
Readers who have paid careful attention may wonder why the contents of the HTML tag were defined by ".*?" and not simply by ".*" in regard to regular expressions. . After all, ".*" should suffice so that an arbitrary character (char: ".") can turn up any number of times (quantifier: "*"). At first glance, the difference between ".*" and ".*? is not easy to recognize; but a short example should make it clear.
Assume that you would like to evaluate month specifications in a logging file, but the months are not all specified in the same way. Sometimes you use the short form, other times the long form of the month name is used. As you've seen, that's no problem for regular expressions, because sub-expressions allow parts of a keyword to be declared optional:
$matches[0]
"February" -match "Feb(ruary)?"
$matches[0]
In both cases, the regular expression recognizes the month, but returns different results in $matches. By default, the regular expression is "greedy" and wants to achieve a match in as much detail as possible. If the text is "February," then the expression will search for a match starting with "Feb" and then continue searching "greedily" to check whether even more characters match the pattern. If they do, the entire (detailed) text is reported back.
However, if your main concern is just standardizing the names of months, you would probably prefer getting back the shortest common text. That's exactly what the "??" quantifier does, which in contrast to the regular expression is "lazy." As soon as it recognizes a pattern, it returns it without checking whether additional characters might match the pattern optionally.
$matches[0]
"February" -match "Feb(ruary)?"
$matches[0]
Just what is the connection between the "??" quantifier of this example and the "*?" if the preceding example? In reality, "*?" is not a self-contained quantifier. It just turns a normally "greedy" quantifier into a "lazy" quantifier. This means you could use "?" to force the quantifier "*" to be "lazy" and to return the shortest possible result. That's exactly what happened with our regular expressions for HTML tags. You can see how important this is if you use the greedy quantifier "*" instead of "*?", then it will attempt to retrieve a result in as much detail as possible. That can go wrong:
"<body background=1>Contents</body></body>" -match "<body/b[^>]*>(.*)</body>"
$matches[1]
# The right quantifier is *?, the lazy one, which returns results that
# are as short as possible
"<body background=1>Contents</body></body>" -match "<body/b[^>]*>(.*?)</body>"
$matches[1]
According to the definition of the regular expression, any characters are allowed inside the tag. Moreover, the entire expression must end with "</body>". If "</body>" is also inside the tag, the following will happen: the greedy quantifier ("*"), coming across the first "</body>", will at first assume that the pattern is already completely matched. But because it is greedy, it will continue to look and will discover the second "</body>" that also fits the pattern. The result is that it will take both "</body>" specifications into account, allocate one to the contents of the tag, and use the other as the conclusion of the tag.
I this example, it would be better to use the lazy quantifier ("*?") that notices when it encounters the first "</body>" that the pattern is already correctly matched and consequently doesn't go to the trouble of continuing to search. It will ignore the second "</body>" and use the first to conclude the tag.
Finding String Segments
Entire books have been written about the uses of regular expressions. That's why it would go beyond the scope of this book to discuss more details. However, our last example, which locates text segments, shows how you can use the elements listed in Table 13.11 to easily harvest surprising search results. If you type two words, the regular expression will retrieve the text segment between the two words if at least one word is, and not more than six other words are, between the two words:
True
$matches[0]
---- -----
0 start to end
Replacing a String
You already know how to replace a string because you were already introduced to the -replace operator. Simply tell the operator what term you want to replace in a string and the task is done:
But simple replacement isn't always sufficient, so you need to use regular expressions for replacements. Some of the following interesting examples show how that could be useful.
Perhaps you'd like to replace several different terms in a string with one other term. Without regular expressions, you'd have to replace each term separately. Or use instead the alternation operator, "|", with regular expressions:
You can type any term in parentheses and use the "|" symbol to separate them. All the terms will be replaced with the replacement string you specify.
Using Back References
This last example replaces specified keywords anywhere in a string. Often, that's sufficient, but sometimes you don't want to replace a keyword everywhere it occurs but only when it occurs in a certain context. In such cases, the context must be defined in some way in the pattern. How could you change the regular expression so that it replaces only the names Miller and Meyer? Like this:
-replace "(Mr.|Mrs.)/s*(Miller|Meyer)", "Our client"
The result looks a little peculiar, but the pattern you're looking for was correctly identified. The only replacements were Mr. orMrs. Miller and Mr. or Mrs. Meyer. The term "Mr. Werner" wasn't replaced. Unfortunately, the result also shows that it doesn't make any sense here to replace the entire pattern. At least the name of the person should be retained. Is that possible?
This is where the back referencing you've already seen comes into play. Whenever you use parentheses in your regular expression, the result inside the parentheses is evaluated separately, and you can use these separate results in your replacement string. The first sub-expression always reports whether a "Mr." or a "Mrs." was found in the string. The second sub-expression returns the name of the person. The terms "$1" and "$2" provide you the sub-expressions in the replacement string (the number is consequently a sequential number; you could also use "$3" and so on for additional sub-expressions).
-replace "(Mr.|Mrs.)/s*(Miller|Meyer)", "Our client $2"
Strangely enough, at first the back references don't seem to work. The cause can be found quickly: "$1" and "$2" look like PowerShell variables, but in reality they are regular terms of the -replace operator. As a result, if you put the replacement string inside double quotation marks, PowerShell will replace "$2" with the PowerShell variable $2, which is normally empty. So that replacement with back references works, consequently, you must either put the replacement string inside single quotation marks or add a backtick to the "$" special character so that PowerShell won't recognize it as its own variable and replace it:
# so that the PS variable $2:
"Mr. Miller, Mrs. Meyer and Mr. Werner" -replace `
"(Mr.|Mrs.)/s*(Miller|Meyer)", 'Our client $2'
# Alternatively, $ can also be masked by `$:
"Mr. Miller, Mrs. Meyer and Mr. Werner" -replace `
"(Mr.|Mrs.)/s*(Miller|Meyer)", "Our client `$2"
Putting Characters First at Line Beginnings
Replacements can also be made in multiple instances in text of several lines. For example, when you respond to an e-mail, usually the text of the old e-mail is quoted in your new e-mail as and marked with ">" at the beginning of each line. Regular expressions can do the marking.
However, to accomplish this, you need to know a little more about "multi-line" mode. Normally, this mode is turned off, and the "^" anchor represents the text beginning and the "$" the text ending. So that these two anchors refer respectively to the line beginning and line ending of a text of several lines, the multi-line mode must be turned on with the "(?m)" statement. Only then will -replace substitute the pattern in every single line. Once the multi-line mode is turned on, the anchors "^" and "/A", as well as "$" and "/Z", will suddenly behave differently. "/A" will continue to indicate the text beginning, while "^" will mark the line ending; "/Z" will indicate the text ending, while "$" will mark the line ending.
$text = @ "
Here is a little text.
I want to attach this text to an e-mail as a quote.
That's why I would put a ">" before every line.
"@
$text
I want to attach this text to an e-mail as a quote.
That's why I would put a ">" before every line.
# Normally, -replace doesn't work in multiline mode.
# For this reason, only the first line is replaced:
$text -replace "^", "> "
I want to attach this text to an e-mail as a quote.
That's why I would put a ">" before every line.
# If you turn on multiline mode, replacement will work in every line:
$text -replace "(?m)^", "> "
> I want to attach this text to an e-mail as a quote.
> That's why I would put a ">" before every line.
# The same can also be accomplished by using a RegEx object,
# where the multiline option must be specified:
[ regex]:: Replace( $text, "^", "> ", `
[ Text.RegularExpressions.RegExOptions]:: Multiline)
> I want to attach this text to an e-mail as a quote.
> That's why I would put a ">" before every line.
# In multiline mode, /A stands for the text beginning
# and ^ for the line beginning:
[ regex]:: Replace( $text, "/A", "> ", `
[ Text.RegularExpressions.RegExOptions]:: Multiline)
I want to attach this text to an e-mail as a quote.
That's why I would put a ">" before every line.
Removing Superfluous White Space
Regular expressions can perform routine tasks as well, such as remove superfluous white space. The pattern describes a blank character (char: "/s") that occurs at least twice (quantifier: "{2,}"). That is replaced with a normal blank character.
Finding and Removing Doubled Words
How is it possible to find and remove doubled words in text? Here, you can use back referencing again. The pattern could be described as follows:
The pattern searched for is a word (anchor: "/b"). It consists of one word (the character "/w" and quantifier "+"). A blank character follows (the character "/s" and quantifier "?"). This pattern, the blank character and the repeated word, must occur at least once (at least one and any number of iterations of the word, quantifier "{1,}"). The entire pattern is then replaced with the first back reference, that is, the first located word.
"This this this is a test" -replace "/b(/w+)(/s+/1){1,}/b", '$1'
Summary
Text is demarcated either by single or double quotation marks. If you use double quotation marks, PowerShell will replace PowerShell variables and special characters in text. Text enclosed in single quotation marks will remain unchanged. The same is true for characters in text marked with the backtick character, which can be used to insert special characters in the text (Table 13.1).
The user can query text directly through the Read-Host cmdlet. Lengthier text, text of several lines, can also be inputted through Here-Strings, which are begun with @"(Enter) and ended with "@(Enter).
By using the format operator -f, you can output formatted text. This gives you the option to display text in different ways or to set fixed widths to output text in aligned columns (Table 13.3 through Table 13.5). Along with the formatting operator, PowerShell has a number of further string operators you can use to validate patterns or to replace a string (Table 13.2). Most of these operators are also available in two special forms, which are either case-insensitive (preceded by "i") or case-sensitive (preceded by "c").
PowerShell stores text in string objects, which contain dynamic methods to work on the stored text. You can use these methods by typing a dot after the string object (or the variable in which the text is stored) and then activating auto complete (Table 13.6). Along with the dynamic methods that always refer to text stored in a string object, there are also static methods that are provided directly by the string data type by qualifying the string object with "[string]::".
The simplest way to describe patterns is to use the simple wildcards in Table 13.7. This allows you to check whether text is recognized in a particular pattern. However, simple wildcards are appropriate tools only for rudimentary pattern recognition. Moreover, simple wildcards can only recognize patterns; they cannot extract data from them. A far more sophisticated tool is regular expressions. They consist of the diverse elements listed in Table 13.11, consisting basically of the categories "character," "quantifier," and "anchor." Regular expressions describe any complex pattern and can be used along with the operators -match or -replace. Use the .NET object [regex] if you want to be very specific and utilize advanced functionality of regular expressions.
The -match operator reports whether the string contains the pattern you're looking for and subsequently retrieves the contents of the pattern in the $matches variable. This means that you can use -match not only to recognize patterns, but also to parse unstructured data directly. The -replace operator searches for a pattern and replaces it with an alternative string. Both operators also support back references, whose use was explained in detail in several chapter examples.
Posted Mar 30 2009, 08:04 AM by ps1