String Operations in Shell

最新推荐文章于 2021-04-30 17:59:48 发布

t0nsha

最新推荐文章于 2021-04-30 17:59:48 发布

阅读量2.6k

点赞数

分类专栏： U/Linux BAT/SH/PY/PL/VBS

U/Linux 同时被 2 个专栏收录

52 篇文章 0 订阅

订阅专栏

BAT/SH/PY/PL/VBS

52 篇文章 0 订阅

订阅专栏

String Operations in Shell

News	Bash	Recommended Links	Selected papers	Reference	Pattern Matching	Variable Substitution
${var:-bar} (default)	${var:=bar}	#	##	%	%%	KSH Substitutions
Length	Index	Substr	Search and Replace	Concatenation	Trimming from left and right
BASH Debugging	Bash Built-in Variables	Annotated List of Bash Enhancements	bash Tips and Tricks	Tips and Tricks	Humor	Etc

Those notes are partially based on lecture notes by Professor Nikolai Bezroukov at FDU.

String operators allow you to manipulate the contents of a variable without resorting to AWK or Perl. Modern shells such as bash 3.x or ksh93 supports most of the standard string manipulation functions, but in a very pervert, idiosyncratic way. Anyway, standard functions like l ength,i ndex,s ubstr are available. Strings can be concatenated by juxtaposition and using double quoted strings. You can ensure that variables exist (i.e., are defined and have non-null values) and set default values for variables and catch errors that result from variables not being set. You can also perform basic pattern matching. There are several basic string operations available in bash, ksh93 and similar shells:

Introduction

String operators in shell use unique among programming languages curly-bracket syntax. In shell any variable can be displayed as ${name_of_the_variable} instead of ${name_of_the_variable}. This notation most often is used to protect a variable name from merging with string that comes after it. Here is example in which it is used for separation of a variable $var and a string "_string"

$ export var='test' 
$ echo ${var}_string # var is a variable that uses syntax ${var} and its value will be substituted
test_string
$ echo $var_string # var_string is a variable that doesn't exist, so echo doesn't print anything

In Korn 88 shell this notation was extended to allow expressions inside curvy brackets. For example ${var=moo}. Each operation is encoded using special symbol or two symbols ("digram", for example :-, :=, etc) . An argument that the operator may need is positioned after the symbol of the operation. And later this notation extended ksh93 and adopted by bash and other shells.

This "ksh-originated" group of operators is the most popular and probably the most widely used group of string-handling operators so it makes sense to learn them, if only in order to be able to modify old scripts. Bash 3.2 and later has =~ operator with "normal" Perl-style regular expressions that can be used instead in many cases and they are definitely preferable in new scripts that you might write. Let's say we need to establish whether variable $x appears to be a social security number:

if [[ $x =~ [0-9]{3}-[0-9]{2}-[0-9]{4} ]]
then
	# process SSN
else
	# print error message
fi

Those operators can test for the existence of variables and allows substitutions of default values under certain conditions.

Note: The colon (:) in each of these operators is actually optional. If the colon is omitted, then change "exists and isn't null" to "exists" in each definition, i.e., the operator tests for existence only.

Bash and ksh also provide some (limited) regular expression functionality called pattern matching operators

Variable substitution

Introduced in ksh88 notation was and still it really very idiosyncratic. In examples below we will assume that the variable var has value "this is a test" (as produced by execution of statement export var="this is a test")

Operator #: deletes the shortest possible match from the left
```
echo ${var#t*is}
is a test
```
Operator ##: deletes the longest possible match from the left
```
echo ${var##t*is}
a test
```
Operator %: deletes the shortest possible match from the right
```
echo ${var%t*st}
this is a
```

Operator %%: deletes the longest possible match from the right

echo ${var%%t*st} # returns empty string as the first word is matched

Although the # and % operators mnemonics looks arbitrary, they have a convenient mnemonic if you use the US keyboard. The # key is on the left side of the $ key and operates from the left, while % is to right and usually is used to the right of the string like is 100%. Also C preprocessor uses " #"; as a prefix to identify preprocessor statements ( #define, #include).

Implementation of classic string operations in shell

Despite shell deficiencies in this area and idiosyncrasies preserved from 1970th most classic string operations can be implemented in shell. You can define functions that behave almost exactly like in Perl or other "more normal" language. In case shell facilities are not enough you can useAWK or Perl. It's actually sad that AWK was not integrated into shell.

Length Operator

There are several ways to get length of the string.

The simplest one is ${#varname}, which returns the length of the value of the variable as a character string. For example, if filename has the value fred.c, then ${#filename} would have the value 6.
The second is to use built in function expr, for example
```
expr length $string
```
or
```
expr "$string" : '.*'
```

Additional example from Advanced Bash-Scripting Guide

stringZ=abcABC123ABCabc

echo ${#stringZ}                 # 15
echo `expr length $stringZ`      # 15
echo `expr "$stringZ" : '.*'`    # 15

More complex example. Here's the function for validating that that string is within a given max length. It requires two parameters, the actual string and the maximum length the string should be.

check_length() 
# check_length 
# to call: check_length string max_length_of_string 
{ 
	# check we have the right params 
	if (( $# != 2 )) ; then 
	   echo "check_length need two parameters: a string and max_length" 
	   return 1 
	fi 
	if (( ${#1} > $2 )) ; then 
	   return 1 
	fi 
	return 0 
}

You could call the function check_length like this:

#!/usr/bin/bash
# test_name 
while : 
do 
  echo -n "Enter customer name :" 
  read NAME 
  [ check_length $NAME 10 ] && break
  echo "The string $NAME is longer then 10 characters"    
done

echo $NAME

Determining the Length of Matching Substring at Beginning of String

This is pretty rarely used capability of expr built-in but still sometimes it can be useful:

expr match "$string" '$substring'

where:

String is any variable of literal string.
$substring is aregular expression.

my_regex=abcABC123ABCabc
#       |------|

echo `expr match "$my_regex" 'abc[A-Z]*.2'`   # 8
echo `expr "$my_regex" : 'abc[A-Z]*.2'`       # 8

Index

Function index return the position of substring in string counting from one and 0 if substring is not found.

expr index $string $substring

Numerical position in $string of first character in $substring that matches.

stringZ=abcABC123ABCabc
echo `expr index "$stringZ" C12`             # 6
                                             # C position.

echo `expr index "$stringZ" c`              # 3
# 'c' (in #3 position)

This is the close equivalent of strchr() in C.

Substr

Substring function is available as a part of pattern matching operators in shell and has the form ${param:offset[:length}.

If an `offset' evaluates to a number less than zero, it counts back from the end of the string defined by variable$param.

Notes:

this pattern matching operator uses zero-based indexing.
When you specify negative offset as a numeric literal with minus sigh in the front, unexpected things can happen. Consider
```
a=12345678
echo ${a:-4}
```
intending to print the last four characters of $a. The problem is that ${param:-word} already has a special meaning: in shell: assigning the value after minus sign to the variable, if the value of variable param is undefined or null. To use negative offsets that begin with a minus sign, separate the minus sign and the colon with a space.

${string:position}

Extracts substring from $string at $position.

If the $string parameter is "*" or "@", then this extracts thepositional parameters, starting at $position.

${string:position:length}

Extracts $length characters of substring from $string at $position.

stringZ=abcABC123ABCabc
#       0123456789.....
#       0-based indexing.

echo ${stringZ:0}                            # abcABC123ABCabc
echo ${stringZ:1}                            # bcABC123ABCabc
echo ${stringZ:7}                            # 23ABCabc

echo ${stringZ:7:3}                          # 23A
                                             # Three characters of substring.

If the $string parameter is "*" or "@", then this extracts a maximum of $length positional parameters, starting at $position.

echo ${*:2}          # Echoes second and following positional parameters.
echo ${@:2}          # Same as above.

echo ${*:2:3}        # Echoes three positional parameters, starting at second.

expr substr $string $position $length

Extracts $length characters from $string starting at $position..

The first character has index one.

stringZ=abcABC123ABCabc
#       123456789......
#       1-based indexing.

echo `expr substr $stringZ 1 2`              # ab
echo `expr substr $stringZ 4 3`              # ABC

Search and Replace

You can search and replace substring in a variable using ksh syntax:

alpha='This is a test string in which the word "test" is replaced.' 
beta="${alpha/test/replace}"

The string "beta" now contains an edited version of the original string in which the first case of the word "test" has been replaced by "replace". To replace all cases, not just the first, use this syntax:

beta="${alpha//test/replace}"

Note the double "//" symbol.

Here is an example in which we replace one string with another in a multi-line block of text:

list="cricket frog cat dog" 
poem="I wanna be a x\n\ A x is what I'd love to be\n\ If I became a x\n\ How happy I would be.\n"
for critter in $list; do
   echo -e ${poem//x/$critter}
done

Concatenation

Strings can be concatenated by juxtaposition and using double quoted strings. For example

PATH="$PATH:/usr/games"

Double quoted string in shell is almost identical to double quoted string in Perl and performs macro expansion of all variables in it. The minor difference is the treatment of escaped characters. If you want exact match you can use $'string'

#!/bin/bash

# String expansion.Introduced with version 2 of Bash.

#  Strings of the form $'xxx' have the standard escaped characters interpreted. 

echo $'Ringing bell 3 times \a \a \a'
     # May only ring once with certain terminals.
echo $'Three form feeds \f \f \f'
echo $'10 newlines \n\n\n\n\n\n\n\n\n\n'
echo $'\102\141\163\150'   # Bash
                           # Octal equivalent of characters.

exit 0

In bash-3.1, a string append operator (+=) was added:

PATH+=":~/bin"
echo "$PATH"

Trimming from left and right

Using the wildcard character (?), you can imitate Perl chop function (which cuts the last character of the string and returns the rest) quite easily

test="~/bin/"
trimmed_last=${test%?}
trimmed_first=${test#?}
echo "original='$test,timmed_first='$trimmed_first', trimmed_last='$trimmed_last'"

The first character of a string can also be obtained with printf:

printf -v char "%c" "$source"

Conditional chopping line in Perl chomp function or REXX function trim can be done using while loop, for example:

function trim
{
   target=$1
   while : # this is an infinite loop
   do
   case $target in
      ' '*) target=${target#?} ;; ## if $target begins with a space remove it
      *' ') target=${target%?} ;; ## if $target ends with a space remove it
      *) break ;; # no more leading or trailing spaces, so exit the loop
   esac
   done
   return target
}

A more Perl-style method to trim trailing blanks would be

spaces=${source_var##*[! ]} ## get the trailing blanks in var $spaces

trimmed_var=${source_var#$spaces}

The same trick can be used for removing leading spaces.

Assignment of default value for undefined variables

Operator: ${var:-bar} is useful for assigning a variable a default value. It word the following way: if $var exists and is not null, return $var. If it doesn't exist or is null, return bar.

Example:

$ export var=""
$ echo ${var:-one}
one
$ echo $var

More complex example:

sort -nr $1 | head -${2:-10}

A typical usage include situations when you need to check if arguments were passed to the script and if not assign some default values::

#!/bin/bash 
export FROM=${1:-"~root/.profile"}
export TO=${2:-"~my/.profile"}
cp -p $FROM $TO

Additional modification allows to set variable if it is not defined. This is done with the operator ${var:=bar}

It works as following: If $var exists and is not null, return $var. If it doesn't exist or is null, set $var to bar and return bar.

Example:

$ export var=""
$ echo ${var:=one}
one

Results:

$ echo $var
one

Pattern Matching

There are two types of pattern matching is shell:

Old KSH-style pattern matching that uses very idiosyncratic regular expressions with prefix positioning of metasymbols
Classic, Perl-style pattern matching (implemented in bash 3.0). Since version 3 of bash (released in 2004) bash implements an extended regular expressions which are mostly compatible with Perl regex. They are also called POSIX regular expressions as they are defined inIEEE POSIX 1003.2.

Unless you need to modify old scripts it does not make sense to use old ksh-style regex in bash.

Perl-style regular expressions

(partially borrowed fromBash Regular Expressions | Linux Journal)

Since version 3 of bash (released in 2004) bash implements an extended regular expressions which are mostly compatible with Perl regex. They are also called POSIX regular expressions as they are defined in IEEE POSIX 1003.2. (which you should read and understand to use the full power provided). Extended regular expression are also used in egrep so they are mostly well known by system administrators. Please note that Perl regular expressions are equivalent to extended regular expressions with a few additional features:

Perl supports noncapturing parentheses, as described in“Noncapturing Parentheses.”
The order of multiple options within parentheses can be important when substrings come into play, as described in“Grouping Operators.”
Perl allows you to include a literal square bracket anywhere within a character class by preceding it with a backslash, as described in“Quoting Special Characters.”
Perl adds a number of additional switches that are equivalent to certain special characters and character classes. These are described in“Character Class Shortcuts.”
Perl supports a broader range of modifiers. These are described in“Using Modifiers.”

Predefined Character Classes

Extended regular expression support set of predefined character classes. When used between brackets, these define commonly used sets of characters. The POSIX character classes implemented in extended regular expressions include:

[:alnum:]—all alphanumeric characters (a-z, A-Z, and 0-9).
[:alpha:]—all alphabetic characters (a-z, A-Z).
[:blank:]—all whitespace within a line (spaces or tabs).
[:cntrl:]—all control characters (ASCII 0-31).
[:digit:]—all numbers.
[:graph:]—all alphanumeric or punctuation characters.
[:lower:]—all lowercase letters (a-z).
[:print:]—all printable characters (opposite of [:cntrl:], same as the union of [:graph:] and [:space:]).
[:punct:]—all punctuation characters
[:space:]—all whitespace characters (space, tab, newline, carriage return, form feed, and vertical tab). (See note below about compatibility.)
[:upper:]—all uppercase letters.
[:xdigit:]—all hexadecimal digits (0-9, a-f, A-F).

Modifies are by and large similar to Perl

Extended regex	Perl regex
`a+`	`a+`
`a?`	`a?`
`a\|b`	`a\|b`
`(expression1)`	`(expression1)`
`{m,n}`	`{m,n}`
`{,n}`	`{,n}`
`{m,}`	`{m,}`
`{m}`	`{m}`

It returns 0 (success) if the regular expression matches the string, otherwise it returns 1 (failure).

In addition to doing simple matching, bash regular expressions support sub-patterns surrounded by parenthesis for capturing parts of the match. The matches are assigned to an array variable BASH_REMATCH. The entire match is assigned to BASH_REMATCH[0], the first sub-pattern is assigned to BASH_REMATCH[1], etc..

The following example script takes a regular expression as its first argument and one or more strings to match against. It then cycles through the strings and outputs the results of the match process:

#!/bin.bash

if [[ $# -lt 2 ]]; then
    echo "Usage: $0 PATTERN STRINGS..."
    exit 1
fi
regex=$1
shift
echo "regex: $regex"
echo

while [[ $1 ]]
do
    if [[ $1 =~ $regex ]]; then
        echo "$1 matches"
        i=1
        n=${#BASH_REMATCH[*]}
        while [[ $i -lt $n ]]
        do
            echo "  capture[$i]: ${BASH_REMATCH[$i]}"
            let i++
        done
    else
        echo "$1 does not match"
    fi
    shift
done

Assuming the script is saved in "bashre.sh", the following sample shows its output:

  # sh bashre.sh 'aa(b{2,3}[xyz])cc' aabbxcc aabbcc
  regex: aa(b{2,3}[xyz])cc

  aabbxcc matches
    capture[1]: bbx
  aabbcc does not match

KSH Pattern-matching Operators

Pattern-matching operators were introduced in ksh88 in a very idiosyncratic way. The notation is different from used by Perl or utilities such as grep. That's a shame, but that's how it is. Life is not perfect. They are hard to remember, but there is a handy mnemonic tip: # matches the front because number signsprecede numbers; % matches the rear because percent signs follow numbers.

There are two kinds of pattern matching available: matching from the left and matching from the right.

The operators, with their functions and an example, are shown in the following table:

Operator	Meaning	Example
`${var#t*is}`	Deletes the shortest possible match from the left: If the pattern matches the beginning of the variable's value, delete the shortest part that matches and return the rest.	`export $var="this is a test"` `echo ${var#t*is}` `is a test`
`${var##t*is}`	Deletes the longest possible match from the left: If the pattern matches the beginning of the variable's value, delete the longest part that matches and return the rest.	`export $var="this is a test"` `echo ${var##t*is}` `a test`
`${var%t*st}`	Deletes the shortest possible match from the right: If the pattern matches the end of the variable's value, delete the shortest part that matches and return the rest.	`export $var="this is a test"` `echo ${var%t*st}` `this is a`
${var%%t*st}	Deletes the longest possible match from the right: If the pattern matches the end of the variable's value, delete the longest part that matches and return the rest.	`export $var="this is a test" echo ${var%%t*is}`

While the # and % identifiers may not seem obvious, they have a convenient mnemonic. The # key is on the left side of the $ key on the keyboard and operates from the left. The % key is on the right of the $ key and operated from the right.

These operators can be used to do a variety of things. For example, the following script changes the extension of all .html files to .htm.

#!/bin/bash
# quickly convert html filenames for use on a dossy system
# only handles file extensions, not filenames

for i in *.html; do
  if [ -f ${i%l} ]; then
    echo ${i%l} already exists
  else
    mv $i ${i%l}
  fi
done

The classic use for pattern-matching operators is stripping off components of pathnames, such as directory prefixes and filename suffixes. With that in mind, here is an example that shows how all of the operators work. Assume that the variablepath has the value /home /billr/mem/long.file.name; then:

Expression         	  Result
${path##/*/}                       long.file.name
${path#/*/}              billr/mem/long.file.name
$path              /home/billr/mem/long.file.name
${path%.*}         /home/billr/mem/long.file
${path%%.*}        /home/billr/mem/long

Operator #: ${var#t*is} deletes the shortest possible match from the left

Example:

$ export var="this is a test"

$ echo ${var#t*is}

is a test

Operator ##: ${var##t*is} deletes the longest possible match from the left

Example:

$ export var="this is a test"

$ echo ${var##t*is}

a test

Operator %: ${var%t*st} Function: deletes the shortest possible match from the right

Example:

$ export var="this is a test" 
$ echo ${var%t*st} 
this is a

for i in *.htm*; do 
   if [ -f ${i%l} ]; then  
      echo "${i%l} already exists" 
   else  
      mv $i ${i%l} 
   fi  
done

Operator %%: ${var%%t*st} deletes the longest possible match from the right

Example:

$ export var="this is a test" 
$ echo ${var%%t*st}

Ksh-style regular expressions

A shell regular expression can contain regular characters, standard wildcard characters, and additional operators that are more powerful than wildcards. Each such operator has the form x(exp), where x is the particular operator and exp is any regular expression (often simply a regular string). The operator determines how many occurrences of exp a string that matches the pattern can contain.

Operator	Meaning
`*`(`exp`)	0 or more occurrences of `exp`
`+`(`exp`)	1 or more occurrences of `exp`
`?`(`exp`)	0 or 1 occurrences of `exp`
@(`exp1`\|`exp2`\|...)	`exp1` or `exp2` or...
!(`exp`)	Anything that doesn't match `exp`

Expression	Matches
`x`	`x`
`*`(`x`)	Null string, `x`, `xx`, `xxx`, ...
`+`(`x`)	`x`, `xx`, `xxx`, ...
`?`(`x`)	Null string, `x`
`!`(`x`)	Any string except `x`
`@`(`x`)	`x` (see below)

The following section compares Korn shell regular expressions to analogous features in awk and egrep. If you aren't familiar with these, skip to the section entitled "Pattern-matching Operators."

shell basic regex vs awk/egrep regular expressions

Shell	egrep/awk	Meaning
`*`(`exp`)	`exp*`	0 or more occurrences of `exp`
+(`exp`)	`exp`+	1 or more occurrences of `exp`
`?`(`exp`)	`exp?`	0 or 1 occurrences of `exp`
@(`exp1`\|`exp2`\|...)	`exp1`\|`exp2`\|...	`exp1` or `exp2` or...
!(`exp`)	(none)	Anything that doesn't match `exp`

These equivalents are close but not quite exact. Actually, an exp within any of the Korn shell operators can be a series of exp1|exp2|... alternates. But because the shell would interpret an expression like dave|fred|bob as a pipeline of commands, you must use @(dave|fred|bob) for alternates

For example:

@(dave|fred|bob) matches dave, fred, orbob.
*(dave|fred|bob) means, "0 or more occurrences ofdave, fred, or bob". This expression matches strings like the null string, dave, davedave, fred, bobfred,bobbobdavefredbobfred, etc.
+(dave|fred|bob) matches any of the above except the null string.
?(dave|fred|bob) matches the null string, dave,fred, or bob.
!(dave|fred|bob) matches anything except dave,fred, or bob.

It is worth re-emphasizing that shell regular expressions can still contain standard shell wildcards. Thus, the shell wildcard ? (match any single character) is the equivalent to . in egrep or awk, and the shell's character set operator [...] is the same as in those utilities. For example, the expression +([0-9]) matches a number, i.e., one or more digits. The shell wildcard character * is equivalent to the shell regular expression * (?).

A few egrep and awk regexp operators do not have equivalents in the Korn shell. These include: