Bash字符串处理（与Java对照） - 21.字符串正则匹配

最新推荐文章于 2023-10-10 18:40:53 发布

codingstandards

最新推荐文章于 2023-10-10 18:40:53 发布

阅读量461

点赞数

分类专栏： Bash基础文章标签： Bash Java String regex

Bash基础专栏收录该内容

30 篇文章

订阅专栏

Bash字符串处理（与Java对照） - 21.字符串正则匹配

In Java

正则表达式查询

String.matches方法

boolean matches(String regex)

通知此字符串是否匹配给定的正则表达式。

String str = "123456";
String re = "\\d+";
if (str.matches(re)) {
    // do something
}

Pattern类和Matcher类

String str = "abc efg ABC";
String re = "a|f"; //表示a或f
Pattern p = Pattern.compile(re);
Matcher m = p.matcher(str);
boolean rs = m.find();

如果str中有re，那么rs为true，否则为flase。如果想在查找时忽略大小写，则可以写成Pattern p = Pattern.compile(re, Pattern.CASE_INSENSITIVE);

正则表达式提取

String re = ".+\\(.+)$";
String str = "c:\\dir1\\dir2\\name.txt";
Pattern p = Pattern.compile(re);
Matcher m = p.matcher(str);
boolean rs = m.find();
for (int i = 1; i <= m.groupCount(); i++) {
    System.out.println(m.group(i));
}

以上的执行结果为name.txt，提取的字符串储存在m.group(i)中，其中i最大值为m.groupCount();

正则表达式分割

String re = "::";
Pattern p = Pattern.compile(re);
String[] r = p.split("xd::abc::cde");

执行后，r就是{"xd","abc","cde"}，其实分割时还有跟简单的方法：

String str="xd::abc::cde";
String[] r = str.split("::");

正则表达式替换（删除）

String re = "a+"; //表示一个或多个a
Pattern p = Pattern.compile(re);
Matcher m = p.matcher("aaabbced a ccdeaa");
String s = m.replaceAll("A");

结果为"Abbced A ccdeA"
　　
如果写成空串，既可达到删除的功能，比如：

String re = "a+"; //表示一个或多个a
Pattern p = Pattern.compile(re);
Matcher m = p.matcher("aaabbced a ccdeaa");
String s = m.replaceAll("");

结果为"bbced ccde"

String.replaceAll 和 String.replaceFirst 是可执行正则表达式替换（删除）的简易做法。但String.replace不是按正则表达式来进行的。

JavaDoc class String 写道

String replace(char oldChar, char newChar)
    Returns a new string resulting from replacing all occurrences of oldChar in this string with newChar.
String replace(CharSequence target, CharSequence replacement)
    Replaces each substring of this string that matches the literal target sequence with the specified literal replacement sequence.
String replaceAll(String regex, String replacement)
    Replaces each substring of this string that matches the given regular expression with the given replacement.
String replaceFirst(String regex, String replacement)
    Replaces the first substring of this string that matches the given regular expression with the given replacement.

Java中常用的正则表达式元字符

. 代表任意字符
? 表示前面的字符出现0次或1次
+ 表示前面的字符出现1次或多次
* 表示前面的字符出现0次或多次
{n} 表示前面的字符出现正好n次
{n,} 表示前面的字符出现n次或以上
{n,m} 表示前面的字符出现n次到m次
\d 等于 [0-9] 数字
\D 等于 [^0-9] 非数字
\s 等于 [ \t\n\x0B\f ] 空白字元
\S 等于 [^ \t\n\x0B\f ] 非空白字元
\w 等于 [a-zA-Z_0-9] 数字或是英文字
\W 等于 [^a-zA-Z_0-9] 非数字与英文字
^ 表示每行的开头

$ 表示每行的结尾

In Bash

关于Linux下正则表达式的说明，详见 http://codingstandards.iteye.com/blog/1195592

Bash对正则表达式的支持

Bash v3 内置对正则表达式匹配的支持，操作符为 =~。（Bash Version 3）

[[ "$STR" =~ "$REGEX" ]]

man bash 写道

[[ expression ]]
       An additional binary operator, =~, is available, with the same precedence as == and !=. When it is
       used, the string to the right of the operator is considered an extended regular expression and matched
       accordingly (as in regex(3)). The return value is 0 if the string matches the pattern, and 1 otherwise.
       If the regular expression is syntactically incorrect, the conditional expression’s return value is 2.
       If the shell option nocasematch is enabled, the match is performed without regard to the case of alpha-
       betic characters. Substrings matched by parenthesized subexpressions within the regular expression are
       saved in the array variable BASH_REMATCH. The element of BASH_REMATCH with index 0 is the portion of
       the string matching the entire regular expression. The element of BASH_REMATCH with index n is the por-
       tion of the string matching the nth parenthesized subexpression.

在Bash中二元操作符 =~ 进行扩展的正则表达式匹配。如果匹配，返回值为0，否则1，如果正则表达式错误，返回2。如果shell选项nocasematch没有开启，那么匹配时区分大小写。在正则表达式中小括号包围的子表达式的匹配结果保存在BASH_REMATCH中，它是个数组，${BASH_REMATCH[0]}是匹配的整个字符串，${BASH_REMATCH[1]}是匹配的第一个子表达式的字符串，其他以此类推。

以下脚本来自 http://www.linuxjournal.com/content/bash-regular-expressions 很好的展示了Bash3.0中内置的正则表达式匹配功能。

#!/bin/bash

if [[ $# -lt 2 ]]; then
    echo "Usage: $0 PATTERN STRINGS..."
    exit 1
fi
regex=$1
shift
echo "regex: $regex"
echo

while [[ $1 ]]
do
    if [[ $1 =~ $regex ]]; then
        echo "$1 matches"
        i=1
        n=${#BASH_REMATCH[*]}
        while [[ $i -lt $n ]]
        do
            echo "  capture[$i]: ${BASH_REMATCH[$i]}"
            let i++
        done
    else
        echo "$1 does not match"
    fi
    shift
done

[root@jfht ~]# ./bashre.sh 'aa(b{2,3}[xyz])cc' aabbxcc aabbcc
regex: aa(b{2,3}[xyz])cc

aabbxcc matches
capture[1]: bbx
aabbcc does not match
[root@jfht ~]#

在grep/egrep命令中进行正则表达式匹配

使用Basic RE

格式1：echo "$STR" | grep -q "$REGEX"

格式2：grep -q "$REGEX" <<<"$STR"

使用Extended RE

格式3：echo "$STR" | egrep -q "$REGEX"

格式4：egrep -q "$REGEX" <<<"$STR"

注意：grep/egrep加上-q参数是为了减少输出，根据退出码判断是否匹配，退出码为0时表示匹配。

man grep 写道

Egrep is the same as grep -E.

       -E, --extended-regexp
              Interpret PATTERN as an extended regular expression (see below).

       -e PATTERN, --regexp=PATTERN
              Use PATTERN as the pattern; useful to protect patterns beginning with -.

       -q, --quiet, --silent
              Quiet; do not write anything to standard output. Exit immediately with zero status if any match is
              found, even if an error was detected. Also see the -s or --no-messages option.

匹配手机号码，模式为：1[3458][0-9]{9} 或 1[3458][0-9]\{9\}

[root@jfht ~]# echo "13012345678" | egrep '1[3458][0-9]{9}'
13012345678
[root@jfht ~]# echo "13012345678" | grep '1[3458][0-9]{9}'
[root@jfht ~]# echo "13012345678" | grep '1[3458][0-9]\{9\}'
13012345678
[root@jfht ~]#

STR="13024184301"
REGEX="1[3458][0-9]{9}"
if echo "$STR" | egrep -q "$REGEX"; then
    echo "matched"
else
    echo "not matched"
fi

[root@jfht ~]# STR="13024184301"
[root@jfht ~]# REGEX="1[3458][0-9]{9}"
[root@jfht ~]# if echo "$STR" | egrep -q "$REGEX"; then
> echo "matched"
> else
> echo "not matched"
> fi
matched
[root@jfht ~]#

使用expr match进行正则表达式匹配

expr match "$STR" "$REGEX"

expr "$STR" : "$REGEX"

打印与正则表达式匹配的长度。

man expr 写道

STRING : REGEXP
anchored pattern match of REGEXP in STRING
match STRING REGEXP
same as STRING : REGEXP

[root@jfht ~]# STR=Hello
[root@jfht ~]# REGEX=He
[root@jfht ~]# expr "$STR" : "$REGEX"
2

[root@jfht ~]# REGEX=".*[aeiou]"
[root@jfht ~]# expr "$STR" : "$REGEX"
5

注意：贪婪匹配！

[root@jfht ~]# REGEX=ll
[root@jfht ~]# expr "$STR" : "$REGEX"
0

另外，expr match 也可以实现根据正则表达式取子串。

expr match "$STR" ".*$$SUB$.*"

expr "$STR" : ".*$$SUB$.*"

注意与上面不同的是，结果是子串，而不是匹配的长度。

[root@jfht ~]# STR="某某是2009年进公司的"

想从此字符串中提取出数字来，下面是尝试的过程。
[root@jfht ~]# SUB="[0-9]+"
[root@jfht ~]# expr "$STR" : ".*$$SUB$.*"

[root@jfht ~]# SUB="[0-9]\+"
[root@jfht ~]# expr "$STR" : ".*$$SUB$.*"
9
[root@jfht ~]# SUB="[0-9]*"
[root@jfht ~]# expr "$STR" : ".*$$SUB$.*"

[root@jfht ~]# SUB="[0-9]\*"
[root@jfht ~]# expr "$STR" : ".*$$SUB$.*"