Java 正则表达式

最新推荐文章于 2021-02-08 17:36:08 发布

Harold Gao

最新推荐文章于 2021-02-08 17:36:08 发布

阅读量188

点赞数 1

分类专栏： Java 文章标签：正则表达式

欢迎访问我的博客网站 https://thenorthsea.cn/

本文链接：https://blog.csdn.net/weixin_40255793/article/details/81158267

版权

Java 专栏收录该内容

42 篇文章 0 订阅

订阅专栏

文章目录

（注：下面的所有格式都是指阅读格式，字符串格式需要注意\的转义，例如\d在字符串中表示为 "\\d"，表示普通字符 \ 的阅读格式为 \\，而字符串中表示为 "\\\\"）

创建正则表达式

元字符

([{\^-=$!|]})?*+.

两种方式可以将元字符作为普通字符来处理

\<metacharacter>
\Q<metacharacter>\E

元字符	说明
`()`	标记子表达式的开始和结束位置
`[]`	中括号表达式的开始和结束位置
`\`	转义下一个字符
`^`	匹配一行的开头
`$`	匹配一行的结尾
`-`	表示前后两个字符的范围
`=`
`!`
`\\|`	指定两项之间任选一项
`?`	指定前面子表达式可出现 0次或 1 次
`*`	指定前面子表达式可出现 0 次或多次
`+`	指定前面子表达式可出现 1 次或多次
`.`	匹配处 `\n` 之外的任何字符

转义字符：

字符	说明
`\0mnn`	八进制数 0mnn 所表示的字符
`\xhh`	十六进制值 0xhh 所表示的字符
`\uhhhh`	十六进制值 0xhhhh 所表示的 Unicode 字符

\uhhhh 代表要匹配的 Unicode 字符，或者 "\x{" + Integer.toHexString(codePoint) + "}";。 \u6771 匹配的就是一个汉字“東”。

ASCII 码的字符可以如下表示：

十六进制	八进制	含义
`\x20`	`\032`	空格
`\x23`	`\035`	#

单个字符的匹配集合

预定义的单字符匹配集

模式	对应的单字符匹配集
`·`	任意字符
`\d`	digit 数字：`[0-9]`
`\D`	`[^0-9]`
`\s`	space 空白字符：空格`\x0B`、制表符`\t`、回车符`\r`、换页符`\f`、换行符`\n`等
`\S`	`[^\s]`
`\w`	word 所有的单词字符：`[a-zA-Z_0-9]`
`\W`	`[^\w]`

方括号表达式

单字符匹配可以用 [<匹配集合>]，比如 [a-z&&[^bc]]。其中

- 表示范围
^表示这是一个取反的匹配集合
&& 表示取交集
[a-z[A-Z]] 表示取并集
&& 和 [^<要减去的集合>] 配合使用可以达到减法的效果。

数量标识符

数量标识符有三类模式：

贪婪模式（默认）：一直匹配下去，知道无法匹配为止
勉强模式：用后缀 ? 表示，只会匹配最少的字符
占有模式：用后缀 + 表示，不懂

贪婪模式（默认）	勉强模式	占有模式	含义
X?	X??	X?+	0 或 1
X*	X*?	X*+	0 或更多
X+	X+?	X++	1 或更多
X{n}	X{n}?	X{n}+	n 次
X{n,}	X{n,}?	X{n,}+	n 或更多
X{n,m}	X{n,m}?	X{n,m}+	n~m 次

零长度匹配

前两行中的 * 和 ? 允许 零长度匹配。这种匹配的标志就是匹配的起始和结束索引相同。在零长度匹配之后，end == first，如果继续从 end 处开始，那么会陷入死循环，所以 Matcher 的 find 方法中会有一步检查：

int nextSearchIndex = last;
if (nextSearchIndex == first)
    nextSearchIndex++;

零长度匹配示例：

Enter your regex: a?
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.

Enter your regex: a*
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.

Enter your regex: a*
Enter input string to search: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "" starting at index 3 and ending at index 3.
I found the text "aaaa" starting at index 4 and ending at index 8.
I found the text "" starting at index 8 and ending at index 8.
I found the text "" starting at index 9 and ending at index 9.

量词除了可以指定它左边的单个字符的数量，还可以和“单字符匹配集”、“匹配组”配合使用。

贪婪模式、勉强模式、占有模式的区别

贪婪量词的匹配顺序是，首先匹配量词吞下能匹配的最大字符串，然后判断整个表达式是否匹配，如果不匹配，量词就回退最后一个字符，继续判断整个表达式是否匹配，一直到量词允许的最少字符。比如，.*foo 对输入字符串 xfooxxxxxxfoo 进行匹配时，.* 首先吞下整个字符串，此时匹配失败，因为 .* 后面的 foo 找不到匹配项。所以贪婪量词就会逐个字符地回退，直到 .* 只吞下 xfooxxxxxx 时，匹配成功。

不情愿量词的匹配顺序是，首先不情愿量词吞下能匹配的最小字符串，然后判断整个表达式是否匹配，如果不匹配，就不情愿地吞下下一个字符，继续判断整个表达式是否匹配，一直到不情愿量词吞下允许的最大字符。比如，.*？foo 对输入字符串 xfooxxxxxxfoo 进行匹配时，首先吞下零长度，此时匹配失败（因为.*？ 后的 foo 没找到匹配），所以不情愿量词就逐个字符地吞下，吞下 x 时，匹配成功，然后从这次匹配的末尾继续匹配。

占有量词的匹配顺序是，首先匹配量词吞下能匹配的最大字符串，然后判断整个表达式是否匹配，如果不匹配，占有量词不能回退字符，直接结束匹配。比如，.*+foo 对输入字符串 xfooxxxxxxfoo 进行匹配时，.*+ 首先吞下整个字符串，此时匹配失败，因为 .* 后面的 foo 找不到匹配项，因为占有量词不能回退吞下的字符串，所以返回不匹配。

Enter your regex: .*foo  // greedy quantifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.

Enter your regex: .*?foo  // reluctant quantifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfoo" starting at index 0 and ending at index 4.
I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.

Enter your regex: .*+foo // possessive quantifier
Enter input string to search: xfooxxxxxxfoo
No match found.

占有量词：

Enter your regex: a*+foo
Enter input string to search: bbaaafoocc
I found the text "aaafoo" starting at index 2 and ending at index 8.

Enter your regex: a*+afoo
Enter input string to search: bbaaafoocc
No match found.

捕获群组：圆括号表达式

群组可以把多个字符组成一个单元，在表达式中通过一系列的括弧来创建群组，群组中的字符集合可以被保存在内存中，被取出来使用。

群组的序号 可以通过数左括号的序号来判断，比如对于表达式 ((A)(B(C)))，有 4 个群组：

((A)(B©))
(A)
(B©)
©

Matcher 对象的 groupCount 方法可以返回群组数，上面的例子的返回值就是 4。群组中有一个特殊的序号，第 0 号群组，它代表整个表达式。以 (? 开头的群组是 非捕获群组，它们不会被捕获，不会被计算进表达式的群组总数中。

群组的序号可以用于后续的操作中，比如 Matcher 有如下实例方法：

public int start(int group): 某个群组的起始索引
public int end (int group): 某个群组的结束索引
public String group (int group): 某个群组的字符内容

群组的 反向引用，是指表达式中，在群组之后可以引用前面捕获的群组的内容，通常用 \<要引用的群组序号> 来实现。下例中，(\d\d) 匹配两个数字并构成一个捕获群组，\1 引用的这两个数字。

Enter your regex: (\d\d)\1
Enter input string to search: 1212
I found the text "1212" starting at index 0 and ending at index 4.

Enter your regex: (\d\d)\1
Enter input string to search: 1234
No match found.

边界匹配

前面描述的匹配可以发生在输入字符串的任意位置，可以在开头，可以在结尾，可以在中间。边界匹配器可以更精确地限定要匹配的位置。

边界匹配符	描述
`^`	行首
`$`	行尾
`\b`	boundary 单词边界
`\B`	非单词边界
`\A`	输入的开始处
`\G`	上一次匹配的结束索引
`\Z`	输入的结束处，不包括终止符
`\z`	输入的结束处

比如，\bdog\b 匹配单词，\bdog\B 匹配单词前缀，\Bdog\b 匹配单词后缀。

下面的例子中，第二个 dog 没有被匹配，因为它并不是开始于前一个匹配的末尾处。

Enter your regex: \Gdog
Enter input string to search: dog dog
I found the text "dog" starting at index 0 and ending at index 3.

使用正则表达式

将正则表达式编译为 Pattern 对象，Pattern 对象为正则表达式编译后在内存中的表现形式。
从 Pattern 对象创建 Matcher 对象，执行匹配所涉及的状态保留在 Matcher 中。

Pattern 为不可变类，线程安全。

Pattern 类编译正则表达式

获取 Pattern 对象

Pattern 的静态方法 compile 接受一个用于匹配的正则表达式来创建 Pattern 对象，同时可以指定一系列 标记位 来设定匹配的特殊模式。

标记位	描述
Pattern.CASE_INSENSITIVE	不区分大小写

Pattern pattern = Pattern.compile("[a-z]$", Pattern.CASE_INSENSITIVE);

除了在 compile 方法中添加参数来设置特殊模式，还可以在正则表达式中 嵌入标记位。

标记位	等价的嵌入表达式
Pattern.CASE_INSENSITIVE	(?i)

Enter your regex: (?i)foo
Enter input string to search: FOOfooFoOfoO
I found the text "FOO" starting at index 0 and ending at index 3.
I found the text "foo" starting at index 3 and ending at index 6.
I found the text "FoO" starting at index 6 and ending at index 9.
I found the text "foO" starting at index 9 and ending at index 12.

获取 Matcher 对象

Pattern 类常用的方法是 matcher(input) 来获取 Matcher 对象，也可以直接使用静态方法来匹配一次：

Pattern.matches("\\d","1"); // true

分割方法

split(input): String[]：利用正则表达式分割输入的字符串。

Pattern p = Pattern.compile("\\d");
String[] items = p.split("one9two4three7four1five");
for(String s : items) {
    System.out.print(s + " "); // one two three ...
}

当无匹配时，返回原来的字符串作为数组的唯一元素。

如果第一次匹配结果的起始索引和结束索引都为 0（零长度匹配），就忽略这次结果。如果第一次匹配结果的起始索引为 0，但结束索引不为 0（非零长度匹配），那么数组的第一个元素为空字符串。其他的匹配过程允许零长度匹配。

split(CharSequence input, int limit): String[]

当 limit > 0 时，返回的数组大小不会超过 limit。最多进行 limit - 1 次匹配。如果进行到了第 limit - 1 次匹配，那么将上次匹配的 end 索引开始的所有字符存为结果数组的第 limit 个元素。
当 limit <= 0 时，返回数组大小没有长度限制，匹配可以进行任意次。
当 limit == 0 时，数组最后的空字符串都会被丢弃。

if (limit == 0)
while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
    resultSize--;

示例：

```java
String REGEX = "a*";
String input = "abaaef";
Pattern p = Pattern.compile(REGEX);
String[] items = p.split(input, -2);
for(String s : items) {
    System.out.print("\"" + s + "\" "); // "" "" "b" "" "e" "f" ""
```

Matcher 类执行匹配

判断是否匹配：

matches(): boolean：最严格的，判断字符串整体是否匹配
lookingAt(): boolean：判断字符串 前面部分 是否匹配
find(): boolean：最宽松，判断 字符串中 是否存在匹配

Pattern p = Pattern.compile("b");
System.out.println(p.matcher("abaaef").lookingAt()); // false
System.out.println(p.matcher("baaef").lookingAt()); // true

取出子串：

group(): String // 返回上一次匹配的子串，默认为 group(0)

获取索引：

方法名	描述
`public int start()`	上一次匹配的起始索引
`public int end()`	上一次匹配的结束索引

重置：

reset(CharSequence) // 清除之前的匹配状态，并设置新的测试字符
reset() // 清除之前的匹配状态

更简单的使用：String 的方法

String 类中提供了若干方法用于正则表达式操作：

matches(regex): boolean // 判断是否匹配指定的 regex
replaceAll(regex, replacement): String // 将该字符串中所有匹配 regex 的字串替换
replaceFirst(regex, replacement): String // 第一个匹配的替换
split(regex): String[] // 以 regex 为分隔符，把字符串分割成多个子串

String 的 replace 方法其实是使用 Matcher 来实现的：

// java.lang.String
public String replace(CharSequence target, CharSequence replacement) {
return Pattern.compile(target.toString(), Pattern.LITERAL).matcher(
        this).replaceAll(Matcher.quoteReplacement(replacement.toString()));
}

测试程序

import java.io.Console;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class RegexTestHarness {

    public static void main(String[] args){
        Console console = System.console();
        if (console == null) {
            System.err.println("No console.");
            System.exit(1);
        }
        while (true) {

            Pattern pattern =
            Pattern.compile(console.readLine("%nEnter your regex: "));

            Matcher matcher =
            pattern.matcher(console.readLine("Enter input string to search: "));

            boolean found = false;
            while (matcher.find()) {
                console.format("I found the text" +
                    " \"%s\" starting at " +
                    "index %d and ending at index %d.%n",
                    matcher.group(),
                    matcher.start(),
                    matcher.end());
                found = true;
            }
            if(!found){
                console.format("No match found.%n");
            }
        }
    }
}