正则表达式

最新推荐文章于 2023-04-28 15:12:06 发布

hulamua

最新推荐文章于 2023-04-28 15:12:06 发布

阅读量1k

点赞数

分类专栏：算法分类总结文章标签：正则表达式

算法分类总结专栏收录该内容

12 篇文章 1 订阅

订阅专栏

正则表达式

原文地址

1 概念

正则表达式（regular expressions）是一种描述字符串集的方法，它是以字符串集中各字符串的共有特征为依据的。正则表达式可以用于搜索、编辑或者是操作文本和数据。（本教程讲授 java.util.regex API 所支持的正则表达式语法）
　java.util.regex 包主要由三个类所组成：Pattern、Matcher 和 PatternSyntaxException。
- Pattern 对象表示一个已编译的正则表达式. Pattern 类没有提供公共的构造方法。要构建一个模式，首先必须调用公共的静态 compile 方法，它将返回一个 Pattern 对象。这个方法接受正则表达式作为第一个参数。本教程的开始部分将教你必需的语法。
- Matcher是一个靠着输入的字符串来解析这个模式和完成匹配操作的对象。与 Pattern 相似，Matcher 也没有定义公共的构造方法，需要通过调用 Pattern 对象的 matcher 方法来获得一个 Matcher 对象。
- PatternSyntaxException对象是一个未检查异常，指示了正则表达式中的一个语法错误。

API 所支持的元字符有：([{\^-$|}])?*+.
字符 !、@ 和 # 无特殊意义。
　　有两种方法可以强制将元字符处理成为普通字符：
　　1. 在元字符前加上反斜线（\）；
　　2. 把它放在\Q（引用开始）和\E（引用结束）之间。在使用时，\Q和\E能被放于表达式中的任何位置（假设先出现\Q）
　字符类返回目录

2 字符类

[abc] a, b 或 c（简单类）
[^abc] 除 a, b 或 c 之外的任意字符（取反）
[a-zA-Z] a 到 z，或 A 到 Z，包括（范围）
[a-d[m-p]] a 到 d，或 m 到 p：[a-dm-p]（并集）
[a-z&&[def]] d，e 或 f（交集）
[a-z&&[^bc]] 除 b 和 c 之外的 a 到 z 字符：[ad-z]（差集）
[a-z&&[^m-p]] a 到 z，并且不包括 m 到 p：[a-lq-z]（差集）
注意：“字符类（character class）”这个词中的“类（class）”指的并不是一个 .class 文件。在正则表达式的语义中，字符类是放在方括号里的字符集，指定了一些字符中的一个能被给定的字符串所匹配。

简单类（Simple Classes)

字符类最基本的格式是把一些字符放在一对方括号内。例如：正则表达式[bcr]at会匹配“bat”、“cat”或者“rat”。

3.1.2　范围

有时会想要定义一个包含值范围的字符类，诸如，“a 到 h”的字母或者是“1 到 5”的数字。指定一个范围，只要在被匹配的首字符和末字符间插入-元字符，比如：[1-5]或者是[a-h]。也可以在类里每个的边上放置不同的范围来提高匹配的可能性，如：[a-zA-Z]

3.1.3　并集返回目录

构建一个并集，只要在一个字符类的边上嵌套另外一个，如：[0-4[6-8]]，可以匹配 0，1，2，3，4，6，7，8 这几个数字。

3.1.4　交集返回目录

[0-9&&[345]]，仅以匹配两个字符类中的 3，4，5 共有部分。

3.1.5　差集返回目录

差集（subtraction）否定一个或多个嵌套的字符类，如：[0-9&&[^345]]

3 预定义字符类

Pattern 的 API 包有许多有用的预定义字符类（predefined character classes），提供了常用正则表达式的简写形式。
预定义字符类
. 任何字符（匹配或者不匹配行结束符）
\d 数字字符：[0-9]
\D 非数字字符：[^0-9]
\s 空白字符：[\t\n\x0B\f\r]
\S 非空白字符：[^\s]
\w 单词字符：[a-zA-Z_0-9]
\W 非单词字符：[^\w] （如：!）

4 量词

贪婪（greedy）、勉强（reluctant）和侵占（possessive）量词，来匹配指定表达式X的次数。
　　量词（quantifiers）允许指定匹配出现的次数，方便起见，当前 Pattern API 规范下，描述了贪婪(字母后面跟着?、*和+)、勉强和侵占三种量词。首先粗略地看一下，量词X?、X??和X?+都允许匹配 X零次或一次，精确地做同样的事情，但它们之间有着细微的不同之处.

量词种类	意　义
贪婪，勉强，侵占
X? X?? X?+	匹配 X 零次或一次
X* X? X+	匹配 X 零次或多次
X+ X+? X++	匹配 X 一次或多次
X{n} X{n}? X{n}+	匹配 X n 次
X{n,} X{n,}? X{n,}+	匹配 X 至少 n 次
X{n,m} X{n,m}? X{n,m}+	匹配 X 至少 n 次，但不多于 m 次

为了说明一下，看看输入的字符串是 xfooxxxxxxfoo 时。
[java] view plain copy print?在CODE上查看代码片派生到我的代码片
Enter your regex: .*foo // 贪婪量词
Enter input string to search: xfooxxxxxxfoo
I found the text “xfooxxxxxxfoo” starting at index 0 and ending at index 13.

Enter your regex: .*?foo // 勉强量词
Enter input string to search: xfooxxxxxxfoo
I found the text “xfoo” starting at index 0 and ending at index 4.
I found the text “xxxxxxfoo” starting at index 4 and ending at index 13.

Enter your regex: .*+foo // 侵占量词
Enter input string to search: xfooxxxxxxfoo
No match found.
　　第一个例子使用贪婪量词.，寻找紧跟着字母“f”“o”“o”的“任何东西”零次或者多次。由于量词是贪婪的，表达式的.部分第一次“吃掉”整个输入的字符串。在这一点，全部表达式不能成功地进行匹配，这是由于最后三个字母（“f”“o”“o”）已经被消耗掉了。那么匹配器会慢慢地每次回退一个字母，直到返还的“foo”在最右边出现，这时匹配成功并且搜索终止。
　　然而，第二个例子采用勉强量词，因此通过首次消耗“什么也没有”作为开始。由于“foo”并没有出现在字符串的开始，它被强迫吞掉第一个字母（“x”），在 0 和 4 处触发了第一个匹配。测试用具会继续处理，直到输入的字符串耗尽为止。在 4 和 13 找到了另外一个匹配。
　　第三个例子的量词是侵占，所以在寻找匹配时失败了。在这种情况下，整个输入的字符串被.*+消耗了，什么都没有剩下来满足表达式末尾的“foo”。

5.1　零长度匹配

在上面的例子中，开始的两个匹配是成功的，这是因为表达式a?和a*都允许字符出现零次。输入的空字符串没有长度，因此该测试简单地在索引 0 上匹配什么都没有，诸如此类的匹配称之为零长度匹配（zero-length matches）。零长度匹配会出现在以下几种情况：输入空的字符串、在输入字符串的开始处、在输入字符串最后字符的后面，或者是输入字符串中任意两个字符之间。由于它们开始和结束的位置有着相同的索引，因此零长度匹配是容易被发现的。
零长度匹配例：

Enter your regex: a?  
Enter input string to search: a  
I found the text "a" starting at index 0 and ending at index 1.  
I found the text "" starting at index 1 and ending at index 1.  

Enter your regex: a*  
Enter input string to search: a  
I found the text "a" starting at index 0 and ending at index 1.  
I found the text "" starting at index 1 and ending at index 1.  

Enter your regex: a+  
Enter input string to search: a  
I found the text "a" starting at index 0 and ending at index 1.

三个量词都是用来寻找字母“a”的，但是前面两个在索引 1 处找到了零长度匹配，也就是说，在输入字符串最后一个字符的后面。
现在把输入的字符串改为一行 5 个“a”时，会得到下面的结果：

[java] view plain copy print?在CODE上查看代码片派生到我的代码片
Enter your regex: a?  
Enter input string to search: aaaaa  
I found the text "a" starting at index 0 and ending at index 1.  
I found the text "a" starting at index 1 and ending at index 2.  
I found the text "a" starting at index 2 and ending at index 3.  
I found the text "a" starting at index 3 and ending at index 4.  
I found the text "a" starting at index 4 and ending at index 5.  
I found the text "" starting at index 5 and ending at index 5.  

Enter your regex: a*  
Enter input string to search: aaaaa  
I found the text "aaaaa" starting at index 0 and ending at index 5.  
I found the text "" starting at index 5 and ending at index 5.  

Enter your regex: a+  
Enter input string to search: aaaaa  
I found the text "aaaaa" starting at index 0 and ending at index 5.

在“a”出现零次或一次时，表达式a?寻找到所匹配的每一个字符。表达式a*找到了两个单独的匹配：第一次匹配到所有的字母“a”，然后是匹配到最后一个字符后面的索引 5。最后，a+匹配了所有出现的字母“a”，忽略了在最后索引处“什么都没有”的存在。

Enter your regex: a?  
Enter input string to search: ababaaaab  
I found the text "a" starting at index 0 and ending at index 1.  
I found the text "" starting at index 1 and ending at index 1.  
I found the text "a" starting at index 2 and ending at index 3.  
I found the text "" starting at index 3 and ending at index 3.  
I found the text "a" starting at index 4 and ending at index 5.  
I found the text "a" starting at index 5 and ending at index 6.  
I found the text "a" starting at index 6 and ending at index 7.  
I found the text "a" starting at index 7 and ending at index 8.  
I found the text "" starting at index 8 and ending at index 8.  
I found the text "" starting at index 9 and ending at index 9.  

Enter your regex: a*  
Enter input string to search: ababaaaab  
I found the text "a" starting at index 0 and ending at index 1.  
I found the text "" starting at index 1 and ending at index 1.  
I found the text "a" starting at index 2 and ending at index 3.  
I found the text "" starting at index 3 and ending at index 3.  
I found the text "aaaa" starting at index 4 and ending at index 8.  
I found the text "" starting at index 8 and ending at index 8.  
I found the text "" starting at index 9 and ending at index 9.  

Enter your regex: a+  
Enter input string to search: ababaaaab  
I found the text "a" starting at index 0 and ending at index 1.  
I found the text "a" starting at index 2 and ending at index 3.  
I found the text "aaaa" starting at index 4 and ending at index 8.

5 边界匹配器

边界匹配器
^ 行首
$ 行尾
\b 单词边界
\B 非单词边界
\A 输入的开头
\G 上一个匹配的结尾
\Z 输入的结尾，仅用于最后的结束符（如果有的话）
\z 输入的结尾
对于匹配非单词边界的表达式，可以使用\B来代替：

Enter your regex: \bdog\B  
Enter input string to search: The dog plays in the yard.  
No match found.  
Enter your regex: \bdog\B  
Enter input string to search: The doggie plays in the yard.  
I found the text "dog" starting at index 4 and ending at index 7.

对于需要匹配仅出现在前一个匹配的结尾，可以使用\G：

[java] view plain copy print?在CODE上查看代码片派生到我的代码片
Enter your regex: dog  
Enter input string to search: dog dog  
I found the text "dog" starting at index 0 and ending at index 3.  
I found the text "dog" starting at index 4 and ending at index 7.  

Enter your regex: \Gdog  
Enter input string to search: dog dog  
I found the text "dog" starting at index 0 and ending at index 3.

6 示例：

//查找以Java开头，任意结尾的字符串
    Pattern p1 = Pattern.compile("^Java.*");
    Matcher matcher = p1.matcher("Java不是人");
    boolean b = matcher.matches();
    System.out.println(b);
    // 以多条件分割字符串时
    Pattern pattern = Pattern.compile("[,|]+");
    String[] strs = pattern.split("Java Hello World java,Hello,,World|Sun");
    for(int i = 0; i < strs.length; i++) {
        System.out.println(strs[i]);
    }

输出：
true
Java Hello World java
Hello
World
Sun

// 文字替换：替换第一个符合正则(pattern)的数据
    Pattern pattern = Pattern.compile("正则表达式");
    Matcher matcher = pattern.matcher("正则表达式 Hello World,正则表达式 Hello World");
    System.out.println(matcher.replaceFirst("Java"));
    输出：
    Java Hello World,正则表达式 Hello World

// 文字替换（全部）
    Pattern pattern = Pattern.compile("正则表达式");
    Matcher matcher = pattern.matcher("正则表达式 Hello World,正则表达式 Hello World");
    System.out.println(matcher.replaceAll("Java"));
    输出：
    Java Hello World,Java Hello World
// 文字替换（置换字符）
    Pattern pattern = Pattern.compile("正则表达式");
    Matcher matcher = pattern.matcher("正则表达式 Hello World,正则表达式 Hello World ");
    StringBuffer sbr = new StringBuffer();
    while (matcher.find()) {
        matcher.appendReplacement(sbr, "Java");
    }
    matcher.appendTail(sbr);
    System.out.println(sbr.toString());
    输出：
    Java Hello World,Java Hello World
    // 验证是否为邮箱地址
    String str="ceponline@yahoo.com.cn";
    Pattern pattern = Pattern.compile("[//w//.//-]+@([//w//-]+//.)+[//w//-]+",Pattern.CASE_INSENSITIVE);
    Matcher matcher = pattern.matcher(str);
    System.out.println(matcher.matches());
    输出：false

问题
2. 问：考虑一下字符串“foo”，它的开始索引是多少？结束索引是多少？解释一下这些编号的意思。
答：字符串中的每一个字符位于其自身的单元格中。索引位置在两个单元格之间。字符串“foo”开始于索引 0，结束于索引 3，即便是这些字符仅占用了 0、1 和 2 号单元格。
3. 问：普通字符和元字符有什么不同？各给出它们的一个例子。
答：正则表达式中的普通字符匹配其本身。元字符是一个特殊的字符，会影响被匹配模式的方式。字母A是一个普通字符。标点符号.是一个元字符，其匹配任意的单字符。
8. 问：思考正则表达式(dog){3}，识别一下其中的两个子表达式。这个表达式会匹配什么字符串？
答：表达式由捕获组(dog)和接着的贪婪量词{3}所组成。它匹配字符串“dogdogdog”。
【练习答案】
1. 练习：使用反向引用写一个表达式，用于匹配一个人的名字，假设这个人的 first 名字与 last 名字是相同的。
解答：([A-Z][a-zA-Z]*)\s\1

example:
1. 匹配在某个单词后，紧跟着0个或1个标点符号，以及任意个空格之后再次出现这个单词的行。如：cart cart、 long,long ago、ha!ha

(\<.*\>).?( )*\1

(\<.*>)匹配任意长度的单词。第一个子表达式
.?匹配0个或1个标点符号。由于在句点之前匹配的是单词，因此句点“.”在这里只能匹配标点。
()*匹配0个或多个空格。第二个子表达式
\1指代第一个子表达式匹配到的模式。（\2表示第二个子表达式）

egrep '[a-z]at' /usr/share/dict/words

返回：
Akhmatova
Akhmatova’s
Alcatraz
Alcatraz’s

egrep '<[a-z]at\>' /usr/share/dict/words

bat
bat’s
cat
cat’s
eat
fat
“单词”指两侧由非单词字符分隔的字符串。非单词字符指的是字母、数字、下划线以外的任何字符。
第一行中bat以行首和行尾分隔，符合单词的定义，可以匹配。“a#$bat”中的bat分别由标点和行尾分隔，也可以匹配。
3. 匹配文件中所有以大写字母开头，以小写t结尾的行

egrep "^[[:upper:]]t$" words 
^、$分别用于匹配行首和行尾
^a[a-z]t$匹配所有以a开头、t结尾，a和t之间包含一个小写字母的行

hulamua

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

正则表达式

正则表达式

原文地址

1 概念

2 字符类

简单类（Simple Classes)

3.1.2 范围

3.1.3 并集返回目录

3.1.4 交集返回目录

3.1.5 差集返回目录

3 预定义字符类

4 量词

5.1 零长度匹配

5 边界匹配器

6 示例：

3.1.2　范围

3.1.3　并集返回目录

3.1.4　交集返回目录

3.1.5　差集返回目录

5.1　零长度匹配