精通Java正则表达式-CSDN博客

本文链接：https://blog.csdn.net/zuo_huai/article/details/84318398

精通java正则表达式

1.正则表达式的作用

正则表达式几乎可以处理所有的字符串操作

2.正则表达式的基本使用（一）

1) 字符组，

正则表达式的最基本结构之一，规定某个位置能够出现的字符，以[….] 的形式给出，在方括号内列出字符，例如：

		String regex2 = "sep[ea]r[ea]te";
		String str2 = "seperate";

2) 连字符：

使用连字符范围表示表法(使得描述简要)

[0123456789] = [0-9]

[0-789] = [0-9]

[0123456789abcdef] = [0-9a-f]

注意：在字符组的内部，只有在连字符出现在两个字符之间时，才能表示字符的范围，如果出现在字符的开头，则只能表示连字符本身“-”

3) 排除型字符组

规定某个位置不容许出现的字符，以[ˆ…..]表示，在方括号内规定不允许出现的字符，排除型字符，需要匹配一个字符，不能匹配空字符；排除型字符，可以使用一个连字符或者多个连字符之前。例如：

"[^0-5]";

"[^0-5e-f]"

4) 字符组简记法

对于常用的字符组，正则表达式提供了相应的简记法，来方便表示：

\d = [0-9]

\D = [ˆ0-9]

\w=[0-9a-z]

\W=[ˆ0-9a-z]

\s 匹配空白字符(回车、换行、制表、空格)

\S 匹配非空白字符

5) 特殊字符

点号 “.” 是一个特殊的字符组简记法，它可以匹配几乎所有的字符。\. 表示点号本身，在字符组的内部[.]也只能匹配点号本身；注意：正则表达式的规定，点号不能匹配换行符，（在特点的匹配模式除外）

3.正则表达式的基本使用（二）

1）量词

限定之前字符出现的次数

* 之前的字符出现0次到无穷多次

+ 之前的字符至少出现1次

? 之前的字符至多只能出现1次

2）区间量词

规定字符出现的次数

形式：

{min,max}

{min,}

{number}

*={0,}

+={1,}

?={0,1}

3) 量词的局限

量词只能规定字符或者字符组出现的次数，如果需要规定一个字符串出现的次数，必须使用(….) ，在括号内填写字符串，在闭括号之后加入量词

4）括号的用途：多选结构

字符组只能表示某个位置可能出现的单子字符，而不能表示某个位置表示的字符串

表示某个位置可能出现的字符串

形式是：

(….|…) ,在竖线两端添加各个字符串

(….|….|…|)

实例：美化数字，代码如下：

/**
	 * @param args
	 */
	public static void main(String[] args) {
		String[] numbers = new String[] { "1234567890", "123456", "123" };
		for (String number : numbers) {
			System.out.print(number + "处理之后：" + beautifyNumber(number) + "");
			System.out.println();
		}
	}
	public static String beautifyNumber(String s) {
		return s.replaceAll("(?<=\\d)(?=((\\d{3})+\\b))", ",");
	}

结果如下：

1234567890处理之后：1,234,567,890
123456处理之后：123,456
123处理之后：123

这里先不进行解析，之后学习完环视之后，在进行理解

5）括号的用途：捕获分组

作用：将括号内的子表达式捕获的字符串存放到匹配结果中，供匹配完成后访问

形式：

使用普通的括号(…..)

注意：只要使用使用的括号就存在分组；捕获分组按照开括号从左至右的顺序编号，遇到括号嵌套的情况也是如此，例如：

在表达式 ((A)(B(C))) 中，存在四个这样的组：

1     1 ((A)(B(C)))
2     2 /A
3    3 (B(C))
4     4 (C)

1	`1 ((A)(B(C)))`
2	`2 /A`
3	`3 (B(C))`
4	`4 (C)`

组零始终代表整个表达式

如果捕获分组之后存在量词，则匹配结果中，捕获分组保存的是子表达式最后一次匹配的字符串

6）不捕获文本的括号

如果表达式很复杂，或者需要处理的文本很长，捕获分组会降低效率

作用：仅仅对用来对表达式分组，而不把捕获的文本存放结果

形式：(?:)

不是所有的语言都支持，可读性不好，如果效率成为一个严重问题时，则考虑使用不捕获文本的括号

7）括号的用途：反向引用

在表达式的某一部分，动态重复之前的子表达式所匹配的文本
形式：(\1)

4.正则表达式的基本使用（三）

1）锚点

对匹配的位置进行规定

形式：\b 单词分节符锚点

例如：\bcat\b

注意事项：1) 表示单词分解符，要求一侧是单词字符，另一侧是非单词字符 2)单词字符英文字符，数字字符，对中文不适用;3）非单词字符指的是各种标点符号和空白字符

例如：

String[] strings = new String[] {
				"This sentence contain word cat",
				"This sentence contain word \"cat\"",
				"This sentence contain word vacation",
				"This sentence contain word \"cate\"",
				"中文cat字符",
				"中文cat0",
		};
		String regex = "\\bcat\\b";
		for(String str : strings) {
			System.out.println("Checking sentence:\t" + str);
			Pattern p = Pattern.compile(regex);
			Matcher m = p.matcher(str);
			if(m.find()) {
				System.out.println("Found word \"cat\"!");
			}
			else {
				System.out.println("Can not found word \"cat\"!");
			}
		}

运行结果如下：

Checking sentence:	This sentence contain word cat
Found word "cat"!
Checking sentence:	This sentence contain word "cat"
Found word "cat"!
Checking sentence:	This sentence contain word vacation
Can not found word "cat"!
Checking sentence:	This sentence contain word "cate"
Can not found word "cat"!
Checking sentence:	中文cat字符
Can not found word "cat"!
Checking sentence:	中文cat0
Can not found word "cat"!

ˆ(托字符) 表示行的开头 (在不同的匹配模式下有可能变化)

$ 表示行的结尾（在不同的匹配模式下有可能变化）

\A 表示整个字符串的开头

\Z 匹配整个字符串的结尾

示例代码如下：

		String[] strings = new String[] { "start ", " start  ", " end ", " end" };
		String[] regexes = new String[] { "^start", "\\Astart", "end$", "end\\Z"};
		for (String str : strings) {
			for (String regex : regexes) {
				Pattern p = Pattern.compile(regex);
				Matcher m = p.matcher(str);
				if(m.find()) {
					System.out.println("\"" + str
							+ "\" can be matched with regex \"" + regex
							+ "\"");
				}
				else {
					System.out.println("\"" + str
							+ "\" can not be matched with regex \"" + regex
							+ "\"");
				}
			}
			System.out.println("");
		}

运行结果如下：

"start " can be matched with regex "^start"
"start " can be matched with regex "\Astart"
"start " can not be matched with regex "end$"
"start " can not be matched with regex "end\Z"

" start  " can not be matched with regex "^start"
" start  " can not be matched with regex "\Astart"
" start  " can not be matched with regex "end$"
" start  " can not be matched with regex "end\Z"

" end " can not be matched with regex "^start"
" end " can not be matched with regex "\Astart"
" end " can not be matched with regex "end$"
" end " can not be matched with regex "end\Z"

" end" can not be matched with regex "^start"
" end" can not be matched with regex "\Astart"
" end" can be matched with regex "end$"
" end" can be matched with regex "end\Z"

2）环视

锚点对位置的判断不够明确

作用：应用子表达式，对位置进行判断

形式：

(?=子表达式) 肯定顺序环视，右侧文本能够由子表达式匹配

(?!子表达式) 否定顺序环视，右侧文本不能够由子表达式匹配

(?<子表达式) 肯定逆序环视，左侧文本能够由子表达式匹配

(?<!子表达式) 否定逆序环视，左侧文本不能够由子表达式匹配

例如，肯定环视：

String[] strings = new String[] { "Jeff", "Jeffrey", "Jefferson"};
		String[] regexes = new String[] { "Jeff", "Jeff(?=rey)", "Jeff(?!rey)"};
		for (String regex : regexes) {
			for (String str : strings) {
				Pattern p = Pattern.compile(regex);
				Matcher m = p.matcher(str);
				if(m.find()) {
					System.out.println("\"" + str
							+ "\" can be matched with regex \"" + regex
							+ "\"");
				}
				else {
					System.out.println("\"" + str
							+ "\" can not be matched with regex \"" + regex
							+ "\"");
				}
			}
			System.out.println("");
		}

运行的结果如下：

"Jeff" can be matched with regex "Jeff"
"Jeffrey" can be matched with regex "Jeff"
"Jefferson" can be matched with regex "Jeff"

"Jeff" can not be matched with regex "Jeff(?=rey)"
"Jeffrey" can be matched with regex "Jeff(?=rey)"
"Jefferson" can not be matched with regex "Jeff(?=rey)"

"Jeff" can be matched with regex "Jeff(?!rey)"
"Jeffrey" can not be matched with regex "Jeff(?!rey)"
"Jefferson" can be matched with regex "Jeff(?!rey)"

否定环视：

String[] strings = new String[] {"see", "bee", "tee"};
		String[] regexes = new String[] { "(?<=s)ee", "(?<!s)ee"};
		for (String regex : regexes) {
			for (String str : strings) {
				Pattern p = Pattern.compile(regex);
				Matcher m = p.matcher(str);
				if(m.find()) {
					System.out.println("\"" + str
							+ "\" can be matched with regex \"" + regex
							+ "\"");
				}
				else {
					System.out.println("\"" + str
							+ "\" can not be matched with regex \"" + regex
							+ "\"");
				}
			}
		}

运行结果如下：

"see" can be matched with regex "(?<=s)ee"
"bee" can not be matched with regex "(?<=s)ee"
"tee" can not be matched with regex "(?<=s)ee"
"see" can not be matched with regex "(?<!s)ee"
"bee" can be matched with regex "(?<!s)ee"
"tee" can be matched with regex "(?<!s)ee"

环视的使用注意事项：1）环视结构仅仅用于布尔判断，结构内的子表达式所匹配的文本，不会保存在整个表达式的结果中 2）逆序环视结构对子表达式存在限制：限制如下：

Perl,Python:逆序环视结构中的子表达式必须为固定长度

PHP,JAVA :逆序环视结构中的子表达式可以不为固定长度，但是必须具有上线，所以不能为使用*,+ 这样的量词

.NET ：逆序环视结构中的子表达式可完全没有限制

通过环视的学习，就可以完全理解上面例子中：将 123456789 转换成123,456,789 的正则表达式了

5.正则表达式的基本使用(四)

1）匹配模式

作用：改变某些结构的匹配规则

形式：

I: Case Insensitive

S: SingleLine(dot All)

M: MultiLine

X: Comment

2）不区分大小写模式

String regex = "ABC";

Pattern p = Pattern.compile(regex,Pattern.CASE_INSENSITIVE)

String str = "abc";
		String regex = "ABC";
		Pattern p = Pattern.compile(regex,Pattern.CASE_INSENSITIVE);//
		Matcher m = p.matcher(str);
		if(m.find()) {		
			System.out.println("\"" + str
					+ "\" can be matched with regex \"" + regex
					+ "\"");
		} else {
			System.out.println("\"" + str
					+ "\" can not be matched with regex \"" + regex
					+ "\"");
		}

运行结果如下：

写道

"abc" can be matched with regex "ABC"

3）单行模式 (点号通配模式)

String regex = "<a href=([^>]+)>.*</a>";

Pattern p = Pattern.compile(regex,Pattern.DOTALL);//

在默认情况在点号，不能匹配换行，但是使用在以上模式下点号可以匹配换行

String str = "<a href=www.itcast.net>\nITCAST\n</a>";
		String regex = "<a href=([^>]+)>.*</a>";
		Pattern p = Pattern.compile(regex);//
		Matcher m = p.matcher(str);
		if(m.find()) {		
			System.out.println("\"" + str
					+ "\" can be matched with regex \"" + regex
					+ "\"");
		} else {
			System.out.println("\"" + str
					+ "\" can not be matched with regex \"" + regex
					+ "\"");
		}

运行结果如下：

"<a href=www.itcast.net>
ITCAST
</a>" can not be matched with regex "<a href=([^>]+)>.*</a>"

4）多行模式

用来修改ˆ(托字符) $匹配模式，可以匹配字符串内部各行的开头和结束的位置

\A 和\Z 不受影响

String regex = "^ITCAST$";

Pattern p = Pattern.compile(regex,Pattern.MULTILINE);//

String str = "<a href=www.itcast.net>\nITCAST\n</a>";
		String regex = "^ITCAST$";
		Pattern p = Pattern.compile(regex,Pattern.MULTILINE);//
		Matcher m = p.matcher(str);
		if (m.find()) {
			System.out.println("\"" + str + "\" can be matched with regex \""
					+ regex + "\"");
		} else {
			System.out.println("\"" + str
					+ "\" can not be matched with regex \"" + regex + "\"");
		}

运行结果如下：

"<a href=www.itcast.net>
ITCAST
</a>" can be matched with regex "^ITCAST$"

5）注释模式

作用：使用注释模式使得正则表达式内部可以使用注释

注释以#开头，以换行符结尾（或者直到表达式结束）

使用此模式之后，正则表达式会忽略所有的空白字符

String str = "webmaster@itcast.net";
		String regex = "webmaster #username\n" + "@" + "itcast.net #hostname";
		Pattern p = Pattern.compile(regex, Pattern.COMMENTS);//
		Matcher m = p.matcher(str);
		if (m.find()) {
			System.out.println("\"" + str + "\" can be matched with regex \""
					+ regex + "\"");
		} else {
			System.out.println("\"" + str
					+ "\" can not be matched with regex \"" + regex + "\"");
		}

运行结果如下：

"<a href=www.itcast.net>
ITCAST
</a>" can be matched with regex "^ITCAST$"

6）混合模式

作用：同时使用多个模式

形式：在编译正则表达式时，把表示模式的多个参数以竖线” | ” 连接起来

String regex = "<a href=([^>]+)>.*</a>";

Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE | Pattern.DOTALL);

7）模式的作用范围

作用：精确控制模式的作用范围

形式：在表达式中以(?ismx) 的方式启用模式，以(?-ismx)的方式停用模式

如：

String regex = "(?is)<a href=([^>]+)>.*</a>"; 启用i和s 模式

8）模式冲突

String regex = "(?-i)ABC";

Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);//

以正则表达式中的模式为主

内容有待进一步完善