12、正则表达式中的Matcher类总结

最新推荐文章于 2023-11-08 16:51:04 发布

heruozhi

最新推荐文章于 2023-11-08 16:51:04 发布

阅读量507

点赞数

分类专栏： Java基础学习

本文链接：https://blog.csdn.net/heruozhi/article/details/44538723

版权

Java基础学习专栏收录该内容

21 篇文章 0 订阅

订阅专栏

正则表达式不多介绍，记住要

import java.util.regex.*;

主要总结一下Pattern和Matcher的一些东西。

先看代码：

public class RegExPrac {
	public static void main(String[] args)
	{
		String str = "1588344570615883445706";
		Matcher m = Pattern.compile("(((13\\d)|(15\\d))\\d{8})+?").matcher(str);		
		while(m.find())
		{
			System.out.println("start at : " + m.start() + ";end at :"
		                           + m.end(1) + " output : " + m.group());
		}
	}
}

Pattern.compile是一个类方法，查阅API文档可知，compile有两种重载，一种是compile(String regex)，另一种多一个参数compile(String regex, int flags)，flags是一个用以约束的描述符，有一些宏与其对应。compile返回一个 Pattern对象，Pattern没有可见的构造器。Pattern有一些方法，不过不是重点，Pattern对象保存着compile的参数，可以产生Matcher对象，Matcher对象是重点。

Matcher的构造器也是不可见的，只能通过Pattern对象的matcher(String str)方法产生Matcher实例，Pattern为什么可以调用Matcher的构造器？我表示疑惑，有空研究一下源码。

关于Matcher归类简介几点：

1、匹配。

匹配主要有find()、lookingAt()和matches()，它们都返回一个boolean值，表示是否找到了匹配的结果

find()：文档中这样说：Attempts to find the next subsequence of the input sequence that matches the pattern. 即是在输入的序列中去找“下一个”匹配模版的子序列。这里的“找”如果找到了，则把结果存在某处，返回一个true，用group()方法就可以把这个已经存好的结果输出。find()从头到位把序列检查一遍之后就不再检查，之后会一直输出false。使用find(int start)可以设定一个起始位置重新开始查找。

比如最上方的代码，((13\\d)|(15\\d))\\d{8} 表示“13x开头，或者15x开头，后面有八个x” x代表数字。其实也可以写成((13)|(15))\\d{9}。把这个式子括起来，加上数量表示符“+?”。加号表示的条件是：前面的式子出现一次或者多次。问号表示“勉强模式”：匹配最少的字符。所以这个条件就是寻找模版只出现一次的情况。

第一次调用find()，找到前11位，m.group()输出一个String结果。再次调用find()，找到后11位，group()又输出一个结果。此时已经到了字符串的结尾，第三次调用find()将返回false，循环结束。

所以程序的结果是：

start at : 0;end at :11 output : 15883445706
start at : 11;end at :22 output : 15883445706

lookingAt()：

这个方法只查找字符串的开头，开头能够匹配则返回true，否则返回false。所以如果在上述程序中使用

while(m.lookingAt())
		{
			System.out.println("start at : " + m.start() + ";end at :"
		                           + m.end(1) + " output : " + m.group());
		}

就会一直循环下去，检查开头----匹配-----输出-----检查开头-----匹配-----输出……

matches()：

这个方法检查整个字符串，把程序修改如下

import java.util.regex.*;

public class RegExPrac {
	public static void main(String[] args)
	{
		String str = "1588344570615883445706";
		Matcher m = Pattern.compile("(((13)|(15))\\d{9})+").matcher(str);		
		while(m.find())
		{
			System.out.println("start at : " + m.start() + ";end at :"
		                           + m.end(1) + " output : " + m.group());
		}
	}
}

可以看到，正则表达式尾部的数量表示符只有一个“+”，表示此式子出现一次或者多次，加号之后没有其他限定，默认采用“贪婪模式”，即允许的范围内尽量多，所以会匹配“多次”，而“多次”这个概念没有上限，所以会匹配到无法匹配为止。

所以结果是，一次性把两个相连的电话号码都匹配了，输出：

start at : 0;end at :22 output : 1588344570615883445706

2、Matcher每次查找匹配之后的定位行为：

find()的行为，以以下代码为例：

import java.util.regex.*;

public class RegExPrac {
	public static void main(String[] args)
	{
		String str = "1588344570615883d445706";
		Matcher m = Pattern.compile("(((13)|(15))\\d{9})?").matcher(str);//零次或一次，贪婪模式	
//		System.out.println(m.groupCount());
//	        m.matches();
//		System.out.println(m.start(0));
//		System.out.println(m.start(1));
//		System.out.println(m.start(2));
//		System.out.println(m.start(3));
//              m.lookingAt();		
		while(m.find())
		{
			System.out.println("start at : " + m.start() + ";end at :"
		                           + m.end(1) + " output : " + m.group());
		}
	}
}

结果是：

start at : 0;end at :11 output : 15883445706
start at : 11;end at :-1 output :
start at : 12;end at :-1 output :
start at : 13;end at :-1 output :
start at : 14;end at :-1 output :
start at : 15;end at :-1 output :
start at : 16;end at :-1 output :
start at : 17;end at :-1 output :
start at : 18;end at :-1 output :
start at : 19;end at :-1 output :
start at : 20;end at :-1 output :
start at : 21;end at :-1 output :
start at : 22;end at :-1 output :
start at : 23;end at :-1 output :

可以发现，匹配到第一个正确的号码后，没有再找到能够匹配“一次”的结果，于是回到上一次的结果的后一个字符，开始匹配“零次”。

lookingAt的行为，以以下代码为例：

import java.util.regex.*;

public class RegExPrac {
	public static void main(String[] args)
	{
		String str = "158d8344570615883445706";
		Matcher m = Pattern.compile("(((13)|(15))\\d{9})?").matcher(str);	
//		System.out.println(m.groupCount());
//	        m.matches();
//		System.out.println(m.start(0));
//		System.out.println(m.start(1));
//		System.out.println(m.start(2));
//		System.out.println(m.start(3));
		m.lookingAt();		
		while(m.find())
		{
			System.out.println("start at : " + m.start() + ";end at :"
		                           + m.end(1) + " output : " + m.group());
		}
	}
}

结果是：

start at : 1;end at :-1 output :
start at : 2;end at :-1 output :
start at : 3;end at :-1 output :
start at : 4;end at :-1 output :
start at : 5;end at :-1 output :
start at : 6;end at :-1 output :
start at : 7;end at :-1 output :
start at : 8;end at :-1 output :
start at : 9;end at :-1 output :
start at : 10;end at :-1 output :
start at : 11;end at :-1 output :
start at : 12;end at :23 output : 15883445706
start at : 23;end at :-1 output :
虽然lookingAt在开头匹配成功，因为匹配多次不成功之后，尝试匹配“零次”，并在第一个字符成功，定位后移了一位，find是从第1位开始匹配“零次”的。

如果lookingAt匹配成功，则定位在之后一位，跟find一样。

matches()的行为比较奇怪，看如下代码：

import java.util.regex.*;

public class RegExPrac {
	public static void main(String[] args)
	{
		String str = "158d8344570615883445706";
		Matcher m = Pattern.compile("(((13)|(15))\\d{9})??").matcher(str);	
                m.matches();	
		while(m.find())
		{
			System.out.println("start at : " + m.start() + ";end at :"
		                           + m.end(1) + " output : " + m.group());
		}
	}
}

输出结果是：

start at : 3;end at :-1 output :
start at : 4;end at :-1 output :
start at : 5;end at :-1 output :
start at : 6;end at :-1 output :
start at : 7;end at :-1 output :
start at : 8;end at :-1 output :
start at : 9;end at :-1 output :
start at : 10;end at :-1 output :
start at : 11;end at :-1 output :
start at : 12;end at :-1 output :
start at : 13;end at :-1 output :
start at : 14;end at :-1 output :
start at : 15;end at :-1 output :
start at : 16;end at :-1 output :
start at : 17;end at :-1 output :
start at : 18;end at :-1 output :
start at : 19;end at :-1 output :
start at : 20;end at :-1 output :
start at : 21;end at :-1 output :
start at : 22;end at :-1 output :
start at : 23;end at :-1 output :

显然，matches()发现从第三位起匹配失败了，于是定位停在了这里（失败那一位），然后find从第三位起开始匹配“零次”

再看如下代码：

import java.util.regex.*;

public class RegExPrac {
	public static void main(String[] args)
	{
		String str = "1588344570615883445706";
		Matcher m = Pattern.compile("(((13)|(15))\\d{9})+?").matcher(str);//一次或多次，勉强模式	
        m.matches();	
		while(m.find())
		{
			System.out.println("start at : " + m.start() + ";end at :"
		                           + m.end(1) + " output : " + m.group());
		}
	}
}

结果是什么也没有输出。虽然是数量表示符是+?，即一次或多次并使用勉强模式，但是matches()还是把整个字符串都匹配进去了。同样的，如果数量表示符是*?，在“零次或多次，勉强模式”的条件下，matches()一样把整个字符串匹配进去了。

事实证明，matches根本不受“勉强模式”的约束。其实也有道理，都拿整个字符串来检查了，还要什么勉强模式呢，当然以匹配最多为准。

而且，如果使用??，即“零次或一次，勉强模式”，代码如下：

package c71;

import java.util.regex.*;

public class RegExPrac {
	public static void main(String[] args)
	{
		String str = "1588344570615883445706";
		Matcher m = Pattern.compile("(((13)|(15))\\d{9})??").matcher(str);	
                System.out.println(m.matches());        	
		while(m.find())
		{
			System.out.println("start at : " + m.start() + ";end at :"
		                           + m.end(1) + " output : " + m.group());
		}
	}
}

结果是：

false
start at : 11;end at :-1 output :
start at : 12;end at :-1 output :
start at : 13;end at :-1 output :
start at : 14;end at :-1 output :
start at : 15;end at :-1 output :
start at : 16;end at :-1 output :
start at : 17;end at :-1 output :
start at : 18;end at :-1 output :
start at : 19;end at :-1 output :
start at : 20;end at :-1 output :
start at : 21;end at :-1 output :
start at : 22;end at :-1 output :

可以想象matches执行的过程，检查完第一个电话号码，还是对的（定位在10），准备往下检查（定位来到11），但是发现只允许检查一个，于是就像遇到错误一样的反应，定位停在当前位置，并返回一个false。

3、关于匹配“零次”：

如果数量表示符允许匹配零次，比如(((13)|(15))\\d{9})*，星号表示零次或者多次。不加其他限制默认采用“贪婪模式”，则会输出：

start at : 0;end at :22 output : 1588344570615883445706
start at : 22;end at :-1 output :

即把一个“多次”匹配完之后，在最末尾匹配了一次“零次”，这个零次起始于第22个字符，也就是\0结束符。

如果用(((13)|(15))\\d{9})?，一个问号表示零次或一次的“贪婪模式”，结果是：

start at : 0;end at :11 output : 15883445706
start at : 11;end at :22 output : 15883445706
start at : 22;end at :-1 output :

同样在匹配两个“一次”之后，匹配了一个“零次”。

把代码改为

import java.util.regex.*;

public class RegExPrac {
	public static void main(String[] args)
	{
		String str = "15883445706df15883445706";
		Matcher m = Pattern.compile("(((13)|(15))\\d{9})?").matcher(str);		
		while(m.find())
		{
			System.out.println("start at : " + m.start() + ";end at :"
		                           + m.end(1) + " output : " + m.group());
		}
	}
}

结果是：

start at : 0;end at :11 output : 15883445706
start at : 11;end at :-1 output :
start at : 12;end at :-1 output :
start at : 13;end at :24 output : 15883445706
start at : 24;end at :-1 output :
结果证明：如果允许匹配“零次”，则会在每个不匹配的字符上find()成功，包括结尾的\0。同样地，如果第一个字符不匹配，lookingAt()也会对第一个字符匹配“零次”成功。

不难推测：如果数量表示符里面允许匹配零次，再使用“勉强模式”会出现以下现象，比如代码：

import java.util.regex.*;

public class RegExPrac {
	public static void main(String[] args)
	{
		String str = "1588344570615883445706";
		Matcher m = Pattern.compile("(((13)|(15))\\d{9})*?").matcher(str);		
		while(m.find())
		{
			System.out.println("start at : " + m.start() + ";end at :"
		                           + m.end(1) + " output : " + m.group());
		}
	}
}

*?表示零次或多次并使用“勉强模式”，即匹配“零次”，结果是

start at : 0;end at :-1 output :
start at : 1;end at :-1 output :
start at : 2;end at :-1 output :
start at : 3;end at :-1 output :
start at : 4;end at :-1 output :
start at : 5;end at :-1 output :
start at : 6;end at :-1 output :
start at : 7;end at :-1 output :
start at : 8;end at :-1 output :
start at : 9;end at :-1 output :
start at : 10;end at :-1 output :
start at : 11;end at :-1 output :
start at : 12;end at :-1 output :
start at : 13;end at :-1 output :
start at : 14;end at :-1 output :
start at : 15;end at :-1 output :
start at : 16;end at :-1 output :
start at : 17;end at :-1 output :
start at : 18;end at :-1 output :
start at : 19;end at :-1 output :
start at : 20;end at :-1 output :
start at : 21;end at :-1 output :
start at : 22;end at :-1 output :

没有得到任何电话号码，而是在每个字符上匹配了一次“零次”。

这么看起来，还是用“+?”控制最好，一次或者多次中用“勉强模式”，即刚好一次。

4、输出

输出其实就是用刚才说的group()，输出当前被匹配到的结果，注意必须要先用上面那三个方法之一匹配一下得到确认返回为true，有结果，再输出，不然会出异常：java.lang.IllegalStateException

同时，可以用start()、end()方法分别输出当前匹配的结果在源字符串中的起始位置和结束位置，结束位置是字符串最后一个字符位置再往后一位，java的习惯。

group，start，end三个方法都有带参数的重载，比如group(int group)，start(int group)，end(int group)，这些参数好像只有在matches检查整个字符串之后才有意义，而group到底是怎么分，怎么算的，暂时还没弄明白。

heruozhi

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
12、正则表达式中的Matcher类总结

正则表达式不多介绍，记住要import java.util.regex.*;主要总结一下Pattern和Matcher的一些东西。先看代码：public class RegExPrac { public static void main(String[] args) { String str = "1588344570615883445706"; Matcher m
复制链接

扫一扫