正则表达式

最新推荐文章于 2024-06-28 16:52:28 发布

TimeMagician

最新推荐文章于 2024-06-28 16:52:28 发布

阅读量269

点赞数

分类专栏： JavaSE基础文章标签：正则表达式

本文链接：https://blog.csdn.net/TimeMagician/article/details/79616934

版权

JavaSE基础专栏收录该内容

15 篇文章 0 订阅

订阅专栏

Pattern，Matcher类

主要参考：我爱学Java之Pattern和Matcher用法(直接使用了他的代码示例)

来源于java.util.regex包的Pattern类和Matcher类构成了Java正则表达式。
一个正则表达式，也就是一串有特定意义的字符，必须首先要编译成为一个 Pattern 类的实例，这个 Pattern 对象将会使用 matcher()方法来生成一个 Matcher 实例，接着便可以使用该 Matcher 实例以编译的正则表达式为基础对目标字符串进行匹配工作，多个 Matcher 是可以共用一个 Pattern 对象的。
首先看一个简单的使用案例：

Pattern pattern = Pattern.compile("Java");
String test1 = "Java";
Matcher matcher = pattern.matcher(test1);
System.out.println(matcher.matches());//返回true

在该例中，我们可以看到Pattern是和Matcher配合使用的。下面分别来看看这两个类的用法。

Pattern类

常用方法如下：

Modifier and Type	Method	Description
static Pattern	compile(String regex)	Compiles the given regular expression into a pattern.
static Pattern	compile(String regex, int flags)	Compiles the given regular expression into a pattern with the given flags.
Matcher	matcher(CharSequence input)	Creates a matcher that will match the given input against this pattern.
static boolean	matches(String regex, CharSequence input)	Compiles the given regular expression and attempts to match the given input against it.
String[]	split(CharSequence input)	Splits the given input sequence around matches of this pattern.
String[]	split(CharSequence input, int limit)	Splits the given input sequence around matches of this pattern.

工厂方法创建Pattern类

首先看前两个方法compile()。Pattern类用于创建一个正则表达式，也可以说是创建一个匹配模式，可以通过两个静态方法（工厂方法）创建：compile(String regex)和compile(String regex,int flags)，其中regex是正则表达式，flags为可选模式(如：Pattern.CASE_INSENSITIVE 忽略大小写)。

Pattern pattern = Pattern.compile("Java");
System.out.println(pattern.pattern());//返回此模式的正则表达式即Java

Pattern类的使用

Pattern使用的方法有2个，matches() split()。
matches(String regex, CharSequence input)方法主要进行全字符串匹配并且只能返回是否匹配上的boolean值。该方法主要是简化了产生Matcher类的使用步骤，就是所谓的一步到位，但是弊端就是无法进行多项操作。用法如下：

String test1 = "Java";
String test2 = "Java123456";

System.out.println(Pattern.matches("Java",test1));//返回true
System.out.println(Pattern.matches("Java",test2));//返回false

split(CharSequence input)方法同String类的split方法类似，使用如下：

Pattern pattern = Pattern.compile("Java");
String test="123Java456Java789Java";
String[] result = pattern.split(test);
for(String s : result)
    System.out.println(s);

//--------------结果如下------------
123
456
789

强调一下的是split(CharSequence input, int limit)方法。当limit值大于所能返回的字符串的最多个数或者为负数，返回的字符串个数将不受限制，但结尾可能包含空串，而当limit=0时与split(CharSequence input)等价，但结尾的空串会被丢弃。示例如下：

Pattern pattern = Pattern.compile("Java");
String test = "123Java456Java789Java";

String[] result = pattern.split(test,2);
for(String s : result)
            System.out.println(s);

result = pattern.split(test,10);
System.out.println(result.length);

result = pattern.split(test,-2);
System.out.println(result.length);

result = pattern.split(test,0);
System.out.println(result.length);

//-------------------
123
456Java789Java
4
4
3

Pattern类与Matcher类的链接

两个类之间用matcher(CharSequence input)进行连接。Matcher类提供了对正则表达式的分组支持,以及对正则表达式的多次匹配支持，要想得到更丰富的正则匹配操作,那就需要将Pattern与Matcher联合使用。使用方法如下：

Pattern pattern = Pattern.compile("Java");
String test = "123Java456Java789Java";
Matcher matcher = pattern.matcher(test);

Matcher类

主要参考：Java 正则表达式；JAVA 中正则表达式的应用（二）
Matcher类就像一座桥，链接了正则表达式（Pattern类持有）和搜索文本。
一个 Matcher 实例是被用来对目标字符串进行基于既有模式（也就是一个给定的 Pattern 所编译的正则表达式）进行匹配查找的，所有往 Matcher 的输入都是通过 CharSequence 接口提供的，这样做的目的在于可以支持对从多元化的数据源所提供的数据进行匹配工作。

Modifier and Type	Method	Description
boolean	matches()	Attempts to match the entire region against the pattern.
boolean	lookingAt()	Attempts to match the input sequence, starting at the beginning of the region, against the pattern.
boolean	find()	Attempts to find the next subsequence of the input sequence that matches the pattern.
boolean	find(int start)	Resets this matcher and then attempts to find the next subsequence of the input sequence that matches the pattern, starting at the specified index.
分割线	分割线	分割线
int	start()	Returns the start index of the previous match.
int	start(int group)	Returns the start index of the subsequence captured by the given group during the previous match operation.
int	end()	Returns the offset after the last character matched.
int	end(int group)	Returns the offset after the last character of the subsequence captured by the given group during the previous match operation.
分割线	分割线	分割线
String	replaceAll(String replacement)	Replaces every subsequence of the input sequence that matches the pattern with the given replacement string.
String	replaceFirst(String replacement)	Replaces the first subsequence of the input sequence that matches the pattern with the given replacement string.
Matcher	appendReplacement(StringBuilder sb, String replacement)	Implements a non-terminal append-and-replace step.
StringBuilder	appendTail(StringBuilder sb)	Implements a terminal append-and-replace step.

Matcher类的方法可以分成3类，即搜寻方法，索引方法和替换方法。

搜寻方法

matches(),lookingAt(),find()都是寻找方法。3个方法都会返回一个逻辑值来表示是否找到。
matches()方法
全字段匹配方法，该方法要求搜寻文本和正则表达式全部匹配，否则返回false。
lookingAt()方法
全字段部分匹配方法，该方法要求搜寻文本从头开始要和正则表达式匹配，但是不用全部匹配完全。如果不满足以上条件则返回false。
具体区别见如下例子：

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexMatches
{
    private static final String REGEX = "foo";
    private static final String INPUT = "fooooooooooooooooo";
    private static final String INPUT2 = "ooooofoooooooooooo";
    private static Pattern pattern;
    private static Matcher matcher;
    private static Matcher matcher2;

    public static void main( String args[] ){
       pattern = Pattern.compile(REGEX);
       matcher = pattern.matcher(INPUT);
       matcher2 = pattern.matcher(INPUT2);

       System.out.println("Current REGEX is: "+REGEX);
       System.out.println("Current INPUT is: "+INPUT);
       System.out.println("Current INPUT2 is: "+INPUT2);


       System.out.println("lookingAt(): "+matcher.lookingAt());
       System.out.println("matches(): "+matcher.matches());
       System.out.println("lookingAt(): "+matcher2.lookingAt());
   }
}


//--------------------结果如下
Current REGEX is: foo
Current INPUT is: fooooooooooooooooo
Current INPUT2 is: ooooofoooooooooooo
lookingAt(): true
matches(): false
lookingAt(): false

find()
该方法用于多项匹配，即搜寻文本中含有多个与正则表达式相符的字面量时，常用find搜索，find搜索还可以输入int参数表示从第几个字符开始搜索。一旦find方法搜索到符合的字面量便会停止，这时候可以对其继续调用索引方法。常常在while循环中使用find方法。

索引方法

索引方法start()，end()常常用于搜寻方法之后，返回搜索到的结果的开始/结束下标。用法如下：

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexMatches
{
    private static final String REGEX = "\\bcat\\b";
    private static final String INPUT =
                                    "cat cat cat cattie cat";

    public static void main( String args[] ){
       Pattern p = Pattern.compile(REGEX);
       Matcher m = p.matcher(INPUT); // 获取 matcher 对象
       int count = 0;

       while(m.find()) {
         count++;
         System.out.println("Match number "+count);
         System.out.println("start(): "+m.start());
         System.out.println("end(): "+m.end());
      }
   }
}


//------------------
Match number 1
start(): 0
end(): 3
Match number 2
start(): 4
end(): 7
Match number 3
start(): 8
end(): 11
Match number 4
start(): 19
end(): 22

正则表达式的分组概念

组是用括号划分的正则表达式，可以根据组的编号来引用这个组。组号为0表示整个表达式，组号为1表示被第一对括号括起的组，依次类推，例如A(B(C))D，组0是ABCD，组1是BC，组2是C。
组是把多个字符当一个单独单元进行处理的方法，它通过对括号内的字符分组来创建。
可以通过调用 matcher 对象的 groupCount 方法来查看表达式有多少个分组。groupCount 方法返回一个 int 值，表示matcher对象当前有多个捕获组。
还有一个特殊的组（group(0)），它总是代表整个表达式。该组不包括在 groupCount 的返回值中。
对于一个match对象，可以用match.group(int x)来表示在本次find()中，找到的对应第x组的结果的String值。

由此引申出start(int group)，end(int group)方法，就是对具体的组求下标。

替换方法

replaceAll(String replacement)
replaceFirst(String replacement)
appendReplacement(StringBuilder sb, String replacement)
appendTail(StringBuilder sb)
构成了Matcher类的替换方法。替换方法也是在搜寻方法后使用。即在搜索到后执行替换操作。值得一提的是appendReplacement(StringBuilder sb, String replacement)和
appendTail(StringBuilder sb)方法。
appendReplacement(StringBuffer sb, String replacement) 将当前匹配子串替换为指定字符串，并且将替换后的子串以及其之前到上次匹配子串之后的字符串段添加到一个 StringBuffer 对象里，而 appendTail(StringBuffer sb) 方法则将最后一次匹配工作后剩余的字符串添加到一个 StringBuffer 对象里。上个例子方便理解：

// 该例将把句子里的"Kelvin"改为"XXX"
import java.util.regex.*; 
public class Test{ 
   public static void main(String[] args) 
                        throws Exception { 
       // 生成 Pattern 对象并且编译一个简单的正则表达式"Kelvin"
       Pattern p = Pattern.compile("Kelvin"); 
       // 用 Pattern 类的 matcher() 方法生成一个 Matcher 对象
       Matcher m = p.matcher("Kelvin Li and Kelvin Chan are both working in " +
           "Kelvin Chen's KelvinSoftShop company"); 
       StringBuffer sb = new StringBuffer(); 
       int i=0; 
       // 使用 find() 方法查找第一个匹配的对象
       boolean result = m.find(); 
       // 使用循环将句子里所有的 kelvin 找出并替换再将内容加到 sb 里
       while(result) { 
           i++; 
           m.appendReplacement(sb, "XXX"); 
           System.out.println("第"+i+"次匹配后 sb 的内容是："+sb); 
           // 继续查找下一个匹配对象
           result = m.find(); 
       } 
       // 最后调用 appendTail() 方法将最后一次匹配后的剩余字符串加到 sb 里；
       m.appendTail(sb); 
       System.out.println("调用 m.appendTail(sb) 后 sb 的最终内容是 :"+ 
           sb.toString());
   } 
}



//------------------------结果------------
//第1次匹配后 sb 的内容是：XXX
//第2次匹配后 sb 的内容是：XXX Li and XXX
//第3次匹配后 sb 的内容是：XXX Li and XXX Chan are both working in XXX
//第4次匹配后 sb 的内容是：XXX Li and XXX Chan are both working in XXX Chen's XXX
//调用 m.appendTail(sb) 后 sb 的最终内容是 :XXX Li and XXX Chan are both working in XXX Chen's XXXSoftShop company

正则表达式语法规则

正则表达式有一套自己的语法规则，具体的可以查看官方文档或者百度，在此提及的是：

在其他语言中，\\ 表示：我想要在正则表达式中插入一个普通的（字面上的）反斜杠，请不要给它任何特殊的意义。
在 Java 中，\\ 表示：我要插入一个正则表达式的反斜线，所以其后的字符具有特殊的意义。
所以，在其他的语言中，一个反斜杠\就足以具有转义的作用，而在正则表达式中则需要有两个反斜杠才能被解析为其他语言中的转义作用。也可以简单的理解在正则表达式中，两个\代表其他语言中的一个 \，这也就是为什么表示一位数字的正则表达式是\\d，而表示一个普通的反斜杠是 \\\\。

这么做的原因是Java的正则表达式是存储在字符串中的即"正则表达式"，因此第一个反斜杠会被字符串编译给消耗掉，而要传递给正则表达式则需要再加一个反斜杠才能让正则表达式编译。
具体的可以参看：理解 Java 正则表达式怪异的 \ 和 \\，让您见怪不怪