了解Java：Pattern及正则使用

最新推荐文章于 2024-04-11 10:59:46 发布

lidf2007

最新推荐文章于 2024-04-11 10:59:46 发布

阅读量257

点赞数

分类专栏： CoreJava 文章标签： java

本文链接：https://blog.csdn.net/lidf2007/article/details/84524787

版权

CoreJava 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

1.简单了解

Regular Expressions

字符串处理：字符匹配、查找、替换

2.JDK支持

相关jdk类：java.lang.String,java.util.regex.Pattern,java.util.regex.Matcher

Summary of regular-expression constructs

Construct	Matches

Characters
x	The character x
`\\`	The backslash character
`\0`n	The character with octal value `0`n (0 `<=` n `<=` 7)
`\0`nn	The character with octal value `0`nn (0 `<=` n `<=` 7)
`\0`mnn	The character with octal value `0`mnn (0 `<=` m `<=` 3, 0 `<=` n `<=` 7)
`\x`hh	The character with hexadecimal value `0x`hh
`\u`hhhh	The character with hexadecimal value `0x`hhhh
`\t`	The tab character (`'\u0009'`)
`\n`	The newline (line feed) character (`'\u000A'`)
`\r`	The carriage-return character (`'\u000D'`)
`\f`	The form-feed character (`'\u000C'`)
`\a`	The alert (bell) character (`'\u0007'`)
`\e`	The escape character (`'\u001B'`)
`\c`x	The control character corresponding to x

Character classes
`[abc]`	`a`, `b`, or `c` (simple class)
`[^abc]`	Any character except `a`, `b`, or `c` (negation)
`[a-zA-Z]`	`a` through `z` or `A` through `Z`, inclusive (range)
`[a-d[m-p]]`	`a` through `d`, or `m` through `p`: `[a-dm-p]` (union)
`[a-z&&[def]]`	`d`, `e`, or `f` (intersection)
`[a-z&&[^bc]]`	`a` through `z`, except for `b` and `c`: `[ad-z]` (subtraction)
`[a-z&&[^m-p]]`	`a` through `z`, and not `m` through `p`: `[a-lq-z]`(subtraction)

Predefined character classes
`.`	Any character (may or may not match line terminators)
`\d`	A digit: `[0-9]`
`\D`	A non-digit: `[^0-9]`
`\s`	A whitespace character: `[ \t\n\x0B\f\r]`
`\S`	A non-whitespace character: `[^\s]`
`\w`	A word character: `[a-zA-Z_0-9] 单词字符`
`\W`	A non-word character: `[^\w]`

POSIX character classes (US-ASCII only)
`\p{Lower}`	A lower-case alphabetic character: `[a-z]`
`\p{Upper}`	An upper-case alphabetic character:`[A-Z]`
`\p{ASCII}`	All ASCII:`[\x00-\x7F]`
`\p{Alpha}`	An alphabetic character:`[\p{Lower}\p{Upper}]`
`\p{Digit}`	A decimal digit: `[0-9]`
`\p{Alnum}`	An alphanumeric character:`[\p{Alpha}\p{Digit}]`
`\p{Punct}`	Punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{\|}~
`\p{Graph}`	A visible character: `[\p{Alnum}\p{Punct}]`
`\p{Print}`	A printable character: `[\p{Graph}\x20]`
`\p{Blank}`	A space or a tab: `[ \t]`
`\p{Cntrl}`	A control character: `[\x00-\x1F\x7F]`
`\p{XDigit}`	A hexadecimal digit: `[0-9a-fA-F]`
`\p{Space}`	A whitespace character: `[ \t\n\x0B\f\r]`

java.lang.Character classes (simple java character type)
`\p{javaLowerCase}`	Equivalent to java.lang.Character.isLowerCase()
`\p{javaUpperCase}`	Equivalent to java.lang.Character.isUpperCase()
`\p{javaWhitespace}`	Equivalent to java.lang.Character.isWhitespace()
`\p{javaMirrored}`	Equivalent to java.lang.Character.isMirrored()

Classes for Unicode blocks and categories
`\p{InGreek}`	A character in the Greek block (simple block)
`\p{Lu}`	An uppercase letter (simple category)
`\p{Sc}`	A currency symbol
`\P{InGreek}`	Any character except one in the Greek block (negation)
`[\p{L}&&[^\p{Lu}]]`	Any letter except an uppercase letter (subtraction)

Boundary matchers
`^`	The beginning of a line
`$`	The end of a line
`\b`	A word boundary
`\B`	A non-word boundary
`\A`	The beginning of the input
`\G`	The end of the previous match
`\Z`	The end of the input but for the final terminator, if any
`\z`	The end of the input

Greedy quantifiers
X`?`	X, once or not at all
X`*`	X, zero or more times
X`+`	X, one or more times
X`{`n`}`	X, exactly n times
X`{`n`,}`	X, at least n times
X`{`n`,`m`}`	X, at least n but not more than m times

Reluctant quantifiers
X`??`	X, once or not at all
X`*?`	X, zero or more times
X`+?`	X, one or more times
X`{`n`}?`	X, exactly n times
X`{`n`,}?`	X, at least n times
X`{`n`,`m`}?`	X, at least n but not more than m times

Possessive quantifiers
X`?+`	X, once or not at all
X`*+`	X, zero or more times
X`++`	X, one or more times
X`{`n`}+`	X, exactly n times
X`{`n`,}+`	X, at least n times
X`{`n`,`m`}+`	X, at least n but not more than m times

Logical operators
XY	X followed by Y
X`\|`Y	Either X or Y
`(`X`)`	X, as a capturing group

Back references
`\`n	Whatever the n^th capturing group matched

Quotation
`\`	Nothing, but quotes the following character
`\Q`	Nothing, but quotes all characters until `\E`
`\E`	Nothing, but ends quoting started by `\Q`

Special constructs (non-capturing)
`(?:`X`)`	X, as a non-capturing group
`(?idmsux-idmsux)`	Nothing, but turns match flags i d m s u x on - off
`(?idmsux-idmsux:`X`)`	X, as a non-capturing group with the given flags i d m s u x on - off
`(?=`X`)`	X, via zero-width positive lookahead
`(?!`X`)`	X, via zero-width negative lookahead
`(?<=`X`)`	X, via zero-width positive lookbehind
`(?<!`X`)`	X, via zero-width negative lookbehind
`(?>`X`)`	X, as an independent, non-capturing group

greedy quantifiers and reluctant quantifiers and possessive quantifiers

贪婪量词，饥饿量词、占有量词，多加了？号和+号，表示的意思是一样的。匹配不同

A greedy quantifier starts by looking at the entire string for a match. If no match is found, it eliminates
the last character in the string and tries again. If a match is still not found, the last character is again
discarded and the process repeats until a match is found or the string is left with no characters. All the
quantifiers discussed to this point have been greedy.

A reluctant quantifier starts by looking at the first character in the string for a match. If that character
alone isn’t enough, it reads in the next character, forming a string of two characters. If still no match isfound, a reluctant quantifier continues to add characters from the string until either a match is found or
the entire string is checked without a match. Reluctant quantifiers work in reverse of greedy quantifiers.
A Possessive quantifier only tries to match against the entire string. If the entire string doesn’t produce a match, no further attempt is made. Possessive quantifiers are, in a manner of speaking, a one-shot deal.

贪婪量词之所以称之为"贪婪的"，是由于它们强迫匹配器读入(或者称之为吃掉)整个输入的字符串，来优先尝试第一次匹配，如果第一次尝试匹配（对整个输入的字符串）失败，匹配器会通过回退整个字符串的一个字符再一次进行尝试，不断的进行处理直到找到一个匹配，或者左边没有更多的字符用来回退了。赖于在表达式中使用的量词，最终它将尝试地靠着1或0个字符的匹配。

但是，勉强量词采用相反的路径：从输入字符串的开始处开始，因此每次勉强地吞噬一个字符来寻找匹配，最终它们尝试整个输入的字符串。

最后，侵占量词始终是吞掉整个输入的字符串，尝试着一次（仅有一次）匹配。不像贪婪量词那样，侵占量词绝不会回退，即使这样是允许全部的匹配成功。

Matcher类:
使用Matcher类,最重要的一个概念必须清楚:组(Group),在正则表达式中 ()定义了一个组,由于一个正则表达式可以包含很多的组,所以下面先说说怎么划分组的, 以及这些组和组的下标怎么对应的.

下面我们看看一个小例子,来说明这个问题

\w(\d\d)(\w+)

这个正则表达式有三个组:
整个\w(\d\d)(\w+) 是第0组 group(0)
(\d\d)是第1组 group(1)
(\w+)是第2组 group(2)

我们看看和正则表达式匹配的一个字符串x99SuperJava，
group(0)是匹配整个表达式的字符串的那部分x99SuperJava
group(1)是第1组(\d\d)匹配的部分:99
group(2)是第二组(\w+)匹配的那部分SuperJava

下面我们写一个程序来验证一下：

package edu.jlu.fuliang;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexTest {
	public static void main(String[] args) {
		String regex = "\\w(\\d\\d)(\\w+)";
		String candidate = "x99SuperJava";
		
		Pattern p = Pattern.compile(regex);
		Matcher matcher = p.matcher(candidate);
		if(matcher.find()){
			int gc = matcher.groupCount();
			for(int i = 0; i <= gc; i++)
				System.out.println("group " + i + " :" + matcher.group(i));
		}
	}
}

输出结果:

引用

group 099SuperJava
group 1 :99
group 2 :SuperJava

下面我们看看Matcher类提供的方法：
public Pattern pattern()
这个方法返回了，创建Matcher的那个pattern对象。

下面我们看看一个小例子来说明这个结果

import java.util.regex.*;
public class MatcherPatternExample{
  public static void main(String args[]){
      test();
  }
  public static void test(){
     Pattern p = Pattern.compile("\\d");
     Matcher m1 = p.matcher("55");
     Matcher m2 = p.matcher("fdshfdgdfh");
     System.out.println(m1.pattern() == m2.pattern());
     //return true
  }
}

public Matcher reset()
这个方法将Matcher的状态重新设置为最初的状态。

public Matcher reset(CharSequence input)
重新设置Matcher的状态，并且将候选字符序列设置为input后进行Matcher, 这个方法和重新创建一个Matcher一样，只是这样可以重用以前的对象。

public int start()
这个方法返回了，Matcher所匹配的字符串在整个字符串的的开始下标：
下面我们看看一个小例子

public class MatcherStartExample{
  public static void main(String args[]){
      test();
  }
  public static void test(){
     //create a Matcher and use the Matcher.start() method
     String candidateString = "My name is Bond. James Bond.";
     String matchHelper[] =
      {"          ^","                      ^"};
     Pattern p = Pattern.compile("Bond");
     Matcher matcher = p.matcher(candidateString);
     //Find the starting point of the first 'Bond'
      matcher.find();
      int startIndex = matcher.start();
      System.out.println(candidateString);
      System.out.println(matchHelper[0] + startIndex);
     //Find the starting point of the second 'Bond'
      matcher.find();
      int nextIndex = matcher.start();
      System.out.println(candidateString);
      System.out.println(matchHelper[1] + nextIndex);
}

输出结果：
My name is Bond. James Bond.
^11
My name is Bond. James Bond.
^23

public int start(int group)
这个方法可以指定你感兴趣的sub group,然后返回sup group匹配的开始位置。

public int end()
这个和start()对应，返回在以前的匹配操作期间，由给定组所捕获子序列的最后字符之后的偏移量。
其实start和end经常是一起配合使用来返回匹配的子字符串。

public int end(int group)
和public int start(int group)对应，返回在sup group匹配的子字符串最后一个字符在整个字符串下标加一

public String group()
返回由以前匹配操作所匹配的输入子序列。
这个方法提供了强大而方便的工具，他可以等同使用start和end,然后对字符串作substring(start,end)操作。
看看下面一个小例子：

import java.util.regex.*;
public class MatcherGroupExample{
  public static void main(String args[]){
      test();
  }
  public static void test(){
      //create a Pattern
      Pattern p = Pattern.compile("Bond");
      //create a Matcher and use the Matcher.group() method
      String candidateString = "My name is Bond. James Bond.";
      Matcher matcher = p.matcher(candidateString);
      //extract the group
      matcher.find();
      System.out.println(matcher.group());
  }
}

public String group(int group)
这个方法提供了强大而方便的工具，可以得到指定的group所匹配的输入字符串
因为这两个方法经常使用，同样我们看一个小例子：

import java.util.regex.*;
public class MatcherGroupParamExample{
  public static void main(String args[]){
      test();
  }
  public static void test(){
     //create a Pattern
      Pattern p = Pattern.compile("B(ond)");
     //create a Matcher and use the Matcher.group(int) method
     String candidateString = "My name is Bond. James Bond.";
     //create a helpful index for the sake of output
     Matcher matcher = p.matcher(candidateString);
     //Find group number 0 of the first find
      matcher.find();
      String group_0 = matcher.group(0);
      String group_1 = matcher.group(1);
      System.out.println("Group 0 " + group_0);
      System.out.println("Group 1 " + group_1);
      System.out.println(candidateString);
     //Find group number 1 of the second find
      matcher.find();
      group_0 = matcher.group(0);
      group_1 = matcher.group(1);
      System.out.println("Group 0 " + group_0);
      System.out.println("Group 1 " + group_1);
      System.out.println(candidateString);
  }
}

public int groupCount() 

这个方法返回了，正则表达式的匹配的组数。 


public boolean matches() 

尝试将整个区域与模式匹配。这个要求整个输入字符串都要和正则表达式匹配。 

和find不同， find是会在整个输入字符串查找匹配的子字符串。 

public boolean find() 

find会在整个输入中寻找是否有匹配的子字符串，一般我们使用find的流程：

 while(matcher.find()){
    //在匹配的区域，使用group,replace等进行查看和替换操作
 }

public boolean find(int start)
从输入字符串指定的start位置开始查找。

public boolean lookingAt()
基本上是matches更松约束的一个方法，尝试将从区域开头开始的输入序列与该模式匹配

public Matcher appendReplacement (StringBuffer sb, String replacement)
你想把My name is Bond. James Bond. I would like a martini中的Bond换成Smith

StringBuffer sb = new StringBuffer();
String replacement = "Smith";
Pattern pattern = Pattern.compile("Bond");
Matcher matcher =pattern.matcher("My name is Bond. James Bond. I would like a martini.");
while(matcher.find()){
  matcher.appendReplacement(sb,replacement);//结果是My name is Smith. James Smith
}

Matcher对象会维护追加的位置，所以我们才能不断地使用appendReplacement来替换所有的匹配。

public StringBuffer appendTail(StringBuffer sb)
这个方法简单的把为匹配的结尾追加到StringBuffer中。在上一个例子的最后再加上一句：
matcher.appendTail(sb);
结果就会成为My name is Smith. James Smith. I would like a martini.

public String replaceAll(String replacement)
这个是一个更方便的方法，如果我们想替换所有的匹配的话，我们可以简单的使用replaceAll就ok了。
是：

while(matcher.find()){
  matcher.appendReplacement(sb,replacement);//结果是My name is Smith. James Smith
}
matcher.appendTail(sb);

的更便捷的方法。

public String replaceFirst(String replacement)

这个与replaceAll想对应很容易理解，就是只替换第一个匹配的。

lidf2007

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录