1.简单了解
Regular Expressions
字符串处理:字符匹配、查找、替换
2.JDK支持
相关jdk类:java.lang.String,java.util.regex.Pattern,java.util.regex.Matcher
Summary of regular-expression constructs
Construct | Matches |
---|---|
Characters | |
x | The character x |
\\ | The backslash character |
\0n | The character with octal value 0n (0 <= n <= 7) |
\0nn | The character with octal value 0nn (0 <= n <= 7) |
\0mnn | The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7) |
\xhh | The character with hexadecimal value 0xhh |
\uhhhh | The character with hexadecimal value 0xhhhh |
\t | The tab character ('\u0009') |
\n | The newline (line feed) character ('\u000A') |
\r | The carriage-return character ('\u000D') |
\f | The form-feed character ('\u000C') |
\a | The alert (bell) character ('\u0007') |
\e | The escape character ('\u001B') |
\cx | The control character corresponding to x |
Character classes | |
[abc] | a, b, or c (simple class) |
[^abc] | Any character except a, b, or c (negation) |
[a-zA-Z] | a through z or A through Z, inclusive (range) |
[a-d[m-p]] | a through d, or m through p: [a-dm-p] (union) |
[a-z&&[def]] | d, e, or f (intersection) |
[a-z&&[^bc]] | a through z, except for b and c: [ad-z] (subtraction) |
[a-z&&[^m-p]] | a through z, and not m through p: [a-lq-z](subtraction) |
Predefined character classes | |
. | Any character (may or may not match line terminators) |
\d | A digit: [0-9] |
\D | A non-digit: [^0-9] |
\s | A whitespace character: [ \t\n\x0B\f\r] |
\S | A non-whitespace character: [^\s] |
\w | A word character: [a-zA-Z_0-9] 单词字符 |
\W | A non-word character: [^\w] |
POSIX character classes (US-ASCII only) | |
\p{Lower} | A lower-case alphabetic character: [a-z] |
\p{Upper} | An upper-case alphabetic character:[A-Z] |
\p{ASCII} | All ASCII:[\x00-\x7F] |
\p{Alpha} | An alphabetic character:[\p{Lower}\p{Upper}] |
\p{Digit} | A decimal digit: [0-9] |
\p{Alnum} | An alphanumeric character:[\p{Alpha}\p{Digit}] |
\p{Punct} | Punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ |
\p{Graph} | A visible character: [\p{Alnum}\p{Punct}] |
\p{Print} | A printable character: [\p{Graph}\x20] |
\p{Blank} | A space or a tab: [ \t] |
\p{Cntrl} | A control character: [\x00-\x1F\x7F] |
\p{XDigit} | A hexadecimal digit: [0-9a-fA-F] |
\p{Space} | A whitespace character: [ \t\n\x0B\f\r] |
java.lang.Character classes (simple java character type) | |
\p{javaLowerCase} | Equivalent to java.lang.Character.isLowerCase() |
\p{javaUpperCase} | Equivalent to java.lang.Character.isUpperCase() |
\p{javaWhitespace} | Equivalent to java.lang.Character.isWhitespace() |
\p{javaMirrored} | Equivalent to java.lang.Character.isMirrored() |
Classes for Unicode blocks and categories | |
\p{InGreek} | A character in the Greek block (simple block) |
\p{Lu} | An uppercase letter (simple category) |
\p{Sc} | A currency symbol |
\P{InGreek} | Any character except one in the Greek block (negation) |
[\p{L}&&[^\p{Lu}]] | Any letter except an uppercase letter (subtraction) |
Boundary matchers | |
^ | The beginning of a line |
$ | The end of a line |
\b | A word boundary |
\B | A non-word boundary |
\A | The beginning of the input |
\G | The end of the previous match |
\Z | The end of the input but for the final terminator, if any |
\z | The end of the input |
Greedy quantifiers | |
X? | X, once or not at all |
X* | X, zero or more times |
X+ | X, one or more times |
X{n} | X, exactly n times |
X{n,} | X, at least n times |
X{n,m} | X, at least n but not more than m times |
Reluctant quantifiers | |
X?? | X, once or not at all |
X*? | X, zero or more times |
X+? | X, one or more times |
X{n}? | X, exactly n times |
X{n,}? | X, at least n times |
X{n,m}? | X, at least n but not more than m times |
Possessive quantifiers | |
X?+ | X, once or not at all |
X*+ | X, zero or more times |
X++ | X, one or more times |
X{n}+ | X, exactly n times |
X{n,}+ | X, at least n times |
X{n,m}+ | X, at least n but not more than m times |
Logical operators | |
XY | X followed by Y |
X|Y | Either X or Y |
(X) | X, as a capturing group |
Back references | |
\n | Whatever the nth capturing group matched |
Quotation | |
\ | Nothing, but quotes the following character |
\Q | Nothing, but quotes all characters until \E |
\E | Nothing, but ends quoting started by \Q |
Special constructs (non-capturing) | |
(?:X) | X, as a non-capturing group |
(?idmsux-idmsux) | Nothing, but turns match flags i d m s u x on - off |
(?idmsux-idmsux:X) | X, as a non-capturing group with the given flags i d m s u x on - off |
(?=X) | X, via zero-width positive lookahead |
(?!X) | X, via zero-width negative lookahead |
(?<=X) | X, via zero-width positive lookbehind |
(?<!X) | X, via zero-width negative lookbehind |
(?>X) | X, as an independent, non-capturing group |
greedy quantifiers and reluctant quantifiers and possessive quantifiers
贪婪量词,饥饿量词、占有量词,多加了?号和+号,表示的意思是一样的。匹配不同
A greedy quantifier starts by looking at the entire string for a match. If no match is found, it eliminates
the last character in the string and tries again. If a match is still not found, the last character is again
discarded and the process repeats until a match is found or the string is left with no characters. All the
quantifiers discussed to this point have been greedy.
A reluctant quantifier starts by looking at the first character in the string for a match. If that character
alone isn’t enough, it reads in the next character, forming a string of two characters. If still no match isfound, a reluctant quantifier continues to add characters from the string until either a match is found or
the entire string is checked without a match. Reluctant quantifiers work in reverse of greedy quantifiers.
A Possessive quantifier only tries to match against the entire string. If the entire string doesn’t produce a match, no further attempt is made. Possessive quantifiers are, in a manner of speaking, a one-shot deal.
贪婪量词之所以称之为"贪婪的",是由于它们强迫匹配器读入(或者称之为吃掉)整个输入的字符串,来优先尝试第一次匹配,如果第一次尝试匹配(对整个输入的字符串)失败,匹配器会通过回退整个字符串的一个字符再一次进行尝试,不断的进行处理直到找到一个匹配,或者左边没有更多的字符用来回退了。赖于在表达式中使用的量词,最终它将尝试地靠着1或0个字符的匹配。
但是,勉强量词采用相反的路径:从输入字符串的开始处开始,因此每次勉强地吞噬一个字符来寻找匹配,最终它们尝试整个输入的字符串。
最后,侵占量词始终是吞掉整个输入的字符串,尝试着一次(仅有一次)匹配。不像贪婪量词那样,侵占量词绝不会回退,即使这样是允许全部的匹配成功。
Matcher类:
\w(\d\d)(\w+)
这个正则表达式有三个组: 整个\w(\d\d)(\w+) 是第0组 group(0) (\d\d)是第1组 group(1) (\w+)是第2组 group(2) 我们看看和正则表达式匹配的一个字符串x99SuperJava, group(0)是匹配整个表达式的字符串的那部分x99SuperJava group(1)是第1组(\d\d)匹配的部分:99 group(2)是第二组(\w+)匹配的那部分SuperJava |
<iframe id="aswift_1" style="position: absolute; top: 0px; left: 0px;" name="aswift_1" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" width="336" height="280"></iframe>
|
下面我们写一个程序来验证一下:
package edu.jlu.fuliang;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexTest {
public static void main(String[] args) {
String regex = "\\w(\\d\\d)(\\w+)";
String candidate = "x99SuperJava";
Pattern p = Pattern.compile(regex);
Matcher matcher = p.matcher(candidate);
if(matcher.find()){
int gc = matcher.groupCount();
for(int i = 0; i <= gc; i++)
System.out.println("group " + i + " :" + matcher.group(i));
}
}
}
输出结果:
group 1 :99
group 2 :SuperJava
下面我们看看Matcher类提供的方法:
public Pattern pattern()
这个方法返回了,创建Matcher的那个pattern对象。
下面我们看看一个小例子来说明这个结果
import java.util.regex.*;
public class MatcherPatternExample{
public static void main(String args[]){
test();
}
public static void test(){
Pattern p = Pattern.compile("\\d");
Matcher m1 = p.matcher("55");
Matcher m2 = p.matcher("fdshfdgdfh");
System.out.println(m1.pattern() == m2.pattern());
//return true
}
}
public Matcher reset()
这个方法将Matcher的状态重新设置为最初的状态。
public Matcher reset(CharSequence input)
重新设置Matcher的状态,并且将候选字符序列设置为input后进行Matcher, 这个方法和重新创建一个Matcher一样,只是这样可以重用以前的对象。
public int start()
这个方法返回了,Matcher所匹配的字符串在整个字符串的的开始下标:
下面我们看看一个小例子
public class MatcherStartExample{
public static void main(String args[]){
test();
}
public static void test(){
//create a Matcher and use the Matcher.start() method
String candidateString = "My name is Bond. James Bond.";
String matchHelper[] =
{" ^"," ^"};
Pattern p = Pattern.compile("Bond");
Matcher matcher = p.matcher(candidateString);
//Find the starting point of the first 'Bond'
matcher.find();
int startIndex = matcher.start();
System.out.println(candidateString);
System.out.println(matchHelper[0] + startIndex);
//Find the starting point of the second 'Bond'
matcher.find();
int nextIndex = matcher.start();
System.out.println(candidateString);
System.out.println(matchHelper[1] + nextIndex);
}
输出结果:
My name is Bond. James Bond.
^11
My name is Bond. James Bond.
^23
public int start(int group)
这个方法可以指定你感兴趣的sub group,然后返回sup group匹配的开始位置。
public int end()
这个和start()对应,返回在以前的匹配操作期间,由给定组所捕获子序列的最后字符之后的偏移量。
其实start和end经常是一起配合使用来返回匹配的子字符串。
public int end(int group)
和public int start(int group)对应,返回在sup group匹配的子字符串最后一个字符在整个字符串下标加一
public String group()
返回由以前匹配操作所匹配的输入子序列。
这个方法提供了强大而方便的工具,他可以等同使用start和end,然后对字符串作substring(start,end)操作。
看看下面一个小例子:
import java.util.regex.*;
public class MatcherGroupExample{
public static void main(String args[]){
test();
}
public static void test(){
//create a Pattern
Pattern p = Pattern.compile("Bond");
//create a Matcher and use the Matcher.group() method
String candidateString = "My name is Bond. James Bond.";
Matcher matcher = p.matcher(candidateString);
//extract the group
matcher.find();
System.out.println(matcher.group());
}
}
public String group(int group)
这个方法提供了强大而方便的工具,可以得到指定的group所匹配的输入字符串
因为这两个方法经常使用,同样我们看一个小例子:
import java.util.regex.*;
public class MatcherGroupParamExample{
public static void main(String args[]){
test();
}
public static void test(){
//create a Pattern
Pattern p = Pattern.compile("B(ond)");
//create a Matcher and use the Matcher.group(int) method
String candidateString = "My name is Bond. James Bond.";
//create a helpful index for the sake of output
Matcher matcher = p.matcher(candidateString);
//Find group number 0 of the first find
matcher.find();
String group_0 = matcher.group(0);
String group_1 = matcher.group(1);
System.out.println("Group 0 " + group_0);
System.out.println("Group 1 " + group_1);
System.out.println(candidateString);
//Find group number 1 of the second find
matcher.find();
group_0 = matcher.group(0);
group_1 = matcher.group(1);
System.out.println("Group 0 " + group_0);
System.out.println("Group 1 " + group_1);
System.out.println(candidateString);
}
}
public int groupCount()
这个方法返回了,正则表达式的匹配的组数。
public boolean matches()
尝试将整个区域与模式匹配。这个要求整个输入字符串都要和正则表达式匹配。
和find不同, find是会在整个输入字符串查找匹配的子字符串。
public boolean find()
find会在整个输入中寻找是否有匹配的子字符串,一般我们使用find的流程:
while(matcher.find()){
//在匹配的区域,使用group,replace等进行查看和替换操作
}
public boolean find(int start)
从输入字符串指定的start位置开始查找。
public boolean lookingAt()
基本上是matches更松约束的一个方法,尝试将从区域开头开始的输入序列与该模式匹配
public Matcher appendReplacement (StringBuffer sb, String replacement)
你想把My name is Bond. James Bond. I would like a martini中的Bond换成Smith
StringBuffer sb = new StringBuffer();
String replacement = "Smith";
Pattern pattern = Pattern.compile("Bond");
Matcher matcher =pattern.matcher("My name is Bond. James Bond. I would like a martini.");
while(matcher.find()){
matcher.appendReplacement(sb,replacement);//结果是My name is Smith. James Smith
}
Matcher对象会维护追加的位置,所以我们才能不断地使用appendReplacement来替换所有的匹配。
public StringBuffer appendTail(StringBuffer sb)
这个方法简单的把为匹配的结尾追加到StringBuffer中。在上一个例子的最后再加上一句:
matcher.appendTail(sb);
结果就会成为My name is Smith. James Smith. I would like a martini.
public String replaceAll(String replacement)
这个是一个更方便的方法,如果我们想替换所有的匹配的话,我们可以简单的使用replaceAll就ok了。
是:
while(matcher.find()){
matcher.appendReplacement(sb,replacement);//结果是My name is Smith. James Smith
}
matcher.appendTail(sb);
的更便捷的方法。
public String replaceFirst(String replacement)
这个与replaceAll想对应很容易理解,就是只替换第一个匹配的。