前些天同事问我正则表达式为什么匹配不出来数据,在工具上验证均是正常的。当时看了一头蒙,好久不处理都忘记api中类怎么处理了。当时第一反应直接跟源码查看原因;一看后更有点小蒙,虽然问题解决了,也勾起我对java中正则解析处理方式的好奇。
简单正则表达式处理:
@Test
public void testDemo(){
String str = "this is my test 52, i want find all number";
Pattern patter = Pattern.compile("\\d+");
Matcher matcher = patter.matcher(str);
while(matcher.find()){
System.out.println(matcher.group());
}
}
上述例子就是根据一个字符串根据正则表达式获取所有整数,并打印出来。针对这样一个程序的底层是怎么处理的呢?
首先Pattern通过compile方法自动生成new Pattern对象:
public static Pattern compile(String regex) {
return new Pattern(regex, 0);
}
private Pattern(String p, int f) {
pattern = p;
flags = f;
// Reset group index count
capturingGroupCount = 1;
localCount = 0;
if (pattern.length() > 0) {
compile();
} else {
root = new Start(lastAccept);
matchRoot = lastAccept;
}
}
可以看出Pattern类中设计的构造为私有的,只允许通过compile方法进行创建Pattern对象。其中f为匹配标志,可能包括 CASE_INSENSITIVE
、MULTILINE
、DOTALL
、UNICODE_CASE
、CANON_EQ
、UNIX_LINES
、LITERAL
和COMMENTS
的位掩码
在构造中进行初始化类中基础属性赋值,根据前面的单元测试直接走的为compile()方法。在这个方法中又做了什么处理呢?
查看对应compile方法部分代码为:
if (! has(LITERAL))
RemoveQEQuoting();
// Allocate all temporary objects here.
buffer = new int[32];
groupNodes = new GroupHead[10];
if (has(LITERAL)) {
// Literal pattern handling
matchRoot = newSlice(temp, patternLength, hasSupplementary);
matchRoot.next = lastAccept;
} else {
// Start recursive descent parsing
matchRoot = expr(lastAccept);
// Check extra pattern characters
if (patternLength != cursor) {
if (peek() == ')') {
throw error("Unmatched closing ')'");
} else {
throw error("Unexpected internal error");
}
}
}
// Peephole optimization
if (matchRoot instanceof Slice) {
root = BnM.optimize(matchRoot);
if (root == matchRoot) {
root = hasSupplementary ? new StartS(matchRoot) : new Start(matchRoot);
}
} else if (matchRoot instanceof Begin || matchRoot instanceof First) {
root = matchRoot;
} else {
root = hasSupplementary ? new StartS(matchRoot) : new Start(matchRoot);
}
// Release temporary storage
temp = null;
buffer = null;
groupNodes = null;
patternLength = 0;
compiled = true;
}
// Start recursive descent parsing
matchRoot = expr(lastAccept);
进行开始递归解析传入的指定规则;例如:\\d+,进行解析后封装成Nodeduixang ,赋值给matchRoot对象。
具体expr源码为:
private Node expr(Node end) {
Node prev = null;
Node firstTail = null;
Node branchConn = null;
for (;;) {
Node node = sequence(end);//真正封装Node对象
Node nodeTail = root; //double return
if (prev == null) {
prev = node;
firstTail = nodeTail;
} else {
// Branch
if (branchConn == null) {
branchConn = new BranchConn();
branchConn.next = end;
}
if (node == end) {
// if the node returned from sequence() is "end"
// we have an empty expr, set a null atom into
// the branch to indicate to go "next" directly.
node = null;
} else {
// the "tail.next" of each atom goes to branchConn
nodeTail.next = branchConn;
}
if (prev instanceof Branch) {
((Branch)prev).add(node);
} else {
if (prev == end) {
prev = null;
} else {
// replace the "end" with "branchConn" at its tail.next
// when put the "prev" into the branch as the first atom.
firstTail.next = branchConn;
}
prev = new Branch(prev, node, branchConn);
}
}
//判断匹配符中是否含有|符号
if (peek() != '|') {
return prev;
}
next();
}
}
观看源码可到,如果匹配符中不含有或,直接通过prev = node;return prev;方式返回组装的Node对象。如果含有|字符,则组装层级结构并返回Branch节点对象;
sequence方法处理:
private Node sequence(Node end) {
Node head = null;
Node tail = null;
Node node = null;
LOOP:
for (;;) {
int ch = peek();
switch (ch) {
case '(':
// Because group handles its own closure,
// we need to treat it differently
node = group0();
// Check for comment or flag group
if (node == null)
continue;
if (head == null)
head = node;
else
tail.next = node;
// Double return: Tail was returned in root
tail = root;
continue;
case '[':
node = clazz(true);
break;
case '\\':
ch = nextEscaped();
if (ch == 'p' || ch == 'P') {
boolean oneLetter = true;
boolean comp = (ch == 'P');
ch = next(); // Consume { if present
if (ch != '{') {
unread();
} else {
oneLetter = false;
}
node = family(oneLetter).maybeComplement(comp);
} else {
unread();
node = atom();
}
break;
case '^':
next();
if (has(MULTILINE)) {
if (has(UNIX_LINES))
node = new UnixCaret();
else
node = new Caret();
} else {
node = new Begin();
}
break;
case '$':
next();
if (has(UNIX_LINES))
node = new UnixDollar(has(MULTILINE));
else
node = new Dollar(has(MULTILINE));
break;
case '.':
next();
if (has(DOTALL)) {
node = new All();
} else {
if (has(UNIX_LINES))
node = new UnixDot();
else {
node = new Dot();
}
}
break;
case '|':
case ')':
break LOOP;
case ']': // Now interpreting dangling ] and } as literals
case '}':
node = atom();
break;
case '?':
case '*':
case '+':
next();
throw error("Dangling meta character '" + ((char)ch) + "'");
case 0:
if (cursor >= patternLength) {
break LOOP;
}
// Fall through
default:
node = atom();
break;
}
node = closure(node);
if (head == null) {
head = tail = node;
} else {
tail.next = node;
tail = node;
}
}
if (head == null) {
return end;
}
tail.next = end;
root = tail; //double return
return head;
}
从这段代码不难看出根据对应匹配符进行组装不同解析类型的Node对象。并把不同Node解析对象进行层级方式存放并且返回去。
在代码中
if (peek() != '|')
如果为ture时表明匹配符中不含有|直接返回Node对象。否则进行组装Branch对象。针对Branch对象为:
static final class Branch extends Node {
Node[] atoms = new Node[2];
int size = 2;
Node conn;
Branch(Node first, Node second, Node branchConn) {
conn = branchConn;
atoms[0] = first;
atoms[1] = second;
}
void add(Node node) {
if (size >= atoms.length) {
Node[] tmp = new Node[atoms.length*2];
System.arraycopy(atoms, 0, tmp, 0, atoms.length);
atoms = tmp;
}
atoms[size++] = node;
}
boolean match(Matcher matcher, int i, CharSequence seq) {
for (int n = 0; n < size; n++) {
if (atoms[n] == null) {
if (conn.next.match(matcher, i, seq))
return true;
} else if (atoms[n].match(matcher, i, seq)) {
return true;
}
}
return false;
}
boolean study(TreeInfo info) {
int minL = info.minLength;
int maxL = info.maxLength;
boolean maxV = info.maxValid;
int minL2 = Integer.MAX_VALUE; //arbitrary large enough num
int maxL2 = -1;
for (int n = 0; n < size; n++) {
info.reset();
if (atoms[n] != null)
atoms[n].study(info);
minL2 = Math.min(minL2, info.minLength);
maxL2 = Math.max(maxL2, info.maxLength);
maxV = (maxV & info.maxValid);
}
minL += minL2;
maxL += maxL2;
info.reset();
conn.next.study(info);
info.minLength += minL;
info.maxLength += maxL;
info.maxValid &= maxV;
info.deterministic = false;
return false;
}
}
该对象含有一个Node[] atoms属性默认数组大小为2,如果含有多个|字符时会调用
if (prev instanceof Branch) {
((Branch)prev).add(node);
}
进行追加判断数组大小进行追加;
好了,到此就属于
Pattern patter = Pattern.compile("\\d+");
方法执行完成了,那下一步
Matcher matcher = patter.matcher(str);
又做了哪些处理呢?
public Matcher matcher(CharSequence input) {
if (!compiled) {
synchronized(this) {
if (!compiled)
compile();
}
}
Matcher m = new Matcher(this, input);
return m;
}
默认情况下compiled是false的,但是执行完Parttern.compile方法后自动设定为true(表明已经对匹配符做了处理),然后进行创建Matcher对象,把对应的Patter和需要匹配的字符串传给Matcher对象属性
查看Matcher对象
Matcher() {
}
/**
* All matchers have the state used by Pattern during a match.
*/
Matcher(Pattern parent, CharSequence text) {
this.parentPattern = parent;
this.text = text;
// Allocate state storage
int parentGroupCount = Math.max(parent.capturingGroupCount, 10);
groups = new int[parentGroupCount * 2];
locals = new int[parent.localCount];
// Put fields into initial states
reset();
}
构造都不是作用域为当前包和类,因此不同包下的类不能直接通过new Matcher方式进行创建Matcher对象;
Matcher构造中调用方法reset,源码为:
public Matcher reset() {
first = -1;
last = 0;
oldLast = -1;
for(int i=0; i<groups.length; i++)
groups[i] = -1;
for(int i=0; i<locals.length; i++)
locals[i] = -1;
lastAppendPosition = 0;
from = 0;
to = getTextLength();
return this;
}
初始化需要开始索引的基础字符串位置;from从第几个索引开始,to 到几个索引;from和to之间的字符串数据为匹配后的数据;
获得了Matcher对象后,需要进行获取匹配规则进行查询指定字符串中所处的位置并进行输出
while(matcher.find()){
System.out.println(matcher.group());
}
matcher.find()方法源码为:
public boolean find() {
int nextSearchIndex = last;
if (nextSearchIndex == first)
nextSearchIndex++;
// If next search starts before region, start it at region
if (nextSearchIndex < from)
nextSearchIndex = from;
// If next search starts beyond region then it fails
if (nextSearchIndex > to) {
for (int i = 0; i < groups.length; i++)
groups[i] = -1;
return false;
}
return search(nextSearchIndex);
}
在这个方法中,nextSearchIndex值为开始需要索引位置,通过search(nextSearchIndex)方法进行查询匹配的字符串
boolean search(int from) {
this.hitEnd = false;
this.requireEnd = false;
from = from < 0 ? 0 : from;
this.first = from;
this.oldLast = oldLast < 0 ? from : oldLast;
for (int i = 0; i < groups.length; i++)
groups[i] = -1;
acceptMode = NOANCHOR;
boolean result = parentPattern.root.match(this, from, text);
if (!result)
this.first = -1;
this.oldLast = this.last;
return result;
}
在search方法中通过
parentPattern.root.match(this, from, text)
其中parentPattern.root为Pattern中根据匹配符组装的Node对象。然后调用Node对象的match方法进行处理查找所在指定字符串的位置值,并且把数据记录到Matcher对象的groups数组属性中。并且方法返回true.然后调用matcher.group()方法进行获取指定索引查出groups中对应的字符串数据
public String group(int group) {
if (first < 0)
throw new IllegalStateException("No match found");
if (group < 0 || group > groupCount())
throw new IndexOutOfBoundsException("No group " + group);
if ((groups[group*2] == -1) || (groups[group*2+1] == -1))
return null;
return getSubSequence(groups[group * 2], groups[group * 2 + 1]).toString();
}
好了,这就是整体java处理正则表达式的整体过程。不同类型匹配符通过不同Node进行处理就不做阐述了。