Java正则
概述
java正则基于NFA引擎的,Pattern
和Matcher
就是构成java正则最重要的两个类。
Pattern
精确的描述了正则表达式的构造行为。主要包含各种Node树形
结构、和字符操作方法。
主要属性
修饰符
/*
* 正则表达式修饰符值。它们也可以作为内联修饰符传递,
* 而不是作为参数传递,例如p1和p2是等效的
* Pattern p1 = Pattern.compile("abc",Pattern.CASE_INSENSITIVE|Pattern.MULTILINE);
* Pattern p2 = Pattern.compile("(?im)abc", 0);
*/
/**
* 开启unix行模式,这种模式下只有\n可以被认作是行终止符
* Unix行模式也可以通过嵌入式标志表达式(?d)启用
* <p> Unix lines mode can also be enabled via the embedded flag
* expression {@code (?d)}.
*/
public static final int UNIX_LINES = 0x01;
/**开启不区分大小写
* 默认情况下,不区分大小写的匹配假定仅匹配US-ASCII字符集中的字符。通过将
* Unicode_case标志与此标志一起指定,可以启用Unicode感知的不区分大小写匹
* 配。
* 可以通过(?i)开启
*/
public static final int CASE_INSENSITIVE = 0x02;
/**
* 允许空格和注释
* 在这种模式下,空白被忽略,以#开头的嵌入注释被忽略,直到行尾。
* (?x)
*/
public static final int COMMENTS = 0x04;
/**
* 开启多行模式
* 在多行模式下,表达式^和$分别在行终止符或输入序列结尾之后或之前匹配。默
* 认情况下,这些表达式仅在整个输入序列的开头和结尾匹配
* (?m)
*/
public static final int MULTILINE = 0x08;
/**
* 开启字面义解析
*
* 指定此标志后,指定模式的输入字符串将被视为一系列文字字符。输入序列中的
* 元字符或转义序列没有特殊意义。与此标志结合使用时,标志大小写不敏感和
* UNICODE大小写保留其对匹配的影响。其他的标志变得多余了。
* 没有嵌入式开启标志
*/
public static final int LITERAL = 0x10;
/**
* 开启匹配任何字符模式,也是单行模式
* 在dotall模式下,表达式匹配任何字符,包括行终止符。默认情况下,此表达
* 式与行终止符不匹配。也可以通过嵌入的标志表达式(?s)启用Dotall模式。
* (s是“单行”模式的助记符,在Perl中就是这样称呼的。)
*
*/
public static final int DOTALL = 0x20;
/**
* 开启unicode忽略大小写
*如果指定了此标志,则在由不区分大小写的标志启用时,将以与Unicode标准一致
*的方式进行不区分大小写的匹配。指定此标志可能会有性能损失。
*/
public static final int UNICODE_CASE = 0x40;
/**
* 开启了规范等价性
*
* 指定此标志后,当且仅当两个字符的完整规范分解匹配时,才会认为这两个字符
* 匹配。例如,当指定此标志时,表达式“a\u030A”将与字符串“\u00E5”匹配。默
* 认情况下,匹配不考虑规范等价性。没有用于启用规范等价性的嵌入标志字符。
* 指定此标志可能会造成性能损失。
*/
public static final int CANON_EQ = 0x80;
/**
* 预定义字符类和POSIX字符类的Unicode版本
* 忽略unicode大小写
* 影响性能
*/
public static final int UNICODE_CHARACTER_CLASS = 0x100;
/**
* 包含所有可能标志
* 主要用他的非来与任何标志,如果不为0,则这个标志不是Java正则可识别的修饰符
*/
private static final int ALL_FLAGS = CASE_INSENSITIVE | MULTILINE |
DOTALL | UNICODE_CASE | CANON_EQ | UNIX_LINES | LITERAL |
UNICODE_CHARACTER_CLASS | COMMENTS;
其他属性
//只有这两个可以序列化pattern、flags,当反序列化后,需要重新编译模式
/**
* 原始的正则表达式字符串
*/
private String pattern;
/**
* 正则模式标志
*
* @serial
*/
private int flags;
/**
* 编译期间使用的临时模式标志,这些标志可以通过嵌入式标志打开和关闭。
*/
private transient int flags0;
/**
* 模式已经编译完成
* Boolean indicating this Pattern is compiled; this is necessary
* in order
* to lazily compile deserialized Patterns.
*/
private transient volatile boolean compiled;
/**
* 规范的模式字符串。
* NFC ,默认参数,表示“标准等价合成”,返回多个简单的字符合成字符。所“标
* 准等价”指的是视觉和语义上的等价。
* NFD:表示“标准等价分解”,即在标准等价的前提下,返
* 回合成字符分解出的多个简单字符。然后构造一个纯组以匹配字符
* 的规范等价(见unicode的归一化)
* 例如Ǒ(三声的O)可以是Ǒ(\u01D1)也可以是O(\u004F)和三声抑扬符 ̌
* (\u030C)的组合如果不开启CANON_EQ,只认Ǒ,不会认合成,如果开启了就
* 认为这两种组合方式是等价的
* The normalized pattern string.
*/
private transient String normalizedPattern;
/**
* The starting point of state machine for the find operation.
* This allows a match to start anywhere in the input.
* 查找操作状态机的起点,允许匹配从输入中的任何位置开始
*/
transient Node root;
/**
* The root of object tree for a match operation. The pattern is
* matched at the beginning. This may include a find that uses
* BnM or a First node.
* 匹配操作的对象树的根。模式是一开始就匹配。这可能包括使用BnM或第一个节
* 点。
*/
transient Node matchRoot;
/**
* Temporary storage used by parsing pattern slice.
* 解析模式分片的零时存储
*/
transient int[] buffer;
/**
* A temporary storage used for predicate for double return.
* 临时存储用于判断字符是否符合正则的双返回谓词
*/
transient CharPredicate predicate;
/**
* Map the "name" of the "named capturing group" to its group id
* node.
* 计算组名称和序号的映射
*/
transient volatile Map<String, Integer> namedGroups;
/**
* Temporary storage used while parsing group references.
* 临时存储分组节点
*/
transient GroupHead[] groupNodes;
/**
* Temporary storage used to store the top level closure nodes.
* 临时存储顶级闭包节点(顶级的贪婪节点:.*, .+)
*/
transient List<Node> topClosureNodes;
/**
* The number of top greedy closure nodes in this Pattern. Used by
* matchers to allocate storage needed for a IntHashSet to keep
* the beginning pos {@code i} of all failed match.
* 顶级贪婪节点数,保持所有匹配失败回溯的起始位置
*/
transient int localTCNCount;
/*
* Turn off the stop-exponential-backtracking optimization if
* there is a group ref in the pattern.
* 如果有组引用,关闭阻止指数回溯操作
*/
transient boolean hasGroupRef;
/**
* Temporary null terminated code point array used by pattern
* compiling.
* 编译时临时存储模式的数组
*/
private transient int[] temp;
/**
* The number of capturing groups in this Pattern. Used by
* matchers to allocate storage needed to perform a match.
* 捕获组数量
*/
transient int capturingGroupCount;
/**
* The local variable count used by parsing tree. Used by matchers
* to allocate storage needed to perform a match.
* 解析树使用的局部变量计数。由匹配器matcher用来分配执行匹配所需的存储。
* createGroup方法中int localIndex = localCount++,每创建一个组的时候
* localCount就会加1,localIndex比它小1,索引是从零开始的。localCount
* 就是计数有几个捕获组。在groups[]数组填充时至关重要。
*/
transient int localCount;
/**
* Index into the pattern string that keeps track of how much has
* been parsed.
* 模式字符串匹配的光标
*/
private transient int cursor;
/**
* Holds the length of the pattern string.
* 模式字符串长度
*/
private transient int patternLength;
/**
* If the Start node might possibly match supplementary or
* surrogate code points.It is set to true during compiling if
* (1) There is supplementary or surrogate code point in pattern,
* or (2) There is complement node of a "family" CharProperty
* 如果开始节点可能匹配补充节点或代理代码点。在编译过程中,如果
* (1)模式中存在补充或替代代码点,或者(2)“族”属性中有补码节点
*/
private transient boolean hasSupplementary;
构造器
private Pattern(String p, int f) {
//判断模式标志是否非法
if ((f & ~ALL_FLAGS) != 0) {
throw new IllegalArgumentException("Unknown flag 0x"
+ Integer.toHexString(f));
}
//记录模式字符串和传入的模式标志(修饰符)
pattern = p;
flags = f;
//如果修饰符预定义unicode版本,则开启忽略unicode大小写
// to use UNICODE_CASE if UNICODE_CHARACTER_CLASS present
if ((flags & UNICODE_CHARACTER_CLASS) != 0)
flags |= UNICODE_CASE;
// 'flags' for compiling
flags0 = flags;
// Reset group index count重置组索引计数
capturingGroupCount = 1;
localCount = 0;
localTCNCount = 0;
if (!pattern.isEmpty()) {
try {
compile();
} catch (StackOverflowError soe) {
throw error("Stack overflow during pattern compilation");
}
} else {
//如果模式为空开始节点设为最后接收节点,匹配根节点设为最后接受节点
root = new Start(lastAccept);
matchRoot = lastAccept;
}
}
编译器
private void compile() {
// Handle canonical equivalences
if (has(CANON_EQ) && !has(LITERAL)) {
//开启规范等价,模式字符串转为规范形式(unicode正规分解)
normalizedPattern = normalize(pattern);
} else {
//否则不改变模式字符串
normalizedPattern = pattern;
}
patternLength = normalizedPattern.length();
// Copy pattern to int array for convenience
// Use double zero to terminate pattern
/**为了方便把模式中的字符拷贝到int数组中,用两个零表示模式结束
* 存储模式中所有字符的代码点。因为一个char并不一定能够代表一个字符,
* 可能只是一半字符。用char类型来存储字符就不方便。
*/
temp = new int[patternLength + 2];
//增补状态(0平面之外的字符称为增补字符)
hasSupplementary = false;
int c, count = 0;
// Convert all chars into code points
//把模式中的所有的字符转换成代码点
for (int x = 0; x < patternLength; x += Character.charCount(c)) {
c = normalizedPattern.codePointAt(x);
if (isSupplementary(c)) {//增补字符,增补状态设为true
hasSupplementary = true;
}
//每一个字符代码点存入temp
temp[count++] = c;
}
//模式长度设置为代码点个数
patternLength = count; // patternLength now in code points
//非字面量匹配,预处理\Q...\E序列,将\Q
/**
\Q 在non-word 字符前加上\,直到\E,\E可以结束\L或\Q
*/
if (! has(LITERAL))
//\Q 在non-word 字符前加上\转译,直到\E结束,并且去掉Q和E
RemoveQEQuoting();
// Allocate all temporary objects here.
//保存所有的临时对象
buffer = new int[32];
//组起始节点数组
groupNodes = new GroupHead[10];
//组名和组序列号(index)映射
namedGroups = null;
//顶级贪婪节点存储数组
topClosureNodes = new ArrayList<>(10);
//如果是字面量匹配
if (has(LITERAL)) {
// Literal pattern handling创建字面匹配模式处理器
//创建字符串切片匹配器
matchRoot = newSlice(temp, patternLength, hasSupplementary);
//下一个匹配节点设置为最后接受节点(匹配结束节点)
matchRoot.next = lastAccept;
} else {
// Start recursive descent parsing
//不是字面量匹配,需要递归向下解析
/**
* 解析表达式时会添加分支节点以供替换。可以递归调用它来解析可能包
* 含替换的子表达式
*/
matchRoot = expr(lastAccept);
// Check extra pattern characters
//检查多余的模式莫匹配字符抛出异常
if (patternLength != cursor) {
if (peek() == ')') {
throw error("Unmatched closing ')'");
} else {
throw error("Unexpected internal error");
}
}
}
// Peephole optimization
//检索算法窥孔优化
if (matchRoot instanceof Slice) {//模式切片匹配
//用BM算法检索
root = BnM.optimize(matchRoot);
if (root == matchRoot) {
root = hasSupplementary ? new StartS(matchRoot) : new Start(matchRoot);
}
} else if (matchRoot instanceof Begin || matchRoot instanceof First) {
root = matchRoot;
} else {
root = hasSupplementary ? new StartS(matchRoot) : new Start(matchRoot);
}
// Optimize the greedy Loop to prevent exponential backtracking, IF there
// is no group ref in this pattern. With a non-negative localTCNCount value,
// the greedy type Loop, Curly will skip the backtracking for any starting
// position "i" that failed in the past.
if (!hasGroupRef) {
for (Node node : topClosureNodes) {
if (node instanceof Loop) {
// non-deterministic-greedy-group
((Loop)node).posIndex = localTCNCount++;
}
}
}
// Release temporary storage
//释放内存
temp = null;
buffer = null;
groupNodes = null;
patternLength = 0;
compiled = true;
topClosureNodes = null;
}
规范等价
节点Node
sequence
交替进行序列之间的解析
private Node sequence(Node end) {
Node head = null;
Node tail = null;
Node node;
LOOP:
for (;;) {
//从temp中获取cursor位置元素但不移动光标
int ch = peek();
switch (ch) {
//如果有括号,组有自己的闭包,我们区别对待
case '(':
// Because group handles its own closure,
// we need to treat it differently
node = group0();
// Check for comment or flag group
if (node == null)
continue;
if (head == null)
head = node;
else
tail.next = node;
// Double return: Tail was returned in root
tail = root;
continue;
case '[':
if (has(CANON_EQ) && !has(LITERAL))
node = new NFCCharProperty(clazz(true));
else
node = newCharProperty(clazz(true));
break;
case '\\':
//如果是"\"转译,跳一个光标取实际字符,光标自增1
ch = nextEscaped();
//判断是否是unicode字符组,或者反unicode组
if (ch == 'p' || ch == 'P') {
//一个字符标志true
boolean oneLetter = true;
//反unicode标志
boolean comp = (ch == 'P');
//取出下一个字符
ch = next(); // Consume { if present
if (ch != '{') {//如果不是"{",不是unicode字符组,例如\p8,匹配8的unicode码
//回退不能消费了一位unicode字符
unread();
} else {
//否则就是多个unicode符号,单个字符标志置为false
oneLetter = false;
}
// node = newCharProperty(family(oneLetter, comp));
//开启了正规等价并且不是字面量匹配模式
if (has(CANON_EQ) && !has(LITERAL))
//正规合并解析Unicode字符族并返回其代表节点
node = new NFCCharProperty(family(oneLetter, comp));
else
//解析Unicode字符族并返回其代表节点
node = newCharProperty(family(oneLetter, comp));
} else {
//不是unicode匹配,回退一步,创建atom节点
unread();
node = atom();
}
break;
case '^':
next();
if (has(MULTILINE)) {
if (has(UNIX_LINES))
node = new UnixCaret();
else
node = new Caret();
} else {
node = new Begin();
}
break;
case '$':
next();
if (has(UNIX_LINES))
node = new UnixDollar(has(MULTILINE));
else
node = new Dollar(has(MULTILINE));
break;
case '.':
next();
if (has(DOTALL)) {
node = new CharProperty(ALL());
} else {
if (has(UNIX_LINES)) {
node = new CharProperty(UNIXDOT());
} else {
node = new CharProperty(DOT());
}
}
break;
case '|':
case ')':
break LOOP;
case ']': // Now interpreting dangling ] and } as literals
case '}':
node = atom();
break;
case '?':
case '*':
case '+':
next();
throw error("Dangling meta character '" + ((char)ch) + "'");
case 0:
if (cursor >= patternLength) {
break LOOP;
}
// Fall through
default:
node = atom();
break;
}
/**贪婪节点(过程重复。如果窥视的下一个字符是量词,则必须附加新节
点以处理重复。Prev可以是单个或组,因此它可以是一个节点链)*/
node = closure(node);
/* save the top dot-greedy nodes (.*, .+) as well
if (node instanceof GreedyCharProperty &&
((GreedyCharProperty)node).cp instanceof Dot) {
topClosureNodes.add(node);
}
*/
if (head == null) {
head = tail = node;
} else {
tail.next = node;
tail = node;
}
}
if (head == null) {
return end;
}
tail.next = end;
root = tail; //double return
return head;
}
Matcher
模式在序列上的匹配操作的抽象。
属性
/**
* The Pattern object that created this Matcher.
* 创建这个匹配器的模式对象
*/
Pattern parentPattern;
/**
* The storage used by groups. They may contain invalid values if
* a group was skipped during the matching.
* 分组的存储(每个组对应起始和结束位置索引),如果跳过了组,它们可能包含
* 无效值
*
*/
int[] groups;
/**
* The range within the sequence that is to be matched. Anchors
* will match at these "hard" boundaries. Changing the region
* changes these values.
* 序列中被匹配的起始和结束位置
*/
int from, to;
/**
* Lookbehind uses this value to ensure that the subexpression
* match ends at the point where the lookbehind was encountered.
* 后顾时使用此值确保子表达式匹配在遇到查找时结束
*/
int lookbehindTo;
/**
* The original string being matched.
* 原始字符串被匹配的开始
*/
CharSequence text;
/**
* Matcher state used by the last node. NOANCHOR is used when a
* match does not have to consume all of the input. ENDANCHOR is
* the mode used for matching all the input.
* last节点匹配器状态,NOANCHOR:匹配不需要匹配所有输入
* ENDANCHOR:匹配所有输入
*/
static final int ENDANCHOR = 1;
static final int NOANCHOR = 0;
int acceptMode = NOANCHOR;
/**
* The range of string that last matched the pattern. If the last
* match failed then first is -1; last initially holds 0 then it
* holds the index of the end of the last match (which is where the
* next search starts).
* 上一次匹配的范围
*/
int first = -1, last = 0;
/**
* The end index of what matched in the last match operation.
* 上一次匹配结束时索引
*/
int oldLast = -1;
/**
* The index of the last position appended in a substitution.
* 上一次拼接的位置索引
*/
int lastAppendPosition = 0;
/**
* Storage used by nodes to tell what repetition they are on in
* a pattern, and where groups begin. The nodes themselves are
* stateless,
* so they rely on this field to hold state during a match.
* 用于告知它们在中的重复模式,以及组的起始位置。
*/
int[] locals;
/**
* Storage used by top greedy Loop node to store a specific hash set to
* keep the beginning index of the failed repetition match. The nodes
* themselves are stateless, so they rely on this field to hold state
* during a match.
* 在顶级贪婪循环节点中保留失败重复匹配的开始索引
*/
IntHashSet[] localsPos;
/**
* Boolean indicating whether or not more input could change
* the results of the last match.
*
* If hitEnd is true, and a match was found, then more input
* might cause a different match to be found.
* If hitEnd is true and a match was not found, then more
* input could cause a match to be found.
* If hitEnd is false and a match was found, then more input
* will not change the match.
* If hitEnd is false and a match was not found, then more
* input will not cause a match to be found.
* 1.如果hitEnd为true,并且找到了匹配项,则需要更多输入可能会导致找到不
* 同的匹配项。
* 2.如果hitEnd为true,但未找到匹配项,则会出现更多输入可能导致找到匹配
* 项。
* 3.如果hitEnd为false且找到匹配项,则更多输入不会改变匹配。
* 4.如果hitEnd为false且未找到匹配项,则更多输入不也不会找到匹配项。
*/
boolean hitEnd;
/**
* Boolean indicating whether or not more input could change
* a positive match into a negative one.
* If requireEnd is true, and a match was found, then more
* input could cause the match to be lost.
* If requireEnd is false and a match was found, then more
* input might change the match but the match won't be lost.
* If a match was not found, then requireEnd has no meaning.
* 是否可以更改更多输入积极的匹配变成消极的匹配
* 1.如果requireEnd为true,并且找到了匹配项,则更多输入可能导致匹配丢失
* 2.如果requireEnd为false并找到匹配项,则更多输入可能会更改匹配,但匹
* 配不会丢失。
* 3.如果requireEnd为false未找到匹配项,则Required没有任何意义。
*/
boolean requireEnd;
/**
* If transparentBounds is true then the boundaries of this
* matcher's region are transparent to lookahead, lookbehind,
* and boundary matching constructs that try to see beyond them.
* 如果transparentBounds为true,则此matcher区域对向前看、向后看是透明
* 的,和边界匹配结构,超越它们。
*/
boolean transparentBounds = false;
/**
* If anchoringBounds is true then the boundaries of this
* matcher's region match anchors such as ^ and $.
* 如果位true,这个匹配器边界有^ and $定位
*/
boolean anchoringBounds = true;
/**
* Number of times this matcher's state has been modified
* 匹配器状态被修改次数
*/
int modCount;
构造器
Matcher(Pattern parent, CharSequence text) {
//创建匹配器的模式
this.parentPattern = parent;
//待匹配字符串
this.text = text;
// Allocate state storage 初始化存储状态
//捕获组数小于10,父类组数量取10
int parentGroupCount = Math.max(parent.capturingGroupCount, 10);
//以两倍的父类组数量构建groups数组存储捕获组起始和结束索引
groups = new int[parentGroupCount * 2];
//根据组数量localCount构建存储组起始位置索引的数组
locals = new int[parent.localCount];
//失败重复次数构建存储失败/重复位置开始索引
localsPos = new IntHashSet[parent.localTCNCount];
// Put fields into initial states 字段赋为初始状态
reset();
}
重要方法
重置匹配器将丢弃其所有显式状态信息,并将其附加位置设置为零。匹配器的区域设置为默认区域,即其整个字符序列。此匹配器区域边界的锚定和透明度不受影响
reset
public Matcher reset() {
first = -1;
last = 0;
oldLast = -1;
//groups数组和组开始索引数组locals置为-1
for(int i=0; i<groups.length; i++)
groups[i] = -1;
for(int i=0; i<locals.length; i++)
locals[i] = -1;
for (int i = 0; i < localsPos.length; i++) {
if (localsPos[i] != null)
localsPos[i].clear();
}
lastAppendPosition = 0;
from = 0;
to = getTextLength();
modCount++;
return this;
}
find
尝试查找与模式匹配的输入序列的下一个子序列。此方法从该匹配器区域的开始处开始,或者,如果先前成功调用了该方法并且此后未重置匹配器,则从第一个未与先前匹配匹配的字符开始。如果匹配成功,则可以通过start、end和group方法获得更多信息。返回:当且仅当输入序列的子序列匹配此匹配器的模式时,为true
public boolean find() {
//洗一次查找索引紧跟上一次结束
int nextSearchIndex = last;
if (nextSearchIndex == first)
nextSearchIndex++;
// If next search starts before region, start it at region
//如果下一一次搜索位置比from小,意味着从最开始位置匹配
if (nextSearchIndex < from)
nextSearchIndex = from;
// If next search starts beyond region then it fails
//如果下一次搜索超过了范围,则匹配失败
if (nextSearchIndex > to) {
for (int i = 0; i < groups.length; i++)
groups[i] = -1;
return false;
}
//继续下一次搜索
return search(nextSearchIndex);
}
search
启动搜索以查找给定边界内的模式。使用默认值填充组,并调用状态机根的匹配。当状态机在此匹配器中进行匹配时,它将保持匹配的状态。此处未设置Matcher.from,因为它是锚定将设置为的搜索开始的“硬”边界。from参数是搜索开始的“软”边界,这意味着正则表达式试图在该索引处匹配,但在该索引处^将不匹配。对搜索方法的后续调用从新的“软”边界开始,该边界是前一个匹配的结束
boolean search(int from) {
this.hitEnd = false;
this.requireEnd = false;
from = from < 0 ? 0 : from;
this.first = from;
this.oldLast = oldLast < 0 ? from : oldLast;
//初始化groups[i]=-1
for (int i = 0; i < groups.length; i++)
groups[i] = -1;
for (int i = 0; i < localsPos.length; i++) {
if (localsPos[i] != null)
localsPos[i].clear();
}
acceptMode = NOANCHOR;
//从root节点开始匹配
boolean result = parentPattern.root.match(this, from, text);
//匹配失败first复位
if (!result)
this.first = -1;
//更新上老的last为当前last
this.oldLast = this.last;
//修改数自增
this.modCount++;
return result;
}
组的创建createGroup
每一个组都会通过createGroup
创建GroupHead
和GroupTail
,每一个组的创建localCount
(组数量)就加1,组起始索引localIndex
比localCount
小1。
private Node createGroup(boolean anonymous) {
//组起始索引(在匹配字符串上的起始位置)
int localIndex = localCount++;
int groupIndex = 0;
//如果组非匿名创建(捕获组),捕获组数量自增,该组在捕获组数组索引比数量小1
/**Matcher中的groups数组(组开始元素索引位置和组结束元素索引位置存储数组)
* 每个位置索引都和这个组索引有关:组起始元素索引存储的位置等于该组head节
* 点在groupHead数组存储的索引的两倍
* 'groupIndex = groupCount + groupCount',组结束元素索引存储位置该组
* head节点在groupHead数组存储的索引的两倍+1,一个偶数一个基数。这也是
* group()方法取出对应组数据内部奇偶原理。
*/
if (!anonymous)
/**
* 捕获组,就记录捕获组数量,组索引肯定是从1开始,因为只要有捕获组
* capturingGroupCount>0。留下的0组永远给整个匹配组
*/
groupIndex = capturingGroupCount++;
//创建组head
GroupHead head = new GroupHead(localIndex);
//创建组尾:组开始和结束位置索引填充到groups[]数组
root = new GroupTail(localIndex, groupIndex);
// for debug/print only, head.match does NOT need the "tail" info
head.tail = (GroupTail)root;
//捕获组并且groupIndex小于10,更新组节点数组,把当前head加入
/**
* groupNodes要么全是null,没有捕获组。要么就是groupNodes[0]=null,其他位置* 存放的各个组的groupHead。
*/
if (!anonymous && groupIndex < 10)
groupNodes[groupIndex] = head;
return head;
}
match
这个match方法各个节点都重新了Node的此方法。根据节点不同,方法也不同。
match∈Start
boolean match(Matcher matcher, int i, CharSequence seq) {
if (i > matcher.to - minLength) {
matcher.hitEnd = true;
return false;
}
/**守卫很重要,待匹配字符串长度与最小树深度之差
* 最小树深度就是模式按顺序匹配最少匹配成功时字符串个数
* \d+(\d|\w)最小深度2,因为组内有分支只算一个。
* 这样guard就可以限制匹配重试匹配次数,超过这个次数未匹配成功
* 就匹配失败了
*/
int guard = matcher.to - minLength;
for (; i <= guard; i++) {
//每一次都是从star节点的next匹配
if (next.match(matcher, i, seq)) {
/**这是开始节点,所以这时groups存储的就是整个模式匹配成功的起始和结
* 束。其他分组的存储都在他们groupTail中存储。
* 由此可见整个模式完全匹配开始结束所以存储在groups[0],groups[1]的
* 位置
*/
matcher.first = i;
matcher.groups[0] = matcher.first;
matcher.groups[1] = matcher.last;
return true;
}
}
matcher.hitEnd = true;
return false;
}
match∈GroupHead
boolean match(Matcher matcher, int i, CharSequence seq) {
//先保存本组在当前localIndex时存储分组/重复开始的索引
int save = matcher.locals[localIndex];
//修改该位置的值为当前组开始索引
matcher.locals[localIndex] = i;
//调用GroupHead的下一个节点匹配
boolean ret = next.match(matcher, i, seq);
//下一个节点递归匹配完成,恢复当前localIndex的值
matcher.locals[localIndex] = save;
return ret;
}
match∈Slice
对序列片段进行字面量匹配。组内也是一个片段。
boolean match(Matcher matcher, int i, CharSequence seq) {
//组内的片段赋值给buf比如3(456)的456
int[] buf = buffer;
int len = buf.length;
//把这个片段进行匹配,如果出现不匹配直接返回false
for (int j=0; j<len; j++) {
if ((i+j) >= matcher.to) {
matcher.hitEnd = true;
return false;
}
if (buf[j] != seq.charAt(i+j))
return false;
}
//如果片段匹配没有问题,去下一个节点继续
return next.match(matcher, i+len, seq);
}
match∈GroupTail
成功匹配组时,GroupTail
处理组开始和结束位置的设置。它还必须能够取消必须退出的组。当引用以前的组时,也会使用GroupTail节点,在这种情况下,不需要设置组信息。groupIndex = groupCount + groupCount
GroupTail(int localCount, int groupCount) {
localIndex = localCount;
//起始位置索引存储在偶数位置
groupIndex = groupCount + groupCount;
}
boolean match(Matcher matcher, int i, CharSequence seq) {
//取出该组的起始位置索引
int tmp = matcher.locals[localIndex];
if (tmp >= 0) { // This is the normal group case.
// Save the group so we can unset it if it
// backs off of a match.
/**正常组,保存该组,以便在出现问题时取消设置
*/
//取出该组开始和结束值,以便出现问题恢复原值
int groupStart = matcher.groups[groupIndex];
int groupEnd = matcher.groups[groupIndex+1];
/**设置组起始位置,当前组序号(groupHead组中的索引的两倍)
* groupIndex = groupCount + groupCount;
*/
matcher.groups[groupIndex] = tmp;
//紧接着设置组结束位置,奇数位置
matcher.groups[groupIndex+1] = i;
//转移控制权到下一个节点,匹配成功则返回true
if (next.match(matcher, i, seq)) {
return true;
}
//出现问题恢复值
matcher.groups[groupIndex] = groupStart;
matcher.groups[groupIndex+1] = groupEnd;
return false;
} else {
// This is a group reference case. We don't need to save any
// group info because it isn't really a group.
matcher.last = i;
return true;
}
}
group
根据组序号取对应组数据。group(0)获取的是整个表达式的匹配。
public String group(int group) {
if (first < 0)
throw new IllegalStateException("No match found");
if (group < 0 || group > groupCount())
throw new IndexOutOfBoundsException("No group " + group);
if ((groups[group*2] == -1) || (groups[group*2+1] == -1))
return null;
//偶数存开始,奇数存结束位置索引。
return getSubSequence(groups[group * 2], groups[group * 2 + 1]).toString();
}