正则表达式(二)

最新推荐文章于 2023-02-06 16:45:42 发布

success112

最新推荐文章于 2023-02-06 16:45:42 发布

阅读量654

点赞数 1

分类专栏： java重大关键点文章标签：正则表达式前端后端

本文链接：https://blog.csdn.net/success112/article/details/121249065

版权

java重大关键点专栏收录该内容

8 篇文章 0 订阅

订阅专栏

概述

java正则基于NFA引擎的，Pattern和Matcher就是构成java正则最重要的两个类。

Pattern

精确的描述了正则表达式的构造行为。主要包含各种Node树形结构、和字符操作方法。

主要属性

修饰符

 /*
     * 正则表达式修饰符值。它们也可以作为内联修饰符传递，
     * 而不是作为参数传递，例如p1和p2是等效的
     *   Pattern p1 = Pattern.compile("abc",Pattern.CASE_INSENSITIVE|Pattern.MULTILINE);
     *   Pattern p2 = Pattern.compile("(?im)abc", 0);
     */

    /**
     * 开启unix行模式，这种模式下只有\n可以被认作是行终止符
     * Unix行模式也可以通过嵌入式标志表达式(?d)启用
     * <p> Unix lines mode can also be enabled via the embedded flag
     * expression&nbsp;{@code (?d)}.
     */
    public static final int UNIX_LINES = 0x01;
	/**开启不区分大小写
	* 默认情况下，不区分大小写的匹配假定仅匹配US-ASCII字符集中的字符。通过将
	* Unicode_case标志与此标志一起指定，可以启用Unicode感知的不区分大小写匹
	* 配。
	* 可以通过(?i)开启
	*/
	public static final int CASE_INSENSITIVE = 0x02;
	/**
	* 允许空格和注释
	* 在这种模式下，空白被忽略，以#开头的嵌入注释被忽略，直到行尾。
	* (?x)
	*/
	public static final int COMMENTS = 0x04;
	/**
	* 开启多行模式
	* 在多行模式下，表达式^和$分别在行终止符或输入序列结尾之后或之前匹配。默
	* 认情况下，这些表达式仅在整个输入序列的开头和结尾匹配
	* (?m)
	*/
	public static final int MULTILINE = 0x08;
	/**
	* 开启字面义解析
	*
	* 指定此标志后，指定模式的输入字符串将被视为一系列文字字符。输入序列中的
	* 元字符或转义序列没有特殊意义。与此标志结合使用时，标志大小写不敏感和
	* UNICODE大小写保留其对匹配的影响。其他的标志变得多余了。
	* 没有嵌入式开启标志
	*/
	public static final int LITERAL = 0x10;
	/**
	* 开启匹配任何字符模式,也是单行模式
	* 在dotall模式下，表达式匹配任何字符，包括行终止符。默认情况下，此表达
	* 式与行终止符不匹配。也可以通过嵌入的标志表达式(?s)启用Dotall模式。
	* （s是“单行”模式的助记符，在Perl中就是这样称呼的。）
	*
	*/
	public static final int DOTALL = 0x20;
	/**
	* 开启unicode忽略大小写
	*如果指定了此标志，则在由不区分大小写的标志启用时，将以与Unicode标准一致
	*的方式进行不区分大小写的匹配。指定此标志可能会有性能损失。
	*/
	public static final int UNICODE_CASE = 0x40;
	/**
	* 开启了规范等价性
	* 
	* 指定此标志后，当且仅当两个字符的完整规范分解匹配时，才会认为这两个字符
	* 匹配。例如，当指定此标志时，表达式“a\u030A”将与字符串“\u00E5”匹配。默
	* 认情况下，匹配不考虑规范等价性。没有用于启用规范等价性的嵌入标志字符。
	* 指定此标志可能会造成性能损失。
	*/
	public static final int CANON_EQ = 0x80;

	/**
	* 预定义字符类和POSIX字符类的Unicode版本
	* 忽略unicode大小写
	* 影响性能
	*/
	public static final int UNICODE_CHARACTER_CLASS = 0x100;
/**
* 包含所有可能标志
* 主要用他的非来与任何标志，如果不为0，则这个标志不是Java正则可识别的修饰符
*/
private static final int ALL_FLAGS = CASE_INSENSITIVE | MULTILINE |
            DOTALL | UNICODE_CASE | CANON_EQ | UNIX_LINES | LITERAL |
            UNICODE_CHARACTER_CLASS | COMMENTS;

其他属性

//只有这两个可以序列化pattern、flags,当反序列化后，需要重新编译模式
    /**
     * 原始的正则表达式字符串
     */
    private String pattern;

    /**
     * 正则模式标志
     *
     * @serial
     */
    private int flags;

	/**
	* 编译期间使用的临时模式标志，这些标志可以通过嵌入式标志打开和关闭。
	*/
	private transient int flags0;
    /**
     * 模式已经编译完成
     * Boolean indicating this Pattern is compiled; this is necessary
     * in order
     * to lazily compile deserialized Patterns.
     */
    private transient volatile boolean compiled;

    /**
     * 规范的模式字符串。
     * NFC ,默认参数，表示“标准等价合成”，返回多个简单的字符合成字符。所“标
     * 准等价”指的是视觉和语义上的等价。
     * NFD：表示“标准等价分解”，即在标准等价的前提下，返
     * 回合成字符分解出的多个简单字符。然后构造一个纯组以匹配字符
     * 的规范等价(见unicode的归一化)
     * 例如Ǒ(三声的O)可以是Ǒ(\u01D1)也可以是O(\u004F)和三声抑扬符 ̌  
     * (\u030C)的组合如果不开启CANON_EQ，只认Ǒ，不会认合成，如果开启了就
     * 认为这两种组合方式是等价的
     * The normalized pattern string.
     */
    private transient String normalizedPattern;

    /**
     * The starting point of state machine for the find operation. 
     *  This allows a match to start anywhere in the input.
     * 查找操作状态机的起点，允许匹配从输入中的任何位置开始
     */
    transient Node root;

    /**
     * The root of object tree for a match operation.  The pattern is
     * matched at the beginning.  This may include a find that uses 
     * BnM or a First node.
     * 匹配操作的对象树的根。模式是一开始就匹配。这可能包括使用BnM或第一个节
     * 点。
     */
    transient Node matchRoot;

    /**
     * Temporary storage used by parsing pattern slice.
     * 解析模式分片的零时存储
     */
    transient int[] buffer;

    /**
     * A temporary storage used for predicate for double return.
     * 临时存储用于判断字符是否符合正则的双返回谓词
     */
    transient CharPredicate predicate;

    /**
     * Map the "name" of the "named capturing group" to its group id
     * node.
     * 计算组名称和序号的映射
     */
    transient volatile Map<String, Integer> namedGroups;

    /**
     * Temporary storage used while parsing group references.
     * 临时存储分组节点
     */
    transient GroupHead[] groupNodes;

    /**
     * Temporary storage used to store the top level closure nodes.
     * 临时存储顶级闭包节点(顶级的贪婪节点：.*, .+)
     */
    transient List<Node> topClosureNodes;

    /**
     * The number of top greedy closure nodes in this Pattern. Used by
     * matchers to allocate storage needed for a IntHashSet to keep
     *  the beginning pos {@code i} of all failed match.
     * 顶级贪婪节点数，保持所有匹配失败回溯的起始位置
     */
    transient int localTCNCount;

    /*
     * Turn off the stop-exponential-backtracking optimization if 
     * there is a group ref in the pattern.
     * 如果有组引用，关闭阻止指数回溯操作
     */
    transient boolean hasGroupRef;

    /**
     * Temporary null terminated code point array used by pattern 
     * compiling.
     * 编译时临时存储模式的数组
     */
    private transient int[] temp;

    /**
     * The number of capturing groups in this Pattern. Used by 
     * matchers to allocate storage needed to perform a match.
     * 捕获组数量
     */
    transient int capturingGroupCount;

    /**
     * The local variable count used by parsing tree. Used by matchers 
     * to allocate storage needed to perform a match.
     * 解析树使用的局部变量计数。由匹配器matcher用来分配执行匹配所需的存储。
     * createGroup方法中int localIndex = localCount++，每创建一个组的时候
     * localCount就会加1,localIndex比它小1，索引是从零开始的。localCount
     * 就是计数有几个捕获组。在groups[]数组填充时至关重要。
     */
    transient int localCount;

    /**
     * Index into the pattern string that keeps track of how much has 
     * been parsed.
     * 模式字符串匹配的光标
     */
    private transient int cursor;

    /**
     * Holds the length of the pattern string.
     * 模式字符串长度
     */
    private transient int patternLength;

    /**
     * If the Start node might possibly match supplementary or 
     * surrogate code points.It is set to true during compiling if
     * (1) There is supplementary or surrogate code point in pattern, 
     * or (2) There is complement node of a "family" CharProperty
     * 如果开始节点可能匹配补充节点或代理代码点。在编译过程中，如果
     * （1）模式中存在补充或替代代码点，或者（2）“族”属性中有补码节点
     */
    private transient boolean hasSupplementary;

构造器

private Pattern(String p, int f) {
//判断模式标志是否非法
        if ((f & ~ALL_FLAGS) != 0) {
            throw new IllegalArgumentException("Unknown flag 0x"
                                               + Integer.toHexString(f));
        }
        //记录模式字符串和传入的模式标志(修饰符)
        pattern = p;
        flags = f;
		//如果修饰符预定义unicode版本，则开启忽略unicode大小写
        // to use UNICODE_CASE if UNICODE_CHARACTER_CLASS present
        if ((flags & UNICODE_CHARACTER_CLASS) != 0)
            flags |= UNICODE_CASE;

        // 'flags' for compiling
        flags0 = flags;

        // Reset group index count重置组索引计数
        capturingGroupCount = 1;
        localCount = 0;
        localTCNCount = 0;

        if (!pattern.isEmpty()) {
            try {
                compile();
            } catch (StackOverflowError soe) {
                throw error("Stack overflow during pattern compilation");
            }
        } else {
        //如果模式为空开始节点设为最后接收节点，匹配根节点设为最后接受节点
            root = new Start(lastAccept);
            matchRoot = lastAccept;
        }
    }

编译器

在这里插入图片描述

 private void compile() {
        // Handle canonical equivalences
        if (has(CANON_EQ) && !has(LITERAL)) {
        //开启规范等价，模式字符串转为规范形式(unicode正规分解)
            normalizedPattern = normalize(pattern);
        } else {
        //否则不改变模式字符串
            normalizedPattern = pattern;
        }
        patternLength = normalizedPattern.length();

        // Copy pattern to int array for convenience
        // Use double zero to terminate pattern
        /**为了方便把模式中的字符拷贝到int数组中，用两个零表示模式结束
        * 存储模式中所有字符的代码点。因为一个char并不一定能够代表一个字符,
        * 可能只是一半字符。用char类型来存储字符就不方便。
        */
        temp = new int[patternLength + 2];
		//增补状态(0平面之外的字符称为增补字符)
        hasSupplementary = false;
        int c, count = 0;
        // Convert all chars into code points
        //把模式中的所有的字符转换成代码点
        for (int x = 0; x < patternLength; x += Character.charCount(c)) {
            c = normalizedPattern.codePointAt(x);
            if (isSupplementary(c)) {//增补字符，增补状态设为true
                hasSupplementary = true;
            }
            //每一个字符代码点存入temp
            temp[count++] = c;
        }
		//模式长度设置为代码点个数
        patternLength = count;   // patternLength now in code points
		//非字面量匹配，预处理\Q...\E序列，将\Q
		/**
		\Q 在non-word 字符前加上\，直到\E,\E可以结束\L或\Q
		*/
        if (! has(LITERAL))
        //\Q 在non-word 字符前加上\转译，直到\E结束，并且去掉Q和E
            RemoveQEQuoting();

        // Allocate all temporary objects here.
        //保存所有的临时对象
        buffer = new int[32];
        //组起始节点数组
        groupNodes = new GroupHead[10];
        //组名和组序列号(index)映射
        namedGroups = null;
        //顶级贪婪节点存储数组
        topClosureNodes = new ArrayList<>(10);
	//如果是字面量匹配
        if (has(LITERAL)) {
            // Literal pattern handling创建字面匹配模式处理器
            //创建字符串切片匹配器
            matchRoot = newSlice(temp, patternLength, hasSupplementary);
            //下一个匹配节点设置为最后接受节点(匹配结束节点)
            matchRoot.next = lastAccept;
        } else {
            // Start recursive descent parsing
            //不是字面量匹配，需要递归向下解析
            /**
            * 解析表达式时会添加分支节点以供替换。可以递归调用它来解析可能包
            * 含替换的子表达式
            */
            matchRoot = expr(lastAccept);
            // Check extra pattern characters
            //检查多余的模式莫匹配字符抛出异常
            if (patternLength != cursor) {
                if (peek() == ')') {
                    throw error("Unmatched closing ')'");
                } else {
                    throw error("Unexpected internal error");
                }
            }
        }

        // Peephole optimization
        //检索算法窥孔优化
        if (matchRoot instanceof Slice) {//模式切片匹配
        //用BM算法检索
            root = BnM.optimize(matchRoot);
            if (root == matchRoot) {
                root = hasSupplementary ? new StartS(matchRoot) : new Start(matchRoot);
            }
        } else if (matchRoot instanceof Begin || matchRoot instanceof First) {
            root = matchRoot;
        } else {
            root = hasSupplementary ? new StartS(matchRoot) : new Start(matchRoot);
        }

        // Optimize the greedy Loop to prevent exponential backtracking, IF there
        // is no group ref in this pattern. With a non-negative localTCNCount value,
        // the greedy type Loop, Curly will skip the backtracking for any starting
        // position "i" that failed in the past.
        if (!hasGroupRef) {
            for (Node node : topClosureNodes) {
                if (node instanceof Loop) {
                    // non-deterministic-greedy-group
                    ((Loop)node).posIndex = localTCNCount++;
                }
            }
        }

        // Release temporary storage
        //释放内存
        temp = null;
        buffer = null;
        groupNodes = null;
        patternLength = 0;
        compiled = true;
        topClosureNodes = null;
    }

规范等价
在这里插入图片描述

节点Node

sequence

交替进行序列之间的解析

private Node sequence(Node end) {
        Node head = null;
        Node tail = null;
        Node node;
    LOOP:
        for (;;) {
        //从temp中获取cursor位置元素但不移动光标
            int ch = peek();
            switch (ch) {
            //如果有括号，组有自己的闭包，我们区别对待
            case '(':
                // Because group handles its own closure,
                // we need to treat it differently
                node = group0();
                // Check for comment or flag group
                if (node == null)
                    continue;
                if (head == null)
                    head = node;
                else
                    tail.next = node;
                // Double return: Tail was returned in root
                tail = root;
                continue;
            case '[':
                if (has(CANON_EQ) && !has(LITERAL))
                    node = new NFCCharProperty(clazz(true));
                else
                    node = newCharProperty(clazz(true));
                break;
            case '\\':
            //如果是"\"转译，跳一个光标取实际字符，光标自增1
                ch = nextEscaped();
                //判断是否是unicode字符组，或者反unicode组
                if (ch == 'p' || ch == 'P') {
                //一个字符标志true
                    boolean oneLetter = true;
                    //反unicode标志
                    boolean comp = (ch == 'P');
                    //取出下一个字符
                    ch = next(); // Consume { if present
                    if (ch != '{') {//如果不是"{",不是unicode字符组，例如\p8，匹配8的unicode码
                    //回退不能消费了一位unicode字符
                        unread();
                    } else {
                    //否则就是多个unicode符号,单个字符标志置为false
                        oneLetter = false;
                    }
                    // node = newCharProperty(family(oneLetter, comp));
                    //开启了正规等价并且不是字面量匹配模式
                    if (has(CANON_EQ) && !has(LITERAL))
                    //正规合并解析Unicode字符族并返回其代表节点
                        node = new NFCCharProperty(family(oneLetter, comp));
                    else
                    //解析Unicode字符族并返回其代表节点
                        node = newCharProperty(family(oneLetter, comp));
                } else {
                //不是unicode匹配，回退一步，创建atom节点
                    unread();
                    node = atom();
                }
                break;
            case '^':
                next();
                if (has(MULTILINE)) {
                    if (has(UNIX_LINES))
                        node = new UnixCaret();
                    else
                        node = new Caret();
                } else {
                    node = new Begin();
                }
                break;
            case '$':
                next();
                if (has(UNIX_LINES))
                    node = new UnixDollar(has(MULTILINE));
                else
                    node = new Dollar(has(MULTILINE));
                break;
            case '.':
                next();
                if (has(DOTALL)) {
                    node = new CharProperty(ALL());
                } else {
                    if (has(UNIX_LINES)) {
                        node = new CharProperty(UNIXDOT());
                    } else {
                        node = new CharProperty(DOT());
                    }
                }
                break;
            case '|':
            case ')':
                break LOOP;
            case ']': // Now interpreting dangling ] and } as literals
            case '}':
                node = atom();
                break;
            case '?':
            case '*':
            case '+':
                next();
                throw error("Dangling meta character '" + ((char)ch) + "'");
            case 0:
                if (cursor >= patternLength) {
                    break LOOP;
                }
                // Fall through
            default:
                node = atom();
                break;
            }
			/**贪婪节点(过程重复。如果窥视的下一个字符是量词，则必须附加新节
			点以处理重复。Prev可以是单个或组，因此它可以是一个节点链)*/
            node = closure(node);
            /* save the top dot-greedy nodes (.*, .+) as well
            if (node instanceof GreedyCharProperty &&
                ((GreedyCharProperty)node).cp instanceof Dot) {
                topClosureNodes.add(node);
            }
            */
            if (head == null) {
                head = tail = node;
            } else {
                tail.next = node;
                tail = node;
            }
        }
        if (head == null) {
            return end;
        }
        tail.next = end;
        root = tail;      //double return
        return head;
    }

Matcher

模式在序列上的匹配操作的抽象。
在这里插入图片描述

属性


    /**
     * The Pattern object that created this Matcher.
     * 创建这个匹配器的模式对象
     */
    Pattern parentPattern;

    /**
     * The storage used by groups. They may contain invalid values if
     * a group was skipped during the matching.
     * 分组的存储(每个组对应起始和结束位置索引),如果跳过了组，它们可能包含
     * 无效值
     * 
     */
    int[] groups;

    /**
     * The range within the sequence that is to be matched. Anchors
     * will match at these "hard" boundaries. Changing the region
     * changes these values.
     * 序列中被匹配的起始和结束位置
     */
    int from, to;

    /**
     * Lookbehind uses this value to ensure that the subexpression
     * match ends at the point where the lookbehind was encountered.
     * 后顾时使用此值确保子表达式匹配在遇到查找时结束
     */
    int lookbehindTo;

    /**
     * The original string being matched.
     * 原始字符串被匹配的开始
     */
    CharSequence text;

    /**
     * Matcher state used by the last node. NOANCHOR is used when a
     * match does not have to consume all of the input. ENDANCHOR is
     * the mode used for matching all the input.
     * last节点匹配器状态，NOANCHOR：匹配不需要匹配所有输入
     * ENDANCHOR：匹配所有输入
     */
    static final int ENDANCHOR = 1;
    static final int NOANCHOR = 0;
    int acceptMode = NOANCHOR;

    /**
     * The range of string that last matched the pattern. If the last
     * match failed then first is -1; last initially holds 0 then it
     * holds the index of the end of the last match (which is where the
     * next search starts).
     * 上一次匹配的范围
     */
    int first = -1, last = 0;

    /**
     * The end index of what matched in the last match operation.
     * 上一次匹配结束时索引
     */
    int oldLast = -1;

    /**
     * The index of the last position appended in a substitution.
     * 上一次拼接的位置索引
     */
    int lastAppendPosition = 0;

    /**
     * Storage used by nodes to tell what repetition they are on in
     * a pattern, and where groups begin. The nodes themselves are
     *  stateless,
     * so they rely on this field to hold state during a match.
     * 用于告知它们在中的重复模式，以及组的起始位置。
     */
    int[] locals;

    /**
     * Storage used by top greedy Loop node to store a specific hash set to
     * keep the beginning index of the failed repetition match. The nodes
     * themselves are stateless, so they rely on this field to hold state
     * during a match.
     * 在顶级贪婪循环节点中保留失败重复匹配的开始索引
     */
    IntHashSet[] localsPos;

    /**
     * Boolean indicating whether or not more input could change
     * the results of the last match.
     *
     * If hitEnd is true, and a match was found, then more input
     * might cause a different match to be found.
     * If hitEnd is true and a match was not found, then more
     * input could cause a match to be found.
     * If hitEnd is false and a match was found, then more input
     * will not change the match.
     * If hitEnd is false and a match was not found, then more
     * input will not cause a match to be found.
     * 1.如果hitEnd为true，并且找到了匹配项，则需要更多输入可能会导致找到不
     * 同的匹配项。
     * 2.如果hitEnd为true，但未找到匹配项，则会出现更多输入可能导致找到匹配
     * 项。
     * 3.如果hitEnd为false且找到匹配项，则更多输入不会改变匹配。
     * 4.如果hitEnd为false且未找到匹配项，则更多输入不也不会找到匹配项。
     */
    boolean hitEnd;

    /**
     * Boolean indicating whether or not more input could change
     * a positive match into a negative one.
     * If requireEnd is true, and a match was found, then more
     * input could cause the match to be lost.
     * If requireEnd is false and a match was found, then more
     * input might change the match but the match won't be lost.
     * If a match was not found, then requireEnd has no meaning.
     * 是否可以更改更多输入积极的匹配变成消极的匹配
     * 1.如果requireEnd为true，并且找到了匹配项，则更多输入可能导致匹配丢失
     * 2.如果requireEnd为false并找到匹配项，则更多输入可能会更改匹配，但匹
     * 配不会丢失。
     * 3.如果requireEnd为false未找到匹配项，则Required没有任何意义。
     */
    boolean requireEnd;

    /**
     * If transparentBounds is true then the boundaries of this
     * matcher's region are transparent to lookahead, lookbehind,
     * and boundary matching constructs that try to see beyond them.
     * 如果transparentBounds为true，则此matcher区域对向前看、向后看是透明
     * 的，和边界匹配结构，超越它们。
     */
    boolean transparentBounds = false;

    /**
     * If anchoringBounds is true then the boundaries of this
     * matcher's region match anchors such as ^ and $.
     * 如果位true，这个匹配器边界有^ and $定位
     */
    boolean anchoringBounds = true;

    /**
     * Number of times this matcher's state has been modified
     * 匹配器状态被修改次数
     */
    int modCount;

构造器

Matcher(Pattern parent, CharSequence text) {
//创建匹配器的模式
        this.parentPattern = parent;
//待匹配字符串
        this.text = text;

        // Allocate state storage 初始化存储状态
        //捕获组数小于10，父类组数量取10
        int parentGroupCount = Math.max(parent.capturingGroupCount, 10);
        //以两倍的父类组数量构建groups数组存储捕获组起始和结束索引
        groups = new int[parentGroupCount * 2];
        //根据组数量localCount构建存储组起始位置索引的数组
        locals = new int[parent.localCount];
        //失败重复次数构建存储失败/重复位置开始索引
        localsPos = new IntHashSet[parent.localTCNCount];

        // Put fields into initial states 字段赋为初始状态
        reset();
    }

重要方法

重置匹配器将丢弃其所有显式状态信息，并将其附加位置设置为零。匹配器的区域设置为默认区域，即其整个字符序列。此匹配器区域边界的锚定和透明度不受影响

reset

public Matcher reset() {
        first = -1;
        last = 0;
        oldLast = -1;
        //groups数组和组开始索引数组locals置为-1
        for(int i=0; i<groups.length; i++)
            groups[i] = -1;
        for(int i=0; i<locals.length; i++)
            locals[i] = -1;
        for (int i = 0; i < localsPos.length; i++) {
            if (localsPos[i] != null)
                localsPos[i].clear();
        }
        lastAppendPosition = 0;
        from = 0;
        to = getTextLength();
        modCount++;
        return this;
    }

find

尝试查找与模式匹配的输入序列的下一个子序列。此方法从该匹配器区域的开始处开始，或者，如果先前成功调用了该方法并且此后未重置匹配器，则从第一个未与先前匹配匹配的字符开始。如果匹配成功，则可以通过start、end和group方法获得更多信息。返回：当且仅当输入序列的子序列匹配此匹配器的模式时，为true
在这里插入图片描述

public boolean find() {
		//洗一次查找索引紧跟上一次结束
        int nextSearchIndex = last;
        if (nextSearchIndex == first)
            nextSearchIndex++;

        // If next search starts before region, start it at region
        //如果下一一次搜索位置比from小，意味着从最开始位置匹配
        if (nextSearchIndex < from)
            nextSearchIndex = from;

        // If next search starts beyond region then it fails
        //如果下一次搜索超过了范围，则匹配失败
        if (nextSearchIndex > to) {
            for (int i = 0; i < groups.length; i++)
                groups[i] = -1;
            return false;
        }
        //继续下一次搜索
        return search(nextSearchIndex);
    }

search

启动搜索以查找给定边界内的模式。使用默认值填充组，并调用状态机根的匹配。当状态机在此匹配器中进行匹配时，它将保持匹配的状态。此处未设置Matcher.from，因为它是锚定将设置为的搜索开始的“硬”边界。from参数是搜索开始的“软”边界，这意味着正则表达式试图在该索引处匹配，但在该索引处^将不匹配。对搜索方法的后续调用从新的“软”边界开始，该边界是前一个匹配的结束
在这里插入图片描述

boolean search(int from) {
        this.hitEnd = false;
        this.requireEnd = false;
        from        = from < 0 ? 0 : from;
        this.first  = from;
        this.oldLast = oldLast < 0 ? from : oldLast;
        //初始化groups[i]=-1
        for (int i = 0; i < groups.length; i++)
            groups[i] = -1;
        for (int i = 0; i < localsPos.length; i++) {
            if (localsPos[i] != null)
                localsPos[i].clear();
        }
        acceptMode = NOANCHOR;
        //从root节点开始匹配
        boolean result = parentPattern.root.match(this, from, text);
        //匹配失败first复位
        if (!result)
            this.first = -1;
        //更新上老的last为当前last
        this.oldLast = this.last;
        //修改数自增
        this.modCount++;
        return result;
    }

组的创建createGroup

每一个组都会通过createGroup创建GroupHead和GroupTail,每一个组的创建localCount(组数量)就加1，组起始索引localIndex比localCount小1。

private Node createGroup(boolean anonymous) {
//组起始索引(在匹配字符串上的起始位置)
        int localIndex = localCount++;
        int groupIndex = 0;
   //如果组非匿名创建(捕获组)，捕获组数量自增，该组在捕获组数组索引比数量小1
   /**Matcher中的groups数组(组开始元素索引位置和组结束元素索引位置存储数组)
   * 每个位置索引都和这个组索引有关：组起始元素索引存储的位置等于该组head节
   * 点在groupHead数组存储的索引的两倍
   * 'groupIndex = groupCount + groupCount',组结束元素索引存储位置该组
   * head节点在groupHead数组存储的索引的两倍+1,一个偶数一个基数。这也是
   * group()方法取出对应组数据内部奇偶原理。
   */
        if (!anonymous)
        /**
        * 捕获组，就记录捕获组数量，组索引肯定是从1开始，因为只要有捕获组
        * capturingGroupCount>0。留下的0组永远给整个匹配组
        */
            groupIndex = capturingGroupCount++;
            //创建组head
        GroupHead head = new GroupHead(localIndex);
        //创建组尾：组开始和结束位置索引填充到groups[]数组
        root = new GroupTail(localIndex, groupIndex);

        // for debug/print only, head.match does NOT need the "tail" info
        head.tail = (GroupTail)root;
//捕获组并且groupIndex小于10，更新组节点数组，把当前head加入
/**
* groupNodes要么全是null，没有捕获组。要么就是groupNodes[0]=null，其他位置* 存放的各个组的groupHead。
*/
        if (!anonymous && groupIndex < 10)
            groupNodes[groupIndex] = head;
        return head;
    }

match

这个match方法各个节点都重新了Node的此方法。根据节点不同，方法也不同。

match∈Start

boolean match(Matcher matcher, int i, CharSequence seq) {
            if (i > matcher.to - minLength) {
                matcher.hitEnd = true;
                return false;
            }
            /**守卫很重要，待匹配字符串长度与最小树深度之差
            * 最小树深度就是模式按顺序匹配最少匹配成功时字符串个数
            * \d+(\d|\w)最小深度2，因为组内有分支只算一个。
            * 这样guard就可以限制匹配重试匹配次数，超过这个次数未匹配成功
            * 就匹配失败了
            */
            int guard = matcher.to - minLength;
            for (; i <= guard; i++) {
            //每一次都是从star节点的next匹配
                if (next.match(matcher, i, seq)) {
        /**这是开始节点，所以这时groups存储的就是整个模式匹配成功的起始和结
        * 束。其他分组的存储都在他们groupTail中存储。
        * 由此可见整个模式完全匹配开始结束所以存储在groups[0],groups[1]的
        * 位置
        */
                    matcher.first = i;
                    matcher.groups[0] = matcher.first;
                    matcher.groups[1] = matcher.last;
                    return true;
                }
            }
            matcher.hitEnd = true;
            return false;
        }

match∈GroupHead

boolean match(Matcher matcher, int i, CharSequence seq) {
			//先保存本组在当前localIndex时存储分组/重复开始的索引
            int save = matcher.locals[localIndex];
            //修改该位置的值为当前组开始索引
            matcher.locals[localIndex] = i;
            //调用GroupHead的下一个节点匹配
            boolean ret = next.match(matcher, i, seq);
            //下一个节点递归匹配完成，恢复当前localIndex的值
            matcher.locals[localIndex] = save;
            return ret;
        }

match∈Slice

对序列片段进行字面量匹配。组内也是一个片段。

boolean match(Matcher matcher, int i, CharSequence seq) {
			//组内的片段赋值给buf比如3(456)的456
            int[] buf = buffer;
            int len = buf.length;
            //把这个片段进行匹配，如果出现不匹配直接返回false
            for (int j=0; j<len; j++) {
                if ((i+j) >= matcher.to) {
                    matcher.hitEnd = true;
                    return false;
                }
                if (buf[j] != seq.charAt(i+j))
                    return false;
            }
            //如果片段匹配没有问题，去下一个节点继续
            return next.match(matcher, i+len, seq);
        }

match∈GroupTail

成功匹配组时，GroupTail处理组开始和结束位置的设置。它还必须能够取消必须退出的组。当引用以前的组时，也会使用GroupTail节点，在这种情况下，不需要设置组信息。groupIndex = groupCount + groupCount

 GroupTail(int localCount, int groupCount) {
            localIndex = localCount;
            //起始位置索引存储在偶数位置
            groupIndex = groupCount + groupCount;
        }
 boolean match(Matcher matcher, int i, CharSequence seq) {
 //取出该组的起始位置索引
            int tmp = matcher.locals[localIndex];
            if (tmp >= 0) { // This is the normal group case.
                // Save the group so we can unset it if it
                // backs off of a match.
                /**正常组，保存该组，以便在出现问题时取消设置
                */
                //取出该组开始和结束值，以便出现问题恢复原值
                int groupStart = matcher.groups[groupIndex];
                int groupEnd = matcher.groups[groupIndex+1];
				/**设置组起始位置，当前组序号(groupHead组中的索引的两倍)
				* groupIndex = groupCount + groupCount;
				*/
                matcher.groups[groupIndex] = tmp;
                //紧接着设置组结束位置，奇数位置
                matcher.groups[groupIndex+1] = i;
                //转移控制权到下一个节点，匹配成功则返回true
                if (next.match(matcher, i, seq)) {
                    return true;
                }
                //出现问题恢复值
                matcher.groups[groupIndex] = groupStart;
                matcher.groups[groupIndex+1] = groupEnd;
                return false;
            } else {
                // This is a group reference case. We don't need to save any
                // group info because it isn't really a group.
                matcher.last = i;
                return true;
            }
        }

group

根据组序号取对应组数据。group(0)获取的是整个表达式的匹配。

public String group(int group) {
        if (first < 0)
            throw new IllegalStateException("No match found");
        if (group < 0 || group > groupCount())
            throw new IndexOutOfBoundsException("No group " + group);
        if ((groups[group*2] == -1) || (groups[group*2+1] == -1))
            return null;
            //偶数存开始，奇数存结束位置索引。
        return getSubSequence(groups[group * 2], groups[group * 2 + 1]).toString();
    }

success112

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
4
评论
正则表达式(二)

Java正则概述Pattern主要属性修饰符其他属性概述java正则基于NFA引擎的，Pattern和Matcher就是构成java正则最重要的两个类。Pattern精确的描述了正则表达式的构造行为。主要包含各种Node树形结构、和字符操作方法。主要属性修饰符 /* * 正则表达式修饰符值。它们也可以作为内联修饰符传递， * 而不是作为参数传递，例如p1和p2是等效的 * Pattern p1 = Pattern.compile("abc",Pattern.C
复制链接

扫一扫

专栏目录