Java 正则表达式源码解析

使用方法

    Pattern p = Pattern.compile("a*");
    Matcher m = p.matcher("aaaa");
    if (m.find()) {
        System.out.println(m.group());
        System.out.println(m.groupCount());
    }else {
        System.out.println("not find");
    }

分析Pattern.compile ()方法

    public static Pattern compile(String regex) {
    	// 只是调用了私有的构造方法
        return new Pattern(regex, 0);
    }


    private Pattern(String p, int f) {
        pattern = p;
        flags = f;

        // to use UNICODE_CASE if UNICODE_CHARACTER_CLASS present
        if ((flags & UNICODE_CHARACTER_CLASS) != 0)
            flags |= UNICODE_CASE;

        // Reset group index count
        capturingGroupCount = 1;
        localCount = 0;

        if (pattern.length() > 0) {
            //正则表达式的长度一般都是大于0的,会走compile 方法
            compile();
        } else {
            root = new Start(lastAccept);
            matchRoot = lastAccept;
        }
    }

compile 方法:

compile 方法解析完正则表达式之后,会得到一个root Node节点。匹配的时候,会调用root Node 的match 方法,rootNode 依次调用next 的match 方法,在match 的过程中,记录匹配的开始结束位置,最终得到结果。我们看下这个方法:

    /**
     * Copies regular expression to an int array and invokes the parsing
     * of the expression which will create the object tree.
     */
    private void compile() {
        // Handle canonical equivalences
        if (has(CANON_EQ) && !has(LITERAL)) {
            normalize();
        } else {
            normalizedPattern = pattern;
        }
        patternLength = normalizedPattern.length();

        // Copy pattern to int array for convenience
        // Use double zero to terminate pattern
        temp = new int[patternLength + 2];

        hasSupplementary = false;
        int c, count = 0;
        // Convert all chars into code points
        for (int x = 0; x < patternLength; x += Character.charCount(c)) {
            c = normalizedPattern.codePointAt(x);
            if (isSupplementary(c)) {
                hasSupplementary = true;
            }
            temp[count++] = c;
        }

        patternLength = count;   // patternLength now in code points

        if (! has(LITERAL))
            RemoveQEQuoting();

        // Allocate all temporary objects here.
        buffer = new int[32];
        // 其实 groupNodes  这个变量没啥用,真正存储的是在Matcher 类的 groups 变量里面
        groupNodes = new GroupHead[10];
        namedGroups = null;

        if (has(LITERAL)) {
            // Literal pattern handling
            matchRoot = newSlice(temp, patternLength, hasSupplementary);
            matchRoot.next = lastAccept;
        } else {
            // Start recursive descent parsing   这里开始解析正则表达式
            matchRoot = expr(lastAccept);
            // Check extra pattern characters
            if (patternLength != cursor) {
                if (peek() == ')') {
                    throw error("Unmatched closing ')'");
                } else {
                    throw error("Unexpected internal error");
                }
            }
        }

        // Peephole optimization
        if (matchRoot instanceof Slice) {
            root = BnM.optimize(matchRoot);
            if (root == matchRoot) {
                root = hasSupplementary ? new StartS(matchRoot) : new Start(matchRoot);
            }
        } else if (matchRoot instanceof Begin || matchRoot instanceof First) {
            root = matchRoot;
        } else {
            root = hasSupplementary ? new StartS(matchRoot) : new Start(matchRoot);
        }

        // Release temporary storage
        temp = null;
        buffer = null;
        groupNodes = null;
        patternLength = 0;
        compiled = true;
    }

里面调用了expr 方法:

这个方法里面有一个for 循环,用来解决| 的逻辑与 的关系。比如a*|b*, 就会执行这里的for 循环。
这个方法是可以用来解析子表达式的,比如() 里面的正则表达式。

这个方法会传递一个尾巴节点,也就是最后一个节点进去。因为每个表达式都要有结束的节点,都要有一个root结点。
当解析() 的时候,要传递一个尾巴到expr() 参数里面。
这个类里面的root 表示当前的最后一个node.遇到() 会重置。

    /**
     *  The following methods handle the main parsing. They are sorted
     *  according to their precedence order, the lowest one first.
     */

    /**
     * The expression is parsed with branch nodes added for alternations.
     * This may be called recursively to parse sub expressions that may
     * contain alternations.
     */

    private Node expr(Node end) {
        Node prev = null;
        Node firstTail = null;
        Branch branch = null;
        Node branchConn = null;

        for (;;) {
            Node node = sequence(end);
            Node nodeTail = root;      //double return
            if (prev == null) {
                prev = node;
                firstTail = nodeTail;
            } else {
                // Branch
                if (branchConn == null) {
                    branchConn = new BranchConn();
                    branchConn.next = end;
                }
                if (node == end) {
                    // if the node returned from sequence() is "end"
                    // we have an empty expr, set a null atom into
                    // the branch to indicate to go "next" directly.
                    node = null;
                } else {
                    // the "tail.next" of each atom goes to branchConn
                    nodeTail.next = branchConn;
                }
                if (prev == branch) {
                    branch.add(node);
                } else {
                    if (prev == end) {
                        prev = null;
                    } else {
                        // replace the "end" with "branchConn" at its tail.next
                        // when put the "prev" into the branch as the first atom.
                        firstTail.next = branchConn;
                    }
                    prev = branch = new Branch(prev, node, branchConn);
                }
            }
            //如果不是 | ,那么就直接返回了。
            if (peek() != '|') {
                return prev;
            }
            next();
        }
    }

里面又会调用sequence方法

这个方法里面是具体的,某个字符代表什么意思的具体方法。比如\p{Lower} 就会在这里面进行解析,生成具体的Node.
包括\p、\w、\s、\d 等。

用来解析某个具体的正则表达式的分支。比如 a|b ,a 和 b 是两个分支。
该方法返回的值是头Node.


    /**
     * Parsing of sequences between alternations.
     */
    private Node sequence(Node end) {
        Node head = null;
        Node tail = null;
        Node node = null;
    LOOP:
        for (;;) {
            int ch = peek();
            switch (ch) {
            case '(':
                // Because group handles its own closure,
                // we need to treat it differently
                node = group0();
                // Check for comment or flag group
                if (node == null)
                    continue;
                // 如果没有头节点,那么当前的就赋值为头节点    
                if (head == null)
                    head = node;
                else
                //如果有的话,上次的尾巴节点的下一个是当前解析的节点
                    tail.next = node;
                // Double return: Tail was returned in root
                tail = root;
                continue;
            case '[':
            // 如果是[],直接分配一个区间的node
                node = clazz(true);
                break;
            case '\\':
                ch = nextEscaped();
                if (ch == 'p' || ch == 'P') {
                    boolean oneLetter = true;
                    boolean comp = (ch == 'P');
                    ch = next(); // Consume { if present
                    if (ch != '{') {
                        unread();
                    } else {
                        oneLetter = false;
                    }
                    node = family(oneLetter, comp);
                } else {
                    unread();
                    node = atom();
                }
                break;
            case '^':
                next();
                if (has(MULTILINE)) {
                    if (has(UNIX_LINES))
                        node = new UnixCaret();
                    else
                        node = new Caret();
                } else {
                    node = new Begin();
                }
                break;
            case '$':
                next();
                if (has(UNIX_LINES))
                    node = new UnixDollar(has(MULTILINE));
                else
                    node = new Dollar(has(MULTILINE));
                break;
            case '.':
                next();
                if (has(DOTALL)) {
                    node = new All();
                } else {
                    if (has(UNIX_LINES))
                        node = new UnixDot();
                    else {
                        node = new Dot();
                    }
                }
                break;
            case '|':
            case ')':
                break LOOP;
            case ']': // Now interpreting dangling ] and } as literals
            case '}':
                node = atom();
                break;
            case '?':
            case '*':
            case '+':
                next();
                throw error("Dangling meta character '" + ((char)ch) + "'");
            case 0:
                if (cursor >= patternLength) {
                    break LOOP;
                }
                // Fall through
            default:
                node = atom();
                break;
            }
			// closure 里面会对a*这种node 进行处理
            node = closure(node);

            if (head == null) {
                head = tail = node;
            } else {
                tail.next = node;
                tail = node;
            }
        }
        if (head == null) {
            return end;
        }
        tail.next = end;
        // 其实返回的是两个值,方法返回的是头node, 尾巴node 是赋值到root 里面供后面使用。
        root = tail;      //double return
        // 最终返回头node
        return head;
    }

里面会执行atom 方法,分派单字符Node 或者普通字符切片的Node 。

atom 方法用来分配单个字符,或者字符切片的Node. 比如a*。
只有字符开始的部分才会进来,比如a*,a+,如果是没有任何字符,上来是个*,在sequence 方法里面就会报错了。

    /**
     * Parse and add a new Single or Slice.
     */
    private Node atom() {
    	// 记录当前遍历了几个字符,用来解决abc*,这种,需要把ab 放到字符切片里面
        int first = 0;
        // 记录上一个索引,用来解决ab* 这种字符问题。
        //如果是ab*,那么游标要往前返回到上一个字符。
        int prev = -1;
        boolean hasSupplementary = false;
        int ch = peek();
        for (;;) {
            switch (ch) {
            case '*':
            case '+':
            case '?':
            case '{':
                if (first > 1) {
                    cursor = prev;    // Unwind one character
                    first--;
                }
                //遇到* + 就是一个最小的字符切片,需要结束往后面的循环
                break;
            case '$':
            case '.':
            case '^':
            case '(':
            case '[':
            case '|':
            case ')':
            // | 也是一个最小的字符切片,需要结束往后面的循环
                break;
            case '\\':
                ch = nextEscaped();
                if (ch == 'p' || ch == 'P') { // Property
                    if (first > 0) { // Slice is waiting; handle it first
                        unread();
                        break;
                    } else { // No slice; just return the family node
                        boolean comp = (ch == 'P');
                        boolean oneLetter = true;
                        ch = next(); // Consume { if present
                        if (ch != '{')
                            unread();
                        else
                            oneLetter = false;
                        return family(oneLetter, comp);
                    }
                }
                unread();
                prev = cursor;
                ch = escape(false, first == 0, false);
                if (ch >= 0) {
                    append(ch, first);
                    first++;
                    if (isSupplementary(ch)) {
                        hasSupplementary = true;
                    }
                    ch = peek();
                    continue;
                } else if (first == 0) {
                    return root;
                }
                // Unwind meta escape sequence
                cursor = prev;
                break;
            case 0:
                if (cursor >= patternLength) {
                    break;
                }
                // Fall through
            default:
                prev = cursor;
                append(ch, first);
                first++;
                if (isSupplementary(ch)) {
                    hasSupplementary = true;
                }
                ch = next();
                continue;
            }
            break;
        }
        if (first == 1) {
            return newSingle(buffer[0]);
        } else {
            return newSlice(buffer, first, hasSupplementary);
        }
    }

closure 用来处理a*,a+ 这种情况

    /**
     * Processes repetition. If the next character peeked is a quantifier
     * then new nodes must be appended to handle the repetition.
     * Prev could be a single or a group, so it could be a chain of nodes.
     */
    private Node closure(Node prev) {
        Node atom;
        int ch = peek();
        switch (ch) {
        case '?':
            ch = next();
            if (ch == '?') {
                next();
                return new Ques(prev, LAZY);
            } else if (ch == '+') {
                next();
                return new Ques(prev, POSSESSIVE);
            }
            return new Ques(prev, GREEDY);
        case '*':
            ch = next();
            if (ch == '?') {
                next();
                return new Curly(prev, 0, MAX_REPS, LAZY);
            } else if (ch == '+') {
                next();
                return new Curly(prev, 0, MAX_REPS, POSSESSIVE);
            }
            return new Curly(prev, 0, MAX_REPS, GREEDY);
        case '+':
            ch = next();
            if (ch == '?') {
                next();
                return new Curly(prev, 1, MAX_REPS, LAZY);
            } else if (ch == '+') {
                next();
                return new Curly(prev, 1, MAX_REPS, POSSESSIVE);
            }
            return new Curly(prev, 1, MAX_REPS, GREEDY);
        case '{':
            ch = temp[cursor+1];
            if (ASCII.isDigit(ch)) {
                skip();
                int cmin = 0;
                do {
                    cmin = cmin * 10 + (ch - '0');
                } while (ASCII.isDigit(ch = read()));
                int cmax = cmin;
                if (ch == ',') {
                    ch = read();
                    cmax = MAX_REPS;
                    if (ch != '}') {
                        cmax = 0;
                        while (ASCII.isDigit(ch)) {
                            cmax = cmax * 10 + (ch - '0');
                            ch = read();
                        }
                    }
                }
                if (ch != '}')
                    throw error("Unclosed counted closure");
                if (((cmin) | (cmax) | (cmax - cmin)) < 0)
                    throw error("Illegal repetition range");
                Curly curly;
                ch = peek();
                if (ch == '?') {
                    next();
                    curly = new Curly(prev, cmin, cmax, LAZY);
                } else if (ch == '+') {
                    next();
                    curly = new Curly(prev, cmin, cmax, POSSESSIVE);
                } else {
                    curly = new Curly(prev, cmin, cmax, GREEDY);
                }
                return curly;
            } else {
                throw error("Illegal repetition");
            }
        default:
            return prev;
        }
    }

正则表达式关键是把正则表达式解析成一连串的Node.

常见的Node 如:

Start node 是正则匹配开始的Node

   static class Start extends Node {
        int minLength;
        Start(Node node) {
            this.next = node;
            TreeInfo info = new TreeInfo();
            // study 会遍历每一个节点,把每个节点的最小长度加起来,得到一个整体的最小长度
            next.study(info);
            minLength = info.minLength;
        }
        boolean match(Matcher matcher, int i, CharSequence seq) {
            if (i > matcher.to - minLength) {
                matcher.hitEnd = true;
                return false;
            }
            // 如果当前的位置到末尾的长度,小于最小长度,那么就不需要匹配了,肯定不匹配。
            // 所以这里有一个guard (中文:警卫)
            int guard = matcher.to - minLength;
            // 这个for 循环是一个一个字符,往后进行匹配。
           //用于解决这种情况:正则是:a*,字符串是baaa.第一个字符不匹配,要第二个开始匹配。
            for (; i <= guard; i++) {
                if (next.match(matcher, i, seq)) {
                    matcher.first = i;
                    // 这里 matcher.groups[0] 表示第1组 开始匹配的索引,因为第一组是整个匹配,所以赋值为 matcher.first    
                    matcher.groups[0] = matcher.first;
                    //这里 matcher.groups[1] 表示第1组 结束的索引,因为第一组是整个匹配,所以赋值为 matcher.last
                    matcher.groups[1] = matcher.last;
                    return true;
                }
            }
            matcher.hitEnd = true;
            return false;
        }
        boolean study(TreeInfo info) {
            next.study(info);
            info.maxValid = false;
            info.deterministic = false;
            return false;
        }
    }

SliceNode // 简单的连续字符匹配,比如aabbcc

    /**
     * Base class for all Slice nodes
     */
    static class SliceNode extends Node {
    // 所有的简单字符串匹配其实都是一个字符数据,用buffer 存放
        int[] buffer;
        SliceNode(int[] buf) {
            buffer = buf;
        }
        boolean study(TreeInfo info) {
        //最小长度要加上数组的长度,对于字符类型的节点,最小的长度就是数组的长度
            info.minLength += buffer.length;
            info.maxLength += buffer.length;
            return next.study(info);
        }
    }

    static final class Slice extends SliceNode {
        Slice(int[] buf) {
            super(buf);
        }
        boolean match(Matcher matcher, int i, CharSequence seq) {
            int[] buf = buffer;
            int len = buf.length;
            for (int j=0; j<len; j++) {
            	// 如果长度大于等于字符串的长度,那么就结束啦。
                if ((i+j) >= matcher.to) {
                    matcher.hitEnd = true;
                    return false;
                }
                //如果有任何一个字符和字符类型的node 节点不对应,那么不匹配
                if (buf[j] != seq.charAt(i+j))
                    return false;
            }
            // 如果所有的都匹配,继续匹配下一个节点
            return next.match(matcher, i+len, seq);
        }
    }

单一字符匹配Node:

    /**
     * Optimization -- matches a given BMP character
	 single character matcher.
     */
    static final class Single extends BmpCharProperty {
        final int c;
        Single(int c) { this.c = c; }
        // 常见的如a*,a+ 这种会分析称Single 节点。
        boolean isSatisfiedBy(int ch) {
            return ch == c;
        }
    }

Curly 括号类型的节点,a* 是一个特殊的括号类型节点来处理。

    /**
     * Handles the curly-brace style repetition with a specified minimum and
     * maximum occurrences. The * quantifier is handled as a special case.
     * This class handles the three types.
     */
    static final class Curly extends Node {
        Node atom;
        int type;
        int cmin;
        int cmax;

        Curly(Node node, int cmin, int cmax, int type) {
            this.atom = node;
            this.type = type;
            this.cmin = cmin;
            this.cmax = cmax;
        }
        boolean match(Matcher matcher, int i, CharSequence seq) {
            int j;
            for (j = 0; j < cmin; j++) {
                if (atom.match(matcher, i, seq)) {
                    i = matcher.last;
                    continue;
                }
                return false;
            }
            if (type == GREEDY)
                return match0(matcher, i, j, seq);
            else if (type == LAZY)
                return match1(matcher, i, j, seq);
            else
                return match2(matcher, i, j, seq);
        }
        // Greedy match.
        // i is the index to start matching at
        // j is the number of atoms that have matched
        boolean match0(Matcher matcher, int i, int j, CharSequence seq) {
            if (j >= cmax) {
                // We have matched the maximum... continue with the rest of
                // the regular expression
                return next.match(matcher, i, seq);
            }
            int backLimit = j;
            while (atom.match(matcher, i, seq)) {
                // k is the length of this match
                int k = matcher.last - i;
                if (k == 0) // Zero length match
                    break;
                // Move up index and number matched
                i = matcher.last;
                j++;
                // We are greedy so match as many as we can
                while (j < cmax) {
                    if (!atom.match(matcher, i, seq))
                        break;
                    if (i + k != matcher.last) {
                        if (match0(matcher, matcher.last, j+1, seq))
                            return true;
                        break;
                    }
                    i += k;
                    j++;
                }
                // Handle backing off if match fails
                while (j >= backLimit) {
                   if (next.match(matcher, i, seq))
                        return true;
                    i -= k;
                    j--;
                }
                return false;
            }
            return next.match(matcher, i, seq);
        }
        // Reluctant match. At this point, the minimum has been satisfied.
        // i is the index to start matching at
        // j is the number of atoms that have matched
        boolean match1(Matcher matcher, int i, int j, CharSequence seq) {
            for (;;) {
                // Try finishing match without consuming any more
                if (next.match(matcher, i, seq))
                    return true;
                // At the maximum, no match found
                if (j >= cmax)
                    return false;
                // Okay, must try one more atom
                if (!atom.match(matcher, i, seq))
                    return false;
                // If we haven't moved forward then must break out
                if (i == matcher.last)
                    return false;
                // Move up index and number matched
                i = matcher.last;
                j++;
            }
        }
        boolean match2(Matcher matcher, int i, int j, CharSequence seq) {
            for (; j < cmax; j++) {
                if (!atom.match(matcher, i, seq))
                    break;
                if (i == matcher.last)
                    break;
                i = matcher.last;
            }
            return next.match(matcher, i, seq);
        }
        boolean study(TreeInfo info) {
            // Save original info
            int minL = info.minLength;
            int maxL = info.maxLength;
            boolean maxV = info.maxValid;
            boolean detm = info.deterministic;
            info.reset();

            atom.study(info);

            int temp = info.minLength * cmin + minL;
            if (temp < minL) {
                temp = 0xFFFFFFF; // arbitrary large number
            }
            info.minLength = temp;

            if (maxV & info.maxValid) {
                temp = info.maxLength * cmax + maxL;
                info.maxLength = temp;
                if (temp < maxL) {
                    info.maxValid = false;
                }
            } else {
                info.maxValid = false;
            }

            if (info.deterministic && cmin == cmax)
                info.deterministic = detm;
            else
                info.deterministic = false;
            return next.study(info);
        }
    }

group类型的Node

    /**
     * Create group head and tail nodes using double return. If the group is
     * created with anonymous true then it is a pure group and should not
     * affect group counting.
     *  创建一个group 类型的node
     */
    private Node createGroup(boolean anonymous) {
    	// localindex 表示的是除去全部匹配的第0组,当前是第几组
        int localIndex = localCount++;
        int groupIndex = 0;
        if (!anonymous)
        // capturingGroupCount 表示当前是总的第几组,如果是localindex 是0 ,那么这个就是1,因为默认有一个全部匹配的组,是第0组
            groupIndex = capturingGroupCount++;
        GroupHead head = new GroupHead(localIndex);
        //这里要把GroupTail 置为root 是因为后面要调用expr() 方法里面要传递一个结束的节点
        root = new GroupTail(localIndex, groupIndex);
        if (!anonymous && groupIndex < 10)
            groupNodes[groupIndex] = head;
        return head;
    }

    /**
     * The GroupHead saves the location where the group begins in the locals
     * and restores them when the match is done.
     *
     * The matchRef is used when a reference to this group is accessed later
     * in the expression. The locals will have a negative value in them to
     * indicate that we do not want to unset the group if the reference
     * doesn't match.
     */
    static final class GroupHead extends Node {
        int localIndex;
        GroupHead(int localCount) {
            localIndex = localCount;
        }
        boolean match(Matcher matcher, int i, CharSequence seq) {
            int save = matcher.locals[localIndex];
            matcher.locals[localIndex] = i;
            boolean ret = next.match(matcher, i, seq);
            matcher.locals[localIndex] = save;
            return ret;
        }
        boolean matchRef(Matcher matcher, int i, CharSequence seq) {
            int save = matcher.locals[localIndex];
            matcher.locals[localIndex] = ~i; // HACK
            boolean ret = next.match(matcher, i, seq);
            matcher.locals[localIndex] = save;
            return ret;
        }
    }

    /**
     * The GroupTail handles the setting of group beginning and ending
     * locations when groups are successfully matched. It must also be able to
     * unset groups that have to be backed off of.
     *
     * The GroupTail node is also used when a previous group is referenced,
     * and in that case no group information needs to be set.
     */
    static final class GroupTail extends Node {
        int localIndex;
        int groupIndex;
        GroupTail(int localCount, int groupCount) {
            localIndex = localCount;
            //groupIndex 是两倍的groupCount  是因为每一组占用数组的两个位置,表示开始和结束。
            // 所以,groupIndex  表示该组在数组里面的开始位置
            groupIndex = groupCount + groupCount;
        }
        boolean match(Matcher matcher, int i, CharSequence seq) {
            int tmp = matcher.locals[localIndex];
            if (tmp >= 0) { // This is the normal group case.
                // Save the group so we can unset it if it
                // backs off of a match.
                int groupStart = matcher.groups[groupIndex];
                int groupEnd = matcher.groups[groupIndex+1];

                matcher.groups[groupIndex] = tmp;
                matcher.groups[groupIndex+1] = i;
                if (next.match(matcher, i, seq)) {
                    return true;
                }
                matcher.groups[groupIndex] = groupStart;
                matcher.groups[groupIndex+1] = groupEnd;
                return false;
            } else {
                // This is a group reference case. We don't need to save any
                // group info because it isn't really a group.
                matcher.last = i;
                return true;
            }
        }
    }

我们看下m.find() 方法:

    public boolean find() {
        int nextSearchIndex = last;
        if (nextSearchIndex == first)
            nextSearchIndex++;

        // If next search starts before region, start it at region
        if (nextSearchIndex < from)
            nextSearchIndex = from;

        // If next search starts beyond region then it fails
        if (nextSearchIndex > to) {
            for (int i = 0; i < groups.length; i++)
                groups[i] = -1;
            return false;
        }
        return search(nextSearchIndex);
    }

search 方法

    boolean search(int from) {
        this.hitEnd = false;
        this.requireEnd = false;
        from        = from < 0 ? 0 : from;
        this.first  = from;
        this.oldLast = oldLast < 0 ? from : oldLast;
        for (int i = 0; i < groups.length; i++)
            groups[i] = -1;
        acceptMode = NOANCHOR;
        // 调用root  的match 方法,root  的match 方法会调用下一个node 的match 方法。
        boolean result = parentPattern.root.match(this, from, text);
        if (!result)
            this.first = -1;
        this.oldLast = this.last;
        return result;
    }
   groups 保存了所有的匹配到的组,以每两个为一个单位,第一个位置标示匹配开始的位置,第二个位置标示结束的位置。
      比如
	matcher.groups = {int[20]@485} 
 0 = 0
 1 = 2
 2 = 0
 3 = 1
 4 = 1
 5 = 2
 6 = -1
 7 = -1
0  和 1  表示第0组,
2 和 3  表示第二组,
4 和 5 表示第三组。
-1 表示没有组了,匹配结束。


    /**
     * The storage used by groups. They may contain invalid values if
     * a group was skipped during the matching.
     */
    int[] groups;
    public String group(int group) {
        if (first < 0)
            throw new IllegalStateException("No match found");
        if (group < 0 || group > groupCount())
            throw new IndexOutOfBoundsException("No group " + group);
		//groups 保存了所有的匹配到的组,以每两个为一个单位,第一个位置标示匹配开始的位置,第二个位置标示结束的位置。
		// 所以,返回是的时候 getSubSequence(groups[group * 2], groups[group * 2 + 1])
        if ((groups[group*2] == -1) || (groups[group*2+1] == -1))
            return null;
        return getSubSequence(groups[group * 2], groups[group * 2 + 1]).toString();
    }

学习到的技巧:

1.别人写的都是可以扩展的,可以扩展有两个,一个是接口,一个是抽象类(基类);
Node 节点被里面继承了很多,支持了很多的功能。
2.其实所有的正则表达式的用法,都是可以在源码里面看到的。比如: \p{Lower} 表示的是a-z,因为源码里面这样写的:

        defRange("Lower", 'a', 'z');     // Lower-case alphabetic
        defRange("Print", 0x20, 0x7E);   // Printable characters

问题:

1.为什么temp 要以两个0 结束呢?
2.buffer 的作用是什么?
用来存放普通字符数组,当有需要的时候,从里面取。
3.那些flag 都是用来干什么的?
4. 是怎么实现的分组功能?
是用一个groups 存在每一组的开始结束索引,取每一组的信息的时候,拿到索引就可以拿到匹配组的值。

总结:

Node 类抽象的特别好,每一个Node 类都有肯定会有两个方法,第一匹配,当前Node 匹配的最小长度。

   /**
     * The following classes are the building components of the object
     * tree that represents a compiled regular expression. The object tree
     * is made of individual elements that handle constructs in the Pattern.
     * Each type of object knows how to match its equivalent construct with
     * the match() method.
     */

    /**
     * Base class for all node classes. Subclasses should override the match()
     * method as appropriate. This class is an accepting node, so its match()
     * always returns true.
     */
    static class Node extends Object {
        Node next;
        Node() {
            next = Pattern.accept;
        }
        /**
         * This method implements the classic accept node.
         */
        boolean match(Matcher matcher, int i, CharSequence seq) {
            matcher.last = i;
            matcher.groups[0] = matcher.first;
            matcher.groups[1] = matcher.last;
            return true;
        }
        /**
         * This method is good for all zero length assertions.
         */
        boolean study(TreeInfo info) {
            if (next != null) {
                return next.study(info);
            } else {
                return info.deterministic;
            }
        }
    }

前言:本资源来自于javaeye,原资源链接地址:http://www.javaeye.com/topic/67398 原文如下: 以前写了一个java的正规表达式的java工具类,分享一下,有用到的欢迎下载使用。 如果你有常用的定义好的,且测试通过的正规表达式,欢迎跟贴,也让我享用一下 . 类中用到了 jakarta-oro-2.0.jar 包,请大家自己在 apache网站下下载 在这是junit测试单元类我就不提交了,在main()方法中有几个小测试,有兴趣自己玩吧. 这个工具类目前主要有25种正规表达式(有些不常用,但那时才仔细深入的研究了一下正规,写上瘾了,就当时能想到的都写了): 1.匹配图象; 2 匹配email地址; 3 匹配匹配并提取url ; 4 匹配并提取http ; 5.匹配日期 6 匹配电话; 7 匹配身份证 8 匹配邮编代码 9. 不包括特殊字符的匹配 (字符串中不包括符号 数学次方号^ 单引号' 双引号" 分号; 逗号, 帽号: 数学减号- 右尖括号> 左尖括号 0) 12 匹配正整数 13 匹配非正整数(负整数 + 0) 14 匹配负整数; 15. 匹配整数 ; 16 匹配非负浮点数(正浮点数 + 0) 17. 匹配正浮点数 18 匹配非正浮点数(负浮点数 + 0) 19 匹配负浮点数; 20 .匹配浮点数; 21. 匹配由26个英文字母组成的字符串; 22. 匹配由26个英文字母的大写组成的字符串 23 匹配由26个英文字母的小写组成的字符串 24 匹配由数字和26个英文字母组成的字符串; 25 匹配由数字、26个英文字母或者下划线组成的字符串; java源码: /* * Created on 2005-4-15 * * Summary of regular-expression constructs 正则表达式结构简介: * Construct Matches * Characters 字符: * x The character x x 字符 x * \\ The backslash character \\ 反斜杠 * \0n The character with octal value 0n (0 <= n <= 7) \0n 十进制数 (0 <= n <= 7) * \0nn The character with octal value 0nn (0 <= n <= 7) \0nn 十进制数 0nn (0 <= n <= 7) * \0mnn The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7) \0mnn 十进制数 0mnn (0 <= m <= 3, 0 <= n <= 7) * \xhh The character with hexadecimal value 0xhh \xhh 十六进制数 0x
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值