Regex

最新推荐文章于 2024-04-17 08:54:39 发布

ChuckLin

最新推荐文章于 2024-04-17 08:54:39 发布

阅读量522

点赞数

分类专栏： Reading the SourceCode

本文链接：https://blog.csdn.net/weixin_38927996/article/details/87523017

版权

Reading the SourceCode 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

Regex’s Pattern and Matcher

Pattern：对字符串进行正则表达式匹配

I.By default, the regular expressions ^ and $ ignore line terminators and only match at the beginning and the end, respectively, of the entire input sequence.

If {@link #MULTILINE} mode is activated then ^ matches at the beginning of input and after any line terminator except at the end of input. When in {@link #MULTILINE} mode $ matches just before a line terminator or the end of the input sequence.
III.Capturing groups are numbered by counting their opening parentheses from left to right,Group zero always stands for the entire expression.例如((A)(B©))，其分组号为：((A)(B©))-1,(A)-2,(B©)-3,©-4

Capturing groups are so named because, during a match, each subsequence of the input sequence that matches such a group is saved. The captured subsequence may be used later in the expression, via a back reference, and may also be retrieved from the matcher once the match operation is complete.

A capturing group can also be assigned a “name”,The first character must be a letter.A named-capturing group is still numbered

The captured input associated with a group is always the subsequence that the group most recently matched. If a group is evaluated a second time because of quantification then its previously-captured value, if any, will be retained if the second evaluation fails

Groups beginning with (?<are either pure, non-capturing groups that do not capture text and do not count named-capturing group.
IIV.编译方法public static Pattern compile(String regex, int flags)中flags不同标志代表不同模式。
默认编译方法public static Pattern compile(String regex)使用了默认模式

		代码清单
			public static Pattern compile(String regex) {
		        return new Pattern(regex, 0);
		    }
			
			public static Pattern compile(String regex, int flags) {
		        return new Pattern(regex, flags);
		    }

标志状态：

UNIX_LINES = 0x01;
In this mode, only the ‘\n’ line terminator is recognized in the behavior of ., ^, and $.

CASE_INSENSITIVE = 0x02;
By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched

UNICODE_CASE = 0x40;
When this flag is specified then case-insensitive matching, when enabled by the CASE_INSENSITIVE flag, is done in a manner consistent with the Unicode Standard. By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched.

COMMENTS = 0x04;
In this mode, whitespace is ignored, and embedded comments starting with # are ignored until the end of a line.

MULTILINE = 0x08;
In multiline mode the expressions ^ and $ match just after or just before, respectively, a line terminator or the end of the input sequence. By default these expressions only match at the beginning and the end of the entire input sequence.

LITERAL = 0x10;
When this flag is specified then the input string that specifies the pattern is treated as a sequence of literal characters. Metacharacters or escape sequences in the input sequence will be given no special meaning.

DOTALL = 0x20;
In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.

CANON_EQ = 0x80;
When this flag is specified then two characters will be considered to match if, and only if, their full canonical decompositions match. The expression “a\u030A”, for example, will match the string “\u00E5” when this flag is specified. By default, matching does not take canonical equivalence into account.

UNICODE_CHARACTER_CLASS = 0x100;
The UNICODE_CHARACTER_CLASS mode can also be enabled via the embedded flag expression (?U). The flag implies UNICODE_CASE, that is, it enables Unicode-aware case folding.
IV.RemoveQEQuoting()方法用于去掉字符串中非数字字母的字符携带的标志括号
- quotemeta:
  Returns the value of EXPR with all non-“word” characters backslashed. (That is, all characters not matching /[A-Za-z_0-9]/ will be preceded by a backslash in the returned string, regardless of any locale settings.) This is the internal function implementing the \Q escape in double-quoted strings.

Matcher：使用pattern匹配结果

I.invoking the pattern’s matcher method
II.The matches method attempts to match the entire input sequence against the pattern(完全查找).
```
		public boolean matches() {
	        return match(from, ENDANCHOR);
	    }

		boolean match(int from, int anchor) {
	        this.hitEnd = false;
	        this.requireEnd = false;
	        from        = from < 0 ? 0 : from;
	        this.first  = from;
	        this.oldLast = oldLast < 0 ? from : oldLast;
	        for (int i = 0; i < groups.length; i++)
	            groups[i] = -1;
	        acceptMode = anchor;
	        boolean result = parentPattern.matchRoot.match(this, from, text);
	        if (!result)
	            this.first = -1;
	        this.oldLast = this.last;
	        return result;
	    }				
```
The lookingAt method attempts to match the input sequence, starting at the beginning, against the pattern(区间查找).
```
		public boolean lookingAt() {
	        return match(from, NOANCHOR);
	    }

	   The find method scans the input sequence looking for  the next subsequence that matches the pattern. 

		public boolean find(int start) {
	        int limit = getTextLength();
	        if ((start < 0) || (start > limit))
	            throw new IndexOutOfBoundsException("Illegal start index");
	        reset();
	        return search(start);
	    }
```
Each of these methods returns a boolean indicating success or failure. More information about a successful match can be obtained by querying the state of the matcher

The explicit state of a matcher is recomputed by every match operation.The implicit state of a matcher includes the input character sequence as well as the append position
- int[] groups
  The storage used by groups. They may contain invalid values if a group was skipped during the matching.
- int from, to
  The range within the sequence that is to be matched. Anchors will match at these “hard” boundaries. Changing the region changes these values.
- int lookbehindTo
  Lookbehind uses this value to ensure that the subexpression match ends at the point where the lookbehind was encountered.
- int NOANCHOR = 1; int acceptMode = NOANCHOR;
  used when a match does not have to consume all of the input.
- int ENDANCHOR= 0
  the mode used for matching all the input
- int first
  If the int last match failed then first is -1
- int last
  initially holds 0 then it holds the index of the end of the last match (which is where the next search starts)
- int oldLast = -1
  The end index of what matched in the last match operation.
- int[] locals
  Storage used by nodes to tell what repetition they are on in a pattern, and where groups begin. The nodes themselves are stateless, so they rely on this field to hold state during a match.
- boolean hitEnd
  Boolean indicating whether or not more input could change the results of the last match.
  If hitEnd is true, and a match was found, then more input might cause a different match to be found.
  If hitEnd is true and a match was not found, then more input could cause a match to be found.
  If hitEnd is false and a match was found, then more input will not change the match.
  If hitEnd is false and a match was not found, then more input will not cause a match to be found.
- boolean requireEnd
  Boolean indicating whether or not more input could change a positive match into a negative one(be lost).
  If requireEnd is true, and a match was found, then more input could cause the match to be lost.
  If requireEnd is false and a match was found, then more input might change the match but the match won’t be lost.
  If a match was not found, then requireEnd has no meaning.
- boolean transparentBounds = false
  If transparentBounds is true then the boundaries of this matcher’s region are transparent to lookahead, lookbehind, and boundary matching constructs that try to see beyond them.
- boolean anchoringBounds = true
  If anchoringBounds is true then the boundaries of this matcher’s region match anchors such as ^ and $.
III.Instances of this class are not safe for use by multiple concurrent threads.

IV.Matcher的初始化：会将pattern和解析目标字符串保存到Matcher对象中，当然也会保存分组和标志成员变量。其中比较特殊的是，在保存分组的时候会创建两倍分组数目长度的group数组变量，目的是为了保存分组的起始和结束位置，可以由start(int group)方法和end(int group)方法推断出来。

			Matcher(Pattern parent, CharSequence text) {
		        this.parentPattern = parent;
		        this.text = text;
		
		        // Allocate state storage
		        int parentGroupCount = Math.max(parent.capturingGroupCount, 10);
		        groups = new int[parentGroupCount * 2];
		        locals = new int[parent.localCount];
		
		        // Put fields into initial states
		        reset();
		    }

			public int start(int group) {
		        if (first < 0)
		            throw new IllegalStateException("No match available");
		        if (group < 0 || group > groupCount())
		            throw new IndexOutOfBoundsException("No group " + group);
		        return groups[group * 2];
		    }	

			public int end(int group) {
		        if (first < 0)
		            throw new IllegalStateException("No match available");
		        if (group < 0 || group > groupCount())
		            throw new IndexOutOfBoundsException("No group " + group);
		        return groups[group * 2 + 1];
		    }

V.创建一个新的Matcher时会调用reset()方法重置Matcher对象中的标识类的成员变量，reset()方法之后此前产生的匹配结果会被保留下来。而reset(CharSequence input)会在填入新的字符队列之后重置标志变量，但此前的匹配结果会被保留下来。

		//代码清单：
			public Matcher reset() {
		        first = -1;
		        last = 0;
		        oldLast = -1;
				//保留此前的匹配结果
		        for(int i=0; i<groups.length; i++)
		            groups[i] = -1;
		        for(int i=0; i<locals.length; i++)
		            locals[i] = -1;
		        lastAppendPosition = 0;
		        from = 0;
		        to = getTextLength();
		        return this;
		    }

			public Matcher reset(CharSequence input) {
		        text = input;
		        return reset();
		    }

VI.Matcher类中的usePattern(Pattern newPattern)方法会重新填入一个Pattern对象以替代旧的Pattern对象，但此前产生的匹配结果会被保留下来。

		代码清单：
	
			public Matcher usePattern(Pattern newPattern) {
		        if (newPattern == null)
		            throw new IllegalArgumentException("Pattern cannot be null");
		        parentPattern = newPattern;
		
		        // Reallocate state storage
		        int parentGroupCount = Math.max(newPattern.capturingGroupCount, 10);
		        groups = new int[parentGroupCount * 2];
		        locals = new int[newPattern.localCount];
				//保留此前的匹配结果
		        for (int i = 0; i < groups.length; i++)
		            groups[i] = -1;
		        for (int i = 0; i < locals.length; i++)
		            locals[i] = -1;
		        return this;
		    }

测试代码：

			String s="adsfjl\n\\\\dsflj\\kjld$dfkjls\n";
			System.out.print(s);
			System.out.print(Matcher.quoteReplacement(s));

			结果：adsfjl
				 \\dsflj\kjld$dfkjls-----------这是s真正的内容
				 adsfjl
				 \\\\dsflj\\kjld\$dfkjls-------这是加了转义字符的s

VII.quoReplacement(String s)方法会给传入的String对象中的’’,’$‘字符添加转义字符’’，使String变成一个纯字符字符串而没有特殊符号。

		public static String quoteReplacement(String s) {
	        if ((s.indexOf('\\') == -1) && (s.indexOf('$') == -1))
	            return s;
	        StringBuilder sb = new StringBuilder();
	        for (int i=0; i<s.length(); i++) {
	            char c = s.charAt(i);
	            if (c == '\\' || c == '$') {
	                sb.append('\\');
	            }
	            sb.append(c);
	        }
	        return sb.toString();
	    }

appendReplacement(StringBuffer sb, String replacement)方法将replacement替换Matcher对象test字符串成员变量中跟Pattern对象成员变量第一个匹配的字符串，并拼接到sb中。

		 public Matcher appendReplacement(StringBuffer sb, String replacement) {

       			// If no match, return error
		        if (first < 0)
		            throw new IllegalStateException("No match available");
		
		        // Process substitution string to replace group references with groups
		        int cursor = 0;
		        StringBuilder result = new StringBuilder();
		
				//开始解析replacement
		        while (cursor < replacement.length()) {
		            char nextChar = replacement.charAt(cursor);
					//取出转义字符后面的字符
		            if (nextChar == '\\') {
		                cursor++;
		                if (cursor == replacement.length())
		                    throw new IllegalArgumentException(
		                        "character to be escaped is missing");
		                nextChar = replacement.charAt(cursor);
		                result.append(nextChar);
		                cursor++;
					//'$'+groupNumber或者'$'+'{'+groupName+'}'是获取Matcher中分组的表达式，下面的代码是解析获取分组表达式的过程代码
		            } else if (nextChar == '$') {
		                cursor++;
		                // Throw IAE if this "$" is the last character in replacement
		                if (cursor == replacement.length())
		                   throw new IllegalArgumentException(
		                        "Illegal group reference: group index is missing");
		                nextChar = replacement.charAt(cursor);
		                int refNum = -1;
		                if (nextChar == '{') {
		                    cursor++;
		                    StringBuilder gsb = new StringBuilder();
		                    while (cursor < replacement.length()) {
		                        nextChar = replacement.charAt(cursor);
		                        if (ASCII.isLower(nextChar) ||
		                            ASCII.isUpper(nextChar) ||
		                            ASCII.isDigit(nextChar)) {
		                            gsb.append(nextChar);
		                            cursor++;
		                        } else {
		                            break;
		                        }
		                    }
		                    if (gsb.length() == 0)
		                        throw new IllegalArgumentException(
		                            "named capturing group has 0 length name");
		                    if (nextChar != '}')
		                        throw new IllegalArgumentException(
		                            "named capturing group is missing trailing '}'");
		                    String gname = gsb.toString();
		                    if (ASCII.isDigit(gname.charAt(0)))
		                        throw new IllegalArgumentException(
		                            "capturing group name {" + gname +
		                            "} starts with digit character");
		                    if (!parentPattern.namedGroups().containsKey(gname))
		                        throw new IllegalArgumentException(
		                            "No group with name {" + gname + "}");
		                    refNum = parentPattern.namedGroups().get(gname);
		                    cursor++;
		                } else {
		                    // The first number is always a group
		                    refNum = (int)nextChar - '0';
		                    if ((refNum < 0)||(refNum > 9))
		                        throw new IllegalArgumentException(
		                            "Illegal group reference");
		                    cursor++;
		                    // Capture the largest legal group string
		                    boolean done = false;
		                    while (!done) {
		                        if (cursor >= replacement.length()) {
		                            break;
		                        }
		                        int nextDigit = replacement.charAt(cursor) - '0';
		                        if ((nextDigit < 0)||(nextDigit > 9)) { // not a number
		                            break;
		                        }
		                        int newRefNum = (refNum * 10) + nextDigit;
		                        if (groupCount() < newRefNum) {
		                            done = true;
		                        } else {
		                            refNum = newRefNum;
		                            cursor++;
		                        }
		                    }
		                }
		                // Append group
		                if (start(refNum) != -1 && end(refNum) != -1)
		                    result.append(text, start(refNum), end(refNum));
					//直接拼接没有特殊意义的字符
		            } else {
		                result.append(nextChar);
		                cursor++;
		            }
		        }
		        // 拼接匹配字符之前的字符
		        sb.append(text, lastAppendPosition, first);
		        // 拼接匹配结果字符串
		        sb.append(result);
				//记录本次匹配的位置
		        lastAppendPosition = last;
		        return this;
		    }

		
		public StringBuffer appendTail(StringBuffer sb)方法用于拼接上次拼接后剩下的所有字符串。

				public StringBuffer appendTail(StringBuffer sb) {
			        sb.append(text, lastAppendPosition, getTextLength());
			        return sb;
			    }

VIII.public Matcher region(int start, int end)方法用于设定Matcher匹配的范围，而public Matcher useTransparentBounds(boolean b)方法用于设定Matcher中的其他方法能否查看region之外的字符串。

			public Matcher region(int start, int end) {
		        if ((start < 0) || (start > getTextLength()))
		            throw new IndexOutOfBoundsException("start");
		        if ((end < 0) || (end > getTextLength()))
		            throw new IndexOutOfBoundsException("end");
		        if (start > end)
		            throw new IndexOutOfBoundsException("start > end");
		        reset();
		        from = start;
		        to = end;
		        return this;
		    }

			public Matcher useTransparentBounds(boolean b) {
		        transparentBounds = b;
		        return this;
		    }

ChuckLin

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Regex

Regex’s Pattern and MatcherPattern：对字符串进行正则表达式匹配I.By default, the regular expressions ^ and $ ignore line terminators and only match at the beginning and the end, respectively, of the entire input...
复制链接

扫一扫