java正则表达式转义

最新推荐文章于 2024-05-30 09:55:21 发布

程序员猪佩琪

最新推荐文章于 2024-05-30 09:55:21 发布

阅读量826

点赞数

本文链接：https://blog.csdn.net/liuwenjie517333813/article/details/68060835

版权

    在学习java正则表达式时，遇到三个问题。
1、java字符串和正则模式的字符串很不清楚
2、正则中有捕获组的概念，并且还能对捕获后的组进行字符串替换，即appendReplacement(StringBuffer sb, String replacement)方法的原理不清楚
3、为什么在调用appendReplacement(StringBuffer sb, String replacement)方法之前需要对replacement中的"\"和"$"字符进行转义。

    java正则表达式一般是对于字符串的操作，个人理解在正则表达式中，字符串有三个基本的概念：
     第一，代码字符串：写在java文件中的字符串。
     第二，内存字符串：代码字符串在内存中的形式。
     第三，正则模式字符串：由Pattern编译内存字符串形成的正则模式字符串。
     比如在java代码中表示一个"\$"字符串，应该这样写


          String code = "\\$";
          System.out.println("code: " + code);

     输出结果 code: \$

     因为在java代码中"\"表示转义字符，所以第一个"\"表示要转义后面的内容。在内存中，code内存字符串形式就是"\$"。
     现在想写一个正则表达式匹配字符串"\$"，代码如下：


          String code="\\$";
          System.out.println("code: " + code);
          String patternString ="\\\\\\$";
          Pattern pattern = Pattern.compile(patternString);
          Matcher matcher = pattern.matcher(code);
          while(matcher.find()) {
               System.out.println("matcher:" + matcher.group());
           }

     输出结果 code:\$
               matcher:\$
    为什么匹配模式的java代码字符串是"\\\\\\$"。code在内存中形式是：\$，那正则模式字符串也应该是\$，在编译为模式字符串前的内存字符串应该是\\\$。正则模式字符串的第一个"\"，在内存中应该是"\\"（正则模式中的\表示转义），而$在内存中应该是\$（正则模式串中$表示空白字符，所以需要转义），所以模式字符串在编译前在内存中的形式应该\\\$。内存的\\\$，再转换为java代码，应该是"\\\\\\$"

    在使用正则时经常用到捕获组，对于捕获到的组进行字符串替换，需要用到appendReplacement(StringBuffer sb, String replacement)方法。该方法对于第二个参数替换的字符串有特殊要求，如果replacement里有"\"字符串和"$"字符串则必须要进行转义。为什么了？可以看看源代码replacement的源码：


      public Matcher appendReplacement(StringBuffer sb, String replacement) {

        // If no match, return error
        if (first < 0)
            throw new IllegalStateException("No match available");

        // Process substitution string to replace group references with groups
        int cursor = 0;
        String s = replacement;
        StringBuffer result = new StringBuffer();

        while (cursor < replacement.length()) {
            char nextChar = replacement.charAt(cursor);
            if (nextChar == '\\') {
                cursor++;
                nextChar = replacement.charAt(cursor);
                result.append(nextChar);
                cursor++;
            } else if (nextChar == '$') {
                // Skip past $
                cursor++;

                // The first number is always a group
                int refNum = (int)replacement.charAt(cursor) - '0';
                if ((refNum < 0)||(refNum > 9))
                    throw new IllegalArgumentException(
                        "Illegal group reference");
                cursor++;

                // Capture the largest legal group string
                boolean done = false;
                while (!done) {
                    if (cursor >= replacement.length()) {
                        break;
                    }
                    int nextDigit = replacement.charAt(cursor) - '0';
                    if ((nextDigit < 0)||(nextDigit > 9)) { // not a number
                        break;
                    }
                    int newRefNum = (refNum * 10) + nextDigit;
                    if (groupCount() < newRefNum) {
                        done = true;
                    } else {
                        refNum = newRefNum;
                        cursor++;
                    }
                }

                // Append group
                if (group(refNum) != null)
                    result.append(group(refNum));
            } else {
                result.append(nextChar);
                cursor++;
            }
        }

        // Append the intervening text
        sb.append(getSubSequence(lastAppendPosition, first));
        // Append the match substitution
        sb.append(result.toString());

        lastAppendPosition = last;
        return this;
    }

    原来在替换的时候，程序会遍历replacement的每个字符，如果内存字符为'\'则不会把该字符放入到缓存result中而是把后面一个字符直接放入到缓存result中；如果字符为$符号，则会判断$字符后面的是不是数字，如果是数字则会认为是组引用，及把匹配组的字符串加到字符缓存result中。原来字符'\'和字符'$'在replacement中可以认为是关键字有特殊的作用。所以我们在调用此方法之前最好把replacement中"\"转换为"\\"，"$"转换为"\$"。
    现在实验一下，把字符串中"Y123Y324"字符"Y"替换成"$"。


   String patternString = "(Y)";
		String code = "Y123Y432";
		//System.out.println("code:" + code);
		String replacement = "$";
		replacement = replacement.replaceAll("\\$", "\\\\\\$");		
		Pattern pattern = Pattern.compile(patternString);
		Matcher matcher = pattern.matcher(code);
		StringBuffer sb = new StringBuffer();
		while(matcher.find()) {
			matcher.appendReplacement(sb, replacement);
		}
		matcher.appendTail(sb);
		System.out.println(sb.toString());

输出结果：$123$432
至于为什么replaceAll里的第二个参数是这样的"\\\\\\$"，其实看源码，replaceAll也是调用了Matcher类里的appendReplacement。分析方法可以参照前面的方法。

程序员猪佩琪

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
java正则表达式转义

在学习java正则表达式时，遇到三个问题。1、java字符串和正则模式的字符串很不清楚2、正则中有捕获组的概念，并且还能对捕获后的组进行字符串替换，即appendReplacement(StringBuffer sb, String replacement)方法的原理不清楚3、为什么在调用appendReplacement(StringBuffer sb, String replacemen
复制链接

扫一扫