Unicode编码字符和普通字符的混合字符串转普通字符串

最新推荐文章于 2023-07-13 21:24:03 发布

ErbaoLiu

最新推荐文章于 2023-07-13 21:24:03 发布

阅读量922

点赞数

分类专栏： Java

本文链接：https://blog.csdn.net/L_15156024189/article/details/120092206

版权

正则表达式 java Unicode编码字符串转换

Java 专栏收录该内容

18 篇文章 0 订阅

订阅专栏

上网查了一下这个问题的解决方法，大多比较复杂，或者存在一些无法预料的bug。本文使用正则匹配替换的思路实现，代码简洁。

需求：将Unicode编码字符和普通字符的混合字符串转成普通字符串。

例如Unicode编码字符和普通字符的混合字符串如下：

|s2\u005c/\u0001/\u0024|we

先将该字符串中普通字符和Unicode字符按顺序分割后如下：

|s2

\u005c

\u0001

\u0024

|we

其中Unicode字符对应的普通字符如下：

Unicode编码字符	普通字符
\u005c	\
\u0001	^A（用^A表示\u0001对应的不可见字符）
\u0024	$

希望转换后对应的普通字符串如下：

混合字符串	普通字符串
\|s2\u005c/\u0001/\u0024\|we	\|s2\/^A/$\|we

转换代码如下：

import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * Created by leboop on 2021/9/2.
 */
public class UnicodeTest {
    private static final Pattern PATTERN = Pattern.compile("\\\\u[0-9a-fA-F]{4}");

    public static void main(String[] args) throws Exception {
        String mixedStr = "|s2\\u005c/\\u0001/\\u0024|we";
        String plainStr = mixedUnicode2Plain(mixedStr);
        System.out.println(plainStr);
    }

    private static String mixedUnicode2Plain(String mixedStr) {
        if (mixedStr == null) {
            return null;
        }
        Matcher matcher = PATTERN.matcher(mixedStr);
        StringBuffer sb = new StringBuffer();
        while (matcher.find()) {
            char ch = (char) Integer.parseInt(matcher.group().substring(2), 16);
            if (ch == 36 || ch == 92) {
                matcher.appendReplacement(sb, "");
                sb.append(ch);
                continue;
            }
            matcher.appendReplacement(sb, String.valueOf(ch));
        }
        matcher.appendTail(sb);

        return sb.toString();
    }


}

一些说明：

（1）如下代码片段

            if (ch == 36 || ch == 92) {
                matcher.appendReplacement(sb, "");
                sb.append(ch);
                continue;
            }

如果没有这段代码，运行程序异常如下：

Exception in thread "main" java.lang.IllegalArgumentException: character to be escaped is missing
   at java.util.regex.Matcher.appendReplacement(Matcher.java:809)
   at UnicodeTest.mixedUnicode2Plain(UnicodeTest.java:29)
   at UnicodeTest.main(UnicodeTest.java:12)

原因在于appendReplacement方法中：

appendReplacement(StringBuffer sb, String replacement)appendReplacement(StringBuffer sb, String replacement)

当replacement参数为反斜杠\和Dollar符号$时，用来替换匹配到的Unicode编码字符会如下抛出异常：

throw new IllegalArgumentException( "Illegal group reference: group index is missing")

关于反斜杠\和Dollar符号$，源码注释说明如下：

* <p> Note that backslashes (<tt>\</tt>) and dollar signs (<tt>$</tt>) in
* the replacement string may cause the results to be different than if it
* were being treated as a literal replacement string. Dollar signs may be
* treated as references to captured subsequences as described above, and
* backslashes are used to escape literal characters in the replacement
* string.

所以本文代码中的处理方法是先使用空格替换到反斜杠\和Dollar符号$对应的Unicode编码，之后再将反斜杠\和Dollar符号$拼接到StringBuffer类型的sb参数中。