上网查了一下这个问题的解决方法,大多比较复杂,或者存在一些无法预料的bug。本文使用正则匹配替换的思路实现,代码简洁。
需求:将Unicode编码字符和普通字符的混合字符串转成普通字符串。
例如Unicode编码字符和普通字符的混合字符串如下:
|s2\u005c/\u0001/\u0024|we
先将该字符串中普通字符和Unicode字符按顺序分割后如下:
|s2 | \u005c | / | \u0001 | / | \u0024 | |we |
其中Unicode字符对应的普通字符如下:
Unicode编码字符 | 普通字符 |
---|---|
\u005c | \ |
\u0001 | ^A(用^A表示\u0001对应的不可见字符) |
\u0024 | $ |
希望转换后对应的普通字符串如下:
混合字符串 | 普通字符串 |
---|---|
|s2\u005c/\u0001/\u0024|we | |s2\/^A/$|we |
转换代码如下:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
* Created by leboop on 2021/9/2.
*/
public class UnicodeTest {
private static final Pattern PATTERN = Pattern.compile("\\\\u[0-9a-fA-F]{4}");
public static void main(String[] args) throws Exception {
String mixedStr = "|s2\\u005c/\\u0001/\\u0024|we";
String plainStr = mixedUnicode2Plain(mixedStr);
System.out.println(plainStr);
}
private static String mixedUnicode2Plain(String mixedStr) {
if (mixedStr == null) {
return null;
}
Matcher matcher = PATTERN.matcher(mixedStr);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
char ch = (char) Integer.parseInt(matcher.group().substring(2), 16);
if (ch == 36 || ch == 92) {
matcher.appendReplacement(sb, "");
sb.append(ch);
continue;
}
matcher.appendReplacement(sb, String.valueOf(ch));
}
matcher.appendTail(sb);
return sb.toString();
}
}
一些说明:
(1)如下代码片段
if (ch == 36 || ch == 92) {
matcher.appendReplacement(sb, "");
sb.append(ch);
continue;
}
如果没有这段代码,运行程序异常如下:
Exception in thread "main" java.lang.IllegalArgumentException: character to be escaped is missing
at java.util.regex.Matcher.appendReplacement(Matcher.java:809)
at UnicodeTest.mixedUnicode2Plain(UnicodeTest.java:29)
at UnicodeTest.main(UnicodeTest.java:12)
原因在于appendReplacement方法中:
appendReplacement(StringBuffer sb, String replacement)appendReplacement(StringBuffer sb, String replacement)
当replacement参数为反斜杠\和Dollar符号$时,用来替换匹配到的Unicode编码字符会如下抛出异常:
throw new IllegalArgumentException( "Illegal group reference: group index is missing")
关于反斜杠\和Dollar符号$,源码注释说明如下:
* <p> Note that backslashes (<tt>\</tt>) and dollar signs (<tt>$</tt>) in * the replacement string may cause the results to be different than if it * were being treated as a literal replacement string. Dollar signs may be * treated as references to captured subsequences as described above, and * backslashes are used to escape literal characters in the replacement * string.
所以本文代码中的处理方法是先使用空格替换到反斜杠\和Dollar符号$对应的Unicode编码,之后再将反斜杠\和Dollar符号$拼接到StringBuffer类型的sb参数中。