关于java中的excel导入中碰到的看不到的字符_java 匹配excel 不可见字符正则-CSDN博客

本文链接：https://blog.csdn.net/SkydivingWang/article/details/100118182

在excel导入过程中, 碰到一个有趣的现象, 就是正则匹配不过去, 那么先看一下这种现象:

Pattern p = Pattern.compile("^\\d{9}$");
String callPhone1 = "645401367‬";
Matcher m1 = p.matcher(callPhone1);
System.out.println(m1.matches());

输出得到的结果为:

false

那么为什么匹配不成功呢?于是如下程序验证:

Pattern p = Pattern.compile("^\\d{9}$");
String callPhone1 = "645401367‬";
String callPhone2 = "645401367";
Matcher m1 = p.matcher(callPhone1);
Matcher m2 = p.matcher(callPhone2);
System.out.println(m1.matches());
System.out.println(m2.matches());

得到结果为:

false
true

那么这种结果应该是第一个字符串的问题, callPhone1的格式应该是不正确的, 于是输出callPhone1的长度:

Pattern p = Pattern.compile("^\\d{9}$");
String callPhone1 = "645401367‬";
String callPhone2 = "645401367";
Matcher m1 = p.matcher(callPhone1);
Matcher m2 = p.matcher(callPhone2);
System.out.println(m1.matches());
System.out.println(m2.matches());
System.out.println(callPhone1.length());
System.out.println(callPhone2.length());

得到结果为:

false
true
10
9

说明什么问题呢? 明显是callPhone1的长度多了一位,那么多了哪一位呢? 通过遍历该字符串的字符数组我们发现:

String callPhone1 = "645401367‬";
for (int i = 0; i < callPhone1.toCharArray().length; i++) {
	System.out.println(i + ":" + callPhone1.toCharArray()[i]);
}
char nine = callPhone1.toCharArray()[9];
System.out.println("该字符对应的unicode编码是:" + Integer.toHexString(nine).toUpperCase());

输出结果为

0:6
1:4
2:5
3:4
4:0
5:1
6:3
7:6
8:7
9:‬
该字符对应的unicode编码是:202C

所以该字符为\u202C.

那么对于这个字符是什么原因导致的呢?

通过查阅资料发现该unicode字符是介于2000-206F, 他的含义是General Punctuation(常用标点符号)(参考Unicode字符列表).

那么为什么excel中会有这种字符呢?很明显这个字符是在编辑excel的时候加入的, 查资料发现, 在excel导入时如果是粘贴得到的手机号码,往往前后都会加上无用的万国码,导致长度判断的时候总是+1+2,这里是导入手机号码所以可以提取字符值在48到57的字符 \u202d \u202c 对应8236 8237不会被提取到(可参考关于EXCEL导入手机号提取时被自动加上多余空万国码的问题)

参考文章:

关于EXCEL导入手机号提取时被自动加上多余空万国码的问题

unicode（统一码、万国码、单一码）和ascii字符编码的区别