问题:
1. dom4j乱码问题 1字节的UTF-8序列的字节1无效
2. org.xml.sax.SAXParseException:无效的XML字符(Unicode:0x1b)是在CDATA部分
3. dom4j xml read Unicode: 0x1b
解决方案:
保留合法字符
// 保留合法字符
public String stripNonValidXMLCharacters(String in) {
StringBuffer out = new StringBuffer(); // Used to hold the output.
char current; // Used to reference the current character.
if (in == null || ("".equals(in))) return ""; // vacancy test.
for (int i = 0; i < in.length(); i++) {
current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen.
if ((current == 0x9) ||
(current == 0xA) ||
(current == 0xD) ||
((current >= 0x20) && (current <= 0xD7FF)) ||
((current >= 0xE000) && (current <= 0xFFFD)) ||
((current >= 0x10000) && (current <= 0x10FFFF)))
out.append(current);
}
return out.toString();
}
正则去除异常字符
//过滤非法字符
//注意,以下正则表达式过滤不全面,过滤范围为
// 0x00 - 0x08
// 0x0b - 0x0c
// 0x0e - 0x1f
public static String stripNonValidXMLChars(String str) {
if (str == null || "".equals(str)) {
return str;
}
return str.replaceAll("[\\x00-\\x08\\x0b-\\x0c\\x0e-\\x1f]", "");
}
上述两者是传入string,也可以直接传入File对象,参考如下
private static ByteArrayInputStream filterXmlSpecialChars(File file) {
try (InputStream input = new FileInputStream(file);
ByteArrayOutputStream bytestream = new ByteArrayOutputStream();
) {
int ch;
while ((ch = input.read()) != -1) {
bytestream.write(ch);
}
byte[] data = bytestream.toByteArray();
List<Byte> newData = new ArrayList<>();
for (int i = 0; i < data.length; i++) {
byte curr = data[i];
if ((curr == 0x9) ||
(curr == 0xA) ||
(curr == 0xD) ||
((curr >= 0x20) && (curr <= 0xD7FF)) ||
((curr >= 0xE000) && (curr <= 0xFFFD)) ||
((curr >= 0x10000) && (curr <= 0x10FFFF)))
newData.add(curr);
}
byte[] result = new byte[newData.size()];
for (int i = 0; i < newData.size(); i++) {
result[i] = newData.get(i);
}
return new ByteArrayInputStream(result);
} catch (FileNotFoundException e) {
log.error(e.getMessage(), e);
} catch (IOException e) {
log.error(e.getMessage(), e);
} catch (Exception e) {
log.error(e.getMessage(), e);
}
return null;
}
参考文章: https://blog.csdn.net/yan3013216087/article/details/81450658