最近工作中碰到需要解析超大XML的问题(XML文件超过1G),并且在处理中还碰到无法解析的异常(org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x19) was found in the element content of the document)。
现在把处理方式和代码分享出来。
try {
SAXReader saxReader = new SAXReader();
saxReader.addHandler("/list/XXXX", new ElementHandler() {
public void onStart(ElementPath path) {
// do nothing here...
}
public void onEnd(ElementPath path) {
// process a ROW element
Element row = path.getCurrent();
Document document = row.getDocument();
System.out.println(document.asXML());
row.detach();
}
});
}
final File file = new File(getFileName(language, isProduct));
saxReader.setErrorHandler(new ErrorHandler() {
public void error(SAXParseException e) {
System.out.println("file:" + file.getName() + " ERROR: " + e);
}
public void fatalError(SAXParseException e) {
System.out.println("file:" + file.getName() + " FATAL: " + e);
}
public void warning(SAXParseException e) {
System.out.println("file:" + file.getName() + " WARNING: " + e);
}
});
InputStreamReader source = new InputStreamReader(new FileInputStream(file));
saxReader.read(source);
} catch (DocumentException e) {
logger.error("error", e);
return;
} catch (FileNotFoundException e) {
logger.error(" error", e);
return;
}
如果XML文件中包含了一些不可见的无效字符,就会导致JDom在解析该文件是抛出异常(An invalid XML character Unicode: 0x19 etc)。我们可以通过一些xml工具来保证,如果在xml文件出现了,也可以通过下面这个方法来过滤。
public static String stripNonValidXMLCharacters(String in) {
StringBuffer out = new StringBuffer(); // Used to hold the output.
char current; // Used to reference the current character.
if (in == null || ("".equals(in)))
return ""; // vacancy test.
for (int i = 0; i < in.length(); i++) {
current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught
// here; it should not happen.
if ((current == 0x9) || (current == 0xA) || (current == 0xD)
|| ((current >= 0x20) && (current <= 0xD7FF))
|| ((current >= 0xE000) && (current <= 0xFFFD))
|| ((current >= 0x10000) && (current <= 0x10FFFF)))
out.append(current);
}
return out.toString();
}
转载:http://www.ilehao.com/blog/2012/10/28/parse-big-xml-file/