word读取常见问题(涉及格式:doc、docx、rtf)
- 异常
- java.lang.IllegalArgumentException: The document is really a RTF file
- java.lang.IllegalArgumentException: The document is really a OOXML file
- java.lang.IllegalArgumentException: The document is really a HTML file
- java.lang.IndexOutOfBoundsException: Block xxx not found
- org.apache.poi.openxml4j.exceptions.NotOfficeXmlFileException: No valid entries or contents found, this is not a valid OOXML (Office Open XML) file
- java.io.IOException: Zip bomb detected!
异常
java.lang.IllegalArgumentException: The document is really a RTF file
问题原因:
文档表面上是doc后缀,但是本质上是rtf格式
解决方案一:
1、将文件名称从 .doc 后缀换成 .rtf (也可以通过拷贝出一个 .rtf 文件的方式)
2、使用RTF格式读取内容
String rtfFileAbsPath = "C:\\Users\\cjyou\\Desktop\\a306dedb-3c65-47b5-9c02-d87a61d1ffe2.rtf";
RTFEditorKit rtf = new RTFEditorKit();
DefaultStyledDocument styledDoc = new DefaultStyledDocument();
InputStream rtfin = new FileInputStream(rtfFileAbsPath);
rtf.read(in, styledDoc, 0);
String content = new String(styledDoc.getText(0, styledDoc.getLength()).getBytes("ISO8859_1"));
java.lang.IllegalArgumentException: The document is really a OOXML file
问题原因:
doc格式的文件,会验证文件头,判断是否是正常的doc文件
1、当docx文件,改后缀为doc文件时,会报出该错
解决方案:
1、将文件名称从 .doc 后缀换成 .docx (也可以通过拷贝出一个 .docx 文件的方式)
2、使用docx格式读取内容
String fileAbsPath = "C:\\Users\\cjyou\\Desktop\\a306dedb-3c65-47b5-9c02-d87a61d1ffe2.docx";
InputStream in = new FileInputStream(fileAbsPath);
XWPFDocument docx = new XWPFDocument(in);
XWPFWordExtractor extractor = new XWPFWordExtractor(docx);
content = extractor.getText();
java.lang.IllegalArgumentException: The document is really a HTML file
问题原因:
word文件其实是一个html文件
解决方法:
直接使用FileInputStream流读取
java.lang.IndexOutOfBoundsException: Block xxx not found
问题原因:
文件损坏
org.apache.poi.openxml4j.exceptions.NotOfficeXmlFileException: No valid entries or contents found, this is not a valid OOXML (Office Open XML) file
问题原因:
使用 XWPFDocument 读取 doc 文件时,会报错该错
java.io.IOException: Zip bomb detected!
问题报错:
java.io.IOException: Zip bomb detected! The file would exceed the max. ratio of compressed file size to the size of the expanded data.
This may indicate that the file is used to inflate memory usage and thus could pose a security risk.
You can adjust this limit via ZipSecureFile.setMinInflateRatio() if you need to work with files which exceed this limit.
Uncompressed size: 212101, Raw/compressed size: 2108, ratio: 0.009939
Limits: MIN_INFLATE_RATIO: 0.010000, Entry: word/fonts/font2.odttf
问题原因:
什么是Zip bomb? 一个里面包含了很多重复的、或者很多的递归操作的小文件,在解压的时候,需要占用巨大的空间,从而导致系统崩溃。
poi在读取word的时候,会检测文件的压缩率,检测的目的是为了防止大量空间被占用导致系统崩溃
poi默认的压缩率是0.01,当低于这个值的时候(比如上面的压缩率为0.009939
小于默认的0.01
),会抛出上面的错误。
解决方法:
在代码中设置更小的压缩率
ZipSecureFile.setMinInflateRatio(0.001);