【JAVA】word读取常见问题（涉及格式：doc、docx、rtf）

尤成军军军

已于 2023-05-15 20:05:06 修改

阅读量4.2k

点赞数 3

分类专栏： JAVA 文章标签： java poi word读取 doc docx

于 2022-06-28 17:35:44 首次发布

本文链接：https://blog.csdn.net/qq_31083947/article/details/125506709

版权

JAVA 专栏收录该内容

22 篇文章 0 订阅

订阅专栏

word读取常见问题（涉及格式：doc、docx、rtf）

异常

异常

java.lang.IllegalArgumentException: The document is really a RTF file

问题原因：
文档表面上是doc后缀，但是本质上是rtf格式

解决方案一：
1、将文件名称从 .doc 后缀换成 .rtf （也可以通过拷贝出一个 .rtf 文件的方式）
2、使用RTF格式读取内容

String rtfFileAbsPath = "C:\\Users\\cjyou\\Desktop\\a306dedb-3c65-47b5-9c02-d87a61d1ffe2.rtf";
RTFEditorKit rtf = new RTFEditorKit();
DefaultStyledDocument styledDoc = new DefaultStyledDocument();
InputStream rtfin = new FileInputStream(rtfFileAbsPath);
rtf.read(in, styledDoc, 0);
String content = new String(styledDoc.getText(0, styledDoc.getLength()).getBytes("ISO8859_1"));

java.lang.IllegalArgumentException: The document is really a OOXML file

问题原因：
doc格式的文件，会验证文件头，判断是否是正常的doc文件
1、当docx文件，改后缀为doc文件时，会报出该错

解决方案：
1、将文件名称从 .doc 后缀换成 .docx （也可以通过拷贝出一个 .docx 文件的方式）
2、使用docx格式读取内容

String fileAbsPath = "C:\\Users\\cjyou\\Desktop\\a306dedb-3c65-47b5-9c02-d87a61d1ffe2.docx";
InputStream in = new FileInputStream(fileAbsPath);
XWPFDocument docx = new XWPFDocument(in);
XWPFWordExtractor extractor = new XWPFWordExtractor(docx);
content = extractor.getText();

java.lang.IllegalArgumentException: The document is really a HTML file

问题原因：
word文件其实是一个html文件

解决方法：
直接使用FileInputStream流读取

java.lang.IndexOutOfBoundsException: Block xxx not found

问题原因：
文件损坏

org.apache.poi.openxml4j.exceptions.NotOfficeXmlFileException: No valid entries or contents found, this is not a valid OOXML (Office Open XML) file

问题原因：
使用 XWPFDocument 读取 doc 文件时，会报错该错

java.io.IOException: Zip bomb detected!

问题报错：

java.io.IOException: Zip bomb detected! The file would exceed the max. ratio of compressed file size to the size of the expanded data.
This may indicate that the file is used to inflate memory usage and thus could pose a security risk.
You can adjust this limit via ZipSecureFile.setMinInflateRatio() if you need to work with files which exceed this limit.
Uncompressed size: 212101, Raw/compressed size: 2108, ratio: 0.009939
Limits: MIN_INFLATE_RATIO: 0.010000, Entry: word/fonts/font2.odttf

问题原因：
什么是Zip bomb? 一个里面包含了很多重复的、或者很多的递归操作的小文件，在解压的时候，需要占用巨大的空间，从而导致系统崩溃。
poi在读取word的时候，会检测文件的压缩率，检测的目的是为了防止大量空间被占用导致系统崩溃
poi默认的压缩率是0.01，当低于这个值的时候（比如上面的压缩率为0.009939小于默认的0.01），会抛出上面的错误。

解决方法：
在代码中设置更小的压缩率

ZipSecureFile.setMinInflateRatio(0.001);

尤成军军军

关注

3
点赞
踩
7

收藏

觉得还不错? 一键收藏
打赏
0
评论
【JAVA】word读取常见问题（涉及格式：doc、docx、rtf）

java读取doc、docx、rtf常见问题
复制链接

扫一扫

专栏目录