Java-提取文档内容
介绍
Apache POI是一套基于 OOXML 标准(Office Open XML)和 OLE2 标准来读写各种格式文件的 Java API,也就是说只要是遵循以上标准的文件,POI 都能够进行读写,而不仅仅只能操作我们熟知的办公程序文件。
poi官方网站。
参考文章
pom
<!--工具包-->
<dependency>
<groupId>cn.hutool</groupId>
<artifactId>hutool-all</artifactId>
<version>5.5.2</version>
</dependency>
<!-- excel工具 -->
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>4.1.2</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>4.1.2</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml-schemas</artifactId>
<version>4.1.2</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-scratchpad</artifactId>
<version>4.1.2</version>
</dependency>
<dependency>
<groupId>xerces</groupId>
<artifactId>xercesImpl</artifactId>
<version>${xerces.version}</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.13.1</version>
</dependency>
<dependency>
<groupId>e-iceblue</groupId>
<artifactId>spire.office.free</artifactId>
<version>3.9.0</version>
</dependency>
Word
查看jar包源码的时候发现有两个类可以操作后缀为.doc的文档Word6Extractor 和WordExtractor,测试使用WordExtractor正常,Word6Extractor好像用来操作版本更老的word文档,还没测
.doc
WordExtractor
Word6Extractor
try {
File tempFile = null;
...
HWPFDocument doc = new HWPFDocument(new POIFSFileSystem(tempFile));
WordExtractor extractor = new WordExtractor(doc);
String[] paragraphText = extractor.getParagraphText();
text = extractor.getText();
} catch (IOException e) {
e.printStackTrace();
}
.docx
Word07Writer writer = WordUtil.getWriter(tempFile);
XWPFWordExtractor extractor = new XWPFWordExtractor(writer.getDoc());
String wordText = extractor.getText();
txt
String javaEncode = EncodingDetect.getJavaEncode(tempFile);
String fileText = FileUtil.readString(tempFile, javaEncode);
-
参考其他博客java解析pdf获取pdf中内容信息
-
使用 spire.office.free
PdfDocument pdf = new PdfDocument();
pdf.loadFromFile(tempFile.getAbsolutePath());
PdfPageCollection pages = pdf.getPages();
if (pages!=null && pages.getCount()>0){
int count = pages.getCount();
StringBuilder builder = new StringBuilder();
String pageContent = "";
for (int i = 0; i < count; i++) {
PdfPageBase pageBase = pages.get(i);
pageContent = pageBase.extractText(false);
builder.append(pageContent).append("\r\n");
}
text = builder.toString();
}
- Spire.PDF for Java 中文教程
Spire.PDF for Java 中文教程
Excel
参考详解POI的使用方法(DOM和SAX的方式)及存在的不足
SpringBoot使用FreeMarker根据模板导出Word
SpringBoot使用FreeMarker根据模板导出Word
springboot下生成复杂word文档方案 在Word软件里面制作模板