Java从网上读取docx文件到内存
注:未测试是否可以直接读取doc文件,但有需求可以自己更改一下,doc文件的话,更改不难的
策略
从网上获取下载docx文件的链接,再用POI读取存在链接的docx文件(用Stream流)
实现
引入POI依赖
<!--poi依赖-->
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>4.1.0</version>
</dependency>
<!--poi依赖-->
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>4.1.0</version>
</dependency>
<!--poi依赖-->
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml-schemas</artifactId>
<version>4.1.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.poi/poi-scratchpad -->
<!--poi依赖-->
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-scratchpad</artifactId>
<version>4.1.0</version>
</dependency>
获取并读取
核心代码:
InputStream in = conn.getInputStream();
OPCPackage opcPackage = OPCPackage.open(in);
POIXMLTextExtractor extractor = new XWPFWordExtractor(opcPackage);
result = extractor.getText();
conn是那个链接的通道,通过通道来获取InputStream流,使用POI的OPCPackage.open()
方法获取下面XWPFWordExtractor()
需要的参数,最后用getText()
方法获取文本。
源码:
public static String getInfo(String path) {
URL url = null;
try {
url = new URL(path);
} catch (MalformedURLException e1) {
e1.printStackTrace();
}
URLConnection conn = null;
try {
conn = url.openConnection();
} catch (IOException e1) {
e1.printStackTrace();
}
String result = "";
//首先判断文件中的是doc/docx
try {
if (path.endsWith(".doc")) {
InputStream is = new FileInputStream(path);
WordExtractor extractor = new WordExtractor(is);
result = extractor.getText();
//输出word文档所有的文本
System.out.println(extractor.getText());
System.out.println("=================1=================");
System.out.println("==================2================");
// //输出页脚的内容
System.out.println("页脚:" + extractor.getDocument());
// System.out.println("===============4===================");
// //输出当前word文档的元数据信息,包括作者、文档的修改时间等。
System.out.println(extractor.getMetadataTextExtractor().getText());
System.out.println("===============5===================");
//获取各个段落的文本
String paraTexts[] = extractor.getParagraphText();
for (int i = 0; i < paraTexts.length; i++) {
System.out.println("Paragraph " + (i + 1) + " : " + paraTexts[i]);
}
//输出当前word的一些信息
System.out.println(extractor.getTextFromPieces());
System.out.println("=============6=====================");
//输出当前word的一些信息
System.out.println(extractor.getMetadataTextExtractor());
System.out.println("===============7===================");
System.out.println(extractor.getEndnoteText());
System.out.println("===============8===================");
extractor.close();
} else if (path.endsWith(".docx")) {
// 获取流
InputStream in = conn.getInputStream();
OPCPackage opcPackage = OPCPackage.open(in);
POIXMLTextExtractor extractor = new XWPFWordExtractor(opcPackage);
result = extractor.getText();
extractor.close();
} else {
System.out.println("此文件不是word文件");
}
} catch (Exception e) {
e.printStackTrace();
}
return result;
}
核心在于获取docx那里,网络上的解决方案是读取本地文件,我这里是直接调用OPCPackage的open方法获取InputStream流。