Java中用iText导出DPF文档的纯文本内容

最新推荐文章于 2023-10-19 14:32:33 发布

薛定谔之死猫

最新推荐文章于 2023-10-19 14:32:33 发布

阅读量4.2k

点赞数

分类专栏： Hello World 文章标签：文档 java string jar 工具 import

本文链接：https://blog.csdn.net/mscf/article/details/6957061

版权

Hello World 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

iText作为在Java中处理PDF文档的工具被广泛使用，各种开源项目中都比较常见。现在就使用iText提供的API将PDF文档中的文本信息导出为纯文本，虽然现在很多工具中都已经支持这样的操作，这是第一步也算是读取PDF文件最常见的需求。

首先下载iText包，地址为http://sourceforge.net/projects/itext/，最新版本为5.1.2，完整包名为iText-5.1.2.zip，解压后将得到一组jar包，我们要使用的是里面的itextpdf-5.1.2.jar。在本地配置好Java编译和运行环境后，编写如下示例代码：

import java.io.IOException;

import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfReaderContentParser;
import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;

public class PDFReader {

 /**
  * @param args
  * @throws IOException
  */
 public static void main(String[] args) throws IOException {
  System.out.print(getPdfFileText("E:\\test\\plugindoc.pdf"));
 }

 public static String getPdfFileText(String fileName) throws IOException {
  PdfReader reader = new PdfReader(fileName);
  PdfReaderContentParser parser = new PdfReaderContentParser(reader);
  StringBuffer buff = new StringBuffer();
  TextExtractionStrategy strategy;
  for (int i = 1; i <= reader.getNumberOfPages(); i++) {
   strategy = parser.processContent(i,
     new SimpleTextExtractionStrategy());
   buff.append(strategy.getResultantText());
  }
  return buff.toString();
 }

}

上述的代码读取本地磁盘的PDF文件，并将结构输出到标准输出。其中导出文本的部分由一个静态方法完成，在mian方法中对其调用，把方法的返回值输出到标准输出。使用javac编译该源代码文件，编译工程中将上面提到的jar包加入到classpath，执行时也使用相同的classpath设置。