在eclipse中读取PDF格式文本信息

最新推荐文章于 2024-11-01 14:13:50 发布

小资质

最新推荐文章于 2024-11-01 14:13:50 发布

阅读量644

点赞数

文章标签： eclipse class pdf

本文链接：https://blog.csdn.net/k_1075659958/article/details/80088845

版权

最常见的一种PDF文本抽取工具就是PDFBox了

public class PdfParser {

/**
* PDF全称Portable Document Format，是Adobe公司开发的电子文件格式。
* 这种文件格式与操作系统平台无关，可以在Windows、Unix或Mac OS等操作系统上通用。
PDF文件格式将文字、字型、格式、颜色及独立于设备和分辨率的图形图像等封装在
一个文件中。如果要抽取其中的文本信息，需要根据它的文件格式来进行解析。幸好目前已经有不少工具能帮助我们做这些事情。

*PDFBox提供的API，从一个PDF文件中提取出文本信息。

* @param args
* @throws Exception
*/
// TODO 自动生成方法存根

public static void main(String[] args) throws Exception{
//读出路径
FileInputStream fis = new FileInputStream("F:\\Working\\船舶工业标准体系(2012年版).pdf");
//读到哪去
BufferedWriter writer = new BufferedWriter(new FileWriter("F:\\Working\\pdf_change.txt"));
PDFParser p = new PDFParser(fis);
p.parse();
PDFTextStripper ts = new PDFTextStripper();
String s = ts.getText(p.getPDDocument());
writer.write(s);
System.out.println(s);
fis.close();
writer.close();

}
}