pdfbox转换html,使用开源包pdfbox将pdf文件批量转换成txt文件

最新推荐文章于 2023-07-18 21:36:23 发布

风景无限之

最新推荐文章于 2023-07-18 21:36:23 发布

阅读量480

点赞数

文章标签： pdfbox转换html

前两天，同学苦于不能将上千篇pdf报告转换成txt文档，让我帮忙写程序自动化转换。于是在网上看到开源包pdfbox，好奇地查了查，也参考了网上不少帖子，在别人帖子的基础上，增改了代码，总算解决了同学的烦心事。贴出来，希望对有同样烦恼的同学有所帮助

下载pdfbox和fontbox的jar包；

在eclipse新建项目，导入pdfbox和fontbox两个jar包，测试代码可以直接粘贴

http://www.prasannatech.net/2009/01/convert-pdf-text-parser-java-api-pdfbox.html和

http://blog.csdn.net/dengjianqiang001/article/details/3960305，修正(包括改项目编码utf-8和import正确的包)后直接运行，当然还得给出一篇pdf。

为了批量转换pdf为txt，我对

http://www.prasannatech.net/2009/01/convert-pdf-text-parser-java-api-pdfbox.html的代码做了小修改，如下：

package test;

import java.io.File;

import java.io.FileInputStream;

import java.io.PrintWriter;

import org.apache.pdfbox.cos.COSDocument;

import org.apache.pdfbox.pdfparser.PDFParser;

import org.apache.pdfbox.pdmodel.PDDocument;

import org.apache.pdfbox.pdmodel.PDDocumentInformation;

import org.apache.pdfbox.util.PDFTextStripper;

public class PDFTextParser {

PDFParser parser;

String parsedText;

PDFTextStripper pdfStripper;

PDDocument pdDoc;

COSDocument cosDoc;

PDDocumentInformation pdDocInfo;

// PDFTextParser Constructor

public PDFTextParser() {

}

// Extract text from PDF Document

String pdftoText(String fileName) {

System.out.println("Parsing text from PDF file " + fileName + "....");

File f = new File("input/"+fileName);

if (!f.isFile()) {

System.out.println("File " + fileName + " does not exist.");

return null;

}

try {

parser = new PDFParser(new FileInputStream(f));

} catch (Exception e) {

System.out.println("Unable to open PDF Parser.");

return null;

}

try {

parser.parse();

cosDoc = parser.getDocument();

pdfStripper = new PDFTextStripper();

pdDoc = new PDDocument(cosDoc);

parsedText = pdfStripper.getText(pdDoc);

} catch (Exception e) {

System.out

.println("An exception occured in parsing the PDF Document.");

e.printStackTrace();

try {

if (cosDoc != null)

cosDoc.close();

if (pdDoc != null)

pdDoc.close();

} catch (Exception e1) {

e.printStackTrace();

}

return null;

}

System.out.println("Done.");

return parsedText;

}

// Write the parsed text from PDF to a file

void writeTexttoFile(String pdfText, String fileName) {

System.out.println("\nWriting PDF text to output text file " + fileName

+ "....");

try {

PrintWriter pw = new PrintWriter(fileName);

pw.print(pdfText);

pw.close();

} catch (Exception e) {

System.out

.println("An exception occured in writing the pdf text to file.");

e.printStackTrace();

}

System.out.println("Done.");

}

// Extracts text from a PDF Document and writes it to a text file

public static void main(String args[]) {

File input = new File("input");

if (input.isDirectory()) {

String[] fileList = input.list();

PDFTextParser ptp = new PDFTextParser();

for (String f : fileList) {

String pdfTxt = ptp.pdftoText(f);

if (pdfTxt == null) {

System.out.println("PDF to Text Conversion failed.");

} else {

String outTxtName = f.substring(0, f.length() - 4) + ".txt";

ptp.writeTexttoFile(pdfTxt, "output/" + outTxtName);

}

}

}

}

}

顺利帮同学转换好了1000多篇pdf，过程有时会出现警告

十一月 12, 2012 9:22:12 下午 org.apache.pdfbox.util.PDFStreamEngine processOperator

信息: unsupported/disabled operation: EI

但不影响结果，还没考虑解决办法。另外，遇到过缺少bcprov-jdk15on-147.jar的情况，只要去到jar包对应的网站下载导入即可解决问题。

用pdf转换格式正规的pdf文档(像论文/通知文件/财务报告等格式规范的pdf)效果挺好，转换不太正规的pdf(比如ppt转成的或图片奇怪符号太多的pdf)效果一般。

风景无限之

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
pdfbox转换html,使用开源包pdfbox将pdf文件批量转换成txt文件

前两天，同学苦于不能将上千篇pdf报告转换成txt文档，让我帮忙写程序自动化转换。于是在网上看到开源包pdfbox，好奇地查了查，也参考了网上不少帖子，在别人帖子的基础上，增改了代码，总算解决了同学的烦心事。贴出来，希望对有同样烦恼的同学有所帮助下载pdfbox和fontbox的jar包；在eclipse新建项目，导入pdfbox和fontbox两个jar包，测试代码可以直接粘贴http://ww...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。