前两天,同学苦于不能将上千篇pdf报告转换成txt文档,让我帮忙写程序自动化转换。于是在网上看到开源包pdfbox,好奇地查了查,也参考了网上不少帖子,在别人帖子的基础上,增改了代码,总算解决了同学的烦心事。贴出来,希望对有同样烦恼的同学有所帮助
下载pdfbox和fontbox的jar包;
在eclipse新建项目,导入pdfbox和fontbox两个jar包,测试代码可以直接粘贴
http://www.prasannatech.net/2009/01/convert-pdf-text-parser-java-api-pdfbox.html和
http://blog.csdn.net/dengjianqiang001/article/details/3960305,修正(包括改项目编码utf-8和import正确的包)后直接运行,当然还得给出一篇pdf。
为了批量转换pdf为txt,我对
http://www.prasannatech.net/2009/01/convert-pdf-text-parser-java-api-pdfbox.html的代码做了小修改,如下:
package test;
import java.io.File;
import java.io.FileInputStream;
import java.io.PrintWriter;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentInformation;
import org.apache.pdfbox.util.PDFTextStripper;
public class PDFTextParser {
PDFParser parser;
String parsedText;
PDFTextStripper pdfStripper;
PDDocument pdDoc;
COSDocument cosDoc;
PDDocumentInformation pdDocInfo;
// PDFTextParser Constructor
public PDFTextParser() {
}
// Extract text from PDF Document
String pdftoText(String fileName) {
System.out.println("Parsing text from PDF file " + fileName + "....");
File f = new File("input/"+fileName);
if (!f.isFile()) {
System.out.println("File " + fileName + " does not exist.");
return null;
}
try {
parser = new PDFParser(new FileInputStream(f));
} catch (Exception e) {
System.out.println("Unable to open PDF Parser.");
return null;
}
try {
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
parsedText = pdfStripper.getText(pdDoc);
} catch (Exception e) {
System.out
.println("An exception occured in parsing the PDF Document.");
e.printStackTrace();
try {
if (cosDoc != null)
cosDoc.close();
if (pdDoc != null)
pdDoc.close();
} catch (Exception e1) {
e.printStackTrace();
}
return null;
}
System.out.println("Done.");
return parsedText;
}
// Write the parsed text from PDF to a file
void writeTexttoFile(String pdfText, String fileName) {
System.out.println("\nWriting PDF text to output text file " + fileName
+ "....");
try {
PrintWriter pw = new PrintWriter(fileName);
pw.print(pdfText);
pw.close();
} catch (Exception e) {
System.out
.println("An exception occured in writing the pdf text to file.");
e.printStackTrace();
}
System.out.println("Done.");
}
// Extracts text from a PDF Document and writes it to a text file
public static void main(String args[]) {
File input = new File("input");
if (input.isDirectory()) {
String[] fileList = input.list();
PDFTextParser ptp = new PDFTextParser();
for (String f : fileList) {
String pdfTxt = ptp.pdftoText(f);
if (pdfTxt == null) {
System.out.println("PDF to Text Conversion failed.");
} else {
String outTxtName = f.substring(0, f.length() - 4) + ".txt";
ptp.writeTexttoFile(pdfTxt, "output/" + outTxtName);
}
}
}
}
}
顺利帮同学转换好了1000多篇pdf,过程有时会出现警告
十一月 12, 2012 9:22:12 下午 org.apache.pdfbox.util.PDFStreamEngine processOperator
信息: unsupported/disabled operation: EI
但不影响结果,还没考虑解决办法。另外,遇到过缺少bcprov-jdk15on-147.jar的情况,只要去到jar包对应的网站下载导入即可解决问题。
用pdf转换格式正规的pdf文档(像论文/通知文件/财务报告等格式规范的pdf)效果挺好,转换不太正规的pdf(比如ppt转成的或图片奇怪符号太多的pdf)效果一般。