使用PDFBox读取PDF文件中文本内容

最新推荐文章于 2024-03-28 10:30:58 发布

daning106

最新推荐文章于 2024-03-28 10:30:58 发布

阅读量2.3k

点赞数

文章标签： Java Blog

为什么要写这段代码，在[url=../../../blog/164931]这个文章[/url]中已经说了。其实网上关于java读取pdf文件的文章很多，我这里只是把自己的实践记录下来，供以后参考。读写pdf的库有很多，这里使用PDFBox 0.7.3。PDFBox是一个开源的对pdf文件进行操作的库。
首先下载[url=http://www.pdfbox.org/]PDFBox最新版本[/url]，并解压缩。为方便描述，假设解压缩后的目录是$PDFBox_HOME。
将$PDFBox_HOME/lib/PDFBox-0.7.3.jar加入classpath。如果编译过程中提示找不到类，可能是缺少某些库，要将$PDFBox_HOME/external中的所有jar文件都加入classpath吧。
读取文本的代码如下：

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;

import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;

public class SimplePDFReader {
	/**
	 * simply reader all the text from a pdf file. 
	 * You have to deal with the format of the output text by yourself.
	 * 2008-2-25
	 * @param pdfFilePath file path
	 * @return all text in the pdf file
	 */
	public static String getTextFromPDF(String pdfFilePath) {
		String result = null;
		FileInputStream is = null;
		PDDocument document = null;
		try {
			is = new FileInputStream(pdfFilePath);
			PDFParser parser = new PDFParser(is);
			parser.parse();
			document = parser.getPDDocument();
			PDFTextStripper stripper = new PDFTextStripper();
			result = stripper.getText(document);
		} catch (FileNotFoundException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} finally {
			if (is != null) {
				try {
					is.close();
				} catch (IOException e) {
					// TODO Auto-generated catch block
					e.printStackTrace();
				}
			}
			if (document != null) {
				try {
					document.close();
				} catch (IOException e) {
					// TODO Auto-generated catch block
					e.printStackTrace();
				}
			}
		}
		return result;
	}
}

得到PDF的文本内容之后，自己根据文件的格式，取得想要的文本（这里我找的就是文章的标题，在文本中恰巧都是文件的第一行的内容），然后通过java的File相关api，对文件进行更名操作。
文件更名代码如下：

import java.io.File;
import java.io.FilenameFilter;

public class PaperNameMender {

	public static void changePaperName(String filePath) {
		//使用SimplePDFReader得到pdf文本
		String ts = SimplePDFReader.getTextFromPDF(filePath);
                //取得一行内容
		String result = ts.substring(0, ts.indexOf('\n'));
		//得到源文件名中的最后一个逗点.的位置
		int index = filePath.indexOf('.');
		int nextIndex = filePath.indexOf('.', index + 1);
                while(nextIndex != -1) {
			index = nextIndex;
			nextIndex = filePath.indexOf('.', index + 1);
		}
		//合成新文件名
		String newFilename = filePath.substring(0, index) + " " + 
				result.trim() + ".pdf";
		File originalFile = new File(filePath);
		//修改文件名
		originalFile.renameTo(new File(newFilename));
	}
}

daning106

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
使用PDFBox读取PDF文件中文本内容

为什么要写这段代码，在[url=../../../blog/164931]这个文章[/url]中已经说了。其实网上关于java读取pdf文件的文章很多，我这里只是把自己的实践记录下来，供以后参考。读写pdf的库有很多，这里使用PDFBox 0.7.3。PDFBox是一个开源的对pdf文件进行操作的库。首先下载[url=http://www.pdfbox.org/]PDFBox最新版本[/url...
复制链接

扫一扫