使用Apache PDFBox从一堆pdf论文中提取出作者Email地址

最新推荐文章于 2024-06-13 15:33:40 发布

chenxupro

最新推荐文章于 2024-06-13 15:33:40 发布

阅读量2k

点赞数

文章标签： pdfbox

本文链接：https://blog.csdn.net/chenxupro/article/details/8796238

版权

pdfbox是一个开源的处理pdf文档的Java工具，通过它我们可以很方便地读取分析pdf文档。项目地址是：http://pdfbox.apache.org/

我这里使用1.6.0版本，下载了fontbox-1.6.0.jar jempbox-1.6.0.jar pdfbox-1.6.0.jar pdfbox-app-1.6.0.jar。

思路就是将需要提取Email的论文放到同一个文件夹下面，然后遍历，使用正则去匹配，将匹配到的Email地址输出到某个文本文件里面。

核心代码如下：

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.util.PDFTextStripper;

public class GetEmailFromPdfsMain 
{
	public static void main(String a[]) throws IOException 
	{
		File PDFDir = new File("C:\\pdfs");
		BufferedWriter writer = new BufferedWriter(new FileWriter("C:\\pdfs\\AuthorEmails.txt"));
		if(PDFDir.isDirectory())
		{
			File[] PDFFiles = PDFDir.listFiles(); 
			for (File PDFFile:PDFFiles)
			{
				if(PDFFile.isFile())
				{
					FileInputStream fis = new FileInputStream(PDFFile);
					PDFParser p = new PDFParser(fis);
					p.parse();
					PDFTextStripper ts = new PDFTextStripper();
					ts.setStartPage(1);
					ts.setEndPage(1);
					String s = ts.getText(p.getPDDocument());
					fis.close();
					
					//^[\w-]+(\.[\w-]+)*@[\w-]+(\.[\w-]+)+$
					String regEx="[\\w[.-]]+@[\\w[.-]]+\\.[\\w]+";
					
					Pattern FindEmailPattern = Pattern.compile(regEx);
					Matcher m = null;
					m = FindEmailPattern.matcher(s);
					while (m.find()) 
					{
						writer.write(m.group()+"\n");
						System.out.println(m.group());
					}
				}
			}
		}
		writer.close();
	}
}