如何优雅的抽出Pdf的内容
该方法抽字的时候要注意一下:
1、pdf中如果存在隐藏数据,会被抽取出来;
2、背景色和字体颜色相同,会被抽取出来;
3、字体颜色和字体背景色相同,会被抽取出来;
public static String getPdfText(String pathStr) {
PDDocument document = null;
String text = "";
try {
document = PDDocument.load(new File(pathStr));
// 文本内容
PDFTextStripper stripper = new PDFTextStripper();
// 设置按顺序输出
stripper.setSortByPosition(true);
log.info(pathStr);
text = stripper.getText(document);
} catch (InvalidPasswordException e) {
log.info(pathStr ,e.getMessage());
return text;
} catch (IOException e) {
log.info(pathStr ,e.getMessage());
return text;
} finally {
try {
document.close();
} catch (IOException e) {
log.info("[关闭IO],IOException:{}" ,e.getMessage());
}
}
return text;
}
最大程度降低乱码率
做ocr扫描正确的将数据存储起来。
public static List<File> fetchPdfText(String ocrFolder, String path, double zoom, File sourceFile, PdfProcessLog processLog
, boolean flag, CustomContext customContext) throws Exception {
FileInputStream fis =