Aspose Pdf 优雅的姿势抽字

程序媛-承哥

已于 2022-07-14 15:24:07 修改

阅读量545

点赞数

于 2020-12-08 16:08:42 首次发布

本文链接：https://blog.csdn.net/qq_22368681/article/details/110875376

版权

如何优雅的抽出Pdf的内容

该方法抽字的时候要注意一下：

1、pdf中如果存在隐藏数据，会被抽取出来；

2、背景色和字体颜色相同，会被抽取出来；

3、字体颜色和字体背景色相同，会被抽取出来；

 public static String getPdfText(String pathStr) {
        PDDocument document = null;
        String text = "";
        try {
            document = PDDocument.load(new File(pathStr));
            // 文本内容
            PDFTextStripper stripper = new PDFTextStripper();
            // 设置按顺序输出
            stripper.setSortByPosition(true);
            log.info(pathStr);
            text = stripper.getText(document);
        } catch (InvalidPasswordException e) {
            log.info(pathStr ,e.getMessage());
            return text;
        } catch (IOException e) {
            log.info(pathStr ,e.getMessage());
            return text;
        } finally {
            try {
                document.close();
            } catch (IOException e) {
                log.info("[关闭IO]，IOException：{}" ,e.getMessage());
            }
        }
        return text;
    }

最大程度降低乱码率

做ocr扫描正确的将数据存储起来。


    public static List<File> fetchPdfText(String ocrFolder, String path, double zoom, File sourceFile, PdfProcessLog processLog
            , boolean flag, CustomContext customContext) throws Exception {
        FileInputStream fis =

最低0.47元/天解锁文章

程序媛-承哥

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
Aspose Pdf 优雅的姿势抽字

如何优雅的抽出Pdf的内容该方法抽字的时候要注意一下： 1、pdf中如果存在隐藏数据，会被抽取出来； 2、背景色和字体颜色相同，会被抽取出来； 3、字体颜色和字体背景色相同，会被抽取出来； public static String getPdfText(String pathStr) { PDDocument document = null;.........
复制链接

扫一扫