linux系统读取PDF文件中的文本并存储到数据库中

最新推荐文章于 2024-04-30 17:02:38 发布

一条小斑码

最新推荐文章于 2024-04-30 17:02:38 发布

阅读量376

点赞数

文章标签： linux unix 服务器 java

本文链接：https://blog.csdn.net/weixin_44520267/article/details/125313291

版权

项目场景：

linux系统读取PDF文件中的文本并存储到数据库中

问题描述

提示：Linux无法识别\r\n换行符

例如：pdf数据读取出来后需要通过\r\n换行符来分割每行的数据进行处理,但是\r\n时windows系统中的换行符,Linux无法识别

读取pdf文件关键代码

        //加载pdf文件，创建PDDocument对象
        File file1 = ImageUtil.multipartFileToFile(file);
        PDDocument document = null;
        //创建pdf文本获取对象PDFTextStripper
        PDFTextStripper pdfStripper = null;
        //获取pdf中所有信息，text中包含的就是当前pdf文档中所有信息
        String text = null;
        try {
            document = PDDocument.load(file1);
            pdfStripper = new PDFTextStripper();
            text = pdfStripper.getText(document);
            document.close();
        } catch (IOException e) {
            throw new InsertException(e.getMessage());
        }
        file1.delete();

    /**
     * MultipartFile 转 File
     *
     * @param file
     * @throws Exception
     */
    public static File multipartFileToFile(MultipartFile file) {
        Assert.notNull(file, "身份证不能为空!");
        File toFile;
        if (file.equals("") || file.getSize() <= 0) {
            return null;
        } else {
            InputStream ins = null;
            try {
                ins = file.getInputStream();
                toFile = new File(file.getOriginalFilename());
                inputStreamToFile(ins, toFile);
                ins.close();
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
        }
        return toFile;
    }

依赖

<!-- 读取pdf -->
        <dependency>
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>pdfbox</artifactId>
            <version>2.0.1</version>
        </dependency>

        <dependency>
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>fontbox</artifactId>
            <version>2.0.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>jempbox</artifactId>
            <version>1.8.11</version>
        </dependency>

        <dependency>
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>xmpbox</artifactId>
            <version>2.0.0</version>
        </dependency>