tess4J的 OCR 识别PDF Error: End-of-File, expected line

一: 错误生产原因

从MinIO下载文件获取字节流, 将字节流传给PDDocument.load(in)

Caused by: java.io.IOException: Error: End-of-File, expected line
	at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1119) ~[pdfbox-2.0.8.jar:2.0.8]
	at org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:2442) ~[pdfbox-2.0.8.jar:2.0.8]
	at org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:2425) ~[pdfbox-2.0.8.jar:2.0.8]
	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:233) ~[pdfbox-2.0.8.jar:2.0.8]
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1145) ~[pdfbox-2.0.8.jar:2.0.8]
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1042) ~[pdfbox-2.0.8.jar:2.0.8]
	at com.es.canal.SysFileHandler.ocrPdf(SysFileHandler.java:173) ~[classes/:na]
	... 9 common frames omitted

1.1 环境依赖

 <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.6.7</version>
        <relativePath/> <!-- lookup parent from repository -->
    </parent>


    <dependencies>

        <!-- minio -->
        <!--maven引入minio排除okhttp依赖并添加高版本的okhttp依赖-->
        <dependency>
            <groupId>io.minio</groupId>
            <artifactId>minio</artifactId>
            <version>8.5.2</version>
            <exclusions>
                <exclusion>
                    <groupId>com.squareup.okhttp3</groupId>
                    <artifactId>okhttp</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>com.squareup.okhttp3</groupId>
            <artifactId>okhttp</artifactId>
            <version>4.9.0</version>
        </dependency>


        <!-- PDF文档处理 -->
        <dependency>
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>pdfbox</artifactId>
            <version>2.0.8</version>
        </dependency>

        <!-- ocr-->
        <dependency>
            <groupId>net.sourceforge.tess4j</groupId>
            <artifactId>tess4j</artifactId>
            <version>4.1.1</version>
        </dependency>

        <dependency>
            <groupId>top.javatool</groupId>
            <artifactId>canal-spring-boot-starter</artifactId>
            <version>1.2.1-RELEASE</version>
        </dependency>
        <!-- elasticsearch -->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-elasticsearch</artifactId>
        </dependency>
        <!-- IOUtils -->
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.8.0</version>
        </dependency>

        <!-- JSON 解析器和生成器 -->
        <dependency>
            <groupId>com.alibaba.fastjson2</groupId>
            <artifactId>fastjson2</artifactId>
            <version>2.0.7</version>
        </dependency>

        <!--web-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-devtools</artifactId>
            <scope>runtime</scope>
            <optional>true</optional>
        </dependency>
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <optional>true</optional>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>

    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
                <configuration>
                    <excludes>
                        <exclude>
                            <groupId>org.projectlombok</groupId>
                            <artifactId>lombok</artifactId>
                        </exclude>
                    </excludes>
                </configuration>
            </plugin>
        </plugins>
    </build>

1.2 错误代码示例

public void a(){
  //从MinIO下载文件
  GetObjectResponse in = 
minioClient.getObject(GetObjectArgs.builder().bucket(sysFile.getBucketName()).object(sysFile.getObjectName()).build());
String base64;
byte[] bytes = IOUtils.toByteArray(in);

//图片解析
PDDocument pdf = PDDocument.load(in);
PDFRenderer renderer = new PDFRenderer(pdf);

}

二: 问题解决代码

把IOUtils.toByteArray(in) 移到PDDocument.load(in)下面,目的就是为了把流传给PDDocument.load(in)之前不要做流操作

public void a(){
    //从MinIO下载文件
  GetObjectResponse in = 
 
minioClient.getObject(GetObjectArgs.builder().bucket(sysFile.getBucketName()).object(sysFile.getObjectName()).build());

String base64;

//图片解析
PDDocument pdf = PDDocument.load(in);
PDFRenderer renderer = new PDFRenderer(pdf);
byte[] bytes = IOUtils.toByteArray(in);

}

其它

具体原因可能是工具类操作流后没有把流关闭

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值