图片型pdf转文本文档

最新推荐文章于 2022-06-14 15:53:13 发布

「已注销」

最新推荐文章于 2022-06-14 15:53:13 发布

阅读量1.2k

点赞数

分类专栏：常用工具&功能

本文链接：https://blog.csdn.net/qq_40918961/article/details/111225538

版权

基本思路

直接用工具将扫描型pdf转文本是不行的，因为扫描型的pdf是图片。先读取整个pdf文件按页生成图片，在调用OCR识别读取文字即可。

pdf第三方库pdfbox

依赖：

        <!--pdfbox pdf解析-->
        <dependency>
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>pdfbox</artifactId>
            <version>2.0.1</version>
        </dependency>

使用该库可对pdf文件进行基本读写操作

File file = new File(PdfFilePath);
//加载pdf，可对其进行基本读写
PDDocument pdDocument = PDDocument.load(file);
int pages =pdDocument.getNumberOfPages();// 获取PDF页数
//pdf renderer 渲染器，可转成图片读取
PDFRenderer renderer = new PDFRenderer(pdDocument);
BufferedImage image = renderer.renderImageWithDPI(i, dpi);//i为页数，dpi为精度，默认96

从pdf中读取文本

    /**
     * 指定pdf与目标文件路径，从pdf文件读取文本到指定文件
     * @param pdfPath pdf文件路径，绝对路径
     * @param targetPath 目标文件，绝对路径
     */
    public static void readTextFromPdf(String pdfPath,String targetPath)  {
   
        //pdf文件校验
        File pdfFile=new File(pdfPath);
        if(!pdfFile.exists()){
   
            System.out.println("pdf文件未发现："+pdfPath);
            return;
        }

        File targetFile=new File(targetPath);

        PDDocument pdDocument=null;

        try{
   
            //读取文档
            pdDocument=PDDocument.load(pdfFile);
            //获取文档页码
            int pages=pdDocument.getNumberOfPages();
            //读取文档内容并设置读取参数
            PDFTextStripper stripper=new PDFTextStripper();
            stripper.setSortByPosition(true);
            stripper.setStartPage(1);
            stripper.setEndPage(pages);
            String content=stripper.getText(pdDocument);

            //校验目标文件
            if(targetFile.exists()){
   
               System.out.println("文件已存在，将覆盖文件");
            }else {
   
                System.out.println("目标文件不存在，新建文件");
                targetFile.createNewFile();
            }

            //写入目标文件
            FileWriter fileWriter=new FileWriter(targetFile);
            fileWriter.write(content);

        }catch (Exception e){
   
            e.printStackTrace();
        }

        System.out.println("写入文件成功");
    }

将pdf转换为图片

    /**
     * PDF文件转PNG/JPEG图片
     * @param PdfFilePath 完整路径
     * @param dstImgFolder 图片存放的文件夹
     * @param dpi dpi越大转换后越清晰，相对转换速度越慢,一般电脑默认96dpi
     */
    public static void pdf2Image(String PdfFilePath,
                                     String dstImgFolder,
                                     int dpi) {
   
        File file = new File(PdfFilePath);
        PDDocument pdDocument;
        try {
   
            //获取pdf文件名称与上层路径
            String imgPDFPath = file.getParent();
            int dot = file.getName().lastIndexOf('.');
            // 获取图片文件名
            String imagePDFName = file.getName().substring(0

最低0.47元/天解锁文章

「已注销」

关注

0
点赞
踩
7

收藏

觉得还不错? 一键收藏
1
评论
图片型pdf转文本文档

基本思路直接用工具将扫描型pdf转文本是不行的，因为扫描型的pdf是图片。先读取整个pdf文件按页生成图片，在调用OCR识别读取文字即可。pdf第三方库pdfbox依赖：  <dependency> <groupId>org.apache.pdfbox</groupId> <artifactId>pdfbox</arti
复制链接

扫一扫