PDF模板查找关键字坐标

最新推荐文章于 2024-06-15 12:37:31 发布

劉叁尐

最新推荐文章于 2024-06-15 12:37:31 发布

阅读量1.2k

点赞数 2

文章标签： xpdf java

本文链接：https://blog.csdn.net/qq_43021813/article/details/122367388

版权

PDF模板查找关键字坐标

开发过程中会遇到很多给一个模板上赋值的任务，一般都是使用占位符等操作，对需要赋值的位置进行文本赋值，还有一种是找到关键字，然后获取其在文档中的位置，然后进行坐标偏移赋值。今天就来介绍第二种方式。

开始始终是我们的接口

    /**
     * PDF模板查找关键字坐标
     * @return 返回值
     */
    @GetMapping(value = "/testNewPdf")
    @ResponseBody
    void testNewPdf();

千变万变，接口不变

 @Override
    public void testNewPdf() {
    	testCreatePDF();
    }

还是接口继续

    /**
     * .
     *
     * @Description: 作用:  测试pdf相关
     * @Author: LXT
     * @Date: 2022/1/7 14:11
     */
    public void testCreatePDF() {
        testQueryKeywordLastAddText();
    }

这里就要开始正常的代码逻辑了，先附上代码

 /**
     * .
     *
     * @Description: 作用:  pdf查找关键字后，加文字
     * @Author: LXT
     * @Date: 2022/1/7 14:12
     */
    private void testQueryKeywordLastAddText() {
    //初始化文件地址，或者其他位置的文件模板位置
        String initPdf = "D:\\Users\\User\\Desktop\\XXXX.pdf";
        logger.info("初始化之后的pdf地址" + initPdf);
        //关键字
        String keyword = "日期：";
        //调用工具类，查找关键字的位置信息
        float[] floats = QueryKeywordPositionUtil.queryKeywordPosition(initPdf, keyword);
        logger.info("关键字位置--" + JSON.toJSONString(floats));
        //要新加的文本
        String text = "2021 年 01 月 07 日";
        //调用工具类，进行文本赋值
        String endPath = KeywordLastAddTextUtil.keywordLastAddText(initPdf, floats, text);
        logger.info("pdf关键字后增加文字后地址--" + endPath);
        logger.info("pdf关键字后增加文字完成");
    }

代码简单明了还是比较容易看懂的，
重点来了看下第一个工具类查找关键字的

/**
 * .
 *
 * @ClassName: QueryKeywordPositionUtil
 * @Description: 查找关键字位置坐标工具类
 * @Author: LXT
 * @Date: 2022/1/6 16:11
 */
public class QueryKeywordPositionUtil {

    private static final Logger logger = LoggerFactory.getLogger(QueryKeywordPositionUtil.class);

    /**
     * .
     * @Description: 作用:  查找关键字位置坐标
     * @Author: LXT
     * @Date: 2022/1/7 14:40
     * @param pdfPath 入参 pdf地址
     * @param keyword 入参  关键字
     * @return float[] 分别为 float[0]所在页码  float[1]所在x轴 float[2]所在y轴
     */
    public static float[] queryKeywordPosition(String pdfPath, String keyword) {
        try {
            //1.给定文件
            File pdfFile = new File(pdfPath);
            //2.定义一个byte数组，长度为文件的长度
            byte[] pdfData = new byte[(int) pdfFile.length()];
            //3.IO流读取文件内容到byte数组
            FileInputStream inputStream = null;
            try {
                inputStream = new FileInputStream(pdfFile);
                inputStream.read(pdfData);
            } catch (IOException e) {
                throw e;
            } finally {
                if (inputStream != null) {
                    try {
                        inputStream.close();
                    } catch (IOException e) {
                    }
                }
            }
            logger.info("读取pdf完成---" + pdfPath);
            //4.指定关键字
            // String keyword = "日期：";
            //5.调用方法，给定关键字和文件
            List<float[]> positions = findKeywordPostions(pdfData, keyword);
            logger.info("关键字" + keyword + "位置---" + JSON.toJSONString(positions));
            float page = 0;
//            float xxxx = 0;
//            float yyyy = 0;
            //6.返回值类型是  List<float[]> 每个list元素代表一个匹配的位置，分别为 float[0]所在页码  float[1]所在x轴 float[2]所在y轴
            System.out.println("total:" + positions.size());
            if (positions != null && positions.size() > 0) {
                for (float[] position : positions) {
                    page = position[0];
//                    xxxx = position[1];
//                    yyyy = position[2];
                    System.out.print("pageNum: " + (int) position[0]);
                    System.out.print("\tx: " + position[1]);
                    System.out.println("\ty: " + position[2]);
                    if (page == 1) {
                        logger.info("关键字" + keyword + "位置在第一页" + JSON.toJSONString(positions));
                        return position;
                    }
                }
            }
        } catch (Exception e) {
            logger.error("查找关键字位置坐标异常", e);
        }
        return null;
    }


    /**
     * findKeywordPostions 查找关键字位置
     *
     * @param pdfData 通过IO流 PDF文件转化的byte数组
     * @param keyword 关键字
     * @return List<float [ ]> : float[0]:pageNum float[1]:x float[2]:y
     * @throws IOException
     */
    private static List<float[]> findKeywordPostions(byte[] pdfData, String keyword) throws IOException {
        List<float[]> result = new ArrayList<>();
        List<PdfPageContentPositions> pdfPageContentPositions = getPdfContentPostionsList(pdfData);
        for (PdfPageContentPositions pdfPageContentPosition : pdfPageContentPositions) {
            List<float[]> charPositions = findPositions(keyword, pdfPageContentPosition);
            if (charPositions == null || charPositions.size() < 1) {
                continue;
            }
            result.addAll(charPositions);
        }
        return result;
    }

    /**
     * .
     *
     * @param pdfData 入参
     * @return java.util.List<com.utils.QueryKeywordPositionUtil.PdfPageContentPositions>
     * @Description: 作用:   查找关键字位置合集
     * @Author: LXT
     * @Date: 2022/1/7 10:37
     */
    private static List<PdfPageContentPositions> getPdfContentPostionsList(byte[] pdfData) throws IOException {
        PdfReader reader = new PdfReader(pdfData);
        List<PdfPageContentPositions> result = new ArrayList<>();
        int pages = reader.getNumberOfPages();
        for (int pageNum = 1; pageNum <= pages; pageNum++) {
            float width = reader.getPageSize(pageNum).getWidth();
            float height = reader.getPageSize(pageNum).getHeight();
            PdfRenderListener pdfRenderListener = new PdfRenderListener(pageNum, width, height);
            //解析pdf，定位位置
            PdfContentStreamProcessor processor = new PdfContentStreamProcessor(pdfRenderListener);
            PdfDictionary pageDic = reader.getPageN(pageNum);
            PdfDictionary resourcesDic = pageDic.getAsDict(PdfName.RESOURCES);
            try {
                processor.processContent(ContentByteUtils.getContentBytesForPage(reader, pageNum), resourcesDic);
            } catch (IOException e) {
                reader.close();
                throw e;
            }
            String content = pdfRenderListener.getContent();
            List<CharPosition> charPositions = pdfRenderListener.getcharPositions();
            List<float[]> positionsList = new ArrayList<>();
            for (CharPosition charPosition : charPositions) {
                float[] positions = new float[]{charPosition.getPageNum(), charPosition.getX(), charPosition.getY()};
                positionsList.add(positions);
            }
            PdfPageContentPositions pdfPageContentPositions = new PdfPageContentPositions();
            pdfPageContentPositions.setContent(content);
            pdfPageContentPositions.setPostions(positionsList);
            result.add(pdfPageContentPositions);
        }
        reader.close();
        return result;
    }


    private static List<float[]> findPositions(String keyword, PdfPageContentPositions pdfPageContentPositions) {
        List<float[]> result = new ArrayList<>();
        String content = pdfPageContentPositions.getContent();
        List<float[]> charPositions = pdfPageContentPositions.getPositions();
        for (int pos = 0; pos < content.length(); ) {
            int positionIndex = content.indexOf(keyword, pos);
            if (positionIndex == -1) {
                break;
            }
            float[] postions = charPositions.get(positionIndex);
            result.add(postions);
            pos = positionIndex + 1;
        }
        return result;
    }


    private static class PdfPageContentPositions {

        private String content;

        private List<float[]> positions;

        public String getContent() {
            return content;
        }

        public void setContent(String content) {
            this.content = content;
        }

        public List<float[]> getPositions() {
            return positions;
        }

        public void setPostions(List<float[]> positions) {
            this.positions = positions;
        }
    }


    private static class PdfRenderListener implements RenderListener {

        private int pageNum;

        private float pageWidth;

        private float pageHeight;

        private StringBuilder contentBuilder = new StringBuilder();

        private List<CharPosition> charPositions = new ArrayList<>();

        public PdfRenderListener(int pageNum, float pageWidth, float pageHeight) {
            this.pageNum = pageNum;
            this.pageWidth = pageWidth;
            this.pageHeight = pageHeight;
        }

        public void beginTextBlock() {
        }

        public void renderText(TextRenderInfo renderInfo) {
            List<TextRenderInfo> characterRenderInfos = renderInfo.getCharacterRenderInfos();
            for (TextRenderInfo textRenderInfo : characterRenderInfos) {
                String word = textRenderInfo.getText();
                if (word.length() > 1) {
                    word = word.substring(word.length() - 1, word.length());
                }
                Rectangle2D.Float rectangle = textRenderInfo.getAscentLine().getBoundingRectange();
                float x = (float) rectangle.getX();
                float y = (float) rectangle.getY();
                //这两个是关键字在所在页面的XY轴的百分比
                float xPercent = Math.round(x / pageWidth * 10000) / 10000f;
                float yPercent = Math.round((1 - y / pageHeight) * 10000) / 10000f;
                CharPosition charPosition = new CharPosition(pageNum, (float) x, (float) y);
                charPositions.add(charPosition);
                contentBuilder.append(word);
            }
        }

        public void endTextBlock() {
        }

        public void renderImage(ImageRenderInfo renderInfo) {
        }

        public String getContent() {
            return contentBuilder.toString();
        }

        public List<CharPosition> getcharPositions() {
            return charPositions;
        }
    }

    private static class CharPosition {

        private int pageNum = 0;

        private float x = 0;

        private float y = 0;

        public CharPosition(int pageNum, float x, float y) {
            this.pageNum = pageNum;
            this.x = x;
            this.y = y;
        }

        public int getPageNum() {
            return pageNum;
        }

        public float getX() {
            return x;
        }

        public float getY() {
            return y;
        }

        @Override
        public String toString() {
            return "[pageNum=" + this.pageNum + ",x=" + this.x + ",y=" + this.y + "]";
        }
    }

}

主要是 queryKeywordPosition 这个需要的参数就是pdf 位置以及关键字
然后开始读取pdf 找到文档中所有的关键字位置集合重复的会反多个哦

我这里只需要操作第一页所有就判断第一页就可以了，
具体是怎么查的关键字，还请高手一一解读，

然后回到 testQueryKeywordLastAddText 将唯一的位置信息返回，
然后进行关键字后增加文本操作
首先工具类必不可少

/**
 * .
 *
 * @ClassName: KeywordLastAddTextUtil
 * @Description: 关键字后增加文本工具类
 * @Author: LXT
 * @Date: 2022/1/7 14:23
 */
public class KeywordLastAddTextUtil {

    private static final Logger logger = LoggerFactory.getLogger(KeywordLastAddTextUtil.class);

    /**
     * .
     *
     * @param templatePath 入参 pdf文件
     * @param floatNum     入参 关键字位置
     * @param text     入参 新增文本
     * @return String 新的pdf地址
     * @Description: 作用: 关键字后增加文本
     * @Author: LXT
     * @Date: 2022/1/7 14:47
     */
    public static String keywordLastAddText(String templatePath, float[] floatNum, String text) {
        logger.info("pdf--" + templatePath + "--增加文本---" + text);
        try {
            // 模板文件路径
            String newFileName = templatePath.replace(".pdf","addText.pdf");
            byte[] pdfBty = ClientUtil.readFileToByteArray(new File(templatePath));//pdf字节数组
            ClientUtil.writeByteArrayToFile(new File(newFileName), pdfBty);
            logger.info("输出pdf地址--" + newFileName);
            fileChannelCopy(new File(newFileName), new File(templatePath));
            PdfReader reader = new PdfReader(templatePath);
            PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(newFileName));
            PdfContentByte overContent = stamper.getOverContent(1);
            //添加文字
            BaseFont font = BaseFont.createFont("STSong-Light", "UniGB-UCS2-H", BaseFont.NOT_EMBEDDED);
            overContent.beginText();
            overContent.setFontAndSize(font, 10);
            overContent.setTextMatrix(200, 200);
            // 位置x 左到右 越来越大
            // 位置y 下到上 越来越大
            // x: 80.04 	y: 614.26263
            //字体大小为10  x 向右边偏移三个字 x+30
            //字体大小为10  y 向下边偏移一个字 y-10
            // float[1]所在x轴 float[2]所在y轴
            float xx = floatNum[1] + 30f;
            float yy = floatNum[2] - 10f;
            overContent.showTextAligned(Element.ALIGN_LEFT, text, xx, yy, 0);
            overContent.endText();
            stamper.close();
            return newFileName;
        } catch (Exception e) {
            System.out.println(e.getMessage());
            return null;
        }
    }

    /**
     * .
     *
     * @param sources 入参 源文件
     * @param dest    入参 目标文件
     * @return void
     * @Description: 作用:  文件的输入输出流
     * @Author: LXT
     * @Date: 2022/1/7 14:41
     */
    private static void fileChannelCopy(File sources, File dest) {
        try {
            FileInputStream inputStream = new FileInputStream(sources);
            FileOutputStream outputStream = new FileOutputStream(dest);
            FileChannel fileChannelin = inputStream.getChannel();//得到对应的文件通道
            FileChannel fileChannelout = outputStream.getChannel();//得到对应的文件通道
            fileChannelin.transferTo(0, fileChannelin.size(), fileChannelout);//连接两个通道，并且从in通道读取，然后写入out通道
            inputStream.close();
            fileChannelin.close();
            outputStream.close();
            fileChannelout.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }


}

这个其实也好理解

入参就是 pdf 位置关键字坐标以及要添加的文本
首先是将这个pdf 复制一份以免对源文件造成损坏
然后就是设置字体的一些信息当然了可以在里边放置像文件了或者图片了等等其他就不一一介绍了

PdfContentByte overContent = stamper.getOverContent(1);
这个是设置我文本要在第几页

然后文本的xy轴说明一看就明白了应该不用介绍看不明白的私信

rotation 是旋转量也就是旋转多少度

将文本输出的新的pdf中整个过程就完成了随后返回新的pdf地址

好了，给定一个pdf 然后查找关键字坐标并在后面添加文本数据整个流程就算完成了需要增加的其他的还请自行研究。

对了附上 pdf模板的截图

在这里插入图片描述
这是第一页需要在日期后增加文本由于不确定因素只能去查找关键字
下边是最终效果

在这里插入图片描述

本文完，码值不易，记得点赞。

劉叁尐

关注

2
点赞
踩
4

收藏

觉得还不错? 一键收藏
1
评论
PDF模板查找关键字坐标

PDF模板查找关键字坐标开发过程中会遇到很多给一个模板上赋值的任务，一般都是使用占位符等操作，对需要赋值的位置进行文本赋值，还有一种是找到关键字，然后获取其在文档中的位置，然后进行坐标偏移赋值。今天就来介绍第二种方式。开始始终是我们的接口 /** * PDF模板查找关键字坐标 * @return 返回值 */ @GetMapping(value = "/candidate/testNewPdf") @ResponseBody void tes
复制链接

扫一扫