PDF文档繁体转换简体并支持复制

jay4195

于 2024-05-17 08:58:09 发布

阅读量309

点赞数 7

文章标签： pdf

本文链接：https://blog.csdn.net/jay4195/article/details/138991462

版权

因为工作需要阅读一些繁体文档，例如香港IPO招股书等等，一个是大量阅读繁字体效率低，第二复制起来不方便。目前将繁字体转换为简体通常需要使用word文档，但PDF转word文档时间长、且一般是付费功能。这里想着用Java代码来实现这一转换操作。

1. maven依赖

        <dependency>
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>pdfbox</artifactId>
            <version>2.0.24</version>
        </dependency>

        <!-- 简体繁体转换 -->
        <!-- https://mvnrepository.com/artifact/com.github.houbb/opencc4j -->
        <dependency>
            <groupId>com.github.houbb</groupId>
            <artifactId>opencc4j</artifactId>
            <version>1.8.1</version>
        </dependency>

先用pdfbox来读取pdf文档，并用opencc4j来做简繁体转换。

2. 核心代码

使用的java版本是jdk21，可以根据需要自行转换为对应的java版本语法

    @Override
    @SneakyThrows
    public void simplify(InputStream is, OutputStream os) {
        try (PDDocument pd = PDDocument.load(is)) {
            // 需要的字体文件
            COSName fontName = null;
            PDFont targetFont = PDType0Font.load(pd, new FileInputStream("C:\\Windows\\Fonts\\STSONG.TTF"), false);
            int fontId = 0;
            for (PDPage page : pd.getPages()) {
                PDFStreamParser parser = new PDFStreamParser(page);
                COSName targetCosName = page.getResources().add(targetFont);
                parser.parse();
                List<Object> tokens = parser.getTokens();
                Map<COSName, PDFont> fontMap = new HashMap<>();
                for (COSName name : page.getResources().getFontNames()) {
                    PDFont font = page.getResources().getFont(name);
                    fontMap.put(name, font);
                }
                for (int j = 0; j < tokens.size(); j++) {
                    //创建一个object对象去接收标记
                    Object next = tokens.get(j);
                    //instanceof判断其左边对象是否为其右边类的实例
                    if (next instanceof COSName nextFont) {
                        fontId = j;
                        fontName = nextFont;
                        fontMap.put(fontName, page.getResources().getFont(fontName));
                    } else if (next instanceof COSString previous) {
                        if (fontMap.get(fontName) == null) {
                            continue;
                        }
                        try (InputStream in = new ByteArrayInputStream(previous.getBytes())) {
                            StringBuilder sb = new StringBuilder();
                            while (in.available() > 0) {
                                int rc = fontMap.get(fontName).readCode(in);
                                sb.append(fontMap.get(fontName).toUnicode(rc));
                            }
                            //重置COSString对象
                            String text = sb.toString();
                            String simplified = ZhConverterUtil.toSimple(text);
                            try {
                                previous.setValue(targetFont.encode(simplified));
                                tokens.set(fontId, targetCosName);
                            } catch (Exception e) {
                                log.error("", e);
                            }
                        }
                    } else if (next instanceof COSArray previous) {
                        //PDF中的字符串
                        byte[] pstring = {};
                        int prej = 0;
                        //循环previous
                        for (int k = 0; k < previous.size(); k++) {
                            Object arrElement = previous.getObject(k);
                            if (arrElement instanceof COSString cosString) {
                                //COSString对象>>创建java字符串的一个新的文本字符串。
                                //将此字符串的内容作为PDF文本字符串返回。
                                if (j == prej) {
                                    byte[] thisbyte = cosString.getBytes();
                                    byte[] temp = new byte[pstring.length + thisbyte.length];
                                    System.arraycopy(pstring, 0, temp, 0, pstring.length);
                                    System.arraycopy(thisbyte, 0, temp, pstring.length, thisbyte.length);
                                    pstring = temp;
                                } else {
                                    prej = j;
                                    pstring = cosString.getBytes();
                                }
                            }
                        }
                        if (fontMap.get(fontName) == null) {
                            continue;
                        }
                        try (InputStream in = new ByteArrayInputStream(pstring)) {
                            StringBuilder sb = new StringBuilder();
                            while (in.available() > 0) {
                                int rc = fontMap.get(fontName).readCode(in);
                                sb.append(fontMap.get(fontName).toUnicode(rc));
                            }
                            String text = sb.toString();
                            String simplified = ZhConverterUtil.toSimple(text);
                            try {
                                COSString cosString = (COSString) previous.getObject(0);
                                cosString.setValue(targetFont.encode(simplified));
                                tokens.set(fontId, targetCosName);
                            } catch (Exception e) {
                                log.error("", e);
                            }
                        }
                        int total = previous.size() - 1;
                        for (int k = total; k > 0; k--) {
                            previous.remove(k);
                        }
                    }
                }
                PDStream updatedStream = new PDStream(pd);
                OutputStream out = updatedStream.createOutputStream();
                ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
                tokenWriter.writeTokens(tokens);
                page.setContents(updatedStream);
                out.close();
            }
            pd.save(os);
        }
    }

上面是参考了一些网上流传的代码，经过个人调试以后解决了繁体字体匹配简体字体，第二是加载字体时候要注意参数，否则字体没有嵌入(embedded)文档中，导致使用pdf阅读工具复制出来时乱码。

PDFont targetFont = PDType0Font.load(pd, new FileInputStream("C:\\Windows\\Fonts\\STSONG.TTF"), false);

另外就是pdfbox只能加载后缀为ttf的字体，需要注意。

转换前效果

转换后效果

jay4195

关注

7
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
PDF文档繁体转换简体并支持复制

因为工作需要阅读一些繁体文档，例如香港IPO招股书等等，一个是大量阅读繁字体效率低，第二复制起来不方便。目前将繁字体转换为简体通常需要使用word文档，但PDF转word文档时间长、且一般是付费功能。这里想着用Java代码来实现这一转换操作。
复制链接

扫一扫