因为工作需要阅读一些繁体文档,例如香港IPO招股书等等,一个是大量阅读繁字体效率低,第二复制起来不方便。目前将繁字体转换为简体通常需要使用word文档,但PDF转word文档时间长、且一般是付费功能。这里想着用Java代码来实现这一转换操作。
1. maven依赖
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.24</version>
</dependency>
<!-- 简体繁体转换 -->
<!-- https://mvnrepository.com/artifact/com.github.houbb/opencc4j -->
<dependency>
<groupId>com.github.houbb</groupId>
<artifactId>opencc4j</artifactId>
<version>1.8.1</version>
</dependency>
先用pdfbox来读取pdf文档,并用opencc4j来做简繁体转换。
2. 核心代码
使用的java版本是jdk21,可以根据需要自行转换为对应的java版本语法
@Override
@SneakyThrows
public void simplify(InputStream is, OutputStream os) {
try (PDDocument pd = PDDocument.load(is)) {
// 需要的字体文件
COSName fontName = null;
PDFont targetFont = PDType0Font.load(pd, new FileInputStream("C:\\Windows\\Fonts\\STSONG.TTF"), false);
int fontId = 0;
for (PDPage page : pd.getPages()) {
PDFStreamParser parser = new PDFStreamParser(page);
COSName targetCosName = page.getResources().add(targetFont);
parser.parse();
List<Object> tokens = parser.getTokens();
Map<COSName, PDFont> fontMap = new HashMap<>();
for (COSName name : page.getResources().getFontNames()) {
PDFont font = page.getResources().getFont(name);
fontMap.put(name, font);
}
for (int j = 0; j < tokens.size(); j++) {
//创建一个object对象去接收标记
Object next = tokens.get(j);
//instanceof判断其左边对象是否为其右边类的实例
if (next instanceof COSName nextFont) {
fontId = j;
fontName = nextFont;
fontMap.put(fontName, page.getResources().getFont(fontName));
} else if (next instanceof COSString previous) {
if (fontMap.get(fontName) == null) {
continue;
}
try (InputStream in = new ByteArrayInputStream(previous.getBytes())) {
StringBuilder sb = new StringBuilder();
while (in.available() > 0) {
int rc = fontMap.get(fontName).readCode(in);
sb.append(fontMap.get(fontName).toUnicode(rc));
}
//重置COSString对象
String text = sb.toString();
String simplified = ZhConverterUtil.toSimple(text);
try {
previous.setValue(targetFont.encode(simplified));
tokens.set(fontId, targetCosName);
} catch (Exception e) {
log.error("", e);
}
}
} else if (next instanceof COSArray previous) {
//PDF中的字符串
byte[] pstring = {};
int prej = 0;
//循环previous
for (int k = 0; k < previous.size(); k++) {
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString cosString) {
//COSString对象>>创建java字符串的一个新的文本字符串。
//将此字符串的内容作为PDF文本字符串返回。
if (j == prej) {
byte[] thisbyte = cosString.getBytes();
byte[] temp = new byte[pstring.length + thisbyte.length];
System.arraycopy(pstring, 0, temp, 0, pstring.length);
System.arraycopy(thisbyte, 0, temp, pstring.length, thisbyte.length);
pstring = temp;
} else {
prej = j;
pstring = cosString.getBytes();
}
}
}
if (fontMap.get(fontName) == null) {
continue;
}
try (InputStream in = new ByteArrayInputStream(pstring)) {
StringBuilder sb = new StringBuilder();
while (in.available() > 0) {
int rc = fontMap.get(fontName).readCode(in);
sb.append(fontMap.get(fontName).toUnicode(rc));
}
String text = sb.toString();
String simplified = ZhConverterUtil.toSimple(text);
try {
COSString cosString = (COSString) previous.getObject(0);
cosString.setValue(targetFont.encode(simplified));
tokens.set(fontId, targetCosName);
} catch (Exception e) {
log.error("", e);
}
}
int total = previous.size() - 1;
for (int k = total; k > 0; k--) {
previous.remove(k);
}
}
}
PDStream updatedStream = new PDStream(pd);
OutputStream out = updatedStream.createOutputStream();
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
page.setContents(updatedStream);
out.close();
}
pd.save(os);
}
}
上面是参考了一些网上流传的代码,经过个人调试以后解决了繁体字体匹配简体字体,第二是加载字体时候要注意参数,否则字体没有嵌入(embedded)文档中,导致使用pdf阅读工具复制出来时乱码。
PDFont targetFont = PDType0Font.load(pd, new FileInputStream("C:\\Windows\\Fonts\\STSONG.TTF"), false);
另外就是pdfbox只能加载后缀为ttf的字体,需要注意。
转换前效果
转换后效果