最近分配到一个任务,对一个PDF文件进行编辑,提取需要替换的内容,使其成为公用模板,用java去编辑。
会出现几个问题:
1)PDF样式文字不好改,推荐工具(Adobe Acrobat Pro DC)http://jingyan.baidu.com/article/e6c8503c7b1ab1e54f1a1819.html#
2)java编写替换代码,如下。
public static void editPDF(String sourceFile, String destinationFile, Map<String, String> chars, String encoding) {
<span style="white-space:pre"> </span>try {
PDDocument helloDocument = PDDocument.load(new File(sourceFile));
List pages = helloDocument.getDocumentCatalog().getAllPages();
for (int i = 0; i < pages.size(); i++) {
PDPage page = (PDPage) pages.get(i);
PDStream contents = page.getContents();
PDFStreamParser parser = new PDFStreamParser(contents.getStream());
parser.parse();
List<Object> tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++) {
Object next = tokens.get(j);
if (next instanceof PDFOperator) {
PDFOperator op = (PDFOperator) next;
// Tj and TJ are the two operators that display strings
// in a
// PDF
try {
COSString previousString = (COSString) tokens.get(j - 1);
String string = previousString.getString();
for (String key : chars.keySet()) {
if (string.indexOf(key) < 0) {
if (string.indexOf("$") >= 0) {
System.out.println(string);
}
continue;
}
string = string.replace(key, chars.get(key));
}
// Word you want to change. Currently this code
// changes
// word "Solr" to "Solr123"
previousString.reset();
previousString.append(string.getBytes(encoding));
} catch (Exception e1) {
try {
COSArray previousArray = (COSArray) tokens.get(j - 1);
for (int k = 0; k < previousArray.size(); k++) {
Object arrElement = previousArray.getObject(k);
if (arrElement instanceof COSString) {
COSString cosString = (COSString) arrElement;
String string = cosString.getString();
for (String key : chars.keySet()) {
<span style="white-space:pre"> </span> if (string.indexOf(key) < 0) {
if (string.indexOf("$") >= 0) {
System.out.println(string);
}
continue;
}
string = string.replace(key, chars.get(key));
}
// Currently this code changes word
// "Solr"
// to
// "Solr123"
cosString.reset();
cosString.append(string.getBytes(encoding));
}
}
} catch (Exception e2) {
continue;
}
}
<span style="white-space:pre"> </span>}
}
// now that the tokens are updated we will replace the page
// content
// stream.
PDStream updatedStream = new PDStream(helloDocument);
OutputStream out = updatedStream.createOutputStream();
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
page.setContents(updatedStream);
helloDocument.save(destinationFile); // Output
// file
// name
// PDFTextStripper textStripper = new PDFTextStripper();
// System.out.println(textStripper.getText(helloDocument));
// helloDocument.close();
}
helloDocument.close();
} catch (IOException e) {
e.printStackTrace();
} catch (COSVisitorException e) {
e.printStackTrace();
}
}
上面的Map<String,String> chars只是我替换字符串比较多,放字符串用的。
3、关键的关键是PDF中有可能有些字体显示出来了,但是自己的系统中并没有该字体,这时候Java就会读出乱码来,解决方法:
可以用PDF编辑工具把识别不出的字体换成系统中存在的字体(有可能java还识别不出,基础的几种还是识别出来的)
或者到网上下载该字体,安装到系统中