java 去掉pdf文字,如何从PDF文件中删除所有图像/绘图并仅以Java格式保留文本?...

I have a PDF file that's an output from an OCR processor, this OCR processor recognizes the image, adds the text to the pdf but at the end places a low quality image instead of the original one (I have no idea why anyone would do that, but they do).

So, I would like to get this PDF, remove the image stream and leave the text alone, so that I could get it and import (using iText page importing feature) to a PDF I'm creating myself with the real image.

And before someone asks, I have already tried to use another tool to extract text coordinates (JPedal) but when I draw the text on my PDF it isn't at the same position as the original one.

I'd rather have this done in Java, but if another tool can do it better, just let me know. And it could be image removal only, I can live with a PDF with the drawings in there.

解决方案

I used Apache PDFBox in similar situation.

To be a little bit more specific, try something like that:

import org.apache.pdfbox.exceptions.COSVisitorException;

import org.apache.pdfbox.exceptions.CryptographyException;

import org.apache.pdfbox.exceptions.InvalidPasswordException;

import org.apache.pdfbox.pdmodel.PDDocument;

import org.apache.pdfbox.pdmodel.PDDocumentCatalog;

import org.apache.pdfbox.pdmodel.PDPage;

import org.apache.pdfbox.pdmodel.PDResources;

import java.io.IOException;

public class Main {

public static void main(String[] argv) throws COSVisitorException, InvalidPasswordException, CryptographyException, IOException {

PDDocument document = PDDocument.load("input.pdf");

if (document.isEncrypted()) {

document.decrypt("");

}

PDDocumentCatalog catalog = document.getDocumentCatalog();

for (Object pageObj : catalog.getAllPages()) {

PDPage page = (PDPage) pageObj;

PDResources resources = page.findResources();

resources.getImages().clear();

}

document.save("strippedOfImages.pdf");

}

}

It's supposed to remove all types of images (png, jpeg, ...). It should work like that:

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值