java打印取消页眉页脚,如何在Java中使用iText从PDF文件中删除页眉和页脚

I am using the PDF iText library to convert PDF to text.

Below is my code to convert PDF to text file using Java.

public class PdfConverter {

/** The original PDF that will be parsed. */

public static final String pdfFileName = "jdbc_tutorial.pdf";

/** The resulting text file. */

public static final String RESULT = "preface.txt";

/**

* Parses a PDF to a plain text file.

* @param pdf the original PDF

* @param txt the resulting text

* @throws IOException

*/

public void parsePdf(String pdf, String txt) throws IOException {

PdfReader reader = new PdfReader(pdf);

PdfReaderContentParser parser = new PdfReaderContentParser(reader);

PrintWriter out = new PrintWriter(new FileOutputStream(txt));

TextExtractionStrategy strategy;

for (int i = 1; i <= reader.getNumberOfPages(); i++) {

strategy = parser.processContent(i, new SimpleTextExtractionStrategy());

out.println(strategy.getResultantText());

System.out.println(strategy.getResultantText());

}

out.flush();

out.close();

reader.close();

}

/**

* Main method.

* @param args no arguments needed

* @throws IOException

*/

public static void main(String[] args) throws IOException {

new PdfConverter().parsePdf(pdfFileName, RESULT);

}

}

The above code works for extracting PDF to text. But my requirement is to ignore header and footer and extract only content from PDF file.

解决方案

Because your pdf has headers and footers, it would be marked as artifacts(if not its just a text or content placed at the position of a header or footer). If its marked as artifacts, you can extract it using ParseTaggedPdf. You can also make use of ExtractPageContentArea if ParseTaggedPdf doesn't work. You can check for a few examples related to it.

The above solution is general and depends on the file. If you really need an alternate solution, you can use apache API's like PdfBox, tika and others like PDFTextStream. The solution which i'm giving below wont work if you have to persist with iText and can't move on to other libraries. In PdfBox you can use PDFTextStripperByArea or PDFTextStripper. Look at the JavaDoc or some examples if you need to know how to use it.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值