I am using the PDF iText library to convert PDF to text.
Below is my code to convert PDF to text file using Java.
public class PdfConverter {
/** The original PDF that will be parsed. */
public static final String pdfFileName = "jdbc_tutorial.pdf";
/** The resulting text file. */
public static final String RESULT = "preface.txt";
/**
* Parses a PDF to a plain text file.
* @param pdf the original PDF
* @param txt the resulting text
* @throws IOException
*/
public void parsePdf(String pdf, String txt) throws IOException {
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
TextExtractionStrategy strategy;
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
out.println(strategy.getResultantText());
System.out.println(strategy.getResultantText());
}
out.flush();
out.close();
reader.close();
}
/**
* Main method.
* @param args no arguments needed
* @throws IOException
*/
public static void main(String[] args) throws IOException {
new PdfConverter().parsePdf(pdfFileName, RESULT);
}
}
The above code works for extracting PDF to text. But my requirement is to ignore header and footer and extract only content from PDF file.
解决方案
Because your pdf has headers and footers, it would be marked as artifacts(if not its just a text or content placed at the position of a header or footer). If its marked as artifacts, you can extract it using ParseTaggedPdf. You can also make use of ExtractPageContentArea if ParseTaggedPdf doesn't work. You can check for a few examples related to it.
The above solution is general and depends on the file. If you really need an alternate solution, you can use apache API's like PdfBox, tika and others like PDFTextStream. The solution which i'm giving below wont work if you have to persist with iText and can't move on to other libraries. In PdfBox you can use PDFTextStripperByArea or PDFTextStripper. Look at the JavaDoc or some examples if you need to know how to use it.