Last year, I made an application in Java using PDFBox to get the raw text in some PDF files and I need to port that application to C++ now.
I wanted to know what was the best C++ alternative to accomplish what I need.
I'll give an example in case it helps:
With PDFBox, using that file, each line read on page 2 and most of page 3 would output all the data of a line, separated by a space instead of keeping it in a grid like it is now.
So the first relevant line in page 2 would look like this:
FB 847 - Tremblay, Gérard 179,63 56 16167 90 268 s27 p3 669 s14 199 223 193 615
or something like that since there are minor changes in the order they appear, but I don't care about that as long as similar lines output the same since I just parse them and put the values I need in different variables.
So, knowing all of that, is there a library that I can use in a C++ program to get similar results?
Edit: After looking at sacredFaith's link at http://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file and trying it, I'm getting a weird output like such for the example file I mentioned earlier:
The parts I actually need are in the weird characters at the beginning. Using Adobe Acrobat Reader X and using Save As... Text (accessible), I get the following result:
Which is approximately what I get in Java using PDFBox and what I want to get as output in C++.
解决方案
Xpdf is a C++ application/library which includes tools to extract plain text from a PDF file.