Searched quite a bit but as I couldn't find a solution for this kind of problem, hence posting a clear question on the same. Most answers cover image/text extraction which are comparatively easier.
I've a requirement of extracting tables and graphs as text (csv) and images respectively from PDFs.
Can anyone help me with an efficient python 3.6 code to solve the same?
Till now I could achieve extracting jpgs using startmark = b"\xff\xd8" and endmark = b"\xff\xd9", but not all tables and graphs in a PDF are plain jpgs, hence my code fails badly in achieving that.
Example, I want to extract table from page 11 and graphs from page 12 as image or something which is feasible from the below given link. How to go about it?
解决方案
For extracting tables you can use camelot
Here is an article about it.