pdf-table-extract
- ONLY works with tables with deadly
straight lines
as sides andtext-based PDF
- Needs install by
python setup.py install
, entry pointpdftableextract.scripts:main
- find lines, work out grid and cells, use
pdftotext
for each cell- Can view debug image by
checkcrop
,checklines
,checkdivs
, andcheckcells
- The script requires numpy and poppler (pdftoppm and pdftotext)
- Analyses a page in a PDF looking for well delineated table cells, and extracts the text in each cell.
- Outputs include JSON, XML, and CSV lists of cell locations, shapes, and contents, and CSV and HTML versions of the tables.
table-extract
- MAINLY based on
Tesseract
. Parse hocr/html output file bytesseract
to figure out the tables and its layout- A tool for extracting tables, figures, maps, and pictures from PDFs using Tesseract
- preprocess.sh
- Script for prepping a PDF for table extraction.
- Converts each page of the PDF to a PNG with Ghostscript, then runs the PNGs through Tesseract (
Assumes local installation of tesseract-ocr.
).- Also runs each page through annotate.py to assist in debugging.
- pdf2hocr ==
preprocess.sh
+process.sh
- The very core function is still *tesseract. All other scripts are used to parse the resultant
hocr
orhtml
file generated bytesseract
- do_extract.py
- a separate process to do actual extraction work after
preprocess.sh
+process.sh
orpdf2hocr
- main script is
table_extractor.py
- it’s basically an XML parser with regard to
- div.ocr_page
- div.ocr_carea
- p.ocr_par
- span.ocr_line
- span.ocrx_word
area_summary
—>summarize_document
–>process_page
- structures
pages
: list type.- each page in
pages
is of dict type including keys:page_no
,soup
,page
,areas
,lines
pages[i]['areas']
is the collection ofcarea
in pagei
- each
area
is of dict type with 20 keys, e.g.lines
,line_heights
,words
,area
,line_heights
,word_distances
,words_per_line
,word_height_avg
etc.
HazyResearch/pdftotree
- Detect and extract document elements such as tables, figures, headers using ML techniques by training on sample documents.
- This project is using the table-extraction tool (https://github.com/xiao-cheng/table-extraction).
- Use
Tabula
to then extract contents from the tables
- Tabula only works on text-based PDFs, not scanned documents.
HazyResearch/TreeStructure
- Detect and extract document elements such as tables, figures, headers using ML techniques by training on sample documents.
- This project is using the table-extraction tool (https://github.com/xiao-cheng/table-extraction).
- An evaluation codes are provided to compute recall, precision and F1 score at the character level (ground truth data required)
pdfminer/pdfminer.six
- PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines.
- PDFMiner comes with two handy tools:
pdf2txt.py
anddumppdf.py
.
pdf2txt.py
extracts text contents from a PDF file. It extracts all the text that are to be rendered programmatically, i.e. text represented as ASCII or Unicode strings. It cannot recognize text drawn as images that would require optical character recognition. It also extracts the corresponding locations, font names, font sizes, writing direction (horizontal or vertical) for each text portion.dumppdf.py
dumps the internal contents of a PDF file in pseudo-XML format. This program is primarily for debugging purposes, but it’s also possible to extract some meaningful contents (e.g. images).