PDF to [images, tables, texts, etc.]

最新推荐文章于 2024-06-11 09:41:49 发布

windtalkersm

最新推荐文章于 2024-06-11 09:41:49 发布

阅读量617

点赞数

分类专栏： UNIX/Linux备忘 Tesseract备忘 OCR备忘

本文链接：https://blog.csdn.net/windtalkersm/article/details/79334053

版权

UNIX/Linux备忘同时被 3 个专栏收录

26 篇文章 0 订阅

订阅专栏

OCR备忘

8 篇文章 0 订阅

订阅专栏

Tesseract备忘

6 篇文章 0 订阅

订阅专栏

pdf-table-extract

ONLY works with tables with deadly straight lines as sides and text-based PDF
Needs install by python setup.py install, entry point pdftableextract.scripts:main
find lines, work out grid and cells, use pdftotext for each cell
Can view debug image by checkcrop, checklines, checkdivs, and checkcells
The script requires numpy and poppler (pdftoppm and pdftotext)
Analyses a page in a PDF looking for well delineated table cells, and extracts the text in each cell.
Outputs include JSON, XML, and CSV lists of cell locations, shapes, and contents, and CSV and HTML versions of the tables.

table-extract

MAINLY based on Tesseract. Parse hocr/html output file by tesseract to figure out the tables and its layout
A tool for extracting tables, figures, maps, and pictures from PDFs using Tesseract
preprocess.sh

Script for prepping a PDF for table extraction.
Converts each page of the PDF to a PNG with Ghostscript, then runs the PNGs through Tesseract (Assumes local installation of tesseract-ocr.).
Also runs each page through annotate.py to assist in debugging.
pdf2hocr == preprocess.sh + process.sh
The very core function is still *tesseract. All other scripts are used to parse the resultant hocr or html file generated by tesseract
do_extract.py

a separate process to do actual extraction work after preprocess.sh+process.sh or pdf2hocr
main script is table_extractor.py
it’s basically an XML parser with regard to
div.ocr_page

div.ocr_carea

p.ocr_par

span.ocr_line

span.ocrx_word
area_summary—> summarize_document –> process_page
structures

pages: list type.
each page in pages is of dict type including keys: page_no, soup, page, areas, lines
pages[i]['areas'] is the collection of carea in page i
each area is of dict type with 20 keys, e.g. lines, line_heights, words, area, line_heights, word_distances, words_per_line, word_height_avg etc.

HazyResearch/pdftotree

Detect and extract document elements such as tables, figures, headers using ML techniques by training on sample documents.
This project is using the table-extraction tool (https://github.com/xiao-cheng/table-extraction).
Use Tabula to then extract contents from the tables

Tabula only works on text-based PDFs, not scanned documents.

HazyResearch/TreeStructure

Detect and extract document elements such as tables, figures, headers using ML techniques by training on sample documents.
This project is using the table-extraction tool (https://github.com/xiao-cheng/table-extraction).
An evaluation codes are provided to compute recall, precision and F1 score at the character level (ground truth data required)

pdfminer/pdfminer.six

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines.
PDFMiner comes with two handy tools: pdf2txt.py and dumppdf.py.

pdf2txt.py extracts text contents from a PDF file. It extracts all the text that are to be rendered programmatically, i.e. text represented as ASCII or Unicode strings. It cannot recognize text drawn as images that would require optical character recognition. It also extracts the corresponding locations, font names, font sizes, writing direction (horizontal or vertical) for each text portion.
dumppdf.py dumps the internal contents of a PDF file in pseudo-XML format. This program is primarily for debugging purposes, but it’s also possible to extract some meaningful contents (e.g. images).

windtalkersm

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
PDF to [images, tables, texts, etc.]

pdf-table-extract ONLY works with tables with deadly straight lines as sides and text-based PDF Needs install by python setup.py install, entry point pdftableextract.scripts:main find lines...
复制链接

扫一扫

专栏目录