PDF to [images, tables, texts, etc.]

pdf-table-extract

  • ONLY works with tables with deadly straight lines as sides and text-based PDF
  • Needs install by python setup.py install, entry point pdftableextract.scripts:main
  • find lines, work out grid and cells, use pdftotext for each cell
  • Can view debug image by checkcrop, checklines, checkdivs, and checkcells
  • The script requires numpy and poppler (pdftoppm and pdftotext)
  • Analyses a page in a PDF looking for well delineated table cells, and extracts the text in each cell.
  • Outputs include JSON, XML, and CSV lists of cell locations, shapes, and contents, and CSV and HTML versions of the tables.

table-extract

  • MAINLY based on Tesseract. Parse hocr/html output file by tesseract to figure out the tables and its layout
  • A tool for extracting tables, figures, maps, and pictures from PDFs using Tesseract
  • preprocess.sh
    • Script for prepping a PDF for table extraction.
    • Converts each page of the PDF to a PNG with Ghostscript, then runs the PNGs through Tesseract (Assumes local installation of tesseract-ocr.).
    • Also runs each page through annotate.py to assist in debugging.
  • pdf2hocr == preprocess.sh + process.sh
  • The very core function is still *tesseract. All other scripts are used to parse the resultant hocr or html file generated by tesseract
  • do_extract.py
    • a separate process to do actual extraction work after preprocess.sh+process.sh or pdf2hocr
    • main script is table_extractor.py
    • it’s basically an XML parser with regard to
    • div.ocr_page
      • div.ocr_carea
        • p.ocr_par
          • span.ocr_line
            • span.ocrx_word
  • area_summary—> summarize_document –> process_page
  • structures
    • pages: list type.
    • each page in pages is of dict type including keys: page_no, soup, page, areas, lines
    • pages[i]['areas'] is the collection of carea in page i
    • each area is of dict type with 20 keys, e.g. lines, line_heights, words, area, line_heights, word_distances, words_per_line, word_height_avg etc.

HazyResearch/pdftotree

  • Detect and extract document elements such as tables, figures, headers using ML techniques by training on sample documents.
  • This project is using the table-extraction tool (https://github.com/xiao-cheng/table-extraction).
  • Use Tabula to then extract contents from the tables
    • Tabula only works on text-based PDFs, not scanned documents.

HazyResearch/TreeStructure

  • Detect and extract document elements such as tables, figures, headers using ML techniques by training on sample documents.
  • This project is using the table-extraction tool (https://github.com/xiao-cheng/table-extraction).
  • An evaluation codes are provided to compute recall, precision and F1 score at the character level (ground truth data required)

pdfminer/pdfminer.six

  • PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines.
  • PDFMiner comes with two handy tools: pdf2txt.py and dumppdf.py.
    • pdf2txt.py extracts text contents from a PDF file. It extracts all the text that are to be rendered programmatically, i.e. text represented as ASCII or Unicode strings. It cannot recognize text drawn as images that would require optical character recognition. It also extracts the corresponding locations, font names, font sizes, writing direction (horizontal or vertical) for each text portion.
    • dumppdf.py dumps the internal contents of a PDF file in pseudo-XML format. This program is primarily for debugging purposes, but it’s also possible to extract some meaningful contents (e.g. images).
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值