R语言提取html表格,用R语言读取PDF文件中的table-CSDN博客

Scientific articles are typically locked away in PDF format, a format designed primarily for printing but not so great for searching or indexing. The new pdftools

package allows for extracting text and metadata from pdf files in R. From the extracted plain-text one could find articles discussing a particular drug or species name, without having to rely on publishers providing metadata, or pay-walled search engines.

The pdftools slightly overlaps with the Rpoppler

package by Kurt Hornik. The main motivation behind developing pdftools was that Rpoppler depends on glib, which does not work well on Mac and Windows. The pdftools package uses the poppler c++ interface together with Rcpp, which results in a lighter and more portable implementation.

Installing pdftools

On Windows and Mac the binary packages can be installed directly from CRAN:

install.packages("pdftools")

Installation on Linux requires the poppler development library. On Debian/Ubuntu:

sudo apt-get install libpoppler-cpp-dev

On Fedora or CentOS:

sudo yum install poppler-cpp-devel

If you want to install the package from source on Mac OS-X you need brew:

brew install poppler

That’s it.

Getting started

The ?pdftools

manual page shows a brief overview of the main utilities. The most important function is pdf_text

which returns a character vector of length equal to the number of pages in the pdf. Each string in the vector contains a plain text version of the text on that page.

library(pdftools)

download.file("1403.2805.pdf", mode = "wb")

txt

# first page text

cat(txt[1])

# second page text

cat(txt[2])

In addition, the package has some utilities to extract other data from the PDF file. The pdf_toc

function shows the table of contents, i.e. the section headers which pdf readers usually display in a menu on the left. It looks pretty in JSON:

# Table of contents

toc

# Show as JSON

jsonlite::toJSON(toc, auto_unbox = TRUE, pretty = TRUE)

Other functions provide information about fonts, attachments and metadata such as the author, creation date or tags.

# Author, version, etc

info

# Table with fonts

fonts

Bonus feature: rendering pdf

A bonus feature on most platforms is rendering of PDF files to bitmap arrays. The poppler library provides all functionality to implement a complete PDF reader, including graphical display of the content. In R we can use pdf_render_page to render a page of the PDF into a bitmap, which can be stored as e.g. png or jpeg.

# renders pdf to bitmap array

bitmap

# save bitmap image

png::writePNG(bitmap, "page.png")

jpeg::writeJPEG(bitmap, "page.jpeg")

webp::write_webp(bitmap, "page.webp")

This feature now works on all platforms.

Limitations

Data scientists are often interested in data from tables. Unfortunately the pdf format is pretty dumb and does not have notion of a table (unlike for example HTML). Tabular data in a pdf file is nothing more than strategically positioned lines and text, which makes it difficult to extract the raw data.

txt

cat(txt[18])

cat(txt[19])

Pdftools usually does a decent job in retaining the positioning of table elements when converting from pdf to text. But the output is still very dependent on the formatting of the original pdf table, which makes it very difficult to write a generic table extractor. But with a little creativity you might be able to parse the table data from the text output of a given paper.

Jeroen Ooms joins team rOpenSci!

A message from the team: We are happy to announce that Jeroen Ooms

has joined the rOpenSci crew! Jeroen is a prolific programmer and author of numerous widely used packages

. At rOpenSci, he will continue to work on developing awesome packages and infrastructural software for improving the scientific tool chain.

数据人网是数据人学习、交流和分享的平台http://shujuren.org 。专注于从数据中学习到有用知识。

平台的理念：人人投稿，知识共享；人人分析，洞见驱动；智慧聚合，普惠人人。

您在数据人网平台，可以1)学习数据知识；2)创建数据博客；3)认识数据朋友；4)寻找数据工作；5)找到其它与数据相关的干货。

我们努力坚持做原创，聚合和分享优质的省时的数据知识！

我们都是数据人，数据是有价值的，坚定不移地实现从数据到商业价值的转换！