pdfbox创建pdf_pdfbox 从PDF文件创建,放大和提取数据

pdfbox

Create, Maniuplate and Extract Data from PDF Files (R Apache PDFBox wrapper)

Description

I came across this thread (https://twitter.com/derekwillis/status/922138080043241473) and it looks like some misguided folks are going to help promote the use of PDF documents as a legit way to dissemiante data, which means that we’re likely to see more evil orgs and Government agencies try to use PDFs to hide data.

PDFs are barely useful as publication holders these days let alone data sources.

Apache PDFBox is a project that provides a comprehensive suite of tools to do things with and to PDF documents.

The aim here is to fill in any gaps in pdftools since poppler may not try to accommodate all the stupidity that we’re now likley to see.

What’s Inside The Tin

The ability to extract URI annotations

The following functions are implemented:

extract_uris: Extract URI annotations from a PDF document

extract_text: Extract text from a PDF document

pdf_info: Retrieve PDF Metadata

Installation

devtools::install_github("hrbrmstr/pdfboxjars")

devtools::install_github("hrbrmstr/pdfbox")

Usage

library(pdfbox)

# current verison

packageVersion("pdfbox")

## [1] '0.3.0'

PDF Info

pdf_info(

system.file(

"extdata", "imperfect-forward-secrecy-ccs15.pdf", package="pdfbox"

)

) -> info

dplyr::glimpse(info)

## Observations: 1

## Variables: 7

## $ title "Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice"

## $ subject ""

## $ author ""

## $ creation_date "2015-08-21T11:06:23-04:00[GMT-04:00]"

## $ modification_date "2015-08-21T11:08:05-04:00[GMT-04:00]"

## $ producer "pdfTeX-1.40.14"

## $ keywords ""

Extract URI Annotations

extract_uris(

system.file("extdata","imperfect-forward-secrecy-ccs15.pdf", package="pdfbox")

)

## # A tibble: 33 x 3

## page uri text

##

## 1 1 https://weakdh.org WeakDH.org.

## 2 6 www.fbi.gov www.fbi.gov.

## 3 12 http://cr.yp.to/factorization/smoothparts-20040510.pdf http://cr.yp.to/factorization/smoothpar…

## 4 12 http://caramel.loria.fr/p180.txt http://caramel.loria.fr/p180.txt.

## 5 12 http://www.hyperelliptic.org/tanja/SHARCS/talks06/thorsten.pdf http://www.hyperelliptic.org/tanja/

## 6 12 http://www.hyperelliptic.org/tanja/SHARCS/talks06/thorsten.pdf SHARCS/talks06/thorsten.pdf.

## 7 13 https://www.olcf.ornl.gov/titan https://www.olcf.ornl.gov/titan.

## 8 13 http://www.spiegel.de/international/germany/inside-the-nsa-s-war-on-i… http://www.spiegel.de/international/ger…

## 9 13 http://www.spiegel.de/international/germany/inside-the-nsa-s-war-on-i… inside-the-nsa-s-war-on-internet-securi…

## 10 13 http://www.sagemath.org http://www.sagemath.org.

## # … with 23 more rows

Extract text

extract_text(

system.file(

"extdata", "imperfect-forward-secrecy-ccs15.pdf", package="pdfbox"

)

) -> pg_df

dplyr::glimpse(pg_df)

## Observations: 13

## Variables: 2

## $ page 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13

## $ text "Imperfect Forward Secrecy:\nHow Diffie-Hellman Fails in Practice\nDavid Adrian¶ Karthikeyan Bhargavan∗ …

pdfbox Metrics

Lang

# Files

(%)

LoC

(%)

Blank lines

(%)

# Lines

(%)

Java

3

0.18

352

0.57

89

0.51

23

0.15

R

10

0.59

132

0.21

47

0.27

77

0.50

XML

1

0.06

69

0.11

0

0.00

0

0.00

Rmd

1

0.06

27

0.04

31

0.18

52

0.34

Maven

1

0.06

27

0.04

3

0.02

1

0.01

make

1

0.06

10

0.02

5

0.03

1

0.01

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值