python正则表达式详解 pdf,在Python中使用正则表达式解析PDF文件

I am trying to parse some object elements from a PDF file using re module of Python. My goal is to parse each PDF object using a regular expression.

A PDF object example is the following:

1 0 obj

<<

/Type /Catalog

/Pages 2 0 R

>>

endobj

2 0 obj

<<

/Type /Pages

/Kids [ 3 0 R ]

/Count 1

>>

endobj

...

When I use "\d+\s\d+\sobj[\s,\S]*endobj" it doesn't work (it keeps parsing util last endobj is found). How can I modify regular expression in order to parse each object seperately (in other words the part from 1 0 obj until endobj)?

解决方案

If you are using only regex, it is easy to construct a PDF file that your program will not be able to handle. PDF dictionaries and lists can contain other objects. Regex can't handle recursive structures, at least not Python re module.

A pdf file is a tree of objects and streams:

Dictionaries: << (name value)* >>

Lists: [ (value)* ]

Names: / (regular char)*

Strings: ( (char)* )

Hex strings: < (hexchar)* >

Numbers: (-)? ((digit)+ | (digit)+ . (digit)* | . (digit)+)

Booleans: true | false

References: (digit)+ (whitespace)+ (digit)+ (whitespace)+ R

Whitespace and comments are ignored in most places.

Comments start with % and run until the end of the line.

Indirect objects are specified as:

1 0 obj

(any object)

endobj

This object can then be referenced as 1 0 R. Indirect dictionaries can also have a stream attached:

1 0 obj

<<

/Length 22

>>

stream

(22 bytes of raw data)

endstream

endobj

A PDF file looks something like this:

%PDF-1.4

%ÿÿÿÿ

1 0 obj

<< /Author (MizardX) >>

endobj

2 0 obj

<<

/Type /Catalog

% more required keys

>>

endobj

%lots of more indirect objects, one after another

trailer

<<

/Info 1 0 R

/Root 2 0 R

% ... more required keys

>>

xref

0 3

0000000000 65535 f

0000000015 00000 n

0000000054 00000 n

startxref

225

%%EOF

The root of the object tree is the trailer object. Every objects is referenced directly or indirectly from this dictionary.

There are a lot more complexity hidden inside the streams, but that does not affect the file structure.

The full specification can be found at Adobe's website.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值