python 给PDF添加目录

Cosmos Tan

已于 2022-07-19 19:59:14 修改

阅读量1.9k

点赞数 1

分类专栏： # python爬虫文章标签： python 开发语言办公自动化python

于 2022-07-19 19:55:07 首次发布

本文链接：https://blog.csdn.net/tanqy1997/article/details/125875949

版权

python爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

一、相关库安装

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pdfplumber
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pypdf2


pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pdfminer3k

1、PyPDF系列:

PyPDF2，PyPDF3, PyPDF4,主要对PDF进行操作：合并，拆分，旋转。

以下是常见测试代码：

pdfpath='D:/A_myfile/deep learning  by shu008.pdf'
from PyPDF2 import PdfFileReader as reader,PdfFileWriter as writer
with open(pdfpath,'rb') as f:
    pdf = reader(f)
    infomation = pdf.getDocumentInfo()
    number_of_pages = pdf.getNumPages()

    txt = f'''{infomation} information:
    Author : {infomation.author},
    Creator : {infomation.creator},
    Producer : {infomation.producer},
    Subject : {infomation.subject},
    Title : {infomation.title},
    Number of pages : {number_of_pages}
    '''
    print(txt)

2、pdfplumber

获取PDF每页的每个文本字符、矩形和线条的详细信息。另外：表格提取和可视化调试。

常见测试代码：

import pdfplumber
pdf = pdfplumber.open(path)
import pandas as pd
for page in pdf.pages:
    # 获取当前页面的全部文本信息，包括表格中的文字
    # print(page.extract_text())   # 只提取文字，对表格信息，有简单合并行
    # print(page.extract_words())   # 提取字符串的文本、坐标等信息
    # print(page.extract_tables())   # 按行元素返回表格信息，无坐标
    # print(page.chars)   # 按字符而非字符串提取文本、坐标等信息
 
    for t in page.extract_tables():
        # for row in t:
        #     print(row)
        # 得到的table是嵌套list类型，转化成DataFrame更加方便查看和分析
        df = pd.DataFrame(t[1:], columns=t[0])
        print(df)
    # 只用第一页测试
    break
 
pdf.close()

链接：GitHub - jsvine/pdfplumber: Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.

Works best on machine-generated, rather than scanned, PDFs. Built on pdfminer.six

3、pdfminer3k

pdfminer 是python2库，pdfminer3k是python3库。相较于pdfplumber，操作繁琐。暂时没用上

dir(pdfminer)
pdfminer?
pdfminer??

二、主要代码

1、pdfplumber提取相关信息

根据不同PDF，编写目录信息、页码信息的代码。

import re
import pdfplumber

pdfpath='D:/A_myfile/deep learning  by shu008.pdf'
with pdfplumber.open(pdfpath) as pdf:
    cata_list = []
    for page in pdf.pages:
        text = page.extract_text() # 提取文本
        if text.find('Chen Gong') != -1:
            text = text.partition('Chen Gong') # 1、提取目录信息。
            page_num = re.sub("\D", "", str(page)) # 2、提取目录所在页数
            cata_list.append((text[0],page_num))
print(cata_list)

2、addBookmark

from PyPDF2 import PdfFileReader as reader,PdfFileWriter as writer
pdfpath='D:/A_myfile/deep learning  by shu008.pdf'
pdf_in=reader(pdfpath)
pdf_out=writer()

pageCount=pdf_in.getNumPages()
#print(pageCount)
for iPage in range(pageCount):
    pdf_out.addPage(pdf_in.getPage(iPage))

for elem in range(len(cata_list)):
    page_name=cata_list[elem][0][:-1] # 目录信息
    page_num=int(cata_list[elem][1]) # 页码信息    
    pdf_out.addBookmark(page_name,page_num-1,None)

outpath='D:/A_myfile/shu008目录版.pdf'
with open(outpath,'wb') as fout:
    pdf_out.write(fout)