6月8日 Python处理PDF和Word文档常用的方法

最新推荐文章于 2025-02-06 14:21:11 发布

Hali_Botebie

最新推荐文章于 2025-02-06 14:21:11 发布

阅读量644

点赞数 3

分类专栏： python笔记

本文链接：https://blog.csdn.net/djfjkj52/article/details/91347013

版权

python笔记专栏收录该内容

13 篇文章

订阅专栏

PyPDF2

Python处理PDF和Word文档的模块是PyPDF2，使用之前需要先导入。

打开一个PDF文档的操作顺序是：

用open()函数打开文件并用一个变量来接收，然后把变量给传递给PdfFileReader对象，形成一个PdfFileReader对象，这样用PdfFileReader对象下面的各种方法、属性去操作PDF文档。

PdfFileReader对象方法：

import PyPDF2 
pdfFileObj = open('meetingminutes.pdf', 'rb') 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 
pdfReader.numPages
	>>> 19 
pageObj = pdfReader.getPage(0) 
pageObj.extractText()

参考 - http://copyfuture.com/blogs-details/2f4702509dd5431f7cee8208e768086a

python-docx使用记录

因为要处理中文，所以在这里使用 python3（相对 python2 编码问题较少）。

安装 docx：使用 pip3 install python-docx
如果安装失败可以尝试：pip3 easy-install python-docx

docx文档结构分为3层：

Document对象表示整个文档
Document包含了Paragraph对象的列表，Paragraph对象用来表示段落
一个Paragraph对象包含了Run对象的列表，Run：
word里不只有字符串，还有字号、颜色、字体等属性，都包含在style中。一个Run对象就是style相同的一段文本。
新建一个Run就有新的style。

基本操作

参考 http://python-docx.readthedocs.io/en/latest/
参考 http://ywtail.github.io/2017/06/30/python-docx使用记录/

基本操作包括打开文档、在文档中写入内容、存储文档，简洁示例如下。

from docx import Document
doc=Document() #不填文件名默认新建空白文档。填文件名（必须是已存在的doc文件）将打开这一文档进行操作
doc.add_heading('Hello') #添加标题
doc.add_paragraph('word') #添加段落
doc.save('test.docx') #保存，必须有1个参数

python-docx包含的对象集合如下

doc.paragraphs    #段落集合
doc.tables        #表格集合
doc.sections      #节  集合
doc.styles        #样式集合
doc.inline_shapes #内置图形 等等...

http://python-docx.readthedocs.io/en/latest/ 中的示例如下：

from docx import Document
from docx.shared import Inches

document = Document()

document.add_heading('Document Title', 0)

p = document.add_paragraph('A plain paragraph having some ')
p.add_run('bold').bold = True
p.add_run(' and some ')
p.add_run('italic.').italic = True

document.add_heading('Heading, level 1', level=1)
document.add_paragraph('Intense quote', style='IntenseQuote')

document.add_paragraph(
    'first item in unordered list', style='ListBullet'
)
document.add_paragraph(
    'first item in ordered list', style='ListNumber'
)

document.add_picture('monty-truth.png', width=Inches(1.25))

table = document.add_table(rows=1, cols=3)
hdr_cells = table.rows[0].cells
hdr_cells[0].text = 'Qty'
hdr_cells[1].text = 'Id'
hdr_cells[2].text = 'Desc'
for item in recordset:
    row_cells = table.add_row().cells
    row_cells[0].text = str(item.qty)
    row_cells[1].text = str(item.id)
    row_cells[2].text = item.desc

document.add_page_break()

document.save('demo.docx')

读写标题

背景：需要将某个文档中的标题拷贝到另一个文档中，但是标题太过分散，手动拷贝太费劲，所以考虑使用 docx 来处理。

打开 doc 文档，获取所有的 paragraphs（里面包含了Heading），查看这些 paragraphs 的 style（查看需要获取的标题是几级的）

import docx
doc=docx.Document('filename.docx') #打开文档

ps=doc.paragraphs
for p in ps:
    print(p.style)

通过上面执行结果知道在这个文档（filename.docx）中，标题的 style 包括 Heading 1、Heading 2、Heading 3（其他文档的标题也许不是这些 style），我们通过 p.style.name来匹配这些标题，将标题及其 level 存到 re 中备用。

re=[]
for p in ps:
    if p.style.name=='Heading 1':
        re.append((p.text,1))
    if p.style.name=='Heading 2':
        re.append((p.text,2))
    if p.style.name=='Heading 3':
        re.append((p.text,3))

现在已经获取了标题内容以及标题的 level，将 re 列表“解压”：titles,titledes=zip(*re)，标题存在 titles 列表中，level 存在 titledes 列表中，接下来将标题写到新文档中

newdoc=docx.Document()
for i in range(len(titles)):
    newdoc.add_heading(titles[i],level=titledes[i])
newdoc.save('newfile.docx')

获取表格内容

背景：需要获取某个文档中所有表格的第二列和第三列内容。

打开doc文档

import docx
doc=docx.Document('filename.docx') #打开文档

doc.tables返回的是文档中的表格，rows，columns和 cell 对象在遍历表格的时候很有用。

Table 对象有两个属性 rows 和 columns，等同于 Row 的列表以及 Column 的列表。因此迭代、求长度等对list的操作同样适用于 Rows 和 Columns。

cell 也是表格中常用的对象，可以利用以下五种方法得到Cell对象：

使用 Table 对象的 cell(row,col) 方法。左上角的坐标为0,0
使用 Table 对象的 row_cells(row_index) 方法得到一个 list，它包含了某一行的按列排序的所有 Cell
得到一个 Row 对象后，使用 Row.cells 属性得到该 Row 的按列排序的所有 Cell
使用 Table 对象的 column_cells(column_index) 方法得到一个 list，它包含了某一列的按行排序的所有 Cell
得到一个 Column 对象后，使用 Column.cells 属性得到该 Column 的按行排序的所有 Cell

如果想遍历所有 Cell，可以先遍历所有行（table.rows），再遍历每一行所有的 Cell；也可以先遍历所有列（table.columns），再遍历每一列所有的 Cell。

一个Cell对象最常用的属性是 text。设置这个属性可以设定单元格的内容，读取这个属性可以获取单元格的内容。

为了便于理解，举例如下

for table in doc.tables: #列举文档中的表格
    for row in table.rows: #表格中的每一行
        t1=row.cells[1].text #每一行中第2列（从0开始计数）的内容
        t2=row.cells[2].text #每一行中第3列的内容

获取表格中的数据后用 DataFrame 存，最后保存为csv文件。如果有中文乱码问题，最后加上encoding=‘gb2312’
df.to_csv('filename.csv',index=False,encoding='gb2312')

##创建表格
Document.add_table 的前两个参数设置表格行数和列数，第三个参数设定表格样式，也可以用 table 的 style 属性获取和设置样式。如果设置样式，可以直接用样式的英文名称，例如『Table Grid』；如果对样式进行了读取，那么会得到一个 Style对象。这个对象是可以跨文档使用的。除此之外，也可以使用 Style.name 方法得到它的名称。

下面创建一个 6 行 2 列的表格，可以通过 table.cell(i,j).text 来对表格进行填充。

doc=docx.Document()
tabel=doc.add_table(rows=6,cols=2,style = 'Table Grid') #实线
tabel.cell(0,0).text='编号'
tabel.cell(1,0).text='位置'

上面创建的表格每一列等宽，可以设置表格的列宽使其更美观。

from docx.shared import Inches
for t in doc.tables:
    for row in t.rows:
        row.cells[0].width=Inches(1)
        row.cells[1].width=Inches(5)

参考

http://python-docx.readthedocs.io/en/latest/
http://www.itwendao.com/article/detail/172784.html (Python读取word文档——python-docx)
http://yshblog.com/blog/40 (Python读写docx文件)
http://www.ctolib.com/topics-57923.html (使用表格—— 使用Python读写Office文档之三)

Python 设置word属性的函数

office 2007中不能直接打开VB编辑器，请按Alt + F11Alt + F11Alt + F11Alt + F11打开。

import win32com.client      # 导入脚本模块
WordApp = win32com.client.Dispatch("Word.Application") # 载入WORD模块
WordApp.Visible = True      # 显示Word应用程序

1、新建Word文档

doc=WordApp.Documents.Add()     # 新建空文件   
doc = WordApp.Documents.Open(r"d:\2011专业考试计划.doc") # 打开指定文档
doc.SaveAs(r"d:\2011专业考试计划.doc")  # 文档保存
doc.Close(-1)      # 保存后关闭，doc.Close()或doc.Close(0)直接关闭不保存

2、页面设置

doc.PageSetup.PaperSize = 7     # 纸张大小, A3=6, A4=7 
doc.PageSetup.PageWidth = 21*28.35    # 直接设置纸张大小, 使用该设置后PaperSize设置取消 
doc.PageSetup.PageHeight = 29.7*28.35        # 直接设置纸张大小  
doc.PageSetup.Orientation = 1                # 页面方向, 竖直=0, 水平=1 
doc.PageSetup.TopMargin = 3*28.35           # 页边距上=3cm，1cm=28.35pt 
doc.PageSetup.BottomMargin = 3*28.35         # 页边距下=3cm 
doc.PageSetup.LeftMargin = 2.5*28.35         # 页边距左=2.5cm
doc.PageSetup.RightMargin = 2.5*28.35        # 页边距右=2.5cm  
doc.PageSetup.TextColumns.SetCount(2)        # 设置页面

3、格式设置

sel = WordApp.Selection       # 获取Selection对象
sel.InsertBreak(8)                # 插入分栏符=8, 分页符=7 
sel.Font.Name = "黑体"                 # 字体
sel.Font.Size = 24                     # 字大
sel.Font.Bold = True                  # 粗体
sel.Font.Italic = True                 # 斜体
sel.Font.Underline = True              # 下划线 
sel.ParagraphFormat.LineSpacing = 2*12   # 设置行距，1行=12磅
sel.ParagraphFormat.Alignment = 1      # 段落对齐,0=左对齐,1=居中,2=右对齐
sel.TypeText("XXXX")       # 插入文字
sel.TypeParagraph()       # 插入空行 
注注注注：：：：ParagraphFormat属性必须使用TypeParagraph()之后才能二次生效！

4、插入图片

pic = sel.InlineShapes.AddPicture(jpgPathName) # 插入图片，缺省嵌入型
pic.WrapFormat.Type = 0           # 修改文字环绕方式：0=四周型,1=紧密型,3=文字上方,5=文字下方 
pic.Borders.OutsideLineStyle = 1          # 设置图片4边线,1=实线
pic.Borders.OutsideLineWidth = 8          # 设置边线宽度，对应对话框中数值依次2,4,6,8,12,18,24,36,48
pic.Borders(-1).LineStyle = 1             # -1=上边线,-2=左边线,-3下边线,-4=右边线
pic.Borders(-1).LineWidth = 8             # 依次2,4,6,8,12,18,24,36,48 
注注注注：：：：InlineShapes方式插入图片类似于插入字符（嵌入式），Shapes插入图片缺省是浮动的。

5、插入表格

tab=doc.Tables.Add(sel.Range, 16, 2)  # 增加一个16行2列的表格 
tab.Style = "网格型"       # 显示表格边框
tab.Columns(1).SetWidth(5*28.35, 0)   # 调整第1列宽度，1cm=28.35pt
tab.Columns(2).SetWidth(9*28.35, 0)   # 调整第2列宽度
tab.Rows.Alignment = 1                    # 表格对齐,0=左对齐,1=居中,2=右对齐
tab.CellCellCellCell(1,1).Range.Text = "xxx"    # 填充内容，注意Excel中使用wSheet.Cells(i,j)   
sel.MoveDown(5, 16)       # 向下移动2行,5=以行为单位 
注注注注：：：：插入n行表格之后必须使用MoveDown(5,n)移动到表格之后才能进行其它操作，否则报错！

6、使用样式

for  stl in doc.Styles: 
    print stl.NameLocal   # 显示文档中所有样式名

Python 请教win32com,win32con,win32gui的详细帮助文档

今日用到win32com但是找不到帮助文档，Google之后再哲思上找到

安装包里有个chm文件，安装到lib\ site-packages目录下的。

python-2.7 – 使用python在MS Word文件中读取自定义文档属性

如何使用python获取MS-Word 2010文档的文档属性？
文档属性我是指那些可以在FILE下添加或修改的人 – >信息 – >属性 – >高级属性(在MS-WORD 2010中)

我在windows764bit上使用python 2.7和相应的pywin32com版本来访问doc文件…

我发现带有方法值和名称的CustomProperty对象似乎对我来说是正确的(http://msdn.microsoft.com/en-us/library/bb257518%28v=office.12%29.aspx)

import win32com.client as win32
word = win32.Dispatch("Word.Application")
word.Visible = 0
doc = word.Documents.Open(file)
try:
    csp= doc.CustomDocumentProperties('property_you_want_to_know').value
    print('property is %s' % csp)

except exception as e:
    print ('\n\n', e)

doc.Saved= False
doc.Save()
doc.Close()

word.Quit()