python之python-docx批量处理docx文件基础

最新推荐文章于 2024-07-20 09:55:03 发布

Magician_liu

最新推荐文章于 2024-07-20 09:55:03 发布

阅读量2.4k

点赞数 57

文章标签： python word pycharm

本文链接：https://blog.csdn.net/Magician_liu/article/details/140416163

版权

python操作word文档之python-docx

python-docx官方文档

安装python -docx(注意：只能操作docx，而不能doc，可以将doc转换成docx即可)

pip install python-docx

1.简介

在使用前我们需要了解docx文件的结构，方便我们后续代码操作理解

在我们日常使用的word文档里，有段落也有表格，每个段落又有格式不同的文字，每段落只要格式不同的文字段就分为不同的run

在这里插入图片描述

Document:：是一个 Word 文档对象，不同于 VBA 中 Worksheet 的概念，Document 是独立的，打开不同的Word 文档，就会有不同的 Document 对象，相互之间没有影响

Paragraph：是段落，一个 Word 文档由多个段落组成，当在文档中输入一个回车键，就会成为新的段落，输入

shift + 回车，不会分段

Run:表示一个节段，每个段落由多个节段组成，一个段落中具有相同样式的连续文本，组成一个节段，所以一个段落对象有个 Run 列表

举例划分如下：
在这里插入图片描述

常用模块

from docx import Document
from docx.enum.text import WD_ALIGN_PARAGRAPH#设置对象居中、对齐等
from docx.enum.text import WD_TAB_ALIGNMENT,WD_TAB_LEADER#设置制表符等
from docx.shared import Pt#设置像素、缩进等
from docx.shared import RGBColor#设置字体颜色

2.python-docx库的使用

2.1docx读操作

我准备的docx文件如下：

在这里插入图片描述

2.1.1读取段落的文字

首先获取段落

一、文件对象.paragraphs 得到的是一个列表，包含了每个段落的实例，可以索引、切片、遍历

#导入Document功能类
from docx import Document
#加载docx文件内容返回一个实例化文件对象doc，
doc=Document('./magician.docx')
#返回文档中每个段落集合，是一个列表，可以通过索引获取#doc.paragraphs
print(doc.paragraphs)  #对象列表
print(doc.paragraphs[0])
print(doc.paragraphs[0:2])

out:

[<docx.text.paragraph.Paragraph object at 0x000002F19B3ED3C0>, <docx.text.paragraph.Paragraph object at 0x000002F19B3ED570>, <docx.text.paragraph.Paragraph object at 0x000002F19B3ED450>, <docx.text.paragraph.Paragraph object at 0x000002F19B3ED630>]
<docx.text.paragraph.Paragraph object at 0x000002F19B3ED570>
[<docx.text.paragraph.Paragraph object at 0x000002F19B3ED450>, <docx.text.paragraph.Paragraph object at 0x000002F19B3ED630>]

二、段落.text 得到该段落的文字内容

from docx import Document
doc=Document('./magician.docx')
paras=doc.paragraphs
for para in paras:  #循环遍历段落对象列表，由对象.text取段落文字内容
    print(para.text)

out:

Document: 是一个 Word 文档 对象，不同于 VBA 中 Worksheet 的概念，Document 是独立的，打开不同的 Word 文档，就会有不同的 Document 对象，相互之间没有影响
Paragraph: 是段落，一个 Word 文档由多个段落组成，当在文档中输入一个回车键，就会成为新的段落，输入 shift + 回车，不会分段
Run: 表示一个节段，一个段落由多个节段组成，一个段落中具有相同样式的连续文本，组成一个节段，所以一个段落对象有Run列表

三、每段落会有格式不同的文字块，可由段落对象.runs获取不同文字块对象列表

块与文字，parapraph.runs 得到一个列表，包含了每个文字块，可索引、切片、遍历块.text 得到该文字块的文字内容

示例：

#读取run的文本，验证其划分方式
from docx import Document
doc=Document('./magician.docx')
paras=doc.paragraphs
runs=paras[0].runs  #我们查看第一段落的文字块
for run in runs:
    print(run.text)

out:

Document: #第一个run
是一个 Word 文档 对象，不同于 VBA 中 Worksheet 的概念，   #第二个run
Document 是独立的，打开不同的 Word 文档，就会有不同的 Document 对象，相互之间没有影响  #第三个run

docx文件第一段落如下：（验证成功！确实是根据格式不同来划分的）

在这里插入图片描述

2.1.2读word中的表格

我在word中制作一张表格如下：
在这里插入图片描述

from docx import Document
doc = Document('./magician.docx')
tables=doc.tables
print(tables) #[<docx.table.Table object at 0x000001DF405FD300>] 发现一个table对象
table=tables[0] #取出列表中的table对象

print('按行读取：---------------------------------------------------')
#按行读取
for row in table.rows:
    for cell in row.cells:
        print(cell.text,cell.paragraphs)
print('按列读取：--------------------------------------------------')
#按列读取
for col in table.columns:
    for cell in col.cells:
        print(cell.text,cell.paragraphs)

out:

按行读取：---------------------
姓名 [<docx.text.paragraph.Paragraph object at 0x000001F957A09DB0>]
年龄 [<docx.text.paragraph.Paragraph object at 0x000001F957A09DB0>]
性别 [<docx.text.paragraph.Paragraph object at 0x000001F957A09DB0>]
年级 [<docx.text.paragraph.Paragraph object at 0x000001F957A09DB0>]
户籍 [<docx.text.paragraph.Paragraph object at 0x000001F957A09DB0>]
政治面貌 [<docx.text.paragraph.Paragraph object at 0x000001F957A09DB0>]
存款 [<docx.text.paragraph.Paragraph object at 0x000001F957A09DB0>]
Magician [<docx.text.paragraph.Paragraph object at 0x000001F957A09E70>]
18 [<docx.text.paragraph.Paragraph object at 0x000001F957A09E70>]
M [<docx.text.paragraph.Paragraph object at 0x000001F957A09E70>]
22 [<docx.text.paragraph.Paragraph object at 0x000001F957A09E70>]
江西 [<docx.text.paragraph.Paragraph object at 0x000001F957A09E70>]
团员 [<docx.text.paragraph.Paragraph object at 0x000001F957A09E70>]
1000 [<docx.text.paragraph.Paragraph object at 0x000001F957A09E70>]
Xiaoming [<docx.text.paragraph.Paragraph object at 0x000001F957A097E0>]
20 [<docx.text.paragraph.Paragraph object at 0x000001F957A097E0>]
M [<docx.text.paragraph.Paragraph object at 0x000001F957A097E0>]
23 [<docx.text.paragraph.Paragraph object at 0x000001F957A097E0>]
福建 [<docx.text.paragraph.Paragraph object at 0x000001F957A097E0>]
群众 [<docx.text.paragraph.Paragraph object at 0x000001F957A097E0>]
2000 [<docx.text.paragraph.Paragraph object at 0x000001F957A097E0>]
小红 [<docx.text.paragraph.Paragraph object at 0x000001F957A0A530>]
22 [<docx.text.paragraph.Paragraph object at 0x000001F957A0A530>]
F [<docx.text.paragraph.Paragraph object at 0x000001F957A0A530>]
20 [<docx.text.paragraph.Paragraph object at 0x000001F957A0A530>]
上海 [<docx.text.paragraph.Paragraph object at 0x000001F957A0A530>]
党员 [<docx.text.paragraph.Paragraph object at 0x000001F957A0A530>]
10000 [<docx.text.paragraph.Paragraph object at 0x000001F957A0A530>]
按列读取：---------------------
姓名 [<docx.text.paragraph.Paragraph object at 0x000001F957A0A620>]
Magician [<docx.text.paragraph.Paragraph object at 0x000001F957A0A620>]
Xiaoming [<docx.text.paragraph.Paragraph object at 0x000001F957A0A620>]
小红 [<docx.text.paragraph.Paragraph object at 0x000001F957A0A620>]
年龄 [<docx.text.paragraph.Paragraph object at 0x000001F957A0AD40>]
18 [<docx.text.paragraph.Paragraph object at 0x000001F957A0AD40>]
20 [<docx.text.paragraph.Paragraph object at 0x000001F957A0AD40>]
22 [<docx.text.paragraph.Paragraph object at 0x000001F957A0AD40>]
性别 [<docx.text.paragraph.Paragraph object at 0x000001F957A0AE00>]
M [<docx.text.paragraph.Paragraph object at 0x000001F957A0AE00>]
M [<docx.text.paragraph.Paragraph object at 0x000001F957A0AE00>]
F [<docx.text.paragraph.Paragraph object at 0x000001F957A0AE00>]
年级 [<docx.text.paragraph.Paragraph object at 0x000001F957A0AEC0>]
22 [<docx.text.paragraph.Paragraph object at 0x000001F957A0AEC0>]
23 [<docx.text.paragraph.Paragraph object at 0x000001F957A0AEC0>]
20 [<docx.text.paragraph.Paragraph object at 0x000001F957A0AEC0>]
户籍 [<docx.text.paragraph.Paragraph object at 0x000001F957A0AF80>]
江西 [<docx.text.paragraph.Paragraph object at 0x000001F957A0AF80>]
福建 [<docx.text.paragraph.Paragraph object at 0x000001F957A0AF80>]
上海 [<docx.text.paragraph.Paragraph object at 0x000001F957A0AF80>]
政治面貌 [<docx.text.paragraph.Paragraph object at 0x000001F957A0A650>]
团员 [<docx.text.paragraph.Paragraph object at 0x000001F957A0A650>]
群众 [<docx.text.paragraph.Paragraph object at 0x000001F957A0A650>]
党员 [<docx.text.paragraph.Paragraph object at 0x000001F957A0A650>]
存款 [<docx.text.paragraph.Paragraph object at 0x000001F957A0AAA0>]
1000 [<docx.text.paragraph.Paragraph object at 0x000001F957A0AAA0>]
2000 [<docx.text.paragraph.Paragraph object at 0x000001F957A0AAA0>]
10000 [<docx.text.paragraph.Paragraph object at 0x000001F957A0AAA0>]

发现：table中的每个cell也有paragraphs，符合开头的那张图！
练习：统计此docx文件中关键词段落的出现次数

#练习：统计关键词次数
from docx import Document
doc = Document('./magician.docx')
num=0
#先搜文本段落
for para in doc.paragraphs:
    count=para.text.count('段落')
    num += count
#再搜表格
for table in doc.tables:
    for row in table.rows:
        for cell in row.cells:
            count=cell.text.count('段落')
            num += count
print(num)  #6

2.1.3按规则读取文字

通过para.style.name可以读取所有paragraph的风格名称，根据这个名称可以有辨识的读取相应内容

1.读取段落风格名称:

from docx import Document
doc = Document('./magician.docx')
for para in doc.paragraphs:
    print('风格名称:',para.style.name)
    print(para.text)

out:(读取到我的word文档全是正文，而没有标题)

风格名称: Normal
Document: 是一个 Word 文档 对象，不同于 VBA 中 Worksheet 的概念，Document 是独立的，打开不同的 Word 文档，就会有不同的 Document 对象，相互之间没有影响
风格名称: Normal
Paragraph: 是段落，一个 Word 文档由多个段落组成，当在文档中输入一个回车键，就会成为新的段落，输入 shift + 回车，不会分段
风格名称: Normal
Run: 表示一个节段，一个段落由多个节段组成，一个段落中具有相同样式的连续文本，组成一个节段，所以一个段落对象有Run列表
风格名称: Normal


进程已结束，退出代码为 0

以下根据段落风格读取段落:

2.读取一级标题

doc=Document('./magician.docx')
for para in doc.paragraphs:
    if para.style.name=='Heading 1':
        print(para.text)

3.读取二级标题

doc=Document('./magician.docx')
for para in doc.paragraphs:
    if para.style.name=='Heading 2':
        print(para.text)

4.读取所有标题【使用正则】

import re
doc=Document('./magician.docx')
for para in doc.paragraphs:
    temp=re.match('Heading \d+',para.style.name)
    if temp:
        print(para.text)

5.读取正文

doc=Document('./magician.docx')
for para in doc.paragraphs:
   if para.style.name=='Normal':
        print(para.text)

2.2docx写操作

2.2.1写入文字数据

1.添加标题与新页

from docx import Document
#Document不写路径，会创建一个空的Document对象，
# 可以使用doc.save('路径')写入本地
doc=Document()
#添加标题
doc.add_heading('一级标题',level=1)
#添加分页（另起一个新页）
doc.add_page_break()

2.添加段落并添加带样式文字块

para=doc.add_paragraph('我是一个正文，我后面的文字会被设置格式')
para.add_run('被设置加粗的文字块').bold=True
para.add_run('被设置普通的文字块')
para.add_run('被设置斜体的文字块').italic=True
        #设置颜色
from docx.shared import RGBColor
para.add_run('我是红色字体').font.color.rgb=RGBColor(255,0,0)

3.在指定位置插入段落：

#指定位置插入一个段落
doc=Document('dancer.docx')
para2=doc.paragraphs[1]
#在第二个段落处插入
para2.insert_paragraph_before('这是添加的新的第二个段落')

注意:操作完成后不要忘记保存！！

#保存文件
doc.save('new.docx')

2.2.2写入表格数据

1.制作一个2*2表格

from docx import Document
doc=Document()
table=doc.add_table(rows=2,cols=2,style='Light Shading Accent 6')
cell=table.cell(1,1)
cell.text='我是单元格文字'
#通过表格的行访问cell
row0=table.rows[0]
row0.cells[0].text='Hello'
row0.cells[1].text='magician'
#通过表格的列访问cell
col0=table.columns[0]
col0.cells[1].text='第一列第二行'
#增加行
row=table.add_row()
doc.save('test.docx')

生成docx文件内容如下:

在这里插入图片描述

2.批量添加表格数据:

from docx import Document
items=(
    (6,'2024','AI'),
    (3,'2022','数分'),
    (2,'1028','Java'),
)
document=Document()
#添加一个表格
table=document.add_table(1,3)
#设置表格标题
heading_cells=table.rows[0].cells
heading_cells[0].text='数量'
heading_cells[1].text='id'
heading_cells[2].text='课程'
#将数据填入表格
for item in items:
    cells = table.add_row().cells
    cells[0].text = str(item[0])
    cells[1].text = item[1]
    cells[2].text = item[2]
#添加第二个表格
table2=document.add_table(rows=2,cols=2,style='Light Shading Accent 3')
table2.rows[0].cells[0].text='新增的表格'

document.save('test1.docx')

运行效果如下:

在这里插入图片描述

3.样式

样式可以针对整体文档（document）、段落（paragraph）、节段（run），越具体，样式优先级越高

python-docx 样式功能配置多样，功能丰富，可以通过如下方式获取

3.1整体文档样式

from docx import Document
from docx.enum.style import WD_STYLE_TYPE
#WD_STYLE_TYPE定义了4种风格分类
#WD_STYLE_TYPE.CHARACTER性格风格（字体风格）
#WD_STYLE_TYPE.LIST列表样式
#WD_STYLE_TYPE.PARAGRAPH段落样式
#WD_STYLE_TYPE.TABLE表格样式
doc = Document('./magician.docx')
for style in doc.styles:
    if style.type == WD_STYLE_TYPE.CHARACTER:
        print('CHARACTER',style.name)
    if style.type == WD_STYLE_TYPE.LIST:
        print('LIST',style.name)
    if style.type == WD_STYLE_TYPE.PARAGRAPH:
        print('PARAGRAPH',style.name)
    if style.type == WD_STYLE_TYPE.TABLE:
        print('TABLE',style.name)

out:

PARAGRAPH Normal
CHARACTER Default Paragraph Font
TABLE Normal Table
TABLE Table Grid

进程已结束，退出代码为 0

3.2段落样式

段落样式包括：对齐、列表样式、行间距、缩进、背景色等，可以在添加段落时设定，也可以在添加之后设置：

#添加一个段落，设置为无序列表样式
document.add_paragraph('我是个无序列表段落',style='List Bullet')

#添加段落后，通过style属性设置样式
paragraph=document.add_paragraph('我也是个无序列表段落')
paragraph.style='List Bullet'

3.3文字样式

在前面 python-docx 文档结构图可以看到，段落中，不同样式的内容，被划分成多个节段（Run），文字样式是通过节段（Run）来设置的

设置加粗/斜体/颜色

# 设置加粗/斜体/颜色
from docx import Document
from docx.shared import Pt,RGBColor
doc = Document()
paragraph=doc.add_paragraph('添加一个段落')
#设置节段文字为加粗
run=paragraph.add_run('添加一个节段')
run.bold=True
run.font.bold=True#加粗
run.font.italic=True#斜体
run.font.underline=True#下划线
run.font.strike=True#删除线
run.font.shadow=True#阴影
run.font.size=Pt(24)
run.font.color.rgb=RGBColor(255,0,0)#颜色
run.font.name='Arial'#字体设置
doc.save('test.docx')