python word表格嵌套_使用Python构建word中的插图和表格列表,python,在,Word,插表,清单...

最新推荐文章于 2022-06-05 15:30:53 发布

一路走来516

最新推荐文章于 2022-06-05 15:30:53 发布

阅读量318

点赞数

文章标签： python word表格嵌套

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_42515781/article/details/113640277

版权

问题描述

某个文本项目要结题验收，结果说我们格式不规范，仔细一看是少了插图插表清单。一共二百多页的word手动添加这些清单是非常恐怖的工作量，笔者想到了编程。

尝试

试过word自带的VBA，觉得过于笨拙(其实是笔者水平不足)，故放弃。

也试过python用来读取word的一个库

docx

，代码如下。

from docx import Document

from docx.shared import Inches

document = Document('demo.docx') #打开文件demo.dEocx

for paragraph in document.paragraphs:

if '表' in paragraph.text and '-' in paragraph.text and '，' not in paragraph.text:

print(paragraph.text) #打印各段落内容文本

但是这种方法只能读取表的名称，而不能获取它的位置，考虑到这么敷衍，笔者辛辛苦苦搞的东西还得被退回来，真是憋屈。索性再辛苦一点，把位置索引搞出来。

思路

将docx文件转化为pdf文件，再通过python读取pdf文件的方式，获取页码。

环境

语言：python 3.6+

库： pdfminer

方法

from pdfminer.pdfparser import PDFParser,PDFDocument

from pdfminer.pdfinterp import PDFResourceManager,PDFPageInterpreter

from pdfminer.converter import PDFPageAggregator

from pdfminer.layout import LTTextBoxHorizontal,LAParams

from pdfminer.pdfinterp import PDFTextExtractionNotAllowed

f_text = open('pdfoutput.txt','w',encoding='utf-8')#这里存储需要的文件

fp = open('demo.pdf', 'rb')#demo.pdf

# 创建一个pdf文档分析器

parser = PDFParser(fp)

# 创建一个PDF文档对象存储文档结构

document= PDFDocument()

# 连接分析器与文档对象

parser.set_document(document)

document.set_parser(parser)

# 提供初始化密码

document.initialize()

# 检查文件是否允许文本提取

if not document.is_extractable:

raise PDFTextExtractionNotAllowed

else:

# 创建一个PDF资源管理器对象来存储共赏资源

rsrcmgr=PDFResourceManager()

# 设定参数进行分析

laparams=LAParams()

# 创建一个PDF设备对象

device=PDFPageAggregator(rsrcmgr,laparams=laparams)

# 创建一个PDF解释器对象

interpreter=PDFPageInterpreter(rsrcmgr,device)

# 处理每一页

for page in document.get_pages():

# 解析页面

interpreter.process_page(page)

# 接受该页面的LTPage对象

layout=device.get_result()

for x in layout:

if (isinstance(x,LTTextBoxHorizontal)):

text=x.get_text()

if '表' in text and '-' in text and '，' not in text and '。' not in text and '；' not in text:

PageInformation = str(layout)

Index= PageInformation.index(')')

Page = PageInformation[8:Index]

f_text.write(text.replace("\n","")+'........'+Page+'\n')

f_text.close()

效果棒棒辣~

一路走来516

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python word表格嵌套_使用Python构建word中的插图和表格列表,python,在,Word,插表,清单...

问题描述某个文本项目要结题验收，结果说我们格式不规范，仔细一看是少了插图插表清单。一共二百多页的word手动添加这些清单是非常恐怖的工作量，笔者想到了编程。尝试试过word自带的VBA，觉得过于笨拙(其实是笔者水平不足)，故放弃。也试过python用来读取word的一个库docx，代码如下。from docx import Documentfrom docx.shared import Inche...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。