上传文档（提取文本信息）

最新推荐文章于 2024-04-23 10:07:20 发布

yufei707

最新推荐文章于 2024-04-23 10:07:20 发布

阅读量440

点赞数

文章标签： python

本文链接：https://blog.csdn.net/yufei707/article/details/85060009

版权

提取docx文档

from zipfile import ZipFile
from bs4 import BeautifulSoup
document = ZipFile(root_file)

xml = document.read("word/document.xml")
root = BeautifulSoup(xml.decode("utf-8"))  # 解析文档
texts = root.findAll()	 # findAll("p") 可查找所有的p标签
i = 0
lista = []
for text in texts:
    if text.name == "w:t": 	# docx的文本信息
        lista.append(text.text)
    elif text.name =="pic:cnvpr":	#docx的图片信息
        i += 1	#假设获取到的图片信息 对media中的图片进行读取或其他操作
        lista.append(i)

可以获取到文本样式

doc_file = Document(root_file)  # 替换段落内容
html = ""
for paragraph in doc_file.paragraphs:
    
    if paragraph.text:
        runs = paragraph.runs
        if paragraph.alignment:
            text_align = str(paragraph.alignment).split('(')[0]	#居中样式
        else:
            text_align = ''
        text = ''
        for p in runs:
           	# text += p.text
            if p.bold and p.italic:	#加粗斜体
                text += "<strong><em>%s</em></strong>"%p.text
            elif p.italic and not p.bold:	#斜体
                text += "<em>%s</em>"%p.text
            elif  p.bold and not p.italic:	#加粗
                text += "<strong>%s</strong>"%p.text
            else:
                text += p.text

            html += "<p style=\"text-align: %s; font-size=10px; \">"%text_align + text
    else:
        html += "<p>"
    html += "</p>"

获取表格内文本信息

for paragraph in doc_file.tables:
    # paragraph = paragraph.runs[0]
    for r in paragraph.rows:
        for c in r.cells:
            if c.text:
                html += "<p style=font-size=10px;>" + c.text
            else:
                html += "<p>"
            html += "</p>"

最后

document.close()	#不可缺少的一步

此时存在的问题是图片重复存在时，读取到的图片排序是错误的，会按照zip中media下的图片顺序进行排列，但media的顺序与文档顺序不能保证一致，假设文档第一张图片是在第四次插入，那么在media中位置是第四位。

查看document.xml可发现<w:drawing>下的pic:cNvPr 中存在id，当id一致时，图片也是一样的，那么可以取出图片的id判断。

又发现，文档本地编辑并上传时，才会有这个id，而从其他地方拷过来的文档解析中没有，所以这个方法也是不完善的。

经过各种办法的尝试后，终于又发现了一个数据pic:blipFill下的<a:blip r:emed=“rId1”>，取出“ r:embed”，写入图片

from docx import Document

doc = Document(root_file)
# 上传图片，先将图片上传至服务器
img = doc.part.related_parts[rid]
# 命名文件名
code = get_random_string(length=4, allowed_chars='0123456789')
string = "/" + str(int(time.time())) + code + ".jpg"
# 获取年月
file = str(datetime.now().year)+"/"+str(datetime.now().month)
# 文件夹存在则写入图片， 不存在先创建文件夹
try:
    f = open(MEDIA_ROOT+"/%s%s" % (file, string), 'wb')
except Exception as e:
    os.mkdir(MEDIA_ROOT+"/%s" % file)
    f = open(MEDIA_ROOT+"/%s%s" % (file, string), 'wb')
 f.write(img.blob)
 f.close()

对于doc，大部分方法都是在用pywin32，但linux下python不支持此安装包，在ubantu下安装一个antiword，以此方法读取到doc，

        try:
            output = subprocess.check_output(['/root/bin/antiword', MEDIA_ROOT + "/" + forumfile.file.name])
            result = output.decode()

        except Exception as e:
            print(e)

但当其中包含表格，输出形式为
|1今天 |2 |3 |4 |5 |
|2 |2[pic] |[pic]2 |2 |2 |
包含图片只能输出为
[pic]

暂无其他方法，继续学习。。

yufei707

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
上传文档（提取文本信息）

from zipfile import ZipFilefrom bs4 import BeautifulSoupdocument = ZipFile(root_file)# document = ZipFile('D:/Users/Administrator/Desktop/a.docx')xml = document.read(&amp;amp;quot;word/document.xml&amp;amp;quot;)root = Be...
复制链接

扫一扫