Python查找两个word中的相同内容

最新推荐文章于 2024-01-10 09:57:50 发布

蹦跶的小羊羔

最新推荐文章于 2024-01-10 09:57:50 发布

阅读量2.4k

点赞数 4

本文链接：https://blog.csdn.net/yql_617540298/article/details/108522075

版权

本文介绍了如何使用python-docx库查找两个docx文件中的相同内容。通过按小句拆分并对比，实现内容匹配，并提供了源码示例及问题解决方法。

摘要由CSDN通过智能技术生成

参考链接：https://blog.csdn.net/weixin_43145361/article/details/103798581

参考链接：https://zhidao.baidu.com/question/326711580304676805.html

参考链接：https://blog.csdn.net/weixin_43245453/article/details/108335331

参考链接：https://python-docx.readthedocs.io/en/latest/index.html

参考链接：https://blog.csdn.net/weixin_42378365/article/details/85017115

一、使用python-docx库

https://python-docx.readthedocs.io/en/latest/index.html

二、对比规则

对比的基本思想是按小句进行比较，所以拆分以是标点，即，。？！等进行拆分。拆分完成以后，可以有很多的小段。本文中为了便于定位，先根据原始段落进行拆分，然后再将每段根据标点拆分成若干小句，即一个word文档 = [[段落1], [段落2], [段落3], ...,[段落n]]，而每个段落= [[小句1],[小句2],[小句3],...,[小句m],]。

循环对比输出，根据段落，两两进行对比，遇到匹配输出结果。

三、源码

# coding=utf-8

from docx import Document
import re, sys, datetime


def getText(wordname):
    d = Document(wordname)
    texts = []
    for para in d.paragraphs:
        texts.append(para.text)
    return texts

def is_Chinese(word):
    for ch in word:
        if '\u4e00' <= ch <= '\u9fff':
            return True
    return False

def msplit(s, seperators = ',|\.|\?|，|。|？|！'):
    re