python根据页面页码读取doc页面文本的最佳实现

再看我把你喝掉

已于 2022-04-02 15:23:58 修改

阅读量1.8k

点赞数 1

分类专栏：笔记文章标签： python

于 2022-04-02 15:23:13 首次发布

本文链接：https://blog.csdn.net/woaidianqian/article/details/123921182

版权

笔记专栏收录该内容

22 篇文章 1 订阅

订阅专栏

import win32com.client
import comtypes.client
import pdfplumber
def CountPages(Filepath):
    word = win32com.client.Dispatch('Word.Application')
    wdFormatPDF = 17
    in_file = Filepath
    out_file = "out.pdf"
    word = comtypes.client.CreateObject('Word.Application')
    doc = word.Documents.Open(in_file)
    doc.SaveAs(out_file, FileFormat=wdFormatPDF)
    doc.Close()
    word.Quit()
    with pdfplumber.open(out_file) as pdf:        
        count=0
        for page in pdf.pages:
            out=page.extract_text()
            if "申请号:" in out and "审 查 意 见 通 知 书" in out:
                count=0
            count+=1
            print(page.extract_text())
        print(count,"页")
        return count

python大量的库都只能处理docx文件，并且稳定性堪忧。

所以可以考虑将doc文件直接转化成pdf文件，这样可以完整的1:1复刻页面内容，并且每一页的内容都是完美复刻，不会变形。

然后用pdfplumber读取每页的页面内容即可，这里不要用pypdf2提取的文本是乱码，pdfplumber这个库绝对是上上选。

通过上述的方法可以根据word的页面来精确提取其文本内容。

再看我把你喝掉

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
python根据页面页码读取doc页面文本的最佳实现

import win32com.clientimport comtypes.clientimport pdfplumberdef CountPages(Filepath): word = win32com.client.Dispatch('Word.Application') wdFormatPDF = 17 in_file = Filepath out_file = "out.pdf" word = comtypes.client.CreateObject(.
复制链接

扫一扫