【Python解析PDF文件中的文字】

LaiSec

于 2023-10-16 14:11:07 发布

阅读量231

点赞数

分类专栏：工具脚本代码文章标签： python pdf 开发语言

呦！来复制了哈！别忘了把这段删了——LaiSC‘Blog

本文链接：https://blog.csdn.net/qq_37691298/article/details/133857221

版权

工具脚本代码专栏收录该内容

6 篇文章 0 订阅

订阅专栏

Python解析PDF文件中的文字

import re
import PyPDF2

import requests

url = 'https://www.baidu.com/link?url=29OahkXD4qEv8Yg4mqN6qrrDmISTcpLOjOZJ08fdu15qVLM74jSTCXmCnHGjx2lXOeM4CrWWxB6Y1ya8mtVfXMlxJgFvZxKZiitNWS2AEn7IlfaTRgsluZqHRH4bfNmcWSpMBeISAZUnQja6sibTlq&wd=&eqid=f7fcca8700022d400000000664dd9230'

response = requests.get(url)
print(type(response.content))


# PDF
with open('example.pdf','rb') as pdfFile:
    print(type(pdfFile))
    pdfFile = response.content
    pdfText = PyPDF2.PdfFileReader(pdfFile)
    pdfText = pdfText.getPage(0).extractText()
    # 使用正则表达式去除换行符和多余空格
    pdfText = re.sub(r'\n', '', pdfText) 
    pdfText = re.sub(r'\s+', ' ', pdfText)

print(pdfText)