python3用PyPDF2解析pdf文件，用正则匹配数据

最新推荐文章于 2024-07-26 09:00:00 发布

零度愿望

最新推荐文章于 2024-07-26 09:00:00 发布

阅读量2.6k

点赞数 1

分类专栏： Python 文章标签： pdf re

本文链接：https://blog.csdn.net/qq_42336573/article/details/83537812

版权

Python 专栏收录该内容

17 篇文章 0 订阅

订阅专栏

    import PyPDF2
    import re

    pdf_file = open('xxx.pdf', mode='rb')
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    # 获取pdf文件的所有页数
    number_of_pages = read_pdf.getNumPages()
    # print('total_page: ', number_of_pages)
    line_list = []
    # 循环遍历每一页
    for i in range(0, number_of_pages):
        # 读取每一页的内容
        page = read_pdf.getPage(i)
        page_content = page.extractText()
        # 将这一页的内容分割为列表，，并相加所有的页面内容
        line_list += page_content.split()
    # 关闭pdf文件
    pdf_file.close()
    line_buf = ''
    for buf in line_list:
        line_buf = line_buf+' '+buf
    # 匹配数据：第一列和第二列  如：000069.sz  和 100
    # print(line_buf)
    a = re.findall('([0-9]+[0-9]+[0-9]+[0-9]+[0-9]+[0-9]+.[a-z]+[a-z])', line_buf)
    b = re.findall('[0-9]+[0-9]+[0-9]+[0-9]+[0-9]+[0-9]+.[a-z]+[a-z].([0-9,]+)', line_buf)
    # print(b)
    for i in range(0, len(a)):
        a[i] = a[i].upper()
    for i in range(0, len(b)):
        b[i] = int(b[i].replace(',', ''))
    # print(b)
    # 组成字典
    results = dict(zip(a, b))

正则的其他用法：

fp = open(filename,"w")

fp.write(re.search('(StockDescription:)([a-zA-Z]+-[a-zA-Z]+)',line_buf).group(2) +',')

fp.write(time.strftime('%Y%m%d',time.strptime(re.search('(TradeDate:)([0-9]+[a-zA-Z]+[0-9]+)',line_buf).group(2),'%d%B%Y')) +',')

fp.write(re.search('(Price:[A-Z]+)([0-9.,]+)',line_buf).group(2).replace(',','')+',')

fp.close()