从PDF中提取表格(上市公司财务报表)

本文介绍了一种通过Python代码从PDF文件中提取上市公司财务报表的方法,旨在将二级科目数据存入数据库以便自动化分析。虽然代码未经整理,但能够正常运行。目前提取的表格需要进一步加工才能满足保存到数据库的要求。
摘要由CSDN通过智能技术生成

代码有点乱, 实在没时间整理. 但运行是没有问题的. 如果对你有用, 使用过程中遇到什么问题, 可以联系我.

写这点代码的目的是为了从上市公司财务报告中把报表的二级科目弄出来, 然后保存到数据库. 方便进行自动化分析. 但目前提取出来的表格比较毛糙, 还没想好怎么进行二次加工整理, 最后实现保存到数据的目的.

# pip install python-docx
import docx
from openpyxl import Workbook

# define src directory and file name. note no extension name
class FileInfo(object):
    def __init__(self, dir, name):
        self.Dir = dir      # Directory
        self.Name = name    # file name

# file list which will be converted to destination format
files = [
    FileInfo(r'C:\data\FinancialAnalysis\financial-Statement\2022',r'300760_迈瑞医疗_2022年半年度报告'),
    FileInfo(r'C:\data\FinancialAnalysis\financial-Statement\2021', r'300760_迈瑞医疗_2021年年度报告')
]

# following variables are just for development. now it looks nologner needed
#src_dir = r'C:\data\FinancialAnalysis\financial-Statement\2019'
dest_dir = r'C:\data\FinancialAnalysis\financial-Statement'
file_name = r'300760_迈瑞医疗_2019年年度报告'
if file_name.endswith('.docx'):
    file_name = file_name.replace('.docx', '')
if file_name.endswith('.PDF'):
    file_name = file_name.replace('.PDF', '')

#full_pdf_file_name = rf'{src_dir}\{file_name}.PDF'
#full_docx_file_name = rf'{dest_dir}\{file_name}.docx'
#full_excel_file_name = rf'{dest_dir}\{file_name}.xlsx'



def convert_pdf_to_docx(full_pdf_file_name, full_docx_file_name):
    '''
    Convert PDF file to Word(.docx)
    :param full_pdf_file_name: full pad file name
    :param full_docx_file_name: full word file name
    :return: there is no return. converted file will be saved with the full file name `full_docx_file_name`
    '''
    import os
    from pdf2docx import Converter
    cv = Converter(full_pdf_file_name)
    cv.convert(full_docx_file_name)
    cv.close()

from docx.document import Document
from docx.oxml.table import CT_Tbl
from docx.oxml.text.paragraph import CT_P
from docx.table import _Cell, Table
from docx.text.paragraph import Paragraph


def iter_block_items(parent):
    """
    !! this method Copied from somewhere(Github, url lost)
    Yield each paragraph and table child within *parent*, in document order.
    Each returned value is an instance of either Table or Paragraph. *parent*
    would most commonly be a reference to a main Document object, but
    also works for a _Cell object, which itself can contain paragraphs and tables.
    """
    if isinstance(parent, Document):
        parent_elm = parent.element.body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield Table(child, parent)


def extract_all_tables_from_docx(full_docx_file_name, full_excel_file_name):
    '''
    Extract tables from excel

    :param full_docx_file_name: full word file name
    :param full_excel_file_name: full excel file name
    :return: Result is saved with argument `full_excel_file_name`
    '''
    excluding =
  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值