从PDF中提取表格(上市公司财务报表)_通过r语言提取上市公司年报中的数据 pdf-CSDN博客

本文链接：https://blog.csdn.net/herhun_chen/article/details/127007949

本文介绍了一种通过Python代码从PDF文件中提取上市公司财务报表的方法，旨在将二级科目数据存入数据库以便自动化分析。虽然代码未经整理，但能够正常运行。目前提取的表格需要进一步加工才能满足保存到数据库的要求。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

代码有点乱, 实在没时间整理. 但运行是没有问题的. 如果对你有用, 使用过程中遇到什么问题, 可以联系我.

写这点代码的目的是为了从上市公司财务报告中把报表的二级科目弄出来, 然后保存到数据库. 方便进行自动化分析. 但目前提取出来的表格比较毛糙, 还没想好怎么进行二次加工整理, 最后实现保存到数据的目的.

# pip install python-docx
import docx
from openpyxl import Workbook

# define src directory and file name. note no extension name
class FileInfo(object):
    def __init__(self, dir, name):
        self.Dir = dir      # Directory
        self.Name = name    # file name

# file list which will be converted to destination format
files = [
    FileInfo(r'C:\data\FinancialAnalysis\financial-Statement\2022',r'300760_迈瑞医疗_2022年半年度报告'),
    FileInfo(r'C:\data\FinancialAnalysis\financial-Statement\2021', r'300760_迈瑞医疗_2021年年度报告')
]

# following variables are just for development. now it looks nologner needed
#src_dir = r'C:\data\FinancialAnalysis\financial-Statement\2019'
dest_dir = r'C:\data\FinancialAnalysis\financial-Statement'
file_name = r'300760_迈瑞医疗_2019年年度报告'
if file_name.endswith('.docx'):
    file_name = file_name.replace('.docx', '')
if file_name.endswith('.PDF'):
    file_name = file_name.replace('.PDF', '')

#full_pdf_file_name = rf'{src_dir}\{file_name}.PDF'
#full_docx_file_name = rf'{dest_dir}\{file_name}.docx'
#full_excel_file_name = rf'{dest_dir}\{file_name}.xlsx'



def convert_pdf_to_docx(full_pdf_file_name, full_docx_file_name):
    '''
    Convert PDF file to Word(.docx)
    :param full_pdf_file_name: full pad file name
    :param full_docx_file_name: full word file name
    :return: there is no return. converted file will be saved with the full file name `full_docx_file_name`
    '''
    import os
    from pdf2docx import Converter
    cv = Converter(full_pdf_file_name)
    cv.convert(full_docx_file_name)
    cv.close()

from docx.document import Document
from docx.oxml.table import CT_Tbl
from docx.oxml.text.paragraph import CT_P
from docx.table import _Cell, Table
from docx.text.paragraph import Paragraph


def iter_block_items(parent):
    """
    !! this method Copied from somewhere(Github, url lost)
    Yield each paragraph and table child within *parent*, in document order.
    Each returned value is an instance of either Table or Paragraph. *parent*
    would most commonly be a reference to a main Document object, but
    also works for a _Cell object, which itself can contain paragraphs and tables.
    """
    if isinstance(parent, Document):
        parent_elm = parent.element.body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield Table(child, parent)


def extract_all_tables_from_docx(full_docx_file_name, full_excel_file_name):
    '''
    Extract tables from excel

    :param full_docx_file_name: full word file name
    :param full_excel_file_name: full excel file name
    :return: Result is saved with argument `full_excel_file_name`
    '''
    excluding =