（26-3-03）基于OpenAI和LangChain的上市公司估值系统：质性分析(3)文档解析

最新推荐文章于 2024-07-25 00:08:05 发布

码农三叔

最新推荐文章于 2024-07-25 00:08:05 发布

阅读量425

点赞数 6

分类专栏：金融大模型《NLP算法实战》大模型从入门到实战文章标签： langchain 人工智能 python 深度学习自然语言处理神经网络语言模型

本文链接：https://blog.csdn.net/asd343442/article/details/139654937

版权

大模型从入门到实战同时被 3 个专栏收录

169 篇文章 53 订阅

订阅专栏

《NLP算法实战》

127 篇文章 17 订阅

订阅专栏

金融大模型

123 篇文章 35 订阅

订阅专栏

10.4.3 文档解析

（1）定义方法parse_document，它接受一个文档作为输入，解析其中的内容，删除HTML标记，并将结果保存在MongoDB的parsed_documents集合中。

def parse_document(doc):
    url = doc["_id"]
    form_type = doc["form_type"]
    filing_date = doc["filing_date"]
    sections = {}
    cik = doc["cik"]
    html = doc["html"]

    # Supported form type are 10-K, 10-K/A, 10-Q, 10-Q/A, 8-K
    if form_type in ["10-K", "10-K/A"]:
        include_forms = ["10-K", "10-K/A"]
        list_items = list_10k_items
        default_sections = default_10k_sections
    elif form_type == "10-Q":
        include_forms = ["10-Q"]
        list_items = list_10q_items
        default_sections = default_10q_sections
    elif form_type == "8-K":
        include_forms = ["8-K"]
        list_items = None
        default_sections = default_8k_sections
    else:
        print(f"return because form_type {form_type} is not valid")
        return

    if form_type not in include_forms:
        print(f"return because form_type != {form_type}")
        return

    company_info = company_from_cik(cik)

    # no cik in cik_map
    if company_info is None:
        print("return because company info None")
        return

    print(f"form type: \t\t{form_type}")
    print(company_info)

    soup = BeautifulSoup(html, features="html.parser")

    if soup.body is None:
        print("return because soup.body None")
        return

    table_of_contents = identify_table_of_contents(soup, list_items)

    if table_of_contents:
        sections = get_sections_using_hrefs(soup, table_of_contents)

    if len(sections) == 0:
        sections = get_sections_using_strings(soup, table_of_contents, default_sections)

    result = {"_id": url, "cik": cik, "form_type":form_type, "filing_date": filing_date, "sections":{}}

    for s in sections:
        section = sections[s]
        if 'text' in section:
            text = section['text']
            text = re.sub('\n', ' ', text)
            text = re.sub(' +', ' ', text)

            result["sections"][section["title"]] = {"text":text, "link":section["link"] if "link" in section else None}

    try:
        mongodb.upsert_document("parsed_documents", result)
    except:
        traceback.print_exc()
        print(result.keys())
        print(result["sections"].keys())

上述代码的实现步骤如下所示：

根据文档的类型（10-K、10-Q、8-K等），选择相应的处理方法和默认的部分结构。
解析文档的HTML内容，并使用BeautifulSoup库将其转换为Soup对象。
识别文档中的目录，以确定各个部分的位置。
根据目录中的链接，或者通过搜索默认部分标题，从文档中提取各个部分的内容。
整理提取的内容，并将其存储为字典格式。
将解析结果存储在MongoDB数据库的parsed_documents集合中。

（2）定义方法find_auditor(doc)，功能是从指定的文档中查找审计员信息。它采用一个文档作为输入，并搜索其中的文本内容以确定审计员信息。

def find_auditor(doc):
    try:
        soup = BeautifulSoup(doc["html"], features="html.parser")

        # auditor_start_string = 'Report of Independent Registered Public Accounting Firm'.lower()

        # auditor_string = ""
        body = unidecode(soup.body.get_text(separator=" "))
        body = re.sub('\n', ' ', body)
        body = re.sub(' +', ' ', body)

        start_sig = 0
        while start_sig != -1:
            start_sig = body.find('s/', start_sig+1)
            auditor_candidate = body[start_sig: start_sig+200]

            # print(auditor_candidate)
            if 'auditor since' in auditor_candidate.lower():
                pattern = r"s/.+auditor since.*?\d{4}"

                try:
                    match = re.findall(pattern, auditor_candidate)[0]
                    return match.replace("s/", "").strip()
                except:
                    pass
    except Exception as e:
        print(e)
        print("NO AUDITOR FOUND")
        return ""

（3）下面的代码用于解析Google的10-K文件。首先，从MongoDB中获取指定URL的文档，并将其传递给parse_document方法进行解析。方法parse_document将文件解析为文本，并提取其中的各个部分，如业务、风险因素、管理层讨论与分析等。最终，解析后的内容将存储在MongoDB的parsed_documents集合中，以便后续分析和使用。

filing_url = 'https://www.sec.gov/Archives/edgar/data/1652044/000165204423000016/goog-20221231.htm'

doc = mongodb.get_collection("documents").find({"_id":filing_url}).next()
parse_document(doc)

执行后会输出：

form type: 		10-K
cik            0001652044
name        Alphabet Inc.
ticker              GOOGL
exchange           Nasdaq
Name: 2, dtype: object

未完待续

码农三叔

关注

6
点赞
踩
5

收藏

觉得还不错? 一键收藏
打赏
0
评论
（26-3-03）基于OpenAI和LangChain的上市公司估值系统：质性分析(3)文档解析

首先，从MongoDB中获取指定URL的文档，并将其传递给parse_document方法进行解析。方法parse_document将文件解析为文本，并提取其中的各个部分，如业务、风险因素、管理层讨论与分析等。最终，解析后的内容将存储在MongoDB的parsed_documents集合中，以便后续分析和使用。（1）定义方法parse_document，它接受一个文档作为输入，解析其中的内容，删除HTML标记，并将结果保存在MongoDB的parsed_documents集合中。
复制链接

扫一扫