10.4.3 文档解析
(1)定义方法parse_document,它接受一个文档作为输入,解析其中的内容,删除HTML标记,并将结果保存在MongoDB的parsed_documents集合中。
def parse_document(doc):
url = doc["_id"]
form_type = doc["form_type"]
filing_date = doc["filing_date"]
sections = {}
cik = doc["cik"]
html = doc["html"]
# Supported form type are 10-K, 10-K/A, 10-Q, 10-Q/A, 8-K
if form_type in ["10-K", "10-K/A"]:
include_forms = ["10-K", "10-K/A"]
list_items = list_10k_items
default_sections = default_10k_sections
elif form_type == "10-Q":
include_forms = ["10-Q"]
list_items = list_10q_items
default_sections = default_10q_sections
elif form_type == "8-K":
include_forms = ["8-K"]
list_items = None
default_sections = default_8k_sections
else:
print(f"return because form_type {form_type} is not valid")
return
if form_type not in include_forms:
print(f"return because form_type != {form_type}")
return
company_info = company_from_cik(cik)
# no cik in cik_map
if company_info is None:
print("return because company info None")
return
print(f"form type: \t\t{form_type}")
print(company_info)
soup = BeautifulSoup(html, features="html.parser")
if soup.body is None:
print("return because soup.body None")
return
table_of_contents = identify_table_of_contents(soup, list_items)
if table_of_contents:
sections = get_sections_using_hrefs(soup, table_of_contents)
if len(sections) == 0:
sections = get_sections_using_strings(soup, table_of_contents, default_sections)
result = {"_id": url, "cik": cik, "form_type":form_type, "filing_date": filing_date, "sections":{}}
for s in sections:
section = sections[s]
if 'text' in section:
text = section['text']
text = re.sub('\n', ' ', text)
text = re.sub(' +', ' ', text)
result["sections"][section["title"]] = {"text":text, "link":section["link"] if "link" in section else None}
try:
mongodb.upsert_document("parsed_documents", result)
except:
traceback.print_exc()
print(result.keys())
print(result["sections"].keys())
上述代码的实现步骤如下所示:
- 根据文档的类型(10-K、10-Q、8-K等),选择相应的处理方法和默认的部分结构。
- 解析文档的HTML内容,并使用BeautifulSoup库将其转换为Soup对象。
- 识别文档中的目录,以确定各个部分的位置。
- 根据目录中的链接,或者通过搜索默认部分标题,从文档中提取各个部分的内容。
- 整理提取的内容,并将其存储为字典格式。
- 将解析结果存储在MongoDB数据库的parsed_documents集合中。
(2)定义方法find_auditor(doc),功能是从指定的文档中查找审计员信息。它采用一个文档作为输入,并搜索其中的文本内容以确定审计员信息。
def find_auditor(doc):
try:
soup = BeautifulSoup(doc["html"], features="html.parser")
# auditor_start_string = 'Report of Independent Registered Public Accounting Firm'.lower()
# auditor_string = ""
body = unidecode(soup.body.get_text(separator=" "))
body = re.sub('\n', ' ', body)
body = re.sub(' +', ' ', body)
start_sig = 0
while start_sig != -1:
start_sig = body.find('s/', start_sig+1)
auditor_candidate = body[start_sig: start_sig+200]
# print(auditor_candidate)
if 'auditor since' in auditor_candidate.lower():
pattern = r"s/.+auditor since.*?\d{4}"
try:
match = re.findall(pattern, auditor_candidate)[0]
return match.replace("s/", "").strip()
except:
pass
except Exception as e:
print(e)
print("NO AUDITOR FOUND")
return ""
(3)下面的代码用于解析Google的10-K文件。首先,从MongoDB中获取指定URL的文档,并将其传递给parse_document方法进行解析。方法parse_document将文件解析为文本,并提取其中的各个部分,如业务、风险因素、管理层讨论与分析等。最终,解析后的内容将存储在MongoDB的parsed_documents集合中,以便后续分析和使用。
filing_url = 'https://www.sec.gov/Archives/edgar/data/1652044/000165204423000016/goog-20221231.htm'
doc = mongodb.get_collection("documents").find({"_id":filing_url}).next()
parse_document(doc)
执行后会输出:
form type: 10-K
cik 0001652044
name Alphabet Inc.
ticker GOOGL
exchange Nasdaq
Name: 2, dtype: object