sec标签使用_使用正则表达式搜索SEC 10K归档-CSDN博客

sec标签使用

Regular expressions, or “regex”, are text matching patterns that are used for searching text. In the case of SEC 10K filings, regex can greatly assist the search process.

正则表达式或“ regex”是用于文本搜索的文本匹配模式。对于SEC 10K申请，正则表达式可以极大地协助搜索过程。

This is because 10K filings contain inconsistencies from year to year and from company to company. This makes it difficult to identify and extract particular sections of text. Regex offers versatile and powerful text search capabilities that can be used for matching almost any pattern of text.

这是因为1万个申报文件逐年且公司之间存在不一致。这使得难以识别和提取文本的特定部分。正则表达式提供了通用而强大的文本搜索功能，可用于匹配几乎任何模式的文本。

I recently encountered this issue when trying to extract a section of text from a 10K filing. In this article, I show how I found a workable solution using regex in Python. By using carefully selected pattern-matching techniques, regex helped me extract the particular section of text that I was after.

我最近在尝试从10K档案中提取一段文字时遇到了这个问题。在本文中，我将展示如何在Python中使用正则表达式找到可行的解决方案。通过使用精心选择的模式匹配技术，正则表达式帮助我提取了所要查找的特定文本部分。

My solution won’t work for all 10K filings however, as the inconsistencies between filings can vary quite a bit and are difficult to know in advance. But it’ll work in many cases.

但是，我的解决方案不适用于所有10K档案，因为档案之间的不一致可能会相差很多，并且很难事先知道。但这在许多情况下都会起作用。

More importantly, my solution can help you if you’re encountering a similar issue. It can provide a starting point which you can modify as needed for the particular text that you wish to search.

更重要的是，如果您遇到类似的问题，我的解决方案可以为您提供帮助。它可以提供一个起点，您可以根据需要对想要搜索的特定文本进行修改。

If you’re working with 10K filings and are looking for ideas on how to identify and extract text, or if you simply want to learn more about using regex, then this article is for you!

如果您正在处理10K归档文件，并且正在寻找有关如何标识和提取文本的想法，或者您只是想了解有关使用正则表达式的更多信息，那么本文非常适合您！

SEC 10K档案以及在哪里找到 (SEC 10K filings and where to find them)

SEC 10K filings are produced annually by all publicly traded companies in the US. They contain plenty of useful information including details of each company’s history, structure, personnel, financial circumstances and operations.

美国所有上市公司每年都会制作SEC 10K档案。它们包含许多有用的信息，包括每个公司的历史，结构，人员，财务状况和运营的详细信息。

The filings are available from the EDGAR database on the SEC website. EDGAR, or the ‘Electronic Data Gathering, Analysis and Retrieval’ system, offers easy access to all public company filings.

这些文件可从SEC网站上的EDGAR数据库中获得。 EDGAR或“电子数据收集，分析和检索”系统可轻松访问所有上市公司的备案。

EDGAR is huge, with around 3,000 filings processed each day and with over 40,000 new filers each year. Best of all, EDGAR is free to access and offers a comprehensive and reliable source of information on US public companies.

EDGAR非常庞大，每天处理大约3,000份申请，每年有40,000多个新申请者。最重要的是，EDGAR是免费的，并提供有关美国上市公司的全面而可靠的信息来源。

Individual filings can be obtained directly through EDGAR’s online portal. Searches for company information can be done manually, through third-party APIs or by accessing documents through their URL links — this is the approach that we’ll use.

个人档案可以直接通过EDGAR的在线门户获得。可以通过第三方API或通过其URL链接访问文档来手动搜索公司信息，这是我们将使用的方法。

1万份申请的结构(和挑战) (The structure (and challenges) of 10K filings)

10K filings tend to be large and unwieldy given the scope of information that they contain. They’re also challenging to work with due to the inconsistencies inherent between individual filings and companies. This makes it hard to search through and identify text within them.

考虑到它们所包含的信息范围，10,000份申请往往很大且笨拙。由于个人文件与公司之间固有的不一致，因此使用它们也面临挑战。这使得很难搜索和识别其中的文本。

10K filings have a number of standardized sections such as ‘Business’, ‘Risk Factors’ and ‘Selected Financial Data’. One section of particular interest is ‘Management’s Discussion and Analysis of Financial Condition’, or the ‘MD&A’. This contains information about a company’s progress, its financial circumstances and its future plans.

10K申报包含许多标准化部分，例如“业务”，“风险因素”和“选定的财务数据”。其中一项特别令人感兴趣的部分是“管理层对财务状况的讨论和分析”或“ MD＆A”。其中包含有关公司进度，财务状况和未来计划的信息。

We’ll be extracting the MD&A in this article.

我们将在本文中提取MD＆A。

Each of the sections in a 10K filing have numbers associated with them called ‘Item numbers’. The MD&A is Item 7.

10K档案中的每个部分都有与之相关的数字，称为“项目编号”。 MD＆A为项目7。

Appearing just after the MD&A is another section called ‘Quantitative and Qualitative Disclosures About Market Risk’. This is Item 7A. It’s often quite short but also contains useful information, so we’ll extract this as well.

MD＆A之后出现的是另一个部分，“关于市场风险的定量和定性披露”。这是项目7A。它通常很短，但也包含有用的信息，因此我们也将其提取出来。

Unfortunately, the length and location of Items 7 and 7A can vary from filing to filing and between companies. When searching through the text of a 10K filing, there’s no straightforward way of knowing exactly where the contents of Item 7 or 7A will be.

不幸的是，第7项和第7A项的长度和位置可能因申请的不同以及公司之间的不同而不同。在搜索10K归档文件的文本时，没有直接的方法可以准确知道第7项或第7A项的内容在哪里。

Consider, for example, looking for Item 7 in a 10K filing. The first time a reference to “Item 7” appears is typically in the table of contents. The second time may be the section content that we’re after, but not always. There are often references to Item 7 throughout the filing, in commentary or disclosures or even footnotes, and we won’t know in advance exactly how many times “Item 7” appears and which reference relates to the section content that we wish to extract.

例如，考虑在10K归档中查找第7项。目录中通常会首次出现对“项目7”的引用。第二次可能是我们所关注的部分内容，但并非总是如此。在整个申请中，通常在注释，披露或什至脚注中都提到第7项，而我们事先无法确切知道“第7项”出现了多少次，以及哪一部分与我们希望提取的章节内容有关。

We’ll therefore need to be creative in order to find Items 7 and 7A, and we’ll use regex to help us.

因此，我们需要发挥创造力才能找到第7项和第7A项，并且我们将使用正则表达式来帮助我们。

什么是正则表达式？ (What is regex?)

Regex, or ‘regular expressions’, refers to sequences of characters that define a search pattern. These patterns are used to match against text in documents that are being searched. In our case, the documents being searched are the 10K filings.

Regex或“正则表达式”是指定义搜索模式的字符序列。这些模式用于与要搜索的文档中的文本匹配。在我们的情况下，正在搜索的文档是1万份申请。

Regex isn’t new, having first been developed in the 1950’s. It has since been used extensively for pattern matching in applications involving text editing and lexical analysis.

正则表达式并不是什么新东西，它最早是在1950年代开发的。此后，它已被广泛用于涉及文本编辑和词法分析的应用程序中的模式匹配。

Today, regex is available in a number of coding languages and text editors. It has a rich feature set which allows for powerful pattern matching and is a versatile tool for text preprocessing in natural language processing (NLP) applications.

如今，正则表达式支持多种编码语言和文本编辑器。它具有丰富的功能集，可进行强大的模式匹配，是用于自然语言处理(NLP)应用程序中的文本预处理的多功能工具。

Strictly speaking, regex is intended for use in ‘regular languages’, of which HTML is not one. Nevertheless, in our case we use regex to search through HTML texts of the 10K filings due to the inconsistencies that they contain. These inconsistencies make any sort of text search challenging, and the powerful features of regex will help us find what we need.

严格来说，正则表达式旨在用于“常规语言”，而HTML不是其中之一。不过，在我们的案例中，由于它们包含的不一致之处，我们使用正则表达式来搜索10K归档HTML文本。这些不一致使任何形式的文本搜索都具有挑战性，而正则表达式的强大功能将帮助我们找到所需的内容。

I assume a basic familiarity with regex in this article. If you’d like to learn more about regex or refresh your understanding, here is an excellent resource.

我假定本文对regex基本熟悉。如果您想了解更多关于正则表达式的知识或恢复您的理解，这里是一个很好的资源。

搜索模式 (The search pattern)

How do we find exactly where Items 7 and 7A appear in a 10K document?

我们如何确切地找到项目7和7A在10K文档中出现的位置？

If we’re manually perusing an XML version of the document, we’d simply go to the table of contents and click on the hyperlink. But if we’re automating the process, it isn’t so straightforward.

如果我们手动阅读文档的XML版本，则只需转到目录并单击超链接。但是，如果我们要使流程自动化，那就不是那么简单。

The key is to look for patterns (text sequences) that identify the particular section of the filing that we’re searching for. In our case, we want to know where the Item 7/7A section starts and ends.

关键是寻找可识别我们要搜索的文件特定部分的模式(文本序列)。在我们的案例中，我们想知道第7 / 7A条的开始和结束位置。

As mentioned, the second occurrence of “Item 7” in a 10K filing may not refer to the section content. So, what else can we look for to give us a better chance of finding Item 7 (the MD&A)?

如前所述，在10K归档中第二次出现的“项目7”可能不会引用该部分的内容。因此，我们还能寻找什么以使我们有更好的机会找到第7项(MD＆A)？

A pattern that I’ve found to work in many cases is as follows:

我发现在许多情况下都可以使用的模式如下：

“Item 7”, followed immediately by a full stop or space, followed soon after by the title of the MD&A section

“第7项”，紧接着是句号或空格，其后紧接着是“ MD＆A”部分的标题

This sequence of text is quite specific to the MD&A section content, rather than mere reference to it elsewhere in the 10K filing. We’ll refer to it as the Item 7 Search Pattern.

该文本序列非常特定于MD＆A部分的内容，而不仅仅是在10K归档中的其他地方对其进行引用。我们将其称为第7项搜索模式 。

To find where the MD&A section ends, we simply look for where the following section starts. Since we’re including Item 7A in our extraction, this means we need to find where Item 8 starts (which immediately follows Item 7A in 10K filings). We use a similar approach to that for finding Item 7 and look for a unique sequence of text that identifies the start of Item 8.

要找到MD＆A部分的结束位置，我们只需查找下一部分的开始位置。由于我们在提取中包括了第7A条，因此这意味着我们需要找到第8条的起始位置(在10K归档中紧随第7A条之后)。我们使用与查找第7项相似的方法，并查找标识第8项开头的唯一文本序列。

For Item 8, the sequence of text that we’ll be looking for is:

对于第8项，我们要查找的文本顺序为：

“Item 8”, followed immediately by “.” or “ ”, followed soon after by the title of the section, ie. “Financial Statements and Supplementary Data”

“项目8”，紧随其后的是“。” 或“”，紧随其后的是该部分的标题，即 “财务报表和补充数据”

We’ll call this the Item 8 Search Pattern.

我们将其称为Item 8搜索模式 。

I’ve tested this approach on the most recent 5 years of 10K filings for Tesla, Apple, GM, Mastercard and Microsoft. It works in all of these cases. I’ve been able to identify exactly where the MD&A section starts and ends (including Item 7A) and extract it successfully.

我已经在Tesla，Apple，GM，Mastercard和Microsoft的最近5万份10K申请中测试了这种方法。它适用于所有这些情况。我已经能够准确确定MD＆A部分的开始和结束位置(包括项目7A)并成功地将其提取。

As discussed, 10K filings can vary quite a bit so this approach may not work for other companies or filing years. If you find it isn’t working, simply adjust the sequence of text that you’re looking for so that it’s unique in the filing document that you’re searching through.

如前所述，10,000份申请可能会有很大差异，因此这种方法可能不适用于其他公司或申请数年。如果发现它不起作用，只需调整要查找的文本的顺序，以使它在搜索的归档文档中是唯一的。

If you want to find a different section of the 10K filing, such as a different Item number, simply identify the unique sequence of text for that particular Item number.

如果要查找10K归档的其他部分，例如不同的项目号，只需为该特定项目号标识唯一的文本序列即可。

Now that we know what we’re looking for, let’s dive into the code!

现在我们知道了要查找的内容，让我们深入研究代码！

代码 (The code)

I’ve used Python (v3.7.7) to write the code, so you may need to adjust your code if you’re using a different version of Python.

我已经使用Python(v3.7.7)编写了代码，因此如果您使用的是其他版本的Python，则可能需要调整代码。

I found this YouTube video and this github resource to be helpful in writing my code, and I’ve adopted some elements from both of them.

我发现这个YouTube视频和这个github资源对编写我的代码很有帮助，并且我从这两个元素中都采用了一些元素。

Here’s the code:

这是代码：

#################################################################################
### Code for Searching (Extracting Sections from) SEC 10K Filings Using Regex ###
#################################################################################


# Import libraries
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd


# Define URL for the specific 10K filing
URL_text = r'https://www.sec.gov/Archives/edgar/data/1318605/000156459020004475/0001564590-20-004475.txt' # Tesla 10K Dec 2019


# Grab the response
response = requests.get(URL_text)


# Parse the response (the XML flag works better than HTML for 10Ks)
soup = BeautifulSoup(response.content, 'lxml')


for filing_document in soup.find_all('document'): # The document tags contain the various components of the total 10K filing pack
    
    # The 'type' tag contains the document type
    document_type = filing_document.type.find(text=True, recursive=False).strip()
    
    if document_type == "10-K": # Once the 10K text body is found
        
        # Grab and store the 10K text body
        TenKtext = filing_document.find('text').extract().text
        
        # Set up the regex pattern
        matches = re.compile(r'(item\s(7[\.\s]|8[\.\s])|'
                             'discussion\sand\sanalysis\sof\s(consolidated\sfinancial|financial)\scondition|'
                             '(consolidated\sfinancial|financial)\sstatements\sand\ssupplementary\sdata)', re.IGNORECASE)
                                             
        matches_array = pd.DataFrame([(match.group(), match.start()) for match in matches.finditer(TenKtext)])
        
        # Set columns in the dataframe
        matches_array.columns = ['SearchTerm', 'Start']
        
        # Get the number of rows in the dataframe
        Rows = matches_array['SearchTerm'].count()
           
        # Create a new column in 'matches_array' called 'Selection' and add adjacent 'SearchTerm' (i and i+1 rows) text concatenated
        count = 0 # Counter to help with row location and iteration
        while count < (Rows-1): # Can only iterate to the second last row
            matches_array.at[count,'Selection'] = (matches_array.iloc[count,0] + matches_array.iloc[count+1,0]).lower() # Convert to lower case
            count += 1
        
        # Set up 'Item 7/8 Search Pattern' regex patterns
        matches_item7 = re.compile(r'(item\s7\.discussion\s[a-z]*)')
        matches_item8 = re.compile(r'(item\s8\.(consolidated\sfinancial|financial)\s[a-z]*)')
            
        # Lists to store the locations of Item 7/8 Search Pattern matches
        Start_Loc = []
        End_Loc = []
            
        # Find and store the locations of Item 7/8 Search Pattern matches
        count = 0 # Set up counter
        
        while count < (Rows-1): # Can only iterate to the second last row
            
            # Match Item 7 Search Pattern
            if re.match(matches_item7, matches_array.at[count,'Selection']):
                # Column 1 = 'Start' columnn in 'matches_array'
                Start_Loc.append(matches_array.iloc[count,1]) # Store in list => Item 7 will be the starting location (column '1' = 'Start' column)
            
            # Match Item 8 Search Pattern
            if re.match(matches_item8, matches_array.at[count,'Selection']):
                End_Loc.append(matches_array.iloc[count,1])
            
            count += 1


        # Extract section of text and store in 'TenKItem7'
        TenKItem7 = TenKtext[Start_Loc[1]:End_Loc[1]]
        
        # Clean newly extracted text
        TenKItem7 = TenKItem7.strip() # Remove starting/ending white spaces
        TenKItem7 = TenKItem7.replace('\n', ' ') # Replace \n (new line) with space
        TenKItem7 = TenKItem7.replace('\r', '') # Replace \r (carriage returns-if you're on windows) with space
        TenKItem7 = TenKItem7.replace('&nbsp;', ' ') # Replace "&nbsp;" (a special character for space in HTML) with space
        TenKItem7 = TenKItem7.replace('&#160;', ' ') # Replace "&#160;" (a special character for space in HTML) with space
        while '  ' in TenKItem7:
            TenKItem7 = TenKItem7.replace('  ', ' ') # Remove extra spaces


        # Print first 500 characters of newly extracted text
        print(TenKItem7[:500])

I’ll step through and explain the code below — let’s get started!

我将逐步解释下面的代码-让我们开始吧！

1.导入库 (1. Import libraries)

In addition to regex, we use ‘requests’ to grab the 10K filing, ‘Beautiful Soup’ to do basic parsing and ‘pandas’ to do some data manipulation.

除了正则表达式外，我们还使用“请求”获取10K归档，使用“ Beautiful Soup”进行基本解析，并使用“ pandas”进行一些数据操作。

import re
import requests
from bs4 import BeautifulSoup
import pandas as pd

2.抓取并解析10K文件 (2. Grab and parse the 10K filing)

We’ll look at Tesla’s 2019 10K filing (released in early 2020) in this example. You can find the URL for this and other filings on the EDGAR database.

在此示例中，我们将查看特斯拉的2019年10K档案(于2020年初发布)。您可以在EDGAR数据库中找到此文件和其他文件的URL 。

# Define URL for the specific 10K filing
URL_text = r'https://www.sec.gov/Archives/edgar/data/1318605/000156459020004475/0001564590-20-004475.txt' # Tesla 10K Dec 2019# Grab the response
response = requests.get(URL_text)# Parse the response (the XML flag works better than HTML for 10Ks)
soup = BeautifulSoup(response.content, 'lxml')

3.循环浏览10K档案以查找其文本正文 (3. Loop through the 10K filing to find its text body)

10K filings contain various document types, including charts and exhibits, but we only want the text body of the filing.

10K文件包含各种文档类型，包括图表和展览，但我们只需要文件的正文。

for filing_document in soup.find_all('document'):  # The 'type' tag contains the document type
  document_type = filing_document.type.find(text=True, recursive=False).strip()

4.使用10K文本主体 (4. Working with the 10K text body)

The 10K text body has a document type ‘10-K’, and this will appear only once in a given filing (based on post-2009 filing structures). Once this is found, grab it and store it — I use a variable called ‘TenKtext’ for this.

10K文本主体的文档类型为“ 10-K”，这将仅在给定归档中出现一次(基于2009年后的归档结构)。一旦找到它，抓住它并存储它-为此，我使用了一个名为“ TenKtext”的变量。

if document_type == "10-K":  # Once the 10K text body is found
    
  # Grab and store the 10K text body    
  TenKtext = filing_document.find('text').extract().text

5.设置正则表达式模式-搜索过程的第一阶段 (5. Set up the regex pattern — Stage 1 of the search process)

We use a 2-stage process to implement the search.

我们使用两个阶段的过程来实现搜索。

The first stage is the following regex pattern (4 parts):

第一阶段是以下正则表达式模式(4个部分)：

(item\s(7[\.\s]|8[\.\s])

(项目\ s(7 [\。\ s] | 8 [\。\ s]))

discussion\sand\sanalysis\sof\s(consolidated\sfinancial|financial)\scondition

讨论\沙\分析\ sof \ s(合并\财务\财务)\条件

(consolidated\sfinancial|financial)\sstatements\sand\ssupplementary\sdata

(合并\财务|财务)\声明\沙\补充\数据

re.IGNORECASE

重新IGNORECASE

Does it look confusing? Unfortunately regex patterns can be hard to decipher at first glance. Let’s go through it by considering the segments of text that we’re trying to match from the Item 7 and Item 8 Search Patterns:

看起来令人困惑吗？不幸的是，乍一看正则表达式模式可能很难理解。让我们仔细考虑一下我们要与项目7和项目8搜索模式进行匹配的文本段：

“Item 7”, followed (immediately) by “.” (full stop) or “ ” (space), to identify the start of the section of text that we’re after, and similarly for “Item 8” to identify the end of the section. The portion of the regex pattern which matches these is: (item\s(7[\.\s]|8[\.\s])
“项目7”，(紧随其后)是“。” (句号)或“”(空格)，以标识我们要查找的文本的开头，同样，对于“第8项”，标识文本的结尾。正则表达式模式中与之匹配的部分是： (item \ s(7 [\。\ s] | 8 [\。\ s])
“Discussion and Analysis of (Consolidated) Financial Condition”, which is the title of Item 7 (the MD&A), noting that we include “Consolidated” as an optional word, since it appears in the MD&A title for some filings. This is matched by: discussion\sand\sanalysis\sof\s(consolidated\sfinancial|financial)\scondition
项目7(MD＆A)的标题为“(合并)财务状况的讨论和分析”，并指出我们将“ Consolidated”作为可选词，因为它在某些文件的MD＆A标题中出现。这与以下项匹配： 讨论\沙子\分析\ sof \ s(合并\财务\财务)\条件
“(Consolidated) Financial Statements and Supplementary Data”, which is the title of Item 8, and again noting the optional inclusion of “Consolidated”. This is matched by: (consolidated\sfinancial|financial)\sstatements\sand\ssupplementary\sdata
项目8的标题为“(合并)财务报表和补充数据”，并再次指出“合并”的可选内容。这与以下项匹配：( 合并\财务|财务)\语句\砂\补充\数据
Finally, we wish to capture all upper and lower case versions of the above segments of text as the capitalization approach varies between filings. We do this by including the ‘re.IGNORECASE’ flag.
最后，我们希望捕获以上文本片段的所有大写和小写版本，因为不同申请之间的大写方法不同。为此，我们添加了“ re.IGNORECASE”标志。

This regex pattern will match all occurrences of the above segments of text in the 10K filing.

此正则表达式模式将匹配10K归档中出现的上述所有文本段。

matches = re.compile(r'(item\s(7[\.\s]|8[\.\s])|discussion\sand\sanalysis\sof\s(consolidated\sfinancial|financial)\scondition|(consolidated\sfinancial|financial)\sstatements\sand\ssupplementary\sdata)', re.IGNORECASE)

Once we find the regex matches, we store the results in a pandas dataframe — I call this ‘matches_array’.

找到正则表达式匹配项后，我们将结果存储在pandas数据框中-我称之为“ matches_array”。

I set up the dataframe with two columns, the first with the matched segment of text, the second with its starting position in the 10K text body. I label these columns ‘SearchTerm’ and ‘Start’ respectively.

我将数据帧设置为两列，第一列具有匹配的文本段，第二列具有其在10K文本主体中的起始位置。我分别将这些列标记为“ SearchTerm”和“开始”。

I also calculate and store the number of rows in the dataframe — we’ll need this later.

我还计算并存储数据框中的行数-稍后我们将需要它。

matches_array = pd.DataFrame([(match.group(), match.start()) for match in matches.finditer(TenKtext)])# Set columns in the dataframe
matches_array.columns = ['SearchTerm', 'Start']# Get the number of rows in the dataframe
Rows = matches_array['SearchTerm'].count()

In our example of the 2019 Tesla 10K, the regex matches are as follows (which are stored in the ‘matches_array’ dataframe):

在我们的2019 Tesla 10K示例中，正则表达式匹配如下(存储在``matches_array''数据框中)：

SearchTerm Start0 Item 7. 61,1031 Discussion and Analysis of Financial Condition 61,1282 Item 8. 61,2933 Financial Statements and Supplementary Data 61,3054 Discussion and Analysis of Financial Condition 220,7205 ITEM 7. 223,5206 DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION 223,5427 Item 7. 223,9448 Discussion and Analysis of Financial Condition 223,9659 Discussion and Analysis of Financial Condition 298,25110 ITEM 8. 314,72911 FINANCIAL STATEMENTS AND SUPPLEMENTARY DATA 314,73812 Item 8 572,118

SearchTerm Start0项目7。61,1031财务状况的讨论和分析61,1282项目8。61,2933财务报表和补充数据61,3054财务状况的讨论和分析220,7205项目7。223，5206讨论和分析财务状况223,5427项目7。223,9448财务状况的讨论和分析223,9659财务状况的讨论和分析298,25110项目8。314,72911财务报表和补充数据314,73812项目8 572,118

There are 13 matches of the regex pattern (numbered 0 to 12) in the 10K filing. The first match is “Item 7” which appears at position 61,103 in the 10K text body. The second match is “Discussion and Analysis of Financial Condition” at position 61,128, and so on.

10K归档中有13个匹配的正则表达式模式(从0到12)。第一个匹配项是“项目7”，它出现在10K文本正文中的位置61,103处。第二个匹配项是位置61,128的“讨论和财务状况分析”，依此类推。

6.设置模式序列并找到提取位置-搜索过程的第二阶段 (6. Set up the pattern sequence and find the extraction locations — Stage 2 of the search process)

The second stage of the search process begins by forming the Item 7 and Item 8 Search Patterns from the matches in ‘matches_array’. We do this by joining pairs of matches in the order that they appear in ‘matches_array’, ie. concatenating the text of adjacent (ie. rows i and i+1) matches in the ‘SearchTerm’ column of ‘matches_array’. We store the results in a new column called ‘Selection’.

搜索过程的第二阶段从“ matches_array”中的匹配项形成第7项和第8项搜索模式开始。为此，我们按照“ matches_array”中出现的顺序加入配对对。在“ matches_array”的“ SearchTerm”列中将相邻(即第i和i +1行)匹配的文本串联起来。我们将结果存储在名为“选择”的新列中。

Why do we concatenate adjacent rows/matches? In the Item 7/8 Search Patterns, we’re looking for the titles of the Item 7/8 sections appearing ‘soon after’ the item numbers. This means that the title matches will be the next match after the item number matches in ‘matches_array’. Hence, by concatenating adjacent rows/matches, we’re creating the full Search Pattern sequences (where they exist).

为什么我们要串联相邻的行/匹配项？在项目7/8搜索模式中，我们正在寻找在项目编号“之后”出现的项目7/8部分的标题。这意味着标题匹配将是“ matches_array”中项目编号匹配之后的下一个匹配项。因此，通过串联相邻的行/匹配项，我们将创建完整的搜索模式序列(存在它们的地方)。

Next, we find the second occurrence of each of the Item 7 and Item 8 Search Patterns in the ‘Selection’ column of ‘matches_array’. This will indicate the start and end points respectively (ie. the numbers in the ‘Start’ column) for the section of text that we wish to extract.

接下来，我们在“ matches_array”的“选择”列中找到第7项和第8项搜索模式的第二次出现。这将分别指示我们要提取的文本部分的起点和终点(即“开始”列中的数字)。

Note that, although the term “Item 7” (or “Item 8”) may occur in various places in the 10K filing, the whole of the Item 7 Search Pattern and Item 8 Search Pattern occur in a more predictable manner.

注意，尽管术语“项目7”(或“项目8”)可以出现在10K归档中的各个位置，但是整个项目7搜索模式和项目8搜索模式以更可预测的方式出现。

For the 10K filings that I’ve successfully tested, the Item 7 and Item 8 Search Patterns occur at least twice in each filing. The first occurrence is in the table of contents and the second occurrence is the section content that we wish to extract.

对于我已成功测试的10,000份申请，第7项和第8项搜索模式在每个申请中至少出现两次。第一次出现在目录中，第二次出现是我们希望提取的节内容。

We exploit this feature in our search process. This is why we want the second occurrences of Item 7 and Item 8 Search Patterns. They mark the position points (ie. the start and end points respectively) of the section of text we wish to extract.

我们在搜索过程中利用了此功能。这就是为什么我们要第二次出现项目7和项目8搜索模式。它们标记了我们要提取的文本部分的位置点(分别是起点和终点)。

# Create a new column in 'matches_array' called 'Selection' and add adjacent 'SerchTerm' (i and i+1 rows) text concatenatedcount = 0 # Counter to help with row location and iterationwhile count < (Rows-1): # Can only iterate to the second last row
  matches_array.at[count,'Selection'] = (matches_array.iloc[count,0] + matches_array.iloc[count+1,0]).lower() # Convert to lower case
  count += 1

In our 2019 Tesla 10K example, the output from ‘matches_array’ for the ‘Start’ column and the newly created ‘Selection’ column is:

在我们的2019 Tesla 10K示例中，``matches_array''的``开始''列和新创建的``选择''列的输出为：

Start Selection61,103 item 7.discussion and analysis of financial co…61,128 discussion and analysis of financial condition…61,293 item 8.financial statements and supplementary …61,305 financial statements and supplementary datadis…220,720 discussion and analysis of financial condition…223,520 item 7.discussion and analysis of financial co…223,542 discussion and analysis of financial condition…223,944 item 7.discussion and analysis of financial co…223,965 discussion and analysis of financial condition…298,251 discussion and analysis of financial condition…314,729 item 8.financial statements and supplementary …314,738 financial statements and supplementary dataite…NaN

开始选择61,103项7.讨论和分析财务状况…61,128讨论和分析财务状况…61,293项8.财务报表和补充…61,305财务报表和补充数据dis…220,720讨论和分析财务状况…223,520项7。财务状况的讨论与分析…223,542财务状况的讨论与分析…223,944项目7。财务状况的讨论与分析…223,965财务状况的讨论与分析…298,251财务状况的讨论与分析…314,729项目8.财务报表和补充…314,738财务报表和补充数据站点…NaN

The output shows the concatenated text from adjacent matches (rows i and i+1) in ‘matches_array’.

输出显示“ matches_array”中相邻匹配项(第i和i +1行)的串联文本。

Note that the last entry is null (‘NaN’), since there is no i+1 row available when the i row is the last row in ‘matches_array’ (we only count up to ‘Rows - 1’ in the ‘while’ loop in the above section of code).

请注意，最后一个条目为null('NaN')，因为当i行是'matches_array'中的最后一行时，没有i +1行可用(我们在'while'中最多计数'Rows-1')在上面的代码部分中循环)。

We now need to identify which of the concatenated text entries in the ‘Selection’ column match the Item 7 and Item 8 Search Patterns. We do this by using the following regex patterns:

现在，我们需要确定“选择”列中哪些串联的文本条目与项目7和项目8搜索模式匹配。我们通过使用以下正则表达式模式来做到这一点：

item\s7\.discussion\s[a-z]*

item \ s7 \ .discussion \ s [az] *

item\s8\.(consolidated\sfinancial|financial)\s[a-z]*

item \ s8 \。(合并\ financial | financial)\ s [az] *

We set up list variables, which we call ‘Start_Loc’ and ‘End_Loc’, to store the Item 7 and Item 8 Search Pattern matches respectively. We then select the second item in each of these lists as the start and end positions of the section of text that we wish to extract.

我们设置了列表变量，分别称为“ Start_Loc”和“ End_Loc”，以存储项目7和项目8搜索模式匹配项。然后，我们在每个列表中选择第二个项目作为我们要提取的文本部分的开始和结束位置。

# Set up 'Item 7/8 Search Pattern' regex patterns
matches_item7 = re.compile(r'(item\s7\.discussion\s[a-z]*)')
matches_item8 = re.compile(r'(item\s8\.(consolidated\sfinancial|financial)\s[a-z]*)')
            
# Lists to store the locations of Item 7/8 Search Pattern matches
Start_Loc = []
End_Loc = []# Find and store the locations of Item 7/8 Search Pattern matches
count = 0 # Set up counterwhile count < (Rows-1): # Can only iterate to the second last row 
  # Match Item 7 Search Pattern
  if re.match(matches_item7, matches_array.at[count,'Selection']):
    # Column 1 = 'Start' column in 'matches_array'
    Start_Loc.append(matches_array.iloc[count,1])
  
  # Match Item 8 Search Pattern
  if re.match(matches_item8, matches_array.at[count,'Selection']):  
    End_Loc.append(matches_array.iloc[count,1])
  
  count += 1

In our 2019 Tesla 10K example, the above code will find our Item 7 and Item 8 Search Patterns as follows:

在我们的2019 Tesla 10K示例中，以上代码将找到我们的第7项和第8项搜索模式，如下所示：

found Item 7 Search Pattern at: 61,103found Item 8 Search Pattern at: 61,293found Item 7 Search Pattern at: 223,520found Item 7 Search Pattern at: 223,944found Item 8 Search Pattern at: 314,729

找到项目7的搜索模式为：61,103找到项目8的搜索模式为：61,293找到项目7的搜索模式为：223,520找到项目7的搜索模式为：223,944找到项目8的搜索模式为：314,729

The ‘Start_Loc’ and ‘End_Loc’ list variables will be as follows:

“ Start_Loc”和“ End_Loc”列表变量将如下所示：

[61103, 223520, 223944][61293, 314729]

[61103，223520，223944] [61293，314729]

The second numbers in each of the list variables will mark the positions of the text we wish to extract. So, in our 2019 Tesla 10K example, the start position will be 223,520 (the second entry in the ‘Start_Loc’ list) and the end position will be 314,729 (the second entry in the ‘End_Loc’ list).

每个列表变量中的第二个数字将标记我们要提取的文本的位置。因此，在我们的2019 Tesla 10K示例中，开始位置将为223,520(``开始位置''列表中的第二个条目)，结束位置将为314,729(``结束位置''列表中的第二个条目)。

7.提取并清理文本部分 (7. Extract and clean the section of text)

We can now extract the section of text that we’re after from our 10K filing, ie. Items 7 and 7A, as follows:

现在，我们可以从10K档案中提取文本的一部分，即。项目7和7A如下：

TenKItem7 = TenKtext[Start_Loc[1]:End_Loc[1]]

We store the extracted section of text in a new variable called ‘TenKItem7’.

我们将提取的文本部分存储在名为“ TenKItem7”的新变量中。

It helps to clean up the newly extracted text, which we can do with the following code:

它有助于清理新提取的文本，我们可以使用以下代码进行操作：

TenKItem7 = TenKItem7.strip() # Remove start/end white space
TenKItem7 = TenKItem7.replace('\n', ' ') # Replace \n with space
TenKItem7 = TenKItem7.replace('\r', '') # \r => space
TenKItem7 = TenKItem7.replace('&nbsp;', ' ') # "&nbsp;" => space
TenKItem7 = TenKItem7.replace('&#160;', ' ') # "&#160;" => space
  while '  ' in TenKItem7:
    TenKItem7 = TenKItem7.replace('  ', ' ') # Remove extra spaces

输出 (The output)

That’s it! We’ve now identified, extracted and cleaned our section of text… so, what does it look like?

而已！现在，我们已经识别，提取和清理了我们的文本部分……那么，它看起来像什么？

For our 2019 Tesla 10K example, the first ~500 characters of our freshly extracted text is as follows:

对于我们的2019 Tesla 10K示例，我们新提取的文本的前500个字符如下：

ITEM 7. MANAGEMENT’S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS OF OPERATIONS The following discussion and analysis should be read in conjunction with the consolidated financial statements and the related notes included elsewhere in this Annual Report on Form 10-K. For discussion related to changes in financial condition and the results of operations for fiscal year 2017-related items, refer to Part II, Item 7. Management’s Discussion and Analysis of Financial Condition and …

项目7.管理层对财务状况和经营成果的讨论和分析以下讨论和分析应与合并财务报表和本年度报告中其他有关表格10-K的相关说明一起阅读。有关与财务状况变化和2017财年相关项目的运营结果相关的讨论，请参阅第二部分第7项。管理层对财务状况的讨论和分析以及……

We’ve done it! We’ve successfully extracted Item 7 (and 7A) from our 10K filing and can now use this for analysis or natural language processing applications.

我们做到了！我们已经从10K档案中成功提取了项目7(和7A)，现在可以将其用于分析或自然语言处理应用程序。

结论 (Conclusion)

SEC 10K filings are produced annually by all publicly traded companies in the US. They contain lots of useful information for investors or anyone interested in the affairs of those companies.

美国所有上市公司每年都会制作SEC 10K档案。它们包含许多对投资者或对那些公司事务感兴趣的人的有用信息。

Unfortunately, 10K filings tend to be large and unwieldy and they contain inconsistencies between individual filings and companies. Searching through and extracting sections of text from them can be challenging.

不幸的是，1万份申请量往往庞大且笨拙，而且个人申请与公司之间存在不一致之处。搜索并从中提取文本部分可能是具有挑战性的。

I recently encountered this issue when working with SEC 10K filings. Fortunately, I found a solution using the powerful text matching features of regular expressions (regex). Using regex, I searched and extracted the section of text that I wanted from a 10K filing.

我最近在处理SEC 10K档案时遇到此问题。幸运的是，我找到了使用正则表达式(regex)强大的文本匹配功能的解决方案。使用正则表达式，我从10K档案中搜索并提取了想要的文本部分。

Regex is available as versatile and effective text pattern matching packages in various coding languages and text editors. It can be used in many applications of text analytics and NLP preprocessing.

Regex可作为多种编码语言和文本编辑器中的通用且有效的文本模式匹配包提供。它可以用于文本分析和NLP预处理的许多应用中。

I’ve implemented my regex search process in Python, and the resulting code can serve as a useful starting point, or an illustrative use case, for regex applications.

我已经在Python中实现了正则表达式搜索过程，并且所得的代码可以用作正则表达式应用程序的有用起点或说明性用例。

If you’re interested in using regex for searching through SEC 10K filings, then I hope that this article is helpful for you!

如果您有兴趣使用正则表达式来搜索SEC 10K文件，那么希望本文对您有所帮助！