02 API access to market data
这篇文本介绍了使用Python通过API访问市场数据的几种选项。其中包括:
1. pandas datareader:pandas库中内置的一些数据源以及pandas-datareader库提供的访问各种数据提供者API端点的标准接口。
2. yfinance:从Yahoo! Finance下载各种数据的库,通过从网站中抓取数据来解决历史数据API的停用。
3. LOBSTER tick data:LOBSTER提供的订单簿数据,是一个在线限价订单簿数据工具,旨在提供易于使用、高质量的限价订单簿数据。
4. Quandl:使用非常简单的API使其免费和高级数据可用的服务。
5. zipline和Qantopian:介绍了zipline回测库和如何在回测运行时访问股票价格数据。
6. 如何处理基本面数据:SEC要求美国发行人,即上市公司和证券,包括共同基金,每季度提交三份财务报表(Form 10-Q)和一份年度报告(Form 10-K),以及其他各种监管文件要求。这些文件可以通过EDGAR系统获得,而XBRL作为一种自由、开放和全球标准的电子商务报告表示和交换变得越来越容易。基本面数据可以通过EDGAR公共传播服务(PDS)的电子馈送、RSS馈送、FTP和财务报表(以及注释)数据集等途径进行跟踪和访问。
7. 其他基本面数据源:包括Yale Law School的宏观资源汇编、Capital IQ、Compustat、MSCI Barra和Quantitative Services Group等。
8. 使用pandas进行高效数据存储的技术。
`There are several options to access market data via API using Python.
pandas datareader
The notebook 01_pandas_datareader_demo presents a few sources built into the pandas library.
- The
pandas
library enables access to data displayed on websites using the read_html function - the related
pandas-datareader
library provides access to the API endpoints of various data providers through a standard interface
yfinance: Yahoo! Finance market and fundamental data
The notebook yfinance_demo shows how to use yfinance to download a variety of data from Yahoo! Finance. The library works around the deprecation of the historical data API by scraping data from the website in a reliable, efficient way with a Pythonic API.
LOBSTER tick data
The notebook 03_lobster_itch_data demonstrates the use of order book data made available by LOBSTER (Limit Order Book System - The Efficient Reconstructor), an online limit order book data tool that aims to provide easy-to-use, high-quality limit order book data.
Since 2013 LOBSTER acts as a data provider for the academic community, giving access to reconstructed limit order book data for the entire universe of NASDAQ traded stocks. More recently, it started offering a commercial service.
Qandl
The notebook 03_quandl_demo shows how Quandl uses a very straightforward API to make its free and premium data available. See documentation for more details.
zipline & Qantopian
The notebook [contains the notebook zipline_data briefly introduces the backtesting library zipline
that we will use throughout this book and show how to access stock price data while running a backtest. For installation please refer to the instructions here.
How to work with Fundamental data
The Securities and Exchange Commission (SEC) requires US issuers, that is, listed companies and securities, including mutual funds to file three quarterly financial statements (Form 10-Q) and one annual report (Form 10-K), in addition to various other regulatory filing requirements.
Since the early 1990s, the SEC made these filings available through its Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system. They constitute the primary data source for the fundamental analysis of equity and other securities, such as corporate credit, where the value depends on the business prospects and financial health of the issuer.
Automated processing using XBRL markup
Automated analysis of regulatory filings has become much easier since the SEC introduced XBRL, a free, open, and global standard for the electronic representation and exchange of business reports. XBRL is based on XML; it relies on taxonomies that define the meaning of the elements of a report and map to tags that highlight the corresponding information in the electronic version of the report. One such taxonomy represents the US Generally Accepted Accounting Principles (GAAP).
The SEC introduced voluntary XBRL filings in 2005 in response to accounting scandals before requiring this format for all filers since 2009 and continues to expand the mandatory coverage to other regulatory filings. The SEC maintains a website that lists the current taxonomies that shape the content of different filings and can be used to extract specific items.
There are several avenues to track and access fundamental data reported to the SEC:
- As part of the EDGAR Public Dissemination Service (PDS), electronic feeds of accepted filings are available for a fee.
- The SEC updates RSS feeds every 10 minutes, which list structured disclosure submissions.
- There are public index files for the retrieval of all filings through FTP for automated processing.
- The financial statement (and notes) datasets contain parsed XBRL data from all financial statements and the accompanying notes.
The SEC also publishes log files containing the internet search traffic for EDGAR filings through SEC.gov, albeit with a six-month delay.
Building a fundamental data time series
The scope of the data in the Financial Statement and Notes datasets consists of numeric data extracted from the primary financial statements (Balance sheet, income statement, cash flows, changes in equity, and comprehensive income) and footnotes on those statements. The data is available as early as 2009.
The folder 03_sec_edgar contains the notebook edgar_xbrl to download and parse EDGAR data in XBRL format, and create fundamental metrics like the P/E ratio by combining financial statement and price data.
Other fundamental data sources
- Compilation of macro resources by the Yale Law School
- Capital IQ
- Compustat
- MSCI Barra
- Northfield Information Services
- Quantitative Services Group
Efficient data storage with pandas
The notebook storage_benchmark compares the main storage formats for efficiency and performance.
In particular, it compares:
- CSV: Comma-separated, standard flat text file format.
- HDF5: Hierarchical data format, developed initially at the National Center for Supercomputing, is a fast and scalable storage format for numerical data, available in pandas using the PyTables library.
- Parquet: A binary, columnar storage format, part of the Apache Hadoop ecosystem, that provides efficient data compression and encoding and has been developed by Cloudera and Twitter. It is available for pandas through the pyarrow library, led by Wes McKinney, the original author of pandas.
It uses a test DataFrame
that can be configured to contain numerical or text data, or both. For the HDF5 library, we test both the fixed and table format. The table format allows for queries and can be appended to.
Test Results
In short, the results are:
- For purely numerical data, the HDF5 format performs best, and the table format also shares with CSV the smallest memory footprint at 1.6 GB. The fixed format uses twice as much space, and the parquet format uses 2 GB.
- For a mix of numerical and text data, parquet is significantly faster, and HDF5 uses its advantage on reading relative to CSV.
The notebook illustrates how to configure, test, and collect the timing using the %%timeit
cell magic. At the same time demonstrates the usage of the related pandas commands required to use these storage formats.