How to process AlgoSeek intraday NASDAQ 100 data
这篇文本介绍了如何处理Algoseek的纳斯达克100分钟交易数据。该数据包含了2015-2017年的交易和报价信息,可以从Algoseek的网站上下载。文本中提供了一个名为algoseek_minute_data的notebook,其中包含了用于提取和合并数据的代码。这些数据将在第12章中用于开发一个梯度提升模型,该模型预测一分钟的收益率,用于日内交易策略。
文本中提到了Algoseek的Trade & Quote Minute Bar数据,其中包含了基于交易和报价的一些字段信息。这些字段包括日期、股票代码、时间、开盘价、最高价、最低价等等。其中一些字段是基于NBBO(National Best Bid Offer)的变化而来。
You can download a sammle of Algoseek’s NASDAQ100 Minute Bar data with Trade & Quote information for 2015-2017 from Algoseek’s website here. The notebook algoseek_minute_data contains the code to extract and combine the data that we will use in Chapter 12 to develop a Gradient Boosting model that predicts one-minute returns for an intraday trading strategy.
Unzip the directory, rename it 1min_taq
, and move it into a new nasdaq100
folder in the data directory. It contains around 5GB worth of NASDAQ 100 minute bar data in trade-and-quote format. See documentation for details on the definition of the numerous fields.
The following information is from the Algoseek Trade & Quote Minute Bar data linked above.
Trade & Quote Minute Bar Fields
The Quote fields are based on changes to the NBBO (National Best Bid Offer) from the top-of-book price and size from each of the exchanges.
The enhanced Trade & Quote bar fields include the following fields:
- Field: Name of Field.
- Q / T: Field based on Quotes or Trades
- Type: Field format
- No Value: Value of field when there is no value or data.
- Note: “Never” means field should always have a value EXCEPT for the first bar of the day.
- Description: Description of the field.
id | Field | Q/T | Type | No Value | Description |
---|---|---|---|---|---|
1 | Date | YYYYMMDD | Never | Trade Date | |
2 | Ticker | String | Never | Ticker Symbol | |
3 | TimeBarStart | HHMM HHMMSS HHMMSSMMM | Never | For minute bars: HHMM. For second bars: HHMMSS. Examples - One second bar 130302 is from time greater than 130301 to 130302. - One minute bar 1104 is from time greater than 1103 to 1104. | |
4 | OpenBarTime | Q | HHMMSSMMM | Never | Open Time of the Bar, for example one minute: 11:03:00.000 |
5 | OpenBidPrice | Q | Number | Never | NBBO Bid Price as of bar Open |
6 | OpenBidSize | Q | Number | Never | Total Size from all Exchanges with OpenBidPrice |
7 | OpenAskPrice | Q | Number | Never | NBBO Ask Price as of bar Open |
8 | OpenAskSize | Q | Number | Never | Total Size from all Exchange with OpenAskPrice |
9 | FirstTradeTime | T | HHMMSSMMM | Blank | Time of first Trade |
10 | FirstTradePrice | T | Number | Blank | Price of first Trade |
11 | FirstTradeSize | T | Number | Blank | Number of shares of first trade |
12 | HighBidTime | Q | HHMMSSMMM | Never | Time of highest NBBO Bid Price |
13 | HighBidPrice | Q | Number | Never | Highest NBBO Bid Price |
14 | HighBidSize | Q | Number | Never | Total Size from all Exchanges with HighBidPrice |
15 | AskPriceAtHighBidPrice | Q | Number | Never | Ask Price at time of Highest Bid Price |
16 | AskSizeAtHighBidPrice | Q | Number | Never | Total Size from all Exchanges with AskPriceAtHighBidPrice |
17 | HighTradeTime | T | HHMMSSMMM | Blank | Time of Highest Trade |
18 | HighTradePrice | T | Number | Blank | Price of highest Trade |
19 | HighTradeSize | T | Number | Blank | Number of shares of highest trade |
20 | LowBidTime | Q | HHMMSSMMM | Never | Time of lowest Bid |
21 | LowBidPrice | Q | Number | Never | Lowest NBBO Bid price of bar. |
22 | LowBidSize | Q | Number | Never | Total Size from all Exchanges with LowBidPrice |
23 | AskPriceAtLowBidPrice | Q | Number | Never | Ask Price at lowest Bid price |
24 | AskSizeAtLowBidPrice | Q | Number | Never | Total Size from all Exchanges with AskPriceAtLowBidPrice |
25 | LowTradeTime | T | HHMMSSMMM | Blank | Time of lowest Trade |
26 | LowTradePrice | T | Number | Blank | Price of lowest Trade |
27 | LowTradeSize | T | Number | Blank | Number of shares of lowest trade |
28 | CloseBarTime | Q | HHMMSSMMM | Never | Close Time of the Bar, for example one minute: 11:03:59.999 |
29 | CloseBidPrice | Q | Number | Never | NBBO Bid Price at bar Close |
30 | CloseBidSize | Q | Number | Never | Total Size from all Exchange with CloseBidPrice |
31 | CloseAskPrice | Q | Number | Never | NBBO Ask Price at bar Close |
32 | CloseAskSize | Q | Number | Never | Total Size from all Exchange with CloseAskPrice |
33 | LastTradeTime | T | HHMMSSMMM | Blank | Time of last Trade |
34 | LastTradePrice | T | Number | Blank | Price of last Trade |
35 | LastTradeSize | T | Number | Blank | Number of shares of last trade |
36 | MinSpread | Q | Number | Never | Minimum Bid-Ask spread size. This may be 0 if the market was crossed during the bar. If negative spread due to back quote, make it 0. |
37 | MaxSpread | Q | Number | Never | Maximum Bid-Ask spread in bar |
38 | CancelSize | T | Number | 0 | Total shares canceled. Default=blank |
39 | VolumeWeightPrice | T | Number | Blank | Trade Volume weighted average price Sum(( Trade1Shares Price )+(Trade2Shares Price )+…)/TotalShares . Note: Blank if no trades. |
40 | NBBOQuoteCount | Q | Number | 0 | Number of Bid and Ask NNBO quotes during bar period. |
41 | TradeAtBid | Q,T | Number | 0 | Sum of trade volume that occurred at or below the bid (a trade reported/printed late can be below current bid). |
42 | TradeAtBidMid | Q,T | Number | 0 | Sum of trade volume that occurred between the bid and the mid-point: (Trade Price > NBBO Bid ) & (Trade Price < NBBO Mid ) |
43 | TradeAtMid | Q,T | Number | 0 | Sum of trade volume that occurred at mid. TradePrice = NBBO MidPoint |
44 | TradeAtMidAsk | Q,T | Number | 0 | Sum of ask volume that occurred between the mid and ask: (Trade Price > NBBO Mid) & (Trade Price < NBBO Ask) |
45 | TradeAtAsk | Q,T | Number | 0 | Sum of trade volume that occurred at or above the Ask. |
46 | TradeAtCrossOrLocked | Q,T | Number | 0 | Sum of trade volume for bar when national best bid/offer is locked or crossed. Locked is Bid = Ask Crossed is Bid > Ask |
47 | Volume | T | Number | 0 | Total number of shares traded |
48 | TotalTrades | T | Number | 0 | Total number of trades |
49 | FinraVolume | T | Number | 0 | Number of shares traded that are reported by FINRA. Trades reported by FINRA are from broker-dealer internalization, dark pools, Over-The-Counter, etc. FINRA trades represent volume that is hidden or not public available to trade. |
50 | UptickVolume | T | Integer | 0 | Total number of shares traded with upticks during bar. An uptick = ( trade price > last trade price ) |
51 | DowntickVolume | T | Integer | 0 | Total number of shares traded with downticks during bar. A downtick = ( trade price < last trade price ) |
52 | RepeatUptickVolume | T | Integer | 0 | Total number of shares where trade price is the same (repeated) and last price change was up during bar. Repeat uptick = ( trade price == last trade price ) & (last tick direction == up ) |
53 | RepeatDowntickVolume | T | Integer | 0 | Total number of shares where trade price is the same (repeated) and last price change was down during bar. Repeat downtick = ( trade price == last trade price ) & (last tick direction == down ) |
54 | UnknownVolume | T | Integer | 0 | When the first trade of the day takes place, the tick direction is “unknown” as there is no previous Trade to compare it to. This field is the volume of the first trade after 4am and acts as an initiation value for the tick volume directions. In future this bar will be renamed to UnkownTickDirectionVolume . |
Notes
Empty Fields
An empty field has no value and is “Blank” , for example FirstTradeTime and there are no trades during the bar period.
The field Volume
measuring total number of shares traded in bar will be 0
if there are no Trades (see No Value
column above for each field).
No Bid/Ask/Trade OHLC
During a bar timeframe there may not be a change in the NBBO or an actual Trade.
For example, there can be a bar with OHLC Bid/Ask but no Trade OHLC.
Single Event
For bars with only one trade, one NBBO bid or one NBBO ask then Open/High/Low/Close price,size andtime will be the same.
AskPriceAtHighBidPrice
, AskSizeAtHighBidPrice
, AskPriceAtLowBidPrice
, AskSizeAtLowBidPrice
Fields
To provide consistent Bid/Ask prices at a point in time while showing the low/high Bid/Ask for the bar, AlgoSeek uses the low/high Bid
and the corresponding Ask
at that price.
FAQ
Why are Trade Prices often inside the Bid Price to Ask Price range?
The Low/High Bid/Ask is the low and high NBBO price for the bar range.
Very often a Trade may not occur at these prices as the price may only last a few seconds or executions are being crossed at mid-point due to hidden order types that execute at mid-point or as price improvement over current Bid
/Ask
.
How to get exchange tradable shares?
To get the exchange tradable volume in a bar subtract Volume
from FinraVolume
.
Volume
is the total number of shares traded.FinraVolume
is the total number of shares traded that are reported as executions by FINRA.
When a trade is done that is off the listed exchanges, it must be reported to FINRA by the brokerage firm or dark pool. Examples include:
- internal crosses by broker dealer
- over-the-counter block trades, and
- dark pool executions.
02 API access to market data
There are several options to access market data via API using Python.
pandas datareader
The notebook 01_pandas_datareader_demo presents a few sources built into the pandas library.
- The
pandas
library enables access to data displayed on websites using the read_html function - the related
pandas-datareader
library provides access to the API endpoints of various data providers through a standard interface
yfinance: Yahoo! Finance market and fundamental data
The notebook yfinance_demo shows how to use yfinance to download a variety of data from Yahoo! Finance. The library works around the deprecation of the historical data API by scraping data from the website in a reliable, efficient way with a Pythonic API.
LOBSTER tick data
The notebook 03_lobster_itch_data demonstrates the use of order book data made available by LOBSTER (Limit Order Book System - The Efficient Reconstructor), an online limit order book data tool that aims to provide easy-to-use, high-quality limit order book data.
Since 2013 LOBSTER acts as a data provider for the academic community, giving access to reconstructed limit order book data for the entire universe of NASDAQ traded stocks. More recently, it started offering a commercial service.
Qandl
The notebook 03_quandl_demo shows how Quandl uses a very straightforward API to make its free and premium data available. See documentation for more details.
zipline & Qantopian
The notebook [contains the notebook zipline_data briefly introduces the backtesting library zipline
that we will use throughout this book and show how to access stock price data while running a backtest. For installation please refer to the instructions here.
How to work with Fundamental data
The Securities and Exchange Commission (SEC) requires US issuers, that is, listed companies and securities, including mutual funds to file three quarterly financial statements (Form 10-Q) and one annual report (Form 10-K), in addition to various other regulatory filing requirements.
Since the early 1990s, the SEC made these filings available through its Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system. They constitute the primary data source for the fundamental analysis of equity and other securities, such as corporate credit, where the value depends on the business prospects and financial health of the issuer.
Automated processing using XBRL markup
Automated analysis of regulatory filings has become much easier since the SEC introduced XBRL, a free, open, and global standard for the electronic representation and exchange of business reports. XBRL is based on XML; it relies on taxonomies that define the meaning of the elements of a report and map to tags that highlight the corresponding information in the electronic version of the report. One such taxonomy represents the US Generally Accepted Accounting Principles (GAAP).
The SEC introduced voluntary XBRL filings in 2005 in response to accounting scandals before requiring this format for all filers since 2009 and continues to expand the mandatory coverage to other regulatory filings. The SEC maintains a website that lists the current taxonomies that shape the content of different filings and can be used to extract specific items.
There are several avenues to track and access fundamental data reported to the SEC:
- As part of the EDGAR Public Dissemination Service (PDS), electronic feeds of accepted filings are available for a fee.
- The SEC updates RSS feeds every 10 minutes, which list structured disclosure submissions.
- There are public index files for the retrieval of all filings through FTP for automated processing.
- The financial statement (and notes) datasets contain parsed XBRL data from all financial statements and the accompanying notes.
The SEC also publishes log files containing the internet search traffic for EDGAR filings through SEC.gov, albeit with a six-month delay.
Building a fundamental data time series
The scope of the data in the Financial Statement and Notes datasets consists of numeric data extracted from the primary financial statements (Balance sheet, income statement, cash flows, changes in equity, and comprehensive income) and footnotes on those statements. The data is available as early as 2009.
The folder 03_sec_edgar contains the notebook edgar_xbrl to download and parse EDGAR data in XBRL format, and create fundamental metrics like the P/E ratio by combining financial statement and price data.
Other fundamental data sources
- Compilation of macro resources by the Yale Law School
- Capital IQ
- Compustat
- MSCI Barra
- Northfield Information Services
- Quantitative Services Group
Efficient data storage with pandas
The notebook storage_benchmark compares the main storage formats for efficiency and performance.
In particular, it compares:
- CSV: Comma-separated, standard flat text file format.
- HDF5: Hierarchical data format, developed initially at the National Center for Supercomputing, is a fast and scalable storage format for numerical data, available in pandas using the PyTables library.
- Parquet: A binary, columnar storage format, part of the Apache Hadoop ecosystem, that provides efficient data compression and encoding and has been developed by Cloudera and Twitter. It is available for pandas through the pyarrow library, led by Wes McKinney, the original author of pandas.
It uses a test DataFrame
that can be configured to contain numerical or text data, or both. For the HDF5 library, we test both the fixed and table format. The table format allows for queries and can be appended to.
Test Results
In short, the results are:
- For purely numerical data, the HDF5 format performs best, and the table format also shares with CSV the smallest memory footprint at 1.6 GB. The fixed format uses twice as much space, and the parquet format uses 2 GB.
- For a mix of numerical and text data, parquet is significantly faster, and HDF5 uses its advantage on reading relative to CSV.
The notebook illustrates how to configure, test, and collect the timing using the %%timeit
cell magic. At the same time demonstrates the usage of the related pandas commands required to use these storage formats.