如何处理Algoseek的纳斯达克100分钟交易数据

How to process AlgoSeek intraday NASDAQ 100 data

这篇文本介绍了如何处理Algoseek的纳斯达克100分钟交易数据。该数据包含了2015-2017年的交易和报价信息,可以从Algoseek的网站上下载。文本中提供了一个名为algoseek_minute_data的notebook,其中包含了用于提取和合并数据的代码。这些数据将在第12章中用于开发一个梯度提升模型,该模型预测一分钟的收益率,用于日内交易策略。

文本中提到了Algoseek的Trade & Quote Minute Bar数据,其中包含了基于交易和报价的一些字段信息。这些字段包括日期、股票代码、时间、开盘价、最高价、最低价等等。其中一些字段是基于NBBO(National Best Bid Offer)的变化而来。

You can download a sammle of Algoseek’s NASDAQ100 Minute Bar data with Trade & Quote information for 2015-2017 from Algoseek’s website here. The notebook algoseek_minute_data contains the code to extract and combine the data that we will use in Chapter 12 to develop a Gradient Boosting model that predicts one-minute returns for an intraday trading strategy.

Unzip the directory, rename it 1min_taq, and move it into a new nasdaq100 folder in the data directory. It contains around 5GB worth of NASDAQ 100 minute bar data in trade-and-quote format. See documentation for details on the definition of the numerous fields.
The following information is from the Algoseek Trade & Quote Minute Bar data linked above.

Trade & Quote Minute Bar Fields

The Quote fields are based on changes to the NBBO (National Best Bid Offer) from the top-of-book price and size from each of the exchanges.

The enhanced Trade & Quote bar fields include the following fields:

  • Field: Name of Field.
  • Q / T: Field based on Quotes or Trades
  • Type: Field format
  • No Value: Value of field when there is no value or data.
    • Note: “Never” means field should always have a value EXCEPT for the first bar of the day.
  • Description: Description of the field.
idFieldQ/TTypeNo ValueDescription
1DateYYYYMMDDNeverTrade Date
2TickerStringNeverTicker Symbol
3TimeBarStartHHMM
HHMMSS
HHMMSSMMM
NeverFor minute bars: HHMM.
For second bars: HHMMSS.
Examples
- One second bar 130302 is from time greater than 130301 to 130302.
- One minute bar 1104 is from time greater than 1103 to 1104.
4OpenBarTimeQHHMMSSMMMNeverOpen Time of the Bar, for example one minute:
11:03:00.000
5OpenBidPriceQNumberNeverNBBO Bid Price as of bar Open
6OpenBidSizeQNumberNeverTotal Size from all Exchanges with
OpenBidPrice
7OpenAskPriceQNumberNeverNBBO Ask Price as of bar Open
8OpenAskSizeQNumberNeverTotal Size from all Exchange with
OpenAskPrice
9FirstTradeTimeTHHMMSSMMMBlankTime of first Trade
10FirstTradePriceTNumberBlankPrice of first Trade
11FirstTradeSizeTNumberBlankNumber of shares of first trade
12HighBidTimeQHHMMSSMMMNeverTime of highest NBBO Bid Price
13HighBidPriceQNumberNeverHighest NBBO Bid Price
14HighBidSizeQNumberNeverTotal Size from all Exchanges with HighBidPrice
15AskPriceAtHighBidPriceQNumberNeverAsk Price at time of Highest Bid Price
16AskSizeAtHighBidPriceQNumberNeverTotal Size from all Exchanges with AskPriceAtHighBidPrice
17HighTradeTimeTHHMMSSMMMBlankTime of Highest Trade
18HighTradePriceTNumberBlankPrice of highest Trade
19HighTradeSizeTNumberBlankNumber of shares of highest trade
20LowBidTimeQHHMMSSMMMNeverTime of lowest Bid
21LowBidPriceQNumberNeverLowest NBBO Bid price of bar.
22LowBidSizeQNumberNeverTotal Size from all Exchanges with LowBidPrice
23AskPriceAtLowBidPriceQNumberNeverAsk Price at lowest Bid price
24AskSizeAtLowBidPriceQNumberNeverTotal Size from all Exchanges with AskPriceAtLowBidPrice
25LowTradeTimeTHHMMSSMMMBlankTime of lowest Trade
26LowTradePriceTNumberBlankPrice of lowest Trade
27LowTradeSizeTNumberBlankNumber of shares of lowest trade
28CloseBarTimeQHHMMSSMMMNeverClose Time of the Bar, for example one minute: 11:03:59.999
29CloseBidPriceQNumberNeverNBBO Bid Price at bar Close
30CloseBidSizeQNumberNeverTotal Size from all Exchange with CloseBidPrice
31CloseAskPriceQNumberNeverNBBO Ask Price at bar Close
32CloseAskSizeQNumberNeverTotal Size from all Exchange with CloseAskPrice
33LastTradeTimeTHHMMSSMMMBlankTime of last Trade
34LastTradePriceTNumberBlankPrice of last Trade
35LastTradeSizeTNumberBlankNumber of shares of last trade
36MinSpreadQNumberNeverMinimum Bid-Ask spread size. This may be 0 if the market was crossed during the bar.
If negative spread due to back quote, make it 0.
37MaxSpreadQNumberNeverMaximum Bid-Ask spread in bar
38CancelSizeTNumber0Total shares canceled. Default=blank
39VolumeWeightPriceTNumberBlankTrade Volume weighted average price
Sum((Trade1SharesPrice)+(Trade2SharesPrice)+…)/TotalShares.
Note: Blank if no trades.
40NBBOQuoteCountQNumber0Number of Bid and Ask NNBO quotes during bar period.
41TradeAtBidQ,TNumber0Sum of trade volume that occurred at or below the bid (a trade reported/printed late can be below current bid).
42TradeAtBidMidQ,TNumber0Sum of trade volume that occurred between the bid and the mid-point:
(Trade Price > NBBO Bid ) & (Trade Price < NBBO Mid )
43TradeAtMidQ,TNumber0Sum of trade volume that occurred at mid.
TradePrice = NBBO MidPoint
44TradeAtMidAskQ,TNumber0Sum of ask volume that occurred between the mid and ask:
(Trade Price > NBBO Mid) & (Trade Price < NBBO Ask)
45TradeAtAskQ,TNumber0Sum of trade volume that occurred at or above the Ask.
46TradeAtCrossOrLockedQ,TNumber0Sum of trade volume for bar when national best bid/offer is locked or crossed.
Locked is Bid = Ask
Crossed is Bid > Ask
47VolumeTNumber0Total number of shares traded
48TotalTradesTNumber0Total number of trades
49FinraVolumeTNumber0Number of shares traded that are reported by FINRA.
Trades reported by FINRA are from broker-dealer internalization, dark pools, Over-The-Counter, etc.
FINRA trades represent volume that is hidden or not public available to trade.
50UptickVolumeTInteger0Total number of shares traded with upticks during bar.
An uptick = ( trade price > last trade price )
51DowntickVolumeTInteger0Total number of shares traded with downticks during bar.
A downtick = ( trade price < last trade price )
52RepeatUptickVolumeTInteger0Total number of shares where trade price is the same (repeated) and last price change was up during bar.
Repeat uptick = ( trade price == last trade price ) & (last tick direction == up )
53RepeatDowntickVolumeTInteger0Total number of shares where trade price is the same (repeated) and last price change was down during bar.
Repeat downtick = ( trade price == last trade price ) & (last tick direction == down )
54UnknownVolumeTInteger0When the first trade of the day takes place, the tick direction is “unknown” as there is no previous Trade to compare it to.
This field is the volume of the first trade after 4am and acts as an initiation value for the tick volume directions.
In future this bar will be renamed to UnkownTickDirectionVolume .

Notes

Empty Fields

An empty field has no value and is “Blank” , for example FirstTradeTime and there are no trades during the bar period.
The field Volume measuring total number of shares traded in bar will be 0 if there are no Trades (see No Value column above for each field).

No Bid/Ask/Trade OHLC

During a bar timeframe there may not be a change in the NBBO or an actual Trade.
For example, there can be a bar with OHLC Bid/Ask but no Trade OHLC.

Single Event

For bars with only one trade, one NBBO bid or one NBBO ask then Open/High/Low/Close price,size andtime will be the same.

AskPriceAtHighBidPrice, AskSizeAtHighBidPrice, AskPriceAtLowBidPrice, AskSizeAtLowBidPrice Fields

To provide consistent Bid/Ask prices at a point in time while showing the low/high Bid/Ask for the bar, AlgoSeek uses the low/high Bid and the corresponding Ask at that price.

FAQ

Why are Trade Prices often inside the Bid Price to Ask Price range?

The Low/High Bid/Ask is the low and high NBBO price for the bar range.
Very often a Trade may not occur at these prices as the price may only last a few seconds or executions are being crossed at mid-point due to hidden order types that execute at mid-point or as price improvement over current Bid/Ask.

How to get exchange tradable shares?

To get the exchange tradable volume in a bar subtract Volume from FinraVolume.

  • Volume is the total number of shares traded.
  • FinraVolume is the total number of shares traded that are reported as executions by FINRA.

When a trade is done that is off the listed exchanges, it must be reported to FINRA by the brokerage firm or dark pool. Examples include:

  • internal crosses by broker dealer
  • over-the-counter block trades, and
  • dark pool executions.

02 API access to market data

There are several options to access market data via API using Python.

pandas datareader

The notebook 01_pandas_datareader_demo presents a few sources built into the pandas library.

  • The pandas library enables access to data displayed on websites using the read_html function
  • the related pandas-datareader library provides access to the API endpoints of various data providers through a standard interface

yfinance: Yahoo! Finance market and fundamental data

The notebook yfinance_demo shows how to use yfinance to download a variety of data from Yahoo! Finance. The library works around the deprecation of the historical data API by scraping data from the website in a reliable, efficient way with a Pythonic API.

LOBSTER tick data

The notebook 03_lobster_itch_data demonstrates the use of order book data made available by LOBSTER (Limit Order Book System - The Efficient Reconstructor), an online limit order book data tool that aims to provide easy-to-use, high-quality limit order book data.

Since 2013 LOBSTER acts as a data provider for the academic community, giving access to reconstructed limit order book data for the entire universe of NASDAQ traded stocks. More recently, it started offering a commercial service.

Qandl

The notebook 03_quandl_demo shows how Quandl uses a very straightforward API to make its free and premium data available. See documentation for more details.

zipline & Qantopian

The notebook [contains the notebook zipline_data briefly introduces the backtesting library zipline that we will use throughout this book and show how to access stock price data while running a backtest. For installation please refer to the instructions here.

How to work with Fundamental data

The Securities and Exchange Commission (SEC) requires US issuers, that is, listed companies and securities, including mutual funds to file three quarterly financial statements (Form 10-Q) and one annual report (Form 10-K), in addition to various other regulatory filing requirements.

Since the early 1990s, the SEC made these filings available through its Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system. They constitute the primary data source for the fundamental analysis of equity and other securities, such as corporate credit, where the value depends on the business prospects and financial health of the issuer.

Automated processing using XBRL markup

Automated analysis of regulatory filings has become much easier since the SEC introduced XBRL, a free, open, and global standard for the electronic representation and exchange of business reports. XBRL is based on XML; it relies on taxonomies that define the meaning of the elements of a report and map to tags that highlight the corresponding information in the electronic version of the report. One such taxonomy represents the US Generally Accepted Accounting Principles (GAAP).

The SEC introduced voluntary XBRL filings in 2005 in response to accounting scandals before requiring this format for all filers since 2009 and continues to expand the mandatory coverage to other regulatory filings. The SEC maintains a website that lists the current taxonomies that shape the content of different filings and can be used to extract specific items.

There are several avenues to track and access fundamental data reported to the SEC:

  • As part of the EDGAR Public Dissemination Service (PDS), electronic feeds of accepted filings are available for a fee.
  • The SEC updates RSS feeds every 10 minutes, which list structured disclosure submissions.
  • There are public index files for the retrieval of all filings through FTP for automated processing.
  • The financial statement (and notes) datasets contain parsed XBRL data from all financial statements and the accompanying notes.

The SEC also publishes log files containing the internet search traffic for EDGAR filings through SEC.gov, albeit with a six-month delay.

Building a fundamental data time series

The scope of the data in the Financial Statement and Notes datasets consists of numeric data extracted from the primary financial statements (Balance sheet, income statement, cash flows, changes in equity, and comprehensive income) and footnotes on those statements. The data is available as early as 2009.

The folder 03_sec_edgar contains the notebook edgar_xbrl to download and parse EDGAR data in XBRL format, and create fundamental metrics like the P/E ratio by combining financial statement and price data.

Other fundamental data sources

Efficient data storage with pandas

The notebook storage_benchmark compares the main storage formats for efficiency and performance.

In particular, it compares:

  • CSV: Comma-separated, standard flat text file format.
  • HDF5: Hierarchical data format, developed initially at the National Center for Supercomputing, is a fast and scalable storage format for numerical data, available in pandas using the PyTables library.
  • Parquet: A binary, columnar storage format, part of the Apache Hadoop ecosystem, that provides efficient data compression and encoding and has been developed by Cloudera and Twitter. It is available for pandas through the pyarrow library, led by Wes McKinney, the original author of pandas.

It uses a test DataFrame that can be configured to contain numerical or text data, or both. For the HDF5 library, we test both the fixed and table format. The table format allows for queries and can be appended to.

Test Results

In short, the results are:

  • For purely numerical data, the HDF5 format performs best, and the table format also shares with CSV the smallest memory footprint at 1.6 GB. The fixed format uses twice as much space, and the parquet format uses 2 GB.
  • For a mix of numerical and text data, parquet is significantly faster, and HDF5 uses its advantage on reading relative to CSV.

The notebook illustrates how to configure, test, and collect the timing using the %%timeit cell magic. At the same time demonstrates the usage of the related pandas commands required to use these storage formats.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Jerry yolo

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值