量化投资：如何将数据转换为因子

Longbo-AI

已于 2023-05-27 11:20:30 修改

阅读量352

点赞数

分类专栏：量化投资-高级金融学院文章标签： python 开发语言人工智能数据分析

于 2023-05-24 17:57:15 首次发布

本文链接：https://blog.csdn.net/weixin_41908924/article/details/130851720

版权

量化投资-高级金融学院专栏收录该内容

18 篇文章 4 订阅

订阅专栏

How to transform data into factors

这段文本介绍了如何将数据转换为因子。它使用了pandas、statsmodels和matplotlib等Python库来处理时间序列数据。首先，它从Quandl的股票价格数据集中加载数据，并将其转换为月度收益率。然后，它计算了历史收益率，并将其标准化和修剪为[1％，99％]的范围内。最后，它根据收益率的长度（1到12个月）创建了六个复合月度收益率因子。这些因子可以用于量化投资策略的开发和测试。
该文本使用了许多pandas的函数和方法，如resample()、stack()、clip()和swaplevel()。它还使用了statsmodels的RollingOLS类来进行滚动线性回归，并使用matplotlib和seaborn库来可视化数据。此外，它还提到了Quandl，这是一个提供金融和经济数据的在线平台。
Based on a conceptual understanding of key factor categories, their rationale and popular metrics, a key task is to identify new factors that may better capture the risks embodied by the return drivers laid out previously, or to find new ones.

In either case, it will be important to compare the performance of innovative factors to that of known factors to identify incremental signal gains.

We create the dataset here and store it in our data folder to facilitate reuse in later chapters.

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

from datetime import datetime
import pandas as pd
import pandas_datareader.data as web

# replaces pyfinance.ols.PandasRollingOLS (no longer maintained)
from statsmodels.regression.rolling import RollingOLS
import statsmodels.api as sm

import matplotlib.pyplot as plt
import seaborn as sns
import vitables

sns.set_style('whitegrid')
idx = pd.IndexSlice

The assets.h5 store can be generated using the the notebook create_datasets in the data directory in the root directory of this repo for instruction to download the following dataset.We load the Quandl stock price datasets covering the US equity markets 2000-18 using pd.IndexSlice to perform a slice operation on the pd.MultiIndex, select the adjusted close price and unpivot the column to convert the DataFrame to wide format with tickers in the columns and timestamps in the rows:

DATA_STORE = '../data/assets.h5'
START = 2000
END = 2018
with pd.HDFStore(DATA_STORE) as store:
    prices = (store['quandl/wiki/prices']
              .loc[idx[str(START):str(END), :], 'adj_close']
              .unstack('ticker'))
    stocks = store['us_equities/stocks'].loc[:, ['marketcap', 'ipoyear', 'sector']]
prices.info()
stocks.info()

Keep data with stock info

Remove stocks duplicates and align index names for later joining.

stocks = stocks[~stocks.index.duplicated()]
stocks.index.name = 'ticker'
shared = prices.columns.intersection(stocks.index)
stocks = stocks.loc[shared, :]
stocks.info()
prices = prices.loc[:, shared]
prices.info()
assert prices.shape[1] == stocks.shape[0]

Create monthly return series

To reduce training time and experiment with strategies for longer time horizons, we convert the business-daily data to month-end frequency using the available adjusted close price:

monthly_prices = prices.resample('M').last()

To capture time series dynamics that reflect, for example, momentum patterns, we compute historical returns using the method .pct_change(n_periods), that is, returns over various monthly periods as identified by lags.

We then convert the wide result back to long format with the .stack() method, use .pipe() to apply the .clip() method to the resulting DataFrame, and winsorize returns at the [1%, 99%] levels; that is, we cap outliers at these percentiles.

Finally, we normalize returns using the geometric average. After using .swaplevel() to change the order of the MultiIndex levels, we obtain compounded monthly returns for six periods ranging from 1 to 12 months:

monthly_prices.info()
outlier_cutoff = 0.01
data = pd.DataFrame()
lags = [1, 2, 3, 6, 9, 12]
for lag in lags:
    data[f'return_{lag}m'] = (monthly_prices
                           .pct_change(lag)
                           .stack()
                           .pipe(lambda x: x.clip(lower=x.quantile(outlier_cutoff),
                                                  upper=x.quantile(1-outlier_cutoff)))
                           .add(1)
                           .pow(1/lag)
                           .sub(1)
                           )
data = data.swaplevel().dropna()
data.info()

##Drop stocks with less than 10 yrs of returns

min_obs = 120
nobs = data.groupby(level='ticker').size()
keep = nobs[nobs>min_obs].index

data = data.loc[idx[keep,:], :]
data.info()
data.describe()
# cmap = sns.diverging_palette(10, 220, as_cmap=True)
sns.clustermap(data.corr('spearman'), annot=True, center=0, cmap='Blues');
data.index.get_level_values('ticker').nunique()

Rolling Factor Betas

We will introduce the Fama—French data to estimate the exposure of assets to common risk factors using linear regression. We will introduce the Fama—French data to estimate the exposure of assets to common risk factors using linear regression.We can access the historical factor returns using the pandas-datareader and estimate historical exposures using the RollingOLS rolling linear regression functionality in the statsmodels library as follows:Use Fama-French research factors to estimate the factor exposures of the stock in the dataset to the 5 factors market risk, size, value, operating profitability and investment.

factors = ['Mkt-RF', 'SMB', 'HML', 'RMW', 'CMA']
factor_data = web.DataReader('F-F_Research_Data_5_Factors_2x3', 'famafrench', start='2000')[0].drop('RF', axis=1)
factor_data.index = factor_data.index.to_timestamp()
factor_data = factor_data.resample('M').last().div(100)
factor_data.index.name = 'date'
factor_data.info()
factor_data = factor_data.join(data['return_1m']).sort_index()
factor_data.info()
T = 24
betas = (factor_data.groupby(level='ticker',
                             group_keys=False)
         .apply(lambda x: RollingOLS(endog=x.return_1m,
                                     exog=sm.add_constant(x.drop('return_1m', axis=1)),
                                     window=min(T, x.shape[0]-1))
                .fit(params_only=True)
                .params
                .drop('const', axis=1)))
betas.describe().join(betas.sum(1).describe().to_frame('total'))
cmap = sns.diverging_palette(10, 220, as_cmap=True)
sns.clustermap(betas.corr(), annot=True, cmap=cmap, center=0);
data = (data
        .join(betas
              .groupby(level='ticker')
              .shift()))
data.info()

Impute mean for missing factor betas

data.loc[:, factors] = data.groupby('ticker')[factors].apply(lambda x: x.fillna(x.mean()))
data.info()

Momentum factors

We can use these results to compute momentum factors based on the difference between returns over longer periods and the most recent monthly return, as well as for the difference between 3 and 12 month returns as follows:

for lag in [2,3,6,9,12]:
    data[f'momentum_{lag}'] = data[f'return_{lag}m'].sub(data.return_1m)
data[f'momentum_3_12'] = data[f'return_12m'].sub(data.return_3m)

Date Indicators

dates = data.index.get_level_values(‘date’)
data[‘year’] = dates.year
data[‘month’] = dates.month

Lagged returns

To use lagged values as input variables or features associated with the current observations, we use the .shift() method to move historical returns up to the current period:

for t in range(1, 7):
    data[f'return_1m_t-{t}'] = data.groupby(level='ticker').return_1m.shift(t)
data.info()

Target: Holding Period Returns

Similarly, to compute returns for various holding periods, we use the normalized period returns computed previously and shift them back to align them with the current financial features:

for t in [1,2,3,6,12]:
    data[f'target_{t}m'] = data.groupby(level='ticker')[f'return_{t}m'].shift(-t)
cols = ['target_1m',
        'target_2m',
        'target_3m', 
        'return_1m',
        'return_2m',
        'return_3m',
        'return_1m_t-1',
        'return_1m_t-2',
        'return_1m_t-3']

data[cols].dropna().sort_index().head(10)
data.info()

Create age proxy

We use quintiles of IPO year as a proxy for company age.

data = (data
        .join(pd.qcut(stocks.ipoyear, q=5, labels=list(range(1, 6)))
              .astype(float)
              .fillna(0)
              .astype(int)
              .to_frame('age')))
data.age = data.age.fillna(-1)

Create dynamic size proxy

We use the marketcap informaiton from the NASDAQ ticker info to create a size proxy.
Market cap information is tied to current prices. We create an adjustment factor to have the values reflect lower historical prices for each individual stock:

size_factor = (monthly_prices
               .loc[data.index.get_level_values('date').unique(),
                    data.index.get_level_values('ticker').unique()]
               .sort_index(ascending=False)
               .pct_change()
               .fillna(0)
               .add(1)
               .cumprod())
size_factor.info()
msize = (size_factor
         .mul(stocks
              .loc[size_factor.columns, 'marketcap'])).dropna(axis=1, how='all')

Create Size indicator as deciles per period

compute size deciles per month:

data['msize'] = (msize
                 .apply(lambda x: pd.qcut(x, q=10, labels=list(range(1, 11)))
                        .astype(int), axis=1)
                 .stack()
                 .swaplevel())
data.msize = data.msize.fillna(-1)

Combine data

data = data.join(stocks[['sector']])
data.sector = data.sector.fillna('Unknown')
data.info()

Store data

We will use the data again in several later chapters, starting in

with pd.HDFStore(DATA_STORE) as store:
	store.put('engineered_features', data.sort_index().loc[idx[:, :datetime(2018, 3, 1)], :])
	print(store.info())

Create Dummy variables

For most models, we need to encode categorical variables as ‘dummies’(one -hot encoding):

dummy_data = pd.get_dummies(data,
                            columns=['year','month', 'msize', 'age',  'sector'],
                            prefix=['year','month', 'msize', 'age', ''],
                            prefix_sep=['_', '_', '_', '_', ''])
dummy_data = dummy_data.rename(columns={c:c.replace('.0', '') for c in dummy_data.columns})
dummy_data.info()

Longbo-AI

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
打赏
0
评论
量化投资：如何将数据转换为因子

这段文本介绍了如何将数据转换为因子。它使用了pandas、statsmodels和matplotlib等Python库来处理时间序列数据。首先，它从Quandl的股票价格数据集中加载数据，并将其转换为月度收益率。然后，它计算了历史收益率，并将其标准化和修剪为[1％，99％]的范围内。最后，它根据收益率的长度（1到12个月）创建了六个复合月度收益率因子。这些因子可以用于量化投资策略的开发和测试。
复制链接

扫一扫