本文作者:何百圣 哈尔滨工业大学(威海) 经济管理学院 数量金融方向
MLAT系列文章为校内课程作业,以blog的形式记录作业。笔者的课程任务是12gradient_boosting,本篇承接上一篇create dataset,用多种boosting方法对dataset进行处理。
Imports and Settings
import sys, os
import warnings
from time import time
from itertools import product
import joblib
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.model_selection import cross_validate
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
# needed for HistGradientBoostingClassifier
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, HistGradientBoostingClassifier
from sklearn.inspection import partial_dependence, plot_partial_dependence
from sklearn.metrics import roc_auc_score
其他设置
results_path = Path(r'E:/machine learning for algorithmic trading','results', 'baseline')
warnings.filterwarnings('ignore')
sns.set_style("whitegrid")
idx = pd.IndexSlice
np.random.seed(42)
DATA_STORE = r'E:/machine learning for algorithmic trading/wiki.h5'
Prepare Data
这里使用的就是上一节得到的数据集,在原书GitHub中属于第4节。
def get_data(start='2000', end='2018', task='classification', holding_period=1, dropna=False):
idx = pd.IndexSlice
target = f'target_{holding_period}m'
with pd.HDFStore(DATA_STORE) as store:
df = store['engineered_features']
if start is not None and end is not None:
df = df.loc[idx[:, start: end], :]
if dropna:
df = df.dropna()
y = (df[target]>0).astype(int)
#这里target收益率大于零则y为1,否则y为0,应用于classification方法,做方向判断
X = df.drop([c for c in df.columns if c.startswith('target')], axis=1)
return y, X
1. startswith() 用来检查字符串是否以target开头 还可设置 beg与end参数,默认 beg = 0, end = len(string)
Factorize Categories
2. factorize指的是对sector进行分类(例如将建筑业,制造业分为0,1),这里factorize函数还可以加入参数sort,则为有排序的分类,否则先看见的为0,后续出现不同则增加分类
#返回变量为tuple
cat_cols = ['year', 'month', 'age', 'msize', 'sector']
def factorize_cats(df, cats=['sector']):
cat_cols = ['year', 'month', 'age', 'msize'] + cats
for cat in cats:
df[cat] = pd.factorize(df[cat])[0]
df.loc[:, cat_cols] = df.loc[:, cat_cols].fillna(-1).astype(int)
return df
One-Hot Encoding
3. one hot编码是将类别变量转换为机器学习算法易于利用的一种形式的过程,get_dummies返回变量为dataframe,为稀疏矩阵。
注意:
#在使用get_dummies时,要防止多重共线性问题,例如:如果将性别数据使用get_dummies,就必须要删除一列,因为知道了一列就知道了另一列
#除了onehot 和 factorize,映射函数map()也可以起到分类的效果
#在分类时,需要注意数据是否有大小效果,例如red和yellow就没有,但是年龄就有
def get_one_hot_data(df, cols=cat_cols[:-1]):
df = pd.get_dummies(df,
columns=cols + ['sector'],
#columns参数表明这些columns参与分类
prefix=cols + [''],
prefix_sep=['_'] * len(cols) + ['']
#其实get_dummies函数有默认prefix,这里是为了sector分类前不出现sector
)
return df.rename(columns={c: c.replace('.0', '') for c in df.columns})
Get Holdout Set
holdout set用于估计交叉验证后的泛化误差
def get_holdout_set(target, features, period=6):
idx = pd.IndexSlice
label = target.name
dates = np.sort(y.index.get_level_values('date').unique())
cv_start, cv_end = dates[0], dates[-period - 2]
holdout_start, holdout_end = dates[-period - 1], dates[-1]
#这里用了大部分的数据来做cross validation,留最后七天做测试集
df = features.join(target.to_frame())
train = df.loc[idx[:, cv_start: cv_end], :]
y_train, X_train = train[label], train.drop(label, axis=1)
test = df.loc[idx[:, holdout_start: holdout_end], :]
y_test, X_test = test[label], test.drop(label, axis=1)
return y_train, X_train, y_test, X_test
Load Data
y, features = get_data()
X_dummies = get_one_hot_data(features)
X_factors = factorize_cats(features)
y_clean, features_clean = get_data(dropna=True)
X_dummies_clean = get_one_hot_data(features_clean)
X_factors_clean = factorize_cats(features_clean)
#(clean)将滞后收益率为nan的项都删去了
Cross-Validation Setup
交叉验证,原理比较简单,这里采用了12-fold
class OneStepTimeSeriesSplit:
"""Generates tuples of train_idx, test_idx pairs
Assumes the index contains a level labeled 'date'"""
def __init__(self, n_splits=3, test_period_length=1, shuffle=False):
self.n_splits = n_splits
self.test_period_length = test_period_length
self.shuffle = shuffle
@staticmethod
def chunks(l, n):
for i in range(0, len(l), n):
yield l[i:i + n]
def split(self, X, y=None, groups=None):
unique_dates = (X.index
.get_level_values('date')
.unique()
.sort_values(ascending=False)
[:self.n_splits*self.test_period_length])
dates = X.reset_index()[['date']]
for test_date in self.chunks(unique_dates, self.test_period_length):
train_idx = dates[dates.date < min(test_date)].index
test_idx = dates[dates.date.isin(test_date)].index
if self.shuffle:
np.random.shuffle(list(train_id