机器学习预测股票收益(一)之随机森林模型
前言
本文将使用Python整理1927-2020年所有美国上市公司股票数据。根据历史收益以及交易量,使用随机森林,支持向量机以及神经网络等机器学习方法预测股票收益。最优结果构建的资产组合能获得年均超20%的收益率。
一、导入库和数据
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
import matplotlib.pyplot as plt
from pprint import pprint
import statsmodels.api as sm
from stargazer.stargazer import Stargazer
file2 = "crsp_msf_all.csv"
data = pd.read_csv(file2,parse_dates=["date"], index_col="date")
数据来自CRSP数据库,可以看出数据集包含了各种股票数据,本文中只用到股票代码(PERMNO)、收益(RET)、交易量(VOL)。
二、处理数据以及计算特征变量
vol = data["VOL"]
ret = data[["PERMNO","RET","VOL"]]
ret = ret.replace('C',np.nan).replace('B',np.nan)
ret = ret.dropna()
ret ["RET"]= ret["RET"].astype(float)
predictorsname = ["R0","R1","R2","R3","R4","R5","R6","R7","R8","R9","R10","R11","R12",
"R13","R14","R15","R16","R17","R18","R19","R20","R21","R22","R23","R24"]
#计算历史收益
for i in range(25):
data[predictorsname[i]]= data.groupby('PERMNO')['RET'].shift(i+1)
data["R-1"] = data.groupby('PERMNO')['RET'].shift(-1)
predictorsname.append("VOL")
obs = data[predictorsname]
obs["PERMNO"] = data["PERMNO"]
obs["RET"] = data["RET"]
obs["R-1"] = data["R-1"]
obs = obs[["PERMNO","VOL","R-1","RET","R0","R1","R2","R3","R4","R5","R6","R7","R8","R9","R10","R11","R12",
"R13","R14","R15","R16","R17","R18","R19","R20","R21","R22","R23","R24"]]
obs = obs.replace('C',np.nan).replace('B',np.nan)
obs = obs.dropna()
##归一化处理
def regularit(df):
newDataFrame = pd.DataFrame