Machine learning for finance in python
Preparing data and a linear model
Explore the data with some EDA
Any time we begin a machine learning (ML) project, we need to first do some exploratory data analysis (EDA) to familiarize ourselves with the data. This includes things like:raw data plots histograms and more…
I typically begin with raw data plots and histograms. This allows us to understand our data’s distributions. If it’s a normal distribution, we can use things like parametric statistics.(非参数统计)
There are two stocks loaded for you into pandas DataFrames: lng_df and spy_df (LNG and SPY). We’ll use the closing prices and eventually volume as inputs to ML algorithms.
print(lng_df.head()) # examine the DataFrames
print(spy_df.head()) # examine the SPY DataFrame
# Plot the Adj_Close columns for SPY and LNG
spy_df['Adj_Close'].plot(label='SPY', legend=True)
lng_df['Adj_Close'].plot(label='LNG', legend=True, secondary_y=True)
plt.show() # show the plot
plt.clf() # clear the plot space
# Histogram of the daily price change percent of Adj_Close for LNG
lng_df['Adj_Close'].pct_change(1).plot.hist(bins=50)
plt.xlabel('adjusted close 1-day percent change')
plt.show()
大致符合正态.日差异比较小.
Correlations
Correlations are nice to check out before building machine learning models, because we can see which features correlate to the target most strongly. Pearson’s correlation coefficient is often used, which only detects linear relationships. It’s commonly assumed our data is normally distributed, which we can “eyeball” from histograms. Highly correlated variables have a Pearson correlation coefficient near 1 (positively correlated) or -1 (negatively correlated). A value near 0 means the two variables are not linearly correlated.
If we use the same time periods for previous price changes and future price changes, we can see if the stock price is mean-reverting (bounces around) or trend-following (goes up if it has been going up recently).
# Create 5-day % changes of Adj_Close for the current day, and 5 days in the future
lng_df['5d_future_close'] = lng_df['Adj_Close'].shift(-5)
lng_df['5d_close_future_pct'] = lng_df['5d_future_close'].pct_change(5)
lng_df['5d_close_pct'] = lng_df['Adj_Close'].pct_change(5)
# Calculate the correlation matrix between the 5d close pecentage changes (current and future)
corr = lng_df[['5d_close_pct', '5d_close_future_pct']].corr()
print(corr)
# Scatter the current 5-day percent change vs the future 5-day percent change
plt.scatter(lng_df['5d_close_pct'], lng_df['5d_future_close'])
plt.show()
Data transforms,features and targets
Create moving average and RSI features
最简单的指示器是移动平均(moving average),另外常用RSI.
MA
移动平均线,Moving Average,简称MA,MA是用统计分析的方法,将一定时期内的证券价格(指数)加以平均,并把不同时间的平均值连接起来,形成一根MA,用以观察证券价格变动趋势的一种技术指标。均线理论是当今应用最普遍的技术指标之一,它帮助交易者确认现有趋势、判断将出现的趋势、发现过度延生即将反转的趋势。
移动平均线 , 常用线有5天、10天、30天、60天、120天和240天的指标。其中,5天和10天的短期移动平均线,是短线操作的参照指标,称做日均线指标;30天和60天的是中期均线指标,称做季均线指标;120天、240天的是长期均线指标,称做年均线指标。对移动平均线的考查一般从几个方面进行。
计算方法:N日移动平均线=N日收市价之和/N
加权移动平均线
加权的原因是基于移动平均线中,收盘价对未来价格波动的影响最大,因此赋予它较大的权值。
RSI
相对强弱指数RSI是根据一定时期内上涨点数和涨跌点数之和的比率制作出的一种技术曲线。能够反映出市场在一定时期内的景气程度。由威尔斯.威尔德(Welles Wilder)最早应用于期货买卖,后来人们发现在众多的图表技术分析中,强弱指标的理论和实践极其适合于股票市场的短线投资,于是被用于股票升跌的测量和分析中。该分析指标的设计是以三条线来反映价格走势的强弱,这种图形可以为投资者提供操作依据,非常适合做短线差价操作。
数学原理
RSI的原理简单来说是以数字计算的方法求出买卖双方的力量对比,譬如有100个人面对一件商品,如果50个人以上要买,竞相抬价,商品价格必涨。相反,如果50个人以上争着卖出,价格自然下跌。
强弱指标理论认为,任何市价的大涨或大跌,均在0-100之间变动,根据常态分配,认为RSI值多在30-70之间变动,通常80甚至90时被认为市场已到达超买状态,至此市场价格自然会回落调整。当价格低跌至30以下即被认为是超卖状态,市价将出现反弹回升。
feature_names = ['5d_close_pct'] # a list of the feature names for later
# Create moving averages and rsi for timeperiods of 14, 30, 50, and 200
for n in [14, 30, 50, 200]:
# Create the moving average indicator and divide by Adj_Close
lng_df['ma' + str(n)] = talib.SMA(lng_df['Adj_Close'].values,
timeperiod=n) / lng_df['Adj_Close']
# Create the RSI indicator
lng_df['rsi' + str(n)] = talib.RSI(lng_df['Adj_Close'].values, timeperiod=n)
# Add rsi and moving average to the feature name list
feature_names = feature_names + ['ma' + str(n), 'rsi' + str(n)]
print(feature_names)
Create features and targets
# Drop all na values
lng_df = lng_df.dropna()
# Create features and targets
# use feature_names for features; 5d_close_future_pct for targets
features = lng_df[feature_names]
targets = lng_df['5d_close_future_pct']
# Create DataFrame from target column and feature columns
feat_targ_df = lng_df[['5d_close_future_pct'] + feature_names]
# Calculate correlation matrix
corr = feat_targ_df.corr()
print(corr)
Check the correlations
Before we fit our first machine learning model, let’s look at the correlations between features and targets. Ideally we want large (near 1 or -1) correlations between features and targets. Examining correlations can help us tweak features to maximize correlation (for example, altering the timeperiod argument in the talib functions). It can also help us remove features that aren’t correlated to the target.
To easily plot a correlation matrix, we can use seaborn’s heatmap() function. This takes a correlation matrix as the first argument, and has many other options. Check out the annot option – this will help us turn on annotations.
# Plot heatmap of correlation matrix
sns.heatmap(corr, annot=True)
plt.yticks(rotation=0); plt.xticks(rotation=90) # fix ticklabel directions
plt.tight_layout() # fits plot area to the plot, "tightly"
plt.show() # show the plot
plt.clf() # clear the plot area
# Create a scatter plot of the most highly correlated variable with the target
plt.scatter(lng_df['ma200'], lng_df['5d_close_future_pct'])
plt.show()
Linear modeling
Create train and test features
# Import the statsmodels library with the alias sm
import statsmodels.api as sm
# Add a constant to the features
linear_features = sm.add_constant(features)
# Create a size for the training set that is 85% of the total number of samples
train_size = int(0.85 * features.shape[0])
train_features = linear_features[:train_size]
train_targets = targets[:train_size]
test_features = linear_features[train_size:]
test_targets = targets[train_size:]
print(linear_features.shape, train_features.shape, test_features.shape)
Fit a linear model
# Create the linear model and complete the least squares fit
model = sm.OLS(train_targets, train_features)
results = model.fit() # fit the model
print(results.summary())
# examine pvalues
# Features with p <= 0.05 are typically considered significantly different from 0
print(results.pvalues)
# Make predictions from our model for train and test sets
train_predictions = results.predict(train_features)
test_predictions = results.predict(test_features)
ma14 1.317652e-01
rsi14 4.119023e-10
ma30 2.870964e-01
rsi30 1.315491e-11
ma50 6.542888e-08
rsi50 1.598367e-12
ma200 1.087610e-02
rsi200 2.559536e-11
dtype: float64
都显著,都可以用来预测股价。
Evaluate our results
# Scatter the predictions vs the targets with 80% transparency
plt.scatter(train_predictions, train_targets, alpha=0.2, color='b', label='train')
plt.scatter(test_predictions, test_targets, alpha=0.2, color='r', label='test')
# Plot the perfect prediction line
xmin, xmax = plt.xlim()
plt.plot(np.arange(xmin, xmax, 0.01), np.arange(xmin, xmax, 0.01), c='k')
# Set the axis labels and show the plot
plt.xlabel('predictions')
plt.ylabel('actual')
plt.legend() # show the legend
plt.show()
但是用现行模型的预测结果不佳,还需要进一步复杂的模型进行处理。