线性回归股价预测

Machine learning for finance in python

Preparing data and a linear model

Explore the data with some EDA

Any time we begin a machine learning (ML) project, we need to first do some exploratory data analysis (EDA) to familiarize ourselves with the data. This includes things like:raw data plots histograms and more…
I typically begin with raw data plots and histograms. This allows us to understand our data’s distributions. If it’s a normal distribution, we can use things like parametric statistics.(非参数统计)
There are two stocks loaded for you into pandas DataFrames: lng_df and spy_df (LNG and SPY). We’ll use the closing prices and eventually volume as inputs to ML algorithms.

print(lng_df.head())  # examine the DataFrames
print(spy_df.head())  # examine the SPY DataFrame

# Plot the Adj_Close columns for SPY and LNG
spy_df['Adj_Close'].plot(label='SPY', legend=True)
lng_df['Adj_Close'].plot(label='LNG', legend=True, secondary_y=True)
plt.show()  # show the plot
plt.clf()  # clear the plot space

# Histogram of the daily price change percent of Adj_Close for LNG
lng_df['Adj_Close'].pct_change(1).plot.hist(bins=50)
plt.xlabel('adjusted close 1-day percent change')
plt.show()

在这里插入图片描述
在这里插入图片描述大致符合正态.日差异比较小.

Correlations

Correlations are nice to check out before building machine learning models, because we can see which features correlate to the target most strongly. Pearson’s correlation coefficient is often used, which only detects linear relationships. It’s commonly assumed our data is normally distributed, which we can “eyeball” from histograms. Highly correlated variables have a Pearson correlation coefficient near 1 (positively correlated) or -1 (negatively correlated). A value near 0 means the two variables are not linearly correlated.
If we use the same time periods for previous price changes and future price changes, we can see if the stock price is mean-reverting (bounces around) or trend-following (goes up if it has been going up recently).

# Create 5-day % changes of Adj_Close for the current day, and 5 days in the future
lng_df['5d_future_close'] = lng_df['Adj_Close'].shift(-5)
lng_df['5d_close_future_pct'] = lng_df['5d_future_close'].pct_change(5)
lng_df['5d_close_pct'] = lng_df['Adj_Close'].pct_change(5)
# Calculate the correlation matrix between the 5d close pecentage changes (current and future)
corr = lng_df[['5d_close_pct', '5d_close_future_pct']].corr()
print(corr)
# Scatter the current 5-day percent change vs the future 5-day percent change
plt.scatter(lng_df['5d_close_pct'], lng_df['5d_future_close'])
plt.show()

在这里插入图片描述

Data transforms,features and targets

Create moving average and RSI features

在这里插入图片描述
最简单的指示器是移动平均(moving average),另外常用RSI.

MA

移动平均线,Moving Average,简称MA,MA是用统计分析的方法,将一定时期内的证券价格(指数)加以平均,并把不同时间的平均值连接起来,形成一根MA,用以观察证券价格变动趋势的一种技术指标。均线理论是当今应用最普遍的技术指标之一,它帮助交易者确认现有趋势、判断将出现的趋势、发现过度延生即将反转的趋势。
移动平均线 , 常用线有5天、10天、30天、60天、120天和240天的指标。其中,5天和10天的短期移动平均线,是短线操作的参照指标,称做日均线指标;30天和60天的是中期均线指标,称做季均线指标;120天、240天的是长期均线指标,称做年均线指标。对移动平均线的考查一般从几个方面进行。
计算方法:N日移动平均线=N日收市价之和/N
加权移动平均线
加权的原因是基于移动平均线中,收盘价对未来价格波动的影响最大,因此赋予它较大的权值。

RSI

相对强弱指数RSI是根据一定时期内上涨点数和涨跌点数之和的比率制作出的一种技术曲线。能够反映出市场在一定时期内的景气程度。由威尔斯.威尔德(Welles Wilder)最早应用于期货买卖,后来人们发现在众多的图表技术分析中,强弱指标的理论和实践极其适合于股票市场的短线投资,于是被用于股票升跌的测量和分析中。该分析指标的设计是以三条线来反映价格走势的强弱,这种图形可以为投资者提供操作依据,非常适合做短线差价操作。

数学原理
RSI的原理简单来说是以数字计算的方法求出买卖双方的力量对比,譬如有100个人面对一件商品,如果50个人以上要买,竞相抬价,商品价格必涨。相反,如果50个人以上争着卖出,价格自然下跌。
强弱指标理论认为,任何市价的大涨或大跌,均在0-100之间变动,根据常态分配,认为RSI值多在30-70之间变动,通常80甚至90时被认为市场已到达超买状态,至此市场价格自然会回落调整。当价格低跌至30以下即被认为是超卖状态,市价将出现反弹回升。

feature_names = ['5d_close_pct']  # a list of the feature names for later

# Create moving averages and rsi for timeperiods of 14, 30, 50, and 200
for n in [14, 30, 50, 200]:

    # Create the moving average indicator and divide by Adj_Close
    lng_df['ma' + str(n)] = talib.SMA(lng_df['Adj_Close'].values,
                              timeperiod=n) / lng_df['Adj_Close']
    # Create the RSI indicator
    lng_df['rsi' + str(n)] = talib.RSI(lng_df['Adj_Close'].values, timeperiod=n)
    
    # Add rsi and moving average to the feature name list
    feature_names = feature_names + ['ma' + str(n), 'rsi' + str(n)]
    
print(feature_names)

Create features and targets

在这里插入图片描述

# Drop all na values
lng_df = lng_df.dropna()

# Create features and targets
# use feature_names for features; 5d_close_future_pct for targets
features = lng_df[feature_names]
targets = lng_df['5d_close_future_pct']

# Create DataFrame from target column and feature columns
feat_targ_df = lng_df[['5d_close_future_pct'] + feature_names]

# Calculate correlation matrix
corr = feat_targ_df.corr()
print(corr)

Check the correlations

Before we fit our first machine learning model, let’s look at the correlations between features and targets. Ideally we want large (near 1 or -1) correlations between features and targets. Examining correlations can help us tweak features to maximize correlation (for example, altering the timeperiod argument in the talib functions). It can also help us remove features that aren’t correlated to the target.

To easily plot a correlation matrix, we can use seaborn’s heatmap() function. This takes a correlation matrix as the first argument, and has many other options. Check out the annot option – this will help us turn on annotations.

# Plot heatmap of correlation matrix
sns.heatmap(corr, annot=True)
plt.yticks(rotation=0); plt.xticks(rotation=90)  # fix ticklabel directions
plt.tight_layout()  # fits plot area to the plot, "tightly"
plt.show()  # show the plot
plt.clf()  # clear the plot area

# Create a scatter plot of the most highly correlated variable with the target
plt.scatter(lng_df['ma200'], lng_df['5d_close_future_pct'])
plt.show()

在这里插入图片描述

Linear modeling

Create train and test features

# Import the statsmodels library with the alias sm
import statsmodels.api as sm

# Add a constant to the features
linear_features = sm.add_constant(features)

# Create a size for the training set that is 85% of the total number of samples
train_size = int(0.85 * features.shape[0])
train_features = linear_features[:train_size]
train_targets = targets[:train_size]
test_features = linear_features[train_size:]
test_targets = targets[train_size:]
print(linear_features.shape, train_features.shape, test_features.shape)

Fit a linear model

# Create the linear model and complete the least squares fit
model = sm.OLS(train_targets, train_features)
results = model.fit()  # fit the model
print(results.summary())

# examine pvalues
# Features with p <= 0.05 are typically considered significantly different from 0
print(results.pvalues)

# Make predictions from our model for train and test sets
train_predictions = results.predict(train_features)
test_predictions = results.predict(test_features)

ma14 1.317652e-01
rsi14 4.119023e-10
ma30 2.870964e-01
rsi30 1.315491e-11
ma50 6.542888e-08
rsi50 1.598367e-12
ma200 1.087610e-02
rsi200 2.559536e-11
dtype: float64
都显著,都可以用来预测股价。

Evaluate our results

# Scatter the predictions vs the targets with 80% transparency
plt.scatter(train_predictions, train_targets, alpha=0.2, color='b', label='train')
plt.scatter(test_predictions, test_targets, alpha=0.2, color='r', label='test')

# Plot the perfect prediction line
xmin, xmax = plt.xlim()
plt.plot(np.arange(xmin, xmax, 0.01), np.arange(xmin, xmax, 0.01), c='k')

# Set the axis labels and show the plot
plt.xlabel('predictions')
plt.ylabel('actual')
plt.legend()  # show the legend
plt.show()

在这里插入图片描述
但是用现行模型的预测结果不佳,还需要进一步复杂的模型进行处理。

  • 1
    点赞
  • 15
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值