股票价格走势预测

最新推荐文章于 2024-05-09 19:46:27 发布

干炒牛河

最新推荐文章于 2024-05-09 19:46:27 发布

阅读量1.8k

点赞数 4

分类专栏：机器学习案例文章标签：机器学习 python 人工智能

本文链接：https://blog.csdn.net/qq_53201790/article/details/129216255

版权

机器学习案例专栏收录该内容

5 篇文章

订阅专栏

该文介绍了线性回归和基于支持向量机的支持向量回归（SVR）的基本原理，并提供了两种模型在Python中的实现代码，包括数据预处理、模型训练、预测及评价。代码示例中，使用了Quandl库获取股票数据，通过调整后的数据特征进行预测，如HL_PCT和PCT_change，并展示了模型的预测结果和准确性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

基于线性回归

算法原理

回归是基于已有数据对新的数据进行预测，比如预测股票走势。这里我们主要讲简单线性回归。基于标准的线性回归，可以扩展出更多的线性回归算法。
线性回归就是能够用一个直线较为精确地描述数据之间的关系，这样当出现新的数据的时候，就能够预测出一个简单的值。

支持向量回归

算法原理
SVR作为SVM的分支从而被提出，都要通过训练样本找到一个函数g(x),不同之处在于支持向量机是分类问题，寻求的是一个最优超平面（函数g(x)
）将两类样本点分的最开，最大间隔准则（H1和H2之间间隔最大）是支持向量机最佳准则。

在这里插入图片描述

而支持向量回归机寻求的是一个线性回归方程（函数y=g(x)）去拟合所有的样本点，它寻求的最优超平面不是将两类分得最开，而是使样本点离超平面总方差最小。

在这里插入图片描述

模型评价方法

在这里插入图片描述

代码实现

1.基于线性回归

#coding=utf-8
# 线性回归算法一般用于解决”使用已知样本对未知公式参数的估计“类问题
# 获取数据
# 股票数据特征:开盘价(Open)、最高价(High)、最低价(Low)、收盘价(Close)、交易额(Volume)
# 及调整后的开盘价(Adj. Open)、最高价(Adj. High)、最低价(Adj. Low)、收盘价(Adj. Close)、交易额(Adj. Volume)
# 数据预处理
# 除权后的数据更能反映数据特征，选择调整后的数据为主要使用的数据特征
# 两个数据特征：HL_PCT(股票最高价与最低价变化百分比)、PCT_change(股票收盘价与最低价的变化百分比)
# 自变量为：Adj.Close、HL_PCT、PCT_change、Adj.Volume
# 因变量为：Adj.Close

import quandl
from sklearn import preprocessing

df = quandl.get('WIKI/GOOGL')
# df = quandl.get('WIKI/AAPL')
# print(df)

import math
import numpy as np

# 定义预测列变量，存放研究对象的标签名
forecast_col = 'Adj. Close'
# 定义预测天数，这里设置为所有数据量长度的1%
forecast_out = int(math.ceil(0.01 * len(df)))
# 只用到df中的下面几个字段
df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close', 'Adj. Volume']]
# 构造两个新列
df['HL_PCT'] = (df['Adj. High'] - df['Adj. Close']) / df['Adj. Close'] * 100.0
df['PCT_change'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] * 100.0
# 真正用到的特征
df = df[['Adj. Close', 'HL_PCT', 'PCT_change', 'Adj. Volume']]
# 处理空值，这里设置为-99999
df.fillna(-99999, inplace=True)
# label代表预测结果，通过让Adj. Close列的数据往前移动1%行来表示
df['label'] = df[forecast_col].shift(-forecast_out)
# 生成在模型中使用的数据X,y,以及预测时用到的数据X_lately
X = np.array(df.drop(['label'], 1))
X = preprocessing.scale(X)
# 上面生成的label列时留下的最后1%行的数据，这些行并没有label 数据，用作预测时用到的输入数据
X_lately = X[-forecast_out:]
X = X[:-forecast_out]
# 抛弃label列中为空的那些行
df.dropna(inplace=True)
y = np.array(df['label'])

from sklearn import model_selection, svm
from sklearn.linear_model import LinearRegression

# 先把X，y数据分成两部份，训练和测试，这里选取80%作为训练集，%20作为测试集
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2)
# 生成线性回归对象
clf = LinearRegression(n_jobs=-1)
# 开始训练
clf.fit(X_train, y_train)
# 用测试数据评估准确性
accuracy = clf.score(X_test, y_test)
# 进行预测
foreca_set = clf.predict(X_lately)
print(foreca_set, accuracy)

import matplotlib.pyplot as plt
from matplotlib import style
import datetime

# 修改matplotlib样式
style.use('ggplot')
one_day = 86400
# 在df中新建Forecast列，用于存放预测结果的数据
df['Forecast'] = np.nan
# 取df最后一行的时间索引
last_date = df.iloc[-1].name
last_unix = last_date.timestamp()
next_unix = last_unix + one_day
# 遍历预测结果，用它向df中追加行
for i in foreca_set:
    next_date = datetime.datetime.fromtimestamp(next_unix)
    next_unix += one_day
    # [np.nan  for _ in range(len(df.columns)-1)]生成不包含Forecast字段的列表
    # 而[i]是只包含Forecast字段的列表
    # 拼在一起组成新行，按日期追加到df下面
    df.loc[next_date] = [np.nan for _ in range(len(df.columns) - 1)] + [i]
# 绘图
df['Adj. Close'].plot()
df['Forecast'].plot()
plt.legend(loc=4)
plt.xlabel('Date')
plt.ylabel('Price')
plt.show()

2.支持向量回归

#coding=utf-8
import quandl
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
import math
#Get the stock data
df = quandl.get("WIKI/GOOGL")
# Take a look at the data
# print(df.head())

 # 读取训练数据

# stock_code = '601318.XSHG'
# start_date = '2016-02-05'
# end_date = '2017-02-07'


# 只用到df中的下面几个字段
df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close', 'Adj. Volume']]
# 构造两个新列
df['HL_PCT'] = (df['Adj. High'] - df['Adj. Close']) / df['Adj. Close'] * 100.0
df['PCT_change'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] * 100.0
# 真正用到的特征
df = df[['Adj. Close', 'HL_PCT', 'PCT_change', 'Adj. Volume']]
#df = get_price(stock_code, start_date, end_date, frequency='daily',skip_paused=False,fq='pre',fields=['open','high','low','close','money'])
print(df.head())

df = df[['Adj. Close']]
print(df.head())


# A variable for predicting 'n' days out into the future
forecast_out = 30 #'n=30' days
#Create another column (the target or dependent variable) shifted 'n' units up
df['Prediction'] = df[['Adj. Close']].shift(-forecast_out)
#print the new data set
print(df.tail())


### Create the independent data set (X)  #######
# Convert the dataframe to a numpy array
X = np.array(df.drop(['Prediction'],1))

#Remove the last 'n' rows
X = X[:-forecast_out]
# print(X)


### Create the dependent data set (y)  #####
# Convert the dataframe to a numpy array (All of the values including the NaN's)
y = np.array(df['Prediction'])
# Get all of the y values except the last 'n' rows
y = y[:-forecast_out]
# print(y)


# Split the data into 80% training and 20% testing
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create and train the Support Vector Machine (Regressor)
svr_rbf = SVR(kernel='rbf', C=1e3, gamma=0.1)
svr_rbf.fit(x_train, y_train)

# Testing Model: Score returns the coefficient of determination R^2 of the prediction.
# The best possible score is 1.0
svm_confidence = svr_rbf.score(x_test, y_test)
print("svm confidence: ", svm_confidence)


# Create and train the Linear Regression  Model
lr = LinearRegression()
# Train the model
lr.fit(x_train, y_train)

# Testing Model: Score returns the coefficient of determination R^2 of the prediction.
# The best possible score is 1.0
lr_confidence = lr.score(x_test, y_test)
print("lr confidence: ", lr_confidence)


# Set x_forecast equal to the last 30 rows of the original data set from Adj. Close column
x_forecast = np.array(df.drop(['Prediction'],1))[-forecast_out:]
# print(x_forecast)

foreca_set = lr.predict(x_test)
# Print linear regression model predictions for the next 'n' days
lr_prediction = lr.predict(x_forecast)
print("linear regression model predictions for the next n days:\n",lr_prediction)
#将numpy数据存为csv
np.savetxt("lr_prediction.csv", lr_prediction, delimiter=',')

# Print support vector regressor model predictions for the next 'n' days
svm_prediction = svr_rbf.predict(x_forecast)
print("support vector regressor model predictions for the next n days:\n",svm_prediction)
#将numpy数据存为csv
np.savetxt("svm_prediction.csv", svm_prediction, delimiter=',')

#Original stock price
z = np.array(df['Adj. Close'])
print("Original stock price for the next n days:\n",z[-forecast_out:])
np.savetxt("z.csv", z[-forecast_out:], delimiter=',')

import matplotlib.pyplot as plt
from matplotlib import style
import datetime
style.use('ggplot')
one_day = 1
# 在df中新建Forecast列，用于存放预测结果的数据
df['Forecast'] = np.nan
# 取df最后一行的时间索引
last_date = df.iloc[-1].name
last_unix = last_date.timestamp()
next_unix = last_unix + one_day
# 遍历预测结果，用它向df中追加行
for i in foreca_set:
    next_date = datetime.datetime.fromtimestamp(next_unix)
    next_unix += one_day
    # [np.nan  for _ in range(len(df.columns)-1)]生成不包含Forecast字段的列表
    # 而[i]是只包含Forecast字段的列表
    # 拼在一起组成新行，按日期追加到df下面
    df.loc[next_date] = [np.nan for _ in range(len(df.columns) - 1)] + [i]
# 绘图
df['Adj. Close'].plot()
df['Forecast'].plot()
plt.legend(loc=4)
plt.xlabel('Date')
plt.ylabel('Price')
plt.show()