1. 项目介绍
在当今数字化时代,金融市场的数据分析和预测已经成为投资决策的重要依据。本文将详细介绍一个基于Python的股票预测分析系统,该系统利用机器学习算法对历史股票数据进行分析,并预测未来股票价格走势,为投资者提供决策支持。
1.1 项目背景
股票市场充满不确定性,传统的技术分析和基本面分析方法往往依赖于人为判断,存在主观性强、效率低等问题。随着机器学习技术的发展,利用算法对海量历史数据进行分析,挖掘其中的规律和模式,已经成为可能。本项目旨在构建一个完整的股票预测分析系统,集成数据采集、预处理、特征工程、模型训练与评估、预测可视化等功能,为投资决策提供科学依据。
1.2 项目目标
- 构建一个完整的股票数据采集与预处理流程
- 实现多种机器学习模型用于股票价格预测
- 提供直观的数据可视化和分析工具
- 开发用户友好的接口,便于投资者使用
- 评估不同模型的预测性能,提供最优预测结果
1.3 技术栈
- 编程语言:Python 3.8+
- 数据处理:Pandas, NumPy
- 机器学习框架:Scikit-learn, TensorFlow, Keras
- 深度学习模型:LSTM, GRU, Transformer
- 数据可视化:Matplotlib, Seaborn, Plotly
- Web接口:Flask, Streamlit
- 数据存储:SQLite, MongoDB
- API调用:yfinance, alpha_vantage
2. 系统架构
本系统采用模块化设计,包含以下核心组件:
2.1 系统架构图
+------------------------+ +------------------------+ +------------------------+
| | | | | |
| 数据采集模块 | | 数据预处理模块 | | 特征工程模块 |
| | | | | |
+------------------------+ +------------------------+ +------------------------+
| | |
v v v
+------------------------+ +------------------------+ +------------------------+
| | | | | |
| 模型训练模块 | <- | 特征选择模块 | <- | 数据存储模块 |
| | | | | |
+------------------------+ +------------------------+ +------------------------+
| ^ ^
v | |
+------------------------+ +------------------------+ +------------------------+
| | | | | |
| 预测评估模块 | -> | 结果可视化模块 | -> | 用户接口模块 |
| | | | | |
+------------------------+ +------------------------+ +------------------------+
2.2 模块功能说明
- 数据采集模块:负责从各种数据源获取股票历史数据,包括价格、交易量、财务指标等
- 数据预处理模块:对原始数据进行清洗、标准化、去噪等处理
- 特征工程模块:构建预测模型所需的特征,包括技术指标、统计特征等
- 数据存储模块:将处理后的数据存储到数据库中,便于后续分析
- 特征选择模块:从众多特征中选择最具预测能力的特征子集
- 模型训练模块:实现多种机器学习算法,训练预测模型
- 预测评估模块:评估模型性能,生成预测结果
- 结果可视化模块:将预测结果以图表形式展示
- 用户接口模块:提供友好的用户界面,便于用户操作和查看结果
3. 数据采集与预处理
3.1 数据来源
本系统支持多种数据来源,主要包括:
-
公开API:
- Yahoo Finance (yfinance)
- Alpha Vantage
- Quandl
- Tushare (针对中国股市)
-
CSV文件导入:支持用户上传自定义格式的CSV文件
-
数据库导入:支持从SQLite、MongoDB等数据库导入数据
3.2 数据采集实现
以下是使用yfinance库获取股票数据的示例代码:
import yfinance as yf
import pandas as pd
from datetime import datetime, timedelta
class StockDataCollector:
def __init__(self):
self.data = None
def collect_data(self, ticker, start_date, end_date=None, interval='1d'):
"""
从Yahoo Finance获取股票历史数据
参数:
ticker (str): 股票代码,如'AAPL'、'MSFT'
start_date (str): 起始日期,格式'YYYY-MM-DD'
end_date (str): 结束日期,格式'YYYY-MM-DD',默认为当前日期
interval (str): 数据间隔,可选'1d'(日),'1wk'(周),'1mo'(月)
返回:
pandas.DataFrame: 包含股票历史数据的DataFrame
"""
if end_date is None:
end_date = datetime.now().strftime('%Y-%m-%d')
try:
stock = yf.Ticker(ticker)
self.data = stock.history(start=start_date, end=end_date, interval=interval)
print(f"成功获取{ticker}从{start_date}到{end_date}的历史数据")
return self.data
except Exception as e:
print(f"获取数据时出错: {e}")
return None
def save_to_csv(self, file_path):
"""将数据保存为CSV文件"""
if self.data is not None:
self.data.to_csv(file_path)
print(f"数据已保存至{file_path}")
else:
print("没有数据可保存")
def get_stock_info(self, ticker):
"""获取股票基本信息"""
try:
stock = yf.Ticker(ticker)
info = stock.info
return info
except Exception as e:
print(f"获取股票信息时出错: {e}")
return None
3.3 数据预处理
原始股票数据通常包含缺失值、异常值等问题,需要进行预处理:
class StockDataPreprocessor:
def __init__(self, data=None):
self.data = data
def load_data(self, data):
"""加载数据"""
self.data = data
return self
def handle_missing_values(self, method='ffill'):
"""处理缺失值"""
if self.data is None:
print("没有数据可处理")
return self
if method == 'ffill':
self.data = self.data.fillna(method='ffill')
elif method == 'bfill':
self.data = self.data.fillna(method='bfill')
elif method == 'drop':
self.data = self.data.dropna()
elif method == 'mean':
self.data = self.data.fillna(self.data.mean())
return self
def remove_outliers(self, columns, method='zscore', threshold=3):
"""移除异常值"""
if self.data is None:
print("没有数据可处理")
return self
if method == 'zscore':
for col in columns:
if col in self.data.columns:
mean = self.data[col].mean()
std = self.data[col].std()
self.data = self.data[(self.data[col] - mean).abs() <= threshold * std]
return self
def normalize_data(self, columns, method='minmax'):
"""数据标准化"""
if self.data is None:
print("没有数据可处理")
return self
if method == 'minmax':
for col in columns:
if col in self.data.columns:
min_val = self.data[col].min()
max_val = self.data[col].max()
self.data[col] = (self.data[col] - min_val) / (max_val - min_val)
elif method == 'zscore':
for col in columns:
if col in self.data.columns:
mean = self.data[col].mean()
std = self.data[col].std()
self.data[col] = (self.data[col] - mean) / std
return self
def get_processed_data(self):
"""获取处理后的数据"""
return self.data
## 4. 特征工程
特征工程是机器学习模型性能的关键决定因素。在股票预测中,我们需要从原始价格数据中提取有价值的特征。
### 4.1 技术指标计算
技术指标是股票分析中常用的工具,可以揭示价格趋势、动量和波动性等信息:
```python
import numpy as np
import pandas as pd
import talib
class TechnicalIndicators:
def __init__(self, data=None):
self.data = data
def load_data(self, data):
"""加载数据"""
self.data = data
return self
def add_moving_averages(self, periods=[5, 10, 20, 50, 200]):
"""添加移动平均线"""
if self.data is None or 'Close' not in self.data.columns:
print("数据不包含收盘价")
return self
for period in periods:
self.data[f'MA_{period}'] = self.data['Close'].rolling(window=period).mean()
return self
def add_exponential_moving_averages(self, periods=[5, 10, 20, 50, 200]):
"""添加指数移动平均线"""
if self.data is None or 'Close' not in self.data.columns:
print("数据不包含收盘价")
return self
for period in periods:
self.data[f'EMA_{period}'] = self.data['Close'].ewm(span=period, adjust=False).mean()
return self
def add_rsi(self, periods=[14]):
"""添加相对强弱指标(RSI)"""
if self.data is None or 'Close' not in self.data.columns:
print("数据不包含收盘价")
return self
for period in periods:
delta = self.data['Close'].diff()
gain = delta.where(delta > 0, 0)
loss = -delta.where(delta < 0, 0)
avg_gain = gain.rolling(window=period).mean()
avg_loss = loss.rolling(window=period).mean()
rs = avg_gain / avg_loss
self.data[f'RSI_{period}'] = 100 - (100 / (1 + rs))
return self
def add_macd(self, fast_period=12, slow_period=26, signal_period=9):
"""添加MACD指标"""
if self.data is None or 'Close' not in self.data.columns:
print("数据不包含收盘价")
return self
ema_fast = self.data['Close'].ewm(span=fast_period, adjust=False).mean()
ema_slow = self.data['Close'].ewm(span=slow_period, adjust=False).mean()
self.data['MACD'] = ema_fast - ema_slow
self.data['MACD_Signal'] = self.data['MACD'].ewm(span=signal_period, adjust=False).mean()
self.data['MACD_Hist'] = self.data['MACD'] - self.data['MACD_Signal']
return self
def add_bollinger_bands(self, period=20, std_dev=2):
"""添加布林带指标"""
if self.data is None or 'Close' not in self.data.columns:
print("数据不包含收盘价")
return self
self.data[f'BB_Middle_{period}'] = self.data['Close'].rolling(window=period).mean()
self.data[f'BB_Std_{period}'] = self.data['Close'].rolling(window=period).std()
self.data[f'BB_Upper_{period}'] = self.data[f'BB_Middle_{period}'] + std_dev * self.data[f'BB_Std_{period}']
self.data[f'BB_Lower_{period}'] = self.data[f'BB_Middle_{period}'] - std_dev * self.data[f'BB_Std_{period}']
return self
def add_atr(self, period=14):
"""添加平均真实范围(ATR)指标"""
if self.data is None or not all(col in self.data.columns for col in ['High', 'Low', 'Close']):
print("数据不包含必要的价格列")
return self
high_low = self.data['High'] - self.data['Low']
high_close = (self.data['High'] - self.data['Close'].shift()).abs()
low_close = (self.data['Low'] - self.data['Close'].shift()).abs()
ranges = pd.concat([high_low, high_close, low_close], axis=1)
true_range = ranges.max(axis=1)
self.data[f'ATR_{period}'] = true_range.rolling(window=period).mean()
return self
def add_stochastic_oscillator(self, k_period=14, d_period=3):
"""添加随机指标"""
if self.data is None or not all(col in self.data.columns for col in ['High', 'Low', 'Close']):
print("数据不包含必要的价格列")
return self
low_min = self.data['Low'].rolling(window=k_period).min()
high_max = self.data['High'].rolling(window=k_period).max()
self.data['%K'] = 100 * ((self.data['Close'] - low_min) / (high_max - low_min))
self.data['%D'] = self.data['%K'].rolling(window=d_period).mean()
return self
def add_obv(self):
"""添加能量潮(OBV)指标"""
if self.data is None or not all(col in self.data.columns for col in ['Close', 'Volume']):
print("数据不包含必要的价格和成交量列")
return self
obv = [0]
for i in range(1, len(self.data)):
if self.data['Close'].iloc[i] > self.data['Close'].iloc[i-1]:
obv.append(obv[-1] + self.data['Volume'].iloc[i])
elif self.data['Close'].iloc[i] < self.data['Close'].iloc[i-1]:
obv.append(obv[-1] - self.data['Volume'].iloc[i])
else:
obv.append(obv[-1])
self.data['OBV'] = obv
return self
def get_data_with_indicators(self):
"""获取添加了技术指标的数据"""
return self.data
4.2 特征选择
股票数据可能包含大量特征,但并非所有特征都对预测有帮助。特征选择可以提高模型性能并减少过拟合:
from sklearn.feature_selection import SelectKBest, f_regression, RFE
from sklearn.ensemble import RandomForestRegressor
class FeatureSelector:
def __init__(self, data=None):
self.data = data
self.selected_features = None
def load_data(self, data):
"""加载数据"""
self.data = data
return self
def prepare_data(self, target_col='Close', lag_periods=[1, 2, 3, 5, 10]):
"""准备特征和目标变量,创建滞后特征"""
if self.data is None:
print("没有数据可处理")
return None, None
# 创建目标变量(下一天的收盘价)
self.data['Target'] = self.data[target_col].shift(-1)
# 创建滞后特征
for lag in lag_periods:
for col in self.data.columns:
if col != 'Target':
self.data[f'{col}_Lag_{lag}'] = self.data[col].shift(lag)
# 删除包含NaN的行
self.data = self.data.dropna()
# 分离特征和目标
X = self.data.drop(['Target'], axis=1)
y = self.data['Target']
return X, y
def select_k_best(self, X, y, k=10):
"""使用F值统计量选择最佳特征"""
selector = SelectKBest(score_func=f_regression, k=k)
selector.fit(X, y)
# 获取选中的特征
cols = selector.get_support(indices=True)
self.selected_features = X.columns[cols].tolist()
return X[self.selected_features], self.selected_features
def select_with_rfe(self, X, y, n_features=10):
"""使用递归特征消除法选择特征"""
estimator = RandomForestRegressor(n_estimators=100, random_state=42)
selector = RFE(estimator, n_features_to_select=n_features)
selector.fit(X, y)
# 获取选中的特征
cols = selector.get_support(indices=True)
self.selected_features = X.columns[cols].tolist()
return X[self.selected_features], self.selected_features
def select_with_random_forest(self, X, y, threshold=0.01):
"""使用随机森林特征重要性选择特征"""
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X, y)
# 获取特征重要性
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]
# 选择重要性大于阈值的特征
self.selected_features = [X.columns[i] for i in indices if importances[i] > threshold]
return X[self.selected_features], self.selected_features
5. 模型实现
本系统实现了多种机器学习模型用于股票价格预测,包括传统机器学习模型和深度学习模型。
5.1 数据准备
在训练模型前,需要将数据分为训练集和测试集:
from sklearn.model_selection import train_test_split
import numpy as np
class DataPreparation:
def __init__(self, X=None, y=None):
self.X = X
self.y = y
self.X_train = None
self.X_test = None
self.y_train = None
self.y_test = None
def load_data(self, X, y):
"""加载特征和目标数据"""
self.X = X
self.y = y
return self
def train_test_split(self, test_size=0.2, random_state=42):
"""划分训练集和测试集"""
if self.X is None or self.y is None:
print("没有数据可划分")
return self
self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(
self.X, self.y, test_size=test_size, random_state=random_state, shuffle=False
)
return self
def time_series_split(self, test_size=0.2):
"""按时间顺序划分训练集和测试集"""
if self.X is None or self.y is None:
print("没有数据可划分")
return self
# 计算测试集大小
test_index = int(len(self.X) * (1 - test_size))
# 按时间顺序划分
self.X_train = self.X.iloc[:test_index]
self.X_test = self.X.iloc[test_index:]
self.y_train = self.y.iloc[:test_index]
self.y_test = self.y.iloc[test_index:]
return self
def prepare_lstm_data(self, time_steps=60):
"""准备LSTM模型所需的时间序列数据"""
if self.X is None or self.y is None:
print("没有数据可处理")
return None, None, None, None
# 将数据转换为numpy数组
X_values = self.X.values
y_values = self.y.values
X_lstm, y_lstm = [], []
for i in range(time_steps, len(X_values)):
X_lstm.append(X_values[i-time_steps:i])
y_lstm.append(y_values[i])
X_lstm, y_lstm = np.array(X_lstm), np.array(y_lstm)
# 划分训练集和测试集
train_size = int(len(X_lstm) * 0.8)
X_train = X_lstm[:train_size]
X_test = X_lstm[train_size:]
y_train = y_lstm[:train_size]
y_test = y_lstm[train_size:]
return X_train, X_test, y_train, y_test
def get_train_test_data(self):
"""获取划分后的训练集和测试集"""
return self.X_train, self.X_test, self.y_train, self.y_test
5.2 传统机器学习模型
实现多种传统机器学习模型用于股票价格预测:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np
import joblib
class TraditionalModels:
def __init__(self):
self.models = {}
self.best_model = None
self.best_score = float('inf')
def train_linear_regression(self, X_train, y_train):
"""训练线性回归模型"""
model = LinearRegression()
model.fit(X_train, y_train)
self.models['LinearRegression'] = model
return model
def train_ridge_regression(self, X_train, y_train, alpha=1.0):
"""训练岭回归模型"""
model = Ridge(alpha=alpha)
model.fit(X_train, y_train)
self.models['Ridge'] = model
return model
def train_lasso_regression(self, X_train, y_train, alpha=0.1):
"""训练Lasso回归模型"""
model = Lasso(alpha=alpha)
model.fit(X_train, y_train)
self.models['Lasso'] = model
return model
def train_random_forest(self, X_train, y_train, n_estimators=100, max_depth=None):
"""训练随机森林模型"""
model = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth, random_state=42)
model.fit(X_train, y_train)
self.models['RandomForest'] = model
return model
def train_gradient_boosting(self, X_train, y_train, n_estimators=100, learning_rate=0.1):
"""训练梯度提升树模型"""
model = GradientBoostingRegressor(n_estimators=n_estimators, learning_rate=learning_rate, random_state=42)
model.fit(X_train, y_train)
self.models['GradientBoosting'] = model
return model
def train_svr(self, X_train, y_train, kernel='rbf', C=1.0, epsilon=0.1):
"""训练支持向量回归模型"""
model = SVR(kernel=kernel, C=C, epsilon=epsilon)
model.fit(X_train, y_train)
self.models['SVR'] = model
return model
def train_all_models(self, X_train, y_train):
"""训练所有模型"""
self.train_linear_regression(X_train, y_train)
self.train_ridge_regression(X_train, y_train)
self.train_lasso_regression(X_train, y_train)
self.train_random_forest(X_train, y_train)
self.train_gradient_boosting(X_train, y_train)
self.train_svr(X_train, y_train)
return self.models
def evaluate_model(self, model, X_test, y_test):
"""评估模型性能"""
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
return {
'MSE': mse,
'RMSE': rmse,
'MAE': mae,
'R2': r2
}
def evaluate_all_models(self, X_test, y_test):
"""评估所有模型性能"""
results = {}
for name, model in self.models.items():
results[name] = self.evaluate_model(model, X_test, y_test)
# 更新最佳模型
if results[name]['RMSE'] < self.best_score:
self.best_score = results[name]['RMSE']
self.best_model = name
return results
def save_model(self, model_name, file_path):
"""保存模型"""
if model_name in self.models:
joblib.dump(self.models[model_name], file_path)
print(f"模型已保存至{file_path}")
else:
print(f"模型{model_name}不存在")
def load_model(self, model_name, file_path):
"""加载模型"""
try:
model = joblib.load(file_path)
self.models[model_name] = model
print(f"模型已从{file_path}加载")
return model
except Exception as e:
print(f"加载模型时出错: {e}")
return None
def get_best_model(self):
"""获取性能最佳的模型"""
if self.best_model is None:
print("尚未评估模型性能")
return None
return self.models[self.best_model], self.best_model
5.3 深度学习模型
对于时间序列数据,深度学习模型尤其是LSTM和GRU等循环神经网络具有显著优势:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential, load_model, Model
from tensorflow.keras.layers import Dense, LSTM, Dropout, GRU, Input, Bidirectional, Concatenate
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.optimizers import Adam
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
class DeepLearningModels:
def __init__(self):
self.models = {}
self.best_model = None
self.best_score = float('inf')
self.scalers = {}
def preprocess_data(self, X_train, X_test, y_train, y_test, feature_range=(0, 1)):
"""数据预处理,对每个特征进行标准化"""
# 对特征进行标准化
X_scaler = MinMaxScaler(feature_range=feature_range)
X_train_scaled = X_scaler.fit_transform(X_train)
X_test_scaled = X_scaler.transform(X_test)
# 对目标变量进行标准化
y_scaler = MinMaxScaler(feature_range=feature_range)
y_train_scaled = y_scaler.fit_transform(y_train.values.reshape(-1, 1))
y_test_scaled = y_scaler.transform(y_test.values.reshape(-1, 1))
# 保存缩放器供后续使用
self.scalers['X'] = X_scaler
self.scalers['y'] = y_scaler
return X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled
def reshape_data_for_lstm(self, X_train, X_test):
"""将数据重塑为LSTM所需的形状 [samples, time_steps, features]"""
# 假设每个样本只有一个时间步
X_train_reshaped = X_train.reshape(X_train.shape[0], 1, X_train.shape[1])
X_test_reshaped = X_test.reshape(X_test.shape[0], 1, X_test.shape[1])
return X_train_reshaped, X_test_reshaped
def build_lstm_model(self, input_shape, units=50, dropout=0.2):
"""构建LSTM模型"""
model = Sequential()
model.add(LSTM(units=units, return_sequences=True, input_shape=input_shape))
model.add(Dropout(dropout))
model.add(LSTM(units=units, return_sequences=False))
model.add(Dropout(dropout))
model.add(Dense(units=25))
model.add(Dense(units=1))
model.compile(optimizer=Adam(learning_rate=0.001), loss='mean_squared_error')
return model
def build_gru_model(self, input_shape, units=50, dropout=0.2):
"""构建GRU模型"""
model = Sequential()
model.add(GRU(units=units, return_sequences=True, input_shape=input_shape))
model.add(Dropout(dropout))
model.add(GRU(units=units, return_sequences=False))
model.add(Dropout(dropout))
model.add(Dense(units=25))
model.add(Dense(units=1))
model.compile(optimizer=Adam(learning_rate=0.001), loss='mean_squared_error')
return model
def build_bidirectional_lstm_model(self, input_shape, units=50, dropout=0.2):
"""构建双向LSTM模型"""
model = Sequential()
model.add(Bidirectional(LSTM(units=units, return_sequences=True), input_shape=input_shape))
model.add(Dropout(dropout))
model.add(Bidirectional(LSTM(units=units, return_sequences=False)))
model.add(Dropout(dropout))
model.add(Dense(units=25))
model.add(Dense(units=1))
model.compile(optimizer=Adam(learning_rate=0.001), loss='mean_squared_error')
return model
def train_model(self, model, X_train, y_train, X_val=None, y_val=None, epochs=100, batch_size=32, model_name=None):
"""训练深度学习模型"""
callbacks = [
EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
]
if model_name:
callbacks.append(ModelCheckpoint(f'{model_name}.h5', save_best_only=True))
# 如果没有提供验证集,使用训练集的20%作为验证集
if X_val is None or y_val is None:
validation_split = 0.2
validation_data = None
else:
validation_split = 0.0
validation_data = (X_val, y_val)
history = model.fit(
X_train, y_train,
epochs=epochs,
batch_size=batch_size,
validation_split=validation_split,
validation_data=validation_data,
callbacks=callbacks,
verbose=1
)
if model_name:
self.models[model_name] = model
return model, history
def evaluate_model(self, model, X_test, y_test):
"""评估深度学习模型性能"""
# 预测
y_pred = model.predict(X_test)
# 如果数据经过了标准化,需要还原
if 'y' in self.scalers:
y_test = self.scalers['y'].inverse_transform(y_test)
y_pred = self.scalers['y'].inverse_transform(y_pred)
# 计算评估指标
mse = np.mean(np.square(y_test - y_pred))
rmse = np.sqrt(mse)
mae = np.mean(np.abs(y_test - y_pred))
# 计算R方
ss_tot = np.sum(np.square(y_test - np.mean(y_test)))
ss_res = np.sum(np.square(y_test - y_pred))
r2 = 1 - (ss_res / ss_tot)
return {
'MSE': mse,
'RMSE': rmse,
'MAE': mae,
'R2': r2
}
def predict_future(self, model, last_sequence, n_steps=30, scaler=None):
"""预测未来n天的股票价格"""
predictions = []
current_sequence = last_sequence.copy()
for _ in range(n_steps):
# 预测下一个值
current_pred = model.predict(current_sequence)[0][0]
predictions.append(current_pred)
# 更新序列用于下一次预测
current_sequence = np.roll(current_sequence, -1, axis=1)
current_sequence[0, -1, 0] = current_pred
# 如果有缩放器,需要还原数据
if scaler is not None:
predictions = scaler.inverse_transform(np.array(predictions).reshape(-1, 1))
return predictions
def save_model(self, model_name, file_path):
"""保存模型"""
if model_name in self.models:
self.models[model_name].save(file_path)
print(f"模型已保存至{file_path}")
else:
print(f"模型{model_name}不存在")
def load_model(self, model_name, file_path):
"""加载模型"""
try:
model = load_model(file_path)
self.models[model_name] = model
print(f"模型已从{file_path}加载")
return model
except Exception as e:
print(f"加载模型时出错: {e}")
return None
def plot_training_history(self, history, title="模型训练历史"):
"""绘制训练过程中的损失曲线"""
plt.figure(figsize=(12, 6))
plt.plot(history.history['loss'], label='训练集损失')
plt.plot(history.history['val_loss'], label='验证集损失')
plt.title(title)
plt.xlabel('迭代次数')
plt.ylabel('损失')
plt.legend()
plt.grid(True)
plt.show()
5.4 集成模型
通过集成多个模型的预测结果,可以进一步提高预测的准确性:
import numpy as np
from sklearn.ensemble import VotingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
class EnsembleModel:
def __init__(self):
self.models = {}
self.ensemble_model = None
def add_model(self, name, model):
"""添加模型到集成中"""
self.models[name] = model
return self
def create_voting_ensemble(self, weights=None):
"""创建投票集成模型"""
if not self.models:
print("没有模型可以集成")
return None
estimators = [(name, model) for name, model in self.models.items()]
self.ensemble_model = VotingRegressor(estimators=estimators, weights=weights)
return self.ensemble_model
def train_ensemble(self, X_train, y_train):
"""训练集成模型"""
if self.ensemble_model is None:
print("请先创建集成模型")
return None
self.ensemble_model.fit(X_train, y_train)
return self.ensemble_model
def weighted_average_prediction(self, X, weights=None):
"""使用加权平均方式集成预测结果"""
if not self.models:
print("没有模型可以集成")
return None
predictions = []
for name, model in self.models.items():
pred = model.predict(X)
predictions.append(pred)
# 将预测结果转换为数组
predictions = np.array(predictions)
# 如果没有提供权重,使用平均值
if weights is None:
weights = np.ones(len(self.models)) / len(self.models)
else:
# 强制权重和为1
weights = np.array(weights) / np.sum(weights)
# 计算加权平均预测
weighted_pred = np.sum(predictions * weights.reshape(-1, 1), axis=0)
return weighted_pred
def evaluate_ensemble(self, X_test, y_test):
"""评估集成模型性能"""
if self.ensemble_model is None:
print("请先创建集成模型")
return None
y_pred = self.ensemble_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
return {
'MSE': mse,
'RMSE': rmse,
'MAE': mae,
'R2': r2
}
def evaluate_weighted_ensemble(self, X_test, y_test, weights=None):
"""评估加权集成模型性能"""
y_pred = self.weighted_average_prediction(X_test, weights)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
return {
'MSE': mse,
'RMSE': rmse,
'MAE': mae,
'R2': r2
}
6. 数据可视化
数据可视化是股票预测分析系统的重要组成部分,可以直观地展示原始数据、技术指标和预测结果。
6.1 原始数据可视化
使用Matplotlib和Plotly等库可视化股票原始数据:
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
class StockDataVisualizer:
def __init__(self, data=None):
self.data = data
def load_data(self, data):
"""加载数据"""
self.data = data
return self
def plot_stock_price(self, title="股票价格趋势", figsize=(12, 6)):
"""使用Matplotlib绘制股票价格趋势图"""
if self.data is None or 'Close' not in self.data.columns:
print("数据不包含收盘价")
return None
plt.figure(figsize=figsize)
plt.plot(self.data.index, self.data['Close'], label='收盘价')
# 设置日期格式
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.gca().xaxis.set_major_locator(mdates.MonthLocator())
plt.title(title)
plt.xlabel('日期')
plt.ylabel('价格')
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
return plt
def plot_ohlc(self, title="股票OHLC图", figsize=(12, 6)):
"""使用Matplotlib绘制OHLC图"""
if self.data is None or not all(col in self.data.columns for col in ['Open', 'High', 'Low', 'Close']):
print("数据不包含必要的价格列")
return None
# 创建图形
fig, ax = plt.subplots(figsize=figsize)
# 计算柱形图的宽度
width = 0.6
# 绘制价格柱形图
up = self.data[self.data['Close'] >= self.data['Open']]
down = self.data[self.data['Close'] < self.data['Open']]
# 绘制上涨柱形图(绿色)
ax.bar(up.index, up['Close'] - up['Open'], width, bottom=up['Open'], color='g')
ax.bar(up.index, up['High'] - up['Close'], width/5, bottom=up['Close'], color='g')
ax.bar(up.index, up['Open'] - up['Low'], width/5, bottom=up['Low'], color='g')
# 绘制下跌柱形图(红色)
ax.bar(down.index, down['Open'] - down['Close'], width, bottom=down['Close'], color='r')
ax.bar(down.index, down['High'] - down['Open'], width/5, bottom=down['Open'], color='r')
ax.bar(down.index, down['Close'] - down['Low'], width/5, bottom=down['Low'], color='r')
# 设置日期格式
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
ax.xaxis.set_major_locator(mdates.MonthLocator())
plt.title(title)
plt.xlabel('日期')
plt.ylabel('价格')
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
return plt
def plot_candlestick_plotly(self, title="股票K线图"):
"""使用Plotly绘制交互式K线图"""
if self.data is None or not all(col in self.data.columns for col in ['Open', 'High', 'Low', 'Close']):
print("数据不包含必要的价格列")
return None
# 创建K线图
fig = go.Figure(data=[go.Candlestick(
x=self.data.index,
open=self.data['Open'],
high=self.data['High'],
low=self.data['Low'],
close=self.data['Close'],
name='K线'
)])
# 添加5日和20日移动平均线
if len(self.data) >= 20:
fig.add_trace(go.Scatter(
x=self.data.index,
y=self.data['Close'].rolling(window=5).mean(),
line=dict(color='blue', width=1),
name='5日移动平均线'
))
fig.add_trace(go.Scatter(
x=self.data.index,
y=self.data['Close'].rolling(window=20).mean(),
line=dict(color='orange', width=1),
name='20日移动平均线'
))
# 更新布局
fig.update_layout(
title=title,
xaxis_title='日期',
yaxis_title='价格',
xaxis_rangeslider_visible=False,
template='plotly_white'
)
return fig
def plot_volume(self, title="成交量分析", figsize=(12, 6)):
"""绘制成交量图"""
if self.data is None or 'Volume' not in self.data.columns:
print("数据不包含成交量")
return None
plt.figure(figsize=figsize)
# 根据价格变化给成交量柱形图着色
if 'Close' in self.data.columns:
colors = ['g' if close_price > open_price else 'r' for close_price, open_price in zip(self.data['Close'], self.data['Close'].shift(1))]
else:
colors = 'b'
plt.bar(self.data.index, self.data['Volume'], color=colors, alpha=0.8)
# 添加移动平均线
if len(self.data) >= 20:
plt.plot(self.data.index, self.data['Volume'].rolling(window=20).mean(), color='orange', label='20日平均成交量')
# 设置日期格式
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.gca().xaxis.set_major_locator(mdates.MonthLocator())
plt.title(title)
plt.xlabel('日期')
plt.ylabel('成交量')
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
return plt
def plot_technical_indicators(self, indicators, title="技术指标分析", figsize=(12, 8)):
"""绘制技术指标图"""
if self.data is None:
print("没有数据可绘制")
return None
# 检查指标是否存在
for indicator in indicators:
if indicator not in self.data.columns:
print(f"指标{indicator}不存在")
return None
# 创建图形
fig, ax = plt.subplots(figsize=figsize)
# 绘制收盘价
if 'Close' in self.data.columns:
ax.plot(self.data.index, self.data['Close'], label='收盘价', color='black')
# 绘制指标
colors = ['blue', 'green', 'red', 'purple', 'orange', 'brown', 'pink', 'gray', 'olive', 'cyan']
for i, indicator in enumerate(indicators):
ax.plot(self.data.index, self.data[indicator], label=indicator, color=colors[i % len(colors)])
# 设置日期格式
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
ax.xaxis.set_major_locator(mdates.MonthLocator())
plt.title(title)
plt.xlabel('日期')
plt.ylabel('值')
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
return plt
6.2 预测结果可视化
将模型预测结果进行可视化,直观展示预测效果:
class PredictionVisualizer:
def __init__(self, actual_data=None, predicted_data=None):
self.actual_data = actual_data
self.predicted_data = predicted_data
def load_data(self, actual_data, predicted_data):
"""加载实际数据和预测数据"""
self.actual_data = actual_data
self.predicted_data = predicted_data
return self
def plot_predictions(self, title="股票价格预测结果", figsize=(12, 6)):
"""绘制预测结果与实际值对比图"""
if self.actual_data is None or self.predicted_data is None:
print("数据不完整")
return None
plt.figure(figsize=figsize)
# 绘制实际值
plt.plot(self.actual_data.index, self.actual_data, label='实际值', color='blue')
# 绘制预测值
if isinstance(self.predicted_data, pd.Series) and self.predicted_data.index.equals(self.actual_data.index):
plt.plot(self.predicted_data.index, self.predicted_data, label='预测值', color='red', linestyle='--')
else:
plt.plot(self.actual_data.index, self.predicted_data, label='预测值', color='red', linestyle='--')
# 设置日期格式
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.gca().xaxis.set_major_locator(mdates.MonthLocator())
plt.title(title)
plt.xlabel('日期')
plt.ylabel('价格')
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
return plt
def plot_future_predictions(self, historical_data, future_predictions, prediction_dates=None, title="未来股票价格预测", figsize=(12, 6)):
"""绘制历史数据和未来预测结果"""
if historical_data is None or future_predictions is None:
print("数据不完整")
return None
plt.figure(figsize=figsize)
# 绘制历史数据
plt.plot(historical_data.index, historical_data, label='历史数据', color='blue')
# 生成预测日期(如果没有提供)
if prediction_dates is None:
last_date = historical_data.index[-1]
if isinstance(last_date, pd.Timestamp):
prediction_dates = [last_date + timedelta(days=i+1) for i in range(len(future_predictions))]
else:
prediction_dates = range(len(historical_data), len(historical_data) + len(future_predictions))
# 绘制预测数据
plt.plot(prediction_dates, future_predictions, label='未来预测', color='red', linestyle='--')
# 添加分隔线
plt.axvline(x=historical_data.index[-1], color='green', linestyle='-', label='当前日期')
# 设置日期格式(如果是日期类型)
if isinstance(historical_data.index[0], pd.Timestamp):
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.gca().xaxis.set_major_locator(mdates.MonthLocator())
plt.title(title)
plt.xlabel('日期')
plt.ylabel('价格')
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
return plt
def plot_model_comparison(self, actual_data, predictions_dict, title="模型预测效果对比", figsize=(12, 6)):
"""绘制多个模型的预测结果对比图"""
if actual_data is None or not predictions_dict:
print("数据不完整")
return None
plt.figure(figsize=figsize)
# 绘制实际值
plt.plot(actual_data.index, actual_data, label='实际值', color='black', linewidth=2)
# 绘制各模型预测值
colors = ['red', 'blue', 'green', 'purple', 'orange', 'brown', 'pink', 'gray']
for i, (model_name, predictions) in enumerate(predictions_dict.items()):
plt.plot(actual_data.index, predictions, label=f'{model_name}预测', color=colors[i % len(colors)], linestyle='--')
# 设置日期格式
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.gca().xaxis.set_major_locator(mdates.MonthLocator())
plt.title(title)
plt.xlabel('日期')
plt.ylabel('价格')
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
return plt
def plot_error_distribution(self, actual_data, predicted_data, title="预测误差分布", figsize=(12, 6)):
"""绘制预测误差分布图"""
if actual_data is None or predicted_data is None:
print("数据不完整")
return None
# 计算误差
errors = actual_data - predicted_data
plt.figure(figsize=figsize)
# 绘制误差直方图
plt.hist(errors, bins=30, alpha=0.7, color='blue')
plt.title(title)
plt.xlabel('预测误差')
plt.ylabel('频次')
plt.grid(True)
plt.tight_layout()
return plt