机器学习——泰坦尼克号幸存者预测
首先我们要对训练集和测试集进行处理,将非数字部分进行编码,并且将空缺值进行填充
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
# 加载数据
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
# 1. 处理缺失值
# 年龄用中位数填充
age_imputer = SimpleImputer(strategy='median')
train_df['Age'] = age_imputer.fit_transform(train_df[['Age']]).ravel()
test_df['Age'] = age_imputer.transform(test_df[['Age']]).ravel()
# 登船港口用众数填充
embarked_imputer = SimpleImputer(strategy='most_frequent')
train_df['Embarked'] = embarked_imputer.fit_transform(train_df[['Embarked']]).ravel()
test_df['Embarked'] = embarked_imputer.transform(test_df[['Embarked']]).ravel()
# 票价用均值填充
fare_imputer = SimpleImputer(strategy='mean')
test_df['Fare'] = fare_imputer.fit_transform(test_df[['Fare']]).ravel()
# 2. 编码分类变量
label_encoders = {}
categorical_cols = ['Sex', 'Embarked']
for col in categorical_cols:
le = LabelEncoder()
train_df[col] = le.fit_transform(train_df[col])
test_df[col] = le.transform(test_df[col])
label_encoders[col] = le
# 3. 删除不需要的列
drop_cols = ['PassengerId', 'Name', 'Ticket', 'Cabin']
train_df = train_df.drop(drop_cols, axis=1)
test_passenger_ids = test_df['PassengerId']
test_df = test_df.drop(drop_cols, axis=1)
# 保存处理后的数据
train_df.to_csv('processed_train.csv', index=False)
test_df.to_csv('processed_test.csv', index=False)
print("数据预处理完成!")
print("训练集形状:", train_df.shape)
print("测试集形状:", test_df.shape)
数据预处理完成!
训练集形状: (891, 8)
测试集形状: (418, 7)
# ... existing code ...
# 在保存处理后的数据前添加打印前5行的代码
print("\n处理后的训练集前5行数据:")
print(train_df.head())
print("\n处理后的测试集前5行数据:")
print(test_df.head())
# 保存处理后的数据
train_df.to_csv('processed_train.csv', index=False)
test_df.to_csv('processed_test.csv', index=False)
print("\n数据预处理完成!")
print("训练集形状:", train_df.shape)
print("测试集形状:", test_df.shape)
处理后的训练集前5行数据:
Survived Pclass Sex Age SibSp Parch Fare Embarked
0 0 3 1 22.0 1 0 7.2500 2
1 1 1 0 38.0 1 0 71.2833 0
2 1 3 0 26.0 0 0 7.9250 2
3 1 1 0 35.0 1 0 53.1000 2
4 0 3 1 35.0 0 0 8.0500 2
处理后的测试集前5行数据:
Pclass Sex Age SibSp Parch Fare Embarked
0 3 1 34.5 0 0 7.8292 1
1 3 0 47.0 1 0 7.0000 2
2 2 1 62.0 0 0 9.6875 1
3 3 1 27.0 0 0 8.6625 2
4 3 0 22.0 1 1 12.2875 2
数据预处理完成!
训练集形状: (891, 8)
测试集形状: (418, 7)
这里我们选取三个模型来训练,这里我们:
- 使用三个典型模型:随机森林、支持向量机和逻辑回归
- 对每个模型分别使用网格搜索和贝叶斯优化进行超参数调优
- 在验证集上评估每个模型的性能
- 输出每个模型的最佳参数和评估指标(准确率、精确率、召回率和F1分数)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
# 加载预处理后的数据
train_df = pd.read_csv('processed_train.csv')
test_df = pd.read_csv('processed_test.csv')
# 准备数据
X = train_df.drop('Survived', axis=1)
y = train_df['Survived']
X_test = test_df
# 划分验证集
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
# 定义三个模型
models = {
'随机森林': RandomForestClassifier(random_state=42),
'支持向量机': SVC(random_state=42),
'逻辑回归': LogisticRegression(random_state=42, max_iter=1000)
}
# 定义参数搜索空间
param_grids = {
'随机森林': {
'n_estimators': [100, 200, 300],
'max_depth': [None, 5, 10],
'min_samples_split': [2, 5, 10]
},
'支持向量机': {
'C': [0.1, 1, 10],
'kernel': ['linear', 'rbf'],
'gamma': ['scale', 'auto']
},
'逻辑回归': {
'C': [0.1, 1, 10],
'penalty': ['l1', 'l2'],
'solver': ['liblinear']
}
}
# 贝叶斯优化搜索空间
bayes_spaces = {
'随机森林': {
'n_estimators': Integer(100, 300),
'max_depth': Integer(3, 20),
'min_samples_split': Integer(2, 10)
},
'支持向量机': {
'C': Real(0.1, 10, prior='log-uniform'),
'kernel': Categorical(['linear', 'rbf']),
'gamma': Categorical(['scale', 'auto'])
},
'逻辑回归': {
'C': Real(0.1, 10, prior='log-uniform'),
'penalty': Categorical(['l1', 'l2']),
'solver': Categorical(['liblinear'])
}
}
# 训练和评估模型
results = []
for name, model in models.items():
print(f"\n正在训练和优化 {name} 模型...")
# 网格搜索
grid_search = GridSearchCV(model, param_grids[name], cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)
grid_val_pred = grid_search.predict(X_val)
# 贝叶斯优化
bayes_search = BayesSearchCV(model, bayes_spaces[name], cv=5, n_jobs=-1, n_iter=20)
bayes_search.fit(X_train, y_train)
bayes_val_pred = bayes_search.predict(X_val)
# 评估指标
grid_metrics = {
'accuracy': accuracy_score(y_val, grid_val_pred),
'precision': precision_score(y_val, grid_val_pred),
'recall': recall_score(y_val, grid_val_pred),
'f1': f1_score(y_val, grid_val_pred)
}
bayes_metrics = {
'accuracy': accuracy_score(y_val, bayes_val_pred),
'precision': precision_score(y_val, bayes_val_pred),
'recall': recall_score(y_val, bayes_val_pred),
'f1': f1_score(y_val, bayes_val_pred)
}
results.append({
'model': name,
'grid_search': {
'best_params': grid_search.best_params_,
'metrics': grid_metrics
},
'bayes_optimization': {
'best_params': bayes_search.best_params_,
'metrics': bayes_metrics
}
})
# 输出结果
print("\n模型评估结果:")
for result in results:
print(f"\n{result['model']} 模型:")
print("网格搜索最佳参数:", result['grid_search']['best_params'])
print("网格搜索评估指标:", result['grid_search']['metrics'])
print("贝叶斯优化最佳参数:", result['bayes_optimization']['best_params'])
print("贝叶斯优化评估指标:", result['bayes_optimization']['metrics'])
正在训练和优化 随机森林 模型...
正在训练和优化 支持向量机 模型...
正在训练和优化 逻辑回归 模型...
模型评估结果:
随机森林 模型:
网格搜索最佳参数: {'max_depth': 5, 'min_samples_split': 2, 'n_estimators': 100}
网格搜索评估指标: {'accuracy': 0.8156424581005587, 'precision': 0.8360655737704918, 'recall': 0.6891891891891891, 'f1': 0.7555555555555555}
贝叶斯优化最佳参数: OrderedDict([('max_depth', 6), ('min_samples_split', 10), ('n_estimators', 300)])
贝叶斯优化评估指标: {'accuracy': 0.8156424581005587, 'precision': 0.8360655737704918, 'recall': 0.6891891891891891, 'f1': 0.7555555555555555}
支持向量机 模型:
网格搜索最佳参数: {'C': 1, 'gamma': 'scale', 'kernel': 'linear'}
网格搜索评估指标: {'accuracy': 0.7821229050279329, 'precision': 0.7536231884057971, 'recall': 0.7027027027027027, 'f1': 0.7272727272727273}
贝叶斯优化最佳参数: OrderedDict([('C', 9.624644666938327), ('gamma', 'scale'), ('kernel', 'linear')])
贝叶斯优化评估指标: {'accuracy': 0.7821229050279329, 'precision': 0.7536231884057971, 'recall': 0.7027027027027027, 'f1': 0.7272727272727273}
逻辑回归 模型:
网格搜索最佳参数: {'C': 1, 'penalty': 'l1', 'solver': 'liblinear'}
网格搜索评估指标: {'accuracy': 0.7932960893854749, 'precision': 0.7681159420289855, 'recall': 0.7162162162162162, 'f1': 0.7412587412587412}
贝叶斯优化最佳参数: OrderedDict([('C', 3.3547501361695042), ('penalty', 'l2'), ('solver', 'liblinear')])
贝叶斯优化评估指标: {'accuracy': 0.7932960893854749, 'precision': 0.7681159420289855, 'recall': 0.7162162162162162, 'f1': 0.7412587412587412}
也可以对特征进行特征工程,比如用KMeans算法进行聚类,生成新的特征后再用模型训练(这里只训练随机森林这个模型):
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# 加载预处理后的数据
train_df = pd.read_csv('processed_train.csv')
test_df = pd.read_csv('processed_test.csv')
# 准备数据
X_train = train_df.drop('Survived', axis=1)
y_train = train_df['Survived']
X_test = test_df.copy()
# 特征标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 1. 使用肘部法则确定最佳聚类数
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
kmeans.fit(X_train_scaled)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('肘部法则')
plt.xlabel('聚类数')
plt.ylabel('WCSS')
plt.show()
# 2. 基于肘部法则选择k=3进行聚类
kmeans = KMeans(n_clusters=3, init='k-means++', random_state=42)
kmeans.fit(X_train_scaled)
# 3. 生成新特征
train_df['Cluster'] = kmeans.predict(X_train_scaled)
test_df['Cluster'] = kmeans.predict(X_test_scaled)
# 4. 分析聚类中心特征
cluster_centers = pd.DataFrame(scaler.inverse_transform(kmeans.cluster_centers_),
columns=X_train.columns)
print("聚类中心特征分析:")
print(cluster_centers)
# 5. 保存带有新特征的数据集
train_df.to_csv('train_with_clusters.csv', index=False)
test_df.to_csv('test_with_clusters.csv', index=False)
print("特征工程完成!已生成新的聚类特征。")
print("训练集形状:", train_df.shape)
print("测试集形状:", test_df.shape)
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from skopt import BayesSearchCV
from skopt.space import Integer, Real, Categorical
# 加载处理后的数据
train_df = pd.read_csv('train_with_clusters.csv')
test_df = pd.read_csv('test_with_clusters.csv')
# 准备数据
X_train = train_df.drop('Survived', axis=1)
y_train = train_df['Survived']
X_test = test_df
# 定义随机森林模型
rf = RandomForestClassifier(random_state=42)
# 定义贝叶斯优化搜索空间
param_space = {
'n_estimators': Integer(100, 500),
'max_depth': Integer(3, 20),
'min_samples_split': Integer(2, 10),
'min_samples_leaf': Integer(1, 5),
'max_features': Categorical(['sqrt', 'log2']),
'bootstrap': Categorical([True, False])
}
# 创建贝叶斯优化器
bayes_search = BayesSearchCV(
estimator=rf,
search_spaces=param_space,
n_iter=50,
cv=5,
n_jobs=-1,
random_state=42
)
# 执行贝叶斯优化
print("正在进行贝叶斯优化...")
bayes_search.fit(X_train, y_train)
# 输出最佳参数
print("\n最佳参数:")
print(bayes_search.best_params_)
# 使用最佳模型进行预测
best_model = bayes_search.best_estimator_
y_pred = best_model.predict(X_train)
# 评估模型性能
print("\n训练集评估结果:")
print(f"准确率: {accuracy_score(y_train, y_pred):.4f}")
print(f"精确率: {precision_score(y_train, y_pred):.4f}")
print(f"召回率: {recall_score(y_train, y_pred):.4f}")
print(f"F1分数: {f1_score(y_train, y_pred):.4f}")
# 保存模型预测结果
test_df['Survived_Pred'] = best_model.predict(X_test)
test_df.to_csv('test_predictions.csv', index=False)
print("\n预测结果已保存到 test_predictions.csv")
正在进行贝叶斯优化...
最佳参数:
OrderedDict([('bootstrap', False), ('max_depth', 14), ('max_features', 'log2'), ('min_samples_leaf', 2), ('min_samples_split', 10), ('n_estimators', 305)])
训练集评估结果:
准确率: 0.9125
精确率: 0.9314
召回率: 0.8333
F1分数: 0.8796
预测结果已保存到 test_predictions.csv
这里可以看出预测结果大大提高了,这个调整工程是成功的。
还可以进行SVD降维,排除次要特征后再用模型进行训练:
import pandas as pd
import numpy as np
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
# 加载预处理后的数据
train_df = pd.read_csv('processed_train.csv')
test_df = pd.read_csv('processed_test.csv')
# 准备数据
X_train = train_df.drop('Survived', axis=1)
y_train = train_df['Survived']
X_test = test_df.copy()
# 特征标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 1. 使用SVD进行降维
# 选择保留的主成分数量(这里保留3个主成分)
svd = TruncatedSVD(n_components=3, random_state=42)
X_train_svd = svd.fit_transform(X_train_scaled)
X_test_svd = svd.transform(X_test_scaled)
# 2. 分析降维效果
print("原始特征数量:", X_train_scaled.shape[1])
print("降维后特征数量:", X_train_svd.shape[1])
print("保留的方差比例:", np.sum(svd.explained_variance_ratio_))
# 3. 将降维后的数据转换为DataFrame
train_svd_df = pd.DataFrame(X_train_svd, columns=[f'SVD_{i}' for i in range(X_train_svd.shape[1])])
train_svd_df['Survived'] = y_train.values
test_svd_df = pd.DataFrame(X_test_svd, columns=[f'SVD_{i}' for i in range(X_test_svd.shape[1])])
# 4. 保存降维后的数据集
train_svd_df.to_csv('train_svd.csv', index=False)
test_svd_df.to_csv('test_svd.csv', index=False)
print("SVD降维完成!已生成新的特征数据集。")
原始特征数量: 7
降维后特征数量: 3
保留的方差比例: 0.6524096636110657
SVD降维完成!已生成新的特征数据集。
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from skopt import BayesSearchCV
from skopt.space import Integer, Real, Categorical
# 加载SVD降维后的数据
train_df = pd.read_csv('train_svd.csv')
test_df = pd.read_csv('test_svd.csv')
# 准备数据
X = train_df.drop('Survived', axis=1)
y = train_df['Survived']
X_test = test_df
# 划分训练集和验证集
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
# 定义随机森林模型
rf = RandomForestClassifier(random_state=42)
# 定义贝叶斯优化搜索空间
param_space = {
'n_estimators': Integer(100, 500),
'max_depth': Integer(3, 20),
'min_samples_split': Integer(2, 10),
'min_samples_leaf': Integer(1, 5),
'max_features': Categorical(['sqrt', 'log2']),
'bootstrap': Categorical([True, False])
}
# 创建贝叶斯优化器
bayes_search = BayesSearchCV(
estimator=rf,
search_spaces=param_space,
n_iter=50,
cv=5,
n_jobs=-1,
random_state=42
)
# 执行贝叶斯优化
print("正在进行贝叶斯优化...")
bayes_search.fit(X_train, y_train)
# 输出最佳参数
print("\n最佳参数:")
print(bayes_search.best_params_)
# 使用最佳模型进行预测
best_model = bayes_search.best_estimator_
y_pred_train = best_model.predict(X_train)
y_pred_val = best_model.predict(X_val)
# 评估模型性能
print("\n训练集评估结果:")
print(f"准确率: {accuracy_score(y_train, y_pred_train):.4f}")
print(f"精确率: {precision_score(y_train, y_pred_train):.4f}")
print(f"召回率: {recall_score(y_train, y_pred_train):.4f}")
print(f"F1分数: {f1_score(y_train, y_pred_train):.4f}")
print("\n验证集评估结果:")
print(f"准确率: {accuracy_score(y_val, y_pred_val):.4f}")
print(f"精确率: {precision_score(y_val, y_pred_val):.4f}")
print(f"召回率: {recall_score(y_val, y_pred_val):.4f}")
print(f"F1分数: {f1_score(y_val, y_pred_val):.4f}")
# 输出混淆矩阵
print("\n验证集混淆矩阵:")
print(confusion_matrix(y_val, y_pred_val))
# 在测试集上进行预测
test_df['Survived_Pred'] = best_model.predict(X_test)
test_df.to_csv('test_predictions.csv', index=False)
print("\n测试集预测结果已保存到 test_predictions.csv")
正在进行贝叶斯优化...
最佳参数:
OrderedDict([('bootstrap', True), ('max_depth', 8), ('max_features', 'log2'), ('min_samples_leaf', 5), ('min_samples_split', 10), ('n_estimators', 100)])
训练集评估结果:
准确率: 0.8553
精确率: 0.8287
召回率: 0.7761
F1分数: 0.8015
验证集评估结果:
准确率: 0.7877
精确率: 0.7432
召回率: 0.7432
F1分数: 0.7432
验证集混淆矩阵:
[[86 19]
[19 55]]
测试集预测结果已保存到 test_predictions.csv
这个结果略微差于聚类后的效果,但和一开始考虑所有特征的预测效果差不多,这个降维算法也比较成功。