第二十一天打卡

@浙大疏锦行

机器学习——泰坦尼克号幸存者预测

首先我们要对训练集和测试集进行处理,将非数字部分进行编码,并且将空缺值进行填充

import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder

# 加载数据
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

# 1. 处理缺失值
# 年龄用中位数填充
age_imputer = SimpleImputer(strategy='median')
train_df['Age'] = age_imputer.fit_transform(train_df[['Age']]).ravel()
test_df['Age'] = age_imputer.transform(test_df[['Age']]).ravel()

# 登船港口用众数填充
embarked_imputer = SimpleImputer(strategy='most_frequent')
train_df['Embarked'] = embarked_imputer.fit_transform(train_df[['Embarked']]).ravel()
test_df['Embarked'] = embarked_imputer.transform(test_df[['Embarked']]).ravel()

# 票价用均值填充
fare_imputer = SimpleImputer(strategy='mean')
test_df['Fare'] = fare_imputer.fit_transform(test_df[['Fare']]).ravel()

# 2. 编码分类变量
label_encoders = {}
categorical_cols = ['Sex', 'Embarked']

for col in categorical_cols:
    le = LabelEncoder()
    train_df[col] = le.fit_transform(train_df[col])
    test_df[col] = le.transform(test_df[col])
    label_encoders[col] = le

# 3. 删除不需要的列
drop_cols = ['PassengerId', 'Name', 'Ticket', 'Cabin']
train_df = train_df.drop(drop_cols, axis=1)
test_passenger_ids = test_df['PassengerId']
test_df = test_df.drop(drop_cols, axis=1)

# 保存处理后的数据
train_df.to_csv('processed_train.csv', index=False)
test_df.to_csv('processed_test.csv', index=False)

print("数据预处理完成!")
print("训练集形状:", train_df.shape)
print("测试集形状:", test_df.shape)
数据预处理完成!
训练集形状: (891, 8)
测试集形状: (418, 7)
# ... existing code ...

# 在保存处理后的数据前添加打印前5行的代码
print("\n处理后的训练集前5行数据:")
print(train_df.head())
print("\n处理后的测试集前5行数据:")
print(test_df.head())

# 保存处理后的数据
train_df.to_csv('processed_train.csv', index=False)
test_df.to_csv('processed_test.csv', index=False)

print("\n数据预处理完成!")
print("训练集形状:", train_df.shape)
print("测试集形状:", test_df.shape)
处理后的训练集前5行数据:
   Survived  Pclass  Sex   Age  SibSp  Parch     Fare  Embarked
0         0       3    1  22.0      1      0   7.2500         2
1         1       1    0  38.0      1      0  71.2833         0
2         1       3    0  26.0      0      0   7.9250         2
3         1       1    0  35.0      1      0  53.1000         2
4         0       3    1  35.0      0      0   8.0500         2

处理后的测试集前5行数据:
   Pclass  Sex   Age  SibSp  Parch     Fare  Embarked
0       3    1  34.5      0      0   7.8292         1
1       3    0  47.0      1      0   7.0000         2
2       2    1  62.0      0      0   9.6875         1
3       3    1  27.0      0      0   8.6625         2
4       3    0  22.0      1      1  12.2875         2

数据预处理完成!
训练集形状: (891, 8)
测试集形状: (418, 7)

这里我们选取三个模型来训练,这里我们:

- 使用三个典型模型:随机森林、支持向量机和逻辑回归
- 对每个模型分别使用网格搜索和贝叶斯优化进行超参数调优
- 在验证集上评估每个模型的性能
- 输出每个模型的最佳参数和评估指标(准确率、精确率、召回率和F1分数)

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer

# 加载预处理后的数据
train_df = pd.read_csv('processed_train.csv')
test_df = pd.read_csv('processed_test.csv')

# 准备数据
X = train_df.drop('Survived', axis=1)
y = train_df['Survived']
X_test = test_df

# 划分验证集
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# 定义三个模型
models = {
    '随机森林': RandomForestClassifier(random_state=42),
    '支持向量机': SVC(random_state=42),
    '逻辑回归': LogisticRegression(random_state=42, max_iter=1000)
}

# 定义参数搜索空间
param_grids = {
    '随机森林': {
        'n_estimators': [100, 200, 300],
        'max_depth': [None, 5, 10],
        'min_samples_split': [2, 5, 10]
    },
    '支持向量机': {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf'],
        'gamma': ['scale', 'auto']
    },
    '逻辑回归': {
        'C': [0.1, 1, 10],
        'penalty': ['l1', 'l2'],
        'solver': ['liblinear']
    }
}

# 贝叶斯优化搜索空间
bayes_spaces = {
    '随机森林': {
        'n_estimators': Integer(100, 300),
        'max_depth': Integer(3, 20),
        'min_samples_split': Integer(2, 10)
    },
    '支持向量机': {
        'C': Real(0.1, 10, prior='log-uniform'),
        'kernel': Categorical(['linear', 'rbf']),
        'gamma': Categorical(['scale', 'auto'])
    },
    '逻辑回归': {
        'C': Real(0.1, 10, prior='log-uniform'),
        'penalty': Categorical(['l1', 'l2']),
        'solver': Categorical(['liblinear'])
    }
}

# 训练和评估模型
results = []
for name, model in models.items():
    print(f"\n正在训练和优化 {name} 模型...")
    
    # 网格搜索
    grid_search = GridSearchCV(model, param_grids[name], cv=5, n_jobs=-1)
    grid_search.fit(X_train, y_train)
    grid_val_pred = grid_search.predict(X_val)
    
    # 贝叶斯优化
    bayes_search = BayesSearchCV(model, bayes_spaces[name], cv=5, n_jobs=-1, n_iter=20)
    bayes_search.fit(X_train, y_train)
    bayes_val_pred = bayes_search.predict(X_val)
    
    # 评估指标
    grid_metrics = {
        'accuracy': accuracy_score(y_val, grid_val_pred),
        'precision': precision_score(y_val, grid_val_pred),
        'recall': recall_score(y_val, grid_val_pred),
        'f1': f1_score(y_val, grid_val_pred)
    }
    
    bayes_metrics = {
        'accuracy': accuracy_score(y_val, bayes_val_pred),
        'precision': precision_score(y_val, bayes_val_pred),
        'recall': recall_score(y_val, bayes_val_pred),
        'f1': f1_score(y_val, bayes_val_pred)
    }
    
    results.append({
        'model': name,
        'grid_search': {
            'best_params': grid_search.best_params_,
            'metrics': grid_metrics
        },
        'bayes_optimization': {
            'best_params': bayes_search.best_params_,
            'metrics': bayes_metrics
        }
    })

# 输出结果
print("\n模型评估结果:")
for result in results:
    print(f"\n{result['model']} 模型:")
    print("网格搜索最佳参数:", result['grid_search']['best_params'])
    print("网格搜索评估指标:", result['grid_search']['metrics'])
    print("贝叶斯优化最佳参数:", result['bayes_optimization']['best_params'])
    print("贝叶斯优化评估指标:", result['bayes_optimization']['metrics'])
正在训练和优化 随机森林 模型...

正在训练和优化 支持向量机 模型...

正在训练和优化 逻辑回归 模型...

模型评估结果:

随机森林 模型:
网格搜索最佳参数: {'max_depth': 5, 'min_samples_split': 2, 'n_estimators': 100}
网格搜索评估指标: {'accuracy': 0.8156424581005587, 'precision': 0.8360655737704918, 'recall': 0.6891891891891891, 'f1': 0.7555555555555555}
贝叶斯优化最佳参数: OrderedDict([('max_depth', 6), ('min_samples_split', 10), ('n_estimators', 300)])
贝叶斯优化评估指标: {'accuracy': 0.8156424581005587, 'precision': 0.8360655737704918, 'recall': 0.6891891891891891, 'f1': 0.7555555555555555}

支持向量机 模型:
网格搜索最佳参数: {'C': 1, 'gamma': 'scale', 'kernel': 'linear'}
网格搜索评估指标: {'accuracy': 0.7821229050279329, 'precision': 0.7536231884057971, 'recall': 0.7027027027027027, 'f1': 0.7272727272727273}
贝叶斯优化最佳参数: OrderedDict([('C', 9.624644666938327), ('gamma', 'scale'), ('kernel', 'linear')])
贝叶斯优化评估指标: {'accuracy': 0.7821229050279329, 'precision': 0.7536231884057971, 'recall': 0.7027027027027027, 'f1': 0.7272727272727273}

逻辑回归 模型:
网格搜索最佳参数: {'C': 1, 'penalty': 'l1', 'solver': 'liblinear'}
网格搜索评估指标: {'accuracy': 0.7932960893854749, 'precision': 0.7681159420289855, 'recall': 0.7162162162162162, 'f1': 0.7412587412587412}
贝叶斯优化最佳参数: OrderedDict([('C', 3.3547501361695042), ('penalty', 'l2'), ('solver', 'liblinear')])
贝叶斯优化评估指标: {'accuracy': 0.7932960893854749, 'precision': 0.7681159420289855, 'recall': 0.7162162162162162, 'f1': 0.7412587412587412}

也可以对特征进行特征工程,比如用KMeans算法进行聚类,生成新的特征后再用模型训练(这里只训练随机森林这个模型):

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# 加载预处理后的数据
train_df = pd.read_csv('processed_train.csv')
test_df = pd.read_csv('processed_test.csv')

# 准备数据
X_train = train_df.drop('Survived', axis=1)
y_train = train_df['Survived']
X_test = test_df.copy()

# 特征标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 1. 使用肘部法则确定最佳聚类数
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
    kmeans.fit(X_train_scaled)
    wcss.append(kmeans.inertia_)
    
plt.plot(range(1, 11), wcss)
plt.title('肘部法则')
plt.xlabel('聚类数')
plt.ylabel('WCSS') 
plt.show()

# 2. 基于肘部法则选择k=3进行聚类
kmeans = KMeans(n_clusters=3, init='k-means++', random_state=42)
kmeans.fit(X_train_scaled)

# 3. 生成新特征
train_df['Cluster'] = kmeans.predict(X_train_scaled)
test_df['Cluster'] = kmeans.predict(X_test_scaled)

# 4. 分析聚类中心特征
cluster_centers = pd.DataFrame(scaler.inverse_transform(kmeans.cluster_centers_),
                              columns=X_train.columns)
print("聚类中心特征分析:")
print(cluster_centers)

# 5. 保存带有新特征的数据集
train_df.to_csv('train_with_clusters.csv', index=False)
test_df.to_csv('test_with_clusters.csv', index=False)

print("特征工程完成!已生成新的聚类特征。")
print("训练集形状:", train_df.shape)
print("测试集形状:", test_df.shape)
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from skopt import BayesSearchCV
from skopt.space import Integer, Real, Categorical

# 加载处理后的数据
train_df = pd.read_csv('train_with_clusters.csv')
test_df = pd.read_csv('test_with_clusters.csv')

# 准备数据
X_train = train_df.drop('Survived', axis=1)
y_train = train_df['Survived']
X_test = test_df

# 定义随机森林模型
rf = RandomForestClassifier(random_state=42)

# 定义贝叶斯优化搜索空间
param_space = {
    'n_estimators': Integer(100, 500),
    'max_depth': Integer(3, 20),
    'min_samples_split': Integer(2, 10),
    'min_samples_leaf': Integer(1, 5),
    'max_features': Categorical(['sqrt', 'log2']),
    'bootstrap': Categorical([True, False])
}

# 创建贝叶斯优化器
bayes_search = BayesSearchCV(
    estimator=rf,
    search_spaces=param_space,
    n_iter=50,
    cv=5,
    n_jobs=-1,
    random_state=42
)

# 执行贝叶斯优化
print("正在进行贝叶斯优化...")
bayes_search.fit(X_train, y_train)

# 输出最佳参数
print("\n最佳参数:")
print(bayes_search.best_params_)

# 使用最佳模型进行预测
best_model = bayes_search.best_estimator_
y_pred = best_model.predict(X_train)

# 评估模型性能
print("\n训练集评估结果:")
print(f"准确率: {accuracy_score(y_train, y_pred):.4f}")
print(f"精确率: {precision_score(y_train, y_pred):.4f}")
print(f"召回率: {recall_score(y_train, y_pred):.4f}")
print(f"F1分数: {f1_score(y_train, y_pred):.4f}")

# 保存模型预测结果
test_df['Survived_Pred'] = best_model.predict(X_test)
test_df.to_csv('test_predictions.csv', index=False)

print("\n预测结果已保存到 test_predictions.csv")
正在进行贝叶斯优化...

最佳参数:
OrderedDict([('bootstrap', False), ('max_depth', 14), ('max_features', 'log2'), ('min_samples_leaf', 2), ('min_samples_split', 10), ('n_estimators', 305)])

训练集评估结果:
准确率: 0.9125
精确率: 0.9314
召回率: 0.8333
F1分数: 0.8796

预测结果已保存到 test_predictions.csv

 这里可以看出预测结果大大提高了,这个调整工程是成功的。

还可以进行SVD降维,排除次要特征后再用模型进行训练:

import pandas as pd
import numpy as np
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler

# 加载预处理后的数据
train_df = pd.read_csv('processed_train.csv')
test_df = pd.read_csv('processed_test.csv')

# 准备数据
X_train = train_df.drop('Survived', axis=1)
y_train = train_df['Survived']
X_test = test_df.copy()

# 特征标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 1. 使用SVD进行降维
# 选择保留的主成分数量(这里保留3个主成分)
svd = TruncatedSVD(n_components=3, random_state=42)
X_train_svd = svd.fit_transform(X_train_scaled)
X_test_svd = svd.transform(X_test_scaled)

# 2. 分析降维效果
print("原始特征数量:", X_train_scaled.shape[1])
print("降维后特征数量:", X_train_svd.shape[1])
print("保留的方差比例:", np.sum(svd.explained_variance_ratio_))

# 3. 将降维后的数据转换为DataFrame
train_svd_df = pd.DataFrame(X_train_svd, columns=[f'SVD_{i}' for i in range(X_train_svd.shape[1])])
train_svd_df['Survived'] = y_train.values

test_svd_df = pd.DataFrame(X_test_svd, columns=[f'SVD_{i}' for i in range(X_test_svd.shape[1])])

# 4. 保存降维后的数据集
train_svd_df.to_csv('train_svd.csv', index=False)
test_svd_df.to_csv('test_svd.csv', index=False)

print("SVD降维完成!已生成新的特征数据集。")
原始特征数量: 7
降维后特征数量: 3
保留的方差比例: 0.6524096636110657
SVD降维完成!已生成新的特征数据集。
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from skopt import BayesSearchCV
from skopt.space import Integer, Real, Categorical

# 加载SVD降维后的数据
train_df = pd.read_csv('train_svd.csv')
test_df = pd.read_csv('test_svd.csv')

# 准备数据
X = train_df.drop('Survived', axis=1)
y = train_df['Survived']
X_test = test_df

# 划分训练集和验证集
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# 定义随机森林模型
rf = RandomForestClassifier(random_state=42)

# 定义贝叶斯优化搜索空间
param_space = {
    'n_estimators': Integer(100, 500),
    'max_depth': Integer(3, 20),
    'min_samples_split': Integer(2, 10),
    'min_samples_leaf': Integer(1, 5),
    'max_features': Categorical(['sqrt', 'log2']),
    'bootstrap': Categorical([True, False])
}

# 创建贝叶斯优化器
bayes_search = BayesSearchCV(
    estimator=rf,
    search_spaces=param_space,
    n_iter=50,
    cv=5,
    n_jobs=-1,
    random_state=42
)

# 执行贝叶斯优化
print("正在进行贝叶斯优化...")
bayes_search.fit(X_train, y_train)

# 输出最佳参数
print("\n最佳参数:")
print(bayes_search.best_params_)

# 使用最佳模型进行预测
best_model = bayes_search.best_estimator_
y_pred_train = best_model.predict(X_train)
y_pred_val = best_model.predict(X_val)

# 评估模型性能
print("\n训练集评估结果:")
print(f"准确率: {accuracy_score(y_train, y_pred_train):.4f}")
print(f"精确率: {precision_score(y_train, y_pred_train):.4f}")
print(f"召回率: {recall_score(y_train, y_pred_train):.4f}")
print(f"F1分数: {f1_score(y_train, y_pred_train):.4f}")

print("\n验证集评估结果:")
print(f"准确率: {accuracy_score(y_val, y_pred_val):.4f}")
print(f"精确率: {precision_score(y_val, y_pred_val):.4f}")
print(f"召回率: {recall_score(y_val, y_pred_val):.4f}")
print(f"F1分数: {f1_score(y_val, y_pred_val):.4f}")

# 输出混淆矩阵
print("\n验证集混淆矩阵:")
print(confusion_matrix(y_val, y_pred_val))

# 在测试集上进行预测
test_df['Survived_Pred'] = best_model.predict(X_test)
test_df.to_csv('test_predictions.csv', index=False)

print("\n测试集预测结果已保存到 test_predictions.csv")
正在进行贝叶斯优化...

最佳参数:
OrderedDict([('bootstrap', True), ('max_depth', 8), ('max_features', 'log2'), ('min_samples_leaf', 5), ('min_samples_split', 10), ('n_estimators', 100)])

训练集评估结果:
准确率: 0.8553
精确率: 0.8287
召回率: 0.7761
F1分数: 0.8015

验证集评估结果:
准确率: 0.7877
精确率: 0.7432
召回率: 0.7432
F1分数: 0.7432

验证集混淆矩阵:
[[86 19]
 [19 55]]

测试集预测结果已保存到 test_predictions.csv

这个结果略微差于聚类后的效果,但和一开始考虑所有特征的预测效果差不多,这个降维算法也比较成功。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值