[kaggle竞赛] 毒蘑菇的二元预测

最新推荐文章于 2024-10-01 16:06:06 发布

时雨h

最新推荐文章于 2024-10-01 16:06:06 发布

阅读量1.1k

点赞数 15

分类专栏：数据库 kaggle 文章标签：机器学习大数据人工智能

本文链接：https://blog.csdn.net/shaozheng0503/article/details/141471750

版权

数据库同时被 2 个专栏收录

6 篇文章 0 订阅

订阅专栏

kaggle

6 篇文章 0 订阅

订阅专栏

毒蘑菇的二元预测

您提供了很多关于不同二元分类任务的资源和链接，看起来这些都是Kaggle竞赛中的参考资料和高分解决方案。为了帮助您更好地利用这些资源，这里是一些关键点的总结：

Playground Season 4 Episode 8

主要关注的竞赛: 使用银行流失数据集进行二元分类。
数据集: 已经重新组织并发布供参考。
热门解决方案:

- LightGBM 和 CatBoost 模型 (得分 0.8945)。
- XGBoost 和随机森林模型。
- 神经网络分类模型。

其他相关的竞赛和资源

使用生物信号对吸烟者状况进行二元预测

- EDA 和特征工程。
- XGBoost 模型。

使用软件缺陷数据集进行二元分类

- EDA 和建模。

机器故障的二元分类

- EDA, 集成学习, ML pipeline, SHAP 分析。

使用表格肾结石预测数据集进行二元分类

- 多种模型对比。

特色竞赛

- 美国运通 - 违约预测

- - 特征工程和LightGBM模型。

- 房屋信贷违约风险

- - 完整的EDA和特征重要性分析。

竞争指标 - Mathews 相关性系数

定义: 衡量二元分类器输出质量的度量。
资源:

- Wikipedia 关于 Phi 系数的页面。
- Voxco 博客关于 Matthews 相关性系数的文章。
- 一篇关于 Matthews 相关性系数在生物数据挖掘中的应用的论文。
- Scikit-learn 文档中关于 Matthews 相关性系数的说明。

希望这些信息能够帮助您更有效地开始学习和参与这些竞赛。如果您有具体的问题或者需要针对某个特定部分的帮助，请告诉我！

# 加载训练数据
train_data = pd.read_csv('train.csv')

# 显示前几行数据以了解数据结构
print(train_data.head())

# 查看数据的基本信息
print(train_data.info())

步骤 2: 数据探索与可视化

在这一步中，我们将对数据进行更深入的探索，并使用可视化工具来更好地理解数据的分布和特征之间的关系。

# 统计每种类型的蘑菇数量
print(train_data['class'].value_counts())

# 可视化不同类型的蘑菇数量
plt.figure(figsize=(8, 6))
sns.countplot(x='class', data=train_data)
plt.title('Distribution of Mushroom Classes')
plt.show()

# 查看各特征与目标变量之间的关系
fig, axs = plt.subplots(5, 5, figsize=(20, 20))
axs = axs.flatten()
for i, col in enumerate(train_data.columns[1:]):
    sns.countplot(x=col, hue='class', data=train_data, ax=axs[i])
    axs[i].set_title(f'Distribution of {col} by Class')
plt.tight_layout()
plt.show()

步骤 3: 数据预处理

接下来，我们将对数据进行预处理，包括特征编码和其他必要的变换。

# 对类别特征进行编码
label_encoder = LabelEncoder()

# 遍历所有非数字特征
for col in train_data.select_dtypes(include=['object']).columns:
    train_data[col] = label_encoder.fit_transform(train_data[col])
    
# 查看编码后的数据
print(train_data.head())

步骤 4: 构建模型

在这一步中，我们将构建 LightGBM 和 CatBoost 模型，并进行训练。

# 分割数据集
X = train_data.drop('class', axis=1)
y = train_data['class']

# 划分训练集和验证集
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# 定义 LightGBM 模型
lgb_params = {
    'objective': 'binary',
    'metric': 'auc',
    'verbosity': -1,
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'lambda_l1': 0.1,
    'lambda_l2': 0.1
}

# 创建 LightGBM 数据集
lgb_train = lgb.Dataset(X_train, y_train)
lgb_val = lgb.Dataset(X_val, y_val, reference=lgb_train)

# 训练 LightGBM 模型
lgb_model = lgb.train(lgb_params, lgb_train, num_boost_round=1000, valid_sets=[lgb_val], early_stopping_rounds=100)

# 定义 CatBoost 模型
cb_params = {
    'loss_function': 'Logloss',
    'eval_metric': 'AUC',
    'learning_rate': 0.05,
    'depth': 6,
    'l2_leaf_reg': 10,
    'bootstrap_type': 'Bayesian',
    'bagging_temperature': 0.2,
    'random_seed': 42,
    'allow_writing_files': False
}

# 创建 CatBoost 数据集
cb_train = Pool(X_train, y_train)
cb_val = Pool(X_val, y_val)

# 训练 CatBoost 模型
cb_model = CatBoostClassifier(**cb_params)
cb_model.fit(X_train, y_train, eval_set=(X_val, y_val), use_best_model=True, verbose=False)

步骤 5: 测试数据预测

最后，我们将使用训练好的模型对测试数据进行预测，并整理预测结果以符合 Kaggle 提交的要求。

# 加载测试数据
test_data = pd.read_csv('test.csv')

# 对测试数据进行相同的特征编码
for col in test_data.select_dtypes(include=['object']).columns:
    test_data[col] = label_encoder.transform(test_data[col])

# 获取测试数据的 ID
test_ids = test_data['id']

# 删除不需要的列
test_features = test_data.drop('id', axis=1)

# 使用模型进行预测
predictions_lgb = lgb_model.predict(test_features, num_iteration=lgb_model.best_iteration)
predictions_cb = cb_model.predict(test_features)

# 将预测结果转换为二进制形式
predictions_lgb_binary = (predictions_lgb > 0.5).astype(int)
predictions_cb_binary = (predictions_cb > 0.5).astype(int)

# 整理预测结果
submission_df = pd.DataFrame({'id': test_ids, 'class': predictions_lgb_binary})
submission_df['class'] = submission_df['class'].map({1: 'p', 0: 'e'})

# 保存预测结果
submission_df.to_csv('submission.csv', index=False)

好的，现在我将为你提供一个完整的代码示例来解决“毒蘑菇的二元预测”问题。我们将使用 LightGBM 和 CatBoost 这两个模型来进行预测，并遵循之前讨论过的步骤。

首先，我们需要安装必要的库。如果你还没有安装它们，可以使用以下命令安装：

pip install pandas numpy scikit-learn lightgbm catboost matplotlib seaborn plotly

下面是完整的代码示例：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from catboost import CatBoostClassifier
from sklearn.metrics import matthews_corrcoef

# 忽略警告
import warnings
warnings.filterwarnings('ignore')

# 加载数据
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

# 数据预处理
def preprocess_data(data):
    # 对类别特征进行编码
    label_encoder = LabelEncoder()
    for col in data.select_dtypes(include=['object']).columns:
        data[col] = label_encoder.fit_transform(data[col])
    return data

# 预处理训练数据
train_data = preprocess_data(train_data)

# 预处理测试数据
test_data = preprocess_data(test_data)

# 数据分割
X = train_data.drop('class', axis=1)
y = train_data['class']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# 定义 LightGBM 模型
lgb_params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'verbosity': -1,
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'lambda_l1': 0.1,
    'lambda_l2': 0.1
}

# 创建 LightGBM 数据集
lgb_train = lgb.Dataset(X_train, y_train)
lgb_val = lgb.Dataset(X_val, y_val, reference=lgb_train)

# 训练 LightGBM 模型
lgb_model = lgb.train(lgb_params, lgb_train, num_boost_round=1000, valid_sets=[lgb_val], early_stopping_rounds=100)

# 定义 CatBoost 模型
cb_params = {
    'loss_function': 'Logloss',
    'eval_metric': 'AUC',
    'learning_rate': 0.05,
    'depth': 6,
    'l2_leaf_reg': 10,
    'bootstrap_type': 'Bayesian',
    'bagging_temperature': 0.2,
    'random_seed': 42,
    'allow_writing_files': False
}

# 训练 CatBoost 模型
cb_model = CatBoostClassifier(**cb_params)
cb_model.fit(X_train, y_train, eval_set=(X_val, y_val), use_best_model=True, verbose=False)

# 测试数据预测
test_ids = test_data['id']
test_features = test_data.drop('id', axis=1)

# 使用 LightGBM 进行预测
predictions_lgb = lgb_model.predict(test_features, num_iteration=lgb_model.best_iteration)
predictions_lgb_binary = (predictions_lgb > 0.5).astype(int)

# 使用 CatBoost 进行预测
predictions_cb = cb_model.predict(test_features)
predictions_cb_binary = (predictions_cb > 0.5).astype(int)

# 评估模型
mcc_lgb = matthews_corrcoef(y_val, lgb_model.predict(X_val, num_iteration=lgb_model.best_iteration) > 0.5)
mcc_cb = matthews_corrcoef(y_val, cb_model.predict(X_val) > 0.5)

print("LightGBM Matthews Correlation Coefficient: ", mcc_lgb)
print("CatBoost Matthews Correlation Coefficient: ", mcc_cb)

# 整理预测结果
submission_df = pd.DataFrame({'id': test_ids, 'class': predictions_lgb_binary})
submission_df['class'] = submission_df['class'].map({1: 'p', 0: 'e'})

# 保存预测结果
submission_df.to_csv('submission.csv', index=False)

# 可视化特征重要性
def plot_feature_importance(model, feature_names, title):
    fig, ax = plt.subplots(figsize=(12, 8))
    lgb.plot_importance(model, max_num_features=20, importance_type='gain', ax=ax)
    ax.set_title(title)
    plt.show()

# 可视化 LightGBM 特征重要性
plot_feature_importance(lgb_model, X_train.columns, 'LightGBM Feature Importance')

# 可视化 CatBoost 特征重要性
cb_model.plot_feature_importances(top_n=20, figsize=(12, 8), title='CatBoost Feature Importance')

这段代码完成了以下任务：

导入所需的库。
加载训练数据和测试数据。
对数据进行预处理，包括对类别特征进行编码。
划分数据集为训练集和验证集。
定义并训练 LightGBM 和 CatBoost 模型。
对测试数据进行预测。
评估模型的性能（使用 Matthews Correlation Coefficient）。
整理预测结果，并将其保存为 CSV 文件以供提交。
可视化特征重要性。

参考

Binary Classification with a Bank Churn Dataset

Playground Series - Season 4, Episode 1

Overview Data Code Models Discussion Leaderboard Rules Team Submissions

Samvel Kocharyan · 17th in this Competition · Posted 7 months ago

arrow_drop_up9

more_vert

17th Place Solution| AutoML + Unicorn's pollen + Lack of sleep

Context

S4E1 Playground "Binary Classification with a Bank Churn Dataset".

Business context: https://www.kaggle.com/competitions/playground-series-s4e1/overview
Data context: https://www.kaggle.com/competitions/playground-series-s4e1/data

Overview of the approach

Our final submission was a combination of AutoGluon 3-level stack we called "Frankenstein II" and set of averages from our previous models and some public notebooks.

Final submission was trained on the reduced set of features we got from OpenFE. Features were eliminated by BorutaSHAP and RFECV. Final model used 103 features.

Detail of the Submissions

We selected 2 submissions:

WeightedEnsemble_L3 0.89372 Public | 0.89637 Private | 0.898947 CV
Winning solution 0.90106 Private | 0.89687 Public. We got it from averaging 0.89673 and 0.89565 in last hours of the competition.

Frankenstein II schema

What worked for us?

Feature generation - 470 and Feature Elimination - 103
Data-Centric Approach (CleanLab)
Relabeling
AutoGluon 1.0.1 (thanks to @innixma)
BorutaSHAP framework and Skleran - RFECV
Ideas published by @paddykb, @thomasmeiner and respected community
Merging, Stacking, Ensembling, Averaging
Tons of experiments. Mainly for educative purposes
🔥 Kaggle Alchemists Secret Society named after Akka från Kebnekajse
🦄 Unicorn's pollen

What doesn't work for us this time?

PCA / ICA
Standalone Boosting models
TabPFN
Surnames features
Original dataset

Sources

根据您提供的信息，这是一个关于Kaggle竞赛“Playground Series - Season 4, Episode 1”的解决方案概述。在这个竞赛中，参赛者需要构建一个二分类模型来预测银行客户的流失情况。以下是该解决方案的主要部分：

业务背景

目标：预测客户是否会离开银行。
数据：包括客户的基本信息、交易记录等。

数据背景

数据集：包含了多个特征，用于预测客户是否会选择离开银行。
特征：包括但不限于年龄、性别、地理位置、账户余额、产品持有情况等。

解决方案概览

最终提交：基于AutoGluon的3级堆叠模型，结合了之前模型的平均结果。
特征工程：使用了OpenFE工具进行特征生成和消除，最终使用了103个特征。
模型训练：使用了AutoGluon框架，并结合了BorutaSHAP和RFECV进行特征选择。

关键技术点

特征生成与消除：通过多种方法生成新特征，并利用BorutaSHAP和RFECV进行特征选择。
数据清理：使用CleanLab进行数据清洗。
标签修正：进行了重新标注以提高准确性。
AutoGluon：使用版本1.0.1的AutoGluon进行自动机器学习。
集成学习：通过堆叠、合并、平均等技术提高了模型的泛化能力。

未成功的方法

PCA/ICA：主成分分析和独立成分分析并未提升模型性能。
单独的Boosting模型：单独使用Boosting模型效果不佳。
TabPFN：一种用于表格数据的神经网络架构，在本竞赛中未取得显著效果。
姓氏特征：尝试使用客户的姓氏作为特征未能提升模型性能。
原始数据集：仅使用原始数据集的效果不如经过特征工程的数据集。

实现代码

考虑到上述解决方案的复杂性和涉及的技术，下面是一个简化版的示例代码，展示如何使用AutoGluon进行自动机器学习，并结合特征选择的方法：

import pandas as pd
import numpy as np
from autogluon.tabular import TabularPredictor
from boruta import BorutaPy
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline

# 数据路径
train_path = 'train.csv'
test_path = 'test.csv'

# 加载数据
train_data = pd.read_csv(train_path)
test_data = pd.read_csv(test_path)

# 数据预处理
# ...

# 特征选择
# 使用BorutaSHAP进行特征选择
rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
feat_selector = BorutaPy(rf, n_estimators='auto', verbose=2, random_state=1)
feat_selector.fit(train_data.drop('target', axis=1), train_data['target'])

# 使用RFECV进行特征选择
rfecv = RFECV(estimator=RandomForestClassifier(), step=1, cv=StratifiedKFold(5),
              scoring='accuracy', verbose=2)
pipeline = Pipeline([('rfecv', rfecv)])
pipeline.fit(train_data.drop('target', axis=1), train_data['target'])

# 根据特征选择结果更新训练和测试数据
selected_features = list(set(feat_selector.support_) & set(pipeline.named_steps['rfecv'].support_))
train_data_selected = train_data[selected_features + ['target']]
test_data_selected = test_data[selected_features]

# 使用AutoGluon进行自动机器学习
predictor = TabularPredictor(label='target', problem_type='binary').fit(
    train_data=train_data_selected, presets='best_quality', time_limit=1200)

# 预测
predictions = predictor.predict(test_data_selected)

# 保存预测结果
submission = pd.DataFrame({'id': test_data['id'], 'target': predictions})
submission.to_csv('submission.csv', index=False)