脱发数据集的建模学习

项目简介

随着年龄增长,脱发成为许多人关注的健康问题之一。头发的丰盈与否不仅影响着外貌,更与个体的健康状态息息相关。
本数据集汇集了各种可能导致脱发的因素,包括遗传因素、荷尔蒙变化、医疗状况、药物治疗、营养缺乏、心理压力等。
本工作尝试了三种编码方式和四种建模方法,充分考虑了数据泄露的问题,使建模过程更加标准,完善。

相关包的导入

In [10]:

import pandas as pd
from sklearn.model_selection import train_test_split
from category_encoders import OneHotEncoder, BinaryEncoder, BackwardDifferenceEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm

1.数据预览

In [18]:

df = pd.read_csv("/home/mw/input/hair2961/Predict Hair Fall.csv")
df.info()
df.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Id                         999 non-null    int64 
 1   Genetics                   999 non-null    object
 2   Hormonal Changes           999 non-null    object
 3   Medical Conditions         999 non-null    object
 4   Medications & Treatments   999 non-null    object
 5   Nutritional Deficiencies   999 non-null    object
 6   Stress                     999 non-null    object
 7   Age                        999 non-null    int64 
 8   Poor Hair Care Habits      999 non-null    object
 9   Environmental Factors      999 non-null    object
 10  Smoking                    999 non-null    object
 11  Weight Loss                999 non-null    object
 12  Hair Loss                  999 non-null    int64 
dtypes: int64(3), object(10)
memory usage: 101.6+ KB

Out[18]:

IdGeneticsHormonal ChangesMedical ConditionsMedications & TreatmentsNutritional DeficienciesStressAgePoor Hair Care HabitsEnvironmental FactorsSmokingWeight LossHair Loss
0133992YesNoNo DataNo DataMagnesium deficiencyModerate19YesYesNoNo0
1148393NoNoEczemaAntibioticsMagnesium deficiencyHigh43YesYesNoNo0
2155074NoNoDermatosisAntifungal CreamProtein deficiencyModerate26YesYesNoYes0
3118261YesYesRingwormAntibioticsBiotin DeficiencyModerate46YesYesNoNo0
4111915NoNoPsoriasisAccutaneIron deficiencyModerate30NoYesYesNo1

可以看到数据集中共999条数据,不存在缺失值,大多特增是分类变量,在建模过程中需要将其转换为连续变量。

2.数据预处理

2.1删除ID列

In [3]:

df = df.drop(df.columns[0],axis=1)

ID属于无关特征,对建模过程没有用处

2.2 数据划分

In [4]:

x = df.iloc[:,:-1]
y = df.iloc[:,-1]
x_train,x_temp,y_train,y_temp = train_test_split(x,y,test_size=0.3,random_state=42)
x_val,x_test,y_val,y_test = train_test_split(x_temp,y_temp,test_size=1/3,random_state=42)

In [5]:

print("训练集大小:", x_train.shape, y_train.shape)
print("验证集大小:", x_val.shape, y_val.shape)
print("测试集大小:", x_test.shape, y_test.shape)
训练集大小: (699, 11) (699,)
验证集大小: (200, 11) (200,)
测试集大小: (100, 11) (100,)

为了防止数据泄露,首先划分数据集。特征编码和模型筛选过程均不使用测试数据集。

3.编码方式筛选

选择了三种常见的分类变量编码方式
1.OneHotEncoder 将每个类别映射到包含0和1的向量。
2.BinaryEncoder 将类别转换为二进制数字。如果有n个唯一的类别,那么二进制编码只会产生log(以2为底)n个特征。
3.BackwardDifferenceEncoder 一个水平的因变量的平均值与前一个水平的因变量的平均值进行比较。

In [6]:

def encode_categorical_features(x_train, x_val):
    # 创建三个编码器
    encoder1 = OneHotEncoder()
    encoder2 = BinaryEncoder()
    encoder3 = BackwardDifferenceEncoder()

    # 对训练集进行编码
    x_train_encoded1 = encoder1.fit_transform(x_train)
    x_train_encoded2 = encoder2.fit_transform(x_train)
    x_train_encoded3 = encoder3.fit_transform(x_train)

    # 对验证集进行相同的编码
    x_val_encoded1 = encoder1.transform(x_val)
    x_val_encoded2 = encoder2.transform(x_val)
    x_val_encoded3 = encoder3.transform(x_val)

    return x_train_encoded1, x_train_encoded2, x_train_encoded3, x_val_encoded1, x_val_encoded2, x_val_encoded3

# 使用函数进行编码
x_train_encoded1, x_train_encoded2, x_train_encoded3, x_val_encoded1, x_val_encoded2, x_val_encoded3 = encode_categorical_features(x_train, x_val)

# 输出编码后的数据集大小
print("独热编码1的训练集大小:", x_train_encoded1.shape)
print("二进制编码2的训练集大小:", x_train_encoded2.shape)
print("后向差分编码3的训练集大小:", x_train_encoded3.shape)
独热编码1的训练集大小: (699, 49)
二进制编码2的训练集大小: (699, 27)
后向差分编码3的训练集大小: (699, 40)
/opt/conda/lib/python3.6/site-packages/category_encoders/base_contrast_encoder.py:127: FutureWarning: Intercept column might not be added anymore in future releases (c.f. issue #370)
  category=FutureWarning)

4.机器学习模型筛选

选择了四种常见的分类模型
1.Logistic Regression
2.Support Vector Machine
3.Random Forest
4.XGBoost

In [7]:

def train_and_evaluate_models(x_train, y_train, x_val, y_val):
    # 初始化模型
    models = {
        "Logistic Regression": LogisticRegression(),
        "Support Vector Machine": SVC(),
        "Random Forest": RandomForestClassifier(),
        "XGBoost": XGBClassifier()
    }

    # 训练和评估每个模型
    results = {}
    for name, model in models.items():
        # 训练模型
        model.fit(x_train, y_train)

        # 在验证集上进行预测
        y_pred_val = model.predict(x_val)

        # 计算准确率
        accuracy = accuracy_score(y_val, y_pred_val)

        # 保存结果
        results[name] = accuracy

    return results

# 使用函数训练和评估模型
results_encoder1 = train_and_evaluate_models(x_train_encoded1, y_train, x_val_encoded1, y_val)
results_encoder2 = train_and_evaluate_models(x_train_encoded2, y_train, x_val_encoded2, y_val)
results_encoder3 = train_and_evaluate_models(x_train_encoded3, y_train, x_val_encoded3, y_val)

# 输出结果
print("编码方式1的结果:")
for name, accuracy in results_encoder1.items():
    print(f"模型: {name}, 验证集准确率: {accuracy:.4f}")

print("\n编码方式2的结果:")
for name, accuracy in results_encoder2.items():
    print(f"模型: {name}, 验证集准确率: {accuracy:.4f}")

print("\n编码方式3的结果:")
for name, accuracy in results_encoder3.items():
    print(f"模型: {name}, 验证集准确率: {accuracy:.4f}")

# 模型名称
models = ["Logistic Regression", "Support Vector Machine", "Random Forest", "XGBoost"]

# 编码方式1的准确度
accuracy_encoder1 = [results_encoder1[model] for model in models]

# 编码方式2的准确度
accuracy_encoder2 = [results_encoder2[model] for model in models]

# 编码方式3的准确度
accuracy_encoder3 = [results_encoder3[model] for model in models]
/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/opt/conda/lib/python3.6/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
/opt/conda/lib/python3.6/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/opt/conda/lib/python3.6/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
/opt/conda/lib/python3.6/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/opt/conda/lib/python3.6/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
/opt/conda/lib/python3.6/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
编码方式1的结果:
模型: Logistic Regression, 验证集准确率: 0.4750
模型: Support Vector Machine, 验证集准确率: 0.5550
模型: Random Forest, 验证集准确率: 0.4950
模型: XGBoost, 验证集准确率: 0.4350

编码方式2的结果:
模型: Logistic Regression, 验证集准确率: 0.4850
模型: Support Vector Machine, 验证集准确率: 0.5200
模型: Random Forest, 验证集准确率: 0.5250
模型: XGBoost, 验证集准确率: 0.5050

编码方式3的结果:
模型: Logistic Regression, 验证集准确率: 0.4750
模型: Support Vector Machine, 验证集准确率: 0.5450
模型: Random Forest, 验证集准确率: 0.5300
模型: XGBoost, 验证集准确率: 0.4950

5.不同编码方式和建模方法效果对比

In [15]:

# 设置图表
plt.figure(figsize=(10, 6))
plt.rcParams['font.family'] = ['Microsoft YaHei']
# 绘制条形图
bar_width = 0.2
index = range(len(models))
bar1 = plt.bar(index, accuracy_encoder1, bar_width, label='编码方式1')
bar2 = plt.bar([i + bar_width for i in index], accuracy_encoder2, bar_width, label='编码方式2')
bar3 = plt.bar([i + 2 * bar_width for i in index], accuracy_encoder3, bar_width, label='编码方式3')

# 添加标签和标题
plt.xlabel('模型', fontsize=12)
plt.ylabel('准确度', fontsize=12)
plt.title('不同编码方式在不同模型上的准确度比较', fontsize=14)
plt.xticks([i + bar_width for i in index], models)
plt.legend()
plt.legend(loc='lower right')
# 标记准确度
for bar in [bar1, bar2, bar3]:
    for rect in bar:
        height = rect.get_height()
        plt.text(rect.get_x() + rect.get_width() / 2, height, f'{height:.2f}', ha='center', va='bottom')
# 显示图表
plt.tight_layout()
plt.show()

通过数据对比,可以发现Support Vector Machine使用OneHotEncoder方法得到的精准度最高R2=0.56.因此选定独热编码和支持向量机进行在测试集上的测试

6.最佳方法建模

In [17]:

# 使用二进制编码对训练集和验证集进行编码
encoder1 = OneHotEncoder()
x_train_encoded1 = encoder1.fit_transform(x_train)
x_val_encoded1 = encoder1.transform(x_val)

# 初始化随机森林模型
SVM_model = SVC()

# 在训练集上训练模型
SVM_model.fit(x_train_encoded1, y_train)

# 在验证集上进行预测
y_pred_val = SVM_model.predict(x_val_encoded1)

# 计算验证集准确率
accuracy_val = accuracy_score(y_val, y_pred_val)
print(f"在验证集上的准确率为: {accuracy_val:.4f}")

# 使用最终模型在测试集上进行验证
x_test_encoded1 = encoder1.transform(x_test)
y_pred_test = SVM_model.predict(x_test_encoded1)

# 计算测试集准确率
accuracy_test = accuracy_score(y_test, y_pred_test)
print(f"在测试集上的准确率为: {accuracy_test:.4f}")
在验证集上的准确率为: 0.5550
在测试集上的准确率为: 0.5500
/opt/conda/lib/python3.6/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)

最终在测试集上的R2=0.55。
本文测试的编码方法和模型方法较少,不同的编码方法会导致最终数据集的维度也有所不同,或许会有更好的预测效果

  • 30
    点赞
  • 25
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

暴躁的秋秋

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值