脱发数据集的建模学习

最新推荐文章于 2024-05-13 07:00:00 发布

暴躁的秋秋

最新推荐文章于 2024-05-13 07:00:00 发布

阅读量1.3k

点赞数 30

文章标签：机器学习 python 人工智能

本文链接：https://blog.csdn.net/m0_67431719/article/details/136210871

版权

项目简介

随着年龄增长，脱发成为许多人关注的健康问题之一。头发的丰盈与否不仅影响着外貌，更与个体的健康状态息息相关。
本数据集汇集了各种可能导致脱发的因素，包括遗传因素、荷尔蒙变化、医疗状况、药物治疗、营养缺乏、心理压力等。
本工作尝试了三种编码方式和四种建模方法，充分考虑了数据泄露的问题，使建模过程更加标准，完善。

1.数据预览

In [18]:

df = pd.read_csv("/home/mw/input/hair2961/Predict Hair Fall.csv")
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Id                         999 non-null    int64 
 1   Genetics                   999 non-null    object
 2   Hormonal Changes           999 non-null    object
 3   Medical Conditions         999 non-null    object
 4   Medications & Treatments   999 non-null    object
 5   Nutritional Deficiencies   999 non-null    object
 6   Stress                     999 non-null    object
 7   Age                        999 non-null    int64 
 8   Poor Hair Care Habits      999 non-null    object
 9   Environmental Factors      999 non-null    object
 10  Smoking                    999 non-null    object
 11  Weight Loss                999 non-null    object
 12  Hair Loss                  999 non-null    int64 
dtypes: int64(3), object(10)
memory usage: 101.6+ KB

Out[18]:

	Id	Genetics	Hormonal Changes	Medical Conditions	Medications & Treatments	Nutritional Deficiencies	Stress	Age	Poor Hair Care Habits	Environmental Factors	Smoking	Weight Loss	Hair Loss
0	133992	Yes	No	No Data	No Data	Magnesium deficiency	Moderate	19	Yes	Yes	No	No	0
1	148393	No	No	Eczema	Antibiotics	Magnesium deficiency	High	43	Yes	Yes	No	No	0
2	155074	No	No	Dermatosis	Antifungal Cream	Protein deficiency	Moderate	26	Yes	Yes	No	Yes	0
3	118261	Yes	Yes	Ringworm	Antibiotics	Biotin Deficiency	Moderate	46	Yes	Yes	No	No	0
4	111915	No	No	Psoriasis	Accutane	Iron deficiency	Moderate	30	No	Yes	Yes	No	1

可以看到数据集中共999条数据，不存在缺失值，大多特增是分类变量，在建模过程中需要将其转换为连续变量。

2.数据预处理

2.1删除ID列

In [3]:

df = df.drop(df.columns[0],axis=1)

ID属于无关特征，对建模过程没有用处

2.2 数据划分

In [4]:

x = df.iloc[:,:-1]
y = df.iloc[:,-1]
x_train,x_temp,y_train,y_temp = train_test_split(x,y,test_size=0.3,random_state=42)
x_val,x_test,y_val,y_test = train_test_split(x_temp,y_temp,test_size=1/3,random_state=42)

In [5]:

print("训练集大小:", x_train.shape, y_train.shape)
print("验证集大小:", x_val.shape, y_val.shape)
print("测试集大小:", x_test.shape, y_test.shape)

训练集大小: (699, 11) (699,)
验证集大小: (200, 11) (200,)
测试集大小: (100, 11) (100,)

为了防止数据泄露，首先划分数据集。特征编码和模型筛选过程均不使用测试数据集。

3.编码方式筛选

选择了三种常见的分类变量编码方式
1.OneHotEncoder 将每个类别映射到包含0和1的向量。
2.BinaryEncoder 将类别转换为二进制数字。如果有n个唯一的类别，那么二进制编码只会产生log(以2为底)n个特征。
3.BackwardDifferenceEncoder 一个水平的因变量的平均值与前一个水平的因变量的平均值进行比较。

In [6]:

def encode_categorical_features(x_train, x_val):
    # 创建三个编码器
    encoder1 = OneHotEncoder()
    encoder2 = BinaryEncoder()
    encoder3 = BackwardDifferenceEncoder()

    # 对训练集进行编码
    x_train_encoded1 = encoder1.fit_transform(x_train)
    x_train_encoded2 = encoder2.fit_transform(x_train)
    x_train_encoded3 = encoder3.fit_transform(x_train)

    # 对验证集进行相同的编码
    x_val_encoded1 = encoder1.transform(x_val)
    x_val_encoded2 = encoder2.transform(x_val)
    x_val_encoded3 = encoder3.transform(x_val)

    return x_train_encoded1, x_train_encoded2, x_train_encoded3, x_val_encoded1, x_val_encoded2, x_val_encoded3

# 使用函数进行编码
x_train_encoded1, x_train_encoded2, x_train_encoded3, x_val_encoded1, x_val_encoded2, x_val_encoded3 = encode_categorical_features(x_train, x_val)

# 输出编码后的数据集大小
print("独热编码1的训练集大小:", x_train_encoded1.shape)
print("二进制编码2的训练集大小:", x_train_encoded2.shape)
print("后向差分编码3的训练集大小:", x_train_encoded3.shape)

独热编码1的训练集大小: (699, 49)
二进制编码2的训练集大小: (699, 27)
后向差分编码3的训练集大小: (699, 40)

/opt/conda/lib/python3.6/site-packages/category_encoders/base_contrast_encoder.py:127: FutureWarning: Intercept column might not be added anymore in future releases (c.f. issue #370)
  category=FutureWarning)

4.机器学习模型筛选

选择了四种常见的分类模型
1.Logistic Regression
2.Support Vector Machine
3.Random Forest
4.XGBoost

In [7]:

def train_and_evaluate_models(x_train, y_train, x_val, y_val):
    # 初始化模型
    models = {
        "Logistic Regression": LogisticRegression(),
        "Support Vector Machine": SVC(),
        "Random Forest": RandomForestClassifier(),
        "XGBoost": XGBClassifier()
    }

    # 训练和评估每个模型
    results = {}
    for name, model in models.items():
        # 训练模型
        model.fit(x_train, y_train)

        # 在验证集上进行预测
        y_pred_val = model.predict(x_val)

        # 计算准确率
        accuracy = accuracy_score(y_val, y_pred_val)

        # 保存结果
        results[name] = accuracy

    return results

# 使用函数训练和评估模型
results_encoder1 = train_and_evaluate_models(x_train_encoded1, y_train, x_val_encoded1, y_val)
results_encoder2 = train_and_evaluate_models(x_train_encoded2, y_train, x_val_encoded2, y_val)
results_encoder3 = train_and_evaluate_models(x_train_encoded3, y_train, x_val_encoded3, y_val)

# 输出结果
print("编码方式1的结果:")
for name, accuracy in results_encoder1.items():
    print(f"模型: {name}, 验证集准确率: {accuracy:.4f}")

print("\n编码方式2的结果:")
for name, accuracy in results_encoder2.items():
    print(f"模型: {name}, 验证集准确率: {accuracy:.4f}")

print("\n编码方式3的结果:")
for name, accuracy in results_encoder3.items():
    print(f"模型: {name}, 验证集准确率: {accuracy:.4f}")

# 模型名称
models = ["Logistic Regression", "Support Vector Machine", "Random Forest", "XGBoost"]

# 编码方式1的准确度
accuracy_encoder1 = [results_encoder1[model] for model in models]

# 编码方式2的准确度
accuracy_encoder2 = [results_encoder2[model] for model in models]

# 编码方式3的准确度
accuracy_encoder3 = [results_encoder3[model] for model in models]

/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/opt/conda/lib/python3.6/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
/opt/conda/lib/python3.6/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/opt/conda/lib/python3.6/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
/opt/conda/lib/python3.6/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/opt/conda/lib/python3.6/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
/opt/conda/lib/python3.6/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

编码方式1的结果:
模型: Logistic Regression, 验证集准确率: 0.4750
模型: Support Vector Machine, 验证集准确率: 0.5550
模型: Random Forest, 验证集准确率: 0.4950
模型: XGBoost, 验证集准确率: 0.4350

编码方式2的结果:
模型: Logistic Regression, 验证集准确率: 0.4850
模型: Support Vector Machine, 验证集准确率: 0.5200
模型: Random Forest, 验证集准确率: 0.5250
模型: XGBoost, 验证集准确率: 0.5050

编码方式3的结果:
模型: Logistic Regression, 验证集准确率: 0.4750
模型: Support Vector Machine, 验证集准确率: 0.5450
模型: Random Forest, 验证集准确率: 0.5300
模型: XGBoost, 验证集准确率: 0.4950

5.不同编码方式和建模方法效果对比

In [15]:

# 设置图表
plt.figure(figsize=(10, 6))
plt.rcParams['font.family'] = ['Microsoft YaHei']
# 绘制条形图
bar_width = 0.2
index = range(len(models))
bar1 = plt.bar(index, accuracy_encoder1, bar_width, label='编码方式1')
bar2 = plt.bar([i + bar_width for i in index], accuracy_encoder2, bar_width, label='编码方式2')
bar3 = plt.bar([i + 2 * bar_width for i in index], accuracy_encoder3, bar_width, label='编码方式3')

# 添加标签和标题
plt.xlabel('模型', fontsize=12)
plt.ylabel('准确度', fontsize=12)
plt.title('不同编码方式在不同模型上的准确度比较', fontsize=14)
plt.xticks([i + bar_width for i in index], models)
plt.legend()
plt.legend(loc='lower right')
# 标记准确度
for bar in [bar1, bar2, bar3]:
    for rect in bar:
        height = rect.get_height()
        plt.text(rect.get_x() + rect.get_width() / 2, height, f'{height:.2f}', ha='center', va='bottom')
# 显示图表
plt.tight_layout()
plt.show()

通过数据对比，可以发现Support Vector Machine使用OneHotEncoder方法得到的精准度最高R2=0.56.因此选定独热编码和支持向量机进行在测试集上的测试

6.最佳方法建模

In [17]:

# 使用二进制编码对训练集和验证集进行编码
encoder1 = OneHotEncoder()
x_train_encoded1 = encoder1.fit_transform(x_train)
x_val_encoded1 = encoder1.transform(x_val)

# 初始化随机森林模型
SVM_model = SVC()

# 在训练集上训练模型
SVM_model.fit(x_train_encoded1, y_train)

# 在验证集上进行预测
y_pred_val = SVM_model.predict(x_val_encoded1)

# 计算验证集准确率
accuracy_val = accuracy_score(y_val, y_pred_val)
print(f"在验证集上的准确率为: {accuracy_val:.4f}")

# 使用最终模型在测试集上进行验证
x_test_encoded1 = encoder1.transform(x_test)
y_pred_test = SVM_model.predict(x_test_encoded1)

# 计算测试集准确率
accuracy_test = accuracy_score(y_test, y_pred_test)
print(f"在测试集上的准确率为: {accuracy_test:.4f}")

在验证集上的准确率为: 0.5550
在测试集上的准确率为: 0.5500

/opt/conda/lib/python3.6/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)

最终在测试集上的R2=0.55。
本文测试的编码方法和模型方法较少，不同的编码方法会导致最终数据集的维度也有所不同，或许会有更好的预测效果

暴躁的秋秋

关注

30
点赞
踩
25

收藏

觉得还不错? 一键收藏
打赏
0
评论
脱发数据集的建模学习

随着年龄增长，脱发成为许多人关注的健康问题之一。头发的丰盈与否不仅影响着外貌，更与个体的健康状态息息相关。本数据集汇集了各种可能导致脱发的因素，包括遗传因素、荷尔蒙变化、医疗状况、药物治疗、营养缺乏、心理压力等。本工作尝试了三种编码方式和四种建模方法，充分考虑了数据泄露的问题，使建模过程更加标准，完善。
复制链接

扫一扫