项目简介
随着年龄增长,脱发成为许多人关注的健康问题之一。头发的丰盈与否不仅影响着外貌,更与个体的健康状态息息相关。
本数据集汇集了各种可能导致脱发的因素,包括遗传因素、荷尔蒙变化、医疗状况、药物治疗、营养缺乏、心理压力等。
本工作尝试了三种编码方式和四种建模方法,充分考虑了数据泄露的问题,使建模过程更加标准,完善。
相关包的导入
In [10]:
import pandas as pd from sklearn.model_selection import train_test_split from category_encoders import OneHotEncoder, BinaryEncoder, BackwardDifferenceEncoder from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score from xgboost import XGBClassifier import matplotlib.pyplot as plt import matplotlib.font_manager as fm
1.数据预览
In [18]:
df = pd.read_csv("/home/mw/input/hair2961/Predict Hair Fall.csv") df.info() df.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 999 entries, 0 to 998 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 999 non-null int64 1 Genetics 999 non-null object 2 Hormonal Changes 999 non-null object 3 Medical Conditions 999 non-null object 4 Medications & Treatments 999 non-null object 5 Nutritional Deficiencies 999 non-null object 6 Stress 999 non-null object 7 Age 999 non-null int64 8 Poor Hair Care Habits 999 non-null object 9 Environmental Factors 999 non-null object 10 Smoking 999 non-null object 11 Weight Loss 999 non-null object 12 Hair Loss 999 non-null int64 dtypes: int64(3), object(10) memory usage: 101.6+ KB
Out[18]:
Id | Genetics | Hormonal Changes | Medical Conditions | Medications & Treatments | Nutritional Deficiencies | Stress | Age | Poor Hair Care Habits | Environmental Factors | Smoking | Weight Loss | Hair Loss | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 133992 | Yes | No | No Data | No Data | Magnesium deficiency | Moderate | 19 | Yes | Yes | No | No | 0 |
1 | 148393 | No | No | Eczema | Antibiotics | Magnesium deficiency | High | 43 | Yes | Yes | No | No | 0 |
2 | 155074 | No | No | Dermatosis | Antifungal Cream | Protein deficiency | Moderate | 26 | Yes | Yes | No | Yes | 0 |
3 | 118261 | Yes | Yes | Ringworm | Antibiotics | Biotin Deficiency | Moderate | 46 | Yes | Yes | No | No | 0 |
4 | 111915 | No | No | Psoriasis | Accutane | Iron deficiency | Moderate | 30 | No | Yes | Yes | No | 1 |
可以看到数据集中共999条数据,不存在缺失值,大多特增是分类变量,在建模过程中需要将其转换为连续变量。
2.数据预处理
2.1删除ID列
In [3]:
df = df.drop(df.columns[0],axis=1)
ID属于无关特征,对建模过程没有用处
2.2 数据划分
In [4]:
x = df.iloc[:,:-1] y = df.iloc[:,-1] x_train,x_temp,y_train,y_temp = train_test_split(x,y,test_size=0.3,random_state=42) x_val,x_test,y_val,y_test = train_test_split(x_temp,y_temp,test_size=1/3,random_state=42)
In [5]:
print("训练集大小:", x_train.shape, y_train.shape) print("验证集大小:", x_val.shape, y_val.shape) print("测试集大小:", x_test.shape, y_test.shape)
训练集大小: (699, 11) (699,) 验证集大小: (200, 11) (200,) 测试集大小: (100, 11) (100,)
为了防止数据泄露,首先划分数据集。特征编码和模型筛选过程均不使用测试数据集。
3.编码方式筛选
选择了三种常见的分类变量编码方式
1.OneHotEncoder 将每个类别映射到包含0和1的向量。
2.BinaryEncoder 将类别转换为二进制数字。如果有n个唯一的类别,那么二进制编码只会产生log(以2为底)n个特征。
3.BackwardDifferenceEncoder 一个水平的因变量的平均值与前一个水平的因变量的平均值进行比较。
In [6]:
def encode_categorical_features(x_train, x_val): # 创建三个编码器 encoder1 = OneHotEncoder() encoder2 = BinaryEncoder() encoder3 = BackwardDifferenceEncoder() # 对训练集进行编码 x_train_encoded1 = encoder1.fit_transform(x_train) x_train_encoded2 = encoder2.fit_transform(x_train) x_train_encoded3 = encoder3.fit_transform(x_train) # 对验证集进行相同的编码 x_val_encoded1 = encoder1.transform(x_val) x_val_encoded2 = encoder2.transform(x_val) x_val_encoded3 = encoder3.transform(x_val) return x_train_encoded1, x_train_encoded2, x_train_encoded3, x_val_encoded1, x_val_encoded2, x_val_encoded3 # 使用函数进行编码 x_train_encoded1, x_train_encoded2, x_train_encoded3, x_val_encoded1, x_val_encoded2, x_val_encoded3 = encode_categorical_features(x_train, x_val) # 输出编码后的数据集大小 print("独热编码1的训练集大小:", x_train_encoded1.shape) print("二进制编码2的训练集大小:", x_train_encoded2.shape) print("后向差分编码3的训练集大小:", x_train_encoded3.shape)
独热编码1的训练集大小: (699, 49) 二进制编码2的训练集大小: (699, 27) 后向差分编码3的训练集大小: (699, 40)
/opt/conda/lib/python3.6/site-packages/category_encoders/base_contrast_encoder.py:127: FutureWarning: Intercept column might not be added anymore in future releases (c.f. issue #370) category=FutureWarning)
4.机器学习模型筛选
选择了四种常见的分类模型
1.Logistic Regression
2.Support Vector Machine
3.Random Forest
4.XGBoost
In [7]:
def train_and_evaluate_models(x_train, y_train, x_val, y_val): # 初始化模型 models = { "Logistic Regression": LogisticRegression(), "Support Vector Machine": SVC(), "Random Forest": RandomForestClassifier(), "XGBoost": XGBClassifier() } # 训练和评估每个模型 results = {} for name, model in models.items(): # 训练模型 model.fit(x_train, y_train) # 在验证集上进行预测 y_pred_val = model.predict(x_val) # 计算准确率 accuracy = accuracy_score(y_val, y_pred_val) # 保存结果 results[name] = accuracy return results # 使用函数训练和评估模型 results_encoder1 = train_and_evaluate_models(x_train_encoded1, y_train, x_val_encoded1, y_val) results_encoder2 = train_and_evaluate_models(x_train_encoded2, y_train, x_val_encoded2, y_val) results_encoder3 = train_and_evaluate_models(x_train_encoded3, y_train, x_val_encoded3, y_val) # 输出结果 print("编码方式1的结果:") for name, accuracy in results_encoder1.items(): print(f"模型: {name}, 验证集准确率: {accuracy:.4f}") print("\n编码方式2的结果:") for name, accuracy in results_encoder2.items(): print(f"模型: {name}, 验证集准确率: {accuracy:.4f}") print("\n编码方式3的结果:") for name, accuracy in results_encoder3.items(): print(f"模型: {name}, 验证集准确率: {accuracy:.4f}") # 模型名称 models = ["Logistic Regression", "Support Vector Machine", "Random Forest", "XGBoost"] # 编码方式1的准确度 accuracy_encoder1 = [results_encoder1[model] for model in models] # 编码方式2的准确度 accuracy_encoder2 = [results_encoder2[model] for model in models] # 编码方式3的准确度 accuracy_encoder3 = [results_encoder3[model] for model in models]
/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning. FutureWarning) /opt/conda/lib/python3.6/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning. "avoid this warning.", FutureWarning) /opt/conda/lib/python3.6/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22. "10 in version 0.20 to 100 in 0.22.", FutureWarning) /opt/conda/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning. FutureWarning) /opt/conda/lib/python3.6/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning. "avoid this warning.", FutureWarning) /opt/conda/lib/python3.6/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22. "10 in version 0.20 to 100 in 0.22.", FutureWarning) /opt/conda/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning. FutureWarning) /opt/conda/lib/python3.6/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning. "avoid this warning.", FutureWarning) /opt/conda/lib/python3.6/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22. "10 in version 0.20 to 100 in 0.22.", FutureWarning)
编码方式1的结果: 模型: Logistic Regression, 验证集准确率: 0.4750 模型: Support Vector Machine, 验证集准确率: 0.5550 模型: Random Forest, 验证集准确率: 0.4950 模型: XGBoost, 验证集准确率: 0.4350 编码方式2的结果: 模型: Logistic Regression, 验证集准确率: 0.4850 模型: Support Vector Machine, 验证集准确率: 0.5200 模型: Random Forest, 验证集准确率: 0.5250 模型: XGBoost, 验证集准确率: 0.5050 编码方式3的结果: 模型: Logistic Regression, 验证集准确率: 0.4750 模型: Support Vector Machine, 验证集准确率: 0.5450 模型: Random Forest, 验证集准确率: 0.5300 模型: XGBoost, 验证集准确率: 0.4950
5.不同编码方式和建模方法效果对比
In [15]:
# 设置图表 plt.figure(figsize=(10, 6)) plt.rcParams['font.family'] = ['Microsoft YaHei'] # 绘制条形图 bar_width = 0.2 index = range(len(models)) bar1 = plt.bar(index, accuracy_encoder1, bar_width, label='编码方式1') bar2 = plt.bar([i + bar_width for i in index], accuracy_encoder2, bar_width, label='编码方式2') bar3 = plt.bar([i + 2 * bar_width for i in index], accuracy_encoder3, bar_width, label='编码方式3') # 添加标签和标题 plt.xlabel('模型', fontsize=12) plt.ylabel('准确度', fontsize=12) plt.title('不同编码方式在不同模型上的准确度比较', fontsize=14) plt.xticks([i + bar_width for i in index], models) plt.legend() plt.legend(loc='lower right') # 标记准确度 for bar in [bar1, bar2, bar3]: for rect in bar: height = rect.get_height() plt.text(rect.get_x() + rect.get_width() / 2, height, f'{height:.2f}', ha='center', va='bottom') # 显示图表 plt.tight_layout() plt.show()
通过数据对比,可以发现Support Vector Machine使用OneHotEncoder方法得到的精准度最高R2=0.56.因此选定独热编码和支持向量机进行在测试集上的测试
6.最佳方法建模
In [17]:
# 使用二进制编码对训练集和验证集进行编码 encoder1 = OneHotEncoder() x_train_encoded1 = encoder1.fit_transform(x_train) x_val_encoded1 = encoder1.transform(x_val) # 初始化随机森林模型 SVM_model = SVC() # 在训练集上训练模型 SVM_model.fit(x_train_encoded1, y_train) # 在验证集上进行预测 y_pred_val = SVM_model.predict(x_val_encoded1) # 计算验证集准确率 accuracy_val = accuracy_score(y_val, y_pred_val) print(f"在验证集上的准确率为: {accuracy_val:.4f}") # 使用最终模型在测试集上进行验证 x_test_encoded1 = encoder1.transform(x_test) y_pred_test = SVM_model.predict(x_test_encoded1) # 计算测试集准确率 accuracy_test = accuracy_score(y_test, y_pred_test) print(f"在测试集上的准确率为: {accuracy_test:.4f}")
在验证集上的准确率为: 0.5550 在测试集上的准确率为: 0.5500
/opt/conda/lib/python3.6/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning. "avoid this warning.", FutureWarning)
最终在测试集上的R2=0.55。
本文测试的编码方法和模型方法较少,不同的编码方法会导致最终数据集的维度也有所不同,或许会有更好的预测效果