机器学习模型代码详细解释

最新推荐文章于 2024-07-26 00:18:44 发布

EaSoNgo111

最新推荐文章于 2024-07-26 00:18:44 发布

阅读量1k

点赞数 19

文章标签：机器学习人工智能

本文链接：https://blog.csdn.net/easongo111/article/details/138668828

版权

#特征工程
features = data.drop('target', axis=1)
labels = data['target']
train_features, test_features, train_labels, test_labels  = train_test_split(features, labels, test_size=0.3, random_state=1)

通过 data.drop('target', axis=1)，从数据中分离出特征。它会移除名为 'target' 的列，保留其他所有列作为特征。
通过 data['target']，获取数据中的目标变量（标签），并将其存储在变量 labels 中。
使用 train_test_split 函数将特征和标签分割成训练集和测试集。其中参数 test_size=0.3 表示测试集占总数据的30%，random_state=1 用于设置随机种子，以确保每次运行代码时分割的结果都是相同的。分割后的训练集特征、测试集特征、训练集标签和测试集标签分别存储在 train_features、test_features、train_labels 和 test_labels 变量中

针对不平衡数据集的处理

smo_ratio = 5
mino_num = train_labels.value_counts()[0]
majo_num = train_labels.value_counts()[1]
imbalanced_ratio = float(mino_num) / float(majo_num)
if imbalanced_ratio > smo_ratio:
    smo = SMOTE(sampling_strategy={1:int(float(mino_num)/float(smo_ratio))})
    train_features, train_labels = smo.fit_resample(train_features, train_labels)
    print('num of label 1 after smote:%d' % sum(train_labels))
    print('num of label 0 after smote:%d' % (len(train_labels) - sum(train_labels)))

首先计算了训练集中标签为 0 和标签为 1 的样本数量，分别存储在 mino_num 和 majo_num 中。
计算了不平衡比率，即标签为 0 的样本数量除以标签为 1 的样本数量，存储在 imbalanced_ratio 中。
如果不平衡比率大于预先设定的阈值 smo_ratio，则执行下面的操作：
- 使用 SMOTE 方法对少数类样本（标签为 1）进行过采样，使得少数类样本数量达到多数类样本数量除以 smo_ratio。这里使用了 sampling_strategy 参数来设置过采样的比例。
- 使用 fit_resample 方法将过采样后的特征和标签重新组合成训练集。
- 打印出经过 SMOTE 处理后的标签为 1 和标签为 0 的样本数量。

据进行转换和规范化

#数据转换
dvec = DictVectorizer(sparse=False)
train_features = dvec.fit_transform(train_features.to_dict(orient='record'))
test_features = dvec.transform(test_features.to_dict(orient='record'))
#进行规范化
ss = StandardScaler()
train_features = ss.fit_transform(train_features)
test_features = ss.transform(test_features)

DictVectorizer：这是一个用于将字典类型的特征转换成稀疏矩阵或密集矩阵的工具。在这里，首先创建了一个 DictVectorizer 对象 dvec，然后使用 fit_transform() 方法将训练特征 train_features 转换为字典形式并进行转换操作，最终得到稀疏或密集矩阵。测试特征 test_features 也会被相同的转换方式处理，但是这里使用了 transform() 方法，因为测试数据应该使用与训练数据相同的转换规则。
StandardScaler：这是一个用于特征缩放的工具，它通过移除平均值和缩放到单位方差来标准化特征。在这里，首先创建了一个 StandardScaler 对象 ss，然后使用 fit_transform() 方法对训练特征 train_features 进行规范化处理，使得其均值为0，方差为1。接着，对测试特征 test_features 使用 transform() 方法，使用相同的规范化方式，确保测试数据与训练数据相互兼容。

随机森林分类器模型训练和预测，网格搜索调整模型的超参数

rf = RandomForestClassifier()
param_grid = {'n_estimators': range(8, 10)}
clf = GridSearchCV(rf, param_grid=param_grid, cv=5)
clf.fit(train_features, train_labels)
prediction = clf.predict(test_features)

rf = RandomForestClassifier()：创建了一个随机森林分类器对象 rf。随机森林是一种集成学习方法，它由多个决策树组成，通过投票或取平均值的方式来提高预测准确度。
param_grid = {'n_estimators': range(8, 10)}：定义了一个参数网格，其中包含了待调整的超参数 'n_estimators'。'n_estimators' 表示随机森林中树的数量。在这个网格中，'n_estimators' 的取值范围是从 8 到 9（不包括 10）。
clf = GridSearchCV(rf, param_grid=param_grid, cv=5)：创建了一个网格搜索对象 clf，用于在给定的参数网格上执行交叉验证网格搜索。rf 是要使用的分类器，param_grid 是要搜索的参数网格，cv=5 指定了进行 5 折交叉验证。
clf.fit(train_features, train_labels)：在训练集 train_features 上拟合（训练）了网格搜索对象 clf。这会对指定的参数网格中的每一种参数组合进行训练，并使用交叉验证来评估每个参数组合的性能。
prediction = clf.predict(test_features)：对测试集 test_features 进行预测，使用训练好的模型 clf 进行预测，并将预测结果保存在 prediction 中。

预测 & 结果导出

#预测 & 结果导出
from collections import Counter
#设置阈值
threshold = 0.15
#模型应用
apply_prediction = clf.predict_proba(apply_features)[:, 1]
apply_pred_new = (apply_prediction >= threshold).astype(int)
apply_data['predictions']= apply_pred_new

apply_prediction = clf.predict_proba(apply_features)[:, 1]：使用训练好的模型 clf 对标准化后的特征数据 apply_features 进行预测，得到的是每个样本属于各个类别的概率，这里取的是正例的概率（类别为 1 的概率）。
apply_pred_new = (apply_prediction >= threshold).astype(int)：根据设定的阈值 threshold 对预测的概率进行二分类，将大于等于阈值的预测概率设为 1，小于阈值的设为 0，并将结果转换为整数类型。这里使用了阈值来确定正负样本的分类。

评估指标准确率、召回率、精确率和 F1 分数

#模型评估
y_pred_proba = clf.predict_proba(test_features)[:, 1]
y_pred_new = (y_pred_proba >= threshold).astype(int)
accuracy_new = accuracy_score(test_labels, y_pred_new)
recall_new = recall_score(test_labels, y_pred_new)
precision_new = precision_score(test_labels, y_pred_new)
f1_new = f1_score(test_labels, y_pred_new)
cm = confusion_matrix(test_labels, y_pred_new)
tp = cm[1, 1]
fn = cm[1, 0]
fp = cm[0, 1]
tn = cm[0, 0]

y_pred_proba = clf.predict_proba(test_features)[:, 1]：使用训练好的模型 clf 对测试集的特征数据 test_features 进行预测，得到的是每个样本属于各个类别的概率，这里取的是正例的概率（类别为 1 的概率）。
y_pred_new = (y_pred_proba >= threshold).astype(int)：根据设定的阈值 threshold 对预测的概率进行二分类，将大于等于阈值的预测概率设为 1，小于阈值的设为 0，并将结果转换为整数类型。这里使用了阈值来确定正负样本的分类。
accuracy_new = accuracy_score(test_labels, y_pred_new)：计算模型在测试集上的准确率，即模型预测正确的样本数占总样本数的比例。
recall_new = recall_score(test_labels, y_pred_new)：计算模型在测试集上的召回率，即真正例占所有实际正例的比例。
precision_new = precision_score(test_labels, y_pred_new)：计算模型在测试集上的精确率，即真正例占所有预测为正例的样本的比例。
f1_new = f1_score(test_labels, y_pred_new)：计算模型在测试集上的 F1 分数，综合考虑了精确率和召回率的平衡值。
cm = confusion_matrix(test_labels, y_pred_new)：计算模型在测试集上的混淆矩阵，包括真正例（TP）、假负例（FN）、假正例（FP）和真负例（TN）的数量。
tp = cm[1, 1]、fn = cm[1, 0]、fp = cm[0, 1]、tn = cm[0, 0]：分别提取混淆矩阵中的各个元素，用于后续分析和展示。