机器学习实战（逻辑回归与二分类问题+网格模型交叉验证）

__LazyCat__

于 2023-02-02 01:10:56 发布

阅读量301

点赞数

分类专栏：机器学习文章标签：逻辑回归分类

本文链接：https://blog.csdn.net/m0_51547083/article/details/128842025

版权

机器学习专栏收录该内容

2 篇文章 0 订阅

订阅专栏

通过监督学习的方式，在给定的数据集训练下，可以预测出癌症是否为良性。其中原始数据集下载地址为：

https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/

实战流程

进入官网数据，可以看到数据的格式如下：

（1）699条样本，共11列数据，第一列用语检索的id，后9列分别是与肿瘤相关的医学特征，最后一列表示肿瘤类型的数值。

（2）包含16个缺失值，用”?”标出。

分析一下，大概的步骤是

缺失值的处理
标准化处理
逻辑回归预测

小tips，为了能在控制台查看不被压缩的输出数据，可以先设置一下pandas的输出格式：

# 输出结果不压缩
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

在查看了官网的数据后，可以为数据先标上特征名：

# 获取数据集并添加属性名
column_name = ['Sample code number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape',
               'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin',
               'Normal Nucleoli', 'Mitoses', 'Class']
url = r"https://archive.ics.uci.edu/ml/machine-learning-databases/" \
      r"breast-cancer-wisconsin/breast-cancer-wisconsin.data"
data = pd.read_csv(url, names=column_name)

当然接下来就是处理缺失值的问题了。这里采用的是比较简单直接的将缺失数据的行剔除：

# 清除缺失数据项
data = data.replace(to_replace='?', value=np.nan)
data.dropna(inplace=True)

虽然数据的量级好像差不多，但是最好还是进行无量纲化。
在这里插入图片描述
可以先划分数据集在标准化处理：

# 划分数据集
x, y = data.iloc[:, 1:-1], data.iloc[:, -1]
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=114514, test_size=0.3)

# 数据集无量纲化
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)

接下来便是训练了：

# 模型训练
estimator = LogisticRegression()
estimator.fit(x_train, y_train)

根据测试集可以测试模型的准确性：

# 模型性能
print(estimator.score(x_test, y_test))
print(estimator.coef_, estimator.intercept_)

结果还是比较清晰的，测试的准确率高达95%。（当然一般疾病预测采用的精确率或者召回率，在这里不详细介绍）
然后便是各个特征的权重 $w_{i}$ 与偏移量 $b$ 。
在这里插入图片描述

实战进阶

显然，可以采用网格交叉验证的方式提高模型训练的准确率。网格交叉验证主要是可以对多个自己需要测试的超参数进行测试，同时可以使得测试集内有一个交替（训练集与验证集划分测试）的进行。
与上面做法一致，不过需要生成网格交叉验证对象，然后设置自己的超参数，最后训练与预测即可：

# 建立网格模型交叉验证
estimator = LogisticRegression()
param = {"C": [0.1, 0.5, 1]}
gc = GridSearchCV(estimator, param_grid=param, cv=10)
gc.fit(x_train, y_train)

# 模型性能
print(gc.score(x_test, y_test))
print(gc.best_estimator_, gc.best_params_)
print(gc.cv_results_)