机器学习及实践 2.1.1.1 线性分类器

最新推荐文章于 2022-01-04 14:25:09 发布

早起的鸟儿有虫吃h

最新推荐文章于 2022-01-04 14:25:09 发布

阅读量483

点赞数

分类专栏： Python机器学习及实践从零开始通往Kaggle竞赛之路

本文链接：https://blog.csdn.net/u013383813/article/details/78311037

版权

Python机器学习及实践从零开始通往Kaggle竞赛之路专栏收录该内容

1 篇文章 0 订阅

订阅专栏

线性分类器

sklearn 中线性分类器 LogisticRegression 和SGDclassifier 对肿瘤进行、良/恶分类。

p35-p43

代码13：数据预处理：

1.读取csv 数据（1.创建特征向量表。2.pandas.read_csv 读取数据（网络或本地））

2.缺失数据处理（丢弃）

#代码 13 ：良/恶性乳腺癌肿瘤数据预处理

import pandas as pd
import numpy as np

#创建特征列表
column_names = ['Sample code number','Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape',
                'Marginal Adhesion','Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin','Normal Nucleoli','Mitoses','Class']
#使用 pandas.read_csv函数从互联网读取指定数据
data = pd.read_csv(r'D:\MainPart\Waytokaggle\data\breast-cancer-wisconsin.data',names = column_names)
#将？替换为标准缺失值表示
data = data.replace(to_replace = '?',value = np.nan)
#丢弃带有缺失值的数据
data = data.dropna(how = 'any')
#输出data的数据量和纬度
data.shape

代码14：数据划分

1.对数据划分，分出训练集和测试集。

2.查看训练集与测试集的数量和类别划分。

#代码 14 ：准备良/恶性乳腺癌肿瘤训练、测试数据

#使用 train_test_split 分割训练集
from sklearn.cross_validation import train_test_split

#随机采样25%的数据用于测试，剩下75%用于构建训练集
x_train,x_test,y_train,y_test = train_test_split(data[column_names[1:10]],data[column_names[10]],test_size = 0.25,random_state = 33)

#查验训练样本的数量和类别分布

#查看训练集 测试集 数量及类别分布
y_train.value_counts()
y_test.value_counts()

代码15：线性分类器进行分类，sklearn 中线性分类器 LogisticRegression 和SGDclassifier

1.标准化数据，每个维度均值为0，方差为1。

2.LogisticRegression 和SGDclassifier ：1.初始化。2.fit函数训练模型参数。3.训练好模型predict函数进行预测，结果保存。

#代码 15 ：使用线性分类模型从事良/恶性肿瘤预测任务

#StandardScaler标准化，尽量将数据转化为均值为零，方差为一的数据
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier

#标准化数据，尽量将数据转化为均值为零，方差为一的数据，使得预测结果不会被某些纬度过大的特征值而主导
ss = StandardScaler()
x_train = ss.fit_transform(x_train)
x_test = ss.fit_transform(x_test)

#初始化LogisticRegression 与 SGDClassifier
lr = LogisticRegression()
sgdc = SGDClassifier()

#调用LogisticRegression 中的 fit 函数/模块用来训练模型参数
lr.fit(x_train,y_train)
#使用训练好的模型lr对x_test进行预测，结果存储在lr_y_predict中
lr_y_predict = lr.predict(x_test)

#调用SGDClassifier 中的 fit 函数/模块用来训练模型参数
sgdc.fit(x_train,y_train)
#使用训练好的模型sgdc对x_test进行预测，结果存储在sgdc_y_predict中
sgdc_y_predict = lr.predict(x_test)

代码16：性能分析

1.使用模型自带函数score ，获得模型在测试集上的准确性结果。

2.使用classification_report ，获得混淆矩阵

#代码 16 ：使用线性分类器模型从事良/恶性肿瘤预测任务的性能分析

from sklearn.metrics import classification_report

#使用logistic回归模型自带的评分函数score 获得模型在测试集上的准确性结果
print('Accuracy of LR Classifier:',lr.score(x_test,y_test))

#利用classification_report模块获得LogisticRegression其它三个指标的结果

print(classification_report(y_test,lr_y_predict,target_names=['Benign','Malignant']))

#使用SGD随机梯度下降模型自带的评分函数score 获得模型在测试集上的准确性结果
print('Accuracy of SGD classifier:',sgdc.score(x_test,y_test))

#利用classification_report模块获得SGDclassifier其它三个指标的结果
print(classification_report(y_test,sgdc_y_predict,target_names = ['Benign','Malignant']))

早起的鸟儿有虫吃h

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
机器学习及实践 2.1.1.1 线性分类器

线性分类器sklearn 中线性分类器 LogisticRegression 和SGDclassifier 对肿瘤进行、良/恶分类。 p35-p43代码13：数据预处理：1.读取csv 数据（1.创建特征向量表。2.pandas.read_csv 读取数据（网络或本地））2.缺失数据处理（丢弃）#代码 13 ：良/恶性乳腺癌肿瘤数据预处理im
复制链接

扫一扫