线性分类器－Tumer Prediction

最新推荐文章于 2024-03-29 10:38:32 发布

cicilover

最新推荐文章于 2024-03-29 10:38:32 发布

阅读量1.4w

点赞数 1

分类专栏： machine learning 文章标签：线性分类器 LogisticRegression SGDClassifier LR

本文链接：https://blog.csdn.net/cicilover/article/details/77259857

版权

machine learning 专栏收录该内容

27 篇文章 0 订阅

订阅专栏

肿瘤预测数据地址：https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/

注：下文对于有缺失值的数据都暂时做忽略处理

用LogisticRegression和 Stochastic Gradient Descend随机梯度下降算法对该数据集进行分类，并做预测的性能统计。

Python源码：

#coding=utf-8
import pandas as pd
import numpy as np
#-------------
#use train_test_split to split data
from sklearn.cross_validation import train_test_split
#-------------
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
#-------------
from sklearn.metrics import classification_report


#-------------download data
#create feature list
column_names=['Sample code number','Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion','Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin','Normal Nucleoli','Mitoses','Class']

#use pandas.read_csv funtion to read data from internet
data=pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data',names=column_names)

#replace ? with standard missing value representation
data=data.replace(to_replace='?',value=np.nan)
#drop the data which has missing value(one or more dimension has missing value)
data=data.dropna(how='any')
#output the total counts and dimensions of the data
print data.shape

#-------------prepare training and testing datas
#random select 25%datas for testing,75% for training
X_train,X_test,y_train,y_test=train_test_split(data[column_names[1:10]],data[column_names[10]],test_size=0.25,random_state=33)
#see the nums and types of traingData
print y_train.value_counts()
#see the nums and types of testingData
print y_test.value_counts()
#-------------use Linear Classification Model to make predictions
#standardize the data，make sure that datas on each dimension variance is 1，mean value is 0. Do this to make sure that the result won't be dominanted by some dimension because of some large characteristic value
ss=StandardScaler()
X_train=ss.fit_transform(X_train)
X_test=ss.transform(X_test)

#initialize  LogisticRegression and  SGDClassifier
lr=LogisticRegression()
sgdc=SGDClassifier()

#use fit function/model on LogisticRegression to train model prams
lr.fit(X_train,y_train)
#use trained model lr to make prediction at X_test and store the result on lr_y_predict
lr_y_predict=lr.predict(X_test)

#use fit function/model on SGDClassifier to train model prams
sgdc.fit(X_train,y_train)
#use trained model sgdc to make prediction at X_test and store the result on sgdc_y_predict
sgdc_y_predict=sgdc.predict(X_test)

#-------------performance analysis
#use score function provided by LR model to get Accuracy result
print 'Accuracy of LR Classifier:',lr.score(X_test,y_test)
#get other three index
print classification_report(y_test,lr_y_predict,target_names=['Benign','Malignant'])

#use score function provided by SGD model to get Accuracy result
print 'Accuracy of SGD Classifier:',sgdc.score(X_test,y_test)
#get other three index
print classification_report(y_test,sgdc_y_predict,target_names=['Benign','Malignant'])

Result：

(683, 11)
2 344
4 168
Name: Class, dtype: int64
2 100
4 71
Name: Class, dtype: int64
Accuracy of LR Classifier: 0.988304093567
precision recall f1-score support

Benign 0.99 0.99 0.99 100
Malignant 0.99 0.99 0.99 71

avg / total 0.99 0.99 0.99 171

Accuracy of SGD Classifier: 0.982456140351
precision recall f1-score support

Benign 1.00 0.97 0.98 100
Malignant 0.96 1.00 0.98 71

avg / total 0.98 0.98 0.98 171