系统环境
数据集
代码
"""
良/恶性乳腺肿瘤数据线性分类
模型对比:
LogisticRegression (计算时间长,模型性能略高)
SGDClassifier (计算时间段,模型性能略低)
"""
import pandas as pd
import numpy as np
column_names = ['Sample code number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'Class']
data = pd.read_csv("breast-cancer-wisconsin.txt", names = column_names )
data = data.replace(to_replace='?', value=np.nan)
data = data.dropna(how='any')
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data[column_names[1:10]], data[column_names[10]], test_size=0.25, random_state=33)
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
lr = LogisticRegression()
sgdc = SGDClassifier()
lr.fit(X_train, y_train)
lr_y_predict = lr.predict(X_test)
sgdc.fit(X_train, y_train)
sgdc_y_predict = sgdc.predict(X_test)
from sklearn.metrics import classification_report
print('Accuracy of LR Classifier:', lr.score(X_test, y_test))
print(classification_report(y_test, lr_y_predict, target_names=['Benign', 'Malignant']))
print('Accuarcy of SGD Classifier:', sgdc.score(X_test, y_test))
print(classification_report(y_test, sgdc_y_predict, target_names=['Benign', 'Malignant']))
线性分类模型比对结果
- LogisticRegression (计算时间长,模型性能略高)
- SGDClassifier (计算时间段,模型性能略低)
![](https://img-blog.csdn.net/20170430190157671?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvbnMyMjUwMjI1/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast)