分类模型
评估分类模型指标:正确率accuracy、F值F-measure、精度precision、召回率recall
-
逻辑回归
- 风险评估中最经典最常用的模型
- 通常解决二分类问题,即预测目标y的取值范围为{1,-1}
从线性回归到逻辑回归
- 线性回归存在问题:y取值为连续实数而非离散值——解决:引入逻辑斯蒂函数,将连续性的输出映射到(0,1)之间
- 当输入x很大或很小时,函数接近于0或1的值输出,σ(0)=0.5
- 使用逻辑斯蒂函数,可将任意实数映射到(0,1)之间,在逻辑回归中可解释为样本属于正类y=1的概率,记
- 样本xi为正类的概率为:
- 样本xi为负类的概率为:
- 可统一为:
- 利用逻辑斯蒂函数,建立二元预测目标与原始输入之间的关系
参数估计
- 二分类【逻辑斯蒂函数】
- 训练集D={(x1,y1),(x2,y2),…,(xn,yn)},最大似然法估计参数w,似然函数:
- 对L(w)取对数,得到训练集的负对数似然:
- 即目标为:min NLL(w)【梯度下降法】,w←w-η▽NLL(w),η:学习率,▽NLL(w)为目标函数对于参数w的梯度
- 记,
- 因为
- 所以
- w←
- 多分类
- 将逻辑斯蒂函数替换成softmax函数
- 训练集D={(x1,y1),(x2,y2),…,(xn,yn)},其中每一个样本可能属于C种类别,即yi∈{1,2,…,C},在逻辑回归中仅有一个d维的参数向量w,但在多分类的情况下,每一个类别c都需要一个参数向量wc,参数为W={w1,w2,…wC}
- 其条件概率p(yi|xi)
-
K近邻KNN
-
最经典、简单的有监督学习方法,适用于当数据的分布只有很少或者没有任何先验知识时,既能解决分类问题,也能解决回归问题
-
原理简单:当对测试样本进行分类时,首先扫描训练集,找到与该测试样本最相似的k个训练样本,根据这k个样本的类别进行投票确定测试样本的类别【k个样本里哪个类别最多则可预测测试样本也为该类别,可形象化为在二维图种,以测试样本为圆心画圆,园内哪类样本多,则可预测测试样本也为该类别】;也可以通过k个样本与测试样本的相似程度进行加权投票
-
流程:
- 1确定k的大小和距离计算的方法;
- 2从训练样本中得到k个与测试样本最相似的样本
- 3根据k个最相似训练样本的类别,通过投票的方式来确定测试样本的类别
-
核心问题
- 1寻找测试样本“近邻”的方法,即如何计算样本之间的距离或相似度
- 2如何选择k值大小
- 3当训练集样本数量较多或维度较大时,如何快速预测
-
当样本数据能够表示为多维的连续型取值时,可选择欧式距离
-
k值的选择
- 慢慢试
-
提高预测性能
-
K近邻算法在进行预测时的时间复杂度为O
-
优点:简单实用,易于实现;对于二分类问题,如果训练集无限大,K近邻算法的泛化误差的上界为贝叶斯错误率的两倍;对异常数据不敏感【抗噪性】
-
缺点:计算效率不高,当训练集较小时,易导致过度拟合
-
除了解决分类问题,还可解决回归问题:
-
样本x的预测值:d(xi,xj)为样本xI 到样本xj之间的距离
-
-
决策树模型
- 核心问题:如何根据训练数据,自动生成决策树
-
决策树的生成
- 根结点开始,选择对应特征;然后选择该节点特征的分裂点,根据分裂点分裂节点
- 当某一节点中数据只属于某类别(分类问题)或者方差较小(回归问题)时,节点则不再进行进一步分裂
- 核心问题:选择节点特征和特征分裂点
- 不纯度:用来表示落在当前节点的样本的类别分布均衡程度【分裂节点目标:样本类别分布更不均衡,即降低不纯度】不纯度越低越好
- 根据不纯度的下降程度来选择特征和对应的分裂点
- 节点不纯度度量方法:Gini指数、信息熵、误分率
- Gini指数
- 当样本均匀分布在每一个类中,则Gini指数为1-1/C,不纯度大
- 当样本都分布在同一类时,Gini指数为0,不纯度小
- 当一个节点t被分成K个子节点时,分割的Gini指数为
- 挑选使Gini(t)-Ginisplit最大的特征进行分裂
- 信息熵:描述信息不确定度
- 当样本均匀分布在每一个类中,熵为log2C,说明不纯度大
- 当样本属于同一个类时,熵为0,说明不纯度小
- 信息增益:节点分裂前后信息熵的下降值
- 缺点:倾向于分裂成很多小节点(节点的样本数较小),容易造成过拟合
- 信息增益率:
- 节点分裂的子节点的样本数信息
- 误分率
- 按当前多数类来预测当前节点样本的类别时,被错误分类的数据比例
- 当样本均匀分布在每一个类中,则Gini指数为1-1/C,不纯度最大
- 当样本都分布在同一类时,Gini指数为0,不纯度最小
- 节点不纯度
- 对于二分类,p是正类样本的比例,则
- Gini:2p(1-p)
- 信息熵:-plog2p-(1-p)log2(1-p)
- 误分率:1-max(p,1-p)
- 信息熵最大,对不纯度的惩罚最强
- 对于二分类,p是正类样本的比例,则
- Gini指数
-
决策树的修剪
- 树分太细-过于复杂、过度拟合【例,每个树叶只包含一个样本,训练误差为0,但易造成过拟合】
-
朴素贝叶斯
实践代码
KNN
#K近邻
import pandas as pd
from sklearn import cross_validation
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
##整数编码-diagnosis
def integer_coding(diagnosis):
dignosis_dict = {"B":0,"M":1}
diagnosis = diagnosis.map(dignosis_dict)
print(diagnosis)
return diagnosis
##数据标准化
def min_max_normalize(x):
return (x - x.min())/(x.max() - x.min())
##划分训练集和测试集
def partition(dataset, y):
dataset_minmax_train, dataset_minmax_test,\
dataset_train_labels, dataset_test_labels \
= cross_validation.train_test_split(dataset, y, test_size=0.3, random_state=0)
print(dataset_train_labels.value_counts()/len(dataset_train_labels))
print(dataset_test_labels.value_counts()/len(dataset_test_labels))
return dataset_minmax_train, dataset_minmax_test,\
dataset_train_labels, dataset_test_labels
##模型训练
def knn(minmax_train, train_labels, minmax_test, n_neighbors, test_labels):
#train
knn_model = KNeighborsClassifier(n_neighbors)
knn_model.fit(minmax_train, train_labels)
test_pred = knn_model.predict(minmax_test)
#test
accuracy = metrics.accuracy_score(test_labels, test_pred)
confusion_mat = metrics.confusion_matrix(test_labels, test_pred)
print("分类报告:\n", metrics.classification_report(test_labels, test_pred))
print("k = ", k)
print("\t正确率: ", '%.2f'%(accuracy*100) + "%")
print("\t假阴性:", confusion_mat[0,1])
print("","\t假阳性:", confusion_mat[1,0])
return test_pred
##Z-Score标准化
def Z_Score(dataset):
dataset_zscore = pd.DataFrame(preprocessing.scale(dataset),\
columns = dataset.columns)
print(dataset_zscore.head(5))
print(dataset_zscore.area_mean.mean())
print(dataset_zscore.area_mean.std())
return dataset_zscore
if __name__=="__main__":
breast_cancer = pd.read_csv('wisc_bc_data.csv',\
engine='python')
print(breast_cancer.shape)
print(breast_cancer.head(10))
del breast_cancer["id"]
print(breast_cancer.diagnosis.value_counts())
print(breast_cancer.diagnosis.value_counts()/len(breast_cancer))
breast_cancer["diagnosis"] = integer_coding(breast_cancer["diagnosis"])
print(breast_cancer[["radius_mean", "area_mean","smoothness_mean"]].describe())
##Min_Max
print('\n')
print("-------------------------------Min_Max-------------------------------")
print('\n')
#数据标准化-minmax
for col in breast_cancer.columns[1:31]:
breast_cancer[col] = min_max_normalize(breast_cancer[col])
print(breast_cancer.iloc[:,1:].describe()) #是否正确标准化
#数据集划分
y = breast_cancer['diagnosis']
del breast_cancer['diagnosis']
breast_cancer_minmax_train, breast_cancer_minmax_test,\
breast_cancer_train_labels, breast_cancer_test_labels = partition(breast_cancer, y)
print(breast_cancer_minmax_train)
#knn
k_list = (1,3,5,7,9,11,15,21,27)
for k in k_list:
knn(breast_cancer_minmax_train, breast_cancer_train_labels, \
breast_cancer_minmax_test, k, breast_cancer_test_labels)
##Z-Score
print('\n')
print("-------------------------------Z-Score-------------------------------")
print('\n')
#数据标准化-zscore
breast_cancer_zscore = Z_Score(breast_cancer)
#数据集划分
breast_cancer_zscore_train, breast_cancer_zscore_test,\
breast_cancer_zscore_train_labels, breast_cancer_zscore_test_labels = partition(breast_cancer_zscore, y)
#knn
k_list = (1,3,5,7,9,11,15,21,27)
for k in k_list:
knn(breast_cancer_zscore_train, breast_cancer_zscore_train_labels,\
breast_cancer_zscore_test, k, breast_cancer_zscore_test_labels)
贝叶斯
import numpy as np
import pandas as pd
from sklearn import cross_validation
##整数编码-diagnosis
def integer_coding(diagnosis):
dignosis_dict = {"B":0,"M":1}
diagnosis = diagnosis.map(dignosis_dict)
print('!!', diagnosis)
return diagnosis
##数据标准化
def min_max_normalize(x):
return (x - x.min())/(x.max() - x.min())
##划分训练集和测试集
def partition(dataset, y):
dataset_minmax_train, dataset_minmax_test,\
dataset_train_labels, dataset_test_labels \
= cross_validation.train_test_split(dataset, y, test_size=0.3, random_state=0)
print(dataset_train_labels.value_counts()/len(dataset_train_labels))
print(dataset_test_labels.value_counts()/len(dataset_test_labels))
return dataset_minmax_train, dataset_minmax_test,\
dataset_train_labels, dataset_test_labels
def createNBClassifier(train_matrix, train_label):
## 获取训练集的数量num
num = len(train_matrix)
## 获取train_matrix的维度num_col
num_col = len(train_matrix[0])
## 统计c = 1下x_{i}的出现的频次列表,形如[num(x_1|c = 1), num(x_2|c=1),...];同理 c = 0
## 为了防止出现概率为零的情况,我们将所有词出现的频次初始化为1
p1_num = np.ones(num_col)
p0_num = np.ones(num_col)
## 良性喝恶性数量
p1_vec_denom = 0.
p0_vec_denom = 0.
for index in range(num):
if train_label[index] == 1:
p1_num = p1_num + np.array(train_matrix[index])
p1_vec_denom = p1_vec_denom + np.sum(train_matrix[index])
else:
p0_num = p0_num + np.array(train_matrix[index])
p0_vec_denom = p0_vec_denom + np.sum(train_matrix[index])
####根据从训练集统计的频次
##计算[p(x_i | c = 1), p(x_2|c = 1)...] [p(x_i) | c = 0), p(x_i|c=0),...]
####为了防止向下溢出,概率取对数;为了配合初始化值为1的频次,分母+2
p1_vec = np.log(p1_num/(p1_vec_denom + 2))
p0_vec = np.log(p0_num/(p0_vec_denom + 2))
## 计算乳腺癌出现的概率
pc1 = np.sum(train_label)/float(num)
## 计算正常出现的概率
pc0 = pc0 = 1 - pc1
##返回结果
return p1_vec, p0_vec, pc1, pc0
def predict(test_matrix, p1_vec, p0_vec, pc1, pc0):
## 声明预测标签列表,初始值为0
pred_label = [0 for i in range(len(test_matrix))]
## 遍历测试机中的每一个样本
for index, record in enumerate(test_matrix):
## 将list转换为ndarray
record_array = np.array(record)
## 根据公式 log(p(c|x))正比于 log(p(x_1|c)p(x_2|c)...p(x_k|c)p(c))
## 计算测试样本属于乳腺癌的概率
p1 = np.dot(record_array, p1_vec) + np.log(pc1)
## 计算测试样本属于正常的概率
p0 = np.dot(record_array, p0_vec) + np.log(pc0)
## 确认预测标签
if p1 > p0:
##将索引为index的测试标签值改为1
pred_label[index] = 1
else:
continue
## 返回结果
return pred_label
if __name__=="__main__":
breast_cancer = pd.read_csv('D:\\study\\3\\数据科学\\数据集\\实验二\\乳腺癌诊断数据集\\wisc_bc_data.csv',\
engine='python')
print(breast_cancer.shape)
print(breast_cancer.head(10))
del breast_cancer["id"]
print(breast_cancer.diagnosis.value_counts())
print(breast_cancer.diagnosis.value_counts()/len(breast_cancer))
breast_cancer["diagnosis"] = integer_coding(breast_cancer["diagnosis"])
print(breast_cancer[["radius_mean", "area_mean","smoothness_mean"]].describe())
##Min_Max
print('\n')
print("-------------------------------Min_Max-------------------------------")
print('\n')
#数据标准化-minmax
for col in breast_cancer.columns[1:31]:
breast_cancer[col] = min_max_normalize(breast_cancer[col])
print(breast_cancer.iloc[:,1:].describe(), '\n', '\n') #是否正确标准化
#数据集划分
y = breast_cancer['diagnosis']
del breast_cancer['diagnosis']
breast_cancer_minmax_train, breast_cancer_minmax_test,\
breast_cancer_train_labels, breast_cancer_test_labels = partition(breast_cancer, y)
#print(breast_cancer_minmax_train)
breast_cancer_minmax_train = breast_cancer_minmax_train.values
breast_cancer_minmax_test = breast_cancer_minmax_test.values
breast_cancer_train_labels = breast_cancer_train_labels.values
breast_cancer_test_labels = breast_cancer_test_labels.values
p1_vec, p0_vec, pc1, pc0 = createNBClassifier(breast_cancer_minmax_train, breast_cancer_train_labels)
## 使用测试数据集预测标签
pred_label = predict(breast_cancer_minmax_test, p1_vec, p0_vec, pc1, pc0)
#print(pred_label)
## 获取预测标签与真实标签的差,diff类型为ndarray
diff = np.array(pred_label) - np.array(breast_cancer_test_labels)
#print(diff)
## 计算错误率
error_rate = abs(sum(diff))/float(len(diff))
## 打印错误率
print ("错误率:", error_rate)
print("正确率:",(1-error_rate))