数据集介绍
该数据集的每条数据有两个属性,加一个标签列,标签分为0,1,是二分类数据集。数据集共有100条数据,其中40条标签为0,60条标签为1。我们设定前10行为测试集,后90行为训练集。
Python实现
数据集加载和准备
导入包
#!coding:utf-8
import math
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
start = time.time()
定义函数
def load_data():
df=pd.read_table('reg_data.txt',
names=['feature1','feature2','label'],
sep=',',
encoding='utf-8')
df1 = df[df['label'] == 0]
df2 = df[df['label'] == 1]
print("This is an imbalanced data set which has\n" +str(len(df1))+" negative samples and "+str(len(df2))+" positive samples\n")
return df, df1, df2
# Divide the dataset into a training set and a testing set
def prepare_dataset(dt):
train = dt[10:100]
test = dt[0:10]
tr = np.array(train)
te = np.array(test)
return tr, te
load_dataset()函数用来加载.txt格式的数据集,加载为Pandas的DataFrame格式。
prepare_dataset()函数将载入的DataFrame形式数据分割为训练集、测试集,再转换为numpy的数组array形式作为函数输出。
看看数据集长啥样:
Sigmoid函数
Sigmoid函数的公式:
y
=
1
(
1
+
e
−
z
)
y=\frac{1}{(1 + e^{-z})}
y=(1+e−z)1
代码:
# Sigmoid function
def sigmoid(a):
b = 1/(1+np.exp(-a))
return b
arr = np.linspace(-10,10,50)
vis = sigmoid(arr)
plt.title('Sigmoid Function')
plt.plot(vis)
plt.show()
输出:
Gradient Descent优化权重和偏置值
对Loss函数求偏导:
∂
L
(
ω
)
∂
ω
j
=
∑
(
i
=
1
)
m
(
1
(
1
+
e
x
p
(
ω
T
x
i
)
)
e
x
p
(
ω
T
x
i
)
x
i
j
−
y
i
x
i
j
)
\frac{∂L(ω)}{∂ω_j}=\sum_{(i=1)}^{m}(\frac{1}{(1+exp(ω^T x_i))} exp(ω^T x_i ) x_{ij}-y_i x_{ij} )
∂ωj∂L(ω)=(i=1)∑m((1+exp(ωTxi))1exp(ωTxi)xij−yixij)
w
j
=
w
j
−
η
∗
Δ
w
w_j=w_j-η*Δw
wj=wj−η∗Δw
为减少算法复杂度,加快程序运行,写为矩阵形式:
Δ
w
=
X
T
(
X
w
−
y
)
Δw=X^T(Xw-y)
Δw=XT(Xw−y)
代码:
def gradient_des_revisedbyrain(tr):
# Learning rate
lr = 0.1
# Maximum number of iterations
iteration_max = 700000
m, n = tr.shape
tr_label = tr[:,2].reshape(m,1)
# Fill the weight vector with ones
result = np.ones((1, n))
one = np.ones((m, 1))
beta = np.c_[np.mat(tr[:, 0:2]),one]
w_all = []
# Iterations
# Matrix form -- Δ = X^T * (X * w - y)
for i in range(iteration_max):
h = sigmoid(np.dot(beta, result.T))
# print(h.shape)
err = h - tr_label
# print(err.shape)
result = result - lr * np.dot(err.T, beta)/len(tr_label)
# print(result.shape)
w_all.append(result)
return result, w_all
beta矩阵要在右侧增加一列实质上是将其增广:
x
=
>
(
x
;
1
)
x=>(x;1)
x=>(x;1)以便简化计算得到偏置值
b
b
b,增广的这列值全为1即为将偏置值初始化为1。
将上面代码for循环内部翻译为数学公式为:
h
=
s
i
g
m
o
i
d
(
β
∗
r
e
s
u
l
t
T
)
h=sigmoid(β*result^T)
h=sigmoid(β∗resultT)
e
r
r
=
h
−
t
r
_
l
a
b
e
l
err=h-tr\_label
err=h−tr_label
r
e
s
u
l
t
=
r
e
s
u
l
t
−
l
r
×
e
r
r
T
⋅
β
l
e
n
(
t
r
_
l
a
b
l
e
)
result = result - \frac{lr×err^T·β}{len(tr\_lable)}
result=result−len(tr_lable)lr×errT⋅β
每一次迭代得到的result都append到一个多维列表w_all里,到时候画图(画决策边界)的时候有用。
计算准确率
函数输入测试集、训练集标签、权重向量weight vector和阈值。输入值包含了阈值用于后续绘制P-R、ROC曲线。
def accuracy(te, te_label, w, threshold):
one = np.ones((len(te), 1))
mat = np.c_[te, one]
y = sigmoid(np.dot(mat, w.T))
m, n = np.shape(y)
count = 0
flag = []
for i in range(m):
if y[i, 0] > threshold:
flag.append(1)
else:
flag.append(0)
if te_label[i] == flag[i]:
count += 1
acc = count/len(flag)
return acc, flag
输入测试集
t
e
te
te,这个测试集实际上是只包含测试集的两个属性列,不包含标签列的。同梯度下降的函数,不要忘记给测试集增广一个全1列
m
a
t
=
(
t
e
,
1
)
mat=(te,1)
mat=(te,1)预测值
y
=
s
i
g
m
o
i
d
(
m
a
t
⋅
w
T
)
y=sigmoid(mat·w^T)
y=sigmoid(mat⋅wT)
如果
y
>
t
h
r
e
s
h
o
l
d
y>threshold
y>threshold,则归为正类,标签为1;如果
y
<
t
h
r
e
s
h
o
l
d
y<threshold
y<threshold,则归为负类,标签为0。
输出结果
def print_results(w, b, lb, acc):
print('Weight vector is\n' ,w[0,0:2])
print('Bias is\n' ,b)
print('Accuracy on testing set: %f' %acc)
输出值包括:权重向量Weight vector,偏置值Bias和在测试集上的准确率Accuracy。
主函数
if __name__ == '__main__':
data, data_label0, data_label1 = load_data()
# Normalize
for i in ['feature1', 'feature2']:
data[i] = (data[i] - data[i].min()) / (data[i].max() - data[i].min())
training_set, testing_set = prepare_dataset(data)
w_b,w_all = gradient_des_revisedbyrain(training_set)
# w_b is the result of weight vector and bias value
# w_all is all the weight and bias of iterations from 0 to max
weight = w_b[0:2]
bias = w_b[0,2]
label_all = []
acc_train =[]
label_alltrain = []
label_alltrain_measure = []
# Accuracy on training set
for i in range(0,2000):
# Fix the threshold to be 0.5 and change weight vectors
acc_rate_train, label_train = accuracy(training_set[:, 0:2], training_set[:,2], w_all[4*i], 0.5)
acc_train.append(acc_rate_train)
label_alltrain.append(label_train)
# Different thresholds
ran = np.arange(0, 1, 0.01)
for i in ran:
# Set threshold
threshold = i
# Accuracy and labels on testing set
acc_rate, label = accuracy(testing_set[:, 0:2], testing_set[:,2], w_b, threshold)
label_all.append(label)
# Accuracy and labels on training set
acc_rate_train_measure, label_train_measure = accuracy(training_set[:, 0:2], training_set[:,2], w_b, threshold)
label_alltrain_measure.append(label_train_measure)
# Accuracy and labels on testing set when threshold is 0.5
final_acc_rate, final_label = accuracy(testing_set[:, 0:2], testing_set[:,2], w_b, 0.5)
print_results(weight, bias, final_label, final_acc_rate)
plt.plot(acc_train,c = 'r')
plt.xlabel('Iterations')
plt.ylabel('Accuracy')
plt.title('Accuracy changes with Iterations')
plt.show()
主函数第一个for循环:归一化
主函数第二个for循环:输出测试集上的准确率
让前2,000次迭代(在gradient descent函数那里我设置了总共700,000次迭代,这里的次数应当比700,000小)得到的权值和偏置去对整个训练集分类,输出所得到的准确率。
label_alltrain即为在训练集上的这2,000次迭代,所有的分类标签结果
主函数第三个for循环:
让阈值threshold在(0,1)间以0.01的步长变化,保持输入的权值向量为经过优化后最终的权值向量,更改不同的阈值得到准确率,为画P-R曲线、ROC曲线进行性能度量做准备。
label_alltrain_measure即为阈值从0到1变化得到的所有训练集分类标签
300次迭代:
1000次迭代:
训练集分类准确率随迭代次数的增加逐步增加,在0.1的学习率下,大约需要300次迭代可以达到90%的训练集分类准确率,且准确率基本不再随着迭代次数增加而变化。
画图
df1_norm = data[data['label'] == 0]
df2_norm = data[data['label'] == 1]
x = np.arange(0,1,0.1)
for i in range(0,1000,10):
accu = acc_train[i]
plt.scatter(df1_norm.feature1[:],df1_norm.feature2[:],
c='c',marker='o',label='label 0')
plt.scatter(df2_norm.feature1[:],df2_norm.feature2[:],
c='k',marker='o',label='label 1')
y = (-float(w_all[i][0,0]) * x - float(w_all[i][0,2]))/float(w_all[i][0,1])
plt.plot(x, y, c='r')
plt.legend()
plt.title('Accuracy = '+str(accu))
plt.savefig(r'./pic2/'+str(i)+'.png', dpi=300)
plt.show()
因为此图绘制数据集上的分类的边界,因此采用前面得到的acc_train
作为标题显示的分类准确率,采用w_all
作为要画的边界的权重值。
汇总得到的图像如下:
性能度量
先按照以下表格计算各值
真实情况 | 预测结果 | |
正例(label=1) | 反例(label=0) | |
正例(label=1) | True Positive(TP) | False Negetive(FN) |
反例(label=0) | False Positive(FP) | True Negative(TN) |
def performance_measure(lb, truelabel):
tp = np.zeros((len(lb),1),dtype = int)
fp = np.zeros((len(lb),1),dtype = int)
tn = np.zeros((len(lb),1),dtype = int)
fn = np.zeros((len(lb),1),dtype = int)
for i in range(len(lb)):
for j in range(10):
if truelabel[j] == 1 and lb[i][j] == 1:
tp[i] += 1 # True positive
elif truelabel[j] == 1 and lb[i][j] == 0:
fn[i] += 1 # False negative
elif truelabel[j] == 0 and lb[i][j] == 1:
fp[i] += 1 # False positive
elif truelabel[j] == 0 and lb[i][j] == 0:
tn[i] += 1 # True negative
return tp, fn, fp, tn
P-R曲线
公式:
P
=
T
P
T
P
+
F
P
P=\frac{TP}{TP+FP}
P=TP+FPTP
R
=
T
P
T
P
+
F
N
R=\frac{TP}{TP+FN}
R=TP+FNTP
代码:
# 1. P-R curve(Precision and Recall)
# true_pos, false_neg, false_pos, true_neg = performance_measure(label_all, final_label)
true_pos, false_neg, false_pos, true_neg = performance_measure(label_alltrain_measure, training_set[:,2])
# Let NaN in the array be 0
np.seterr(divide='ignore',invalid='ignore')
precision = true_pos/(true_pos + false_pos)
recall = true_pos/(true_pos + false_neg)
precision[np.isnan(precision)]=0
precs_draw = np.zeros(len(precision))
for i in range(1,len(precision)):
if precision[i] != 0:
precs_draw[i] = precision[i]
else:
precs_draw[i] = 1
# Add point to draw plot
recall[len(recall)-2] = 0.5
recall[len(recall)-1] = 0
# Draw scatter
fig, ax = plt.subplots(figsize=(5, 5))
ax.plot(
# recall, precision, "-o",
recall, precs_draw, "-o",
markersize=8,
linewidth=4,
markerfacecolor="b",
markeredgecolor="b",
markeredgewidth=1
)
ax.set(xlim=[0, 1.05], ylim=[0, 1.05], title='P-R(Precision-Recall) Curve',
ylabel='Precision', xlabel='Recall')
plt.show()
ROC曲线和AUC面积
公式:
T
P
R
=
T
P
T
P
+
F
N
TPR=\frac{TP}{TP+FN}
TPR=TP+FNTP
F
P
R
=
F
P
F
P
+
T
N
FPR=\frac{FP}{FP+TN}
FPR=FP+TNFP
代码:
# 2. AUC curve(Area Under Curve)
# The area between the ROC curve and the coordinate axis
TPR = true_pos/(true_pos + false_neg)
FPR = false_pos/(true_neg + false_pos)
# Add point to draw plot
TPR[len(TPR)-2] = 0.5
TPR[len(TPR)-1] = 0
# Draw scatter
fig, ax = plt.subplots(figsize=(5, 5))
ax.plot(
FPR, TPR, "-o",
markersize=8,
linewidth=4,
markerfacecolor="b",
markeredgecolor="b",
markeredgewidth=1
)
ax.set(xlim=[-0.05, 1.05], ylim=[0, 1.05], title='ROC Curve & AUC(Area Under Curve)',
ylabel='TPR(True Positive Rate)', xlabel='FPR(False Positive Rate)')
plt.show()
# Calculate AUC
temp = 0
for i in range(len(FPR)-1):
temp = temp + (FPR[i] - FPR[i+1]) * (TPR[i] + TPR[i+1])
AUC = float(0.5*temp)
print('AUC is %.2f' %AUC)
输出AUC值:
AUC is 0.92
计时:
end = time.time()
t = end - start
print('Run time: %.2f s' %t)
输出:Run time: 38.25 s
代码已传GitHub,数据集和网址:https://github.com/RU0804/Binary_Logisticregression_imbalanceddata