作业说明
给定训练集spam_train.csv,要求根据每个ID各种属性值来判断该ID对应角色是Winner还是Losser(收入是否大于50K),这是一个典型的二分类问题。
训练集介绍:
(1)、CSV文件,大小为4000行X59列;
(2)、4000行数据对应着4000个角色,ID编号从1到4001;
(3)、59列数据中, 第一列为角色ID,最后一列为分类结果,即label(0、1两种),中间的57列为角色对应的57种属性值;
(4)、数据集地址:https://pan.baidu.com/s/1mG7ndtlT4jWYHH9V-Rj_5g, 提取码:hwzf
代码如下:
import pandas as pd
import numpy as np
def main():
df = pd.read_csv('spam_train.csv')
df = df.fillna(0)
array = np.array(df)
#观察数据第一列为编号,最后一列为分类结果,feature中不需要这两列数据
x = array[:,1:-1]
#因为倒数第一列和倒数第二列数据交分散,集中化处理一下
x[:,-1] = x[:,-1]/np.mean(x[:,-1])
x[:,-2] = x[:,-2]/np.mean(x[:,-2])
#y为label
y = array[:,-1]
#3500个为训练集 500个为测试集
x_train,x_val = x[0:3500,:],x[3500:4000,:]
y_train,y_val = y[0:3500],y[3500:4000]
#开始训练
weights,bias = train(x_train,y_train,200)
return x_val,y_val,weights,bias
def train(x_train,y_train,epoch):
#num为有多少样例数目,col为有多少feature
num = x_train.shape[0]
col = x_train.shape[1]
#权重初始化
weights = np.zeros(col)
#偏置值初始化
bias = 0
#正则率初始化
reg_rate = 0.0001
#学习率初始化
learning_rate = 1
#求权重和偏置值的累加和帮助adagrad
w_sum = np.zeros(col)
b_sum = 0
for i in range(epoch):
b_g = 0
w_g = np.zeros(col)
for j in range(num):
y_pre = weights.dot(x_train[j,:])+bias
sig = 1/(1+np.exp(-y_pre))
b_g += y_train[j] - sig
for k in range(col):
w_g[k] += (y_train[j]-sig)*x_train[j,k]+reg_rate * weights[k]
b_g = -b_g
w_g = - w_g
# b_g /= num
# w_g /= num
b_sum += b_g**2
w_sum += w_g**2
#权重和偏执值更新
weights -= learning_rate/(w_sum**0.5)*w_g
bias -= learning_rate/(b_sum**0.5)*b_g
#每10次观察一下训练结果
if i%10 == 0:
acc = 0
res = np.zeros(num)
for j in range(num):
y_pre = weights.dot(x_train[j,:])+bias
#sig就是该例子属于1的概率
sig = 1/(1+np.exp(-y_pre))
#概率大于等于0.5属于1 否则属于0
if sig >= 0.5:
res[j] = 1
else:
res[j] = 0
#判断预测值和真实值是否一样
if res[j] == y_train[j]:
acc += 1
print('after {} epochs, the acc on train data is:'.format(i), acc / num)
return weights,bias
def check(x_val,y_val,weights,bias):
num = x_val.shape[0]
acc = 0
res = np.zeros(num)
for j in range(num):
y_pre = weights.dot(x_val[j, :]) + bias
sig = 1 / (1 + np.exp(-y_pre))
if sig >= 0.5:
res[j] = 1
else:
res[j] = 0
if res[j] == y_val[j]:
acc += 1
print('after {} epochs, the acc on train data is:'.format(2000), acc / num)
if __name__ == '__main__':
x_val,y_val,weights,bias = main()
#检查训练结果
check(x_val,y_val,weights,bias)
训练结果:
after 0 epochs, the acc on train data is: 0.6674285714285715
after 10 epochs, the acc on train data is: 0.9202857142857143
after 20 epochs, the acc on train data is: 0.9168571428571428
after 30 epochs, the acc on train data is: 0.9211428571428572
after 40 epochs, the acc on train data is: 0.9265714285714286
after 50 epochs, the acc on train data is: 0.9251428571428572
after 60 epochs, the acc on train data is: 0.9228571428571428
after 70 epochs, the acc on train data is: 0.9245714285714286
after 80 epochs, the acc on train data is: 0.9237142857142857
after 90 epochs, the acc on train data is: 0.9242857142857143
after 100 epochs, the acc on train data is: 0.9225714285714286
after 110 epochs, the acc on train data is: 0.922
after 120 epochs, the acc on train data is: 0.9237142857142857
after 130 epochs, the acc on train data is: 0.9228571428571428
after 140 epochs, the acc on train data is: 0.9185714285714286
after 150 epochs, the acc on train data is: 0.9245714285714286
after 160 epochs, the acc on train data is: 0.9288571428571428
after 170 epochs, the acc on train data is: 0.9225714285714286
after 180 epochs, the acc on train data is: 0.9234285714285714
after 190 epochs, the acc on train data is: 0.9225714285714286
after 2000 epochs, the acc on train data is: 0.938