逻辑回归样例:Logistic回归预测收入

作业说明

给定训练集spam_train.csv,要求根据每个ID各种属性值来判断该ID对应角色是Winner还是Losser(收入是否大于50K),这是一个典型的二分类问题。

训练集介绍

(1)、CSV文件,大小为4000行X59列;

(2)、4000行数据对应着4000个角色,ID编号从1到4001;

(3)、59列数据中, 第一列为角色ID,最后一列为分类结果,即label(0、1两种),中间的57列为角色对应的57种属性值;

(4)、数据集地址:https://pan.baidu.com/s/1mG7ndtlT4jWYHH9V-Rj_5g, 提取码:hwzf

代码如下:
import pandas as pd
import numpy as np
def main():
    df = pd.read_csv('spam_train.csv')
    df = df.fillna(0)
    array = np.array(df)
    #观察数据第一列为编号,最后一列为分类结果,feature中不需要这两列数据
    x = array[:,1:-1]
    #因为倒数第一列和倒数第二列数据交分散,集中化处理一下
    x[:,-1] = x[:,-1]/np.mean(x[:,-1])
    x[:,-2] = x[:,-2]/np.mean(x[:,-2])
    #y为label
    y = array[:,-1]
    #3500个为训练集 500个为测试集
    x_train,x_val = x[0:3500,:],x[3500:4000,:]
    y_train,y_val = y[0:3500],y[3500:4000]
    #开始训练
    weights,bias = train(x_train,y_train,200)
    return x_val,y_val,weights,bias
def train(x_train,y_train,epoch):
    #num为有多少样例数目,col为有多少feature
    num = x_train.shape[0]
    col = x_train.shape[1]
    #权重初始化
    weights = np.zeros(col)
    #偏置值初始化
    bias = 0
    #正则率初始化
    reg_rate = 0.0001
    #学习率初始化
    learning_rate = 1
    #求权重和偏置值的累加和帮助adagrad
    w_sum = np.zeros(col)
    b_sum = 0
    for i in range(epoch):
        b_g = 0
        w_g = np.zeros(col)
        for j in range(num):
            y_pre = weights.dot(x_train[j,:])+bias
            sig = 1/(1+np.exp(-y_pre))
            b_g += y_train[j] - sig
            for k in range(col):
                w_g[k] += (y_train[j]-sig)*x_train[j,k]+reg_rate * weights[k]
        b_g = -b_g
        w_g = - w_g
        # b_g /= num
        # w_g /= num
        b_sum += b_g**2
        w_sum += w_g**2
        #权重和偏执值更新
        weights -= learning_rate/(w_sum**0.5)*w_g
        bias -= learning_rate/(b_sum**0.5)*b_g
        #每10次观察一下训练结果
        if i%10 == 0:
            acc = 0
            res = np.zeros(num)
            for j in range(num):
                y_pre = weights.dot(x_train[j,:])+bias
                #sig就是该例子属于1的概率
                sig = 1/(1+np.exp(-y_pre))
                #概率大于等于0.5属于1 否则属于0
                if sig >= 0.5:
                    res[j] = 1
                else:
                    res[j] = 0
                #判断预测值和真实值是否一样
                if res[j] == y_train[j]:
                    acc += 1
            print('after {} epochs, the acc on train data is:'.format(i), acc / num)
    return weights,bias
def check(x_val,y_val,weights,bias):
    num = x_val.shape[0]
    acc = 0
    res = np.zeros(num)
    for j in range(num):
        y_pre = weights.dot(x_val[j, :]) + bias
        sig = 1 / (1 + np.exp(-y_pre))
        if sig >= 0.5:
            res[j] = 1
        else:
            res[j] = 0
        if res[j] == y_val[j]:
            acc += 1
    print('after {} epochs, the acc on train data is:'.format(2000), acc / num)

if __name__ == '__main__':
    x_val,y_val,weights,bias = main()
    #检查训练结果
    check(x_val,y_val,weights,bias)

训练结果:

after 0 epochs, the acc on train data is: 0.6674285714285715

after 10 epochs, the acc on train data is: 0.9202857142857143

after 20 epochs, the acc on train data is: 0.9168571428571428

after 30 epochs, the acc on train data is: 0.9211428571428572

after 40 epochs, the acc on train data is: 0.9265714285714286

after 50 epochs, the acc on train data is: 0.9251428571428572

after 60 epochs, the acc on train data is: 0.9228571428571428

after 70 epochs, the acc on train data is: 0.9245714285714286

after 80 epochs, the acc on train data is: 0.9237142857142857

after 90 epochs, the acc on train data is: 0.9242857142857143

after 100 epochs, the acc on train data is: 0.9225714285714286

after 110 epochs, the acc on train data is: 0.922

after 120 epochs, the acc on train data is: 0.9237142857142857

after 130 epochs, the acc on train data is: 0.9228571428571428

after 140 epochs, the acc on train data is: 0.9185714285714286

after 150 epochs, the acc on train data is: 0.9245714285714286

after 160 epochs, the acc on train data is: 0.9288571428571428

after 170 epochs, the acc on train data is: 0.9225714285714286

after 180 epochs, the acc on train data is: 0.9234285714285714

after 190 epochs, the acc on train data is: 0.9225714285714286

after 2000 epochs, the acc on train data is: 0.938

  • 2
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 5
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值