逻辑回归样例：Logistic回归预测收入

最新推荐文章于 2022-07-02 15:13:16 发布

stydwn

最新推荐文章于 2022-07-02 15:13:16 发布

阅读量1.2k

点赞数 2

文章标签：机器学习

本文链接：https://blog.csdn.net/stydwn/article/details/116856559

版权

作业说明

给定训练集spam_train.csv，要求根据每个ID各种属性值来判断该ID对应角色是Winner还是Losser(收入是否大于50K)，这是一个典型的二分类问题。

训练集介绍：

(1)、CSV文件，大小为4000行X59列;

(2)、4000行数据对应着4000个角色，ID编号从1到4001;

(3)、59列数据中，第一列为角色ID，最后一列为分类结果，即label(0、1两种)，中间的57列为角色对应的57种属性值；

(4)、数据集地址：https://pan.baidu.com/s/1mG7ndtlT4jWYHH9V-Rj_5g，提取码：hwzf

代码如下：

import pandas as pd
import numpy as np
def main():
    df = pd.read_csv('spam_train.csv')
    df = df.fillna(0)
    array = np.array(df)
    #观察数据第一列为编号，最后一列为分类结果，feature中不需要这两列数据
    x = array[:,1:-1]
    #因为倒数第一列和倒数第二列数据交分散，集中化处理一下
    x[:,-1] = x[:,-1]/np.mean(x[:,-1])
    x[:,-2] = x[:,-2]/np.mean(x[:,-2])
    #y为label
    y = array[:,-1]
    #3500个为训练集 500个为测试集
    x_train,x_val = x[0:3500,:],x[3500:4000,:]
    y_train,y_val = y[0:3500],y[3500:4000]
    #开始训练
    weights,bias = train(x_train,y_train,200)
    return x_val,y_val,weights,bias
def train(x_train,y_train,epoch):
    #num为有多少样例数目，col为有多少feature
    num = x_train.shape[0]
    col = x_train.shape[1]
    #权重初始化
    weights = np.zeros(col)
    #偏置值初始化
    bias = 0
    #正则率初始化
    reg_rate = 0.0001
    #学习率初始化
    learning_rate = 1
    #求权重和偏置值的累加和帮助adagrad
    w_sum = np.zeros(col)
    b_sum = 0
    for i in range(epoch):
        b_g = 0
        w_g = np.zeros(col)
        for j in range(num):
            y_pre = weights.dot(x_train[j,:])+bias
            sig = 1/(1+np.exp(-y_pre))
            b_g += y_train[j] - sig
            for k in range(col):
                w_g[k] += (y_train[j]-sig)*x_train[j,k]+reg_rate * weights[k]
        b_g = -b_g
        w_g = - w_g
        # b_g /= num
        # w_g /= num
        b_sum += b_g**2
        w_sum += w_g**2
        #权重和偏执值更新
        weights -= learning_rate/(w_sum**0.5)*w_g
        bias -= learning_rate/(b_sum**0.5)*b_g
        #每10次观察一下训练结果
        if i%10 == 0:
            acc = 0
            res = np.zeros(num)
            for j in range(num):
                y_pre = weights.dot(x_train[j,:])+bias
                #sig就是该例子属于1的概率
                sig = 1/(1+np.exp(-y_pre))
                #概率大于等于0.5属于1 否则属于0
                if sig >= 0.5:
                    res[j] = 1
                else:
                    res[j] = 0
                #判断预测值和真实值是否一样
                if res[j] == y_train[j]:
                    acc += 1
            print('after {} epochs, the acc on train data is:'.format(i), acc / num)
    return weights,bias
def check(x_val,y_val,weights,bias):
    num = x_val.shape[0]
    acc = 0
    res = np.zeros(num)
    for j in range(num):
        y_pre = weights.dot(x_val[j, :]) + bias
        sig = 1 / (1 + np.exp(-y_pre))
        if sig >= 0.5:
            res[j] = 1
        else:
            res[j] = 0
        if res[j] == y_val[j]:
            acc += 1
    print('after {} epochs, the acc on train data is:'.format(2000), acc / num)

if __name__ == '__main__':
    x_val,y_val,weights,bias = main()
    #检查训练结果
    check(x_val,y_val,weights,bias)

训练结果：

after 0 epochs, the acc on train data is: 0.6674285714285715

after 10 epochs, the acc on train data is: 0.9202857142857143

after 20 epochs, the acc on train data is: 0.9168571428571428

after 30 epochs, the acc on train data is: 0.9211428571428572

after 40 epochs, the acc on train data is: 0.9265714285714286

after 50 epochs, the acc on train data is: 0.9251428571428572

after 60 epochs, the acc on train data is: 0.9228571428571428

after 70 epochs, the acc on train data is: 0.9245714285714286

after 80 epochs, the acc on train data is: 0.9237142857142857

after 90 epochs, the acc on train data is: 0.9242857142857143

after 100 epochs, the acc on train data is: 0.9225714285714286

after 110 epochs, the acc on train data is: 0.922

after 120 epochs, the acc on train data is: 0.9237142857142857

after 130 epochs, the acc on train data is: 0.9228571428571428

after 140 epochs, the acc on train data is: 0.9185714285714286

after 150 epochs, the acc on train data is: 0.9245714285714286

after 160 epochs, the acc on train data is: 0.9288571428571428

after 170 epochs, the acc on train data is: 0.9225714285714286

after 180 epochs, the acc on train data is: 0.9234285714285714

after 190 epochs, the acc on train data is: 0.9225714285714286

after 2000 epochs, the acc on train data is: 0.938

stydwn

关注

2
点赞
踩
6

收藏

觉得还不错? 一键收藏
5
评论
逻辑回归样例：Logistic回归预测收入

作业说明给定训练集spam_train.csv，要求根据每个ID各种属性值来判断该ID对应角色是Winner还是Losser(收入是否大于50K)，这是一个典型的二分类问题。训练集介绍：(1)、CSV文件，大小为4000行X59列;(2)、4000行数据对应着4000个角色，ID编号从1到4001;(3)、59列数据中，第一列为角色ID，最后一列为分类结果，即label(0、1两种)，中间的57列为角色对应的57种属性值；(4)、数据集地址：https://pan.baidu.com/
复制链接

扫一扫