线性回归之逻辑回归实战

最新推荐文章于 2024-06-27 02:07:07 发布

chaoping315

最新推荐文章于 2024-06-27 02:07:07 发布

阅读量1.4k

点赞数

文章标签： Logistic 回归分类

本文链接：https://blog.csdn.net/chaoping315/article/details/81699998

版权

Logit回归

前面说了线性的逻辑回归，对于线性的逻辑回归，因变量与自变量都是连续的，因变量与自变量之间呈线性的关系，当我们用逻辑回归来解决分类问题时，分类的值肯定是离散的，此时如果能将因变量转为连续的值，自变量与转换值之间就可能存呈线性的关系，此时就可以使用线性回归来解决分类的问题。怎样理解这个转换是理解线性回归用于分类的关键点。

我们先来考虑一个二分类问题，对于Y发生的期望，他等价于事件发的概率，事件发生的概率的值域在[0,1]区间，那么因变量是不是与事件发生的概率程线性的关系呢？

Logit回归函数

对于Logistic回归，我们先不给证明的给出一些公式，后面结合这些函数去理解。

Logistic/sigmoid函数：

$g(z) = \frac{1}{1+e^{-z}} \\ h_0(x) = g(\theta^Tx) = \frac{1}{1+e^{-\theta^T X}}$

假设对于二项式分布：

$P(y=1|x,\theta) = h_0(x) \\ P(y=0|x,\theta) = 1-h_0(x)$

一个事件的几率odds,是指一个事情发与不发概率的比值:

$log\frac{p}{1-p} = log \frac{h_0(x)}{1-h_0(x)} = log(\frac{\frac{1}{1+e^-\theta^T x}}{1-\frac{1}{1+e^-\theta^T x}}) = \theta^Tx$

从上面的公式可以看出，odds其实是一个线性的，所以我们设定合适的阈值就可以将连续的问题转换为一个分类问题。反过来说，分类问题可以看出一个odds,发生与不发生的对数比值是一个线性问题。

Logistic 程序实战

# -*- coding:utf-8 -*-

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score



if __name__ == "__main__":
    data = pd.read_excel("shouru50w.xlsx")
    data[u'收入水平'] = pd.Categorical(data[u'收入水平']).codes
    data[u'性别'] = pd.Categorical(data[u'性别']).codes
    #x = data[[u'年龄', u'受教育时间', u'性别', u'资产净增', u'资产损失', u'一周工作时间']]
    #y = data[[u'收入水平']]
    x, y = np.split(data.values.astype('float64'), (6,), axis=1)

    lr = Pipeline([('sc', StandardScaler()),
                   ('poly', PolynomialFeatures(degree=1,interaction_only=False)),
                   ('clf', LogisticRegression())])
    # print y
    lr.fit(x, y.ravel())
    y_hat = lr.predict(x)
    y_hat_prob = lr.predict_proba(x)
    np.set_printoptions(suppress=True)
    # print 'y_hat = \n', y_hat
    # print 'y_hat_prob = \n', y_hat_prob
    print u'准确度：%.2f%%' % (100 * np.mean(y_hat == y.ravel()))
    print "R2 %s" % r2_score(y, y_hat)

    x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1)

    lr.fit(x_train, y_train.ravel())
    y_train_hat = lr.predict(x_train)
    y_train_hat_prob = lr.predict_proba(x_train)
    print u'train 准确度：%.2f%%' % (100 * np.mean(y_train_hat == y_train.ravel()))
    print "R2 %s" % r2_score(y_train, y_train_hat)


    y_test_hat = lr.predict(x_test)
    y_test_hat_prob = lr.predict_proba(x_test)
    print u'test 准确度：%.2f%%' % (100 * np.mean(y_test_hat == y_test.ravel()))
    print "R2 %s" % r2_score(y_test, y_test_hat)
    print lr.named_steps['clf'].coef_
    print lr.named_steps['clf'].intercept_

数据：