分类---逻辑回归（二分类）

少年心不定

已于 2022-10-25 11:56:55 修改

阅读量1.1k

点赞数 1

分类专栏：机器学习文章标签： 1024程序员节

于 2022-10-24 17:45:42 首次发布

本文链接：https://blog.csdn.net/m0_67531118/article/details/127496353

版权

机器学习专栏收录该内容

6 篇文章 1 订阅

订阅专栏

逻辑回归的基本原理：逻辑回归预测的是概率，需要求解的是如何选取参数c $_{i}$ 和b可以使得所有样本预测正确的可能性最大。逻辑回归算法需要找到分类概率P(y=1)与输入向量X的直接关系，然后通过比较概率值来判断类别。

逻辑回归有个基本假设，即数据的分布符合伯努利分布，也就是正类的概率与负类的概率之和为1，如抛硬币，正反面概率之和为1.在样本具有若干属性值为X的前提下，样本被分类为正类（y=1）的概率为:

P(y=1|X)

而样本为负类的概率为：

P(y=0|X) = 1-P(y=1|X)

概率是指事件发生的可能性与不发生的可能性的比值。定义一个odd(x)为X的概率，这个概率的取值为0到正无穷，值越大，说明发生的可能性越大。odd(x)的表达式：

odd(x) = P(y=1|X)/P(y=0|X) = p/1-p

两边取自然对数就得到Logistic变换，将odd(x)的自然对数成为logit函数，

logit(p) = ln(odd(x)) = P(y=1|X)/P(y=0|X) = ln( $p/1-p$ )--为线性回归所预测的假设函数:

假设函数：

p = 1/1+ $e^{-z}$ （z代表ln(odd(x)）p代表是y的概率）

逻辑回归的损失函数：

损失函数可以定义为负的最大似然函数，损失值越小，似然函数值就越大，这些模型参数也就越能导致样本的观测值。

J( $\theta$ ) = -ln(L( $\theta$ )) = - $\Sigma$ (y $_{i}$ ln(h(x $_{i}$ ))+(1-y $_{i}$ )ln(1-h(x $_{i}$ )))

代码：

读取数据，划分数据集，处理数据

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv(r"D:\data\Social_Network_Ads.csv")
X = data.iloc[:,[2,3]]
y = data.iloc[:,4]
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.25,random_state=42)
X_train.shape,X_test.shape,y_train.shape,y_test.shape

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

训练模型，模型评估（混淆矩阵）：

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

计算精确度与敏感度：

from sklearn.metrics import classification_report
report = classification_report(y_test, y_pred)
print(report)

二分类可视化结果：

plt.figure()
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('pink', 'limegreen')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistic Regression (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()