【机器学习】3——logistic（逻辑斯谛）回归

qq_43507078

已于 2024-09-08 15:54:28 修改

阅读量568

点赞数 3

分类专栏：我的机器学习文章标签：机器学习回归人工智能

于 2024-09-04 23:43:50 首次发布

本文链接：https://blog.csdn.net/qq_43507078/article/details/141905812

版权

我的机器学习专栏收录该内容

8 篇文章 0 订阅

订阅专栏

机器学习 3——logistic（逻辑斯谛）回归

分类方法、对数线性模型
主要还是用在二分类问题

线性回归模型对有些分类是无效的，但logistic可以，在线性模型中，通过超平面将不同类别的数据分离开，但logistic是确定某个实例最可能的类别（数与某个类别的概率）

把线性回归的输出wx+b通过sigmoid函数转换成概率

通过最大化熵来建立具有最少假设的模型，从而推断出最符合数据的概率分布。

预备知识

sigmoid函数（logistic函数）

值域为（0，1），这意味着这个函数能够将任意值映射到（0，1）之间，某种压缩或者数据规范化

这个函数很重要，神经网络什么的都会用到

优点：平滑，易于求导
缺点：计算量大，除法运算
在这里插入图片描述

x很大，f(x)接近于1
x很小，f(x)接近于 0

似然函数

这个很多文章都有写，这里就直观

概率：模型参数我知道，我要求某个观测x的概率（某事件x发生的概率）——P(x|参数)
似然：这个事件x已经发生了，（假如参数有好几种取值）我得知道是某套参数的概率——L（参数|x），就是有观测值了，我要知道参数是多少

L（参数|x）=P(x|参数)

这里观测x维数任意

其实只有观测值，没有其他描述，我们是没法知道具体参数是什么的，这个时候，极大似然估计就出现了，我就认为当前的参数就是，最可能有这个观测x的那套参数（似然最大）

后面模型部分细说

一、logistic回归

logistic分布

X-连续随机变量，则X的分布函数：（这个是logistic函数）
在这里插入图片描述

密度函数：
在这里插入图片描述

二分类（二项logistic回归）

模型，用条件概率描述输入和输出的关系:

在这里插入图片描述

输入x（随机变量X的取值，要分类的某个实例），输出Y取0或1
w，b是模型参数

因为sigmoid函数和y轴交点是0.5，所以决策边界一般概率大于0.5，认为是1类，小于0.5，认为是0类

上面的模型相当于，我拉一个实例交给模型，您给算算，是第一类（Y=0），第二类（Y=1）的概率（条件概率）分别是多少，哪个概率大，我就认为是那一类

几率：事件发生概率/不发生概率 = p/1-p
对数几率：log（p/1-p）

直观来看，这个几率要是大，说明事件更可能发生

辣么，针对分类问题中，针对某一类Y=1的对数几率
在这里插入图片描述

这结果是输入x的线性函数了，梦回线性模型，
！！！ 这里等号后面少了log，结果应该是wx+b

参数估计（w，b），损失函数

有模型就有参数，有参数就要直到这个参数是啥的时候，模型表现最好
在这里插入图片描述
这个对数似然函数作用就类似于损失函数

$\pi$ 是对xi属于1类的概率预测，如果预测错了，那这个值就会小，似然函数值就会比真实的小，所以我们要是想找最好的参数，就要让似然函数取最大值

那参数就是让似然函数最大的值，又是优化问题了，梯度下降什么的（这里其实应该是梯度上升，参数更新要+更新量）

如果要min，那-logL就行

在这里插入图片描述

logistic没办法表示不同特征间的交叉关系

正则化

关于正则化
就是在损失函数上加正则化项
在这里插入图片描述

求导啥的结果也会变化，但是思路不变

可以在scikit-learn中通过设置penalty参数来应用正则化：
penalty=‘l1’ 对应L1正则化（Lasso）。
penalty=‘l2’ 对应L2正则化（Ridge）。
penalty=‘elasticnet’ 使用Elastic Net。
正则化的强度可以通过参数C控制，C的值越小，正则化力度越强。

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score

# 生成二分类模拟数据
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, random_state=42)

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 使用L2正则化训练Logistic回归模型
model = LogisticRegression(penalty='l2', C=1.0, solver='liblinear')
model.fit(X_train, y_train)

# 预测测试集
y_pred = model.predict(X_test)

# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

代码

logistic

1

# 导入必要的库
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 加载数据集
iris = load_iris()
X = iris.data
y = iris.target

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 创建Logistic回归模型
logreg = LogisticRegression()

# 训练模型
logreg.fit(X_train, y_train)

# 预测测试集
y_pred = logreg.predict(X_test)

# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print(f"模型的准确率为: {accuracy:.2f}")

# 打印模型的系数和截距
print("模型系数:", logreg.coef_)
print("模型截距:", logreg.intercept_)

2

import numpy as np
import matplotlib.pyplot as plt

# 生成模拟数据
np.random.seed(0)
num_samples = 100
X = np.random.randn(num_samples, 2)
y = (X[:, 0] + X[:, 1] > 0).astype(int)

# 定义 sigmoid 函数
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# 定义损失函数
def loss_function(theta, X, y):
    m = len(y)
    h = sigmoid(X.dot(theta))
    cost = (-1/m) * (y.T.dot(np.log(h)) + (1 - y).T.dot(np.log(1 - h)))
    return cost

# 梯度下降法更新参数
def gradient_descent(theta, X, y, learning_rate, num_iterations):
    m = len(y)
    costs = []
    for _ in range(num_iterations):
        h = sigmoid(X.dot(theta))
        gradient = (1/m) * X.T.dot(h - y)
        theta = theta - learning_rate * gradient
        cost = loss_function(theta, X, y)
        costs.append(cost)
    return theta, costs

# 初始化参数
theta = np.zeros(X.shape[1])
learning_rate = 0.1
num_iterations = 1000

# 训练模型
theta, costs = gradient_descent(theta, X, y, learning_rate, num_iterations)

# 绘制损失函数随迭代次数的变化
plt.plot(range(num_iterations), costs)
plt.xlabel('Iteration')
plt.ylabel('Cost')
plt.title('Cost Function over Iterations')
plt.show()

# 预测新数据
new_data = np.array([[1, 2], [-1, -2]])
probabilities = sigmoid(new_data.dot(theta))
predictions = (probabilities > 0.5).astype(int)

print("Predictions for new data:", predictions)

3 交易类型预测

信用卡交易数据集，包含交易的各项特征，以及该交易是否为欺诈。目标是通过训练一个Logistic回归模型，来预测新的交易是否为欺诈。

# 导入必要的库
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

# 生成模拟信用卡交易数据
np.random.seed(0)
n_samples = 1000
n_features = 10

# 模拟交易特征（X），以及是否欺诈的标签（y）
X = np.random.rand(n_samples, n_features)
y = np.random.choice([0, 1], size=n_samples, p=[0.95, 0.05])  # 95%的交易是正常的，5%是欺诈的

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 标准化特征数据
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# 训练Logistic回归模型
model = LogisticRegression()
model.fit(X_train, y_train)

# 对测试集进行预测
y_pred = model.predict(X_test)

# 计算模型准确率
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# 混淆矩阵
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)