机器学习之逻辑回归

最新推荐文章于 2024-10-18 11:21:51 发布

withoutp

最新推荐文章于 2024-10-18 11:21:51 发布

阅读量1.4k

点赞数 18

文章标签：人工智能机器学习 python ide

本文链接：https://blog.csdn.net/withoutp/article/details/139292653

版权

1. 逻辑回归简介

逻辑回归是一种常用的统计学习方法，主要用于分类任务。尽管名称中带有"回归"，但它并不是用来解决回归问题的。逻辑回归的核心思想是通过一个特定的函数，称为"逻辑函数"或"sigmoid函数"，来建立分类模型。

2. Sigmoid函数

a. 公式

Sigmoid函数将输入值z从负无穷映射到正无穷，并将输出值限制在[0,1]范围内。该函数可以用于将任意实数值转换为一个概率值，通过设定阈值进行分类。公式如下： \[ \sigma(z) = \frac{1}{1 + e^{-z}} \] 这个函数在机器学习和深度学习中广泛应用，是构建模型和分类的重要工具。

b. Sigmoid函数的输入预测函数h(x)用于表示正类的概率，相应的负类概率则为1-h(x)。对于二分类问题，用1表示正类，0表示负类。公式如下： \[ h(x) = \sigma(\theta^T x) \] 使得y的取值为0或1，均可以表示为y的预测概率。

c. Sigmoid函数代码

```python

def sigmoid(z):

return 1 / (1 + np.exp(-z))

```

3. 梯度上升

a. 似然函数似然函数描述观测数据属于不同类别的概率分布，通过最大化似然函数来估计模型参数。为简化计算，我们对似然函数取对数，得到对数似然函数。

公式如下： \[ \ell(\theta) = \frac{1}{m} \sum_{i=1}^{m} [y_i \log(h(x_i)) + (1 - y_i) \log(1 - h(x_i))] \] ####对数似然函数代码

```python

def log_likelihood(X, y, theta):

m = len(y)

h = sigmoid(np.dot(X, theta))

return (1 / m) * (np.dot(y, np.log(h)) + np.dot(1 - y, np.log(1 - h)))

```

代码解释：m是样本数量，theta是参数向量，h是通过sigmoid函数计算得到的预测值。

b. 梯度下降

为了求解似然函数的最大值，我们使用梯度上升法；求解损失函数最小值时则使用梯度下降法。损失函数是对数似然函数的负值。

c. 学习率

学习率的大小对模型收敛速度和准确性影响很大。太大会导致参数更新波动过大，模型无法收敛；太小则收敛速度慢，可能停留在局部最优解。随机梯度上升的学习率是变化的，可通过调整学习率来减少震荡。

参数更新公式 \[ \theta_j := \theta_j + \alpha \frac{1}{m} \sum_{i=1}^{m} (y_i - h(x_i)) x_{ij} \]

梯度上升代码

```python

def gradient_ascent(X, y, iterations, alpha):

m, n = X.shape

theta = np.zeros(n)

for _ in range(iterations):

h = sigmoid(np.dot(X, theta))

for j in range(n):

theta[j] += alpha * np.sum((y - h) * X[:, j]) / m return theta

```

代码解释：初始化theta为0，进行多次迭代更新theta的值，每次更新时计算当前theta的梯度，并调整theta以最大化对数似然函数。

4. 实验

a. 数据集获取和处理使用鸢尾花数据集的“Sepal.Length”和“Sepal.Width”作为特征，选择setosa和versicolor两类花。为方便处理，将versicolor赋值为0，setosa赋值为1。设定迭代次数为1000，学习率alpha为0.01。

数据处理代码

```python

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

data = pd.read_csv(r"C:\path\to\your\dataset.txt", sep=' ')

X = data.iloc[:, :2].values y1 = data.iloc[:, -1].values

y = np.array([1 if label == 'setosa' else 0 for label in y1])

X = np.c_[np.ones(X.shape[0]), X]

```

b. 散点图

```python

plt.rcParams['font.sans-serif'] = ['SimHei']

plt.scatter(X[:, 1], X[:, 2], c=y, cmap='viridis') plt.xlabel('Sepal.Length')

plt.ylabel('Sepal.Width')

plt.title('散点图')

plt.show()

```

c. 逻辑回归曲线

```python theta = gradient_ascent(X, y, 1000, 0.1)

xi = np.linspace(np.min(X[:, 1]), np.max(X[:, 1]), 100)

yi = -(theta[0] + theta[1] * xi) / theta[2]

plt.scatter(X[:, 1], X[:, 2], c=y, cmap='viridis')

plt.plot(xi, yi, "r-", label='逻辑回归曲线') plt.xlabel('Sepal.Length')

plt.ylabel('Sepal.Width')

plt.title('鸢尾花数据集二分类')

plt.legend()

plt.show()

```

d. 利用逻辑回归进行分类

```python

test_data = pd.read_csv("C:\path\to\your\test_data.txt", sep='\s+')

X_test = test_data[["Sepal.Length", "Sepal.Width"]].values

y_test = test_data["Species"].values

y_test = np.array([1 if label == 'setosa' else 0 for label in y_test])

X_test = np.c_[np.ones(X_test.shape[0]), X_test]

correct_predictions = 0

for i in range(len(y_test)):

prediction = sigmoid(np.dot(X_test[i], theta)) >= 0.5

if prediction == y_test[i]:

correct_predictions += 1

accuracy = correct_predictions / len(y_test)

print(f"准确率为: {accuracy}")

```

5. 实验中的问题

1. 学习率的选择需要合适，过大或过小都会影响模型的准确性。

2. 逻辑回归是在线性回归的基础上添加sigmoid函数，在实验中线性回归得到的模型只能适用于一个类别，而逻辑回归能够区分多个类别。

3. 实验中可能出现过拟合或欠拟合现象，需要通过交叉验证和正则化等方法解决。

6. 总结

逻辑回归的优点包括： - 简单且高效，适用于大规模数据集。 - 输出结果具有概率意义，易于解释。 - 对线性可分问题效果良好。缺点包括： - 对非线性数据拟合能力有限。 - 容易受到特征相关性的影响。 - 在高维度特征空间中容易过拟合，需要进行特征选择或正则化处理。

withoutp

关注

18
点赞
踩
21

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫