Week 2 Logistic regression (逻辑回归)

最新推荐文章于 2022-07-29 09:39:15 发布

修罗_GUAN

最新推荐文章于 2022-07-29 09:39:15 发布

阅读量340

点赞数

分类专栏： STATS

本文链接：https://blog.csdn.net/weixin_42515443/article/details/80813370

版权

这篇博客介绍了逻辑回归在二分类问题中的应用，以学生贷款申请为例，阐述了如何使用逻辑回归模型来判断学生能否获得贷款。文章讨论了数据生成、sigmoid函数的使用、正则化以及损失函数和梯度下降法在确定模型参数中的作用。最终，通过实例展示了逻辑回归如何找到分类边界并进行预测。

摘要由CSDN通过智能技术生成

Week 2 Logistic regression (逻辑回归)

Disclaimer: This note was writte by “修罗_GUAN” (pseudonym) and has not been subjected to the usual scrutiny. It was initially posted on CSDN blog. The author would like to thank the support and help from Rover, who also contributes to this note.

声明：这些笔记由“修罗_GUAN”（笔名）整理撰写。笔记并没有经过严格的检查。本文首发于CSDN。感谢师弟阿易对这篇笔记整理的贡献！

————————————— start from here ——————————————-

Background (问题背景):

Let’s consider a binary classification. For example, students are going to apply for student loan from a financial institution. The instution decides whether it will approve the student application based on two factors: the student’s credit score and his/her monthly income. Now we can use $X1$ to represent the student’s montly income and $X2$ to represent the student’s credit score. Then we can use $Y \in \{1,0\}$ to denote the application results. $Y_i=1$ means that the institution approves the student’s application and 0 for the reject of application.

Imagine that we have 100 students who will apply for the student loan. Obviously, some of them can get the load whereas others get rejected.

To model this problem, we can use logistic regression to classify the students who can and who cannot get the loan. $X1$ and $X2$ are two column vectors with the length of 100. Y is a column vector whose length is also 100. Each element in Y can either be 1 or 0.

我们现在来考虑一个二分类问题。例如，每年都会有学生向银行申请学生贷款。一般而言，银行会根据两个因素来判定是否通过该学生贷款的申请：信用记录分数以及该学生每个月的收入。这样的话，我们可以用 $X1$ 和 $X2$ 分别表达学生的月收入和信用分数。然后我们可以用 $Y \in \{1,0\}$ 来表达该学生是否可以成功从银行取得贷款。Y=1表示该学生成功获得贷款，而Y=0表示该生贷款申请遭到拒绝。

假设我们有200个学生向银行申请贷款。显然，这200名学生中有人将取得贷款，而另一部分学生可能因为较差的信用分数或者较低的月收入而申请被拒。

我们可以利用逻辑回归对该问题建模。这里， $X1$ 和 $X2$ 分别是一个长度为100的列向量。Y也是一个长度为200的列向量。Y中的每一个元素只可能是1或者0.

First of all, the data of X1, X2, and Y are generated using some random function.

首先我们来生成X1, X2, 和Y的数据。

(It should be noted that the generation for X1, X2, and Y are kind of arbitrary as we don’t have actual data from financial institution)

这里需要声明的是，X1, X2以及Y的是随意生成的，因为我们没有实际的数据。

# Add this command because of plot
# Otherwise, the plot may not show up in Anaconda
% matplotlib inline

# Import necessary packages
import matplotlib.pyplot as plt  # Package used for plotting the figure
import numpy as np  # Package relating to math

# Define length of X1 vector, which is dentical to the length of X2 and Y
# In total, 200 students are applying for the student load
number_observation = 200

# Generate X1 and X2
X1 = np.linspace(1600, 2500, number_observation)  # Assume the students' monthly income ranges from 1600 to 2500
X2 = X1/2.5 + np.random.normal(0, 100, number_observation)  # Assume the credit scrore is following this function

# Generate Y vector
Y = np.linspace(0, 0, number_observation)
# indicator represents whther the financial institution approves the student's application or not
# indicator < 0 means the application is rejected.
# indicator >= 0 means the applcation is approved.
# A noise term is added because I don't want the data to be linearly saparable
indicator = (X1 - np.mean(X1))/np.std(X1) + (X2 - np.mean(X2))/np.std(X2) + np.random.normal(0, 0.25, number_observation)
Y[indicator < 0] = 0
Y[indicator >= 0] = 1

# Visualize the data points
plt.figure(figsize=(16,8))
plt.scatter(X1[indicator < 0], X2[indicator < 0], color='b', marker='o', label='Rejected')
plt.scatter(X1[indicator >= 0], X2[indicator >= 0], color='r', marker='*' , label='Approved')
font = {
  'family':'Times New Roman','weight' : 'normal','size': 16,}
plt.xlabel('X1', font)
plt.ylabel('X2', font)
plt.xticks(fontsize=16, fontname='Times New Roman')
plt.yticks(fontsize=16, fontname='Times New Roman')
plt.title('The distibution of X1 and X2', font)
plt.legend(fontsize=16)
plt.show()

这里写图片描述

Later we will use sigmoid function to deal with these data points. As we know, sigmoid function invovles the exponential algorithm. Since X1 is large (the maximum is about 2400), we may encounter overflow problem in coding. To avoid the overflow problem, we have to normailize X1 and X2.

在后面我们将会用sigmoid函数。这个函数包含了以自然指数e为底的运算。因为X1的数值很大，因此我们很有可能会碰到数值溢出的问题。为了避免该类问题，我们需要将X1和X2正则化。

# Normalize X1 and X2
new_X1 = (X1 - np.mean(X1))/np.std(X1)
new_X2 = (X2 - np.mean(X2))/np.std(X2)

# Visualize the data points
plt.figure(figsize=(16,8))
plt.scatter(new_X1[indicator < 0], new_X2[indicator < 0], color='b', marker='o', label='Rejected')
plt.scatter(new_X1[indicator >= 0], new_X2[indicator >= 0], color='r', marker='*' , label='Approved')
font = {
  'family':'Times New Roman','weight' : 'normal','size': 16,}
plt.xlabel('new X1', font)
plt.ylabel('new X2', font)
plt.xticks(fontsize=16, fontname='Times New Roman')
plt.yticks(fontsize=16, fontname='Times New Roman')
plt.title('The distibution of new X1 and X2', font)
plt.legend(fontsize=16)
plt.show()

这里写图片描述

As we can see from the figure, the process of normalization only affects the values of X1 and X2. It doesn’t affect the distribution of X1 and X2 (the layout of X1 and X2 are still the same).

从上图可以看出，对X1和X2正则化只是影响了横纵坐标的数值大小，并不会影响数据的分布趋势。

In the figure, red stars represent the case that a student’s loan application is approved, whereas the blue dots represent the case that a student’s loan application is declined.

图中，红色五角星表示学生的贷款申请被批准；蓝色圆心表示学生的贷款申请被拒。

Although the data points are generated in an arbitrary method, these data points are consistent with our common sense. A student with higher monthly income and higher credit score tends to get approved for his appliation. In contrast, a student with low income and low credit score tends to be rejected for applying the loan.

尽管这些数据点是随意生成的，但是其符合我们的正常认知。当一个学生有较高的月收入和较高的信用分数，他将会大概率获得贷款。而当一个学生的月收入较低，信用分数也很低时，他的贷款申请将会被拒。

Now, we would like to use the sigmoid function to quantify the possibility (or probability) for a single student to get the loan.

现在我们需要用下面的这个叫做sigmoid的函数来表征单个学生能够拿到贷款的概率:

$h^i(X_1, X_2) = \frac{e^{\beta_1 X_1^i + \beta_2 X_2^i}}{1 + e^{\beta_1 X_1^i + \beta_2 X_2^i}}$