Week 2 Logistic regression (逻辑回归)

这篇博客介绍了逻辑回归在二分类问题中的应用,以学生贷款申请为例,阐述了如何使用逻辑回归模型来判断学生能否获得贷款。文章讨论了数据生成、sigmoid函数的使用、正则化以及损失函数和梯度下降法在确定模型参数中的作用。最终,通过实例展示了逻辑回归如何找到分类边界并进行预测。
摘要由CSDN通过智能技术生成

Week 2 Logistic regression (逻辑回归)

Disclaimer: This note was writte by “修罗_GUAN” (pseudonym) and has not been subjected to the usual scrutiny. It was initially posted on CSDN blog. The author would like to thank the support and help from Rover, who also contributes to this note.

声明:这些笔记由“修罗_GUAN”(笔名)整理撰写。笔记并没有经过严格的检查。本文首发于CSDN。感谢师弟阿易对这篇笔记整理的贡献!

————————————— start from here ——————————————-

Background (问题背景):

Let’s consider a binary classification. For example, students are going to apply for student loan from a financial institution. The instution decides whether it will approve the student application based on two factors: the student’s credit score and his/her monthly income. Now we can use X1 X 1 to represent the student’s montly income and X2 X 2 to represent the student’s credit score. Then we can use Y{ 1,0} Y ∈ { 1 , 0 } to denote the application results. Yi=1 Y i = 1 means that the institution approves the student’s application and 0 for the reject of application.

Imagine that we have 100 students who will apply for the student loan. Obviously, some of them can get the load whereas others get rejected.

To model this problem, we can use logistic regression to classify the students who can and who cannot get the loan. X1 X 1 and X2 X 2 are two column vectors with the length of 100. Y is a column vector whose length is also 100. Each element in Y can either be 1 or 0.

我们现在来考虑一个二分类问题。例如,每年都会有学生向银行申请学生贷款。一般而言,银行会根据两个因素来判定是否通过该学生贷款的申请:信用记录分数以及该学生每个月的收入。这样的话,我们可以用 X1 X 1 X2 X 2 分别表达学生的月收入和信用分数。然后我们可以用 Y{ 1,0} Y ∈ { 1 , 0 } 来表达该学生是否可以成功从银行取得贷款。Y=1表示该学生成功获得贷款,而Y=0表示该生贷款申请遭到拒绝。

假设我们有200个学生向银行申请贷款。显然,这200名学生中有人将取得贷款,而另一部分学生可能因为较差的信用分数或者较低的月收入而申请被拒。

我们可以利用逻辑回归对该问题建模。这里, X1 X 1 X2 X 2 分别是一个长度为100的列向量。Y也是一个长度为200的列向量。Y中的每一个元素只可能是1或者0.

First of all, the data of X1, X2, and Y are generated using some random function.

首先我们来生成X1, X2, 和Y的数据。

(It should be noted that the generation for X1, X2, and Y are kind of arbitrary as we don’t have actual data from financial institution)

这里需要声明的是,X1, X2以及Y的是随意生成的,因为我们没有实际的数据。

# Add this command because of plot
# Otherwise, the plot may not show up in Anaconda
% matplotlib inline

# Import necessary packages
import matplotlib.pyplot as plt  # Package used for plotting the figure
import numpy as np  # Package relating to math

# Define length of X1 vector, which is dentical to the length of X2 and Y
# In total, 200 students are applying for the student load
number_observation = 200

# Generate X1 and X2
X1 = np.linspace(1600, 2500, number_observation)  # Assume the students' monthly income ranges from 1600 to 2500
X2 = X1/2.5 + np.random.normal(0, 100, number_observation)  # Assume the credit scrore is following this function

# Generate Y vector
Y = np.linspace(0, 0, number_observation)
# indicator represents whther the financial institution approves the student's application or not
# indicator < 0 means the application is rejected.
# indicator >= 0 means the applcation is approved.
# A noise term is added because I don't want the data to be linearly saparable
indicator = (X1 - np.mean(X1))/np.std(X1) + (X2 - np.mean(X2))/np.std(X2) + np.random.normal(0, 0.25, number_observation)
Y[indicator < 0] = 0
Y[indicator >= 0] = 1

# Visualize the data points
plt.figure(figsize=(16,8))
plt.scatter(X1[indicator < 0], X2[indicator < 0], color='b', marker='o', label='Rejected')
plt.scatter(X1[indicator >= 0], X2[indicator >= 0], color='r', marker='*' , label='Approved')
font = {
  'family':'Times New Roman','weight' : 'normal','size': 16,}
plt.xlabel('X1', font)
plt.ylabel('X2', font)
plt.xticks(fontsize=16, fontname='Times New Roman')
plt.yticks(fontsize=16, fontname='Times New Roman')
plt.title('The distibution of X1 and X2', font)
plt.legend(fontsize=16)
plt.show()

这里写图片描述

Later we will use sigmoid function to deal with these data points. As we know, sigmoid function invovles the exponential algorithm. Since X1 is large (the maximum is about 2400), we may encounter overflow problem in coding. To avoid the overflow problem, we have to normailize X1 and X2.

在后面我们将会用sigmoid函数。这个函数包含了以自然指数e为底的运算。因为X1的数值很大,因此我们很有可能会碰到数值溢出的问题。为了避免该类问题,我们需要将X1和X2正则化。

# Normalize X1 and X2
new_X1 = (X1 - np.mean(X1))/np.std(X1)
new_X2 = (X2 - np.mean(X2))/np.std(X2)

# Visualize the data points
plt.figure(figsize=(16,8))
plt.scatter(new_X1[indicator < 0], new_X2[indicator < 0], color='b', marker='o', label='Rejected')
plt.scatter(new_X1[indicator >= 0], new_X2[indicator >= 0], color='r', marker='*' , label='Approved')
font = {
  'family':'Times New Roman','weight' : 'normal','size': 16,}
plt.xlabel('new X1', font)
plt.ylabel('new X2', font)
plt.xticks(fontsize=16, fontname='Times New Roman')
plt.yticks(fontsize=16, fontname='Times New Roman')
plt.title('The distibution of new X1 and X2', font)
plt.legend(fontsize=16)
plt.show()

这里写图片描述

As we can see from the figure, the process of normalization only affects the values of X1 and X2. It doesn’t affect the distribution of X1 and X2 (the layout of X1 and X2 are still the same).

从上图可以看出,对X1和X2正则化只是影响了横纵坐标的数值大小,并不会影响数据的分布趋势。

In the figure, red stars represent the case that a student’s loan application is approved, whereas the blue dots represent the case that a student’s loan application is declined.

图中,红色五角星表示学生的贷款申请被批准;蓝色圆心表示学生的贷款申请被拒。

Although the data points are generated in an arbitrary method, these data points are consistent with our common sense. A student with higher monthly income and higher credit score tends to get approved for his appliation. In contrast, a student with low income and low credit score tends to be rejected for applying the loan.

尽管这些数据点是随意生成的,但是其符合我们的正常认知。当一个学生有较高的月收入和较高的信用分数,他将会大概率获得贷款。而当一个学生的月收入较低,信用分数也很低时,他的贷款申请将会被拒。

Now, we would like to use the sigmoid function to quantify the possibility (or probability) for a single student to get the loan.

现在我们需要用下面的这个叫做sigmoid的函数来表征单个学生能够拿到贷款的概率:

hi(X1,X2)=eβ1Xi1+β2Xi21+eβ1Xi1+β2Xi2 h i ( X 1 , X 2 ) = e β 1 X 1 i + β 2 X 2 i 1 + e β 1 X 1 i + β 2 X 2 i

where the superscript i i represents the i t h student.

式中上标 i i 表示第 i 个学生。

The question is then how to determine the two parameters: β1 β 1 and β2

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值