Regression(4)-------Logistic Regression & Regularization

最新推荐文章于 2019-06-29 12:18:40 发布

jingruixiaozhuang

最新推荐文章于 2019-06-29 12:18:40 发布

阅读量1.6k

点赞数

分类专栏： Deep Learning

本文链接：https://blog.csdn.net/u012796064/article/details/17054191

版权

Deep Learning 专栏收录该内容

26 篇文章 2 订阅

订阅专栏

玉玺在手，天下我有！

Logistic Regression

=========================

(一)、Classification

（二）、Hypothesis Representation

（三）、Decision Boundary

（四）、Cost Function

（五）、Simplified Cost Function and Gradient Descent

（六）、Parameter Optimization in Matlab

（七）、Multiclass classification : One-vs-all

The problem of overfitting and how to solve it

=========================

（八）、The problem of overfitting

（九）、Cost Function

（十）、Regularized Linear Regression （解决过拟合问题）

（十一）、Regularized Logistic Regression （解决过拟合问题）

（十二）。经典小总结

本章主要讲述逻辑回归和Regularization解决过拟合的问题，非常非常重要，是机器学习中非常常用的回归工具，下面分别进行两部分的讲解。

第一部分：Logistic Regression

/*************（一）~（二）、Classification / Hypothesis Representation***********/

假设随Tumor Size变化，预测病人的肿瘤是恶性（malignant）还是良性（benign）的情况。

给出8个数据如下：

假设进行linear regression得到的hypothesis线性方程如上图中粉线所示，则可以确定一个threshold:0.5进行predict

y=1, if h(x)>=0.5

y=0, if h(x)<0.5

即malignant=0.5的点投影下来，其右边的点预测y=1;左边预测y=0；则能够很好地进行分类。

那么，如果数据集是这样的呢？

这种情况下，假设linear regression预测为蓝线，那么由0.5的boundary得到的线性方程中，不能很好地进行分类。因为不满足

y=1, h(x)>0.5

y=0, h(x)<=0.5

这时，我们引入logistic regression model：

所谓Sigmoid function或Logistic function就是这样一个函数g(z)见上图所示

当z>=0时，g(z)>=0.5；当z<0时，g(z)<0.5

由下图中公式知，给定了数据x和参数θ，y=0和y=1的概率和=1

/*****************************（三）、decision boundary**************************/

所谓Decision Boundary就是能够将所有数据点进行很好地分类的h(x)边界。

如下图所示，假设形如h(x)=g(θ0+θ1x1+θ2x2)的hypothesis参数θ=[-3,1,1]T, 则有

predict Y=1, if -3+x1+x2>=0

predict Y=0, if -3+x1+x2<0

刚好能够将图中所示数据集进行很好地分类

Another Example:

answer:

除了线性boundary还有非线性decision boundaries，比如

下图中，进行分类的decision boundary就是一个半径为1的圆，如图所示：

/********************（四）~（五）Simplified cost function and gradient descent<非常重要>*******************/

该部分讲述简化的logistic regression系统中how to implement gradient descents for logistic regression.

假设我们的数据点中y只会取0和1, 对于一个logistic regression model系统，有，那么cost function定义如下：

由于y只会取0,1，那么就可以写成

不信的话可以把y=0,y=1分别代入，可以发现这个J（θ）和上面的Cost(hθ(x),y)是一样的(*^__^*) ，那么剩下的工作就是求能最小化 J(θ)的θ了~

在第一章中我们已经讲了如何应用Gradient Descent, 也就是下图Repeat中的部分，将θ中所有维同时进行更新，而J(θ)的导数可以由下面的式子求得，结果如下图手写所示：

现在将其带入Repeat中：

这是我们惊奇的发现，它和第一章中我们得到的公式是一样滴~

也就是说，下图中所示，不管h(x)的表达式是线性的还是logistic regression model, 都能得到如下的参数更新过程。

那么如何用vectorization来做呢？换言之，我们不要用for循环一个个更新θj，而用一个矩阵乘法同时更新整个θ。也就是解决下面这个问题：

上面的公式给出了参数矩阵θ的更新，那么下面再问个问题，第二讲中说了如何判断学习率α大小是否合适，那么在logistic regression系统中怎么评判呢？

Q：Suppose you are running gradient descent to fit a logistic regression model with parameter θ∈Rn+1 . Which of the following is a reasonable way to make sure the learning rate α is set properly and that gradient descent is running correctly?

A：

/*************（六）、Parameter Optimization in Matlab***********/

这部分内容将对logistic regression 做一些优化措施，使得能够更快地进行参数梯度下降。本段实现了matlab下用梯度方法计算最优参数的过程。

首先声明，除了gradient descent 方法之外，我们还有很多方法可以使用，如下图所示，左边是另外三种方法，右边是这三种方法共同的优缺点，无需选择学习率α，更快，但是更复杂。

也就是matlab中已经帮我们实现好了一些优化参数θ的方法，那么这里我们需要完成的事情只是写好cost function,并告诉系统，要用哪个方法进行最优化参数。比如我们用‘GradObj’， Use the GradObj option to specify that FUN also returns a second output argument G that is the partial derivatives of the function df/dX, at the point X.

如上图所示，给定了参数θ，我们需要给出cost Function. 其中，

jVal 是 cost function 的表示，比如设有两个点（1,0,5）和（0,1,5）进行回归，那么就设方程为hθ(x)=θ1x1+θ2x2;
则有costfunction J(θ)： jVal=(theta(1)-5)^2+(theta(2)-5)^2;

在每次迭代中，按照gradient descent的方法更新参数θ：θ(i)-=gradient(i),其中gradient(i)是J(θ)对θi求导的函数式，在此例中就有gradient(1)=2*(theta(1)-5), gradient(2)=2*(theta(2)-5)。如下面代码所示：

函数costFunction, 定义jVal=J(θ)和对两个θ的gradient：

[cpp]view plaincopy 
   
 function [ jVal,gradient ] = costFunction( theta )  
 %COSTFUNCTION Summary of this function goes here  
 %   Detailed explanation goes here  
   
 jVal= (theta(1)-5)^2+(theta(2)-5)^2;  
   
 gradient = zeros(2,1);  
 %code to compute derivative to theta  
 gradient(1) = 2 * (theta(1)-5);  
 gradient(2) = 2 * (theta(2)-5);  
   
 end  

编写函数Gradient_descent，进行参数优化

[cpp]view plaincopy 
   
 function [optTheta,functionVal,exitFlag]=Gradient_descent( )  
 %GRADIENT_DESCENT Summary of this function goes here  
 %   Detailed explanation goes here  
   
  options = optimset('GradObj','on','MaxIter',100);  
  initialTheta = zeros(2,1)  
  [optTheta,functionVal,exitFlag] = fminunc(@costFunction,initialTheta,options);  
     
 end  

matlab主窗口中调用，得到优化厚的参数(θ1,θ2)=(5,5),即hθ(x)=θ1x1+θ2x2=5*x1+5*x2

[cpp]view plaincopy 
   
  [optTheta,functionVal,exitFlag] = Gradient_descent()  
   
 initialTheta =  
   
      0  
      0  
   
 Local minimum found.  
   
 Optimization completed because the size of the gradient is less than  
 the default value of the function tolerance.  
   
 <stopping criteria details>  
   
 optTheta =  
   
      5  
      5  
   
 functionVal =  
   
      0  
   
 exitFlag =  
   
      1

最后得到的结果显示出优化参数optTheta=[5,5], functionVal = costFunction(迭代后) = 0

/*****************************（七）、Multi-class Classification One-vs-all**************************/

所谓one-vs-all method就是将binary分类的方法应用到多类分类中。

比如我想分成K类，那么就将其中一类作为positive，另（k-1）合起来作为negative，这样进行K个h(θ)的参数优化，每次得到的一个hθ(x)是指给定θ和x，它属于positive的类的概率。

按照上面这种方法，给定一个输入向量x，获得最大hθ(x)的类就是x所分到的类。

第二部分：The problem of overfitting and how to solve it

/************（八）、The problem of overfitting***********/

The Problem of overfitting:

overfitting就是过拟合，如下图中最右边的那幅图。对于以上讲述的两类（logistic regression和linear regression）都有overfitting的问题，下面分别用两幅图进行解释：

<Linear Regression>:

<logistic regression>:

怎样解决过拟合问题呢？两个方法：

1. 减少feature个数（人工定义留多少个feature、算法选取这些feature）

2. 规格化（留下所有的feature，但对于部分feature定义其parameter非常小）

下面我们将对regularization进行详细的讲解。

对于linear regression model, 我们的问题是最小化

$MSE(f)=\frac{1}{n}(y_{i}-f(x_{i})^2)$

写作矩阵表示即

$for\:problem\:\: Y=aX,\\ J(a)=\sum_{\overrightarrow{x}\epsilon X}(a^T \overrightarrow{x}-y)^2)\\ X = [x_{1},x_{2},...,x_{n}],\:\:Y = [y_{1},y_{2},...,y_{n}]\\$

i.e. the loss function can be written as

$J(a)=(Y-X^Ta)^T(Y-X^Ta)$

there we can get:

$a=(XX^T)^{-1}XY$

After regularization, however,we have:

$a=(XX^T+\lambda I)^{-1}XY$

/************（九）、Cost Function***********/

对于Regularization，方法如下，定义cost function中θ3，θ4的parameter非常大，那么最小化cost function后就有非常小的θ3,θ4了。

写作公式如下，在cost function中加入θ1~θn的惩罚项：

这里要注意λ的设置，见下面这个题目：

A:λ很大会导致所有θ≈0

下面呢，我们分linear regression 和 logistic regression分别进行regularization步骤.

/************（十）、Regularized Linear Regression***********/

<Linear regression>:

首先看一下，按照上面的cost function的公式，如何应用gradient descent进行参数更新。

对于θ0，没有惩罚项，更新公式跟原来一样

对于其他θj，J(θ)对其求导后还要加上一项(λ/m)*θj，见下图：

如果不使用梯度下降法（gradient descent+regularization），而是用矩阵计算（normal equation）来求θ，也就求使J(θ)min的θ，令J(θ)对θj求导的所有导数等于0，有公式如下：

而且已经证明，上面公式中括号内的东西是可逆的。

/************（十一）、Regularized Logistic Regression***********/

<Logistic regression>:

前面已经讲过Logisitic Regression的cost function和overfitting的情况，如下图中所示:

和linear regression一样，我们给J(θ)加入关于θ的惩罚项来抑制过拟合：

用Gradient Descent的方法，令J(θ)对θj求导都等于0，得到

这里我们发现，其实和线性回归的θ更新方法是一样的。

When using regularized logistic regression, which of these is the best way to monitor whether gradient descent is working correctly?

和上面matlab中调用那个例子相似，我们可以定义logistic regression的cost function如下所示：

图中，jval表示cost function 表达式，其中最后一项是参数θ的惩罚项；下面是对各θj求导的梯度，其中θ0没有在惩罚项中，因此gradient不变，θ1~θn分别多了一项(λ/m)*θj；

至此，regularization可以解决linear和logistic的overfitting regression问题了~

最后做一下小总结(经典）：

Logistic regression:

　　在logistic regression问题中，logistic函数表达式如下：

　　这样做的好处是可以把输出结果压缩到0~1之间。而在logistic回归问题中的损失函数与线性回归中的损失函数不同，这里定义的为：

　　如果采用牛顿法来求解回归方程中的参数，则参数的迭代公式为：

　　其中一阶导函数和hessian矩阵表达式如下：

　　当然了，在编程的时候为了避免使用for循环，而应该直接使用这些公式的矢量表达式（具体的见程序内容）。

% Exercise 4 -- Logistic Regression

clear all; close all; clc

x = load('ex4x.dat'); 
y = load('ex4y.dat');

[m, n] = size(x);

% Add intercept term to x
x = [ones(m, 1), x]; 

% Plot the training data
% Use different markers for positives and negatives
figure
pos = find(y); neg = find(y == 0);%find是找到的一个向量，其结果是find函数括号值为真时的值的编号
plot(x(pos, 2), x(pos,3), '+')
hold on
plot(x(neg, 2), x(neg, 3), 'o')
hold on
xlabel('Exam 1 score')
ylabel('Exam 2 score')


% Initialize fitting parameters
theta = zeros(n+1, 1);

% Define the sigmoid function
g = inline('1.0 ./ (1.0 + exp(-z))'); 

% Newton's method
MAX_ITR = 7;
J = zeros(MAX_ITR, 1);

for i = 1:MAX_ITR
    % Calculate the hypothesis function
    z = x * theta;
    h = g(z);%转换成logistic函数
    
    % Calculate gradient and hessian.
    % The formulas below are equivalent to the summation formulas
    % given in the lecture videos.
    grad = (1/m).*x' * (h-y);%梯度的矢量表示法
    H = (1/m).*x' * diag(h) * diag(1-h) * x;%hessian矩阵的矢量表示法
    
    % Calculate J (for testing convergence)
    J(i) =(1/m)*sum(-y.*log(h) - (1-y).*log(1-h));%损失函数的矢量表示法
    
    theta = theta - H\grad;%是这样子的吗？
end
% Display theta
theta

% Calculate the probability that a student with
% Score 20 on exam 1 and score 80 on exam 2 
% will not be admitted
prob = 1 - g([1, 20, 80]*theta)

%画出分界面
% Plot Newton's method result
% Only need 2 points to define a line, so choose two endpoints
plot_x = [min(x(:,2))-2,  max(x(:,2))+2];
% Calculate the decision boundary line
plot_y = (-1./theta(3)).*(theta(2).*plot_x +theta(1));
plot(plot_x, plot_y)
legend('Admitted', 'Not admitted', 'Decision Boundary')
hold off

% Plot J
figure
plot(0:MAX_ITR-1, J, 'o--', 'MarkerFaceColor', 'r', 'MarkerSize', 8)
xlabel('Iteration'); ylabel('J')
% Display J
J

regularized linear regression:

此时的模型表达式如下所示：

　　模型中包含了规则项的损失函数如下：

　　模型的normal equation求解为：

　　程序中主要测试lambda=0,1,10这3个参数对最终结果的影响。

clc,clear
%加载数据
x = load('ex5Linx.dat');
y = load('ex5Liny.dat');

%显示原始数据
plot(x,y,'o','MarkerEdgeColor','b','MarkerFaceColor','r')

%将特征值变成训练样本矩阵
x = [ones(length(x),1) x x.^2 x.^3 x.^4 x.^5];
[m n] = size(x);
n = n -1;

%计算参数sidta，并且绘制出拟合曲线
rm = diag([0;ones(n,1)]);%lamda后面的矩阵
lamda = [0 1 10]';
colortype = {'g','b','r'};
sida = zeros(n+1,3);
xrange = linspace(min(x(:,2)),max(x(:,2)))';
hold on;
for i = 1:3
    sida(:,i) = inv(x'*x+lamda(i).*rm)*x'*y;%计算参数sida
    norm_sida = norm(sida)
    yrange = [ones(size(xrange)) xrange xrange.^2 xrange.^3,...
        xrange.^4 xrange.^5]*sida(:,i);
    plot(xrange',yrange,char(colortype(i)))
    hold on
end
legend('traning data', '\lambda=0', '\lambda=1','\lambda=10')%注意转义字符的使用方法
hold off

regularized logistic regression:

　在logistic回归中，其表达式为：

　　在此问题中，将特征x映射到一个28维的空间中，其x向量映射后为：

　　此时加入了规则项后的系统的损失函数为：

　　对应的牛顿法参数更新方程为：

　　其中：

　　公式中的一些宏观说明（直接截的原网页）：

%载入数据
clc,clear,close all;
x = load('ex5Logx.dat');
y = load('ex5Logy.dat');

%画出数据的分布图
plot(x(find(y),1),x(find(y),2),'o','MarkerFaceColor','b')
hold on;
plot(x(find(y==0),1),x(find(y==0),2),'r+')
legend('y=1','y=0')

% Add polynomial features to x by 
% calling the feature mapping function
% provided in separate m-file
x = map_feature(x(:,1), x(:,2));

[m, n] = size(x);

% Initialize fitting parameters
theta = zeros(n, 1);

% Define the sigmoid function
g = inline('1.0 ./ (1.0 + exp(-z))'); 

% setup for Newton's method
MAX_ITR = 15;
J = zeros(MAX_ITR, 1);

% Lambda is the regularization parameter
lambda = 1;%lambda=0,1,10，修改这个地方，运行3次可以得到3种结果。

% Newton's Method
for i = 1:MAX_ITR
    % Calculate the hypothesis function
    z = x * theta;
    h = g(z);
    
    % Calculate J (for testing convergence)
    J(i) =(1/m)*sum(-y.*log(h) - (1-y).*log(1-h))+ ...
    (lambda/(2*m))*norm(theta([2:end]))^2;
    
    % Calculate gradient and hessian.
    G = (lambda/m).*theta; G(1) = 0; % extra term for gradient
    L = (lambda/m).*eye(n); L(1) = 0;% extra term for Hessian
    grad = ((1/m).*x' * (h-y)) + G;
    H = ((1/m).*x' * diag(h) * diag(1-h) * x) + L;
    
    % Here is the actual update
    theta = theta - H\grad;
  
end
% Show J to determine if algorithm has converged
J
% display the norm of our parameters
norm_theta = norm(theta) 

% Plot the results 
% We will evaluate theta*x over a 
% grid of features and plot the contour 
% where theta*x equals zero

% Here is the grid range
u = linspace(-1, 1.5, 200);
v = linspace(-1, 1.5, 200);

z = zeros(length(u), length(v));
% Evaluate z = theta*x over the grid
for i = 1:length(u)
    for j = 1:length(v)
        z(i,j) = map_feature(u(i), v(j))*theta;%这里绘制的并不是损失函数与迭代次数之间的曲线，而是线性变换后的值
    end
end
z = z'; % important to transpose z before calling contour

% Plot z = 0
% Notice you need to specify the range [0, 0]
contour(u, v, z, [0, 0], 'LineWidth', 2)%在z上画出为0值时的界面，因为为0时刚好概率为0.5，符合要求
legend('y = 1', 'y = 0', 'Decision boundary')
title(sprintf('\\lambda = %g', lambda), 'FontSize', 14)


hold off

% Uncomment to plot J
% figure
% plot(0:MAX_ITR-1, J, 'o--', 'MarkerFaceColor', 'r', 'MarkerSize', 8)
% xlabel('Iteration'); ylabel('J')

jingruixiaozhuang

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Regression(4)-------Logistic Regression & Regularization

本讲内容：Logistic Regression=========================(一)、Classification（二）、Hypothesis Representation（三）、Decision Boundary（四）、Cost Function（五）、Simplified Cost Function and Gradi
复制链接

扫一扫

专栏目录