[ML of Andrew Ng]Week 3 Logistic Regression and Regularization

Week 3 Logistic Regression and Regularization


Logistic Regression

Classification

y{0,1}
0: “Negative Class” 1:”Positive Class”

Linear Regression: hθ(x) can be >1 or <0
Logistic Regression(Classfication): 0hθ(x)1

Hypothesis Representation

Now we want 0hθ(x)1 ,in linear regression, hθ(x)=θTx ,
we use sigmoid function (i.e. Logistic function),then

hθ(x)=g(θTx)

g(z)=11+ez

hθ(x)=11+eθTx

Interpretation of Hypothesis Output
hθ(x) = estimated probability that y = 1 on input x

DEFINITION
hθ(x) : probability that y = 1, given x, parameterized by θ

Decision boundary

if hθ(x)0.5 then y=1
else hθ(x)<0.5 then y=0

DEFINITION
The decision boundary is the line that separates the region where the hypothesis predicts Y=0 ,one from the region where the hypothesis predicts that y=0.

Cost function

Cost(hθ(x),y)={log(hθ(x))ify=1 log(1hθ(x))ify=0
Captures intuition tha if hθ(x)=0 ,but y=1 ,we’ll penalize learning algorithm by a very large cost.

Simplified cost function and gradient descent

Cost(hθ(x),y)=ylog(hθ(x))(1y)log(1hθ(x))

J(θ)=1mi=1mCost(hθ(x(i)),y(i))

J(θ)=1m[i=1my(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))]

Want minθJ(θ) :
Repeat: θj=θjαmi=1(hθ(x(i))y(i))x(i)j
Algorithm looks identical to linear regression!But they have different hθ(x) .

Advanced optimization

Given θ ,we have code that can compute

- J(θ)
- θjJ(θ)

Optmization algorithmsAdvantages
Gradient descentNo need to manually pick α
Conjugate gradientOften faster tahn gradient descent
BFGSDisadvantages
L-BFGSMore complex

In matlab:

function [jVal,gradent] = costFunction(theta,X,y)
jVal = ...
gradient = ...

options = optimset('GradObj','on', 'MaxIter', '100')
initalTheta = zeros(m,1);
[optTheta,functionVal,exitFlag]...
        =fminunc(@costFunction,initialTheta,options)

Multi-class classification:One-vs-all

Train:Train a logistic regression classifier h(i)θ(x) for each class i to predict the probability that y=i.
On a new input x , to make a prediction, pick the class i that maximizes

maxih(i)θ(x)


Regularization

The problem of overfitting

Overfitting: If we have too many features, the learned hypothesis may fit the training set very well ( J(θ)0 ), but fail to generalize to new examples (predict prices on new examples).

example

Options:

  1. Reduce number of features.

    • Manually select which features to keep.
    • Model selection algorithm (later in course).
  2. Regularization.

    • Keep all the features, but reduce magnitude/values of parameters θj .
    • Works well when we have a lot of features, each of which contributes a bit to predicting y .

Cost function

Small values for parameters θ0,θ1,,θn
-“Simpler” hypothesis
-Less prone to overfitting

J(θ)=12m[i=1m(hθ(x(i))y(i))2+λj=1nθ2j]

Regularized linear regression

Gradient descent

θj=θj(1αλm)α1mi=1m(hθ(x(i)))y(i))x(i)j

Normal equation

If λ>0 ,

θ=(XTX+λ011)1XTy

Regularized logistic regression

Cost function

J(θ)=1m[i=1my(i)log(hθ(x(i)))(1y(i))log(1hθ(x(i)))]+λ2mj=1nθ2j

Gradient descent

θ0=θ0αi=1m(hθ(x(i))y(i))x(i)0θj=θjα[i=1m(hθ(x(i))y(i))x(i)jλmθj(j=1,2,3,n)

Advanced optimization

Change the costFunction function.

function [J, grad] = costFunctionReg(theta, X, y, lambda)

m = length(y); % number of training examples

% You need to return the following variables correctly 
J = 0;
grad = zeros(size(theta));

J = 1/m * (-y'*log(sigmoid(X*theta)) - (1-y)'*log(1-sigmoid(X*theta))) ...
    + lambda/(2*m) * sum((theta(2:end)).^2);

grad = 1/m * X' * (sigmoid(X*theta)-y) + lambda/m*theta;
grad(1) = grad(1) - lambda/m*theta(1);

% Set Options
options = optimset('GradObj', 'on', 'MaxIter', 400);

% Optimize
[theta, J, exit_flag] = ...
    fminunc(@(t)(costFunctionReg(t, X, y, lambda)), initial_theta, options);
end
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值