Week 3 Logistic Regression and Regularization
Logistic Regression
Classification
y∈{0,1}
0: “Negative Class” 1:”Positive Class”
Linear Regression:
hθ(x)
can be >1 or <0
Logistic Regression(Classfication):
0≤hθ(x)≤1
Hypothesis Representation
Now we want
0≤hθ(x)≤1
,in linear regression,
hθ(x)=θTx
,
we use sigmoid function (i.e. Logistic function),then
Interpretation of Hypothesis Output
hθ(x)
= estimated probability that y = 1 on input x
DEFINITION
hθ(x) : probability that y = 1, given x, parameterized by θ
Decision boundary
if
hθ(x)≥0.5
then
y=1
else
hθ(x)<0.5
then
y=0
DEFINITION
The decision boundary is the line that separates the region where the hypothesis predicts Y=0 ,one from the region where the hypothesis predicts that y=0.
Cost function
Cost(hθ(x),y)={−log(hθ(x))ify=1 −log(1−hθ(x))ify=0
Captures intuition tha if
hθ(x)=0
,but
y=1
,we’ll penalize learning algorithm by a very large cost.
Simplified cost function and gradient descent
Want
minθJ(θ)
:
Repeat:
θj=θj−α∑mi=1(hθ(x(i))−y(i))x(i)j
Algorithm looks identical to linear regression!But they have different
hθ(x)
.
Advanced optimization
Given θ ,we have code that can compute
-
J(θ)
-
、∂∂θjJ(θ)
Optmization algorithms | Advantages |
---|---|
Gradient descent | No need to manually pick α |
Conjugate gradient | Often faster tahn gradient descent |
BFGS | Disadvantages |
L-BFGS | More complex |
In matlab:
function [jVal,gradent] = costFunction(theta,X,y)
jVal = ...
gradient = ...
options = optimset('GradObj','on', 'MaxIter', '100')
initalTheta = zeros(m,1);
[optTheta,functionVal,exitFlag]...
=fminunc(@costFunction,initialTheta,options)
Multi-class classification:One-vs-all
Train:Train a logistic regression classifier
h(i)θ(x)
for each class
i
to predict the probability that
On a new input
x
, to make a prediction, pick the class
Regularization
The problem of overfitting
Overfitting: If we have too many features, the learned hypothesis may fit the training set very well ( J(θ)≈0 ), but fail to generalize to new examples (predict prices on new examples).
Options:
Reduce number of features.
- Manually select which features to keep.
- Model selection algorithm (later in course).
Regularization.
- Keep all the features, but reduce magnitude/values of parameters θj .
- Works well when we have a lot of features, each of which contributes a bit to predicting y .
Cost function
Small values for parameters
-“Simpler” hypothesis
-Less prone to overfitting
Regularized linear regression
Gradient descent
Normal equation
If
λ>0
,
Regularized logistic regression
Cost function
Gradient descent
Advanced optimization
Change the costFunction function.
function [J, grad] = costFunctionReg(theta, X, y, lambda)
m = length(y); % number of training examples
% You need to return the following variables correctly
J = 0;
grad = zeros(size(theta));
J = 1/m * (-y'*log(sigmoid(X*theta)) - (1-y)'*log(1-sigmoid(X*theta))) ...
+ lambda/(2*m) * sum((theta(2:end)).^2);
grad = 1/m * X' * (sigmoid(X*theta)-y) + lambda/m*theta;
grad(1) = grad(1) - lambda/m*theta(1);
% Set Options
options = optimset('GradObj', 'on', 'MaxIter', 400);
% Optimize
[theta, J, exit_flag] = ...
fminunc(@(t)(costFunctionReg(t, X, y, lambda)), initial_theta, options);
end