本节笔记对应第三周Coursera课程 binary classification problem
Classification is not actually a linear function.
Classification and Representation
Hypothesis Representation
Sigmoid Function(or we called Logistic Function)
hθ(x)=g(θTx)z=θTxg(z)=11+e−z
Sigmoid Function 可以使输出值范围在 (0,1) 之间。 g(z) 对应的图为:
hθ(x) will give us the probability that our output is 1.
- Some basic knowledge of discrete
hθ(x)=P(y=1|x;θ)=1−P(y=0|x;θ)P(y=0|x;θ)+P(y=1|x;θ)=1
Decision Boundary
- translate the output of the hypothesis function as follows:
hθ(x)≥0.5→y=1hθ(x)<0.5→y=0 - From these statements we can now say:
θTx≥0⇒y=1θTx<0⇒y=0
Logistic Regression Model
Cost function for one variable hypothesis
- To let the cost function be convex for gradient descent, it should be like this:
J(θ)=1m∑i=1mCost(hθ(x(i)),y(i))
- example
Cost(hθ(x),y)=0 if hθ(x)=yCost(hθ(x),y)→∞ if y=0andhθ(x)→1Cost(hθ(x),y)→∞ if y=1andhθ(x)→0
Simplified Cost Function and Gradient Descent
compress our cost function’s two conditional cases into one case:
Cost(hθ(x),y)=−ylog(hθ(x))−(1−y)log(1−hθ(x))entire cost function
J(θ)=−1m∑i=1m[y(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))]
Gradient Descent
the general form of gradient descent ,求偏导的得到 J(θ) 的极值
Repeat{θj:=θj−α∂∂θjJ(θ)}using calculus
∂∂θjJ(θ)=1m∑i=1m[(hθ(x(i))−y(i))x(i)j]get
Repeat{θj:=θj−αm∑i=1m(hθ(x(i))−y(i))x(i)j}
Multiclass Classification: One-vs-all
- For more than 2 features of y, do logisitc regression for each feature separately
- Train a logistic regression classifier hθ(x) for each class to predict the probability that  y = i .
- To make a prediction on a new x, pick the class that maximizes
hθ(x)
Solving the Problem of Overfitting
The Problem of Overfitting
address the issue of overfitting
- Reduce the number of features:
- Manually select which features to keep.
- Use a model selection algorithm (studied later in the course).
- Regularization:
- Keep all the features, but reduce the magnitude of parameters θj .
- Regularization works well when we have a lot of slightly useful features.
Cost Function
- in a single summation
The λ, or lambda, is the regularization parameter. It determines how much the costs of our theta parameters are inflated.
Regularized Linear Regression
Gradient Descent
Repeat { θ0:=θ0−α 1m ∑i=1m(hθ(x(i))−y(i))x(i)0 θj:=θj−α [(1m ∑i=1m(hθ(x(i))−y(i))x(i)j)+λmθj]} j∈{1,2...n}Normal Equation
θ=(XTX+λ⋅L)−1XTywhere L=⎡⎣⎢⎢⎢⎢⎢⎢⎢011⋱1⎤⎦⎥⎥⎥⎥⎥⎥⎥- L is a matrix with 0 at the top left and 1’s down the diagonal, with 0’s everywhere else. It should have dimension (n+1)×(n+1)
- Recall that if m ≤ n, then XTX is non-invertible. However, when we add the term λ⋅L, then XTX+λ⋅L becomes invertible.
Summary
我在这里整理一下上述两个方法,补全课程上的相关推导。
Logistic Regression Model
hθ(x) 是假设函数
注意假设函数和真实数据之间的区别
Cost Function
回头看看上边的那个 hθ(x) ,cost function定义了训练集给出的结果 和 当前计算结果之间的差距。当然,该差距越小越好,那么需要求导一下。
Gradient Descent
- 原始公式
θj:=θj−α∂∂θjJ(θ) - 求导计算
∂∂θjJ(θ)=1m∑i=1m[(hθ(x(i))−y(i))x(i)j] - 计算结果
θj:=θj−αm∑i=1m(hθ(x(i))−y(i))x(i)j
这里推导一下 ∂∂θjJ(θ) :
计算 h′θ(x) 导数
h′θ(x)=(11+e−θx)′ =e−θxx1+e−θx =1+e−θx−1(1+e−θx)2x =[11+e−θx−1(1+e−θx)2]x =hθ(x)(1−hθ(x))x推导 ∂∂θjJ(θ)
即:
Solving the Problem of Overfitting
其他地方都一样,稍作修改
- Cost Function
- Gradient Descent
Repeat { θ0:=θ0−α 1m ∑i=1m(hθ(x(i))−y(i))x(i)0 θj:=θj−α [(1m ∑i=1m(hθ(x(i))−y(i))x(i)j)+λmθj]} j∈{1,2...n}
以上