【cs229】吴恩达MachineLearning-1/2
0. 课程计划
coursera地址
本系列相关链接:
【cs229】吴恩达MachineLearning-1/2
共计10周,目的是深入掌握基础知识点,笔记不是重复翻译,而是记录自己的疑问并及时回答自己。
你想从这个课程学到什么?如果是CV、deep learning 在内的稍微新点的方法,no,这里没有。这里是Machine Learning,是传统教程。不过类似random forest之类的其他李航书中的传统机器学习算法也是不包含的。
如果给它一个定位,应该是通往DL的桥梁、理论基础。
1. Introduction
2. Linear Regression
2.1 hypothesis
由于历史原因,表征从输入空间到输出空间的函数,称之为hypothesis.
2.2 objective function
区别于损失函数的概念:
目标是求出使损失函数
J
(
θ
0
,
θ
1
)
J(\theta_0,\theta_1)
J(θ0,θ1)最小的参数。
2.3 contour figure
同一条线的函数值相同;
2.4 Gradient Descent
梯度下降算法对所有参数同时做update;
2.4.1 convex function
- bowl-shaped function
- 初始值的扰动,可能使得陷入不同的局部最小值
2.4.2 learning rate
朴素的梯度下降算法中 θ i = θ i − α ∂ ∂ θ i J ( θ 0 , . . . , θ n ) \theta_i=\theta_i-\alpha\frac{\partial}{\partial\theta_i}J(\theta_0,...,\theta_n) θi=θi−α∂θi∂J(θ0,...,θn),导致在偏导大/斜率大/陡峭的地方修正的多,在偏导小/斜率小/平坦的地方修正少,这样合理吗?
2.4.3 Batch Gradient Descent
根据一批/batch数据,更新一次参数;
2.5. Linear Algebra
在这个教程中,矩阵(常用大写字母表示)、向量(常用小写字母表示)的小标从1开始(1-indexed);
Vector: An n × 1 n\times1 n×1 matrix;
注意,这里补充的4个1使的维度匹配,这也是为什么\theta_0称为bias。
Neural Network中bias的概念类似:
SIMD/GPU/TPU等现在硬件可以高效计算矩阵乘;
矩阵乘符合结合律,不符合交换律;
Identity Matrix指单位矩阵;
没有逆矩阵的矩阵称为singular(奇异矩阵)或者degenerate(退化矩阵/非满秩矩阵);
2.6 Multivariate Linear Regression
h
Θ
(
x
)
=
Θ
T
x
,
x
∈
R
n
h_\varTheta(x)=\varTheta^Tx, x\in R^n
hΘ(x)=ΘTx,x∈Rn
Θ
i
=
Θ
i
−
α
1
m
∑
i
=
1
m
(
h
Θ
(
x
(
i
)
)
−
y
(
i
)
)
⋅
x
i
(
i
)
\varTheta_i=\varTheta_i-\alpha\frac{1}{m}\sum_{i=1}^{m}(h_\varTheta(x^{(i)})-y^{(i)})\cdot x_i^{(i)}
Θi=Θi−αm1∑i=1m(hΘ(x(i))−y(i))⋅xi(i)
2.6.1 feature scale
特征的数据量纲相距非常大,可能导致收敛慢!因此有必要做数据归一化。
减均值除范围
x
i
←
x
i
−
x
μ
x
m
a
x
−
x
m
i
n
x_i\gets\frac{x_i-x_\mu}{x_{max}-x_{min}}
xi←xmax−xminxi−xμ,可以将数据规范化到
[
−
1
,
1
]
[-1,1]
[−1,1]之间,其中分母也可以用标准差代替,这样就规范化到
N
(
0
,
1
)
N(0,1)
N(0,1)标准正态分布了。
2.6.2 Polynomial Regression
通过用 x = z t , x = z 2 , x = z 3 , x = z x=zt, x=z^2, x=z^3, x =\sqrt z x=zt,x=z2,x=z3,x=z等等方式替换,可以将多项式回归问题蜕化到线性回归问题。
2.6.3 Computing Parameters Analytically
design matrix
X
X
X 中的每一行都是样本
(
x
(
i
)
)
T
(x^{(i)})^T
(x(i))T,所以
X
X
X的维度是:
X
⇢
m
×
(
n
+
1
)
X
θ
=
y
X
T
(
X
θ
)
=
X
T
y
(
X
T
X
)
−
1
(
X
T
X
)
θ
=
(
X
T
X
)
−
1
X
T
y
=
θ
\begin{aligned} X&\dashrightarrow m\times(n+1)\\ X\theta&=y\\ X^T(X\theta)&=X^Ty\\ (X^TX)^{-1}(X^TX)\theta&=(X^TX)^{-1}X^Ty=\theta\\ \end{aligned}
XXθXT(Xθ)(XTX)−1(XTX)θ⇢m×(n+1)=y=XTy=(XTX)−1XTy=θ
还有另外一种求法:
J
(
θ
)
=
1
2
m
[
∑
i
=
1
m
(
h
θ
(
x
)
−
y
)
2
]
∂
∂
θ
j
J
(
θ
)
=
1
m
∑
i
=
1
m
(
h
θ
(
x
i
)
−
y
i
)
x
j
i
∂
∂
θ
J
(
θ
)
=
1
m
X
T
(
X
θ
−
y
)
⇓
令
其
=
0
X
T
X
θ
=
X
T
y
(
X
T
X
)
−
1
X
T
X
θ
=
(
X
T
X
)
−
1
X
T
y
=
θ
\begin{aligned} J(\theta)&=\frac{1}{2m}[\sum_{i=1}^{m}(h_\theta(x)-y)^2]\\ \frac{\partial}{\partial\theta_j}J(\theta)&=\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^i)-y^i){x_j}^i\\ \frac{\partial}{\partial\theta}J(\theta)&=\frac{1}{m}X^T(X\theta-y)\\ &\dArr令其=0\\ X^TX\theta&=X^Ty\\ (X^TX)^{-1}X^TX\theta&=(X^TX)^{-1}X^Ty=\theta \end{aligned}
J(θ)∂θj∂J(θ)∂θ∂J(θ)XTXθ(XTX)−1XTXθ=2m1[i=1∑m(hθ(x)−y)2]=m1i=1∑m(hθ(xi)−yi)xji=m1XT(Xθ−y)⇓令其=0=XTy=(XTX)−1XTy=θ
大多数情况下
(
X
T
X
)
(X^TX)
(XTX)可逆,即使不可逆(inv),octave之类的软件也会计算出大约的逆,pinv(pseudo inv).
不可逆的情况例如:
- 样本量m过小(m<n+1)
- n个特征中有冗余
3. Logistic Regression
- 逻辑回归按道理应该是“Logistic Classification”,这个命名是个历史问题;
- 线性回归有其弊端,例如增加一个没有信息量的新样本,却会改变模型;
- 逻辑回归能保证输出范围 0 ⩽ h θ ( x ) ⩽ 1 0 \leqslant h_{\theta}(x) \leqslant 1 0⩽hθ(x)⩽1
- feature scaling 同样适用于逻辑回归;
3.1 sigmoid function
sigmoid function = logistic function
h
θ
(
x
)
=
g
(
θ
T
X
)
=
P
(
y
=
1
∣
x
;
θ
)
g
(
z
)
=
1
1
+
e
−
z
g
(
z
)
−
1
=
(
−
1
)
×
(
1
+
e
−
z
)
−
2
×
(
−
1
)
×
(
e
−
z
)
=
e
−
z
(
1
+
e
−
z
)
2
=
1
+
e
−
z
−
1
(
1
+
e
−
z
)
2
=
1
1
+
e
−
z
−
1
(
1
+
e
−
z
)
2
=
1
1
+
e
−
z
(
1
−
1
1
+
e
−
z
)
=
g
(
z
)
(
1
−
g
(
z
)
)
\begin{aligned} h_{\theta}(x)&=g(\theta^TX)=P(y=1\mid x;\theta)\\ g(z)&=\frac{1}{1+e^{-z}}\\ g(z)^{-1}&=(-1)\times(1+e^{-z})^{-2}\times(-1)\times(e^{-z})\\ &=\frac{e^{-z}}{(1+e^{-z})^{2}}=\frac{1+e^{-z}-1}{(1+e^{-z})^{2}}=\frac{1}{1+e^{-z}}-\frac{1}{(1+e^{-z})^2}\\ &=\frac{1}{1+e^{-z}}(1-\frac{1}{1+e^{-z}})\\ &=g(z)(1-g(z)) \end{aligned}
hθ(x)g(z)g(z)−1=g(θTX)=P(y=1∣x;θ)=1+e−z1=(−1)×(1+e−z)−2×(−1)×(e−z)=(1+e−z)2e−z=(1+e−z)21+e−z−1=1+e−z1−(1+e−z)21=1+e−z1(1−1+e−z1)=g(z)(1−g(z))
3.2 Decision Boundary
边界情况就是
h
θ
(
x
)
=
θ
T
X
=
0.5
h_\theta(x)=\theta^TX=0.5
hθ(x)=θTX=0.5;
根据sigmoid函数的特性可知,
P
≥
0.5
P\geq0.5
P≥0.5 等效于
(
θ
T
X
)
≥
0
(\theta^TX)\geq0
(θTX)≥0.
3.3 Cost function
线性回归中的损失函数是:
c
o
s
t
(
h
θ
(
x
)
,
y
)
=
1
2
(
h
θ
(
x
)
−
y
)
2
J
(
θ
)
=
1
m
∑
j
=
1
m
c
o
s
t
(
h
θ
(
x
)
,
y
)
=
1
m
∑
j
=
1
m
1
2
(
h
θ
(
x
)
−
y
)
2
\begin{aligned} cost(h_\theta(x),y)&=\frac{1}{2}(h_\theta(x)-y)^2\\ J(\theta)&=\frac{1}{m}\sum_{j=1}^{m}cost(h_\theta(x),y)=\frac{1}{m}\sum_{j=1}^{m}\frac{1}{2}(h_{\theta}(x)-y)^2\\ \end{aligned}
cost(hθ(x),y)J(θ)=21(hθ(x)−y)2=m1j=1∑mcost(hθ(x),y)=m1j=1∑m21(hθ(x)−y)2
如果计算方式不变,则逻辑回归的损失函数是:
c
o
s
t
(
h
θ
(
x
)
,
y
)
=
1
2
(
h
θ
(
x
)
−
y
)
2
=
1
2
(
1
1
+
e
−
θ
T
x
−
y
)
2
J
(
θ
)
=
1
m
∑
j
=
1
m
1
2
(
g
(
θ
T
X
)
−
y
)
2
\begin{aligned} cost(h_\theta(x),y)&=\frac{1}{2}(h_\theta(x)-y)^2\\ &=\frac{1}{2}(\frac{1}{1+e^{-\theta^Tx}}-y)^2\\ J(\theta)&=\frac{1}{m}\sum_{j=1}^{m}\frac{1}{2}(g(\theta^TX)-y)^2 \end{aligned}
cost(hθ(x),y)J(θ)=21(hθ(x)−y)2=21(1+e−θTx1−y)2=m1j=1∑m21(g(θTX)−y)2
由于sigmoid函数的存在,使得此时 J ( θ ) J(\theta) J(θ)不是凸函数(non-convex),有很多局部最优解。
损失最小等效于似然最大,而概率为:
P
(
y
∣
x
;
θ
)
=
h
θ
(
x
)
y
(
1
−
h
θ
(
x
)
1
−
y
P(y\mid x;\theta)={h_\theta(x)}^y{(1-h_\theta(x)}^{1-y}
P(y∣x;θ)=hθ(x)y(1−hθ(x)1−y;
对应的极大似然函数为:
L
(
θ
)
=
Π
j
=
0
m
P
(
y
j
∣
x
j
;
θ
)
L(\theta)=\Pi_{j=0}^{m}P(y^j\mid x^j;\theta)
L(θ)=Πj=0mP(yj∣xj;θ);
等效于使下式最大(前提:概率都是非负数):
l
o
g
(
L
(
θ
)
)
=
l
o
g
(
Π
j
=
0
m
P
(
y
j
∣
x
j
;
θ
)
)
=
l
o
g
(
Π
j
=
0
m
h
θ
(
x
)
y
(
1
−
h
θ
(
x
)
1
−
y
)
=
∑
j
=
0
m
y
l
o
g
(
h
θ
(
x
)
)
+
(
1
−
y
)
l
o
g
(
1
−
h
θ
(
x
)
)
=
−
m
J
(
θ
)
\begin{aligned} log(L(\theta))&=log(\Pi_{j=0}^{m}P(y^j\mid x^j;\theta))\\ &=log(\Pi_{j=0}^{m}{h_\theta(x)}^y{(1-h_\theta(x)}^{1-y})\\ &=\sum_{j=0}^{m}ylog(h_\theta(x))+(1-y)log(1-h_\theta(x))\\ &=-mJ(\theta) \end{aligned}
log(L(θ))=log(Πj=0mP(yj∣xj;θ))=log(Πj=0mhθ(x)y(1−hθ(x)1−y)=j=0∑mylog(hθ(x))+(1−y)log(1−hθ(x))=−mJ(θ)
所以目标极大似然等效于目标损失最小;
J
(
θ
)
=
1
m
∑
j
=
0
m
−
y
l
o
g
(
h
θ
(
x
)
)
−
(
1
−
y
)
l
o
g
(
(
1
−
h
θ
(
x
)
)
c
o
s
t
(
h
θ
(
x
)
,
y
)
=
−
y
l
o
g
(
h
θ
(
x
)
)
−
(
1
−
y
)
l
o
g
(
1
−
h
θ
(
x
)
)
=
{
−
l
o
g
(
h
θ
(
x
)
)
if
y
=
1
−
l
o
g
(
1
−
h
θ
(
x
)
)
if
y
=
0
\begin{aligned} J(\theta)&=\frac{1}{m}\sum_{j=0}^{m}-ylog(h_\theta(x))-(1-y)log((1-h_\theta(x))\\ cost(h_\theta(x),y) &=-ylog(h_\theta(x))-(1-y)log(1-h_\theta(x))\\ &= \begin{cases} -log(h_\theta(x)) &\text{if } y=1 \\ -log(1-h_\theta(x)) &\text{if } y=0 \end{cases} \end{aligned}
J(θ)cost(hθ(x),y)=m1j=0∑m−ylog(hθ(x))−(1−y)log((1−hθ(x))=−ylog(hθ(x))−(1−y)log(1−hθ(x))={−log(hθ(x))−log(1−hθ(x))if y=1if y=0
更舒服的是,这个损失函数满足convex。
这样的损失函数正好也是交叉熵损失函数,具体来说:
假设有两个分布
p
(
x
)
p(x)
p(x)和
q
(
x
)
q(x)
q(x),则两者的交叉熵为:
H
(
p
,
q
)
=
−
p
(
x
)
l
o
g
(
q
(
x
)
)
H(p,q)=-p(x)log(q(x))
H(p,q)=−p(x)log(q(x))
熵越小越接近,这里
p
(
x
)
p(x)
p(x)对应真实分布,
q
(
x
)
q(x)
q(x)为
θ
\theta
θ刻画的估计分布;
g
(
z
)
=
1
1
+
e
−
z
∂
∂
z
g
(
z
)
=
g
(
z
)
(
1
−
g
(
z
)
)
h
θ
(
x
)
=
g
(
θ
T
x
)
∂
∂
θ
h
θ
(
x
)
=
h
θ
(
x
)
(
1
−
h
θ
(
x
)
)
x
J
(
θ
)
=
1
m
∑
j
=
0
m
−
y
l
o
g
(
h
θ
(
x
)
)
−
(
1
−
y
)
l
o
g
(
(
1
−
h
θ
(
x
)
)
∂
∂
θ
J
(
θ
)
=
−
1
m
∑
j
=
0
m
[
y
1
h
θ
(
x
)
∂
∂
θ
h
θ
(
x
)
+
(
1
−
y
)
1
1
−
h
θ
(
x
)
(
−
1
)
∂
∂
θ
h
θ
(
x
)
]
=
−
1
m
∑
j
=
0
m
[
y
1
h
θ
(
x
)
h
θ
(
x
)
(
1
−
h
θ
(
x
)
)
x
+
(
y
−
1
)
1
1
−
h
θ
(
x
)
h
θ
(
x
)
(
1
−
h
θ
(
x
)
)
x
]
=
1
m
∑
j
=
0
m
(
h
θ
(
x
)
−
y
)
x
(
方
便
书
写
,
没
带
上
标
)
θ
:
=
θ
−
α
∂
∂
θ
J
(
θ
)
=
θ
−
α
1
m
∑
j
=
0
m
(
h
θ
(
x
j
)
−
y
j
)
x
j
\begin{aligned} g(z)&=\frac{1}{1+e^{-z}}\\ \frac{\partial}{\partial z}g(z)&=g(z)(1-g(z))\\ h_\theta(x)&=g(\theta^Tx)\\ \frac{\partial}{\partial \theta}h_\theta(x)&=h_\theta(x)(1-h_\theta(x))x\\ J(\theta)&=\frac{1}{m}\sum_{j=0}^{m}-ylog(h_\theta(x))-(1-y)log((1-h_\theta(x))\\ \frac{\partial}{\partial \theta}J(\theta)&=-\frac{1}{m}\sum_{j=0}^{m}[y\frac{1}{h_\theta(x)}\frac{\partial}{\partial \theta}h_\theta(x)+(1-y)\frac{1}{1-h_\theta(x)}(-1)\frac{\partial}{\partial \theta}h_\theta(x)]\\ &=-\frac{1}{m}\sum_{j=0}^{m}[y\frac{1}{h_\theta(x)}h_\theta(x)(1-h_\theta(x))x+(y-1)\frac{1}{1-h_\theta(x)}h_\theta(x)(1-h_\theta(x))x]\\ &=\frac{1}{m}\sum_{j=0}^{m}(h_\theta(x)-y)x(方便书写,没带上标)\\ \theta:&=\theta-\alpha\frac{\partial}{\partial \theta}J(\theta)\\ &=\theta-\alpha\frac{1}{m}\sum_{j=0}^{m}(h_\theta(x^j)-y^j)x^j \end{aligned}
g(z)∂z∂g(z)hθ(x)∂θ∂hθ(x)J(θ)∂θ∂J(θ)θ:=1+e−z1=g(z)(1−g(z))=g(θTx)=hθ(x)(1−hθ(x))x=m1j=0∑m−ylog(hθ(x))−(1−y)log((1−hθ(x))=−m1j=0∑m[yhθ(x)1∂θ∂hθ(x)+(1−y)1−hθ(x)1(−1)∂θ∂hθ(x)]=−m1j=0∑m[yhθ(x)1hθ(x)(1−hθ(x))x+(y−1)1−hθ(x)1hθ(x)(1−hθ(x))x]=m1j=0∑m(hθ(x)−y)x(方便书写,没带上标)=θ−α∂θ∂J(θ)=θ−αm1j=0∑m(hθ(xj)−yj)xj
巧合的是,最终逻辑回归的
θ
\theta
θ的更新公式看起来跟线性回归的公式完全相同,只是两者的
h
θ
(
x
)
h_\theta(x)
hθ(x)本身有区别。
3.4 Optimization algorithm
- batch/mini-batch/stochastic gradient descent
- conjugate gradient
- BFGS
- L-BFGS
3.5 Multiclass Classification
将K分类问题,切分成K个二分类,训练好的K个分类器给出K个得分,取得分最高的作为预测的输出。
这么麻烦吗?
3.6 Overfitting & Regularization
解决方案:
- 减少特征数量
- 正则化(降低
θ
0
\theta_0
θ0的幅度)
正则化影响convex吗?
不影响:
https://blog.csdn.net/yyxyuxueYang/article/details/81534965
正则化的理论, θ \theta θ值小可以带来:
- 更简单的假设
(例如在costFunc中加一个 λ θ 3 2 \lambda{\theta_3}^2 λθ32项,则收敛时 θ 3 \theta_3 θ3必然很小)
( λ \lambda λ过大时,收敛时 θ \theta θ过小,underfitting) - 更不容易过拟合
加入正则项之后,线性回归的Normal Equation最优解为:
J ( θ ) = 1 2 m [ ∑ i = 1 m ( h θ ( x ) − y ) 2 + λ ∑ j = 1 n θ j 2 ] ( θ 0 并 不 惩 罚 ) ⇓ j ≥ 1 时 ∂ ∂ θ j J ( θ ) = [ 1 m ∑ i = 1 m ( h θ ( x i ) − y i ) x j i ] + λ m θ j ∂ ∂ θ J ( θ ) = 1 m X T ( X θ − y ) + λ m θ ⇓ 令 其 = 0 X T X θ + λ θ = X T y ( X T X + λ ) − 1 ( X T X + λ ) θ = ( X T X + λ ) − 1 X T y = θ ⇓ 囊 括 θ 0 θ = ( X T X + λ [ 0 0 0 I n × n ] ) − 1 X T y \begin{aligned} J(\theta)&=\frac{1}{2m}[\sum_{i=1}^{m}(h_\theta(x)-y)^2+\lambda\sum_{j=1}^{n}{\theta_j}^2](\theta_0并不惩罚)\\ &\dArr j\geq1时\\ \frac{\partial}{\partial\theta_j}J(\theta)&=[\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^i)-y^i){x_j}^i]+\frac{\lambda}{m}\theta_j\\ \frac{\partial}{\partial\theta}J(\theta)&=\frac{1}{m}X^T(X\theta-y)+\frac{\lambda}{m}\theta\\ &\dArr 令其=0\\ X^TX\theta+\lambda\theta&=X^Ty\\ (X^TX+\lambda)^{-1}(X^TX+\lambda)\theta&=(X^TX+\lambda)^{-1}X^Ty=\theta\\ &\dArr 囊括\theta_0\\ \theta&=(X^TX+\lambda\begin{bmatrix} 0 & 0 \\ 0 & I_{n\times n} \\ \end{bmatrix} )^{-1}X^Ty \end{aligned} J(θ)∂θj∂J(θ)∂θ∂J(θ)XTXθ+λθ(XTX+λ)−1(XTX+λ)θθ=2m1[i=1∑m(hθ(x)−y)2+λj=1∑nθj2](θ0并不惩罚)⇓j≥1时=[m1i=1∑m(hθ(xi)−yi)xji]+mλθj=m1XT(Xθ−y)+mλθ⇓令其=0=XTy=(XTX+λ)−1XTy=θ⇓囊括θ0=(XTX+λ[000In×n])−1XTy
加入正则项之后,逻辑回归的Normal Equation最优解为:
J
(
θ
)
=
[
1
m
∑
i
=
0
m
−
y
l
o
g
(
h
θ
(
x
)
)
−
(
1
−
y
)
l
o
g
(
(
1
−
h
θ
(
x
)
)
]
+
λ
2
m
∑
j
=
1
n
θ
j
2
⇓
j
≥
1
时
∂
∂
θ
j
J
(
θ
)
=
1
m
∑
i
=
0
m
(
h
θ
(
x
i
)
−
y
i
)
x
j
i
+
λ
m
θ
j
⇓
θ
j
:
=
θ
j
−
α
∂
∂
θ
j
J
(
θ
)
=
θ
j
−
α
[
1
m
∑
i
=
0
m
(
h
θ
(
x
i
)
−
y
i
)
x
j
i
+
λ
m
θ
j
]
=
(
1
−
α
λ
m
)
θ
j
−
α
1
m
∑
i
=
0
m
(
h
θ
(
x
i
)
−
y
i
)
x
j
i
\begin{aligned} J(\theta)&=[\frac{1}{m}\sum_{i=0}^{m}-ylog(h_\theta(x))-(1-y)log((1-h_\theta(x))]+\frac{\lambda}{2m}\sum_{j=1}^{n}{\theta_j}^2\\ &\dArr j\geq1时\\ \frac{\partial}{\partial \theta_j}J(\theta)&=\frac{1}{m}\sum_{i=0}^{m}(h_\theta(x^i)-y^i){x_j}^i+\frac{\lambda}{m}\theta_j\\ &\dArr\\ \theta_j:&=\theta_j-\alpha\frac{\partial}{\partial \theta_j}J(\theta)\\ &=\theta_j-\alpha[\frac{1}{m}\sum_{i=0}^{m}(h_\theta(x^i)-y^i){x_j}^i+\frac{\lambda}{m}\theta_j]\\ &=(1-\alpha\frac{\lambda}{m})\theta_j-\alpha\frac{1}{m}\sum_{i=0}^{m}(h_\theta(x^i)-y^i){x_j}^i \end{aligned}
J(θ)∂θj∂J(θ)θj:=[m1i=0∑m−ylog(hθ(x))−(1−y)log((1−hθ(x))]+2mλj=1∑nθj2⇓j≥1时=m1i=0∑m(hθ(xi)−yi)xji+mλθj⇓=θj−α∂θj∂J(θ)=θj−α[m1i=0∑m(hθ(xi)−yi)xji+mλθj]=(1−αmλ)θj−αm1i=0∑m(hθ(xi)−yi)xji