Contets
对于离散型随机变量的学习问题称为 classification.
Generative Learning Algorithms
生成式模型中,我们首先对 class-conditioned probability density
p
(
x
∣
y
)
p(x|y)
p(x∣y) 和 prior
p
(
y
)
p(y)
p(y) 进行建模。这一步可以对联合分布应用 MLE 来完成
l
(
θ
)
=
log
∏
i
p
(
x
(
i
)
,
y
(
i
)
)
l(\theta) = \log\prod\limits_ip(x^{(i)}, y^{(i)})
l(θ)=logi∏p(x(i),y(i))
而后我们使用贝叶斯公式计算后验概率
p
(
y
∣
x
)
=
p
(
x
∣
y
)
p
(
y
)
p
(
x
)
∝
p
(
x
∣
y
)
p
(
y
)
p(y|x) = \frac{p(x|y)p(y)}{p(x)} \propto p(x|y)p(y)
p(y∣x)=p(x)p(x∣y)p(y)∝p(x∣y)p(y)
一般来说,预测时只需要最小化错误率即可
h
(
x
)
=
max
y
p
(
x
∣
y
)
p
(
y
)
h(x) = \max\limits_yp(x|y)p(y)
h(x)=ymaxp(x∣y)p(y)
有时一些不正确的预测结果可能会导致较大损失。为了尽可能减少预测错误带来的损失,可以构造决策表
λ
(
h
(
x
)
,
y
)
\lambda(h(x), y)
λ(h(x),y) 表示当实际类别为
y
y
y 而预测类别为
h
(
x
)
h(x)
h(x) 时的损失。定义条件期望损失
R
(
α
∣
x
)
=
∑
i
λ
(
α
,
i
)
p
(
x
∣
i
)
p
(
i
)
R(\alpha|x) = \sum\limits_{i} \lambda(\alpha, i)p(x|i)p(i)
R(α∣x)=i∑λ(α,i)p(x∣i)p(i)
则预测可以表示成
h
(
x
)
=
min
α
R
(
α
∣
x
)
h(x) = \min\limits_{\alpha} R(\alpha|x)
h(x)=αminR(α∣x)
Gaussian Discriminant Analysis
假设对于二分类问题
y
∼
B
e
r
n
(
ϕ
)
x
∣
y
=
1
∼
M
V
N
(
μ
⃗
1
,
Σ
)
x
∣
y
=
0
∼
M
V
N
(
μ
⃗
0
,
Σ
)
\begin{array}{rcl} y &\sim& Bern(\phi)\\ x|y = 1 &\sim& MVN(\vec\mu_1, \Sigma)\\ x|y = 0 &\sim& MVN(\vec\mu_0, \Sigma) \end{array}
yx∣y=1x∣y=0∼∼∼Bern(ϕ)MVN(μ1,Σ)MVN(μ0,Σ)
MLE 给出
ϕ
=
1
m
∑
i
=
1
m
y
(
i
)
μ
⃗
0
=
∑
i
=
1
m
(
1
−
y
(
i
)
)
x
(
i
)
∑
i
=
1
m
(
1
−
y
(
i
)
)
μ
⃗
1
=
∑
i
=
1
m
y
(
i
)
x
(
i
)
∑
i
=
1
m
y
(
i
)
Σ
=
1
m
∑
i
=
1
m
(
x
(
i
)
−
μ
y
(
i
)
)
(
x
(
i
)
−
μ
y
(
i
)
)
T
\begin{array}{rcl} \phi &=& \frac{1}{m}\sum\limits_{i=1}^my^{(i)}\\ \vec\mu_0 &=& \frac{\sum\limits_{i=1}^m\left(1-y^{(i)}\right)x^{(i)}} {\sum\limits_{i=1}^m\left(1-y^{(i)}\right)}\\ \vec\mu_1 &=& \frac{\sum\limits_{i=1}^my^{(i)}x^{(i)}} {\sum\limits_{i=1}^my^{(i)}}\\ \Sigma &=& \frac{1}{m}\sum\limits_{i=1}^m\left(x^{(i)}-\mu_{y^{(i)}}\right)\left(x^{(i)}-\mu_{y^{(i)}}\right)^T \end{array}
ϕμ0μ1Σ====m1i=1∑my(i)i=1∑m(1−y(i))i=1∑m(1−y(i))x(i)i=1∑my(i)i=1∑my(i)x(i)m1i=1∑m(x(i)−μy(i))(x(i)−μy(i))T
GDA 的后验概率可以写成如下形式
p
(
y
=
1
∣
x
;
ϕ
,
Σ
,
μ
0
,
μ
1
)
=
1
1
+
exp
(
−
θ
ϕ
,
Σ
,
μ
0
,
μ
1
T
x
)
p(y=1|x;\phi, \Sigma, \mu_0, \mu_1) = \frac{1}{1+\exp(-\theta_{\phi, \Sigma, \mu_0, \mu_1}^Tx)}
p(y=1∣x;ϕ,Σ,μ0,μ1)=1+exp(−θϕ,Σ,μ0,μ1Tx)1
尽管形式上与逻辑回归相同,但一般来说二者的决策边界并不相同。事实上,任何指数族分布 x ∣ y ∼ E x p F a m i l y ( η ) x|y \sim ExpFamily(\eta) x∣y∼ExpFamily(η) 都将推出对数后验。因此逻辑回归有更强的 robust 性能,而且对模型的假设较少。而 GDA 假设数据来自正态分布,因此收敛速度较快。
Naive Bayes Classifier
如果特征的各个分量
x
i
x_i
xi 之间条件独立,即满足 NB assumption
p
(
x
1
,
…
,
x
n
∣
y
)
=
∏
i
=
1
n
p
(
x
i
∣
y
)
p(x_1, \dots, x_n|y) = \prod\limits_{i=1}^n p(x_i|y)
p(x1,…,xn∣y)=i=1∏np(xi∣y)
则可以使用朴素贝叶斯分类器进行建模。考虑如下的问题
判别一封邮件是否是垃圾邮件。假设我们有一本字典 V V V 可以将单词映射为整数 { 1 , … , n } \{1, \dots, n\} {1,…,n}.
取特征
x
∈
{
0
,
1
}
n
x\in \{0,1\}^n
x∈{0,1}n 表示字典中第
i
i
i 个单词是否出现。应用 MLE 可以得到参数
ϕ
i
∣
y
=
1
≡
p
(
x
i
=
1
∣
y
=
1
)
=
∑
j
=
1
m
x
i
(
j
)
y
(
j
)
∑
j
=
1
m
y
(
j
)
ϕ
i
∣
y
=
0
≡
p
(
x
i
=
1
∣
y
=
0
)
=
∑
j
=
1
m
x
i
(
j
)
(
1
−
y
(
j
)
)
∑
j
=
1
m
(
1
−
y
(
j
)
)
ϕ
y
≡
p
(
y
=
1
)
=
∑
j
=
1
m
y
(
j
)
m
\begin{array}{rcccl} \phi_{i|y=1} &\equiv& p(x_i=1|y=1) &=& \frac{\sum\limits_{j=1}^mx_i^{(j)}y^{(j)}}{\sum\limits_{j=1}^my^{(j)}}\\ \phi_{i|y=0} &\equiv& p(x_i=1|y=0) &=& \frac{\sum\limits_{j=1}^mx_i^{(j)}(1-y^{(j)})}{\sum\limits_{j=1}^m(1-y^{(j)})}\\ \phi_y &\equiv& p(y=1) &=& \frac{\sum\limits_{j=1}^my^{(j)}}{m} \end{array}
ϕi∣y=1ϕi∣y=0ϕy≡≡≡p(xi=1∣y=1)p(xi=1∣y=0)p(y=1)===j=1∑my(j)j=1∑mxi(j)y(j)j=1∑m(1−y(j))j=1∑mxi(j)(1−y(j))mj=1∑my(j)
上述模型被称为 multi-variate Bernoulli event model 因为特征
x
x
x 是一个多元伯努利变量。另一种可能的模型被称为 multinomial event model 其中特征
x
∈
{
1
,
…
,
n
}
l
x \in \{1,\dots,n\}^l
x∈{1,…,n}l 是对于邮件的一个翻译。其中
l
l
l 表示邮件的单词数。应用 MLE 可以得到参数
ϕ
i
∣
y
=
1
≡
p
(
x
j
=
i
∣
y
=
1
)
=
∑
i
=
1
m
∑
j
=
1
l
i
y
(
i
)
I
{
x
j
(
i
)
=
k
}
∑
i
=
1
m
y
(
i
)
l
i
ϕ
i
∣
y
=
0
≡
p
(
x
j
=
i
∣
y
=
0
)
=
∑
i
=
1
m
∑
j
=
1
l
i
(
1
−
y
(
i
)
)
I
{
x
j
(
i
)
=
k
}
∑
i
=
1
m
(
1
−
y
(
i
)
)
l
i
ϕ
y
≡
p
(
y
=
1
)
=
∑
i
=
1
m
y
(
i
)
m
\begin{array}{rcccl} \phi_{i|y=1} &\equiv& p(x_j=i|y=1) &=& \frac{\sum\limits_{i=1}^m \sum\limits_{j=1}^{l_i}y^{(i)}I\{x_j^{(i)}=k\}}{\sum\limits_{i=1}^my^{(i)}l_i}\\ \phi_{i|y=0} &\equiv& p(x_j=i|y=0) &=& \frac{\sum\limits_{i=1}^m\sum\limits_{j=1}^{l_i}(1-y^{(i)})I\{x_j^{(i)}=k\}}{\sum\limits_{i=1}^m(1-y^{(i)})l_i}\\ \phi_y &\equiv& p(y=1) &=& \frac{\sum\limits_{i=1}^my^{(i)}}{m} \end{array}
ϕi∣y=1ϕi∣y=0ϕy≡≡≡p(xj=i∣y=1)p(xj=i∣y=0)p(y=1)===i=1∑my(i)lii=1∑mj=1∑liy(i)I{xj(i)=k}i=1∑m(1−y(i))lii=1∑mj=1∑li(1−y(i))I{xj(i)=k}mi=1∑my(i)
上述算法的问题在于,如果一个单词在所有垃圾邮件样本中都没有出现,则待测样本中只要出现这个单词就一定不会被判为垃圾邮件。这个问题可以被 Laplace smoothing 解决,其思想是给每个单词一个默认的出现概率。设随机变量
z
∼
M
u
l
t
i
n
o
m
i
a
l
(
ϕ
1
,
ϕ
2
,
…
,
ϕ
k
−
1
)
z \sim Multinomial(\phi_1, \phi_2, \dots, \phi_{k-1})
z∼Multinomial(ϕ1,ϕ2,…,ϕk−1) 的 MLE 结果为
ϕ
j
=
p
/
q
\phi_j = p / q
ϕj=p/q 。因为
z
z
z 从
k
k
k 个候选中取值,因此每个值的默认出现概率应为
1
/
k
1 / k
1/k 。修改后的参数估计值为
ϕ
j
=
p
+
1
q
+
k
\phi_j = \frac{p+1}{q+k}
ϕj=q+kp+1
应该注意到,上面的算法只对离散型随机变量 x x x 有效。如果希望对连续型随机变量应用朴素贝叶斯,可以将其 discretize 成为离散型随机变量。
Discriminative
判别式模型中,我们直接对后验概率 p ( y ∣ x ) p(y|x) p(y∣x) 建模,并以 p ( y ∣ x ) = 0.5 p(y|x) = 0.5 p(y∣x)=0.5 作为决策边界。
Softmax Regression
对于
k
k
k 类决策问题,假设目标变量
y
∣
x
;
θ
∼
M
u
l
t
i
n
o
m
i
a
l
(
1
,
ϕ
⃗
)
y|x;\theta \sim Multinomial(1, \vec\phi)
y∣x;θ∼Multinomial(1,ϕ)
其中每一类的出现概率
ϕ
⃗
=
(
ϕ
1
,
ϕ
2
,
…
,
ϕ
k
−
1
)
\vec\phi = (\phi_1, \phi_2, \dots, \phi_{k-1})
ϕ=(ϕ1,ϕ2,…,ϕk−1)
一般不将
ϕ
k
\phi_k
ϕk 作为参数之一,这是因为归一性要求
ϕ
k
=
1
−
∑
i
=
1
k
−
1
ϕ
i
\phi_k = 1-\sum\limits_{i=1}^{k-1}\phi_i
ϕk=1−i=1∑k−1ϕi
现在我们希望证明多项分布属于指数族。为了使
h
(
x
)
=
E
[
T
(
y
)
]
=
[
ϕ
1
ϕ
2
⋮
ϕ
k
−
1
]
h(x) = E[T(y)] = \left[\begin{array}{c} \phi_1\\ \phi_2\\ \vdots\\ \phi_{k-1}\\ \end{array}\right]
h(x)=E[T(y)]=⎣⎢⎢⎢⎡ϕ1ϕ2⋮ϕk−1⎦⎥⎥⎥⎤
构造
T
(
y
)
=
[
I
{
y
=
1
}
I
{
y
=
2
}
⋮
I
{
y
=
k
−
1
}
]
T(y) = \left[\begin{array}{c} I\{y = 1\}\\ I\{y = 2\}\\ \vdots\\ I\{y = k-1\}\\ \end{array}\right]
T(y)=⎣⎢⎢⎢⎡I{y=1}I{y=2}⋮I{y=k−1}⎦⎥⎥⎥⎤
则可以观察到
I
{
y
=
k
}
=
1
−
∑
i
=
1
k
−
1
T
(
y
)
I\{y = k\} = 1 - \sum\limits_{i = 1}^{k - 1} T(y)
I{y=k}=1−i=1∑k−1T(y)
因为多项分布满足
p
(
y
;
ϕ
)
=
∏
i
=
1
k
ϕ
i
I
{
y
=
i
}
=
exp
(
∑
i
=
1
k
I
{
y
=
i
}
log
ϕ
i
)
=
exp
(
∑
i
=
1
k
−
1
T
i
(
y
)
log
ϕ
i
+
(
1
−
∑
i
=
1
k
−
1
T
i
(
y
)
)
log
ϕ
k
)
=
exp
(
∑
i
=
1
k
−
1
T
i
(
y
)
log
ϕ
i
ϕ
k
+
log
ϕ
k
)
\begin{array}{rcl} p(y;\phi) &=& \prod\limits_{i=1}^k\phi_i^{I\{y = i\}}\\ &=& \exp\left(\sum\limits_{i = 1}^k I\{y = i\}\log\phi_i\right)\\ &=& \exp\left(\sum\limits_{i = 1}^{k - 1} T_i(y)\log\phi_i + \left(1 - \sum\limits_{i = 1}^{k - 1} T_i(y)\right)\log\phi_k\right)\\ &=& \exp\left(\sum\limits_{i = 1}^{k - 1} T_i(y)\log\frac{\phi_i}{\phi_k} + \log\phi_k\right) \end{array}
p(y;ϕ)====i=1∏kϕiI{y=i}exp(i=1∑kI{y=i}logϕi)exp(i=1∑k−1Ti(y)logϕi+(1−i=1∑k−1Ti(y))logϕk)exp(i=1∑k−1Ti(y)logϕkϕi+logϕk)
所以指数族分布参数为
η
=
[
log
(
ϕ
1
/
ϕ
k
)
log
(
ϕ
2
/
ϕ
k
)
⋮
log
(
ϕ
k
−
1
/
ϕ
k
)
]
a
(
η
)
=
−
l
o
g
(
ϕ
k
)
b
(
y
)
=
1
\begin{array}{rcl} \eta &=& \left[\begin{array}{c}\log(\phi_1/\phi_k)\\\log(\phi_2/\phi_k)\\\vdots\\\log(\phi_{k-1}/\phi_k)\\\end{array}\right]\\ a(\eta) &=& -log(\phi_k)\\ b(y) &=& 1 \end{array}
ηa(η)b(y)===⎣⎢⎢⎢⎡log(ϕ1/ϕk)log(ϕ2/ϕk)⋮log(ϕk−1/ϕk)⎦⎥⎥⎥⎤−log(ϕk)1
定义
η
k
≡
log
ϕ
k
ϕ
k
=
0
\eta_k \equiv \log\frac{\phi_k}{\phi_k} = 0
ηk≡logϕkϕk=0 则可以证明
h
(
x
)
=
[
exp
η
1
∑
j
=
1
k
exp
η
j
exp
η
2
∑
j
=
1
k
exp
η
j
⋮
exp
η
k
−
1
∑
j
=
1
k
exp
η
j
]
=
[
exp
θ
1
T
x
∑
j
=
1
k
exp
θ
j
T
x
exp
θ
2
T
x
∑
j
=
1
k
exp
θ
j
T
x
⋮
exp
θ
k
−
1
T
x
∑
j
=
1
k
exp
θ
j
T
x
]
h(x) = \left[\begin{array}{c} \frac{\exp{\eta_1}}{\sum\limits_{j=1}^k\exp{\eta_j}}\\ \frac{\exp{\eta_2}}{\sum\limits_{j=1}^k\exp{\eta_j}}\\ \vdots\\ \frac{\exp{\eta_{k-1}}}{\sum\limits_{j=1}^k\exp{\eta_j}}\\ \end{array}\right] = \left[\begin{array}{c} \frac{\exp{\theta_1^Tx}}{\sum\limits_{j=1}^k\exp{\theta_j^Tx}}\\ \frac{\exp{\theta_2^Tx}}{\sum\limits_{j=1}^k\exp{\theta_j^Tx}}\\ \vdots\\ \frac{\exp{\theta_{k-1}^Tx}}{\sum\limits_{j=1}^k\exp{\theta_j^Tx}}\\ \end{array}\right]
h(x)=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡j=1∑kexpηjexpη1j=1∑kexpηjexpη2⋮j=1∑kexpηjexpηk−1⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡j=1∑kexpθjTxexpθ1Txj=1∑kexpθjTxexpθ2Tx⋮j=1∑kexpθjTxexpθk−1Tx⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤
Logistic Regression
对于二分类问题,假设目标变量
y
∣
x
;
θ
∼
B
e
r
n
(
p
)
y|x;\theta \sim Bern(p)
y∣x;θ∼Bern(p)
为 softmax regression 在
k
=
2
k = 2
k=2 时的特殊情况。因此可以得出
h
(
x
)
=
1
1
+
e
−
θ
T
x
h(x) = \frac{1}{1+e^{-\theta^Tx}}
h(x)=1+e−θTx1
这样的函数被称为 logistic 或 sigmoid
g
(
z
)
=
1
1
+
e
−
z
g(z) = \frac{1}{1+e^{-z}}
g(z)=1+e−z1
所以该模型称为 logistic regression. 应用 MLE 得到
L
(
θ
)
=
∏
i
=
1
m
g
(
θ
T
x
(
i
)
)
y
(
i
)
(
1
−
g
(
θ
T
x
(
i
)
)
)
1
−
y
(
i
)
l
(
θ
)
=
∑
i
=
1
m
y
(
i
)
ln
g
(
θ
T
x
(
i
)
)
+
(
1
−
y
(
i
)
)
ln
(
1
−
g
(
θ
T
x
(
i
)
)
)
\begin{array}{rcl} L(\theta) &=& \prod\limits_{i = 1}^m g\left(\theta^Tx^{(i)}\right)^{y^{(i)}} \left(1 - g\left(\theta^Tx^{(i)}\right)\right)^{1 - y^{(i)}}\\ l(\theta) &=& \sum\limits_{i = 1}^m y^{(i)}\ln g\left(\theta^Tx^{(i)}\right) + \left(1 - y^{(i)}\right)\ln\left(1 - g\left(\theta^Tx^{(i)}\right)\right) \end{array}
L(θ)l(θ)==i=1∏mg(θTx(i))y(i)(1−g(θTx(i)))1−y(i)i=1∑my(i)lng(θTx(i))+(1−y(i))ln(1−g(θTx(i)))
根据 sigmoid 函数的对称性
g
(
−
z
)
=
1
−
g
(
z
)
g(-z) = 1 - g(z)
g(−z)=1−g(z)
将上式化为
l
(
θ
)
=
∑
i
=
1
m
y
(
i
)
ln
g
(
θ
T
x
(
i
)
)
+
(
1
−
y
(
i
)
)
ln
g
(
−
θ
T
x
(
i
)
)
=
∑
i
=
1
m
y
(
i
)
θ
T
x
(
i
)
+
ln
g
(
−
θ
T
x
(
i
)
)
\begin{array}{rcl} l(\theta) &=& \sum\limits_{i = 1}^m y^{(i)}\ln g\left(\theta^Tx^{(i)}\right) + \left(1 - y^{(i)}\right)\ln g\left(-\theta^Tx^{(i)}\right)\\ &=& \sum\limits_{i = 1}^m y^{(i)}\theta^Tx^{(i)} + \ln g\left(-\theta^Tx^{(i)}\right)\\ \end{array}
l(θ)==i=1∑my(i)lng(θTx(i))+(1−y(i))lng(−θTx(i))i=1∑my(i)θTx(i)+lng(−θTx(i))
因为 sigmoid 函数的导数满足
g
′
(
z
)
=
g
(
z
)
(
1
−
g
(
z
)
)
g'(z) = g(z)(1-g(z))
g′(z)=g(z)(1−g(z))
所以
∇
θ
l
=
∑
i
=
1
m
y
(
i
)
x
(
i
)
+
1
g
(
−
θ
T
x
(
i
)
)
⋅
g
(
−
θ
T
x
(
i
)
)
(
1
−
g
(
−
θ
T
x
(
i
)
)
)
(
−
x
(
i
)
)
=
∑
i
=
1
m
y
(
i
)
x
(
i
)
−
g
(
θ
T
x
(
i
)
)
x
(
i
)
=
∑
i
=
1
m
(
y
(
i
)
−
g
(
θ
T
x
(
i
)
)
)
x
(
i
)
\begin{array}{rcl} \nabla_\theta l &=& \sum\limits_{i = 1}^m y^{(i)}x^{(i)} + \frac{1}{g\left(-\theta^Tx^{(i)}\right)} \cdot g\left(-\theta^Tx^{(i)}\right) \left(1 - g\left(-\theta^Tx^{(i)}\right)\right)\left(-x^{(i)}\right)\\ &=& \sum\limits_{i = 1}^m y^{(i)}x^{(i)} - g\left(\theta^Tx^{(i)}\right)x^{(i)}\\ &=& \sum\limits_{i = 1}^m \left(y^{(i)} - g\left(\theta^Tx^{(i)}\right)\right)x^{(i)} \end{array}
∇θl===i=1∑my(i)x(i)+g(−θTx(i))1⋅g(−θTx(i))(1−g(−θTx(i)))(−x(i))i=1∑my(i)x(i)−g(θTx(i))x(i)i=1∑m(y(i)−g(θTx(i)))x(i)
由此就可以通过梯度下降来求解了。
Appendix
Logistic Regression
main.m
%% Initialization
clear ; close all; clc
%% Load Data
% The first two columns contains the X values and the third column
% contains the label (y).
data = load('ex2data2.txt');
X = data(:, [1, 2]); y = data(:, 3);
% Note that mapFeature also adds a column of ones for us, so the intercept
% term is handled
X = mapFeature(X(:,1), X(:,2));
%% Regularization and Accuracies
initial_theta = zeros(size(X, 2), 1);
lambda = 1;
% Set Options
options = optimset('GradObj', 'on', 'MaxIter', 400);
% Optimize
[theta, J, exit_flag] = ...
fminunc(@(t)(costFunctionReg(t, X, y, lambda)), initial_theta, options);
%% Plot Boundary
% Plot Data
figure; hold on;
pos = (y == 1);
neg = (y == 0);
plot(X(pos, 2), X(pos, 3), 'k+');
plot(X(neg, 2), X(neg, 3), 'ko');
hold on
% Here is the grid range
u = linspace(-1, 1.5, 50);
v = linspace(-1, 1.5, 50);
z = zeros(length(u), length(v));
% Evaluate z = theta*x over the grid
for i = 1:length(u)
for j = 1:length(v)
z(i,j) = mapFeature(u(i), v(j))*theta;
end
end
z = z'; % important to transpose z before calling contour
% Plot z = 0
% Notice you need to specify the range [0, 0]
contour(u, v, z, [0, 0], 'LineWidth', 2)
title(sprintf('lambda = %g', lambda))
% Labels and Legend
xlabel('Microchip Test 1')
ylabel('Microchip Test 2')
legend('y = 1', 'y = 0', 'Decision boundary')
hold off;
%% Compute accuracy on our training set
p = double(logsig(X * theta) >= 0.5);
fprintf('Train Accuracy: %f\n', mean(double(p == y)) * 100);
fprintf('Expected accuracy (with lambda = 1): 83.1 (approx)\n');
costFunctionReg.m
function [J, grad] = costFunctionReg(theta, X, y, lambda)
%COSTFUNCTIONREG Compute cost and gradient for logistic regression with regularization
% J = COSTFUNCTIONREG(theta, X, y, lambda) computes the cost of using
% theta as the parameter for regularized logistic regression and the
% gradient of the cost w.r.t. to the parameters.
% Initialize some useful values
m = length(y); % number of training examples
% ====================== YOUR CODE HERE ======================
% Instructions: Compute the cost of a particular choice of theta.
% You should set J to the cost.
% Compute the partial derivatives and set grad to the partial
% derivatives of the cost w.r.t. each parameter in theta
J = -1 / m * (sum(y .* log(logsig(X * theta)) + (1 - y) .* log(1 - logsig(X * theta))) - lambda / 2 * norm(theta(2:end))^2);
grad = 1 / m * (X' * (logsig(X * theta) - y) + lambda * [0; theta(2:end)]);
% =============================================================
end
mapFeature.m
function out = mapFeature(X1, X2)
% MAPFEATURE Feature mapping function to polynomial features
%
% MAPFEATURE(X1, X2) maps the two input features
% to quadratic features used in the regularization exercise.
%
% Returns a new feature array with more features, comprising of
% X1, X2, X1.^2, X2.^2, X1*X2, X1*X2.^2, etc..
%
% Inputs X1, X2 must be the same size
%
degree = 6;
out = ones(size(X1(:,1)));
for i = 1:degree
for j = 0:i
out(:, end+1) = (X1.^(i-j)).*(X2.^j);
end
end
end