ML:极大似然估计
概率密度(质量)函数:用来描述随机变量取某个值的时候,取值点对应的概率的函数。
概率:已知概率分布,推断样本的概率值
似然:已经有观测样本,寻找最符合当前数据分布的参数
似然函数: L ( μ , σ ∣ X ) = ∏ i = 1 N P ( x i ∣ μ , σ ) \mathcal{L}(\mu, \sigma | X)=\prod_{i=1}^{N} P\left(x_{i} | \mu, \sigma\right) L(μ,σ∣X)=∏i=1NP(xi∣μ,σ)
对数似然函数: L ( μ , σ ∣ X ) = ∑ i = 1 N log P ( x i ∣ μ , σ ) \mathcal{L}(\mu, \sigma | X)=\sum_{i=1}^{N} \log P\left(x_{i} | \mu, \sigma\right) L(μ,σ∣X)=∑i=1NlogP(xi∣μ,σ)
损失函数: J ( θ ) = − ∑ i m Y log ( Y ^ ) − ( 1 − Y ) log ( 1 − Y ^ ) J(\theta)=-\sum_{i}^{m} Y \log (\hat{Y})-(1-Y) \log (1-\hat{Y}) J(θ)=−∑imYlog(Y^)−(1−Y)log(1−Y^),需要求 J ( θ ) J(\theta) J(θ)对于 θ i \theta_{i} θi的导数,式中 Y ^ = 1 1 + e − θ T X \hat{Y}=\frac{1}{1+e^{-\theta^{T} X}} Y^=1+e−θTX1
利用 d d x log a ( f ( x ) ) = 1 f ( x ) ln a f ′ ( x ) \frac{d}{d x} \log _{a}(f(x))=\frac{1}{f(x) \ln a} f^{\prime}(x) dxdloga(f(x))=f(x)lna1f′(x),将 Y ^ = 1 1 + e − θ T X \hat{Y}=\frac{1}{1+e^{-\theta^{T} X}} Y^=1+e−θTX1代入 l o g ( Y ^ ) log(\hat{Y}) log(Y^):
∂ ∂ θ j log ( Y ^ ) = ∂ ∂ θ j log ( 1 1 + e − θ T x ) = ∂ ∂ θ j ( log ( 1 ) − log ( 1 + e − θ T x ) ) \frac{\partial}{\partial \theta_{j}} \log (\hat{Y})=\frac{\partial}{\partial \theta_{j}} \log \left(\frac{1}{1+e^{-\theta^{T} x}}\right)=\frac{\partial}{\partial \theta_{j}}(\log (1)-\log \left(1+e^{-\theta^{T} x}\right)) ∂θj∂log(Y^)=∂θj∂log(1+e−θTx1)=∂θj∂(log(1)−log(1+e−θTx))
∂ ∂ θ j log ( Y ^ ) = ∂ ∂ θ j ( − log ( 1 + e − θ T x ) ) = − 1 1 + e − θ T x ⋅ e − θ T x ⋅ − x j = ( 1 − 1 1 + e − θ T x ) x \frac{\partial}{\partial \theta_{j}} \log (\hat{Y})=\frac{\partial}{\partial \theta_{j}}(-\log \left(1+e^{-\theta^{T} x}\right))=-\frac{1}{1+e^{-\theta^{T} x}} \cdot e^{-\theta^{T} x} \cdot-x_{j}=\left(1-\frac{1}{1+e^{-\theta^{T} x}}\right) x ∂θj∂log(Y^)=∂θj∂(−log(1+e−θTx))=−1+e−θTx1⋅e−θTx⋅−xj=(1−1+e−θTx1)x
∂ ∂ θ j log ( 1 − Y ^ ) = ∂ ∂ θ j log ( e − θ T x 1 + e − θ T x ) = ∂ ∂ θ j ( − θ T x − log ( 1 + e − θ T x ) ) \frac{\partial}{\partial \theta_{j}} \log (1-\hat{Y})=\frac{\partial}{\partial \theta_{j}} \log \left(\frac{e^{-\theta^{T} x}}{1+e^{-\theta^{T} x}}\right)=\frac{\partial}{\partial \theta_{j}}(-\theta^{T} x-\log \left(1+e^{-\theta^{T} x}\right)) ∂θj∂log(1−Y^)=∂θj∂log(1+e−θTxe−θTx)=∂θj∂(−θTx−log(1+e−θTx))
∂ ∂ θ j log ( 1 − Y ^ ) = − x j + x j ( 1 − 1 1 + e − θ T x ) = − 1 1 + e − θ T x x j \frac{\partial}{\partial \theta_{j}} \log (1-\hat{Y})=-x_{j}+x_{j}\left(1-\frac{1}{1+e^{-\theta^{T} x}}\right)=-\frac{1}{1+e^{-\theta^{T} x}} x_{j} ∂θj∂log(1−Y^)=−xj+xj(1−1+e−θTx1)=−1+e−θTx1xj
综上可得, ∂ ∂ θ j J ( θ ) = − ∑ i m y i x i j ( 1 − 1 1 + e − θ T x i ) − ( 1 − y i ) x i j 1 1 + e − θ T x i \frac{\partial}{\partial \theta_{j}} J(\theta)=-\sum_{i}^{m} y_{i} x_{i j}\left(1-\frac{1}{1+e^{-\theta^{T} x_{i}}}\right)-\left(1-y_{i}\right) x_{i j} \frac{1}{1+e^{-\theta^{T} x_{i}}} ∂θj∂J(θ)=−∑imyixij(1−1+e−θTxi1)−(1−yi)xij1+e−θTxi1
其中, i i i是数据点的序号, j j j是特征的数量,输入 X X X可以表示为:
X = [ x i = 1 , j = 1 x i = 2 , j = 1 x i = 3 , j = 1 x i = 1 , j = 2 x i = 2 , j = 2 x i = 3 , j = 2 x i = 1 , j = 3 x i = 2 , j = 3 x i = 3 , j = 3 ] X=\left[\begin{array}{ll}{x_{i=1, j=1}} & {x_{i=2, j=1} x_{i=3, j=1}} \\ {x_{i=1, j=2}} & {x_{i=2, j=2} x_{i=3, j=2}} \\ {x_{i=1, j=3}} & {x_{i=2, j=3} x_{i=3, j=3}}\end{array}\right] X=⎣⎡xi=1,j=1xi=1,j=2xi=1,j=3xi=2,j=1xi=3,j=1xi=2,j=2xi=3,j=2xi=2,j=3xi=3,j=3⎦⎤,举个例子,一个batch的图片, x i j x_{i j} xij表示第 i i i张图片的第 j j j个像素
展开整理得: ∂ ∂ θ j J ( θ ) = ∑ i m ( 1 1 + e − θ T x i − y i ) x i j = ∑ i m ( y ^ i − y i ) x i j \frac{\partial}{\partial \theta_{j}} J(\theta)=\sum_{i}^{m}\left(\frac{1}{1+e^{-\theta^{T} x_{i}}}-y_{i}\right) x_{i j}=\sum_{i}^{m}\left(\hat{y}_{i}-y_{i}\right) x_{i j} ∂θj∂J(θ)=∑im(1+e−θTxi1−yi)xij=∑im(y^i−yi)xij,式中 Y ^ = 1 1 + e − θ T X \hat{Y}=\frac{1}{1+e^{-\theta^{T} X}} Y^=1+e−θTX1
之前 θ \theta θ和 X X X的向量表示形式为: θ T = [ bias θ 1 θ 2 ] \theta^{T}=\left[\begin{array}{lll}{\text { bias }} & {\theta_{1}} & {\theta_{2}}\end{array}\right] θT=[ bias θ1θ2]和 X = [ 1 x 1 x 2 ] X=\left[\begin{array}{c}{1} \\ {x_{1}} \\ {x_{2}}\end{array}\right] X=⎣⎡1x1x2⎦⎤
由于 θ \theta θ中的 b i a s bias bias对应着 X X X里面的1,所以可以得到: ∂ ∂ b i a s J ( θ ) = ∑ i m ( y ^ i − y i ) \frac{\partial}{\partial b i a s} J(\theta)=\sum_{i}^{m}\left(\hat{y}_{i}-y_{i}\right) ∂bias∂J(θ)=∑im(y^i−yi)
设定学习率 η \eta η,迭代下面的步骤直至收敛:
θ j ← θ j − η ∂ ∂ θ j J ( θ ) \theta_{j} \leftarrow \theta_{j}-\eta \frac{\partial}{\partial \theta_{j}} J(\theta) θj←θj−η∂θj∂J(θ)
b i a s ← b i a s − η ∂ ∂ bias J ( θ ) bias \leftarrow bias -\eta \frac{\partial}{\partial \text { bias }} J(\theta) bias←bias−η∂ bias ∂J(θ)