1 最小二乘法(Least Square)
一种数学方法,来直接求解最优解。
∑
j
=
1
n
X
i
j
β
j
=
y
i
,
(
i
=
1
,
2
,
.
.
.
,
m
)
,
X
β
=
y
\sum_{j=1}^nX_{ij}\beta_j=y_i,(i=1,2,...,m),X\beta=y
j=1∑nXijβj=yi,(i=1,2,...,m),Xβ=y
[
X
11
X
12
.
.
.
X
1
n
X
21
X
22
.
.
.
X
2
n
X
31
X
32
.
.
.
X
3
n
.
.
.
.
.
.
.
.
.
.
.
.
X
m
1
X
m
2
.
.
.
X
m
n
]
,
β
=
[
β
1
β
2
β
3
.
.
.
β
n
]
,
y
=
[
y
1
y
2
y
3
.
.
.
y
m
]
\left[ \begin{matrix} X_{11} & X_{12} &...& X_{1n} \\ X_{21} & X_{22} &... & X_{2n}\\ X_{31} & X_{32} & ...&X_{3n}\\ ...&...&...&...\\ X_{m1} & X_{m2} & ...&X_{mn} \end{matrix} \right],\beta=\left[ \begin{matrix} \beta_{1}\\ \beta_{2} \\ \beta_{3} \\ ...\\ \beta_{n} \end{matrix} \right],y=\left[ \begin{matrix} y_{1}\\ y_{2} \\ y_{3} \\ ...\\ y_{m} \end{matrix} \right]
⎣⎢⎢⎢⎢⎡X11X21X31...Xm1X12X22X32...Xm2...............X1nX2nX3n...Xmn⎦⎥⎥⎥⎥⎤,β=⎣⎢⎢⎢⎢⎡β1β2β3...βn⎦⎥⎥⎥⎥⎤,y=⎣⎢⎢⎢⎢⎡y1y2y3...ym⎦⎥⎥⎥⎥⎤
β
^
=
a
r
g
m
i
n
β
S
(
β
)
,
S
(
β
)
=
∑
i
=
1
m
∣
y
i
−
∑
j
=
1
n
X
i
j
β
j
∣
2
=
∣
∣
y
−
X
β
∣
∣
2
\hat \beta =argmin_\beta S(\beta),S(\beta)=\sum_{i=1}^m|y_i-\sum_{j=1}^nX_{ij}\beta_j|^2=||y-X\beta||^2
β^=argminβS(β),S(β)=i=1∑m∣yi−j=1∑nXijβj∣2=∣∣y−Xβ∣∣2
推导:
∣ ∣ y − X β ∣ ∣ 2 = ( y − X β ) T ( y − X β ) = ( y T − β T X T ) ( y − X β ) , y T y − y T X β − β T X T y + β T X T X β ||y-X\beta||^2=(y-X\beta)^T(y-X\beta)=(y^T-\beta^TX^T)(y-X\beta),y^Ty-y^TX\beta-\beta^TX^Ty+\beta^TX^TX\beta ∣∣y−Xβ∣∣2=(y−Xβ)T(y−Xβ)=(yT−βTXT)(y−Xβ),yTy−yTXβ−βTXTy+βTXTXβ,其中 y T X β − β T X T y y^TX\beta-\beta^TX^Ty yTXβ−βTXTy是标量,所以 ∣ ∣ y − X β ∣ ∣ 2 = y T y − 2 X T y β + β T X T X β ||y-X\beta||^2=y^Ty-2X^Ty\beta+\beta^TX^TX\beta ∣∣y−Xβ∣∣2=yTy−2XTyβ+βTXTXβ,所以在求 ∂ S ∂ β = 0 = ∂ ∣ ∣ y − X β ∣ ∣ 2 ∂ β = ∂ ( β T X T X β ) ∂ β − 2 X T y \frac{\partial S}{\partial \beta}=0=\frac{\partial ||y-X\beta||^2 }{\partial \beta}=\frac{\partial (\beta^TX^TX\beta )}{\partial \beta}-2X^Ty ∂β∂S=0=∂β∂∣∣y−Xβ∣∣2=∂β∂(βTXTXβ)−2XTy
扩展(1):对向量的求导 d ( U T V ) d ( x ) = d ( U T ) d ( x ) V + d ( V T ) d ( x ) U \frac{d(U^TV)}{d(x)}=\frac{d(U^T)}{d(x)}V+\frac{d(V^T)}{d(x)}U d(x)d(UTV)=d(x)d(UT)V+d(x)d(VT)U
扩展(2):假设B为方阵, d ( X T B X ) d ( x ) = d ( x T ) d ( x ) B X + d ( X T B T ) d x X = B X + B T X = ( B + B T ) X \frac{d(X^TBX)}{d(x)}=\frac{d(x^T)}{d(x)}BX+\frac{d(X^TB^T)}{dx}X=BX+B^TX=(B+B^T)X d(x)d(XTBX)=d(x)d(xT)BX+dxd(XTBT)X=BX+BTX=(B+BT)X
所以 ∂ ( β T X T X β ) ∂ β = X T X β + X T X β = 2 X T X β \frac{\partial (\beta^TX^TX\beta )}{\partial \beta}=X^TX\beta+X^T X\beta=2X^T X\beta ∂β∂(βTXTXβ)=XTXβ+XTXβ=2XTXβ所以 2 X T X β − 2 X T y = 0 2X^T X\beta-2X^Ty=0 2XTXβ−2XTy=0所以 β = ( X T X ) − 1 X T y \beta=(X^TX)^{-1}X^Ty β=(XTX)−1XTy
图形化的理解:
概率的理解:
假设真实值与估计值之间的误差服从正态分布,那么我们可以假设概率密度函数满足:
p
(
y
(
i
)
∣
x
(
i
)
;
θ
)
=
1
2
π
σ
e
−
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
2
σ
2
p(y^{(i)}|x^{(i)};\theta)=\frac{1}{\sqrt {2\pi}\sigma}e^{-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}}
p(y(i)∣x(i);θ)=2πσ1e−2σ2(y(i)−θTx(i))2
当我们想要确定
θ
\theta
θ的值时,我们需要利用最大似然估计的方法,所以这样我们便可以将最大似然值和极小化损失函数联系到一起。
L
(
θ
)
=
L
(
θ
;
X
,
y
)
=
p
(
y
∣
X
;
θ
)
=
∏
i
=
1
m
p
(
y
(
i
)
∣
x
(
i
)
;
θ
)
=
∏
i
=
1
m
1
2
π
σ
e
−
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
2
σ
2
L(\theta)=L(\theta;X,y)=p(y|X;\theta)=\prod_{i=1}^mp(y^{(i)}|x^{(i)};\theta)=\prod_{i=1}^m\frac{1}{\sqrt {2\pi}\sigma}e^{-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}}
L(θ)=L(θ;X,y)=p(y∣X;θ)=i=1∏mp(y(i)∣x(i);θ)=i=1∏m2πσ1e−2σ2(y(i)−θTx(i))2
当我们需要损失函数最小的时候,及分子平方和最小,同时达到最大似然。
2 Logistic 回归(Logistic Regression)
2.1 Sigmoid Function
Logistic 模型:
P
(
Y
=
1
∣
X
)
=
e
w
⋅
x
1
+
e
w
⋅
x
P(Y=1|X)=\frac{e^{w·x}}{1+e^{w·x}}
P(Y=1∣X)=1+ew⋅xew⋅x
P
(
Y
=
0
∣
X
)
=
1
1
+
e
w
⋅
x
P(Y=0|X)=\frac{1}{1+e^{w·x}}
P(Y=0∣X)=1+ew⋅x1
一个事件发生的几率(odds)是指该事件发生的机率与该事件不发生几率的比值。对数几率则是
l
o
g
i
t
(
p
)
=
l
o
g
p
1
−
p
logit(p)=log\frac{p}{1-p}
logit(p)=log1−pp, 对logistic来说,对数几率是线性函数
w
⋅
x
w·x
w⋅x换个角度,对
x
x
x进行分类的线性函数,通过logistic模型可以变为概率。
2.2 Logistic参数估计
对于数据集
T
=
{
(
x
(
1
)
,
y
(
1
)
)
,
.
.
.
.
.
,
(
x
(
N
)
,
y
(
N
)
)
,
}
,
y
∈
{
0
,
1
}
T=\{ (x^{(1)},y^{(1)}),.....,(x^{(N)},y^{(N)}),\},y\in\{0,1\}
T={(x(1),y(1)),.....,(x(N),y(N)),},y∈{0,1},可以用极大似然法来估计参数,从而得到logistic模型。
P
(
Y
=
1
∣
x
)
=
π
(
x
)
,
P
(
Y
=
0
∣
x
)
=
1
−
π
(
x
)
P(Y=1|x)=\pi(x),P(Y=0|x)=1-\pi(x)
P(Y=1∣x)=π(x),P(Y=0∣x)=1−π(x)
所以似然函数为:
∏
i
=
1
N
[
π
(
x
(
i
)
)
]
y
(
i
)
[
1
−
π
(
x
(
i
)
)
]
1
−
y
(
i
)
\prod_{i=1}^N[\pi(x^{(i)})]^{y^{(i)}}[1-\pi(x^{(i)})]^{1-y^{(i)}}
i=1∏N[π(x(i))]y(i)[1−π(x(i))]1−y(i)
对数似然为:
L
(
w
)
=
∑
i
=
1
N
[
y
i
l
o
g
π
(
x
)
+
(
1
−
y
i
)
l
o
g
(
1
−
π
(
x
)
)
]
=
∑
i
=
1
N
[
y
i
l
o
g
π
(
x
)
1
−
π
(
x
)
+
l
o
g
(
1
−
π
(
x
)
)
]
L(w)=\sum_{i=1}^N[y_ilog\pi(x)+(1-y_i)log(1-\pi(x))]=\sum_{i=1}^N[y_ilog\frac{\pi(x)}{1-\pi(x)}+log(1-\pi(x))]
L(w)=i=1∑N[yilogπ(x)+(1−yi)log(1−π(x))]=i=1∑N[yilog1−π(x)π(x)+log(1−π(x))]
.
.
.
=
∑
i
=
1
N
[
y
i
(
w
⋅
x
)
−
l
o
g
(
1
+
e
w
⋅
x
)
]
...=\sum_{i=1}^N[y_i(w·x)-log(1+e^{w·x})]
...=i=1∑N[yi(w⋅x)−log(1+ew⋅x)]
从而可以得出L极大值下的
w
^
\hat w
w^估计。
2.3 多项Logistic回归(Multi-nominal logistic regression)
推广Logistic 到多分类的模型。
假设离散型随机变量Y的取值集合是
{
1
,
2
,
3
,
4....
K
}
\{ 1,2,3,4....K\}
{1,2,3,4....K},那么多项logistic回归模型是:
P
(
Y
=
k
∣
x
)
=
e
w
k
⋅
x
1
+
∑
k
=
1
K
−
1
e
w
k
⋅
x
,
k
∈
{
1
,
2....
K
−
1
}
P(Y=k|x)=\frac{e^{w_k·x}}{1+\sum_{k=1}^{K-1}e^{w_k·x}},k\in\{1,2....K-1\}
P(Y=k∣x)=1+∑k=1K−1ewk⋅xewk⋅x,k∈{1,2....K−1}
P
(
Y
=
K
∣
x
)
=
1
1
+
∑
k
=
1
K
−
1
e
w
k
⋅
x
,
x
∈
R
n
+
1
,
w
k
∈
R
n
+
1
P(Y=K|x)=\frac{1}{1+\sum_{k=1}^{K-1}e^{w_k·x}},x\in R^{n+1},w_k\in R^{n+1}
P(Y=K∣x)=1+∑k=1K−1ewk⋅x1,x∈Rn+1,wk∈Rn+1