机器学习中的常用模型与距离简述

以下仅对机器学习领域常用的几种模型、距离进行介绍。

常用模型

线性模型 Linear Model
f θ ( x ) = ∑ j = 1 b θ j ϕ j ( x ) = θ T ϕ ( x ) f_{\theta}(x)=\sum_{j=1}^{b}\theta_{j}\phi_{j}(x)=\theta^{T}\phi(x) fθ(x)=j=1bθjϕj(x)=θTϕ(x)
where ϕ ( x ) \phi(x ) ϕ(x) is the basis function vector and $\theta $ is the parameter vector. As you can see, there are b b b basis functions.
Basis functions can be of different form, such as polynomial:
ϕ ( x ) = ( 1 , x , x 2 , ⋯   , x b − 1 ) T \phi(x)=(1,x,x^2,\cdots,x^{b-1})^{T} ϕ(x)=(1,x,x2,,xb1)T
or triangular polynomial:
ϕ ( x ) = ( 1 , sin ⁡ x , cos ⁡ x , sin ⁡ 2 x , cos ⁡ 2 x , ⋯   , sin ⁡ m x , cos ⁡ m x ) T \phi(x)=(1,\sin x, \cos x, \sin 2x, \cos 2x, \cdots, \sin mx, \cos mx)^{T} ϕ(x)=(1,sinx,cosx,sin2x,cos2x,,sinmx,cosmx)T

x x x may be a vector rather than a scalar. However, linear models have nothing to do with the training set.

乘法模型 Multiplicative Model
Taking the one-dim basis functions as factors in order to get multi-dim basis functions.
f θ ( x ) = ∑ j 1 = 1 b ′ ⋯ ∑ j d = 1 b ′ θ j 1 , ⋯ j d ϕ j 1 ( x ( 1 ) ) ⋯ ϕ j d ( x ( d ) ) f_{\theta}(x)=\sum_{j_{1}=1}^{b'}\cdots\sum_{j_{d}=1}^{b'}\theta_{j_{1},\cdots j_{d}}\phi_{j_{1}}(x^{(1)})\cdots\phi_{j_{d}}(x^{(d)}) fθ(x)=j1=1bjd=1bθj1,jdϕj1(x(1))ϕjd(x(d))
Obvious curse of dimension.

加法模型 Additive Model
f θ ( x ) = ∑ k = 1 d ∑ j = 1 b ′ θ k , j ϕ j ( x ( k ) ) f_{\theta}(x)=\sum_{k=1}^{d}\sum_{j=1}^{b'}\theta_{k,j}\phi_{j}(x^{(k)}) fθ(x)=k=1dj=1bθk,jϕj(x(k))

核模型 Kernal Model
Kernal functions are binary, relate to the training set. Kernal models are defined as linear combinations of kernal functions.
f θ ( x ) = ∑ j = 1 n θ j K ( x , x j ) f_{\theta}(x)=\sum_{j=1}^{n}\theta_{j}K(x,x_{j}) fθ(x)=j=1nθjK(x,xj)
There are many types of kernal functions, however, Gaussian kernal functions enjoy the most popularity.
K ( x , c ) = exp ⁡ ( − ∥ x − c ∥ 2 2 h 2 ) K(x,c)=\exp(-\frac{\|x-c\|^{2}}{2h^{2}}) K(x,c)=exp(2h2xc2)
As you can see, kernal models assign kernals with respect to the training sample x i x_{i} xi and then learn their height θ i \theta_{i} θi. Therefore, they can approximate functions only in the neighborhood of training samples regardless of the dimension of x i x_{i} xi.

分层模型 Hierarchy Model
Hierarchy models belong to nonlinear models.
f θ ( x ) = ∑ j = 1 b α j ϕ ( x , β j ) f_{\theta}(x)=\sum_{j=1}^{b}\alpha_{j}\phi(x,\beta_{j}) fθ(x)=j=1bαjϕ(x,βj)
Hence, θ = ( α T , β 1 T , ⋯   , β b T ) T \theta=(\alpha^{T},\beta_{1}^{T},\cdots,\beta_{b}^{T})^{T} θ=(αT,β1T,,βbT)T. ϕ ( x , β j ) \phi(x,\beta_{j}) ϕ(x,βj) are basis functions. There are 2 typical basis functions:
S function (or Artificial Neural Network)
ϕ ( x , β ) = 1 1 + exp ⁡ ( − x T ω − γ ) , β = ( ω T , γ ) T \phi(x,\beta)=\frac{1}{1+\exp(-x^{T}\omega-\gamma)},\quad \beta=(\omega^{T},\gamma)^{T} ϕ(x,β)=1+exp(xTωγ)1,β=(ωT,γ)T
Gaussian function
ϕ ( x , β ) = exp ⁡ ( − ∥ x − c ∥ 2 2 h 2 ) , β = ( c T , h ) T \phi(x,\beta)=\exp\left(-\frac{\|x-c\|^{2}}{2h^{2}} \right),\quad \beta=(c^{T},h)^{T} ϕ(x,β)=exp(2h2xc2),β=(cT,h)T
Note that θ \theta θ and f θ f_{\theta} fθ are not one to one corresponded.

高斯分布 Gaussian Distribution
In monovariate case, x ∈ ( − ∞ , ∞ ) x\in (-\infty, \infty) x(,), parameters of the Gaussian distribution are average μ ∈ ( − ∞ , ∞ ) \mu\in (-\infty, \infty) μ(,)and variance σ 2 > 0 \sigma^{2}>0 σ2>0. The probability function is defined as follows
p ( x ∣ μ , σ 2 ) = 1 2 π σ 2 exp ⁡ { − ( x − μ ) 2 2 σ 2 } p(x|\mu, \sigma^{2})=\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\left\{-\frac{(x-\mu)^{2}}{2\sigma^{2}}\right\} p(xμ,σ2)=2πσ2 1exp{2σ2(xμ)2}
μ = E [ x ] \mu = \mathbb{E}[x] μ=E[x]
σ 2 = v a r [ x ] \sigma^{2} = var[x] σ2=var[x]

In multivariate case, consider the d d d-dimension vector x x x, the average μ \mu μ is a d d d-dimension vector, too. However, the covariance in this case becomes a positive definite matrix Σ \Sigma Σ of d × d d\times d d×d dimension.
p ( x ∣ μ , Σ ) = 1 ( 2 π ) d det ⁡ ( Σ ) exp ⁡ { − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) } p(x|\mu,\Sigma)=\frac{1}{\sqrt{(2\pi)^{d}\det(\Sigma)}}\exp\left\{-\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu)\right\} p(xμ,Σ)=(2π)ddet(Σ) 1exp{21(xμ)TΣ1(xμ)}
μ = E [ x ] \mu = \mathbb{E}[x] μ=E[x]
Σ = c o v [ x ] = E [ ( x − μ ) ( x − μ ) T ] \Sigma = cov[x] = \mathbb{E}\left[ (x-\mu)(x-\mu)^{T} \right] Σ=cov[x]=E[(xμ)(xμ)T] That is
Σ i j = E [ ( x i − μ i ) ( x j − μ j ) ] = E [ ( x j − μ j ) ( x i − μ i ) ] = Σ j i \Sigma_{ij} = \mathbb{E}[(x_{i}-\mu_{i})(x_{j}-\mu_{j})] = \mathbb{E}[(x_{j}-\mu_{j})(x_{i}-\mu_{i})] = \Sigma_{ji} Σij=E[(xiμi)(xjμj)]=E[(xjμj)(xiμi)]=Σji
The sign of the covariance helps to determine the relationship between two components:

  • If x j x_{j} xj is large when x i x_{i} xi is large, then ( x j − μ j ) ( x i − μ i ) (x_{j}-\mu_{j})(x_{i}-\mu_{i}) (xjμj)(xiμi) will tend to be positive;
  • If x j x_{j} xj is small when x i x_{i} xi is large, then ( x j − μ j ) ( x i − μ i ) (x_{j}-\mu_{j})(x_{i}-\mu_{i}) (xjμj)(xiμi) will tend to be negative.

Since the covariance matrix is not scale-independent(i.e. independent of the measurement units), we define the correlation coefficient:
ρ ( x j , x k ) = ρ j k = S j k S j j S k k , S j k = ∑ n = 1 N ( x j − μ j ) ( x k − μ k ) \rho (x_{j},x_{k}) = \rho_{jk} = \frac{S_{jk}}{\sqrt{S_{jj}S_{kk}}},\quad S_{jk}=\sum_{n=1}^{N}(x_{j}-\mu_{j})(x_{k}-\mu_{k}) ρ(xj,xk)=ρjk=SjjSkk Sjk,Sjk=n=1N(xjμj)(xkμk) which satisfies − 1 ≤ ρ ≤ 1 -1\leq \rho \leq 1 1ρ1, and

  • ρ ( x , y ) = + 1 \rho(x,y) = +1 ρ(x,y)=+1, if y = a x + b , a > 0 y=ax+b, a>0 y=ax+b,a>0.
  • ρ ( x , y ) = − 1 \rho(x,y) = -1 ρ(x,y)=1, if y = a x + b , a < 0 y=ax+b, a<0 y=ax+b,a<0.

To estimate the mean vector μ \mu μ and covariance matrix Σ \Sigma Σ, it is easy to be done by maximize the likelihood of the training data. And the they are given by:
μ ^ = 1 N ∑ n = 1 N x ( n ) , Σ ^ = 1 N ∑ n = 1 N ( x ( n ) − μ ^ ) ( x ( n ) − μ ^ ) T \hat{\mu} = \frac{1}{N}\sum_{n=1}^{N}x^{(n)},\quad \hat{\Sigma}=\frac{1}{N}\sum_{n=1}^{N}(x^{(n)}-\hat{\mu})(x^{(n)}-\hat{\mu})^{T} μ^=N1n=1Nx(n),Σ^=N1n=1N(x(n)μ^)(x(n)μ^)T

常用距离

metrics

If x 1 , x 2 ∈ R n x_{1}, x_{2}\in\mathbb{R}^{n} x1,x2Rn, then:
闵可夫斯基距离 Minkowski Distance
d 12 = ∑ k = 1 n ( x 1 k − x 2 k ) p p , p > 0 d_{12}= \sqrt[p]{\sum_{k=1}^{n}(x_{1k}-x_{2k})^{p}},\quad p>0 d12=pk=1n(x1kx2k)p ,p>0

欧氏距离 Enclidean Distance
L 2 L_2 L2 norm
d 12 = ∑ k = 1 n ( x 1 k − x 2 k ) 2  or  d 12 = ( x 1 − x 2 ) T ( x 1 − x 2 ) d_{12}=\sqrt{\sum_{k=1}^{n}(x_{1k}-x_{2k})^{2}} \text{ or } d_{12}=\sqrt{(x_{1}-x_{2})^{T}(x_{1}-x_{2})} d12=k=1n(x1kx2k)2  or d12=(x1x2)T(x1x2)

标准化欧式距离/加权欧式距离 Weighted Euclidean Distance
d 12 = ∑ k = 1 n ( x 1 k − x 2 k S k ) 2 d_{12}=\sqrt{\sum_{k=1}^{n}\left( \frac{x_{1k}-x_{2k}}{S_{k}} \right)^{2}} d12=k=1n(Skx1kx2k)2
where S k S_{k} Sk is the standard deviation.

from numpy import *
vectormat=mat([[1,2,3],[4,5,6]])
v12=vectormat[0]-vectormat[1]
varmat=std(vectormat.T, axis=0)
normmat=(vectormat-mean(vectormat))/varmat.T
normv12=normmat[0]-normmat[1]
print(sqrt(normv12*normv12.T))

曼哈顿距离 Manhattan Distance
L 1 L_1 L1 norm
d 12 = ∑ k = 1 n ∣ x 1 k − x 2 k ∣ d_{12}=\sum_{k=1}^{n}|x_{1k}-x_{2k}| d12=k=1nx1kx2k

切比雪夫距离 Chebyshev Distance
L ∞ L_\infty L norm
d 12 = max ⁡ i ( ∣ x 1 i − x 2 i ∣ ) d_{12}=\max_{i}(|x_{1i}-x_{2i}|) d12=imax(x1ix2i)

from numpy import *
vector1=mat([1,2,3])
vector2=mat([4,5,7])
print(abs(vector1-vector2).max())

夹角余弦 Cosine
cos ⁡ θ = ∑ k = 1 n x 1 k x 2 k ∑ k = 1 n x 1 k 2 ∑ k = 1 n x 2 k 2 \cos\theta=\frac{\sum_{k=1}^{n}x_{1k}x_{2k}}{\sqrt{\sum_{k=1}^{n}x_{1k}^{2}}\sqrt{\sum_{k=1}^{n}x_{2k}^{2}}} cosθ=k=1nx1k2 k=1nx2k2 k=1nx1kx2k

汉明距离 Hamming Distance
In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other. (referred from Wikipedia)

from numpy import *
matV=mat([[1,1,0,1,0,1,0,0,1],[0,1,1,0,0,0,1,1,1]])
smstr=nonzero(matV [0]-matV[1])
print(shape(smstr[0])[0])

杰卡德相似系数 Jaccard Similarity Coefficient
Given two sets, A A A and B B B, the Jaccard similarity coefficient is defined as
J ( A , B ) = ∣ A ∩ B ∣ ∣ A ∪ B ∣ J(A,B)=\frac{|A\cap B|}{|A\cup B|} J(A,B)=ABAB

杰卡德距离 Jaccard Distance
J δ ( A , B ) = 1 − J ( A , B ) = ∣ A ∪ B ∣ − ∣ A ∩ B ∣ ∣ A ∪ B ∣ J_{\delta}(A,B)=1-J(A,B)=\frac{|A\cup B|-|A\cap B|}{|A\cup B|} Jδ(A,B)=1J(A,B)=ABABAB

from numpy import *
import scipy.spatial.distance as dist
matV=mat([[1,1,0,1,0,1,0,0,1],[0,1,1,0,0,0,1,1,1]])
print(dist.pdist(matV,'jaccard'))

马氏距离 Mahalanobis Distance
Given m m m sample vectors X 1 , … , X m X_{1},\dots,X_{m} X1,,Xm whose mean value is μ \mu μ and covariance matrix is S S S, then the Mahalanobis distance of sample vector X X X and μ \mu μ is defined as
D ( X ) = ( X − μ ) T S − 1 ( X − μ ) D(X)=\sqrt{(X-\mu)^{T}S^{-1}(X-\mu)} D(X)=(Xμ)TS1(Xμ)
that of sample vector X i X_{i} Xi and X j X_{j} Xj is
D ( X ) = ( X i − X j ) T S − 1 ( X i − X j ) D(X)=\sqrt{(X_{i}-X_{j})^{T}S^{-1}(X_{i}-X_{j})} D(X)=(XiXj)TS1(XiXj)

皮尔松相关系数 Pearson’s correlation coefficient
Assume there are two attributes, u u u and v v v, of some kind of sample. Pearson’s correlation coefficient serves as a indicator that shows how correlative the two attributes are to each other.

Suppose that vector u = [ u 1 , u 2 , … , u n ] T u=[u_{1},u_{2},\dots,u_{n}]^{T} u=[u1,u2,,un]T and v v v share the same length. Take the averages of u u u and v v v:
u ˉ = a v g ( u ) , v ˉ = a v g ( v ) \bar{u}=avg(u),\quad \bar{v}=avg(v) uˉ=avg(u),vˉ=avg(v) Define:
Δ u = u 1 − u ˉ u 2 − u ˉ ⋮ u n − u ˉ , Δ v = v 1 − v ˉ v 2 − v ˉ ⋮ v n − v ˉ \Delta u=\begin{matrix}u_{1}-\bar{u} \\ u_{2}-\bar{u} \\ \vdots \\ u_{n}-\bar{u}\end{matrix},\qquad \Delta v=\begin{matrix}v_{1}-\bar{v} \\ v_{2}-\bar{v} \\ \vdots \\ v_{n}-\bar{v}\end{matrix} Δu=u1uˉu2uˉunuˉ,Δv=v1vˉv2vˉvnvˉ Then calculate the Pearson’s correlation coefficient:
c o r r ( u , v ) = Δ u T ∗ Δ v ( Δ u T ∗ Δ u ) ∗ ( Δ v T ∗ Δ v ) corr(u,v)=\frac{\Delta u^{T}*\Delta v}{\sqrt{(\Delta u^{T}*\Delta u)*(\Delta v^{T}*\Delta v)}} corr(u,v)=(ΔuTΔu)(ΔvTΔv) ΔuTΔv
Higher coefficients mean more correlative attribute pairs.

交叉熵 Cross Entropy
The cross entropy is a normal loss function in machine learning field. As we know, the entropy is defined as
e n t r o p y = − ∑ x p ( x ) log ⁡ p ( x ) \mathrm{entropy} = -\sum_{x}p(x)\log p(x) entropy=xp(x)logp(x) Or
e n t r o p y = − ∫ x p ( x ) log ⁡ p ( x ) d x \mathrm{entropy} = -\int_{x}p(x)\log p(x) \mathrm{d}x entropy=xp(x)logp(x)dx
However, the two definitions above are normal ones. We can define the entropy in various forms as long as the properties of entropy are satisfied. Assume that there exist two random variable P P P and Q Q Q based on Bernoulli distribution, then their cross entropy is
H ( P , Q ) = − P ( 0 ) log ⁡ Q ( 0 ) − ( 1 − P ( 0 ) ) log ⁡ ( 1 − Q ( 0 ) ) H(P,Q) = -P(0)\log Q(0) - (1-P(0))\log (1-Q(0)) H(P,Q)=P(0)logQ(0)(1P(0))log(1Q(0))

KL散度 Kullback-Leibler divergence
KL divergence stands for the divergence of probability distributions.
K L ( p ∥ q ) = ∑ x p ( x ) log ⁡ p ( x ) q ( x ) KL(p \| q ) = \sum_{x}p(x)\log\frac{p(x)}{q(x)} KL(pq)=xp(x)logq(x)p(x) Or
K L ( p ∥ q ) = ∫ x p ( x ) log ⁡ p ( x ) q ( x ) KL(p \| q ) = \int_{x}p(x)\log\frac{p(x)}{q(x)} KL(pq)=xp(x)logq(x)p(x) KL divergence is always non-negative but not symmetric.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值