机器学习中的常用模型与距离简述

最新推荐文章于 2022-11-30 18:05:23 发布

止于至玄

最新推荐文章于 2022-11-30 18:05:23 发布

阅读量806

点赞数

分类专栏： Machine Learning 文章标签：机器学习

本文链接：https://blog.csdn.net/philthinker/article/details/68952534

版权

Machine Learning 专栏收录该内容

23 篇文章 3 订阅

订阅专栏

以下仅对机器学习领域常用的几种模型、距离进行介绍。

文章目录

常用模型
常用距离

常用模型

线性模型 Linear Model
$f_{\theta}(x)=\sum_{j=1}^{b}\theta_{j}\phi_{j}(x)=\theta^{T}\phi(x)$
where $\phi(x )$ is the basis function vector and $\theta $ is the parameter vector. As you can see, there are $b$ basis functions.
Basis functions can be of different form, such as polynomial:
$\phi(x)=(1,x,x^2,\cdots,x^{b-1})^{T}$
or triangular polynomial:
$\phi(x)=(1,\sin x, \cos x, \sin 2x, \cos 2x, \cdots, \sin mx, \cos mx)^{T}$

$x$ may be a vector rather than a scalar. However, linear models have nothing to do with the training set.

乘法模型 Multiplicative Model
Taking the one-dim basis functions as factors in order to get multi-dim basis functions.
$f_{\theta}(x)=\sum_{j_{1}=1}^{b'}\cdots\sum_{j_{d}=1}^{b'}\theta_{j_{1},\cdots j_{d}}\phi_{j_{1}}(x^{(1)})\cdots\phi_{j_{d}}(x^{(d)})$
Obvious curse of dimension.

加法模型 Additive Model
$f_{\theta}(x)=\sum_{k=1}^{d}\sum_{j=1}^{b'}\theta_{k,j}\phi_{j}(x^{(k)})$

核模型 Kernal Model
Kernal functions are binary, relate to the training set. Kernal models are defined as linear combinations of kernal functions.
$f_{\theta}(x)=\sum_{j=1}^{n}\theta_{j}K(x,x_{j})$
There are many types of kernal functions, however, Gaussian kernal functions enjoy the most popularity.
$K(x,c)=\exp(-\frac{\|x-c\|^{2}}{2h^{2}})$
As you can see, kernal models assign kernals with respect to the training sample $x_{i}$ and then learn their height $\theta_{i}$ . Therefore, they can approximate functions only in the neighborhood of training samples regardless of the dimension of $x_{i}$ .

分层模型 Hierarchy Model
Hierarchy models belong to nonlinear models.
$f_{\theta}(x)=\sum_{j=1}^{b}\alpha_{j}\phi(x,\beta_{j})$
Hence, $\theta=(\alpha^{T},\beta_{1}^{T},\cdots,\beta_{b}^{T})^{T}$ . $\phi(x,\beta_{j})$ are basis functions. There are 2 typical basis functions:
S function (or Artificial Neural Network)
$\phi(x,\beta)=\frac{1}{1+\exp(-x^{T}\omega-\gamma)},\quad \beta=(\omega^{T},\gamma)^{T}$
Gaussian function
$\phi(x,\beta)=\exp\left(-\frac{\|x-c\|^{2}}{2h^{2}} \right),\quad \beta=(c^{T},h)^{T}$
Note that $\theta$ and $f_{\theta}$ are not one to one corresponded.

高斯分布 Gaussian Distribution
In monovariate case, $x\in (-\infty, \infty)$ , parameters of the Gaussian distribution are average $\mu\in (-\infty, \infty)$ and variance $\sigma^{2}>0$ . The probability function is defined as follows
$p(x|\mu, \sigma^{2})=\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\left\{-\frac{(x-\mu)^{2}}{2\sigma^{2}}\right\}$
$\mu = \mathbb{E}[x]$
$\sigma^{2} = var[x]$

In multivariate case, consider the $d$ -dimension vector $x$ , the average $\mu$ is a $d$ -dimension vector, too. However, the covariance in this case becomes a positive definite matrix $\Sigma$ of $d\times d$ dimension.
$p(x|\mu,\Sigma)=\frac{1}{\sqrt{(2\pi)^{d}\det(\Sigma)}}\exp\left\{-\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu)\right\}$
$\mu = \mathbb{E}[x]$
$\Sigma = cov[x] = \mathbb{E}\left[ (x-\mu)(x-\mu)^{T} \right]$ That is
$\Sigma_{ij} = \mathbb{E}[(x_{i}-\mu_{i})(x_{j}-\mu_{j})] = \mathbb{E}[(x_{j}-\mu_{j})(x_{i}-\mu_{i})] = \Sigma_{ji}$
The sign of the covariance helps to determine the relationship between two components:

If $x_{j}$ is large when $x_{i}$ is large, then $(x_{j}-\mu_{j})(x_{i}-\mu_{i})$ will tend to be positive;
If $x_{j}$ is small when $x_{i}$ is large, then $(x_{j}-\mu_{j})(x_{i}-\mu_{i})$ will tend to be negative.

Since the covariance matrix is not scale-independent(i.e. independent of the measurement units), we define the correlation coefficient:
$\rho (x_{j},x_{k}) = \rho_{jk} = \frac{S_{jk}}{\sqrt{S_{jj}S_{kk}}},\quad S_{jk}=\sum_{n=1}^{N}(x_{j}-\mu_{j})(x_{k}-\mu_{k})$ which satisfies $-1\leq \rho \leq 1$ , and

$\rho(x,y) = +1$ , if $y = a x + b, a > 0$ .
$\rho(x,y) = -1$ , if $y = a x + b, a < 0$ .

To estimate the mean vector $\mu$ and covariance matrix $\Sigma$ , it is easy to be done by maximize the likelihood of the training data. And the they are given by:
$\hat{\mu} = \frac{1}{N}\sum_{n=1}^{N}x^{(n)},\quad \hat{\Sigma}=\frac{1}{N}\sum_{n=1}^{N}(x^{(n)}-\hat{\mu})(x^{(n)}-\hat{\mu})^{T}$

常用距离

metrics

If $x_{1}, x_{2}\in\mathbb{R}^{n}$ , then:
闵可夫斯基距离 Minkowski Distance
$d_{12}= \sqrt[p]{\sum_{k=1}^{n}(x_{1k}-x_{2k})^{p}},\quad p>0$

欧氏距离 Enclidean Distance
$L_2$ norm
$d_{12}=\sqrt{\sum_{k=1}^{n}(x_{1k}-x_{2k})^{2}} \text{ or } d_{12}=\sqrt{(x_{1}-x_{2})^{T}(x_{1}-x_{2})}$

标准化欧式距离/加权欧式距离 Weighted Euclidean Distance
$d_{12}=\sqrt{\sum_{k=1}^{n}\left( \frac{x_{1k}-x_{2k}}{S_{k}} \right)^{2}}$
where $S_{k}$ is the standard deviation.

from numpy import *
vectormat=mat([[1,2,3],[4,5,6]])
v12=vectormat[0]-vectormat[1]
varmat=std(vectormat.T, axis=0)
normmat=(vectormat-mean(vectormat))/varmat.T
normv12=normmat[0]-normmat[1]
print(sqrt(normv12*normv12.T))

曼哈顿距离 Manhattan Distance
$L_1$ norm
$d_{12}=\sum_{k=1}^{n}|x_{1k}-x_{2k}|$

切比雪夫距离 Chebyshev Distance
$L_\infty$ norm
$d_{12}=\max_{i}(|x_{1i}-x_{2i}|)$

from numpy import *
vector1=mat([1,2,3])
vector2=mat([4,5,7])
print(abs(vector1-vector2).max())

夹角余弦 Cosine
$\cos\theta=\frac{\sum_{k=1}^{n}x_{1k}x_{2k}}{\sqrt{\sum_{k=1}^{n}x_{1k}^{2}}\sqrt{\sum_{k=1}^{n}x_{2k}^{2}}}$

汉明距离 Hamming Distance
In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other. (referred from Wikipedia)

from numpy import *
matV=mat([[1,1,0,1,0,1,0,0,1],[0,1,1,0,0,0,1,1,1]])
smstr=nonzero(matV [0]-matV[1])
print(shape(smstr[0])[0])

杰卡德相似系数 Jaccard Similarity Coefficient
Given two sets, $A$ and $B$ , the Jaccard similarity coefficient is defined as
$J(A,B)=\frac{|A\cap B|}{|A\cup B|}$

杰卡德距离 Jaccard Distance
$J_{\delta}(A,B)=1-J(A,B)=\frac{|A\cup B|-|A\cap B|}{|A\cup B|}$

from numpy import *
import scipy.spatial.distance as dist
matV=mat([[1,1,0,1,0,1,0,0,1],[0,1,1,0,0,0,1,1,1]])
print(dist.pdist(matV,'jaccard'))

马氏距离 Mahalanobis Distance
Given $m$ sample vectors $X_{1},\dots,X_{m}$ whose mean value is $\mu$ and covariance matrix is $S$ , then the Mahalanobis distance of sample vector $X$ and $\mu$ is defined as
$D(X)=\sqrt{(X-\mu)^{T}S^{-1}(X-\mu)}$
that of sample vector $X_{i}$ and $X_{j}$ is
$D(X)=\sqrt{(X_{i}-X_{j})^{T}S^{-1}(X_{i}-X_{j})}$

皮尔松相关系数 Pearson’s correlation coefficient
Assume there are two attributes, $u$ and $v$ , of some kind of sample. Pearson’s correlation coefficient serves as a indicator that shows how correlative the two attributes are to each other.

Suppose that vector $u=[u_{1},u_{2},\dots,u_{n}]^{T}$ and $v$ share the same length. Take the averages of $u$ and $v$ :
$\bar{u}=avg(u),\quad \bar{v}=avg(v)$ Define:
$\Delta u=\begin{matrix}u_{1}-\bar{u} \\ u_{2}-\bar{u} \\ \vdots \\ u_{n}-\bar{u}\end{matrix},\qquad \Delta v=\begin{matrix}v_{1}-\bar{v} \\ v_{2}-\bar{v} \\ \vdots \\ v_{n}-\bar{v}\end{matrix}$ Then calculate the Pearson’s correlation coefficient:
$corr(u,v)=\frac{\Delta u^{T}*\Delta v}{\sqrt{(\Delta u^{T}*\Delta u)*(\Delta v^{T}*\Delta v)}}$
Higher coefficients mean more correlative attribute pairs.

交叉熵 Cross Entropy
The cross entropy is a normal loss function in machine learning field. As we know, the entropy is defined as
$\mathrm{entropy} = -\sum_{x}p(x)\log p(x)$ Or
$\mathrm{entropy} = -\int_{x}p(x)\log p(x) \mathrm{d}x$
However, the two definitions above are normal ones. We can define the entropy in various forms as long as the properties of entropy are satisfied. Assume that there exist two random variable $P$ and $Q$ based on Bernoulli distribution, then their cross entropy is
$-P(0)\log Q(0) - (1-P(0))\log (1-Q(0))$

KL散度 Kullback-Leibler divergence
KL divergence stands for the divergence of probability distributions.
$\| q ) = \sum_{x}p(x)\log\frac{p(x)}{q(x)}$ Or
$\| q ) = \int_{x}p(x)\log\frac{p(x)}{q(x)}$ KL divergence is always non-negative but not symmetric.