机器学习数学基础--概率论

Random Variable

A random variable X on a sample space Ω \Omega Ω is a function X : Ω → R X:\Omega\rightarrow R X:ΩR
that assigns to each sample point ω ∈ Ω \omega \in \Omega ωΩ a real number X ( ω ) X(\omega) X(ω).

Let a be any number in the range of a random variable X. Then X ( ω ) = a X(\omega) = a X(ω)=a is an event in the sample space. We usually abbreivate this event to “X = a”.

The distribution of a discrete random variable X is the collection of values
{ ( a , P [ X = a ] ) } \{(a,P[X = a])\} {(a,P[X=a])}, where a ∈ \in the set of all possible values taken by X

probability mass function(p.m.f) P [ X = a ] = p ( a ) P[X=a] = p(a) P[X=a]=p(a)

Expectations and Variance

E ( X ) = ∑ x ∈ X ( Ω ) x p ( x ) V a r ( X ) = E [ ( X − E ( X ) ) 2 ] = E ( X 2 ) − E 2 ( X ) E(X) =\sum_{x\in X(\Omega)} xp(x) \\ Var(X) = E[(X - E(X))^2] = E(X^2) - E^2(X) E(X)=xX(Ω)xp(x)Var(X)=E[(XE(X))2]=E(X2)E2(X)

Distribution

X ∼ Bernoulli§

one if a coin with heads probability p comes up heads, zero otherwise.

P [ X = i ] = { p , i = 1 1 − p , i = 0 E ( X ) = p V a r ( X ) = p ( 1 − p ) \begin{aligned} P[X=i]= \begin{cases} p, & i=1 \\ 1-p, & i=0 \end{cases} \\ E(X)=p \\ Var(X)=p(1-p) \end{aligned} P[X=i]={p,1p,i=1i=0E(X)=pVar(X)=p(1p)

X ∼ Binomial(n, p)

n times Bernoulli experiments

P [ X = i ] = ( n i ) p i ( 1 − p ) n − i i =0,1,...,n E ( X ) = n p V a r ( X ) = n p ( 1 − p ) \begin{aligned} P[X=i] =( \begin{matrix} n\\i \end{matrix} )p^i(1-p)^{n-i} \qquad \text{i =0,1,...,n}\\ E(X) = np \\ Var(X) = np(1-p) \end{aligned} P[X=i]=(ni)pi(1p)nii =0,1,...,nE(X)=npVar(X)=np(1p)

X ∼ Geometric§

For λ > 0 \lambda > 0 λ>0

P [ X = i ] = ( 1 − p ) i − 1 p i = 1,2,3,.. E ( X ) = 1 p V a r ( X ) = 1 − p p 2 \begin{aligned} P[X = i]= (1− p)^{i-1}p \qquad \text{i = 1,2,3,..} \\ E(X) = \frac{1}{p} \\ Var(X) = \frac{1-p}{p^2} \end{aligned} P[X=i]=(1p)i1pi = 1,2,3,..E(X)=p1Var(X)=p21p

X ∼ Poisson( λ \lambda λ)

P [ X = i ] = λ i i ! e − λ i=0,1,2,... E ( X ) = V a r ( X ) = λ \begin{aligned} P[X = i]= \frac{\lambda^i}{i!}e^{-\lambda} \qquad \text{i=0,1,2,...} \\ E(X) = Var(X) = \lambda \end{aligned} P[X=i]=i!λieλi=0,1,2,...E(X)=Var(X)=λ

Continuous Variable

A probability density function(PDF) for a real-valued random variable X is a function f : R → R f:R\rightarrow R f:RR satisfying

  1. f X ( x ) ≥ 0 f_X(x) \geq 0 fX(x)0 for all x ∈ R x \in R xR.
  2. ∫ − ∞ ∞ f X ( x ) d x = 1 \int_{-\infty}^{\infty}f_X(x)dx = 1 fX(x)dx=1

Then the distribution of X is given by P [ a ≤ X ≤ b ] = ∫ a b f X ( x ) d x P[a\leq X\leq b] = \int_a^b f_X(x)dx P[aXb]=abfX(x)dx

cumulative distribution function(CDF) F X ( x ) = P ( X ≤ x ) = ∫ − ∞ x f X ( z ) d z F_X(x) = P(X\leq x) = \int_{-\infty}^xf_X(z)dz FX(x)=P(Xx)=xfX(z)dz

E ( x ) = ∫ − ∞ ∞ x f X ( x ) d x E(x) = \int_{-\infty}^{\infty}xf_X(x)dx E(x)=xfX(x)dx

Distribution

X ∼ Uniform(a, b)

f ( x ) = { 1 b − a , a ≤ x ≤ b 0 , o t h e r f(x) = \begin{cases} \frac1{b-a}, & a\leq x\leq b \\ 0, & other \end{cases} f(x)={ba1,0,axbother

X ∼ Exponential(λ)

For λ > 0 \lambda > 0 λ>0,

f ( x ) = { λ e − λ x , x ≥ 0 0 , o t h e r f(x) = \begin{cases} \lambda e^{-\lambda x}, & x\geq 0 \\ 0, & other \end{cases} f(x)={λeλx,0,x0other

X ∼ Normal( μ , σ 2 \mu,\sigma^2 μ,σ2)

Also known as the Gaussian distribution

f ( x ) = 1 2 π σ e − ( x − μ ) 2 2 σ 2 f(x) = \frac1{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}} f(x)=2π σ1e2σ2(xμ)2

20230510213724

Multiple Variables

Joint distribution

Random Joint distribution

For some random variables X 1 , . . . , X n X_1,...,X_n X1,...,Xn, the joint
distribution is written p ( X 1 , . . . , X n ) p(X_1,...,X_n) p(X1,...,Xn)

Let { X i } i ∈ I \{X_i\}_{i\in I} {Xi}iI be a collection of random variables indexed by I I I, which may be infinite. Then { X i } \{X_i\} {Xi} are independent if for every finite subset of indices i 1 , . . . , i k ∈ I i_1,..., i_k \in I i1,...,ikI we have p ( X i 1 , . . . , X i k ) = ∏ j = 1 k p ( X i j ) p(X_{i1},..., X_{ik}) = \prod_{j=1}^kp(X_{ij}) p(Xi1,...,Xik)=j=1kp(Xij)

Marginal distribution p ( X ) = ∑ y p ( X , y ) p(X) = \sum_yp(X,y) p(X)=yp(X,y)

Continuous Joint distribution

For continuous variables, the joint CDF F X Y ( x , y ) = P ( X ≤ x , Y ≤ y ) F_{XY}(x,y) = P(X\leq x, Y\leq y) FXY(x,y)=P(Xx,Yy)

the marginal CDF F X ( x ) = lim ⁡ y → ∞ F X Y ( x , y ) d y F_X(x) = \lim_{y\rightarrow \infty}F_{XY}(x,y)dy FX(x)=limyFXY(x,y)dy

In the case that F X Y ( x , y ) F_{XY}(x, y) FXY(x,y) is everywhere differentiable in both x and y, then we can define the joint PDF f X Y ( x , y ) = ∂ 2 F X Y ( x , y ) ∂ x ∂ y f_{XY}(x,y) = \frac{\partial^2 F_{XY}(x,y)}{\partial x\partial y} fXY(x,y)=xy2FXY(x,y)

marginal PDF: f X ( x ) = ∫ − ∞ ∞ f X Y ( x , y ) d y f_X(x) = \int_{-\infty}^{\infty}f_{XY}(x,y)dy fX(x)=fXY(x,y)dy

Conditional Distributions

the conditional PDF f Y ∣ X ( y ∣ x ) = f X Y ( x , y ) f X ( x ) f_{Y|X}(y|x) = \frac{f_{XY}(x,y)}{f_X(x)} fYX(yx)=fX(x)fXY(x,y)

Covariance

C o v ( X , Y ) = E [ ( X − E ( X ) ) ( Y − E ( Y ) ) ] = E ( X Y ) − E ( X ) E ( Y ) Cov(X,Y) = E[(X-E(X))(Y-E(Y))] = E(XY) - E(X)E(Y) Cov(X,Y)=E[(XE(X))(YE(Y))]=E(XY)E(X)E(Y)

correlation ρ ( X , Y ) = C o v ( X , Y ) V a r ( X ) V a r ( Y ) ∈ [ − 1 , 1 ] \rho(X,Y) = \frac{Cov(X,Y)}{\sqrt{Var(X)Var(Y)}} \in [-1,1] ρ(X,Y)=Var(X)Var(Y) Cov(X,Y)[1,1]

s e t Z = ( X − E ( X ) ) t + ( Y − E ( Y ) ) E ( Z 2 ) = V a r ( X ) t 2 + 2 C o v ( X , Y ) t + V a r ( Y ) ≥ 0 ⇒ 4 C o v 2 ( X , Y ) − 4 V a r ( X ) V a r ( Y ) ≤ 0 \begin{aligned} set Z = (X-E(X))t + (Y-E(Y)) \\ E(Z^2) = Var(X)t^2 + 2Cov(X,Y)t + Var(Y) \geq 0 \\ \Rightarrow 4Cov^2(X,Y) - 4Var(X)Var(Y) \leq 0 \end{aligned} setZ=(XE(X))t+(YE(Y))E(Z2)=Var(X)t2+2Cov(X,Y)t+Var(Y)04Cov2(X,Y)4Var(X)Var(Y)0

We can define Inner product ⟨ X , Y ⟩ : = E ( X Y ) \langle X,Y\rangle:=E(XY) X,Y:=E(XY), and correlation as c o s θ = ⟨ X − E ( X ) , X − E ( X ) ⟩ ⟨ Y − E ( Y ) , Y − E ( Y ) ⟩ V a r ( X ) V a r ( Y ) cos\theta = \frac{\langle X-E(X),X-E(X)\rangle\langle Y-E(Y),Y-E(Y)\rangle}{Var(X)Var(Y)} cosθ=Var(X)Var(Y)XE(X),XE(X)⟩YE(Y),YE(Y)⟩

If ρ 2 = 1 \rho^2 = 1 ρ2=1,

V a r ( Y − a X ) = V a r ( Y ) + a 2 V a r ( X ) − 2 a C o v ( X , Y ) = V a r ( Y ) + a 2 V a r ( X ) − 2 a ρ V a r ( X ) V a r ( Y ) = ( a V a r ( X ) − ρ V a r ( Y ) ) 2 \begin{aligned} Var(Y-aX) = Var(Y) + a^2Var(X) - 2aCov(X,Y) \\ = Var(Y) + a^2Var(X) - 2a\rho Var(X)Var(Y) \\ = (a\sqrt{Var(X)} - \rho \sqrt{Var(Y)})^2 \end{aligned} Var(YaX)=Var(Y)+a2Var(X)2aCov(X,Y)=Var(Y)+a2Var(X)2aρVar(X)Var(Y)=(aVar(X) ρVar(Y) )2

if we choose a = ρ V a r ( Y ) V a r ( X ) a = \frac{\rho \sqrt{Var(Y)}}{\sqrt{Var(X)}} a=Var(X) ρVar(Y) , then V a r ( Y − a x ) = 0 ⇒ Y = a X Var(Y-ax) = 0 \Rightarrow Y=aX Var(Yax)=0Y=aX

If ρ = 0 \rho = 0 ρ=0, we say that X and Y are uncorrelated, but that doesn’t mean they are independent.

if X ∼ Uniform(−1, 1) and Y = X 2 Y = X^2 Y=X2 , then one can show that X and Y are uncorrelated, even though they are not
independent.

For multiple variables, define covariance matrix. It is symmetric and semi-definite (for any x, x T Σ x ≥ 0 x^T\Sigma x \geq 0 xTΣx0)

Σ = E [ ( X − E ( X ) ) ( X − E ( X ) ) T ] = [ V a r ( X 1 ) C o v ( X 1 , X 2 ) ⋯ C o v ( X 1 , X n ) C o v ( X 2 , X 1 ) V a r ( X 2 ) ⋯ C o v ( X 2 , X n ) ⋮ ⋮ ⋱ ⋮ C o v ( X n , X 1 ) C o v ( X n , X 2 ) ⋯ V a r ( X n ) ] \Sigma = E[(X-E(X))(X-E(X))^T] = \left[ \begin{matrix} Var(X_1) & Cov(X_1, X_2) & \cdots & Cov(X_1, X_n)\\ Cov(X_2, X_1) & Var(X_2) & \cdots & Cov(X_2, X_n)\\ \vdots & \vdots & \ddots & \vdots\\ Cov(X_n, X_1) & Cov(X_n, X_2)& \cdots & Var(X_n) \end{matrix} \right] Σ=E[(XE(X))(XE(X))T]= Var(X1)Cov(X2,X1)Cov(Xn,X1)Cov(X1,X2)Var(X2)Cov(Xn,X2)Cov(X1,Xn)Cov(X2,Xn)Var(Xn)

Independence

Two random variables X and Y are independent if F X Y ( x , y ) = F X ( x ) F Y ( y ) F_{XY}(x, y) = F_X(x)F_Y(y) FXY(x,y)=FX(x)FY(y) for all values of x and y. Equivalently,

  • For discrete random variables, p ( x , y ) = p ( x ) p ( y ) p(x,y) = p(x)p(y) p(x,y)=p(x)p(y)
  • For continuous variables, f X Y ( x , y ) = f X ( x ) f Y ( y ) f_{XY}(x, y) = f_X(x)f_Y(y) fXY(x,y)=fX(x)fY(y)

The Gaussian distribution

p ( x ; μ , Σ ) = 1 ( 2 π ) d det ⁡ Σ exp ⁡ ( − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) ) p(x; \mu, \Sigma) = \frac1{\sqrt{(2\pi)^d \det{\Sigma}}}{\exp(-\frac12(x-\mu)^T\Sigma^{-1}(x-\mu))} p(x;μ,Σ)=(2π)ddetΣ 1exp(21(xμ)TΣ1(xμ))

Estimation of Parameters

Maximum Likelihood estimation

We make some assumptions about our problem by prescribing a parametric model, then we fit the parameters of the model to the data. How do we choose the values of the parameters?

A common way to fit parameters is maximum likelihood estimation(MLE).

Suppose we have random variables X 1 , . . . , X n X_1, . . . , X_n X1,...,Xn and corresponding observations x 1 , . . . , x n x_1, . . . , x_n x1,...,xn. Then likelihood function

L ( θ ) = p ( x 1 , . . . , x n ; θ ) L(\theta) = p(x_1, . . . , x_n;\theta) L(θ)=p(x1,...,xn;θ)

We assume X 1 , . . . , X n X_1, . . . , X_n X1,...,Xn are independent. Then

L ( θ ) = ∏ i = 1 n p ( x i ; θ ) log ⁡ L ( θ ) = ∑ i = 1 n log ⁡ p ( x i ; θ ) θ m l e = arg ⁡ max ⁡ L ( θ ) \begin{aligned} L(\theta) = \prod_{i=1}^np(x_i;\theta)\\ \log L(\theta) = \sum_{i=1}^n\log p(x_i;\theta) \\ \theta _{mle} = \arg\max L(\theta) \end{aligned} L(θ)=i=1np(xi;θ)logL(θ)=i=1nlogp(xi;θ)θmle=argmaxL(θ)

Information Theory

Entropy

Information: How uncertain we are of the outcome of random experiments

Self information: i ( x ) = − log ⁡ 2 p ( x ) i(x) = - \log_2p(x) i(x)=log2p(x)

Entropy
H ( X ) = E [ i ( x ) ] = − ∑ x ∈ X ( Ω ) p ( x ) log ⁡ 2 p ( x ) H ( X , Y ) = − ∑ y ∈ y ( Ω ) ∑ x ∈ X ( Ω ) p ( x , y ) log ⁡ 2 p ( x , y ) H ( X ∣ Y ) = − ∑ y ∈ y ( Ω ) ∑ x ∈ X ( Ω ) p ( x , y ) log ⁡ 2 p ( x ∣ y ) H ( X , Y ) = H ( X ) + H ( Y ∣ X ) = H ( Y ) + H ( X ∣ Y ) \begin{aligned} H(X) = E[i(x)] = -\sum_{x\in X(\Omega)}p(x)\log_2p(x) \\ H(X,Y) = -\sum_{y\in y(\Omega)}\sum_{x\in X(\Omega)}p(x,y)\log_2p(x,y) \\ H(X|Y) = -\sum_{y\in y(\Omega)}\sum_{x\in X(\Omega)}p(x,y)\log_2p(x|y) \\ H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y) \end{aligned} H(X)=E[i(x)]=xX(Ω)p(x)log2p(x)H(X,Y)=yy(Ω)xX(Ω)p(x,y)log2p(x,y)H(XY)=yy(Ω)xX(Ω)p(x,y)log2p(xy)H(X,Y)=H(X)+H(YX)=H(Y)+H(XY)

Relative Entropy

Jensen’s inequality: Let X be a random variable and f(x) be a convex function, then

E ( f ( X ) ) ≥ f ( E ( X ) ) E(f(X))\geq f(E(X)) E(f(X))f(E(X))

Conversely, if f(x) is a concave function

E ( f ( X ) ) ≤ f ( E ( X ) ) E(f(X))\leq f(E(X)) E(f(X))f(E(X))

Relative Entropy (Kullback-Leibler divergence):

D ( p ( x ) ∣ ∣ q ( x ) ) = ∑ x ∈ X ( Ω ) p ( x ) log ⁡ p ( x ) q ( x ) = − E p [ log ⁡ q ( x ) p ( x ) ] ≥ − log ⁡ E p [ q ( x ) p ( x ) ] ≥ 0 D(p(x)||q(x)) = \sum_{x\in X(\Omega)}p(x)\log\frac{p(x)}{q(x)} = - E_p[\log\frac{q(x)}{p(x)}] \geq -\log E_p[\frac{q(x)}{p(x)}] \geq 0 D(p(x)∣∣q(x))=xX(Ω)p(x)logq(x)p(x)=Ep[logp(x)q(x)]logEp[p(x)q(x)]0

Connection to MLE

Set loss function loss ( x ) = − log ⁡ q ( x ) (x) = -\log q(x) (x)=logq(x), the expect value would be

Risk ( q ) = − E p ( log ⁡ q ( x ) ) = D ( p ∣ ∣ q ) + (q) = -E_p(\log q(x))= D(p||q) + (q)=Ep(logq(x))=D(p∣∣q)+ Risk ( p ) (p) (p)

Thus, arg ⁡ min ⁡ D ( p ∣ ∣ q ) = arg ⁡ min ⁡ \arg \min D(p||q) = \arg \min argminD(p∣∣q)=argmin Risk ( q ) (q) (q)

Here, we call Risk(q) as cross entropy CE(p,q)

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值