Random Variable
A random variable X on a sample space
Ω
\Omega
Ω is a function
X
:
Ω
→
R
X:\Omega\rightarrow R
X:Ω→R
that assigns to each sample point
ω
∈
Ω
\omega \in \Omega
ω∈Ω a real number
X
(
ω
)
X(\omega)
X(ω).
Let a be any number in the range of a random variable X. Then X ( ω ) = a X(\omega) = a X(ω)=a is an event in the sample space. We usually abbreivate this event to “X = a”.
The distribution of a discrete random variable X is the collection of values
{
(
a
,
P
[
X
=
a
]
)
}
\{(a,P[X = a])\}
{(a,P[X=a])}, where a
∈
\in
∈ the set of all possible values taken by X
probability mass function(p.m.f) P [ X = a ] = p ( a ) P[X=a] = p(a) P[X=a]=p(a)
Expectations and Variance
E ( X ) = ∑ x ∈ X ( Ω ) x p ( x ) V a r ( X ) = E [ ( X − E ( X ) ) 2 ] = E ( X 2 ) − E 2 ( X ) E(X) =\sum_{x\in X(\Omega)} xp(x) \\ Var(X) = E[(X - E(X))^2] = E(X^2) - E^2(X) E(X)=x∈X(Ω)∑xp(x)Var(X)=E[(X−E(X))2]=E(X2)−E2(X)
Distribution
X ∼ Bernoulli§
one if a coin with heads probability p comes up heads, zero otherwise.
P [ X = i ] = { p , i = 1 1 − p , i = 0 E ( X ) = p V a r ( X ) = p ( 1 − p ) \begin{aligned} P[X=i]= \begin{cases} p, & i=1 \\ 1-p, & i=0 \end{cases} \\ E(X)=p \\ Var(X)=p(1-p) \end{aligned} P[X=i]={p,1−p,i=1i=0E(X)=pVar(X)=p(1−p)
X ∼ Binomial(n, p)
n times Bernoulli experiments
P [ X = i ] = ( n i ) p i ( 1 − p ) n − i i =0,1,...,n E ( X ) = n p V a r ( X ) = n p ( 1 − p ) \begin{aligned} P[X=i] =( \begin{matrix} n\\i \end{matrix} )p^i(1-p)^{n-i} \qquad \text{i =0,1,...,n}\\ E(X) = np \\ Var(X) = np(1-p) \end{aligned} P[X=i]=(ni)pi(1−p)n−ii =0,1,...,nE(X)=npVar(X)=np(1−p)
X ∼ Geometric§
For λ > 0 \lambda > 0 λ>0
P [ X = i ] = ( 1 − p ) i − 1 p i = 1,2,3,.. E ( X ) = 1 p V a r ( X ) = 1 − p p 2 \begin{aligned} P[X = i]= (1− p)^{i-1}p \qquad \text{i = 1,2,3,..} \\ E(X) = \frac{1}{p} \\ Var(X) = \frac{1-p}{p^2} \end{aligned} P[X=i]=(1−p)i−1pi = 1,2,3,..E(X)=p1Var(X)=p21−p
X ∼ Poisson( λ \lambda λ)
P [ X = i ] = λ i i ! e − λ i=0,1,2,... E ( X ) = V a r ( X ) = λ \begin{aligned} P[X = i]= \frac{\lambda^i}{i!}e^{-\lambda} \qquad \text{i=0,1,2,...} \\ E(X) = Var(X) = \lambda \end{aligned} P[X=i]=i!λie−λi=0,1,2,...E(X)=Var(X)=λ
Continuous Variable
A probability density function(PDF) for a real-valued random variable X is a function f : R → R f:R\rightarrow R f:R→R satisfying
- f X ( x ) ≥ 0 f_X(x) \geq 0 fX(x)≥0 for all x ∈ R x \in R x∈R.
- ∫ − ∞ ∞ f X ( x ) d x = 1 \int_{-\infty}^{\infty}f_X(x)dx = 1 ∫−∞∞fX(x)dx=1
Then the distribution of X is given by P [ a ≤ X ≤ b ] = ∫ a b f X ( x ) d x P[a\leq X\leq b] = \int_a^b f_X(x)dx P[a≤X≤b]=∫abfX(x)dx
cumulative distribution function(CDF) F X ( x ) = P ( X ≤ x ) = ∫ − ∞ x f X ( z ) d z F_X(x) = P(X\leq x) = \int_{-\infty}^xf_X(z)dz FX(x)=P(X≤x)=∫−∞xfX(z)dz
E ( x ) = ∫ − ∞ ∞ x f X ( x ) d x E(x) = \int_{-\infty}^{\infty}xf_X(x)dx E(x)=∫−∞∞xfX(x)dx
Distribution
X ∼ Uniform(a, b)
f ( x ) = { 1 b − a , a ≤ x ≤ b 0 , o t h e r f(x) = \begin{cases} \frac1{b-a}, & a\leq x\leq b \\ 0, & other \end{cases} f(x)={b−a1,0,a≤x≤bother
X ∼ Exponential(λ)
For λ > 0 \lambda > 0 λ>0,
f ( x ) = { λ e − λ x , x ≥ 0 0 , o t h e r f(x) = \begin{cases} \lambda e^{-\lambda x}, & x\geq 0 \\ 0, & other \end{cases} f(x)={λe−λx,0,x≥0other
X ∼ Normal( μ , σ 2 \mu,\sigma^2 μ,σ2)
Also known as the Gaussian distribution
f ( x ) = 1 2 π σ e − ( x − μ ) 2 2 σ 2 f(x) = \frac1{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}} f(x)=2πσ1e−2σ2(x−μ)2
Multiple Variables
Joint distribution
Random Joint distribution
For some random variables
X
1
,
.
.
.
,
X
n
X_1,...,X_n
X1,...,Xn, the joint
distribution is written
p
(
X
1
,
.
.
.
,
X
n
)
p(X_1,...,X_n)
p(X1,...,Xn)
Let { X i } i ∈ I \{X_i\}_{i\in I} {Xi}i∈I be a collection of random variables indexed by I I I, which may be infinite. Then { X i } \{X_i\} {Xi} are independent if for every finite subset of indices i 1 , . . . , i k ∈ I i_1,..., i_k \in I i1,...,ik∈I we have p ( X i 1 , . . . , X i k ) = ∏ j = 1 k p ( X i j ) p(X_{i1},..., X_{ik}) = \prod_{j=1}^kp(X_{ij}) p(Xi1,...,Xik)=∏j=1kp(Xij)
Marginal distribution p ( X ) = ∑ y p ( X , y ) p(X) = \sum_yp(X,y) p(X)=∑yp(X,y)
Continuous Joint distribution
For continuous variables, the joint CDF F X Y ( x , y ) = P ( X ≤ x , Y ≤ y ) F_{XY}(x,y) = P(X\leq x, Y\leq y) FXY(x,y)=P(X≤x,Y≤y)
the marginal CDF F X ( x ) = lim y → ∞ F X Y ( x , y ) d y F_X(x) = \lim_{y\rightarrow \infty}F_{XY}(x,y)dy FX(x)=limy→∞FXY(x,y)dy
In the case that F X Y ( x , y ) F_{XY}(x, y) FXY(x,y) is everywhere differentiable in both x and y, then we can define the joint PDF f X Y ( x , y ) = ∂ 2 F X Y ( x , y ) ∂ x ∂ y f_{XY}(x,y) = \frac{\partial^2 F_{XY}(x,y)}{\partial x\partial y} fXY(x,y)=∂x∂y∂2FXY(x,y)
marginal PDF: f X ( x ) = ∫ − ∞ ∞ f X Y ( x , y ) d y f_X(x) = \int_{-\infty}^{\infty}f_{XY}(x,y)dy fX(x)=∫−∞∞fXY(x,y)dy
Conditional Distributions
the conditional PDF f Y ∣ X ( y ∣ x ) = f X Y ( x , y ) f X ( x ) f_{Y|X}(y|x) = \frac{f_{XY}(x,y)}{f_X(x)} fY∣X(y∣x)=fX(x)fXY(x,y)
Covariance
C o v ( X , Y ) = E [ ( X − E ( X ) ) ( Y − E ( Y ) ) ] = E ( X Y ) − E ( X ) E ( Y ) Cov(X,Y) = E[(X-E(X))(Y-E(Y))] = E(XY) - E(X)E(Y) Cov(X,Y)=E[(X−E(X))(Y−E(Y))]=E(XY)−E(X)E(Y)
correlation ρ ( X , Y ) = C o v ( X , Y ) V a r ( X ) V a r ( Y ) ∈ [ − 1 , 1 ] \rho(X,Y) = \frac{Cov(X,Y)}{\sqrt{Var(X)Var(Y)}} \in [-1,1] ρ(X,Y)=Var(X)Var(Y)Cov(X,Y)∈[−1,1]
s e t Z = ( X − E ( X ) ) t + ( Y − E ( Y ) ) E ( Z 2 ) = V a r ( X ) t 2 + 2 C o v ( X , Y ) t + V a r ( Y ) ≥ 0 ⇒ 4 C o v 2 ( X , Y ) − 4 V a r ( X ) V a r ( Y ) ≤ 0 \begin{aligned} set Z = (X-E(X))t + (Y-E(Y)) \\ E(Z^2) = Var(X)t^2 + 2Cov(X,Y)t + Var(Y) \geq 0 \\ \Rightarrow 4Cov^2(X,Y) - 4Var(X)Var(Y) \leq 0 \end{aligned} setZ=(X−E(X))t+(Y−E(Y))E(Z2)=Var(X)t2+2Cov(X,Y)t+Var(Y)≥0⇒4Cov2(X,Y)−4Var(X)Var(Y)≤0
We can define Inner product ⟨ X , Y ⟩ : = E ( X Y ) \langle X,Y\rangle:=E(XY) ⟨X,Y⟩:=E(XY), and correlation as c o s θ = ⟨ X − E ( X ) , X − E ( X ) ⟩ ⟨ Y − E ( Y ) , Y − E ( Y ) ⟩ V a r ( X ) V a r ( Y ) cos\theta = \frac{\langle X-E(X),X-E(X)\rangle\langle Y-E(Y),Y-E(Y)\rangle}{Var(X)Var(Y)} cosθ=Var(X)Var(Y)⟨X−E(X),X−E(X)⟩⟨Y−E(Y),Y−E(Y)⟩
If ρ 2 = 1 \rho^2 = 1 ρ2=1,
V a r ( Y − a X ) = V a r ( Y ) + a 2 V a r ( X ) − 2 a C o v ( X , Y ) = V a r ( Y ) + a 2 V a r ( X ) − 2 a ρ V a r ( X ) V a r ( Y ) = ( a V a r ( X ) − ρ V a r ( Y ) ) 2 \begin{aligned} Var(Y-aX) = Var(Y) + a^2Var(X) - 2aCov(X,Y) \\ = Var(Y) + a^2Var(X) - 2a\rho Var(X)Var(Y) \\ = (a\sqrt{Var(X)} - \rho \sqrt{Var(Y)})^2 \end{aligned} Var(Y−aX)=Var(Y)+a2Var(X)−2aCov(X,Y)=Var(Y)+a2Var(X)−2aρVar(X)Var(Y)=(aVar(X)−ρVar(Y))2
if we choose a = ρ V a r ( Y ) V a r ( X ) a = \frac{\rho \sqrt{Var(Y)}}{\sqrt{Var(X)}} a=Var(X)ρVar(Y), then V a r ( Y − a x ) = 0 ⇒ Y = a X Var(Y-ax) = 0 \Rightarrow Y=aX Var(Y−ax)=0⇒Y=aX
If ρ = 0 \rho = 0 ρ=0, we say that X and Y are uncorrelated, but that doesn’t mean they are independent.
if X ∼ Uniform(−1, 1) and
Y
=
X
2
Y = X^2
Y=X2 , then one can show that X and Y are uncorrelated, even though they are not
independent.
For multiple variables, define covariance matrix. It is symmetric and semi-definite (for any x, x T Σ x ≥ 0 x^T\Sigma x \geq 0 xTΣx≥0)
Σ = E [ ( X − E ( X ) ) ( X − E ( X ) ) T ] = [ V a r ( X 1 ) C o v ( X 1 , X 2 ) ⋯ C o v ( X 1 , X n ) C o v ( X 2 , X 1 ) V a r ( X 2 ) ⋯ C o v ( X 2 , X n ) ⋮ ⋮ ⋱ ⋮ C o v ( X n , X 1 ) C o v ( X n , X 2 ) ⋯ V a r ( X n ) ] \Sigma = E[(X-E(X))(X-E(X))^T] = \left[ \begin{matrix} Var(X_1) & Cov(X_1, X_2) & \cdots & Cov(X_1, X_n)\\ Cov(X_2, X_1) & Var(X_2) & \cdots & Cov(X_2, X_n)\\ \vdots & \vdots & \ddots & \vdots\\ Cov(X_n, X_1) & Cov(X_n, X_2)& \cdots & Var(X_n) \end{matrix} \right] Σ=E[(X−E(X))(X−E(X))T]= Var(X1)Cov(X2,X1)⋮Cov(Xn,X1)Cov(X1,X2)Var(X2)⋮Cov(Xn,X2)⋯⋯⋱⋯Cov(X1,Xn)Cov(X2,Xn)⋮Var(Xn)
Independence
Two random variables X and Y are independent if F X Y ( x , y ) = F X ( x ) F Y ( y ) F_{XY}(x, y) = F_X(x)F_Y(y) FXY(x,y)=FX(x)FY(y) for all values of x and y. Equivalently,
- For discrete random variables, p ( x , y ) = p ( x ) p ( y ) p(x,y) = p(x)p(y) p(x,y)=p(x)p(y)
- For continuous variables, f X Y ( x , y ) = f X ( x ) f Y ( y ) f_{XY}(x, y) = f_X(x)f_Y(y) fXY(x,y)=fX(x)fY(y)
The Gaussian distribution
p ( x ; μ , Σ ) = 1 ( 2 π ) d det Σ exp ( − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) ) p(x; \mu, \Sigma) = \frac1{\sqrt{(2\pi)^d \det{\Sigma}}}{\exp(-\frac12(x-\mu)^T\Sigma^{-1}(x-\mu))} p(x;μ,Σ)=(2π)ddetΣ1exp(−21(x−μ)TΣ−1(x−μ))
Estimation of Parameters
Maximum Likelihood estimation
We make some assumptions about our problem by prescribing a parametric model, then we fit the parameters of the model to the data. How do we choose the values of the parameters?
A common way to fit parameters is maximum likelihood estimation(MLE).
Suppose we have random variables X 1 , . . . , X n X_1, . . . , X_n X1,...,Xn and corresponding observations x 1 , . . . , x n x_1, . . . , x_n x1,...,xn. Then likelihood function
L ( θ ) = p ( x 1 , . . . , x n ; θ ) L(\theta) = p(x_1, . . . , x_n;\theta) L(θ)=p(x1,...,xn;θ)
We assume X 1 , . . . , X n X_1, . . . , X_n X1,...,Xn are independent. Then
L ( θ ) = ∏ i = 1 n p ( x i ; θ ) log L ( θ ) = ∑ i = 1 n log p ( x i ; θ ) θ m l e = arg max L ( θ ) \begin{aligned} L(\theta) = \prod_{i=1}^np(x_i;\theta)\\ \log L(\theta) = \sum_{i=1}^n\log p(x_i;\theta) \\ \theta _{mle} = \arg\max L(\theta) \end{aligned} L(θ)=i=1∏np(xi;θ)logL(θ)=i=1∑nlogp(xi;θ)θmle=argmaxL(θ)
Information Theory
Entropy
Information: How uncertain we are of the outcome of random experiments
Self information: i ( x ) = − log 2 p ( x ) i(x) = - \log_2p(x) i(x)=−log2p(x)
Entropy
H
(
X
)
=
E
[
i
(
x
)
]
=
−
∑
x
∈
X
(
Ω
)
p
(
x
)
log
2
p
(
x
)
H
(
X
,
Y
)
=
−
∑
y
∈
y
(
Ω
)
∑
x
∈
X
(
Ω
)
p
(
x
,
y
)
log
2
p
(
x
,
y
)
H
(
X
∣
Y
)
=
−
∑
y
∈
y
(
Ω
)
∑
x
∈
X
(
Ω
)
p
(
x
,
y
)
log
2
p
(
x
∣
y
)
H
(
X
,
Y
)
=
H
(
X
)
+
H
(
Y
∣
X
)
=
H
(
Y
)
+
H
(
X
∣
Y
)
\begin{aligned} H(X) = E[i(x)] = -\sum_{x\in X(\Omega)}p(x)\log_2p(x) \\ H(X,Y) = -\sum_{y\in y(\Omega)}\sum_{x\in X(\Omega)}p(x,y)\log_2p(x,y) \\ H(X|Y) = -\sum_{y\in y(\Omega)}\sum_{x\in X(\Omega)}p(x,y)\log_2p(x|y) \\ H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y) \end{aligned}
H(X)=E[i(x)]=−x∈X(Ω)∑p(x)log2p(x)H(X,Y)=−y∈y(Ω)∑x∈X(Ω)∑p(x,y)log2p(x,y)H(X∣Y)=−y∈y(Ω)∑x∈X(Ω)∑p(x,y)log2p(x∣y)H(X,Y)=H(X)+H(Y∣X)=H(Y)+H(X∣Y)
Relative Entropy
Jensen’s inequality: Let X be a random variable and f(x) be a convex function, then
E ( f ( X ) ) ≥ f ( E ( X ) ) E(f(X))\geq f(E(X)) E(f(X))≥f(E(X))
Conversely, if f(x) is a concave function
E ( f ( X ) ) ≤ f ( E ( X ) ) E(f(X))\leq f(E(X)) E(f(X))≤f(E(X))
Relative Entropy (Kullback-Leibler divergence):
D ( p ( x ) ∣ ∣ q ( x ) ) = ∑ x ∈ X ( Ω ) p ( x ) log p ( x ) q ( x ) = − E p [ log q ( x ) p ( x ) ] ≥ − log E p [ q ( x ) p ( x ) ] ≥ 0 D(p(x)||q(x)) = \sum_{x\in X(\Omega)}p(x)\log\frac{p(x)}{q(x)} = - E_p[\log\frac{q(x)}{p(x)}] \geq -\log E_p[\frac{q(x)}{p(x)}] \geq 0 D(p(x)∣∣q(x))=x∈X(Ω)∑p(x)logq(x)p(x)=−Ep[logp(x)q(x)]≥−logEp[p(x)q(x)]≥0
Connection to MLE
Set loss function loss ( x ) = − log q ( x ) (x) = -\log q(x) (x)=−logq(x), the expect value would be
Risk ( q ) = − E p ( log q ( x ) ) = D ( p ∣ ∣ q ) + (q) = -E_p(\log q(x))= D(p||q) + (q)=−Ep(logq(x))=D(p∣∣q)+ Risk ( p ) (p) (p)
Thus, arg min D ( p ∣ ∣ q ) = arg min \arg \min D(p||q) = \arg \min argminD(p∣∣q)=argmin Risk ( q ) (q) (q)
Here, we call Risk(q) as cross entropy CE(p,q)