Possible sources of uncertainty
- Inherent stochasticity in the system being modeled.
- Incomplete observability.
- Incomplete modeling.
Concepts
Note: In this note, x \text{x} x is random varaible and x x x is one of its value
- probability==> degree of belief
- frequentist probability==> directly related to the rates at which events occur.
- bayesian probability==> related to qualitative levels of certainty
- random varaible==> a varaible that can take on different values randomly.
- discrete:has a finite or countably infinite number of states
- continuous:is associated with a real value.
- probability distribution==> a description of how likely a random varaible or set of random variables is to take on eac of its possible states.
- probability mass function(PMF)==> a probability distribution over discrete variable
- PMF maps from a state of random variable to the probability of that random variable taking on that state.
- P ( x = x ) P(\text{x}=x) P(x=x) or x ∼ P ( x ) \text{x}\sim P(\text{x}) x∼P(x)
- the domain of P P P must be the set of all possible states of x x x
- ∀ x ∈ x , 0 ≤ P ( x ) ≤ 1 \forall x\in \text{x},0\leq P(x)\leq 1 ∀x∈x,0≤P(x)≤1.
- ∑ x ∈ x P ( x ) = 1 \sum\nolimits_{x\in\text{x}}P(x)=1 ∑x∈xP(x)=1.
- joint probability distribution==> a probability distribution over many variables
- P ( x = x , y = y ) P(\text{x}=x,\text{y}=y) P(x=x,y=y) or P ( x , y ) P(x,y) P(x,y)
- probability density function(PDF)==> a probability distribution over continuous random variable
- the domain of p p p must be the set of all possible states of x \text{x} x.
- ∀ x ∈ x , p ( x ) ≥ 0 \forall x\in\text{x},p(x)\geq0 ∀x∈x,p(x)≥0.
- ∫ p ( x ) d x = 1 \int p(x)dx=1 ∫p(x)dx=1.
- u ( x ; a , b ) u(x;a,b) u(x;a,b), where b > a b>a b>a. For all x ∉ [ a , b ] x\notin[a,b] x∈/[a,b], u ( x ; a , b ) = 0 u(x;a,b)=0 u(x;a,b)=0; within [ a , b ] [a,b] [a,b], u ( x ; a , b ) = 1 b − a u(x;a,b)=\frac{1}{b-a} u(x;a,b)=b−a1. Namely x ∼ U ( a , b ) \text{x}\sim U(a,b) x∼U(a,b).
- probability mass function(PMF)==> a probability distribution over discrete variable
- Marginal Probability
- The probability distribution over the subset.
- For discrete random variable,know P ( x , y ) P(\text{x},\text{y}) P(x,y), find P ( x ) P(\text{x}) P(x) with the sum rule: ∀ x ∈ x , P ( x = x ) = ∑ y P ( x = x , y = y ) \forall x\in\text{x},P(\text{x}=x)=\sum\limits_yP(\text{x}=x,\text{y}=y) ∀x∈x,P(x=x)=y∑P(x=x,y=y).
- For cotinuous variable, p ( x ) = ∫ p ( x , y ) d y p(x)=\int p(x,y)dy p(x)=∫p(x,y)dy
- Conditional Probability
- P ( y = y ∣ x = x ) = P ( y = y , x = x ) P ( x = x ) P(\text{y}=y|\text{x}=x)=\frac{P(\text{y}=y,\text{x}=x)}{P(\text{x}=x)} P(y=y∣x=x)=P(x=x)P(y=y,x=x)
- intervention query(干预查询)==>compute the consequences of an action.(the domain of causal modeling)
- The Chain Rule of Conditinal Probabilities
- P ( x ( 1 ) , ⋯   , x ( n ) ) = P ( x ( 1 ) ) ∏ i = 2 n P ( x ( i ) , ⋯   , x ( i − 1 ) ) P(\text{x}^{(1)},\cdots,\text{x}^{(n)})=P(\text{x}^{(1)})\prod_{i=2}^nP(\text{x}^{(i)},\cdots,\text{x}^{(i-1)}) P(x(1),⋯,x(n))=P(x(1))∏i=2nP(x(i),⋯,x(i−1))
- Independence:
- ∀ x ∈ x , y ∈ y , p ( x = x , y = y ) = p ( x = x ) p ( y = y ) \forall x\in\text{x},y\in\text{y},p(\text{x}=x,\text{y}=y)=p(\text{x}=x)p(\text{y}=y) ∀x∈x,y∈y,p(x=x,y=y)=p(x=x)p(y=y)
- For simplify: x ⊥ y \text{x}\perp\text{y} x⊥y
- Conditional Independce:
- ∀ x ∈ x , y ∈ y , z ∈ z , p ( x = x , y = y ∣ z = z ) = p ( x = x ∣ z = z ) p ( y = y ∣ z = z ) \forall x\in\text{x},y\in\text{y},z\in\text{z},p(\text{x}=x,\text{y}=y|\text{z}=z)=p(\text{x}=x|\text{z}=z)p(\text{y}=y|\text{z}=z) ∀x∈x,y∈y,z∈z,p(x=x,y=y∣z=z)=p(x=x∣z=z)p(y=y∣z=z)
- For simplify: x ⊥ y ∣ z \text{x}\perp\text{y}|\text{z} x⊥y∣z
- Expectation
- For discrete variables, E x ∼ P [ f ( x ) ] = ∑ x P ( x ) f ( x ) \mathbb{E}_{\text{x}\sim P}[f(x)]=\sum\limits_xP(x)f(x) Ex∼P[f(x)]=x∑P(x)f(x).
- For continuous variables, E x ∼ p [ f ( x ) ] = ∫ x p ( x ) f ( x ) \mathbb{E}_{\text{x}\sim p}[f(x)]=\int\limits_xp(x)f(x) Ex∼p[f(x)]=x∫p(x)f(x)
- linear: E x [ α f ( x ) + β g ( x ) ] = α E x [ f ( x ) ] + β E x [ g ( x ) ) ] \mathbb{E}_{\text{x}}[\alpha f(x)+\beta g(x)]=\alpha\mathbb{E}_{\text{x}}[f(x)]+\beta\mathbb{E}_{\text{x}}[g{(x))}] Ex[αf(x)+βg(x)]=αEx[f(x)]+βEx[g(x))]
- Variance
- Var ( f ( x ) ) = E [ ( f ( x ) − E [ f ( x ) ] ) 2 ] \text{Var}(f(x))=\mathbb{E}\big[(f(x)-\mathbb{E}[f(x)])^2\big] Var(f(x))=E[(f(x)−E[f(x)])2]
- the square root of the variance is known as the standard deviation.
- Covariance
- Cov ( f ( x ) , g ( y ) ) = E [ ( f ( x ) − E [ f ( x ) ] ) ( g ( y ) − E [ g ( y ) ] ) ] \text{Cov}(f(x),g(y))=\mathbb{E}[(f(x)-\mathbb{E}[f(x)])(g(y)-\mathbb{E}[g(y)])] Cov(f(x),g(y))=E[(f(x)−E[f(x)])(g(y)−E[g(y)])]
- how much two values are linearly related to each other and the scale of these variables.
- high absolute value:
- the values changes very much
- far from their respective means
- positive: both variables tend to be relatively high values
- negative: one high and other low.
- relationship between covariance and independence: independence==>0 covariance; 0 covariance!=> independence
- covariance matrix:
- For a random vector x ∈ R n x\in \mathbb{R}^n x∈Rn
- Cov ( x ) i , j = Cov ( x i , x j ) \text{Cov}(\mathbf{x})_{i,j}=\text{Cov}(\text{x}_i,\text{x}_j) Cov(x)i,j=Cov(xi,xj), the diagonal elements of the covariance: Cov ( x i , x i ) = Var ( x i ) \text{Cov}(\text{x}_i,\text{x}_i)=\text{Var}(\text{x}_i) Cov(xi,xi)=Var(xi).
Common Probability Distributions
- Bernouli Distribution
- a distribution over a single binary random variable.
- a single parameter
ϕ
∈
[
0
,
1
]
\phi\in[0,1]
ϕ∈[0,1] gives the probability of the random variable being equal to 1.
- P ( x = 1 ) = ϕ P(\text{x}=1)=\phi P(x=1)=ϕ.
- P ( x = 0 ) = 1 − ϕ P(\text{x}=0)=1-\phi P(x=0)=1−ϕ.
- P ( x = x ) = ϕ x ( 1 − ϕ ) 1 − x P(\text{x}=x)=\phi^x(1-\phi)^{1-x} P(x=x)=ϕx(1−ϕ)1−x
- E x [ x ] = ϕ \mathbb{E}_{\text{x}}[\text{x}]=\phi Ex[x]=ϕ
- Var x ( x ) = ϕ ( 1 − ϕ ) \text{Var}_{\text{x}}(\text{x})=\phi(1-\phi) Varx(x)=ϕ(1−ϕ)
- Multinoulli Distribution
- a distribution over a single discrete variable with k k k different states, where k k k is finite.
- a vector parameter p ⃗ ∈ [ 0 , 1 ] k − 1 \vec{p}\in[0,1]^{k-1} p∈[0,1]k−1, where p i p_i pi gives the probability of the i-th state.
- k-th state’s probability: 1- 1 T p ⃗ \bold{1}^T\vec{p} 1Tp.
- constrain: 1 T p ⃗ ≤ 0 \bold{1}^T\vec{p}\leq0 1Tp≤0
- Gaussian Distribution
-
N
(
x
;
μ
,
σ
2
)
=
1
2
π
σ
2
exp
(
−
1
2
σ
2
(
x
−
μ
)
2
)
\mathcal{N}(x;\mu,\sigma^2)=\sqrt{\frac{1}{2\pi\sigma^2}}\exp\big(-\frac{1}{2\sigma^2}(x-\mu)^2\big)
N(x;μ,σ2)=2πσ21exp(−2σ21(x−μ)2)
- Two parameters
μ
∈
R
\mu\in\mathbb{R}
μ∈R and
σ
∈
(
0
,
∞
)
\sigma\in(0,\infty)
σ∈(0,∞)
- μ \mu μ gives the coordinate of the central peak
- E [ x ] = μ \mathbb{E}[\text{x}]=\mu E[x]=μ
- σ \sigma σ is the standard deviation of the distribution
- σ 2 \sigma^2 σ2 is the variance
- evaluate the PDF with parameter
β
∈
(
0
,
∞
)
\beta\in(0,\infty)
β∈(0,∞)
- N ( x ; μ , β − 1 ) = β 2 π exp ( − 1 2 β ( x − μ ) 2 ) \mathcal{N}(x;\mu,\beta^{-1})=\sqrt{\frac{\beta}{2\pi}}\exp\big(-\frac{1}{2}\beta(x-\mu)^2\big) N(x;μ,β−1)=2πβexp(−21β(x−μ)2)
- reasons for good choice
- many distributions we wish to model are truly close to being normal distributions.
- Centeal limit theorem: the sum of many independent random variables is approximately normally distributed.
- out of all possible probability distributions with the same variance,the normal distribution encodes the maximum amount of uncertainty over the real numbers.
- generalizes to
R
n
\mathbb{R}^n
Rn: multivariate normal distribution.
- a positive definite symmetric matrix parameter Σ \bold{\Sigma} Σ
- N ( x ⃗ ; μ ⃗ , Σ ) = 1 ( 2 π ) n det ( Σ ) exp ( − 1 2 ( x ⃗ − μ ⃗ ) T Σ − 1 ( x ⃗ − μ ⃗ ) ) \mathcal{N}(\vec{x};\vec{\mu},\bold{\Sigma})=\sqrt{\frac{1}{(2\pi)^n\det(\bold{\Sigma})}}\exp\big(-\frac{1}{2}(\vec{x}-\vec{\mu})^T\bold{\Sigma}^{-1}(\vec{x}-\vec{\mu})\big) N(x;μ,Σ)=(2π)ndet(Σ)1exp(−21(x−μ)TΣ−1(x−μ)), where μ ⃗ \vec{\mu} μ, a vector-valued, is the mean of the distribution; Σ \bold{\Sigma} Σ is the covariance matrix of the distribution.
- use a precision matrix
β
\bold{\beta}
β:
- N ( x ⃗ ; μ ⃗ , β − 1 ) = det ( β ) ( 2 π ) n exp ( − 1 2 ( x ⃗ − μ ⃗ ) T β ( x ⃗ − μ ⃗ ) ) \mathcal{N}(\vec{x};\vec{\mu},\bold{\beta}^{-1})=\sqrt{\frac{\det(\bold{\beta})}{(2\pi)^n}}\exp\big(-\frac{1}{2}(\vec{x}-\vec{\mu})^T\bold{\beta}(\vec{x}-\vec{\mu})\big) N(x;μ,β−1)=(2π)ndet(β)exp(−21(x−μ)Tβ(x−μ))
- isotropic Gaussian distribution: covariance matrix is a scalar times the identity matrix.
-
N
(
x
;
μ
,
σ
2
)
=
1
2
π
σ
2
exp
(
−
1
2
σ
2
(
x
−
μ
)
2
)
\mathcal{N}(x;\mu,\sigma^2)=\sqrt{\frac{1}{2\pi\sigma^2}}\exp\big(-\frac{1}{2\sigma^2}(x-\mu)^2\big)
N(x;μ,σ2)=2πσ21exp(−2σ21(x−μ)2)
- Exponential Distribution
- a probability with a sharp point at x = 0 x=0 x=0
- p ( x ; λ ) = λ 1 x ≥ 0 exp ( − λ x ) p(x;\lambda)=\lambda\bold{1}_{x\geq0}\exp(-\lambda x) p(x;λ)=λ1x≥0exp(−λx), where 1 x ≥ 0 \bold{1}_{x\geq0} 1x≥0 is to assign probability zero to all negative values of x x x.
- Laplace Distribution
- place a sharp peak of probability mass at an arbitrary point μ \mu μ.
- Laplace ( x ; μ , γ ) = 1 2 γ exp ( − ∣ x − μ ∣ γ ) \text{Laplace}(x;\mu,\gamma)=\frac{1}{2\gamma}\exp(-\frac{|x-\mu|}{\gamma}) Laplace(x;μ,γ)=2γ1exp(−γ∣x−μ∣)
- Dirac Distribution
- p ( x ) = δ ( x − μ ) p(x)=\delta(x-\mu) p(x)=δ(x−μ)
- Empirical Distribution
- p ^ ( x ⃗ ) = 1 m ∑ i = 1 m δ ( x ⃗ − x ⃗ ( i ) ) \hat{p}(\vec{x})=\frac{1}{m}\sum\limits_{i=1}^m\delta(\vec{x}-\vec{x}^{(i)}) p^(x)=m1i=1∑mδ(x−x(i))
- Dirac delta distribution is for continous variables
- For discrete variables, an empirical distribution can be conceptualized as a multinoulli distribution.
- mixtures Distribution
- be made up of several component distributions
- P ( x ) = ∑ i P ( c = i ) P ( x ∣ c = i ) P(\text{x})=\sum\limits_iP(\text{c}=i)P(\text{x}|\text{c}=i) P(x)=i∑P(c=i)P(x∣c=i), where P ( c ) P(\text{c}) P(c) is the multinoulli distribution over component identities.(a simple strategy)
- latent variable is a random vcariable that we cannot observe directly.
- Gaussian mixture model: a univeral approximator of densities
- prior probability: α i = P ( c = i ) \alpha_i=P(\text{c}=i) αi=P(c=i)
- posterior probability: P ( c ∣ x ⃗ ) P(\text{c}|\vec{x}) P(c∣x)
- any smooth density can be approximated with anyspecific, non-zero amount of error by a Gaussian mixture model with enough components.
Useful Properties of Common Functions
- logistic sigmoid
- σ ( x ) = 1 1 + exp ( − x ) \sigma(x)=\frac{1}{1+\exp{(-x)}} σ(x)=1+exp(−x)1
- produce the
ϕ
\phi
ϕ parameter of a Bernoulli distribution
- properties:
- σ ( x ) = exp ( x ) exp ( x ) + exp ( 0 ) \sigma(x)=\frac{\exp(x)}{\exp(x)+\exp(0)} σ(x)=exp(x)+exp(0)exp(x)
- d d x σ ( x ) = σ ( x ) ( 1 − σ ( x ) ) \frac{d}{dx}\sigma(x)=\sigma(x)(1-\sigma(x)) dxdσ(x)=σ(x)(1−σ(x))
- 1 − σ ( x ) = σ ( − x ) 1-\sigma(x)=\sigma(-x) 1−σ(x)=σ(−x)
- ∀ x ∈ ( 0 , 1 ) , σ − 1 ( x ) = log ( x 1 − x ) \forall x\in (0,1),\sigma^{-1}(x)=\log(\frac{x}{1-x}) ∀x∈(0,1),σ−1(x)=log(1−xx)
- softplus
- ζ ( x ) = l o g ( 1 + exp ( x ) ) \zeta(x)=log(1+\exp(x)) ζ(x)=log(1+exp(x))
- produce the
β
\beta
β or
σ
\sigma
σ parameter of a normal distribution
- properties:
- ∀ x > 0 , ζ − 1 ( x ) = log ( exp ( x ) − 1 ) \forall x>0,\zeta^{-1}(x)=\log(\exp(x)-1) ∀x>0,ζ−1(x)=log(exp(x)−1)
- ζ ( x ) − ζ ( − x ) = x \zeta(x)-\zeta(-x)=x ζ(x)−ζ(−x)=x
- properties:
- log σ ( x ) = − ζ ( − x ) \log\sigma(x)=-\zeta(-x) logσ(x)=−ζ(−x)
- d d x ζ ( x ) = σ ( x ) \frac{d}{dx}\zeta(x)=\sigma(x) dxdζ(x)=σ(x)
- ζ ( x ) = ∫ − ∞ x σ ( y ) d y \zeta(x)=\int^x_{-\infty}\sigma(y)dy ζ(x)=∫−∞xσ(y)dy
Bayes’ Rule
- P ( x ∣ y ) = P ( x ) P ( y ∣ x ) P ( y ) , P ( y ) = ∑ x P ( y ∣ x ) P ( x ) P(\text{x}|\text{y})=\frac{P(\text{x})P(\text{y}|\text{x})}{P(\text{y})},P(\text{y})=\sum_xP(\text{y}|x)P(x) P(x∣y)=P(y)P(x)P(y∣x),P(y)=∑xP(y∣x)P(x)
- derive from the definition of conditional probability.
Techbical Details of Continuous Variables
- Measure theory
- purposes: measure theory is more useful for describing theorems that apply to most points in R n \mathbb{R}^n Rn but do not apply to some corner cases.
- measuer zero: a rigorous way of describing that a set of points is negligibly small
- almost everywhere: Some important results in probability theory hold for all discrete values but only hold “almost everywhere” for continuousvalues
- For x and y , y ⃗ = g ( x ⃗ ) , then p x ( x ) = p y ( g ( x ) ) ∣ ∂ g ( x ) ∂ x ∣ \text{For }\bold{x}\text{ and }\bold{y},~\vec{y}=g(\vec{x}),~\text{then } p_x(x)=p_y(g(x))|\frac{\partial g(x)}{\partial x}| For x and y, y=g(x), then px(x)=py(g(x))∣∂x∂g(x)∣
- For higher dimensions, p x ( x ⃗ ) = p y ( g ( x ⃗ ) ) ∣ det ( ∂ g ( x ⃗ ) ∂ x ⃗ ) ∣ p_x(\vec{x})=p_y(g(\vec{x}))|\det(\frac{\partial g(\vec{x})}{\partial \vec{x}})| px(x)=py(g(x))∣det(∂x∂g(x))∣
- Jacobian matrix: J i , j = ∂ x i ∂ y i J_{i,j}=\frac{\partial x_i}{\partial y_i} Ji,j=∂yi∂xi
- Information theory
- information theory tells how to design optimal codes and calculate the expected length of messages sampled from specific probability distributions using various encoding schemes.
- quantify information:
- Likely events should have low information content
- Less likely events should have higher information content
- Independent events should have additive information
- Sef-information of a event
x
=
x
\text{x}=x
x=x
- I ( x ) = − log ( P ( x ) ) I(x)=-\log(P(x)) I(x)=−log(P(x))
- Shannon entropy
- H ( x ) = E x ∼ P [ I ( x ) ] = − E x ∼ P [ log P ( x ) ] H(\text{x})=\mathbb{E}_{\text{x}\sim P}[I(x)]=-\mathbb{E}_{\text{x}\sim P[\log P(x)]} H(x)=Ex∼P[I(x)]=−Ex∼P[logP(x)]
- the Shannon entropy of a distribution is the expected amount of information in an event drawn from that distribution.
- for
x
\text{x}
x is continuous, the shannon entropy is known as the differential entropy.
- Kullback-Leibler (KL) divergence
- D KL ( P ∣ ∣ Q ) = E x ∼ P [ log P ( x ) Q ( x ) ] = E x ∼ P [ log P ( x ) − log Q ( x ) ] D_{\text{KL}}(P||Q)=\mathbb{E}_{\text{x}\sim P}[\log\frac{P(x)}{Q(x)}]=\mathbb{E}_{\text{x}\sim P}[\log P(x)-\log Q(x)] DKL(P∣∣Q)=Ex∼P[logQ(x)P(x)]=Ex∼P[logP(x)−logQ(x)]
- KL divergence is 0 if and only if P P P and Q Q Q are the same distribution for discrete variables, or equal ‘almost everywhere’ for continous variables.
- for some
P
P
P and
Q
Q
Q,
D
KL
(
P
∣
∣
Q
)
≠
D
KL
(
Q
∣
∣
P
)
D_{\text{KL}}(P||Q)\neq D_{\text{KL}}(Q||P)
DKL(P∣∣Q)̸=DKL(Q∣∣P)
- Cross-entropy
- H ( P , Q ) = H ( P ) + D KL ( P ∣ ∣ Q ) H(P,Q)=H(P)+D_{\text{KL}}(P||Q) H(P,Q)=H(P)+DKL(P∣∣Q)
- namely, H ( P , Q ) = − E x ∼ P log Q ( x ) H(P,Q)=-\mathbb{E}_{\text{x}\sim P}\log Q(x) H(P,Q)=−Ex∼PlogQ(x)
- Note: 0 log 0 = lim x → 0 x log x = 0 0\log 0=\lim_{x\rightarrow0}x\log x=0 0log0=limx→0xlogx=0
- Strutured Probability Models(graphical model)
- we represent the factorization of a probability distributionwith a graph
- Directed
- use graphs with directed edges, represent factorizations into conditional probability distributions
-
p
(
x
)
=
∏
i
p
(
x
i
∣
P
a
G
(
x
i
)
)
p(\bold{x})=\prod\limits_ip(\text{x}_i|Pa_{\mathcal{G}}(\text{x}_i))
p(x)=i∏p(xi∣PaG(xi)), where
P
a
G
(
x
i
)
Pa_{\mathcal{G}}(\text{x}_i)
PaG(xi) is the parents of
x
i
\text{x}_i
xi, given by the factor consists of the conditional distribution over
x
i
\text{x}_i
xi
- Undirected
- use graphs with undirected edges, represent factorizations into a set of functions, which are not probability distributions of any kind.
- p ( x ) = 1 Z ∏ i ϕ ( i ) ( C ( i ) ) p(\bold{x})=\frac{1}{Z}\prod_i\phi^{(i)}(\mathcal{C}^{(i)}) p(x)=Z1∏iϕ(i)(C(i)), where C ( i ) \mathcal{C}^{(i)} C(i) is a set of nodes that are all connected to each other in G \mathcal{G} G and ϕ ( i ) ( C ( i ) ) \phi^{(i)}(\mathcal{C}^{(i)}) ϕ(i)(C(i)) is a factor, which is not a distribution function.