C3-Probability and Information Theory

Possible sources of uncertainty

  • Inherent stochasticity in the system being modeled.
  • Incomplete observability.
  • Incomplete modeling.

Concepts

Note: In this note, x \text{x} x is random varaible and x x x is one of its value

  • probability==> degree of belief
  • frequentist probability==> directly related to the rates at which events occur.
  • bayesian probability==> related to qualitative levels of certainty
  • random varaible==> a varaible that can take on different values randomly.
    • discrete:has a finite or countably infinite number of states
    • continuous:is associated with a real value.
  • probability distribution==> a description of how likely a random varaible or set of random variables is to take on eac of its possible states.
    • probability mass function(PMF)==> a probability distribution over discrete variable
      • PMF maps from a state of random variable to the probability of that random variable taking on that state.
      • P ( x = x ) P(\text{x}=x) P(x=x) or x ∼ P ( x ) \text{x}\sim P(\text{x}) xP(x)
      • the domain of P P P must be the set of all possible states of x x x
      • ∀ x ∈ x , 0 ≤ P ( x ) ≤ 1 \forall x\in \text{x},0\leq P(x)\leq 1 xx,0P(x)1.
      • ∑ x ∈ x P ( x ) = 1 \sum\nolimits_{x\in\text{x}}P(x)=1 xxP(x)=1.
    • joint probability distribution==> a probability distribution over many variables
      • P ( x = x , y = y ) P(\text{x}=x,\text{y}=y) P(x=x,y=y) or P ( x , y ) P(x,y) P(x,y)
    • probability density function(PDF)==> a probability distribution over continuous random variable
      • the domain of p p p must be the set of all possible states of x \text{x} x.
      • ∀ x ∈ x , p ( x ) ≥ 0 \forall x\in\text{x},p(x)\geq0 xx,p(x)0.
      • ∫ p ( x ) d x = 1 \int p(x)dx=1 p(x)dx=1.
      • u ( x ; a , b ) u(x;a,b) u(x;a,b), where b > a b>a b>a. For all x ∉ [ a , b ] x\notin[a,b] x/[a,b], u ( x ; a , b ) = 0 u(x;a,b)=0 u(x;a,b)=0; within [ a , b ] [a,b] [a,b], u ( x ; a , b ) = 1 b − a u(x;a,b)=\frac{1}{b-a} u(x;a,b)=ba1. Namely x ∼ U ( a , b ) \text{x}\sim U(a,b) xU(a,b).
  • Marginal Probability
    • The probability distribution over the subset.
    • For discrete random variable,know P ( x , y ) P(\text{x},\text{y}) P(x,y), find P ( x ) P(\text{x}) P(x) with the sum rule: ∀ x ∈ x , P ( x = x ) = ∑ y P ( x = x , y = y ) \forall x\in\text{x},P(\text{x}=x)=\sum\limits_yP(\text{x}=x,\text{y}=y) xx,P(x=x)=yP(x=x,y=y).
    • For cotinuous variable, p ( x ) = ∫ p ( x , y ) d y p(x)=\int p(x,y)dy p(x)=p(x,y)dy
  • Conditional Probability
    • P ( y = y ∣ x = x ) = P ( y = y , x = x ) P ( x = x ) P(\text{y}=y|\text{x}=x)=\frac{P(\text{y}=y,\text{x}=x)}{P(\text{x}=x)} P(y=yx=x)=P(x=x)P(y=y,x=x)
    • intervention query(干预查询)==>compute the consequences of an action.(the domain of causal modeling)
  • The Chain Rule of Conditinal Probabilities
    • P ( x ( 1 ) , ⋯   , x ( n ) ) = P ( x ( 1 ) ) ∏ i = 2 n P ( x ( i ) , ⋯   , x ( i − 1 ) ) P(\text{x}^{(1)},\cdots,\text{x}^{(n)})=P(\text{x}^{(1)})\prod_{i=2}^nP(\text{x}^{(i)},\cdots,\text{x}^{(i-1)}) P(x(1),,x(n))=P(x(1))i=2nP(x(i),,x(i1))
  • Independence:
    • ∀ x ∈ x , y ∈ y , p ( x = x , y = y ) = p ( x = x ) p ( y = y ) \forall x\in\text{x},y\in\text{y},p(\text{x}=x,\text{y}=y)=p(\text{x}=x)p(\text{y}=y) xx,yy,p(x=x,y=y)=p(x=x)p(y=y)
    • For simplify: x ⊥ y \text{x}\perp\text{y} xy
  • Conditional Independce:
    • ∀ x ∈ x , y ∈ y , z ∈ z , p ( x = x , y = y ∣ z = z ) = p ( x = x ∣ z = z ) p ( y = y ∣ z = z ) \forall x\in\text{x},y\in\text{y},z\in\text{z},p(\text{x}=x,\text{y}=y|\text{z}=z)=p(\text{x}=x|\text{z}=z)p(\text{y}=y|\text{z}=z) xx,yy,zz,p(x=x,y=yz=z)=p(x=xz=z)p(y=yz=z)
    • For simplify: x ⊥ y ∣ z \text{x}\perp\text{y}|\text{z} xyz
  • Expectation
    • For discrete variables, E x ∼ P [ f ( x ) ] = ∑ x P ( x ) f ( x ) \mathbb{E}_{\text{x}\sim P}[f(x)]=\sum\limits_xP(x)f(x) ExP[f(x)]=xP(x)f(x).
    • For continuous variables, E x ∼ p [ f ( x ) ] = ∫ x p ( x ) f ( x ) \mathbb{E}_{\text{x}\sim p}[f(x)]=\int\limits_xp(x)f(x) Exp[f(x)]=xp(x)f(x)
    • linear: E x [ α f ( x ) + β g ( x ) ] = α E x [ f ( x ) ] + β E x [ g ( x ) ) ] \mathbb{E}_{\text{x}}[\alpha f(x)+\beta g(x)]=\alpha\mathbb{E}_{\text{x}}[f(x)]+\beta\mathbb{E}_{\text{x}}[g{(x))}] Ex[αf(x)+βg(x)]=αEx[f(x)]+βEx[g(x))]
  • Variance
    • Var ( f ( x ) ) = E [ ( f ( x ) − E [ f ( x ) ] ) 2 ] \text{Var}(f(x))=\mathbb{E}\big[(f(x)-\mathbb{E}[f(x)])^2\big] Var(f(x))=E[(f(x)E[f(x)])2]
    • the square root of the variance is known as the standard deviation.
  • Covariance
    • Cov ( f ( x ) , g ( y ) ) = E [ ( f ( x ) − E [ f ( x ) ] ) ( g ( y ) − E [ g ( y ) ] ) ] \text{Cov}(f(x),g(y))=\mathbb{E}[(f(x)-\mathbb{E}[f(x)])(g(y)-\mathbb{E}[g(y)])] Cov(f(x),g(y))=E[(f(x)E[f(x)])(g(y)E[g(y)])]
    • how much two values are linearly related to each other and the scale of these variables.
    • high absolute value:
      • the values changes very much
      • far from their respective means
    • positive: both variables tend to be relatively high values
    • negative: one high and other low.
    • relationship between covariance and independence: independence==>0 covariance; 0 covariance!=> independence
    • covariance matrix:
      • For a random vector x ∈ R n x\in \mathbb{R}^n xRn
      • Cov ( x ) i , j = Cov ( x i , x j ) \text{Cov}(\mathbf{x})_{i,j}=\text{Cov}(\text{x}_i,\text{x}_j) Cov(x)i,j=Cov(xi,xj), the diagonal elements of the covariance: Cov ( x i , x i ) = Var ( x i ) \text{Cov}(\text{x}_i,\text{x}_i)=\text{Var}(\text{x}_i) Cov(xi,xi)=Var(xi).

Common Probability Distributions

  • Bernouli Distribution
    • a distribution over a single binary random variable.
    • a single parameter ϕ ∈ [ 0 , 1 ] \phi\in[0,1] ϕ[0,1] gives the probability of the random variable being equal to 1.
      • P ( x = 1 ) = ϕ P(\text{x}=1)=\phi P(x=1)=ϕ.
      • P ( x = 0 ) = 1 − ϕ P(\text{x}=0)=1-\phi P(x=0)=1ϕ.
      • P ( x = x ) = ϕ x ( 1 − ϕ ) 1 − x P(\text{x}=x)=\phi^x(1-\phi)^{1-x} P(x=x)=ϕx(1ϕ)1x
      • E x [ x ] = ϕ \mathbb{E}_{\text{x}}[\text{x}]=\phi Ex[x]=ϕ
      • Var x ( x ) = ϕ ( 1 − ϕ ) \text{Var}_{\text{x}}(\text{x})=\phi(1-\phi) Varx(x)=ϕ(1ϕ)
  • Multinoulli Distribution
    • a distribution over a single discrete variable with k k k different states, where k k k is finite.
    • a vector parameter p ⃗ ∈ [ 0 , 1 ] k − 1 \vec{p}\in[0,1]^{k-1} p [0,1]k1, where p i p_i pi gives the probability of the i-th state.
    • k-th state’s probability: 1- 1 T p ⃗ \bold{1}^T\vec{p} 1Tp .
    • constrain: 1 T p ⃗ ≤ 0 \bold{1}^T\vec{p}\leq0 1Tp 0
  • Gaussian Distribution
    • N ( x ; μ , σ 2 ) = 1 2 π σ 2 exp ⁡ ( − 1 2 σ 2 ( x − μ ) 2 ) \mathcal{N}(x;\mu,\sigma^2)=\sqrt{\frac{1}{2\pi\sigma^2}}\exp\big(-\frac{1}{2\sigma^2}(x-\mu)^2\big) N(x;μ,σ2)=2πσ21 exp(2σ21(xμ)2)
      Fig3.1
    • Two parameters μ ∈ R \mu\in\mathbb{R} μR and σ ∈ ( 0 , ∞ ) \sigma\in(0,\infty) σ(0,)
      • μ \mu μ gives the coordinate of the central peak
      • E [ x ] = μ \mathbb{E}[\text{x}]=\mu E[x]=μ
      • σ \sigma σ is the standard deviation of the distribution
      • σ 2 \sigma^2 σ2 is the variance
    • evaluate the PDF with parameter β ∈ ( 0 , ∞ ) \beta\in(0,\infty) β(0,)
      • N ( x ; μ , β − 1 ) = β 2 π exp ⁡ ( − 1 2 β ( x − μ ) 2 ) \mathcal{N}(x;\mu,\beta^{-1})=\sqrt{\frac{\beta}{2\pi}}\exp\big(-\frac{1}{2}\beta(x-\mu)^2\big) N(x;μ,β1)=2πβ exp(21β(xμ)2)
    • reasons for good choice
      • many distributions we wish to model are truly close to being normal distributions.
      • Centeal limit theorem: the sum of many independent random variables is approximately normally distributed.
      • out of all possible probability distributions with the same variance,the normal distribution encodes the maximum amount of uncertainty over the real numbers.
    • generalizes to R n \mathbb{R}^n Rn: multivariate normal distribution.
      • a positive definite symmetric matrix parameter Σ \bold{\Sigma} Σ
      • N ( x ⃗ ; μ ⃗ , Σ ) = 1 ( 2 π ) n det ⁡ ( Σ ) exp ⁡ ( − 1 2 ( x ⃗ − μ ⃗ ) T Σ − 1 ( x ⃗ − μ ⃗ ) ) \mathcal{N}(\vec{x};\vec{\mu},\bold{\Sigma})=\sqrt{\frac{1}{(2\pi)^n\det(\bold{\Sigma})}}\exp\big(-\frac{1}{2}(\vec{x}-\vec{\mu})^T\bold{\Sigma}^{-1}(\vec{x}-\vec{\mu})\big) N(x ;μ ,Σ)=(2π)ndet(Σ)1 exp(21(x μ )TΣ1(x μ )), where μ ⃗ \vec{\mu} μ , a vector-valued, is the mean of the distribution; Σ \bold{\Sigma} Σ is the covariance matrix of the distribution.
      • use a precision matrix β \bold{\beta} β:
        • N ( x ⃗ ; μ ⃗ , β − 1 ) = det ⁡ ( β ) ( 2 π ) n exp ⁡ ( − 1 2 ( x ⃗ − μ ⃗ ) T β ( x ⃗ − μ ⃗ ) ) \mathcal{N}(\vec{x};\vec{\mu},\bold{\beta}^{-1})=\sqrt{\frac{\det(\bold{\beta})}{(2\pi)^n}}\exp\big(-\frac{1}{2}(\vec{x}-\vec{\mu})^T\bold{\beta}(\vec{x}-\vec{\mu})\big) N(x ;μ ,β1)=(2π)ndet(β) exp(21(x μ )Tβ(x μ ))
      • isotropic Gaussian distribution: covariance matrix is a scalar times the identity matrix.
  • Exponential Distribution
    • a probability with a sharp point at x = 0 x=0 x=0
    • p ( x ; λ ) = λ 1 x ≥ 0 exp ⁡ ( − λ x ) p(x;\lambda)=\lambda\bold{1}_{x\geq0}\exp(-\lambda x) p(x;λ)=λ1x0exp(λx), where 1 x ≥ 0 \bold{1}_{x\geq0} 1x0 is to assign probability zero to all negative values of x x x.
  • Laplace Distribution
    • place a sharp peak of probability mass at an arbitrary point μ \mu μ.
    • Laplace ( x ; μ , γ ) = 1 2 γ exp ⁡ ( − ∣ x − μ ∣ γ ) \text{Laplace}(x;\mu,\gamma)=\frac{1}{2\gamma}\exp(-\frac{|x-\mu|}{\gamma}) Laplace(x;μ,γ)=2γ1exp(γxμ)
  • Dirac Distribution
    • p ( x ) = δ ( x − μ ) p(x)=\delta(x-\mu) p(x)=δ(xμ)
  • Empirical Distribution
    • p ^ ( x ⃗ ) = 1 m ∑ i = 1 m δ ( x ⃗ − x ⃗ ( i ) ) \hat{p}(\vec{x})=\frac{1}{m}\sum\limits_{i=1}^m\delta(\vec{x}-\vec{x}^{(i)}) p^(x )=m1i=1mδ(x x (i))
    • Dirac delta distribution is for continous variables
    • For discrete variables, an empirical distribution can be conceptualized as a multinoulli distribution.
  • mixtures Distribution
    • be made up of several component distributions
    • P ( x ) = ∑ i P ( c = i ) P ( x ∣ c = i ) P(\text{x})=\sum\limits_iP(\text{c}=i)P(\text{x}|\text{c}=i) P(x)=iP(c=i)P(xc=i), where P ( c ) P(\text{c}) P(c) is the multinoulli distribution over component identities.(a simple strategy)
    • latent variable is a random vcariable that we cannot observe directly.
    • Gaussian mixture model: a univeral approximator of densities
      • prior probability: α i = P ( c = i ) \alpha_i=P(\text{c}=i) αi=P(c=i)
      • posterior probability: P ( c ∣ x ⃗ ) P(\text{c}|\vec{x}) P(cx )
      • any smooth density can be approximated with anyspecific, non-zero amount of error by a Gaussian mixture model with enough components.
        Fig3.2

Useful Properties of Common Functions

  • logistic sigmoid
    • σ ( x ) = 1 1 + exp ⁡ ( − x ) \sigma(x)=\frac{1}{1+\exp{(-x)}} σ(x)=1+exp(x)1
    • produce the ϕ \phi ϕ parameter of a Bernoulli distribution
      Fig3.3
    • properties:
      • σ ( x ) = exp ⁡ ( x ) exp ⁡ ( x ) + exp ⁡ ( 0 ) \sigma(x)=\frac{\exp(x)}{\exp(x)+\exp(0)} σ(x)=exp(x)+exp(0)exp(x)
      • d d x σ ( x ) = σ ( x ) ( 1 − σ ( x ) ) \frac{d}{dx}\sigma(x)=\sigma(x)(1-\sigma(x)) dxdσ(x)=σ(x)(1σ(x))
      • 1 − σ ( x ) = σ ( − x ) 1-\sigma(x)=\sigma(-x) 1σ(x)=σ(x)
      • ∀ x ∈ ( 0 , 1 ) , σ − 1 ( x ) = log ⁡ ( x 1 − x ) \forall x\in (0,1),\sigma^{-1}(x)=\log(\frac{x}{1-x}) x(0,1),σ1(x)=log(1xx)
  • softplus
    • ζ ( x ) = l o g ( 1 + exp ⁡ ( x ) ) \zeta(x)=log(1+\exp(x)) ζ(x)=log(1+exp(x))
    • produce the β \beta β or σ \sigma σ parameter of a normal distribution
      Fig3.4
    • properties:
      • ∀ x > 0 , ζ − 1 ( x ) = log ⁡ ( exp ⁡ ( x ) − 1 ) \forall x>0,\zeta^{-1}(x)=\log(\exp(x)-1) x>0,ζ1(x)=log(exp(x)1)
      • ζ ( x ) − ζ ( − x ) = x \zeta(x)-\zeta(-x)=x ζ(x)ζ(x)=x
  • properties:
    • log ⁡ σ ( x ) = − ζ ( − x ) \log\sigma(x)=-\zeta(-x) logσ(x)=ζ(x)
    • d d x ζ ( x ) = σ ( x ) \frac{d}{dx}\zeta(x)=\sigma(x) dxdζ(x)=σ(x)
    • ζ ( x ) = ∫ − ∞ x σ ( y ) d y \zeta(x)=\int^x_{-\infty}\sigma(y)dy ζ(x)=xσ(y)dy

Bayes’ Rule

  • P ( x ∣ y ) = P ( x ) P ( y ∣ x ) P ( y ) , P ( y ) = ∑ x P ( y ∣ x ) P ( x ) P(\text{x}|\text{y})=\frac{P(\text{x})P(\text{y}|\text{x})}{P(\text{y})},P(\text{y})=\sum_xP(\text{y}|x)P(x) P(xy)=P(y)P(x)P(yx),P(y)=xP(yx)P(x)
  • derive from the definition of conditional probability.

Techbical Details of Continuous Variables

  • Measure theory
    • purposes: measure theory is more useful for describing theorems that apply to most points in R n \mathbb{R}^n Rn but do not apply to some corner cases.
    • measuer zero: a rigorous way of describing that a set of points is negligibly small
    • almost everywhere: Some important results in probability theory hold for all discrete values but only hold “almost everywhere” for continuousvalues
    • For  x  and  y ,   y ⃗ = g ( x ⃗ ) ,  then  p x ( x ) = p y ( g ( x ) ) ∣ ∂ g ( x ) ∂ x ∣ \text{For }\bold{x}\text{ and }\bold{y},~\vec{y}=g(\vec{x}),~\text{then } p_x(x)=p_y(g(x))|\frac{\partial g(x)}{\partial x}| For x and y, y =g(x ), then px(x)=py(g(x))xg(x)
    • For higher dimensions, p x ( x ⃗ ) = p y ( g ( x ⃗ ) ) ∣ det ⁡ ( ∂ g ( x ⃗ ) ∂ x ⃗ ) ∣ p_x(\vec{x})=p_y(g(\vec{x}))|\det(\frac{\partial g(\vec{x})}{\partial \vec{x}})| px(x )=py(g(x ))det(x g(x ))
    • Jacobian matrix: J i , j = ∂ x i ∂ y i J_{i,j}=\frac{\partial x_i}{\partial y_i} Ji,j=yixi
  • Information theory
    • information theory tells how to design optimal codes and calculate the expected length of messages sampled from specific probability distributions using various encoding schemes.
    • quantify information:
      • Likely events should have low information content
      • Less likely events should have higher information content
      • Independent events should have additive information
    • Sef-information of a event x = x \text{x}=x x=x
      • I ( x ) = − log ⁡ ( P ( x ) ) I(x)=-\log(P(x)) I(x)=log(P(x))
    • Shannon entropy
      • H ( x ) = E x ∼ P [ I ( x ) ] = − E x ∼ P [ log ⁡ P ( x ) ] H(\text{x})=\mathbb{E}_{\text{x}\sim P}[I(x)]=-\mathbb{E}_{\text{x}\sim P[\log P(x)]} H(x)=ExP[I(x)]=ExP[logP(x)]
      • the Shannon entropy of a distribution is the expected amount of information in an event drawn from that distribution.
      • for x \text{x} x is continuous, the shannon entropy is known as the differential entropy.
        Fig3.5
    • Kullback-Leibler (KL) divergence
      • D KL ( P ∣ ∣ Q ) = E x ∼ P [ log ⁡ P ( x ) Q ( x ) ] = E x ∼ P [ log ⁡ P ( x ) − log ⁡ Q ( x ) ] D_{\text{KL}}(P||Q)=\mathbb{E}_{\text{x}\sim P}[\log\frac{P(x)}{Q(x)}]=\mathbb{E}_{\text{x}\sim P}[\log P(x)-\log Q(x)] DKL(PQ)=ExP[logQ(x)P(x)]=ExP[logP(x)logQ(x)]
      • KL divergence is 0 if and only if P P P and Q Q Q are the same distribution for discrete variables, or equal ‘almost everywhere’ for continous variables.
      • for some P P P and Q Q Q, D KL ( P ∣ ∣ Q ) ≠ D KL ( Q ∣ ∣ P ) D_{\text{KL}}(P||Q)\neq D_{\text{KL}}(Q||P) DKL(PQ)̸=DKL(QP)
        Fig3.6
    • Cross-entropy
      • H ( P , Q ) = H ( P ) + D KL ( P ∣ ∣ Q ) H(P,Q)=H(P)+D_{\text{KL}}(P||Q) H(P,Q)=H(P)+DKL(PQ)
      • namely, H ( P , Q ) = − E x ∼ P log ⁡ Q ( x ) H(P,Q)=-\mathbb{E}_{\text{x}\sim P}\log Q(x) H(P,Q)=ExPlogQ(x)
    • Note: 0 log ⁡ 0 = lim ⁡ x → 0 x log ⁡ x = 0 0\log 0=\lim_{x\rightarrow0}x\log x=0 0log0=limx0xlogx=0
  • Strutured Probability Models(graphical model)
    • we represent the factorization of a probability distributionwith a graph
    • Directed
      • use graphs with directed edges, represent factorizations into conditional probability distributions
      • p ( x ) = ∏ i p ( x i ∣ P a G ( x i ) ) p(\bold{x})=\prod\limits_ip(\text{x}_i|Pa_{\mathcal{G}}(\text{x}_i)) p(x)=ip(xiPaG(xi)), where P a G ( x i ) Pa_{\mathcal{G}}(\text{x}_i) PaG(xi) is the parents of x i \text{x}_i xi, given by the factor consists of the conditional distribution over x i \text{x}_i xi
        Fig3.7
    • Undirected
      • use graphs with undirected edges, represent factorizations into a set of functions, which are not probability distributions of any kind.
      • p ( x ) = 1 Z ∏ i ϕ ( i ) ( C ( i ) ) p(\bold{x})=\frac{1}{Z}\prod_i\phi^{(i)}(\mathcal{C}^{(i)}) p(x)=Z1iϕ(i)(C(i)), where C ( i ) \mathcal{C}^{(i)} C(i) is a set of nodes that are all connected to each other in G \mathcal{G} G and ϕ ( i ) ( C ( i ) ) \phi^{(i)}(\mathcal{C}^{(i)}) ϕ(i)(C(i)) is a factor, which is not a distribution function.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值