C3-Probability and Information Theory

issory

于 2018-10-20 19:06:56 发布

阅读量220

点赞数

分类专栏： Deep Learning Note of Book

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/u011310345/article/details/83215950

版权

Deep Learning 同时被 2 个专栏收录

7 篇文章

订阅专栏

7 篇文章

订阅专栏

Possible sources of uncertainty

Inherent stochasticity in the system being modeled.
Incomplete observability.
Incomplete modeling.

Concepts

Note: In this note, $\text{x}$ is random varaible and $x$ is one of its value

probability==> degree of belief
frequentist probability==> directly related to the rates at which events occur.
bayesian probability==> related to qualitative levels of certainty
random varaible==> a varaible that can take on different values randomly.
- discrete:has a finite or countably infinite number of states
- continuous:is associated with a real value.
probability distribution==> a description of how likely a random varaible or set of random variables is to take on eac of its possible states.
- probability mass function(PMF)==> a probability distribution over discrete variable
  - PMF maps from a state of random variable to the probability of that random variable taking on that state.
  - $P(\text{x}=x)$ or $\text{x}\sim P(\text{x})$
  - the domain of $P$ must be the set of all possible states of $x$
  - $\forall x\in \text{x},0\leq P(x)\leq 1$ .
  - $\sum\nolimits_{x\in\text{x}}P(x)=1$ .
- joint probability distribution==> a probability distribution over many variables
  - $P(\text{x}=x,\text{y}=y)$ or $P (x, y)$
- probability density function(PDF)==> a probability distribution over continuous random variable
  - the domain of $p$ must be the set of all possible states of $\text{x}$ .
  - $\forall x\in\text{x},p(x)\geq0$ .
  - $\int p(x)dx=1$ .
  - $u (x; a, b)$ , where $b > a$ . For all $x\notin[a,b]$ , $u (x; a, b) = 0$ ; within $[a, b]$ , $u(x;a,b)=\frac{1}{b-a}$ . Namely $\text{x}\sim U(a,b)$ .
Marginal Probability
- The probability distribution over the subset.
- For discrete random variable,know $P(\text{x},\text{y})$ , find $P(\text{x})$ with the sum rule: $\forall x\in\text{x},P(\text{x}=x)=\sum\limits_yP(\text{x}=x,\text{y}=y)$ .
- For cotinuous variable, $p(x)=\int p(x,y)dy$
Conditional Probability
- $P(\text{y}=y|\text{x}=x)=\frac{P(\text{y}=y,\text{x}=x)}{P(\text{x}=x)}$
- intervention query(干预查询)==>compute the consequences of an action.(the domain of causal modeling)
The Chain Rule of Conditinal Probabilities
- $P(\text{x}^{(1)},\cdots,\text{x}^{(n)})=P(\text{x}^{(1)})\prod_{i=2}^nP(\text{x}^{(i)},\cdots,\text{x}^{(i-1)})$
Independence:
- $\forall x\in\text{x},y\in\text{y},p(\text{x}=x,\text{y}=y)=p(\text{x}=x)p(\text{y}=y)$
- For simplify: $\text{x}\perp\text{y}$
Conditional Independce:
- $\forall x\in\text{x},y\in\text{y},z\in\text{z},p(\text{x}=x,\text{y}=y|\text{z}=z)=p(\text{x}=x|\text{z}=z)p(\text{y}=y|\text{z}=z)$
- For simplify: $\text{x}\perp\text{y}|\text{z}$
Expectation
- For discrete variables, $\mathbb{E}_{\text{x}\sim P}[f(x)]=\sum\limits_xP(x)f(x)$ .
- For continuous variables, $\mathbb{E}_{\text{x}\sim p}[f(x)]=\int\limits_xp(x)f(x)$
- linear: $\mathbb{E}_{\text{x}}[\alpha f(x)+\beta g(x)]=\alpha\mathbb{E}_{\text{x}}[f(x)]+\beta\mathbb{E}_{\text{x}}[g{(x))}]$
Variance
- $\text{Var}(f(x))=\mathbb{E}\big[(f(x)-\mathbb{E}[f(x)])^2\big]$
- the square root of the variance is known as the standard deviation.
Covariance
- $\text{Cov}(f(x),g(y))=\mathbb{E}[(f(x)-\mathbb{E}[f(x)])(g(y)-\mathbb{E}[g(y)])]$
- how much two values are linearly related to each other and the scale of these variables.
- high absolute value:
  - the values changes very much
  - far from their respective means
- positive: both variables tend to be relatively high values
- negative: one high and other low.
- relationship between covariance and independence: independence==>0 covariance; 0 covariance!=> independence
- covariance matrix:
  - For a random vector $x\in \mathbb{R}^n$
  - $\text{Cov}(\mathbf{x})_{i,j}=\text{Cov}(\text{x}_i,\text{x}_j)$ , the diagonal elements of the covariance: $\text{Cov}(\text{x}_i,\text{x}_i)=\text{Var}(\text{x}_i)$ .

Common Probability Distributions

Bernouli Distribution
- a distribution over a single binary random variable.
- a single parameter $\phi\in[0,1]$ gives the probability of the random variable being equal to 1.
  - $P(\text{x}=1)=\phi$ .
  - $P(\text{x}=0)=1-\phi$ .
  - $P(\text{x}=x)=\phi^x(1-\phi)^{1-x}$
  - $\mathbb{E}_{\text{x}}[\text{x}]=\phi$
  - $\text{Var}_{\text{x}}(\text{x})=\phi(1-\phi)$
Multinoulli Distribution
- a distribution over a single discrete variable with $k$ different states, where $k$ is finite.
- a vector parameter $\vec{p}\in[0,1]^{k-1}$ , where $p_i$ gives the probability of the i-th state.
- k-th state’s probability: 1- $\bold{1}^T\vec{p}$ .
- constrain: $\bold{1}^T\vec{p}\leq0$
Gaussian Distribution
- $\mathcal{N}(x;\mu,\sigma^2)=\sqrt{\frac{1}{2\pi\sigma^2}}\exp\big(-\frac{1}{2\sigma^2}(x-\mu)^2\big)$
- Two parameters $\mu\in\mathbb{R}$ and $\sigma\in(0,\infty)$
  - $\mu$ gives the coordinate of the central peak
  - $\mathbb{E}[\text{x}]=\mu$
  - $\sigma$ is the standard deviation of the distribution
  - $\sigma^2$ is the variance
- evaluate the PDF with parameter $\beta\in(0,\infty)$
  - $\mathcal{N}(x;\mu,\beta^{-1})=\sqrt{\frac{\beta}{2\pi}}\exp\big(-\frac{1}{2}\beta(x-\mu)^2\big)$
- reasons for good choice
  - many distributions we wish to model are truly close to being normal distributions.
  - Centeal limit theorem: the sum of many independent random variables is approximately normally distributed.
  - out of all possible probability distributions with the same variance,the normal distribution encodes the maximum amount of uncertainty over the real numbers.
- generalizes to $\mathbb{R}^n$ : multivariate normal distribution.
  - a positive definite symmetric matrix parameter $\bold{\Sigma}$
  - $\mathcal{N}(\vec{x};\vec{\mu},\bold{\Sigma})=\sqrt{\frac{1}{(2\pi)^n\det(\bold{\Sigma})}}\exp\big(-\frac{1}{2}(\vec{x}-\vec{\mu})^T\bold{\Sigma}^{-1}(\vec{x}-\vec{\mu})\big)$ , where $\vec{\mu}$ , a vector-valued, is the mean of the distribution; $\bold{\Sigma}$ is the covariance matrix of the distribution.
  - use a precision matrix $\bold{\beta}$ :
    - $\mathcal{N}(\vec{x};\vec{\mu},\bold{\beta}^{-1})=\sqrt{\frac{\det(\bold{\beta})}{(2\pi)^n}}\exp\big(-\frac{1}{2}(\vec{x}-\vec{\mu})^T\bold{\beta}(\vec{x}-\vec{\mu})\big)$
  - isotropic Gaussian distribution: covariance matrix is a scalar times the identity matrix.
Exponential Distribution
- a probability with a sharp point at $x = 0$
- $p(x;\lambda)=\lambda\bold{1}_{x\geq0}\exp(-\lambda x)$ , where $\bold{1}_{x\geq0}$ is to assign probability zero to all negative values of $x$ .
Laplace Distribution
- place a sharp peak of probability mass at an arbitrary point $\mu$ .
- $\text{Laplace}(x;\mu,\gamma)=\frac{1}{2\gamma}\exp(-\frac{|x-\mu|}{\gamma})$
Dirac Distribution
- $p(x)=\delta(x-\mu)$
Empirical Distribution
- $\hat{p}(\vec{x})=\frac{1}{m}\sum\limits_{i=1}^m\delta(\vec{x}-\vec{x}^{(i)})$
- Dirac delta distribution is for continous variables
- For discrete variables, an empirical distribution can be conceptualized as a multinoulli distribution.
mixtures Distribution
- be made up of several component distributions
- $P(\text{x})=\sum\limits_iP(\text{c}=i)P(\text{x}|\text{c}=i)$ , where $P(\text{c})$ is the multinoulli distribution over component identities.(a simple strategy)
- latent variable is a random vcariable that we cannot observe directly.
- Gaussian mixture model: a univeral approximator of densities
  - prior probability: $\alpha_i=P(\text{c}=i)$
  - posterior probability: $P(\text{c}|\vec{x})$
  - any smooth density can be approximated with anyspeciﬁc, non-zero amount of error by a Gaussian mixture model with enough components.

Useful Properties of Common Functions

logistic sigmoid
- $\sigma(x)=\frac{1}{1+\exp{(-x)}}$
- produce the $\phi$ parameter of a Bernoulli distribution
- properties:
  - $\sigma(x)=\frac{\exp(x)}{\exp(x)+\exp(0)}$
  - $\frac{d}{dx}\sigma(x)=\sigma(x)(1-\sigma(x))$
  - $1-\sigma(x)=\sigma(-x)$
  - $\forall x\in (0,1),\sigma^{-1}(x)=\log(\frac{x}{1-x})$
softplus
- $\zeta(x)=log(1+\exp(x))$
- produce the $\beta$ or $\sigma$ parameter of a normal distribution
- properties:
  - $\forall x>0,\zeta^{-1}(x)=\log(\exp(x)-1)$
  - $\zeta(x)-\zeta(-x)=x$
properties:
- $\log\sigma(x)=-\zeta(-x)$
- $\frac{d}{dx}\zeta(x)=\sigma(x)$
- $\zeta(x)=\int^x_{-\infty}\sigma(y)dy$

Bayes’ Rule

$P(\text{x}|\text{y})=\frac{P(\text{x})P(\text{y}|\text{x})}{P(\text{y})},P(\text{y})=\sum_xP(\text{y}|x)P(x)$
derive from the definition of conditional probability.

Techbical Details of Continuous Variables

Measure theory
- purposes: measure theory is more useful for describing theorems that apply to most points in $\mathbb{R}^n$ but do not apply to some corner cases.
- measuer zero: a rigorous way of describing that a set of points is negligibly small
- almost everywhere: Some important results in probability theory hold for all discrete values but only hold “almost everywhere” for continuousvalues
- $\text{For }\bold{x}\text{ and }\bold{y},~\vec{y}=g(\vec{x}),~\text{then } p_x(x)=p_y(g(x))|\frac{\partial g(x)}{\partial x}|$
- For higher dimensions, $p_x(\vec{x})=p_y(g(\vec{x}))|\det(\frac{\partial g(\vec{x})}{\partial \vec{x}})|$
- Jacobian matrix: $J_{i,j}=\frac{\partial x_i}{\partial y_i}$
Information theory
- information theory tells how to design optimal codes and calculate the expected length of messages sampled from speciﬁc probability distributions using various encoding schemes.
- quantify information:
  - Likely events should have low information content
  - Less likely events should have higher information content
  - Independent events should have additive information
- Sef-information of a event $\text{x}=x$
  - $I(x)=-\log(P(x))$
- Shannon entropy
  - $H(\text{x})=\mathbb{E}_{\text{x}\sim P}[I(x)]=-\mathbb{E}_{\text{x}\sim P[\log P(x)]}$
  - the Shannon entropy of a distribution is the expected amount of information in an event drawn from that distribution.
  - for $\text{x}$ is continuous, the shannon entropy is known as the differential entropy.
- Kullback-Leibler (KL) divergence
  - $D_{\text{KL}}(P||Q)=\mathbb{E}_{\text{x}\sim P}[\log\frac{P(x)}{Q(x)}]=\mathbb{E}_{\text{x}\sim P}[\log P(x)-\log Q(x)]$
  - KL divergence is 0 if and only if $P$ and $Q$ are the same distribution for discrete variables, or equal ‘almost everywhere’ for continous variables.
  - for some $P$ and $Q$ , $D_{\text{KL}}(P||Q)\neq D_{\text{KL}}(Q||P)$
- Cross-entropy
  - $H(P,Q)=H(P)+D_{\text{KL}}(P||Q)$
  - namely, $H(P,Q)=-\mathbb{E}_{\text{x}\sim P}\log Q(x)$
- Note: $0\log 0=\lim_{x\rightarrow0}x\log x=0$
Strutured Probability Models(graphical model)
- we represent the factorization of a probability distributionwith a graph
- Directed
  - use graphs with directed edges, represent factorizations into conditional probability distributions
  - $p(\bold{x})=\prod\limits_ip(\text{x}_i|Pa_{\mathcal{G}}(\text{x}_i))$ , where $Pa_{\mathcal{G}}(\text{x}_i)$ is the parents of $\text{x}_i$ , given by the factor consists of the conditional distribution over $\text{x}_i$
- Undirected
  - use graphs with undirected edges, represent factorizations into a set of functions, which are not probability distributions of any kind.
  - $p(\bold{x})=\frac{1}{Z}\prod_i\phi^{(i)}(\mathcal{C}^{(i)})$ , where $\mathcal{C}^{(i)}$ is a set of nodes that are all connected to each other in $\mathcal{G}$ and $\phi^{(i)}(\mathcal{C}^{(i)})$ is a factor, which is not a distribution function.

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。