Machine Learning Theory - L1 Concentration Inequalities
This is the first lecture of course Machine Learning Theory (AI603). This lecture introduced some basic but useful concentration inequalities.
What are concentration inequalities?
Concentration inequalities furnish us bounds on how random variables deviate from a value (typically, expected value) or help us to understand how well they are concentrated.
Why we need?
A classification algorithm has an error probability ϵ, i.e., the output of the algorithm misclassifies a randomly sampled data with probability ϵ. What is the probability of the event that the algorithm misclassifies more than 200ϵ data points among 100 data points.
文章目录
1 Markov Inequality
Theorem: If X X X is a non-negative random variable,
P ( X ≥ a ) ≤ E [ X ] a , ∀ a > 0 \mathbb P(X\ge a) \le\frac{\mathbb E[X]} a, \forall a>0 P(X≥a)≤aE[X],∀a>0
Proof:
Let f f f be the probability density function (PDF) of X X X.
E [ X ] = ∫ 0 ∞ x f ( x ) d x ≥ ∫ a ∞ x f ( x ) d x ≥ a ∫ a ∞ f ( x ) d x = a P ( X ≥ a ) \begin{aligned}{} \mathbb E[X]&=\int^\infty_0xf(x)dx\\ &\ge \int^\infty_a xf(x)dx \\&\ge a\int^\infty_a f(x)dx\\ &=a\mathbb P(X\ge a) \end {aligned} E[X]=∫0∞xf(x)dx≥∫a∞xf(x)dx≥a∫a∞f(x)dx=aP(X≥a)
By rearranging the terms,
P ( X ≥ a ) ≤ E [ X ] a \mathbb P(X\ge a)\le\frac{\mathbb E[X]}a P(X≥a)≤aE[X]
A simple trick to make positive random variables:
A non-negative strictly increasing function ϕ ( x ) = e θ x \phi (x)=e^{\theta x} ϕ(x)=eθx
Then, with Markov Inequality we have Chernoff bound:
P ( X ≥ a ) = P ( e X ≥ e a ) = P ( e θ X ≥ e θ a ) ≤ E [ e θ X ] e θ a ∀ a , θ > 0 \mathbb P(X\ge a)=\mathbb P(e^X\ge e^a)=\mathbb P(e^{\theta X}\ge e^{\theta a})\\ \le \frac{\mathbb E[e^{\theta X}]}{e^{\theta a}}~~~~\forall a,\theta>0 P(X≥a)=P(eX≥ea)=P(eθX≥eθa)≤eθaE[eθX] ∀a,θ>0
When X X X is the sum of independent random variables X 1 , … , X n X_1,\dots,X_n X1,…,Xn,
P ( X ≥ a ) ≤ inf θ > 0 E [ e θ X ] e θ a = inf θ > 0 ∏ i = 1 n E [ e θ X i ] e θ a ∀ a > 0 \mathbb P(X\ge a)\le\inf_{\theta>0}\frac{\mathbb E[e^{\theta X}]}{e^{\theta a}}=\inf_{\theta>0}\frac{\prod^n_{i=1}\mathbb E[e^{\theta X_i}]}{e^{\theta a}}~~~~\forall a>0 P(X≥a)≤θ>0infeθaE[eθX]=θ>0infeθa∏i=1nE[eθXi] ∀a>0
2 Chernoff-Hoeffding
Theorem: Let X 1 , … , X n X_1,\dots,X_n X1,…,Xn be i.i.d. Bernoulli random variables f X ( x ) = p x ( 1 − p ) 1 − x = { [ l ] p if x = 1 , q if x = 0. f_X(x)=p^x(1-p)^{1-x}=\left \{\begin{aligned}[l]\ p~~~~\text{if }x=1,\\q~~~~ \text{if }x=0.\end{aligned} \right. fX(x)=px(1−p)1−x={
[l] p if x=1,q if x=0. with probability p p p. When 0 < p ≤ q 0<p\le q 0<p≤q,
P ( ∑ i = 1 n X i ≥ n q ) ≤ e − n D ( q ∣ ∣ p ) , \mathbb P(\sum^n_{i=1}X_i\ge nq)\le e^{-nD(q||p)}, P(i=1∑nXi≥nq)≤e−nD(q∣∣p),
where D ( q ∣ ∣ p ) = q log p q + ( 1 − q ) log 1 − q 1 − p D(q||p)=q\log \frac p q +(1-q)\log\frac{1-q}{1-p} D(q∣∣p)=qlogqp+(1−q)log1−p1−q is the KL-divergence. When 0 < q ≤ p 0<q\le p 0<q≤p,
P ( ∑ i = 1 n X i ≤ n q ) ≤ e − n D ( q ∣ ∣ p ) . \mathbb P(\sum^n_{i=1}X_i\le nq)\le e^{-nD(q||p)}. P(i=1∑nXi≤nq)≤e−nD(q∣∣p).
Proof:
From the Markov inequality, for all θ > 0 \theta>0 θ>0
P ( ∑ i = 1 n X i ≥ n q ) ≤ E [ e θ ∑ i = 1 n X i ] e θ n q = ∏ i = 1 n E [ e θ X i ] e θ n q = ∏ i = 1 n ( p e θ + 1 − p ) e θ n q . = exp ( − θ n q ) ( p e θ + 1 − p ) n \begin{aligned} \mathbb P(\sum^n_{i=1}X_i\ge nq)&\le\frac{\mathbb E[e^{\theta\sum^n_{i=1}X_i}]}{e^{\theta nq}}\\ &=\frac{\prod^n_{i=1}\mathbb E[e^{\theta X_i}]}{e^{\theta nq}}\\ &=\frac{\prod^n_{i=1}(pe^\theta+1-p)}{e^{\theta nq}}.\\ &=\exp(-\theta nq)(pe^{\theta}+1-p)^n \end{aligned} P(i=1∑nXi≥nq)≤eθnqE[eθ∑i=1nXi]=eθnq∏i=1nE[eθXi]=eθnq∏i=1n(peθ+1−p).=exp(−θnq)(peθ+1−p)n
To find the minimization of the above equality, let ϕ ( θ ) = ln ( exp ( − θ n q ) ( p e θ + 1 − p ) n ) \phi(\theta)=\ln (\exp(-\theta nq)(pe^{\theta}+1-p)^n) ϕ(θ)=ln(exp(−θnq)(peθ+1−p)n).
ϕ ( θ ) = − θ n q + n ln ( p e θ + 1 − p ) ϕ ′ ( θ ) = − n q + n p e θ p e θ + 1 − p = 0 e θ = q 2 p 2 = q ( 1 − p ) p ( 1 − q ) \begin{aligned} \phi(\theta)=-\theta nq+n\ln(pe^\theta +1-p)\\ \phi'(\theta)=-nq+\frac{npe^\theta}{pe^\theta +1-p}=0\\ e^\theta=\frac{q^2} {p^2}=\frac{q(1-p)} {p(1-q)} \end{aligned} ϕ(θ)=−θnq+nln(peθ+1−p)ϕ′(θ)=−nq+peθ+1−pnpeθ=0eθ=p2q2=p(1−q)q(1−p)
Thus, with θ = log ( q ( 1 − p ) p ( 1 − q ) ) \theta=\log(\frac{q(1-p)}{p(1-q)}) θ=log(p(1−q)q(1−p)),
P ( ∑ i = 1 n X i ≥ n q ) ≤ e − n D ( q ∣ ∣ p ) . \mathbb P(\sum^n_{i=1}X_i\ge nq)\le e^{-nD(q||p)}. P(i=1∑nXi≥nq)≤e−nD(q∣∣p).
3 Hoeffding’s inequality
Lemma: Let X ∈ [ a , b ] X\in[a,b] X∈[a,b] be a random variable. Then, for all θ > 0 \theta>0 θ>0,
E [ e θ ( X − E [ X ] ) ] ≤ exp ( θ 2 ( b − a ) 2 8 ) . \mathbb E[e^{\theta(X-\mathbb E[X])}]\le\exp(\frac{\theta^2(b-a)^2}8). E[eθ(X−E[X])]≤exp(8θ2(b−a)2).
Theorem: Let X 1 , … , X n X_1,\dots,X_n X1,…,Xn be independent random variables. When X i X_i Xi is bounded by [ a i , b i ] [a_i,b_i] [ai,bi] for all i i i,
P ( 1 n ∑ t = 1 n ( X i − E [ X i ] ) ≥ t ) ≤ exp ( − 2 n 2 t 2 ∑ i = 1 n ( b i − a i ) 2 ) . \mathbb P(\frac 1 n\sum^n_{t=1}(X_i-\mathbb E[X_i])\ge t)\le\exp(-\frac{2n^2t^2}{\sum^n_{i=1}(b_i-a_i)^2}). P(n1t=1∑n(Xi−E[Xi])≥t)≤exp(−∑i=1n(bi−ai)22n2t2).
Proof:
Let Y = X − E [ X ] Y=X-\mathbb E[X] Y=X−