Convex Optimization 读书笔记 (6)

Chapter7: Statistical estimation

7.1 Parametric distribution estimation

7.1.1 Maximum likelihood estimation

Define log-likelihood function, and denoted l:
l ( x ) = log ⁡ p x ( y ) l(x)=\log p_x(y) l(x)=logpx(y)A widely used method, called maximum likelihood (ML) estimation, is to estimate x x x as
x ^ m l = arg ⁡ max ⁡ x p x ( y ) = arg ⁡ max ⁡ x l ( x ) \hat{x}_{\rm ml}=\arg\max_xp_x(y)=\arg\max_xl(x) x^ml=argxmaxpx(y)=argxmaxl(x)

7.1.2 Maximum a posteriori probability estimation

Maximum a posteriori probability (MAP) estimation can be considered a Bayesian version of maximum likelihood estimation, with a prior probability density on the underlying parameter x x x. We assume that x x x (the vector to be estimated) and y y y (the observation) are random variables with a joint probability density p ( x , y ) p(x,y) p(x,y).
The prior density of x x x is given by
p x ( x ) = ∫ p ( x , y ) d y p_x(x)=\int p(x,y)dy px(x)=p(x,y)dySimilarly,
p y ( y ) = ∫ p ( x , y ) d x p_y(y)=\int p(x,y)dx py(y)=p(x,y)dxThe conditional density of y y y, given x x x, is given by
p y ∣ x ( x , y ) = p ( x , y ) p x ( x ) p_{y\mid x}(x,y)=\frac{p(x,y)}{p_x(x)} pyx(x,y)=px(x)p(x,y)In the MAP estimation method, our estimate of x x x, given the observation y y y, is given by
x ^ m a p = arg ⁡ max ⁡ x p x ∣ y ( x , y ) = arg ⁡ max ⁡ x p y ∣ x ( x , y ) p x ( x ) = arg ⁡ max ⁡ x p ( x , y ) \begin{aligned} \hat{x}_{\rm map} &= \arg\max_xp_{x\mid y}(x,y)\\ &= \arg\max_xp_{y\mid x}(x,y)p_x(x)\\ &= \arg\max_xp(x,y)\\ \end{aligned} x^map=argxmaxpxy(x,y)=argxmaxpyx(x,y)px(x)=argxmaxp(x,y)

7.2 Nonparametric distribution estimation

7.3 Optimal detector design and hypothesis testing

Suppose X X X is a random variable with values in { 1 , . . . , n } \{1, . . . , n\} {1,...,n}, with a distribution that depends on a parameter θ ∈ { 1 , . . . , m } θ ∈ \{1, . . . , m\} θ{1,...,m}. The distributions of X X X, for the m m m possible values of θ θ θ, can be represented by a matrix P ∈ R n × m P ∈ \mathbf{R}^{n×m} PRn×m, with elements
p k j = p r o b ( X = k ∣ θ = j ) p_{kj}=\mathbf{prob}(X=k\mid \theta=j) pkj=prob(X=kθ=j)The j j jth column of P P P gives the probability distribution associated with the parameter value θ = j θ = j θ=j. The m m m values of θ θ θ are called hypotheses, and guessing which hypothesis is correct is called hypothesis testing.

7.3.1 Deterministic and randomized detectors

A (deterministic) estimator or detector is a function ψ ψ ψ from { 1 , . . . , n } \{1, . . . , n\} {1,...,n} (the set of possible observed values) into { 1 , . . . , m } \{1, . . . , m\} {1,...,m} (the set of hypotheses).
A randomized detector of θ θ θ is a random variable θ ^ ∈ { 1 , . . . , m } \hat{θ} ∈ \{1, . . . , m\} θ^{1,...,m}. A randomized detector can be defined in terms of a matrix T ∈ R m × n T\in\mathbf{R}^{m\times n} TRm×n with
t i k = p r o b ( θ ^ = i ∣ X = k ) t_{ik} = \mathbf{prob}(\hat{\theta}=i\mid X=k) tik=prob(θ^=iX=k)

7.3.2 Detection probability matrix

For the randomized detector defined by the matrix T T T, we define the detection probability matrix as D = T P D = T P D=TP . We have
D i j = ( T P ) i j = p r o b ( θ ^ = i ∣ θ = j ) D_{ij}=(TP)_{ij}=\mathbf{prob}(\hat{\theta}=i\mid \theta=j) Dij=(TP)ij=prob(θ^=iθ=j)so D i j D_{ij} Dij is the probability of guessing θ ^ = i \hat{θ} = i θ^=i, when in fact θ = j θ = j θ=j.

7.3.3 Optimal detector design

7.3.4 Multicriterion formulation and scalarization

The optimal detector design problem can be considered a multicriterion problem, and the m ( m − 1 ) m(m − 1) m(m1) objectives given by the off-diagonal entries of D D D, which are the probabilities of the different types of detection error:
m i n i m i z e   ( w . r . t .   R + m ( m − 1 ) )      D i j ,    i , j = 1 , . . . , m ,    i ≠ j s u b j e c t   t o        t k ⪰ 0 ,    1 T t k = 1 , k = 1 , . . . , n \begin{aligned} {\rm minimize \ (w.r.t. \ } \mathbf{R}^{m(m-1)}_+) \ \ \ \ & D_{ij}, \ \ i,j=1,...,m, \ \ i\ne j \\ {\rm subject \ to \ } \ \ \ \ & t_k\succeq0, \ \ \bold{1}^Tt_k=1,k=1,...,n \end{aligned} minimize (w.r.t. R+m(m1))    subject to     Dij,  i,j=1,...,m,  i=jtk0,  1Ttk=1,k=1,...,n

7.3.5 Binary hypothesis testing

As an illustration, we consider the special case m = 2 m = 2 m=2, which is called binary hypothesis testing.

7.3.6 Robust detectors

We define the worst-case detection probability matrix D w c D^{\rm wc} Dwc as
D i j w c = sup ⁡ P ∈ P D i j ,    i , j = 1 , . . . , m ,    i ≠ j D^{\rm wc}_{ij}=\sup_{P\in\mathcal{P}}D_{ij}, \ \ i,j=1,...,m, \ \ i\neq j Dijwc=PPsupDij,  i,j=1,...,m,  i=jand
D i i w c = inf ⁡ P ∈ P D i i ,    i = 1 , . . . , m D^{\rm wc}_{ii}=\inf_{P \in \mathcal{P}}D_{ii}, \ \ i=1,...,m Diiwc=PPinfDii,  i=1,...,m

7.4 Chebyshev and Chernoff bounds

7.4.1 Chebyshev bounds

Chebyshev bounds give an upper bound on the probability of a set based on known expected values of certain functions. If X X X is a random variable on R \mathbf{R} R with E X = μ \mathbb{E}X=μ EX=μ and E ( X − μ ) 2 = σ 2 \mathbb{E}(X−μ)^2 =σ^2 E(Xμ)2=σ2, then we have p r o b ( ∣ X − μ ∣ ≥ 1 ) ≤ σ 2 \mathbf{prob}(|X−μ|≥1)≤σ^2 prob(Xμ1)σ2, again no matter what the distribution of X X X is.

7.4.2 Chernoff bounds

Let X X X be a random variable on R \mathbf{R} R. The Chernoff bound states that
p r o b ( X ≥ u ) ≤ inf ⁡ λ ≥ 0 E e λ ( X − u ) \mathbf{prob}(X\geq u)\leq \inf_{\lambda\geq 0}\mathbb{E}e^{\lambda(X-u)} prob(Xu)λ0infEeλ(Xu)which can be expressed as
log ⁡ p r o b ( X ≥ u ) ≤ inf ⁡ λ ≥ 0 { − λ u + log ⁡ E e λ X } \log\mathbf{prob}(X\geq u)\leq \inf_{\lambda\geq 0} \{-\lambda u +\log \mathbb{E} e^{\lambda X} \} logprob(Xu)λ0inf{λu+logEeλX}

7.4.3 Example

7.5 Experiment design

We consider the problem of estimating a vector x ∈ R n x ∈ \mathbf{R}^n xRn from measurements or experiments
y i = a i x + w i ,    i = 1 , . . . , m y_i=a_ix+w_i, \ \ i=1,...,m yi=aix+wi,  i=1,...,mThe associated estimation error e = x ^ − x e = \hat{x} − x e=x^x has zero mean and covariance matrix
E = E e e T = ( ∑ i = 1 m a i a i T ) − 1 E=\mathbb{E}{ee^T}=(\sum_{i=1}^{m}a_ia_i^T)^{-1} E=EeeT=(i=1maiaiT)1We suppose that the vectors a 1 , . . . , a m a_1, . . . , a_m a1,...,am, which characterize the measurements, can be chosen among p p p possible test vectors v 1 , . . . , v p ∈ R n v_1, . . . , v_p ∈ \mathbf{R}^n v1,...,vpRn. The goal of experiment design is to choose the vectors a i a_i ai, from among the possible choices, so that the error covariance E E E is small (in some sense).

7.5.1 The relaxed experiment design problem

In the case when m m m is large compared to n n n, however, a good approximate solution can be found by ignoring, or relaxing, the constraint that the m i m_i mi are integers. The relaxed experiment design problem is
m i n i m i z e   ( w . r . t .   S + n )      E = 1 m ( ∑ i = 1 p λ i v i v i T ) − 1 s u b j e c t   t o        λ ⪰ 0 ,    1 T λ = 1 \begin{aligned} {\rm minimize \ (w.r.t. \ } \mathbf{S}^{n}_+) \ \ \ \ & E=\frac{1}{m}(\sum_{i=1}^p\lambda_iv_iv_i^T)^{-1} \\ {\rm subject \ to \ } \ \ \ \ & \lambda\succeq0, \ \ \bold{1}^T\lambda=1 \end{aligned} minimize (w.r.t. S+n)    subject to     E=m1(i=1pλiviviT)1λ0,  1Tλ=1

7.5.2 Scalarizations

D D D-optimal design

The most widely used scalarization is called D D D-optimal design, in which we minimize the determinant of the error covariance matrix E E E.
m i n i m i z e      log ⁡ det ⁡ ( ∑ i = 1 p λ i v i v i T ) − 1 s u b j e c t   t o        λ ⪰ 0 ,    1 T λ = 1 \begin{aligned} {\rm minimize} \ \ \ \ & \log \det (\sum_{i=1}^p\lambda_iv_iv_i^T)^{-1} \\ {\rm subject \ to \ } \ \ \ \ & \lambda\succeq0, \ \ \bold{1}^T\lambda=1 \end{aligned} minimize    subject to     logdet(i=1pλiviviT)1λ0,  1Tλ=1

E E E-optimal design

In E E E-optimal design, we minimize the norm of the error covariance matrix, the maximum eigenvalue of E E E. The E E E-optimal experiment design problem can be cast as an SDP
m i n i m i z e      t s u b j e c t   t o        ∑ i = 1 p λ i v i v i T ⪰ t I λ ⪰ 0 ,    1 T λ = 1 \begin{aligned} {\rm minimize} \ \ \ \ & t \\ {\rm subject \ to \ } \ \ \ \ & \sum_{i=1}^p\lambda_iv_iv_i^T \succeq tI \\ &\lambda\succeq0, \ \ \bold{1}^T\lambda=1 \end{aligned} minimize    subject to     ti=1pλiviviTtIλ0,  1Tλ=1

A A A-optimal design

In A-optimal experiment design, we minimize t r E \mathbf{tr}E trE, the trace of the covariance matrix
m i n i m i z e      t r ( ∑ i = 1 p λ i v i v i T ) − 1 s u b j e c t   t o        λ ⪰ 0 ,    1 T λ = 1 \begin{aligned} {\rm minimize} \ \ \ \ & \mathbf{tr} (\sum_{i=1}^p\lambda_iv_iv_i^T)^{-1} \\ {\rm subject \ to \ } \ \ \ \ & \lambda\succeq0, \ \ \bold{1}^T\lambda=1 \end{aligned} minimize    subject to     tr(i=1pλiviviT)1λ0,  1Tλ=1

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值