最大似然估计Maximum Likelihood Estimation

Statement of the Problem

Suppose we have a random sample X1X2,..., Xn whose assumed probability distribution depends on some unknown parameter θ. Our primary goal here will be to find a point estimator u(X1X2,..., Xn), such that u(x1x2,..., xn) is a "good" point estimate of θ, where x1x2,..., xn are the observed values of the random sample. For example, if we plan to take a random sample X1X2,..., Xn for which the Xi are assumed to be normally distributed with mean μ and variance σ2, then our goal will be to find a good estimate of μ, say, using the data x1x2,..., xn that we obtained from our specific random sample.

The Basic Idea

It seems reasonable that a good estimate of the unknown parameter θ would be the value of θ that maximizes the probability, errrr... that is, the likelihood... of getting the data we observed. (So, do you see from where the name "maximum likelihood" comes?) So, that is, in a nutshell, the idea behind the method of maximum likelihood estimation. But how would we implement the method in practice? Well, suppose we have a random sample X1X2,..., Xn for which the probability density (or mass) function of each Xi is f(xiθ). Then, the joint probability mass (or density) function of X1X2,..., Xn, which we'll (not so arbitrarily) call L(θ) is:

L(θ)=P(X1=x1,X2=x2,,Xn=xn)=f(x1;θ)f(x2;θ)f(xn;θ)=i=1nf(xi;θ) L(θ)=P(X1=x1,X2=x2,…,Xn=xn)=f(x1;θ)⋅f(x2;θ)⋯f(xn;θ)=∏i=1nf(xi;θ)

The first equality is of course just the definition of the joint probability mass function. The second equality comes from that fact that we have a random sample, which implies by definition that the Xare independent. And, the last equality just uses the shorthand mathematical notation of a product of indexed terms. Now, in light of the basic idea of maximum likelihood estimation, one reasonable way to proceed is to treat the "likelihood functionL(θ) as a function of θ, and find the value of θ that maximizes it.

Is this still sounding like too much abstract gibberish? Let's take a look at an example to see if we can make it a bit more concrete.

Exared sports carmple

Suppose we have a random sample X1X2,..., Xn where:

  • Xi = 0 if a randomly selected student does not own a sports car, and
  • Xi = 1 if a randomly selected student does own a sports car. 

Assuming that the Xi are independent Bernoulli random variables with unknown parameter p, find the maximum likelihood estimator of p, the proportion of students who own a sports car.

Solution. If the Xi are independent Bernoulli random variables with unknown parameter p, then the probability mass function of each Xi is:

f(xi;p)=pxi(1p)1xi f(xi;p)=pxi(1−p)1−xi

for xi = 0 or 1 and 0 < p < 1. Therefore, the likelihood function L(p) is, by definition:

L(p)=i=1nf(xi;p)=px1(1p)1x1×px2(1p)1x2××pxn(1p)1xn L(p)=∏i=1nf(xi;p)=px1(1−p)1−x1×px2(1−p)1−x2×⋯×pxn(1−p)1−xn

for 0 < p < 1. Simplifying, by summing up the exponents, we get :

L(p)=pxi(1p)nxi L(p)=p∑xi(1−p)n−∑xi

Now, in order to implement the method of maximum likelihood, we need to find the p that maximizes the likelihood L(p). We need to put on our calculus hats now, since in order to maximize the function, we are going to need to differentiate the likelihood function with respect to p. In doing so, we'll use a "trick" that often makes the differentiation a bit easier.  Note that the natural logarithm is an increasing function of x:

natural logarithm graph

That is, if x1 < x2, then f(x1) < f(x2). That means that the value of p that maximizes the natural logarithm of the likelihood function ln(L(p)) is also the value of p that maximizes the likelihood function L(p). So, the "trick" is to take the derivative of ln(L(p)) (with respect to p) rather than taking the derivative of L(p). Again, doing so often makes the differentiation much easier. (By the way, throughout the remainder of this course, I will use either ln(L(p)) or log(L(p)) to denote the natural logarithm of the likelihood function.) 

In this case, the natural logarithm of the likelihood function is:

logL(p)=(xi)log(p)+(nxi)log(1p) logL(p)=(∑xi)log(p)+(n−∑xi)log(1−p)

Now, taking the derivative of the log likelihood, and setting to 0, we get:

partial derivative

Now, multiplying through by p(1−p), we get:

(xi)(1p)(nxi)p=0 (∑xi)(1−p)−(n−∑xi)p=0

Upon distributing, we see that two of the resulting terms cancel each other out:

eqn

leaving us with:

xinp=0 ∑xi−np=0

Now, all we have to do is solve for p. In doing so, you'll want to make sure that you always put a hat ("^") on the parameter, in this case p, to indicate it is an estimate:

p^=i=1nxin p^=∑i=1nxin

or, alternatively, an estimator:

p^=i=1nXin p^=∑i=1nXin

Oh, and we should technically verify that we indeed did obtain a maximum. We can do that by verifying that the second derivative of the log likelihood with respect to p is negative. It is, but you might want to do the work to convince yourself!

Now, with that example behind us, let us take a look at formal definitions of the terms (1) likelihood function, (2) maximum likelihood estimators, and (3) maximum likelihood estimates.

Definition. Let X1X2,..., Xn be a random sample from a distribution that depends on one or more unknown parameters θ1θ2,..., θwith probability density (or mass) function f(xiθ1θ2,..., θm). Suppose that (θ1θ2,..., θm) is restricted to a given parameter space Ω. Then:

(1) When regarded as a function of θ1θ2,..., θm, the joint probability density (or mass) function of X1X2,..., Xn:

L(θ1,θ2,,θm)=i=1nf(xi;θ1,θ2,,θm) L(θ1,θ2,…,θm)=∏i=1nf(xi;θ1,θ2,…,θm)

((θ1θ2,..., θm) in Ω) is called the likelihood function.

(2) If:

[u1(x1,x2,,xn),u2(x1,x2,,xn),,um(x1,x2,,xn)] [u1(x1,x2,…,xn),u2(x1,x2,…,xn),…,um(x1,x2,…,xn)]

is the m-tuple that maximizes the likelihood function, then:

θ^i=ui(X1,X2,,Xn) θ^i=ui(X1,X2,…,Xn)

is the maximum likelihood estimator of θi, for i = 1, 2, ..., m.

(3) The corresponding observed values of the statistics in (2), namely:

[u1(x1,x2,,xn),u2(x1,x2,,xn),,um(x1,x2,,xn)] [u1(x1,x2,…,xn),u2(x1,x2,…,xn),…,um(x1,x2,…,xn)]

are called the maximum likelihood estimates of θi, for i = 1, 2, ..., m.

scaleExample

Suppose the weights of randomly selected American female college students are normally distributed with unknown meanμ and standard deviation σ. A random sample of 10 American female college students yielded the following weights (in pounds):

115   122   130   127   149   160   152   138  149   180    

Based on the definitions given above, identify the likelihood function and the maximum likelihood estimator of μ, the mean weight of all American female college students. Using the given sample, find a maximum likelihood estimate of μ as well.

Solution. The probability density function of Xi is:

f(xi;μ,σ2)=1σ2πexp[(xiμ)22σ2] f(xi;μ,σ2)=1σ2πexp[−(xi−μ)22σ2]

for −∞ < x < ∞. The parameter space is Ω = {(μσ): −∞ < μ < ∞ and 0 < σ < ∞}. Therefore, (you might want to convince yourself that) the likelihood function is:

L(μ,σ)=σn(2π)n/2exp[12σ2i=1n(xiμ)2] L(μ,σ)=σ−n(2π)−n/2exp[−12σ2∑i=1n(xi−μ)2]

for −∞ < μ < ∞ and 0 < σ < ∞. It can be shown (we'll do so in the next example!), upon maximizing the likelihood function with respect to μ, that the maximum likelihood estimator of μ is:

μ^=1ni=1nXi=X¯ μ^=1n∑i=1nXi=X¯

Based on the given sample, a maximum likelihood estimate of μ is:

μ^=1ni=1nxi=110(115++180)=142.2 μ^=1n∑i=1nxi=110(115+⋯+180)=142.2

pounds. Note that the only difference between the formulas for the maximum likelihood estimator and the maximum likelihood estimate is that:

  • the estimator is defined using capital letters (to denote that its value is random), and
  • the estimate is defined using lowercase letters (to denote that its value is fixed and based on an obtained sample)

Okay, so now we have the formal definitions out of the way. The first example on this page involved a joint probability mass function that depends on only one parameter, namely p, the proportion of successes. Now, let's take a look at an example that involves a joint probability density function that depends on two parameters.

Example

Let X1X2,..., Xn be a random sample from a normal distribution with unknown mean μ and variance σ2. Find maximum likelihood estimators of mean μ and variance σ2

Solution. In finding the estimators, the first thing we'll do is write the probability density function as a function of θ1 = μ and θσ2:

f(xi;θ1,θ2)=1θ22πexp[(xiθ1)22θ2] f(xi;θ1,θ2)=1θ22πexp[−(xi−θ1)22θ2]

for −∞ < θ1 < ∞ and 0 < θ2 < ∞. We do this so as not to cause confusion when taking the derivative of the likelihood with respect to σ2. Now, that makes the likelihood function:

L(θ1,θ2)=i=1nf(xi;θ1,θ2)=θn/22(2π)n/2exp[12θ2i=1n(xiθ1)2] L(θ1,θ2)=∏i=1nf(xi;θ1,θ2)=θ2−n/2(2π)−n/2exp[−12θ2∑i=1n(xi−θ1)2]

and therefore the log of the likelihood function:

L(θ1,θ2)=n2logθ2n2log(2π)(xiθ1)22θ2 L(θ1,θ2)=−n2logθ2−n2log(2π)−∑(xi−θ1)22θ2

Now, upon taking the partial derivative of the log likelihood with respect to θ1, and setting to 0, we see that a few things cancel each other out, leaving us with:

eqn

Now, multiplying through by θ2, and distributing the summation, we get:

xinθ1=0 ∑xi−nθ1=0

Now, solving for θ1, and putting on its hat, we have shown that the maximum likelihood estimate of θis:

θ^1=μ^=xin=x¯ θ^1=μ^=∑xin=x¯

Now for θ2. Taking the partial derivative of the log likelihood with respect to θ2, and setting to 0, we get:

eqn

Multiplying through by  2θ22 2θ22:

eqn

we get:

nθ2+(xiθ1)2=0 −nθ2+∑(xi−θ1)2=0

And, solving for θ2, and putting on its hat, we have shown that the maximum likelihood estimate of θis:

θ^2=σ^2=(xix¯)2n θ^2=σ^2=∑(xi−x¯)2n

(I'll again leave it to you to verify, in each case, that the second partial derivative of the log likelihood is negative, and therefore that we did indeed find maxima.) In summary, we have shown that the maximum likelihood estimators of μ and variance σ2 for the normal model are:

μ^=Xin=X¯ μ^=∑Xin=X¯   and    σ^2=(XiX¯)2n σ^2=∑(Xi−X¯)2n

respectively. 

Note that the maximum likelihood estimator of σ2 for the normal model is not the sample variance S2. They are, in fact, competing estimators. So how do we know which estimator we should use for σ2 ? Well, one way is too choose the estimator that is "unbiased." Let's go learn about unbiased estimators now.


from: https://onlinecourses.science.psu.edu/stat414/node/191

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值