核密度估计介绍(An introduction to kernel density estimation)

原文地址: http://www.mvstat.net/tduong/research/seminars/seminar-2001-05/
This talk is divided into three parts: first is on histograms, on how to construct them andtheir properties. Next are kernel density estimators – how they are a generalisation andimprovement over histograms. Finally is on how to choose the most appropriate, ‘nice’kernels so that we extract all the important features of the data.

A histogram is the simplest non-parametric density estimator and the one that is mostlyfrequently encountered. To construct a histogram, we divide the interval covered by the datavalues and then into equal sub-intervals, known as ‘bins’. Every time, a data value falls intoa particular sub-interval, then a block, of size equal 1 by the binwidth, is placed on top of it.When we construct a histogram, we need to consider these two main points: the size of thebins (the binwidth) and the end points of the bins.

The data are (the log of) wing spans of aircraft built in from 1956 – 1984. (The completedataset can be found in Bowman & Azzalini (1997) Applied Smoothing Techniques for DataAnalysis. We use a subset of this, namely obeservations 2, 22, 42, 62, 82, 102, 122, 142, 162,182, 202 and 222. We only use a subset otherwise some plots become too crowded so it is fordisplay purposes only.) The data points are represented by crosses on the x-axis.

If we choose breaks at 0 and 0.5 and a binwidth of 0.5, the our histogram looks like the oneon top. It appears that the this density is unimodal and skewed to the right, according tothis histogram on the left. The choice of end points has a particularly marked effect of theshape of a histogram. For example if we use the same binwidth but with the end pointsshifted up to 0.25 and 0.75, then out histogram looks like this one below. We now have acompletely different estimate of the density – it now appears to be bimodal.

这里写图片描述

We have illustrated the properties of histograms with these two examples: they are

• not smooth
• depend on end points of bins
• depend on width of bins.
这里写图片描述
We can alleviate the first two problems by using kernel density estimators. To remove thedependence on the end points of the bins, we centre each of the blocks at each data pointrather than fixing the end points of the blocks.

In the above ‘histogram’, we place a block of width 1/2 and height 1/6 (the dotted boxes)as there are 12 data points, and then add them up. This density estimate (the solid curve)is less blocky than either of the histograms, as we are starting to extract some of the finerstructure. It suggests that the density is bimodal.

This is known as box kernel density estimate – it is still discontinuous as we have used adiscontinuous kernel as our building block. If we use a smooth kernel for our building block,then we will have a smooth density estimate. Thus we can eliminate the first problem withhistograms as well. Unfortunately we still can’t remove the dependence on the bandwidth(which is the equivalent to a histogram’s binwidth).

It’s important to choose the most appropriate bandwidth as a value that is too small ortoo large is not useful. If we use a normal (Gaussian) kernel with bandwidth or standarddeviation of 0.1 (which has area 1/12 under the each curve) then the kernel density estimateis said to undersmoothed as the bandwidth is too small in the figure below. It appears thatthere are 4 modes in this density - some of these are surely artifices of the data. We cantry to eliminate these artifices by increasing the bandwidth of the normal kernels to 0.5. Weobtain a much flatter estimate with only one mode. This situation is said to be oversmoothedas we have chosen a bandwidth that is too large and have obscured most of the structure ofthe data.

这里写图片描述
So how do we choose the optimal bandwidth? A common way is the use the bandwidth thatminimises the optimality criterion (which is a function of the optimal bandwidth) AMISE =Asymptotic Mean Integrated Squared Error so then optimal bandwidth = argmin AMISEi.e. the optimal bandwidth is the argument that minimises the AMISE.

In general, the AMISE still depends of the true underlying density (which of course we don’thave!) and so we need to estimate the AMISE from our data as well. This means that thechosen bandwidth is an estimate of an asymptotic approximation. It now sounds as if it’s toofar away from the true optimal value but it turns out that this particular choice of bandwidthrecovers all the important features whilst maintaining smoothness.

The optimal value of the bandwidth for our dataset is about 0.25. From the optimallysmoothed kernel density estimate, there are two modes. As these are the log of aircraft wingspan, it means that there were a group of smaller, lighter planes built, and these are clusteredaround 2.5 (which is about 12 m). Whereas the larger planes, maybe using jet engines asthese used on a commercial scale from about the 1960s, are grouped around 3.5 (about 33m).

这里写图片描述

The properties of kernel density estimators are, as compared to histograms:

• smooth
• no end points
• depend on bandwidth.

This has been a quick introduction to kernel density estimation. The current state of researchis that most of the issues concerning one-dimensional problems have been resolved. The nextstage is then to extend these ideas to the multi-dimensional case where much less researchhas been done. This is due to that there are the orientation of multi-dimensional kernels hasa large effect on the resulting density estimate (which has no counterpart in one-dimensionalkernels). I am currently looking for reliable methods for bandwidth selection for multivariatekernels. Some progress that I have made in plug-in methods is here. However this page ismore technical and uses equations!

These notes are an edited version of a seminar given by Tarn Duong on 24 May 2001 aspart of the Weatherburn Lecture Series for the Department of Mathematics and Statis-tics, at the University of Western Australia. Please feel free to contact the author attarn(dot)duong(at)gmail(dot)com if you have any questions. Tarn’s web page contains moredetails of his research into kernel smoothing methods.

4

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 回答1: 核密度估计Kernel Density Estimation)是一种通过概率密度函数的方式对数据进行分布估计的非参数方法。该方法可以对数据进行平滑处理,并估计出数据的概率密度函数,从而更好地理解数据的分布情况。在核密度估计中,通过选取一个核函数来估计数据的概率密度函数,常用的核函数有高斯核函数、矩形核函数、三角核函数等。核密度估计在数据分析、信号处理、图像处理等领域有着广泛的应用。 ### 回答2: 核密度估计Kernel Density Estimation)是一种非参数统计学方法,用于估算概率密度函数(PDF)的形状和位置。 核密度估计的核心思想是通过在每个数据点周围放置核函数来创建平滑的密度估计。核函数是一个标准的概率密度函数,它在数据点周围生成一个密度窗口,并将每个数据点的贡献从它们的位置向密度窗口中积累。 当数据点越集中在一起,核函数的数量会增加,产生更平坦的密度窗口来避免过拟合。数据点距离越远,核函数的数量就会减少,产生更尖锐的密度窗口来捕捉较小的细节。 核密度估计还包含一个重要的参数,带宽(bandwidth),它控制了核函数窗口的大小。当带宽较小时,密度曲线会变得更窄,这可能会导致低偏差但高方差的估计。相反,当带宽较大时,密度曲线会变得更平坦,这可能会导致高偏差但低方差的估计。 核密度估计可以用于可视化和比较数据分布,或者作为其他统计方法的前提,例如分类和聚类问题。由于它是一种非参数方法,因此它不依赖于假设或先验分布,因此可以应用于多种数据集和统计问题中。 ### 回答3: Kernel density estimation是一种非参数统计方法,它可以用来估计概率密度函数。简单来说,它在数据点处放置一些核函数,然后将它们加起来得到密度估计。核函数可以是任何连续函数,且它必须是关于原点对称的非负函数,积分以后等于1。 Kernel density estimation的优点在于可以处理非常复杂的概率分布,而不需要假设一个具体的分布类型。这个方法有着非常广泛的应用,其中最常用的是在数据分析、数据挖掘、模式识别和信号处理等领域。 Kernel density estimation的实现过程可以分为三个步骤:核函数的选择、带宽的选择和估计密度函数。对于核函数的选择,通常选择高斯核函数或Epanechnikov核函数。高斯核函数的形式为$K(x)=\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}x^2}$,Epanechnikov核函数的形式为$K(x)=\frac{3}{4}(1-x^2)$。带宽的选择通常使用交叉验证来确定。在估计密度函数时,可以通过对核函数进行平移和缩放得到不同的密度估计,然后将它们平均起来得到最终的估计结果。当数据点变得非常多的时候,随着核函数密度的增加,会导致估计结果的波动性也随之增加,可以通过增加带宽来缓解导致的问题。 总的来说,kernel density estimation是一种非常有用的统计方法,在许多实际应用中都有着广泛的应用,并且它的可解释性和伸缩性可以满足许多实际问题的需求。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值