【论文翻译】Mean Shift: A Robust Approach toward Feature Space Analysis

最新推荐文章于 2024-03-30 08:47:19 发布

大数据机器学习实验室

最新推荐文章于 2024-03-30 08:47:19 发布

阅读量5.1k

点赞数 12

分类专栏：机器学习文章标签：机器学习

本文链接：https://blog.csdn.net/diaokui2312/article/details/107830964

版权

机器学习专栏收录该内容

28 篇文章

订阅专栏

论文题目：Mean Shift: A Robust Approach toward Feature Space Analysis
论文来源: Mean Shift: A Robust Approach toward Feature Space Analysis
翻译人：BDML@CQUT实验室

Mean Shift: A Robust Approach toward Feature Space Analysis

均值偏移：面向特征空间分析的鲁棒性方法

$Comaniciu^1 \qquad Peter Meer^2\\$

$^1Siemens Corporate Research\\ 755 College Road East, Princeton, NJ 08540\\ comanici@scr.siemens.com\\$

$^2Electrical and Computer Engineering Department\\ Rutgers University\\ 94 Brett Road, Piscataway, NJ 08854-8058\\ meer@caip.rutgers.edu\\$

Abstract

A general nonparametric technique is proposed for the analysis of a complex multimodal feature space and to delineate arbitrarily shaped clusters in it. The basic computational module of the technique is an old pattern recognition procedure, the mean shift. We prove for discrete data the convergence of a recursive mean shift procedure to the nearest stationary point of the underlying density function and thus its utility in detecting the modes of the density. The equivalence of the mean shift procedure to the Nadaraya–Watson estimator from kernel regression and the robust M-estimators of location is also established. Algorithms for two low-level vision tasks, discontinuity preserving smoothing and image segmentation are described as applications. In these algorithms the only user set parameter is the resolution of the analysis, and either gray level or color images are accepted as input. Extensive experimental results illustrate their excellent performance.

Keywords: mean shift; clustering; image segmentation; image smoothing; feature space; low-level
vision

摘要

本文提出了一种通用的非参数技术，用于分析复杂的多峰特征空间并在其中描绘任意形状的聚类。该技术的基本计算模块是一个古老的模式识别程序，均值偏移。对于离散数据，我们证明了递归均值偏移过程收敛到基础密度函数的最近固定点，从而证明了其在检测密度模式中的效用。还建立了通过核回归和位置的鲁棒M估计到Nadaraya–Watson估计的均值偏移过程的等价性。描述了用于两个底层视觉任务的算法，即图像平滑和图像分割。在这些算法中，唯一的用户设置参数是分析的分辨率，并且可以接受灰度或彩色图像作为输入。大量的实验结果证明了它们的出色性能。

关键字：均值偏移；聚类；图像分割；图像平滑；特征空间；底层视觉

1 Introduction

Low-level computer vision tasks are misleadingly difficult. Incorrect results can be easily obtained since the employed techniques often rely upon the user correctly guessing the values for the tuning parameters. To improve performance the execution of low-level tasks should be task driven, i.e., supported by independent high level information. This approach, however, requires that first the low-level stage provides a reliable enough representation of the input, and that the feature extraction process is controlled only by very few tuning parameters corresponding to intuitive measures in the input domain.

Feature space based analysis of images is a paradigm which can achieve the above stated goals. A feature space is a mapping of the input obtained through the processing of the data in small subsets at a time. For each subset a parametric representation of the feature of interest is obtained and the result is mapped into a point in the multidimensional space of the parameter. After the entire input is processed, significant features correspond to denser regions in the feature space, i.e., to clusters, and the goal of the analysis is the delineation of these clusters.

The nature of the feature space is application dependent. The subsets employed in the mapping can range from individual pixels as in the color space representation of an image, to a set of quasi-randomly chosen data points as in the probabilistic Hough transform. Both the advantage and the disadvantage of the feature space paradigm are arising from the global nature of the derived representation of the input. On one hand, all the evidence for the presence of a significant feature is pooled together providing an excellent tolerance to a noise level which may render local decisions unreliable. On the other hand, features with lesser support in the feature space may not be detected in spite of being salient for the task to be executed. This disadvantage, however, can be largely avoided by either augmenting the feature space with additional (spatial) parameters from the input domain, or by robust post processing of the input domain guided by the results of the feature space analysis.

Analysis of the feature space is application independent. While there are a plethora of published clustering techniques, most of them are not adequate to analyze feature spaces derived from real data. Methods which rely upon a priori knowledge of the number of clusters present (including those which use optimization of a global criterion to find this number), as well as methods which implicitly assume the same shape (most often elliptical) for all the clusters in the space, are not able to handle the complexity of a real feature space. For a recent survey of such methods see [29, Sec.8].

In Figure 1 a typical example is shown. The color image in Figure 1a is mapped into the three-dimensional Luv color space (to be discussed in Section 4). There is a continuous transition between the clusters arising from the dominant colors, and a decomposition of the space into elliptical tiles will introduce severe artifacts. Enforcing a Gaussian mixture model over such data is doomed to fail, e.g., [49], and even the use of a robust approach with contaminated Gaussian densities [67] cannot be satisfactory for such complex cases. Note also that the mixture models require the number of clusters as a parameter which raises its own challenges. For example, the method described in [45] proposes several different ways to determine this number.

Arbitrarily structured feature spaces can be analyzed only by nonparametric methods since these methods do not have embedded assumptions. Numerous nonparametric clustering methods were described in the literature and they can be classified into two large classes: hierarchical clustering and density estimation. Hierarchical clustering techniques either aggregate or divide the data based on some proximity measure. See [28, Sec.3.2] for a survey of hierarchical clustering methods. The hierarchical methods tend to be computationally expensive and the definition of a meaningful stopping criterion for the fusion (or division) of the data is not straightforward.

The rationale behind the density estimation based nonparametric clustering approach is that the feature space can be regarded as the empirical probability density function (p.d.f.) of the represented parameter. Dense regions in the feature space thus correspond to local maxima of the p.d.f., that is, to the modes of the unknown density. Once the location of a mode is determined, the cluster associated with it is delineated based on the local structure of the feature space [25, 60, 63].

Our approach to mode detection and clustering is based on the mean shift procedure, proposed in 1975 by Fukunaga and Hostetler [21] and largely forgotten till Cheng’s paper [7] rekindled the interest in it. In spite of its excellent qualities, the mean shift procedure does not seem to be known in the statistical literature. While the book [54, Sec.6.2.2] discusses [21], the advantages of employing a mean shift type procedure in density estimation were only recently rediscovered [8].

As will be proven in the sequel a computational module based on the mean shift procedure is an extremely versatile tool for feature space analysis and can provide reliable solutions for many vision tasks. In Section 2 the mean shift procedure is defined and its properties are analyzed. In Section 3 the procedure is used as the computational module for robust feature space analysis and implementational issues are discussed. In Section 4 the feature space analysis technique is applied to two low level vision tasks: discontinuity preserving filtering and image segmentation. Both algorithms can have as input either gray level or color images and the only parameter to be tuned by the user is the resolution of the analysis. The applicability of the mean shift procedure is not restricted to the presented examples. In Section 5 other applications are mentioned and the procedure is put into a more general context.

1 引言

底层的计算机视觉任务具有误导性。由于所采用的技术通常依赖于用户正确猜测调整参数的值，因此很容易获得不正确的结果。为了提高性能，底层任务的执行应该由任务驱动，即由独立的高级信息支持。但是，这种方法要求首先底层阶段提供输入的足够可靠的表示，并且特征提取过程仅由很少的与输入域中的直观度量相对应的调整参数来控制。

基于特征空间的图像分析是可以实现上述目标的范例。特征空间是一次通过处理小子集中的数据获得的输入的映射。对于每个子集，获取感兴趣特征的参数表示，并将结果映射到参数多维空间中的一个点。在处理完所有输入之后，重要特征对应于特征空间中较密集的区域，即对应于聚类，而分析的目的是描绘这些聚类。

特征空间的性质取决于应用程序。映射中采用的子集的范围可以从图像颜色空间表示中的单个像素到概率霍夫变换中的一组准随机选择的数据点。特征空间范式的优势和劣势都源于输入的派生表示的全局性质。一方面，将存在重要特征的所有证据汇总在一起，以提供对噪声水平的出色容忍度，这可能会使本地决策变得不可靠。另一方面，尽管对要执行的任务很重要，但可能无法检测到在特征空间中支持较少的特征。但是，可以通过使用来自输入域的其他（空间）参数扩展特征空间，或通过特征空间分析的结果对输入域进行可靠的后处理，来很大程度上避免此缺点。

特征空间的分析与应用程序无关。尽管存在大量已发布的聚类技术，但大多数技术不足以分析从真实数据得出的特征空间。依赖于对存在的聚类数的先验知识的方法（包括那些使用全局准则优化来找到此数目的方法），以及隐含地假设空间中所有聚类的形状相同（通常为椭圆形）的方法不能处理真实特征空间的复杂性。有关此类方法的最新调查，请参见[29，第8节]。在这里插入图片描述

图1：一个特征空间的例子。（a）是一个400x276色彩的图像。（b）是相应的L*u*v色彩空间110，400个数据点

图1中显示了一个典型示例。图1a中的彩色图像被映射到三维L*u*v彩色空间（将在第4节中讨论）。由主要颜色引起的聚类之间存在连续过渡，并且空间分解为椭圆形瓷砖会引入严重的伪影。对此类数据执行高斯混合模型注定会失败，例如，[49]，甚至对于这种复杂情况，即使使用具有受污染的高斯密度的稳健方法[67]也无法令人满意。还要注意，混合模型需要将聚类数作为参数，这会带来挑战。例如，[45]中描述的方法提出了几种不同的方法来确定该数字。

任意结构化的特征空间只能通过非参数方法进行分析，因为这些方法没有嵌入的假设。文献中描述了许多非参数聚类方法，它们可分为两大类：层次聚类和密度估计。层次聚类技术基于某种接近度度量来聚合或划分数据。有关分层聚类方法的概述，请参见[28，第3.2节]。分层方法在计算上趋于昂贵，并且对于数据的融合（或划分）有意义的停止标准的定义并不简单。

基于密度估计的非参数聚类方法的基本原理是可以将特征空间视为所表示参数的经验概率密度函数（p.d.f.）。因此，特征空间中的密集区域对应于p.d.f.的局部最大值，即对应于未知密度的众数。一旦确定了模式的位置，便会基于特征空间的局部结构[25、60、63]来确定与其关联的群集。

我们的模式检测和聚类方法是基于均值偏移过程，1975年，Fukunaga和Hostetler[21]提出了这个想法，但大部分已经被遗忘，直到Cheng的论文[7]重新点燃了人们对它的兴趣。尽管具有出色的质量，但该均值偏移过程在统计文献中似乎并不为人所知。本书[54，Sec.6.2.2]讨论了[21]时，直到最近才重新发现在密度估计中采用均值偏移类型过程的优势[8]。

正如后续将证明的那样，基于均值偏移过程的计算模块是用于特征空间分析的极其通用的工具，可以为许多视觉任务提供可靠的解决方案。在第2节中，定义了均值偏移程序并分析了其性质。在第3节中，该过程用作鲁棒特征空间分析的计算模块，并讨论了实现问题。在第4节中，特征空间分析技术应用于两个底层视觉任务：不连续性保留过滤和图像分割。两种算法都可以输入灰度图像或彩色图像作为输入，并且用户要调整的唯一参数是分析的分辨率。均值偏移过程的适用性不限于所提供的示例。在第5节中，提到了其他应用程序，并且将该过程放到了更一般的上下文中。

2 The Mean Shift Procedure

Kernel density estimation (known as the Parzen window technique in the pattern recognition literature [17, Sec.4.3]) is the most popular density estimation method. Given $n$ data points $X_i$ , $i = 1, . . ., n$ in the $d$ -dimensional space $R^d$ , the multivariate kernel density estimator with kernel $K (X)$ and a symmetric positive definite $d\times d$ bandwidth matrix $H$ , computed in the point $X$ is given by
$\hat{f}(X)=\frac{1}{n}\sum^n_{i=1}K_H(X-X_i)\tag{1}$
where
$K_H(X)=|H|^{-1/2}K(H^{-1/2}X)\tag{2}$
The $d$ -variate kernel $K (X)$ , is a bounded function with compact support satisfying [62, p.95]
$\int_{R^d}K(X)dX=1\qquad \lim_{||X||\to\infty}||X||^dk(X)=0\\ \int_{R^d}XK(X)dX=0\qquad \int_{R^d}XX^TK(X)dX=c_KI\\ \tag{3}$
where $c_K$ is a constant. The multivariate kernel can be generated from a symmetric univariate kernel $K_1(x)$ in two different ways
$K^P(X)=\prod^d_{i=1}K_1(x_i)\qquad K^S(X)=a_{k,d}K_1(||X||)\tag{4}$
where $K^P(X)$ is obtained from the product of the univariate kernels, and $K^P(X)$ from rotating $K_1(x)$ in $R_d$ , i.e., $K^S(X)$ is radially symmetric. The constant $a^{-1}_{k,d}=\int_{R^d}K_1(||X||)dX$ assures that $K^S (X)$ integrates to one, though this condition can be relaxed in our context. Either type of multivariate kernel obeys (3), but for our purposes the radially symmetric kernels are often more suitable.

We are interested only in a special class of radially symmetric kernels satisfying
$K(X)=c_{k,d}k(||X||^2)\tag{5}$
in which case it suffices to define the function $k (x)$ called the profile of the kernel, only for $x\geq0$ .The normalization constant $c_{k,d}$ , which makes $K (X)$ to integrate to one, is assumed strictly positive.

Using a fully parameterized H increases the complexity of the estimation [62, p.106] and in practice the bandwidth matrix H is chosen either as diagonal H = diag $h_1^2,...,h_d^2]$ , or proportional to the identity matrix H = $h^2$ I. The clear advantage of the latter case is that only one bandwidth parameter $h > 0$ must be provided, however, as can be seen from (2) then the validity of an Euclidean metric for the feature space should be confirmed first. Employing only one bandwidth parameter, the kernel density estimator (1) becomes the well known expression
$\hat{f}(x)=\frac{1}{nh^d}\sum^n_{i=1}K(\frac{X-X_i}{h}\tag{6})$
The quality of a kernel density estimator is measured by the mean of the square error between the density and its estimate, integrated over the domain of definition. In practice, however, only an asymptotic approximation of this measure (denoted as AMISE) can be computed. Under the asymptotics the number of data points $n\to\infty$ while the bandwidth $h\to\infty$ at a rate slower than $n^{-1}$ . For both types of multivariate kernels the AMISE measure is minimized by the Epanechnikov kernel [51, p.139], [62, p.104] having the profile
$k_E(x)=\begin{cases}1-x\ & 0\leq x \leq 1 \\ 0 & x>1\end{cases}\tag{7}$
which yields the radially symmetric kernel
$k_E(x)=\begin{cases}\frac{1}{2}c^{-1}_d(d+2(1-||X||^2))\ & ||X||\leq 1 \\ 0 & otherwise\end{cases}\tag{8}$
where $c_d$ is the volume of the unit $d$ -dimensional sphere. Note that the Epanechnikov profile is not differentiable at the boundary. The profile
$k_N(x)=\exp(-\frac{1}{2}x)\qquad x\geq0\tag{9}$
yields the multivariate normal kernel
$K_N(X)=(2\pi)^{-d/2}\exp(-\frac{1}{2}||X||^2)\tag{10}$
for both types of composition (4). The normal kernel is often symmetrically truncated to have a
kernel with finite support.

While these two kernels will suffice for most applications we are interested in, all the results
presented below are valid for arbitrary kernels within the conditions to be stated. Employing the
profile notation the density estimator (6) can be rewritten as
$\hat{f}_{h,k}(X)=\frac{c_{k,d}}{nh^d}\sum^n_{i=1}k(||\frac{X-X_i}{h}||^2)\tag{11}$
The first stepinthe analysisof a feature space withthe underlyingdensity $f (x)$ is to find themodes of this density. The modes are located among the zeros of the gradient $\nabla f(x)=0$ , and the mean shift procedure is an elegant way to locate these zeros without estimating the density.

2 均值偏移步骤

内核密度估计（在模式识别文献中称为Parzen窗口技术[17，第4.3节]）是最流行的密度估计方法。给定 $n$ 个数据点 $X_i$ , $i = 1, . . ., n$ 在 $d$ 维空间 $R^d$ 中，在点x上计算的具有核 $K (X)$ 和对称正定 $d\times d$ 带宽矩阵H的多元核密度估计器由下式给出：
$\hat{f}(X)=\frac{1}{n}\sum^n_{i=1}K_H(X-X_i)\tag{1}$
则
$K_H(X)=|H|^{-1/2}K(H^{-1/2}X)\tag{2}$
$d$ 变量内核 $K (X)$ 是一个有界函数
$\int_{R^d}K(X)dX=1\qquad \lim_{||X||\to\infty}||X||^dk(X)=0\\ \int_{R^d}XK(X)dX=0\qquad \int_{R^d}XX^TK(X)dX=c_KI\\ \tag{3}$
其中 $c$ 为常数。可以采用两种不同方式从对称单变量内核 $K_1(x)$ 生成多变量内核
$K^P(X)=\prod^d_{i=1}K_1(x_i)\qquad K^S(X)=a_{k,d}K_1(||X||)\tag{4}$
其中， $K^P(X)$ 是从单变量核的乘积获得的，而 $K^R(X)$ 是通过在 $R^d$ 中旋转 $K_1(x)$ 来获得的，即 $K^S(X)$ 是径向对称的。常数 $a^{-1}_{k,d}=\int_{R^d}K_1(||X||)dX$ 可以确保 $K^S(X)$ 积分为1，不过在我们的上下文中可以放宽此条件。两种类型的多元核均服从（3），但出于我们的目的，径向对称核通常更适合。

我们只对满足以下条件的一类特殊的径向对称核感兴趣：
$K(X)=c_{k,d}k(||X||^2)\tag{5}$
在这种情况下，仅对于 $x\geq0$ 定义为内核轮廓的函数 $k (x)$ 就足够了。使 $K (X)$ 积分为1的归一化常数 $c_{k,d}$ 严格地假定为正。

使用完全参数化的H会增加估计的复杂度[62，p.106]，实际上，带宽矩阵H可以选择为对角线H = diag $h_1^2,...,h_d^2]$ ，或与恒等式H = $h^2$ I成正比。后一种情况的明显优势是，仅必须提供一个带宽参数 $h > 0$ ，但是，从（2）中可以看出有效性首先应确定特征空间的欧几里得度量。仅使用一个带宽参数，核密度估计器（1）成为众所周知的表达式
$\hat{f}(x)=\frac{1}{nh^d}\sum^n_{i=1}K(\frac{X-X_i}{h}\tag{6})$
内核密度估计量的质量是通过在定义范围内积分的密度与其估计值之间的平方误差的平均值来衡量的。但是，实际上，只能计算该度量的渐近近似值（表示为AMISE）。在渐近状态下，数据点的数量为 $n\to\infty$ , 而带宽 $h\to\infty$ 的速率慢于$ n^{-1}$。对于两种类型的多元内核，通过具有以下特征的Epanechnikov内核[51，p.139]，[62，p.104]可以将AMISE度量最小化
$k_E(x)=\begin{cases}1-x\ & 0\leq x \leq 1 \\ 0 & x>1\end{cases}\tag{7}$
其中产生径向对称的内核
$k_E(x)=\begin{cases}\frac{1}{2}c^{-1}_d(d+2(1-||X||^2))\ & ||X||\leq 1 \\ 0 & otherwise\end{cases}\tag{8}$
其中 $c$ 是单位 $d$ 维球体的体积。注意，Epanechnikov轮廓在边界处是不可微的。
$k_N(x)=\exp(-\frac{1}{2}x)\qquad x\geq0\tag{9}$
产生多元正态核
$K_N(X)=(2\pi)^{-d/2}\exp(-\frac{1}{2}||X||^2)\tag{10}$
对于两种类型的成分（4）。普通内核通常被对称地截断，以使内核具有有限的支持。

尽管这两个内核足以满足我们感兴趣的大多数应用程序的需要，但下面列出的所有结果对于在规定条件下的任意内核都是有效的。利用轮廓符号，密度估算器（6）可以重写为
$\hat{f}_{h,k}(X)=\frac{c_{k,d}}{nh^d}\sum^n_{i=1}k(||\frac{X-X_i}{h}||^2)\tag{11}$
分析具有基础密度 $f (x)$ 的特征空间的第一步是找到该密度的模式。这些模式位于梯度 $\nabla f(x)=0$ 的零之间，并且均值偏移过程是一种无需估计密度即可找到这些零的绝妙方法。

2.1 Density Gradient Estimation

The density gradient estimator is obtained as the gradient of the density estimator by exploiting the
linearity of (11)
$\hat{\nabla f_{h,k}(X)}\equiv\nabla\hat f_{h,k}(X)=\frac{2c_{k,d}}{nh^{d+2}}\sum^n_{i=1}(X-X_i)k^{'} \left( \left\|\frac{X-X_i}{h} \right\|^2 \right) \tag{12}$
We define the function
$g(x)=-k^{'}(x)\tag{13}$
assuming that the derivative of the kernel profile $k$ exists for all $x\in[0,\infty)$ , except for a finite set of points. Using now $g (x)$ for profile, the kernel $G (X)$ is defined as
$G(X)=c_{g,d}g(\|X\|^2)\tag{14}$
where $c_{g,d}$ is the corresponding normalization constant. The kernel $K (X)$ was called the shadow of $G (X)$ in [7] in a slightly different context. Note that the Epanechnikov kernel is the shadow of the uniform kernel, i.e., the $d$ -dimensional unit sphere; while the normal kernel and its shadow have the same expression.

Introducing $g (x)$ into (12) yields
$\hat{\nabla}f_{h,K}(X)=\frac{2c_{k,d}}{nh^{d+2}}\sum^n_{i=1}(X_i-X)g\left(\left\|\frac{X-X_i}{h}\right\|^2\right) \\=\frac{2c_{k,d}}{nh^{d+2}}\left[ \sum^n_{i=1}g\left(\left\|\frac{X-X_i}{h}\right\|^2\right) \right] \left[ \frac{\sum^n_{i=1}X_ig\left(\left\|\frac{X-X_i}{h}\right\|^2\right)} {\sum^n_{i=1}g\left(\left\|\frac{X-X_i}{h}\right\|^2\right)}-x \right] \tag{15}$
where $\sum^n_{i=1}g\left(\left\|\frac{X-X_i}{h}\right\|^2\right)$ is assumed to be a positive number. This condition is easy to satisfy for all the profiles met in practice. Both terms of the product in (15) have special significance. From (11) the first term is proportional to the density estimate at $X$ computed with the kernel $G$
$\hat{f}_{h,G}(X)=\frac{c_{g,d}}{nh^{d}}\sum^n_{i=1}g\left(\left\|\frac{X-X_i}{h}\right\|^2\right)\tag{16}$
The second term is the mean shift
$\pmb{m}_{h,G}(X)=\frac{\sum^n_{i=1}X_ig\left(\left\|\frac{X-X_i}{h}\right\|^2\right)}{\sum^n_{i=1}g\left(\left\|\frac{X-X_i}{h}\right\|^2\right)}-x \tag{17}$
i.e., the difference between the weighted mean, using the kernel $G$ for weights, and $X$ the center of the kernel (window). From (16) and (17) the expression (15) becomes
$\hat{\nabla}f_{h,K}(X)=\hat f_{h,G}(X)\frac{2c_{k,d}}{h^2c_{g,d}}\pmb{m}_{h,G}(X)\tag{18}$
yielding
$\pmb{m}_{h,G}(X)=\frac{1}{2}h^2c\frac{\hat{\nabla}f_{h,K}(X)}{\hat{f_{h,G}}(X)} \tag{19}$
The expression (19) shows that at location $X$ the mean shift vector computed with kernel $G$ is proportional to the normalizeddensity gradient estimate obtained with kernel $K$ . The normalization is by the density estimate in $X$ computed with the kernel $G$ . The mean shift vector thus always points toward the direction of maximum increase in the density. This is a more general formulation of the property first remarked by Fukunaga and Hostetler [20, p.535], [21], and also discussed in [7].

The relation captured in (19) is intuitive,the local mean is shifted toward the region in which the majority of the points reside. Since the mean shift vector is aligned with the local gradient estimate it can define a path leading to a stationary point of the estimated density. The modes of the density are such stationary points. The mean shift procedure, obtained by successive

– computation of the mean shift vector $\pmb{m}_{h,G}(X)$ ,

– translation of the kernel (window) $G (X)$ by $\pmb{m}_{h,G}(X)$ ,

is guaranteed to converge at a nearby point where the estimate (11) has zero gradient, as will be shown in the next section. The presence of the normalization by the density estimate is a desirable feature. The regions of low density values are of no interest for the feature space analysis, and in such regions the mean shift steps are large. Similarly, near local maxima the steps are small, and the analysis more refined. The mean shift procedure thus is an adaptive gradient ascent method.

2.1 密度梯度估计

利用（11）的线性可将密度梯度估算器作为密度估算器的梯度获得
$\hat{\nabla f_{h,k}(X)}\equiv\nabla\hat f_{h,k}(X)=\frac{2c_{k,d}}{nh^{d+2}}\sum^n_{i=1}(X-X_i)k^{'} \left( \left\|\frac{X-X_i}{h} \right\|^2 \right) \tag{12}$
令
$g(x)=-k^{'}(x)\tag{13}$
假设内核配置文件 $k$ 的导数对于所有 $x\in[0,\infty)$ ，但一组有限的点除外。现在使用 $g (x)$ 进行分析，将内核 $G (X)$ 定义为
$G(X)=c_{g,d}g(\|X\|^2)\tag{14}$
其中 $c_{g,d}$ 是对应的归一化常数。在[7]中，内核在 $K (X)$ 稍微不同的上下文中称为 $G (X)$ 的阴影。请注意，Epanechnikov核是均匀核（即d维单位球体）的阴影；而普通内核及其阴影具有相同的表达。

将 $g (x)$ 带入(12)
$\hat{\nabla}f_{h,K}(X)=\frac{2c_{k,d}}{nh^{d+2}}\sum^n_{i=1}(X_i-X)g\left(\left\|\frac{X-X_i}{h}\right\|^2\right) \\=\frac{2c_{k,d}}{nh^{d+2}}\left[ \sum^n_{i=1}g\left(\left\|\frac{X-X_i}{h}\right\|^2\right) \right] \left[ \frac{\sum^n_{i=1}X_ig\left(\left\|\frac{X-X_i}{h}\right\|^2\right)} {\sum^n_{i=1}g\left(\left\|\frac{X-X_i}{h}\right\|^2\right)}-x \right] \tag{15}$
假定 $\sum^n_{i=1}g\left(\left\|\frac{X-X_i}{h}\right\|^2\right)$ 为正数，对于在实践中满足的所有配置文件，此条件很容易满足。(15）中产品的两个术语都具有特殊的意义。根据（11），第一项与使用核G计算的x处的密度估计值成比例
$\hat{f}_{h,G}(X)=\frac{c_{g,d}}{nh^{d}}\sum^n_{i=1}g\left(\left\|\frac{X-X_i}{h}\right\|^2\right)\tag{16}$
第二项是均值偏移
$\pmb{m}_{h,G}(X)=\frac{\sum^n_{i=1}X_ig\left(\left\|\frac{X-X_i}{h}\right\|^2\right)}{\sum^n_{i=1}g\left(\left\|\frac{X-X_i}{h}\right\|^2\right)}-x \tag{17}$
即，使用内核 $G$ 进行加权的加权平均值与 $X$ 内核中心（窗口）之间的差。从（16）和（17）表达式（15）变为
$\hat{\nabla}f_{h,K}(X)=\hat f_{h,G}(X)\frac{2c_{k,d}}{h^2c_{g,d}}\pmb{m}_{h,G}(X)\tag{18}$
即
$\pmb{m}_{h,G}(X)=\frac{1}{2}h^2c\frac{\hat{\nabla}f_{h,K}(X)}{\hat{f_{h,G}}(X)} \tag{19}$
表达式（19）表明，在位置 $X$ 处，由核 $G$ 计算出的平均位移矢量与由核 $K$ 获得的归一化密度梯度估计值成正比。归一化是通过由核 $G$ 得出的 $X$ 中的密度估计值。总是指向密度最大增加的方向。这是由Fukunaga和Hostetler首先提出的性质的更一般的表述[20, p.535]，也在[7]中讨论过。

在(19)中捕捉到的关系是直观的，局部的平均值被移向大多数点所在的区域。由于平均位移向量与局部梯度估计对齐，它可以定义一条路径，导致估计密度的平稳点。密度的模态就是这样的平稳点。平均位移程序，由逐次获得

– 计算平均偏移向量 $\pmb{m}_{h,G}(X)$ ，

– 将核(窗) $G (X)$ 偏移 $\pmb{m}_{h,G}(X)$ ，

这将保证收敛于估计值(11)具有零梯度的附近点，如下一节所示。通过密度估计进行归一化是一个理想的功能。低密度值区域对于特征空间分析不重要，并且在此类区域中，平均移动步长很大。同样，在局部最大值附近，步长很小，分析也更加精细。因此，平均移位过程是一种自适应梯度上升方法。

2.2 Sufficient Condition for Convergence

Denote by ${Y_j\}_{j=1,2...}$ the sequence of successive locations of the kernel $G$ , where from (17)
$Y_{j+1}=\frac{\sum^n_{i=1}X_ig\left(\left\|\frac{Y_i-X_i}{h}\right\|^2\right)}{\sum^n_{i=1}g\left(\left\|\frac{Y_i-X_i}{h}\right\|^2\right)} \qquad j=1,2... \tag{20}$
is the weighted mean at $Y_j$ computed with kernel $G$ and $Y_1$ is the center of the initial position of the kernel. The corresponding sequence of density estimates computed with kernel $K$ , $\{\hat{f}_{h,K}(j)\}_{j=1,2...}$ , is given by
$\hat{f}_{h,K}(j)=\hat{f}_{h,K}(Y_j) \qquad j=1,2... \tag{21}$
As stated by the following theorem, a kernel $K$ that obeys some mild conditions suffices for the convergence of the sequences ${Y_j\}_{j=1,2...}$ and $\{\hat{f}_{h,K}(j)\}_{j=1,2...}$

Theorem 1 : If the kernel $K$ has a convex and monotonically decreasing profile, the sequences ${Y_j\}_{j=1,2...}$ and $\{\hat{f}_{h,K}(j)\}_{j=1,2...}$ converge, and $\{\hat{f}_{h,K}(j)\}_{j=1,2...}$ is also monotonically increasing.

The proof is given in the Appendix. The theorem generalizes the result derived differently in [13], where $K$ was the Epanechnikov kernel, and $G$ the uniform kernel. The theorem remains valid when each data point $X$ iis associated with a nonnegative weight $w_i$ . An example of nonconvergence when the kernel $K$ is not convex is shown in [10, p.16].

The convergence property of the mean shift was also discussed in [7, Sec.IV]. (Note, however, that almost all the discussion there is concerned with the “blurring” process in which the input is recursively modified after each mean shift step.) The convergence of the procedure as defined in this paper was attributed in [7] to the gradient ascent nature of (19). However, as shown in [4, Sec.1.2], moving in the direction of the local gradient guarantees convergence only for infinitesimal steps. The stepsize of a gradient based algorithm is crucial for the overall performance. If the step size is too large, the algorithm will diverge, while if the step size is too small, the rate of convergence may be very slow. A number of costly procedures have been developed for stepsize selection [4, p.24].The guaranteed convergence (as shown by Theorem 1) is due to the adaptive magnitude of the mean shift vector which also eliminates the need for additional procedures to chose the adequate stepsizes. This is a major advantage over the traditional gradient based methods.

For discrete data, the number of steps to convergence depends on the employed kernel. When $G$ is the uniform kernel, convergence is achieved in a finite number of steps, since the number of locations generating distinct mean values is finite. However, when the kernel $G$ imposes a weighting on the data points (according to the distance from its center), the mean shift procedure is infinitely convergent. The practical way to stop the iterations is to set a lower bound for the magnitude of the mean shift vector.

2.2收敛的充分条件

如下表示为 ${Y_i}_{j=1,2...}$ 核 $G$ 的连续位置的序列，其中（17）
$Y_{j+1}=\frac{\sum^n_{i=1}X_ig\left(\left\|\frac{Y_i-X_i}{h}\right\|^2\right)}{\sum^n_{i=1}g\left(\left\|\frac{Y_i-X_i}{h}\right\|^2\right)} \qquad j=1,2... \tag{20}$
是用内核 $G$ 计算的 $Y_j$ 的加权平均值，而 $Y_1$ 是内核初始位置的中心。用核 $K$ 计算的相应密度估计序列 $\{\hat{f}_{h,K}(j)\}_{j=1,2...}$ 由下式给出
$\hat{f}_{h,K}(j)=\hat{f}_{h,K}(Y_j) \qquad j=1,2... \tag{21}$
如以下定理所述，服从某些温和条件的核 $K$ 满足序列 ${Y_j\}_{j=1,2...}$ 和 $\{\hat{f}_{h,K}(j)\}_{j=1,2...}$ 的收敛。

定理1 ：如果核 $K$ 具有凸且单调递减的轮廓，则序列 ${Y_j\}_{j=1,2...}$ 和 $\{\hat{f}_{h,K}(j)\}_{j=1,2...}$ 收敛，并且 $\{\hat{f}_{h,K}(j)\}_{j=1,2...}$ 也单调增加。

证明在附录中给出。该定理概括了在[13]中得出的不同结果，其中 $K$ 是Epanechnikov核，而 $G$ 是均匀核。当每个数据点 $X$ 与非负权重 $w_i$ 相关联时，该定理仍然有效。在[10，p.16]中显示了当核 $K$ 不是凸时的不收敛示例。

在[7，IV节]中也讨论了均值偏移的收敛性。（但是，请注意，几乎所有讨论都与“模糊”过程有关，在该过程中，在每个均值偏移步骤之后对输入进行递归修改。）本文定义的过程的收敛性在[7]中归因于（19）的梯度上升性质。但是，如[4，第1.2节]所示，沿局部梯度的方向移动只能保证收敛于无穷小步长。基于梯度的算法的步长对于整体性能至关重要。如果步长太大，则算法将发散，而如果步长太大，则收敛速度可能会很慢。现在已经开发出许多成本高昂的程序来进行逐步大小选择[4，p.24]。保证的收敛性（如定理1所示）是由于均值偏移矢量的自适应幅度所致，这也消除了选择其他程序来选择适当步长的需求。与传统的基于梯度的方法相比，这是一个主要优势。

对于离散数据，收敛步骤的数量取决于所采用的内核。当 $G$ 是均匀核时，由于生成不同平均值的位置数是有限的，所以收敛的步数是有限的。但是，当内核 $G$ 对数据点施加权重时（根据距其中心的距离），平均移位过程将无限收敛。停止迭代的实际方法是为平均移位向量的大小设置一个下限。

2.3 Mean Shift Based Mode Detection

Let us denote by $Y_c$ and $\hat{f}_{h,K}(Y_c)$ the convergence points of the sequences ${Y_j\}_{j=1,2...}$ and $\{\hat{f}{h,K}(j)\}_{j=1,2...}$ , respectively. The implications of Theorem 1 are the following.

First, the magnitude of the mean shift vector converges to zero. Indeed, from (17) and (20) the $j$ -th mean shift vector is
$\pmb{m}_{h,G}(Y_j)=Y_{j+1}-Y_j \tag{22}$
and, at the limit $\pmb{m}_{h,G}(Y_c)=Y_{c}-Y_c=0$ . In other words, the gradient of the density estimate (11) computed at $Y_c$ is zero
$\nabla \hat{f}_{h,K}(Y_c)=0, \tag{23}$
due to (19). Hence, $Y_c$ is a stationary point of $\hat{f}_{h,K}$ .

Second, since $\{\hat{f}{h,K}(j)\}_{j=1,2...}$ is monotonically increasing, the mean shift iterations satisfy the conditions required by the Capture Theorem [4, p.45], which states that the trajectories of such gradient methods are attracted by local maxima if they are unique (within a small neighborhood) stationary points. That is, once $Y_j$ gets sufficiently close to a mode of $\hat{f}_{h,K}$ , it converges to it. The set of all locations that converge to the same mode defines the basin of attraction of that mode.

The theoretical observations from above suggest a practical algorithm for mode detection:

– run the mean shift procedure to find the stationary points of $\hat{f}_{h,K}$ ,

– prune these points by retaining only the local maxima.

The local maxima points are defined according to the Capture Theorem, as unique stationary points within some small open sphere. This property can be tested by perturbing each stationary point by a random vector of small norm, and letting the mean shift procedure converge again. Should the point of convergence be unchanged (up to a tolerance), the point is a local maximum.

2.3基于均值偏移的模式检测

我们用 $Y_c$ a和 $\hat{f}_{h,K}(Y_c)$ 来分别表示序列 ${Y_j\}_{j=1,2...}$ 和 $\{\hat{f}{h,K}(j)\}_{j=1,2...}$ ，定理1的含义如下。

首先，平均偏移矢量的大小收敛到零。实际上，从（17）和（20），第 $j$ 个均值偏移向量为
$\pmb{m}_{h,G}(Y_j)=Y_{j+1}-Y_j \tag{22}$
并且，在极限 $\pmb{m}_{h,G}(Y_c)$ 处， $Y_{c}-Y_c=0$ 。换句话说，在 $Y_c$ 处计算的密度估计值（11）的梯度为零
$\nabla \hat{f}_{h,K}(Y_c)=0, \tag{23}$
由于（19）可知， $Y_c$ 是 $K$ 的固定点。

其次，由于 $\{\hat{f}{h,K}(j)\}_{j=1,2...}$ 是单调增加的，所以均值偏移迭代满足Capture Theorem [4，p.45] 的要求，该定理指出，如果这种梯度方法的轨迹是唯一的(在一个小邻域内)静止点，则会被局部极大值所吸引。也就是说，一旦 $Y_j$ 足够接近 $\hat{f}_{h,K}$ 的某个模态，它就会收敛。收敛到同一模式的所有位置的集合定义了该模式的吸引域。

以上理论观察为模式检测提供了一种实用的算法:

-运行均值偏移程序，求 $\hat{f}_{h,K}$ 的固定点，

-通过只保留局部最大值来删除这些点。

根据捕获定理，将局部最大值定义为某个小开放球体内的唯一固定点。可以通过以下方法来测试此属性：使用小范数的随机向量扰动每个固定点，然后让均值偏移过程再次收敛。如果收敛点保持不变（不超过公差），则该点为局部最大值。

2.4 Smooth Trajectory Property

The mean shift procedure employing a normal kernel has an interesting property. Its path toward the mode follows a smooth trajectory, the angle between two consecutive mean shift vectors being always less than 90 degrees.

Using the normal kernel (10) the $j$ -th mean shift vector is given by:
$\pmb{m}_{h,N}(Y_i)=Y_{j+1}-Y_{j}=\frac{\sum^n_{i=1}X_i\exp\left(-\left\|\frac{Y_j-X_i}{h}\right\|^2\right)} {\sum^n_{i=1}\exp\left(-\left\|\frac{Y_j-X_i}{h}\right\|^2\right)}-y_i. \tag{24}$
The following theorem holds true for all $j = 1, 2, . . .$ , according to the proof given in the Appendix.

Theorem 2: The cosine of the angle between two consecutive mean shift vectors is strictly positive when a normal kernel is employed, i.e.,
$\frac{\pmb{m}_{h,N}(Y_j)^\top\pmb{m}_{h,N}(Y_{j+1})} {\left\|\pmb{m}_{h,N}(Y_j)\right\|\left\|\pmb{m}_{h,N}(Y_{j+1})\right\|}>0. \tag{25}$
As a consequence of Theorem 2 the normal kernel appears to be the optimal one for the mean shift procedure. The smooth trajectory of the mean shift procedure is in contrast with the standard steepest ascent method [4, p.21] (local gradient evaluation followed by line maximization) whose convergence rate on surfaces with deep narrow valleys is slow due to its zigzagging trajectory.

In practice, the convergence of the mean shift procedure based on the normal kernel requires large number of steps, as was discussed at the end of Section 2.2. Therefore, in most of our experiments we have used the uniform kernel, for which the convergence is finite, and not the normal kernel. Note, however, that the quality of the results almost always improves when the normal kernel is employed.

2.4 平滑轨迹特性

采用标准核函数的均值偏移过程有一个有趣的性质。它走向模式的轨迹是平滑的，两个连续的平均位移向量之间的夹角总是小于90度。

使用标准核（10），第 $j$ 个均值移动向量由下式给出：
$\pmb{m}_{h,N}(Y_i)=Y_{j+1}-Y_{j}=\frac{\sum^n_{i=1}X_i\exp\left(-\left\|\frac{Y_j-X_i}{h}\right\|^2\right)} {\sum^n_{i=1}\exp\left(-\left\|\frac{Y_j-X_i}{h}\right\|^2\right)}-y_i. \tag{24}$
根据附录中给出的证明，对于所有的 $j = 1, 2, . . .$ ，以下定义成立。

定理2：当使用标准核时，两个连续均值偏移向量之间的角度的余弦严格为正。
$\frac{\pmb{m}_{h,N}(Y_j)^\top\pmb{m}_{h,N}(Y_{j+1})} {\left\|\pmb{m}_{h,N}(Y_j)\right\|\left\|\pmb{m}_{h,N}(Y_{j+1})\right\|}>0. \tag{25}$
定理2的结果是，标准核似乎是均值偏移程序的最佳选择。均值移位过程的平滑轨迹与标准最陡上升方法[4，p.21]（局部梯度评估，然后进行线最大化）相反，该方法由于其曲折轨迹而在深窄谷的表面上的收敛速度较慢。

在实践中，基于标准核的均值偏移过程的收敛需要大量步骤，如2.2节末尾所述。因此，在我们的大多数实验中，我们使用的是收敛性是有限的统一内核，而不是标准内核。但是请注意，使用标准内核时，结果质量几乎总是会提高。

2.5 Relation to Kernel Regression

Important insight can be gained when the relation (19) is obtained approaching the problem differently. Considering the univariate case suffices for this purpose.

Kernel regression is an on parametric method to estimate complex trends from noisy data. See [62, Chap.5] for an introduction to the topic, [24] for a more in-depth treatment. Let $n$ measured data points be $X_i,Z_i)$ and assume that the values $X_i$ are the outcomes of a random variable $x$ with probability density function $f (x)$ , $x_i = X_i$ ; $i=1,\dots,n$ , while the relation between $Z_i$ and $X_i$ is
$Z_i=m(X_i)+\epsilon_i \qquad i=1,\dots,n \tag{26}$
where $m (x)$ is called the regression function, and $\epsilon_i$ is an independently distributed, zero-mean error, $E[\epsilon_i] = 0$ .

A natural way to estimate the regression function is by locally fitting a degree $p$ polynomial to the data. For a window centered at $x$ the polynomial coefficients then can be obtained by weighted least squares, the weights being computed from a symmetric function $g (x)$ . The size of the window is controlled by the parameter $h$ , $g_h(x)=h^{-1}g(x/h)$ . The simplest case is that of fitting a constant to the data in the window, i.e., $p = 0$ . It can be shown, [24, Sec.3.1], [62, Sec.5.2], that the estimated constant is the value of the Nadaraya–Watson estimator
$\hat{m}(x;h)=\frac{\sum^n_{i=1}g_h(x-X_i)Z_i}{\sum^n_{i=1}g_h(x-X_i)} \tag{27}$
introduced in the statistical literature 35 years ago. The asymptotic conditional bias of the estimator has the expression [24, p.109], [62, p.125],
$E[(\hat{m}(x;h)-m(x))|X_1,\dots,X_n|]\approx h^2\frac{m^{''}(x)f(x)+2m^{'}(x)f^{'}(x)}{2f(x)}\mu_2[g] \tag{28}$
where $\mu_2[g] = \int u^2 g (u)du$ . Defining $m (x) = x$ reduces the Nadaraya–Watson estimator to (20) (in the univariate case), while (28) becomes
$E[(\hat{x}-x)|X_1,\dots,X_n|]\approx h^2\frac{f^{'}(x)}{f(x)\mu_2[g]} \tag{29}$
which is similar to (19). The mean shift procedure thus exploits to its advantage the inherent bias of the zero-order kernel regression.

The connection to the kernel regression literature opens many interesting issues, however, most of these are more of a theoretical than practical importance.

2.5 内核回归关系

Important insight can be gained when the relation (19) is obtained approaching the problem differently. Considering the univariate case suffices for this purpose.

内核回归是非参数方法，可以根据嘈杂的数据估算复杂趋势。有关该主题的介绍，请参见[62，第5章]，有关更深入的处理，请参见[24]。设 $n$ 个测得的数据点为 $X_i,Z_i)$ ，并假定 $X_i$ 值是概率密度函数为 $f (x)$ ， $x_i = X_i$ 的随机变量 $x$ 的结果；当 $i=1,\dots,n$ ， $Z_i$ 和 $X_i$ 之间的关系是
$Z_i=m(X_i)+\epsilon_i \qquad i=1,\dots,n \tag{26}$
其中 $m (x)$ 称为回归函数， $\epsilon_i$ 是一个独立分布的零均值误差， $E[\epsilon_i] = 0$ 。

估计回归函数的一种自然方法是通过局部拟合一个 $p$ 次多项式的数据。在 $x$ 多项式系数为中心的窗口可以通过加权最小二乘法,计算的权重从对称函数 $g (x)$ 窗口的大小是由参数 $h$ , $g_h(x)=h^{-1}g(x/h)$ 。最简单的情况是为窗口中的数据拟合一个常数，即。 $p = 0$ 。可以看出，[24，第3.1节]，[62，第5.2节]，估计的常数就是Nadaraya–Watson估计量的值
$\hat{m}(x;h)=\frac{\sum^n_{i=1}g_h(x-X_i)Z_i}{\sum^n_{i=1}g_h(x-X_i)} \tag{27}$
35年前在统计文献中引入的。该估计量的条件渐近偏差有[24,p。109], [62, p.125],
$E[(\hat{m}(x;h)-m(x))|X_1,\dots,X_n|]\approx h^2\frac{m^{''}(x)f(x)+2m^{'}(x)f^{'}(x)}{2f(x)}\mu_2[g] \tag{28}$
$\mu_2[g] = \int u^2 g (u)du$ ，定义 $m (x) = x$ 将Nadaraya–Watson估计量减少到(20)(在单变量情况下)，而(28)变成
$E[(\hat{x}-x)|X_1,\dots,X_n|]\approx h^2\frac{f^{'}(x)}{f(x)\mu_2[g]} \tag{29}$
与(19)相似。均值偏移法利用了零阶核回归的固有偏差。

与内核回归文献的联系带来了许多有趣的问题，但是，大多数问题在理论上比实际上更重要。

2.6 Relation to Location M-estimators

The M-estimators are a family of robust techniques which can handle data in the presence of severe contaminations, i.e., outliers. See [26], [32] for introductory surveys. In our context only the problem of location estimation has to be considered.

Given the data $x_i$ , $1,\dots,n$ , and the scale $h$ , will define $\hat{\pmb\theta}$ , the location estimator as
$\hat{\pmb\theta}=\arg\min_{\theta}J({\pmb\theta})=\arg\min_{\theta}\sum^n_{i=1}\rho\left(\left\|\frac{{\pmb\theta}-X_i}{h}\right\|^2\right) \tag{30}$
where, $\rho(u)$ is a symmetric, nonnegative valued function, with a unique minimum at the origin and nondecreasing for $u\geq0$ . The estimator is obtained from the normal equations
$\nabla_\theta J(\hat{\pmb\theta})=2h^{-2}(\hat{\pmb\theta}-X_i)w\left(\left\|\frac{\hat{\pmb\theta}-X_i}{h}\right\|^2\right)=0 \tag{31}$
where $w(u)=\frac{d\rho(u)}{du}$ . Therefore the iterations to find the location M-estimate are based on
$\hat{\pmb\theta}=\frac{\sum^n_{i=1}X_iw\left(\left\|\frac{\hat{\pmb\theta}-X_i}{h}\right\|^2\right)}{\sum^n_{i=1}w\left(\left\|\frac{\hat{\pmb\theta}-X_i}{h}\right\|^2\right)} \tag{32}$
which is identical to (20) when $w(u)\equiv g(u)$ . Taking into account (13) the minimization (30) becomes
$\hat{\pmb\theta}=\arg\max_{\theta}\sum^n_{i=1}k\left(\left\|\frac{{\pmb\theta}-X_i}{h}\right\|^2\right) \tag{33}$
which can be also interpreted as
$\hat{\pmb\theta}=\arg\max_{\theta}\hat{f}_{h,K}(\pmb\theta|X_1,\dots,X_n). \tag{34}$
That is, the location estimator is the mode of the density estimated with the kernel $K$ from the available data. Note that the convexity of the $k (x)$ profile, the sufficient condition for the convergence of the mean shift procedure (Section 2.2), is in accordance with the requirements to be satisfied by the objective function $\rho(u)$ .

The relation between location M-estimators and kernel density estimation is not well investigated in the statistical literature, only [9] discusses it in the context of an edge preserving smoothing technique.

2.6 与位置M估计量的关系

M估计器是一系列强大的技术，可以在存在严重数据污染（即异常值）的情况下处理数据。介绍性调查参见[26]和[32]。在本文中，仅需考虑位置估计的问题。

已知数据 $x_i$ , $1,\dots,n$ , 尺度 $h$ ，定义位置估计器 $\hat{\pmb\theta}$ 为
$\hat{\pmb\theta}=\arg\min_{\theta}J({\pmb\theta})=\arg\min_{\theta}\sum^n_{i=1}\rho\left(\left\|\frac{{\pmb\theta}-X_i}{h}\right\|^2\right) \tag{30}$
其中， $\rho(u)$ 是一个对称的非负值函数，在原点处具有唯一最小值，并且对于 $u\geq0$ 具有不变的最小值。估计量是从正规方程获得的
$\nabla_\theta J(\hat{\pmb\theta})=2h^{-2}(\hat{\pmb\theta}-X_i)w\left(\left\|\frac{\hat{\pmb\theta}-X_i}{h}\right\|^2\right)=0 \tag{31}$
其中 $w(u)=\frac{d\rho(u)}{du}$ 。因此，寻找位置M估计的迭代是基于
$\hat{\pmb\theta}=\frac{\sum^n_{i=1}X_iw\left(\left\|\frac{\hat{\pmb\theta}-X_i}{h}\right\|^2\right)}{\sum^n_{i=1}w\left(\left\|\frac{\hat{\pmb\theta}-X_i}{h}\right\|^2\right)} \tag{32}$
与 $w(u)\equiv g(u)$ 时（20）相同。考虑到（13），最小化（30）变为
$\hat{\pmb\theta}=\arg\max_{\theta}\sum^n_{i=1}k\left(\left\|\frac{{\pmb\theta}-X_i}{h}\right\|^2\right) \tag{33}$
这也可以解释为
$\hat{\pmb\theta}=\arg\max_{\theta}\hat{f}_{h,K}(\pmb\theta|X_1,\dots,X_n). \tag{34}$
也就是说，位置估计器是用可用数据的核 $K$ 估计的密度的模态。注意, $k (x)$ 概要的凸性,收敛的充分条件的转变过程(2.2节),是按照需求来满足目标函数 $\rho(u)$ 。

在统计文献中没有很好地研究位置M估计量与核密度估计之间的关系，只有[9]在保留边缘的平滑技术的背景下对此进行了讨论。

3 Robust Analysis of Feature Spaces

Multimodality and arbitrarily shaped clusters are the defining properties of a real feature space. The quality of the mean shift procedure to move toward the mode (peak) of the hill on which it was initiated, makes it the ideal computational module to analyze such spaces. To detect all the significant modes, the basic algorithm given in Section 2.3 should be run multiple times (evolving in principle in parallel) with initializations that cover the entire feature space.

issues should be addressed: the metric of the feature space and the shape of the kernel. The mapping from the input domain into a feature space often associates a noneuclidean metric to the space. The problem of color representation will be discussed in Section 4, but the employed parameterization has to be carefully examined even in a simple case like the Hough space of lines, e.g., [48], [61].

The presence of a Mahalanobis metric can be accommodated by an adequate choice of the bandwidth matrix (2). In practice, however, it is preferable to have assured that the metric of the feature space is Euclidean and thus the bandwidth matrix is controlled by a single parameter, H = $h^2$ I. To be able to use the same kernel size for all the mean shift procedures in the feature space, the necessary condition is that local density variations near a significant mode are not as large as the entire support of a significant mode somewhere else.

The starting points of the mean shift procedures should be chosen to have the entire feature space (except the very sparse regions) tessellated by the kernels (windows). Regular tessellations are not required. As the windows evolve toward the modes, almost all the data points are visited and thus all the information captured in the feature space is exploited. Note that the convergence to a given mode may yield slightly different locations, due to the threshold that terminate the iterations. Similarly, on flat plateaus the value of the gradient is close to zero and the mean shift procedure could stop.

These artifacts are easy to eliminate through post processing. Mode candidates at a distance less than the kernel bandwidth are fused,the one corresponding to the highest density being chosen. The global structure of the feature space can be confirmed by measuring the significance of the valleys defined along a cut through the density in the direction determined by two modes.

The delineation of the clusters is a natural outcome of the mode seeking process. After convergence the basin of attraction of a mode, i.e., the data points visited by all the mean shift procedures converging to that mode, automatically delineates a cluster of arbitrary shape. Close to the boundaries, where a data point could have been visited by several diverging procedures, majority logic can be employed. It is important to notice that in computer vision most often we are not dealing with an abstract clustering problem. The input domain almost always provides an independent test for the validity of local decisions in the feature space. That is, while it is less likely that one can recover from a severe clustering error, allocation of a few uncertain data points can be reliably supported by input domain information.

The multimodal feature space analysis technique was discussed in detail in [12]. It was shown experimentally that for a synthetic, bimodal normal distribution the technique achieves a classification error similar to the optimal Bayesian classifier. The behavior of this feature space analysis technique is illustrated in Figure 2. A two dimensional data set of 110; 400 points (Figure 2a) is decomposed into 7 clusters represented with different colors in Figure 2b. A number of 159 mean shift procedures with uniform kernel were employed. Their trajectories are shown in Figure 2c, overlapped over the density estimate computed with Epanechnikov kernel. The pruning of the mode candidates produced seven peaks. Observe that some of the trajectories are prematurely stopped by local plateaus.

3 特征空间的鲁棒性分析

多模态和任意形状的群集是真实特征空间的定义属性。均值偏移程序朝着其起始的山峰模式（峰）移动的质量使其成为分析此类空间的理想计算模块。为了检测所有有效模式，第2.3节中给出的基本算法应运行多次（原则上并行发展），且初始化要覆盖整个特征空间。

在进行分析之前，应该解决两个重要的(和一些相关的)问题:特性空间的度量和内核的形状。从输入域到特征空间的映射通常将非欧几里得度量与空间联系起来。颜色表示的问题将在第4节中讨论，但即使在像霍夫线空间这样简单的情况下，也必须仔细检查所采用的参数化，例如[48]，[61]。

Mahalanobis度量的存在可以通过一个适当的选择带宽矩阵(2)。在实践中,然而,最好保证特征空间的度量是欧几里得因此带宽矩阵是由一个参数控制,H = $h^2$ I。能够使用相同的内核大小意味着特征空间的转变过程,必要条件是一个显著模附近的局部密度变化不像另一个显著模的整个支撑点那么大。

均值偏移过程的起始点应该被选择来让内核(窗口)对整个特征空间(除了非常稀疏的区域)进行镶嵌。规则镶嵌是不需要的。随着窗口向模式演化，几乎所有的数据点都被访问，因此在特征空间中捕获的所有信息都被利用。注意，由于终止迭代的阈值，收敛到给定模式可能产生略微不同的位置。同样，在平坦的高原上，梯度的值接近于零，均值偏移过程可以停止。

这些伪影很容易通过后处理消除。模式候选在距离小于内核带宽被融合，一个或响应最高密度被选择。特征空间的全局结构可以通过测量沿密度的一个切面沿两种模式确定的方向所定义的谷值的显著性来确定。

集群的描绘是模式寻找过程的自然结果。收敛后模式的吸引域，即所有均值偏移程序收敛到该模式所访问的数据点，都会自动描绘出任意形状的聚类。靠近边界，在该边界处可能已通过几种不同的程序访问了数据点，因此可以采用多数逻辑。重要的是要注意，在计算机视觉中，我们通常不处理抽象的聚类问题。输入域几乎总是为特征空间中局部决策的有效性提供独立测试。尽管不太可能从严重的聚类错误中恢复过来，但是输入域信息可以可靠地支持几个不确定数据点的分配。

在[12]中详细讨论了多峰特征空间分析技术。实验表明，对于合成的双峰正态分布，该技术实现了类似于最佳贝叶斯分类器的分类误差。这种特征空间分析技术的行为如图2所示。一个二维数据集为110；一个为110。 400个点（图2a）被分解为7个聚类，在图2b中用不同的颜色表示。采用了许多具有均匀内核的159mean移位程序。它们的轨迹显示在图2c中，与用Epanechnikov核计算的密度估计值重叠。候选模式的修剪产生了七个峰值。观察到某些轨迹被局部高原过早地阻止了。

在这里插入图片描述

图2：一个2D特征空间的示例。（a）二维数据集为110400个点代表的图1b中L*u*v空间的前两个分量。（b）通过运行159次具有不同初始化的均值偏移过程得到的分解。（c）在同一数据集计算的Epanechnikov密度估计数上绘制的平均偏移程序轨迹。为最终分类保留的峰值用红点标记。

3.1 Bandwidth Selection

The influence of the bandwidth parameter $h$ was assessed empirically in [12] through a simple image segmentation task. In a more rigorous approach, however, four different techniques for bandwidth selection can be considered.

The first one has a statistical motivation. The optimal bandwidth associated with the kernel density estimator (6) is defined as the bandwidth that achieves the best compromise between the bias and variance of the estimator, over all $x\in R^d$ , i.e., minimizes AMISE. In the multivariate case, the resulting bandwidth formula [54, p.85], [62, p.99] is of little practical use, since it depends on the Laplacian of the unknown density being estimated, and its performance is not well understood [62, p.108]. For the univariate case a reliable method for bandwidth selection is the plug-in rule [53], which was proven to be superior to least squares cross validation and biased cross-validation [42], [55, p.46]. Its only assumption is the smoothness of the underlying density.
The second bandwidth selection technique is related to the stability of the decomposition. The bandwidth is taken as the center of the largest operating range over which the same number of clusters are obtained for the given data [20, p.541].
For the third technique, the best bandwidth maximizes an objective function that expresses the quality of the decomposition (i.e., the index of cluster validity). The objective function typically compares the inter- versus intra-cluster variability [30, 28] or evaluates the isolation and connectivity of the delineated clusters [43].
Finally, since in most of the cases the decomposition is task dependent, top-down information provided by the user or by an upper-level module can be used to control the kernel bandwidth.

We present in [15] a detailed analysis of the bandwidth selection problem. To solve the difficulties generated by the narrow peaks and the tails of the underlying density, two locally adaptive solutions are proposed. One is nonparametric, being based on a newly defined adaptive mean shift procedure, which exploits the plug-in rule and the sample point density estimator. The other is semiparametric, imposing a local structure on the data to extract reliable scale information. We showthat the local bandwidthshouldmaximizethe magnitudeof thenormalizedmean shift vector. The adaptation of the bandwidth provides superior results when compared to the fixed bandwidth procedure. For more details, see [15].

3.1 带宽选择

带宽参数 $h$ 的影响是通过简单的图像分割任务在[12]中凭经验评估的。但是，在更严格的方法中，可以考虑四种不同的带宽选择技术。

第一个是统计动机。与核密度估计器(6)相关联的最优带宽被定义为在估计器的偏差和方差之间达到最佳折衷的带宽，在所有的 $\in R^d$ 中，即。,最大限度地减少AMISE。在多变量情况下，得到的带宽公式[54,p。85]、[62,p。由于它依赖于未知密度的Laplacian被估计，而且它的性能还没有被很好地理解[62,p.108]，因此几乎没有实际用途。对于单变量情况，一个可靠的带宽选择方法是插件规则[53]，它被证明优于最小二乘交叉验证和偏交叉验证[42]，[55,p.46]。它唯一的假设是潜在密度的平滑性。
第二种带宽选择技术与分解的稳定性有关。对于给定的数据，带宽被用作获得相同数量集群的最大操作范围的中心[20,p.541]。
对于第三种技术，最佳带宽使表示分解质量的目标函数（即聚类有效性的指标）最大化。目标函数通常会比较集群间差异和集群内差异[30，28]，或评估所描绘集群的隔离性和连通性[43]。
最后，由于在大多数情况下分解取决于任务本身，因此用户或上层模块提供的自上而下的信息可用于控制内核带宽。

我们在[15]中介绍了带宽选择问题的详细分析。为了解决底层密度的窄峰和低峰所产生的困难，提出了两种局部自适应解决方案。一种是非参数的，基于新定义的自适应均值偏移过程，该过程利用了插入规则和采样点密度估计器。另一个是半参数的，在数据上施加局部结构以提取可靠的比例信息。我们表明，局部带宽应使归一化均值偏移矢量的大小最大。与固定带宽过程相比，带宽的适应提供了优异的结果。有关更多详细信息，请参见[15]。

3.2 Implementation Issues

An efficient computation of the mean shift procedure requires first the resampling of the input data with a regular grid. This a standard technique in the context of density estimation which leads to a binned estimator [62, Appendix D]. The procedure is similar to defining a histogram where linear interpolation is used to compute the weights associated with the grid points. Further reduction in the computation time is achieved by employing algorithms for multidimensional range searching [52, p.373] used to find the data points falling in the neighborhood of a given kernel. For the efficient Euclidean distance computation we used the improved absolute error inequality criterion, derived in [39].

3.2 实现问题

要想有效地计算均值偏移过程，首先需要使用规则网格对输入数据进行重采样。这是密度估计中的一种标准技术，导致合并估计器（binned estimator）[62，附录D]。这个过程类似于定义直方图，其中使用线性插值来计算与网格点相关的权重。采用多维范围搜索算法进一步缩短了计算时间[52,p。用来找到落在给定核附近的数据点。为了有效地计算欧氏距离，我们使用了改进的绝对误差不等式准则，该准则源自[39]。

4 Applications

The feature space analysis technique introduced in the previous section is application independent and thus can be used to develop vision algorithms for a wide variety of tasks. Two somewhat related applications are discussed in the sequel: discontinuity preserving smoothing and image segmentation. The versatility of the feature space analysis enables to design algorithms in which the user controls performance through a single parameter, the resolution of the analysis (i.e., bandwidth of the kernel). Since the control parameter has clear physical meaning, the new algorithms can be easily integrated into systems performing more complex tasks. Furthermore, both gray level and color images are processed with the same algorithm, in the former case the feature space containing two degenerate dimensions that have no effect on the mean shift procedure.

Before proceeding to develop the new algorithms the issue of the employed color space has to be settled. To obtain a meaningful segmentation perceived color differences should correspond to Euclidean distances in the color space chosen to represent the features (pixels). An Euclidean metric, however, is not guaranteed for a color space [65, Secs.6.5.2; 8.4]. The spaces L*u*v and L*a*b were especially designed to best approximate perceptually uniform color spaces. In both cases L* the lightness (relative brightness) coordinate is defined the same way, the two spaces differ only through the chromaticity coordinates. The dependence of all three coordinates on the traditional RGB color values is nonlinear. See [46, Sec.3.5] for a readily accessible source for the conversion formulae. The metric of perceptually uniform color spaces is discussed in the context of feature representation for image segmentation in [16]. In practice there is no clear advantage between using L*u*v or L*a*b, in the proposed algorithms we employed L*u*v motivated by a linear mapping property [65, p.166].

Our first image segmentation algorithm was a straightforward application of the feature space analysis technique to an L*u*v representation of the color image [11]. The modularity of the segmentation algorithm enabled its integration by other groups to a large variety of applications like image retrieval [1], face tracking [6], object based video coding for MPEG-4 [22], shape detection and recognition [33], and texture analysis [47], to mention only a few. However, since the feature space analysis can be applied unchanged to moderately higher dimensional spaces (see Section 5) we subsequently also incorporated the spatial coordinates of a pixel into its feature space representation. This joint domain representation is employed in the two algorithms described here.

An image is typically represented as a two-dimensional lattice of $p$ -dimensional vectors (pixels), where $p = 1$ in the gray level case, 3 for color images, and $p > 3$ in the multispectral case. The space of the lattice is known as the spatial domain while the gray level, color, or spectral information is represented in the range domain. For both domains an Euclidean metric is assumed. When the location and range vectors are concatenated in the joint spatial-range domain of dimension $d = p + 2$ , their different nature has to be compensated by proper normalization. Thus, the multivariate kernel is defined as the product of two radially symmetric kernels and the Euclidean metric allows a single bandwidth parameter for each domain
$K_{h_s,h_r}(X)=\frac{C}{h^2_sh^p_r}k\left(\left\| \frac{X^s}{h_s} \right\|^2\right) k\left(\left\| \frac{X^r}{h_r} \right\|^2\right) \tag{35}$
where $X^s$ is the spatial part, $X^r$ is the range part of a feature vector, $k (x)$ the common profile used in both two domains, $h_s$ and $h_r$ the employed kernel bandwidths, and $C$ the corresponding normalization constant.In practice an Epanechnikov, or a (truncated) normal kernel always provides satisfactory performance, so the user only has to set the bandwidth parameter $\bf{h}=\it(h_s,h_r)$ , which by controlling the size of the kernel determines the resolution of the mode detection.

4 应用

上一节介绍的特征空间分析技术与应用程序无关，因此可用于开发用于各种任务的视觉算法。后续将讨论两个相关的应用程序：图像平滑和图像分割。特征空间分析的多功能性使用户可以设计算法，使用户可以通过单个参数，分析的分辨率（即内核带宽）来控制性能。由于控制参数具有明确的物理含义，因此可以将新算法轻松集成到执行更复杂任务的系统中。此外，灰度图像和彩色图像都用相同的算法处理，在前一种情况下，特征空间包含两个退化的维，对均值偏移过程没有影响。

在继续开发新算法之前，必须解决所用色彩空间的问题。为了获得有意义的分割，感知到的色差应对应于在选择用于表示特征（像素）的颜色空间中的欧几里得距离。但是，对于颜色空间，不能保证欧几里得度量[65，Secs.6.5.2; 8.4]。 L*u*v和L*a*b空间经过特别设计，可以最佳地近似感知上统一的色彩空间。在这两种情况下，亮度（相对亮度）坐标都以相同的方式定义，两个空间仅通过色度坐标不同。所有三个坐标对传统RGB颜色值的依赖性都是非线性的。有关转换公式的易于使用的来源，请参见[46，第3.5节]。在[16]中，在用于图像分割的特征表示的上下文中讨论了感知均匀色彩空间的度量。在实践中，使用L*u*v或L*a*b之间没有明显的优势，在所提出的算法中，我们采用了线性映射属性激发的L*u*v [65，p.166]。

我们的第一个图像分割算法是将特征空间分析技术直接应用于彩色图像的L*u*v表示[11]。分割算法的模块化使其能够被其他小组集成到各种应用中，例如图像检索[1]，面部跟踪[6]，针对MPEG-4的基于对象的视频编码[22]，形状检测和识别[33]。和纹理分析[47]，仅举几例。但是，由于特征空间分析可以不变地应用于较高维度的空间（请参阅第5节），因此我们随后还将像素的空间坐标合并到其特征空间表示中。在此处描述的两种算法中采用了这种联合域表示。

图像通常表示为二维的p维向量(像素)格，其中灰度图像 $p = 1$ ，彩色图像 $p = 3$ ，多光谱图像 $p > = 3$ 。格点的空间称为空间域，而灰度、颜色或光谱信息在距离域内表示。对于这两个域都假设有欧几里得度规。在 $d = p + 2$ 维的空间-距离联合域中串联位置向量和距离向量时，需要通过适当的归一化来补偿它们的不同性质。因此，多元核被定义为两个径向对称核的乘积，欧几里得度量允许每个域有一个单一的带宽参数
$K_{h_s,h_r}(X)=\frac{C}{h^2_sh^p_r}k\left(\left\| \frac{X^s}{h_s} \right\|^2\right) k\left(\left\| \frac{X^r}{h_r} \right\|^2\right) \tag{35}$
其 $X^s$ 是空间部分， $X^r$ 是特征向量的范围部分， $k (x)$ 是两个域使用的共同剖面， $h_s$ 和 $h_r$ 是使用的核带宽， $C$ 是相应的归一化常数。实际上，Epanechnikov或(truncated) 标准内核总是能够提供令人满意的性能，因此用户只需设置带宽参数 $\bf{h}=\it(h_s,h_r)$ ，它通过控制内核的大小决定模式检测的分辨率。

4.1 Discontinuity Preserving Smoothing

Smoothing through replacing the pixel in the center of a window by the (weighted) average of the pixels in the window, indiscriminately blurs the image removing not only the noise but also salient information. Discontinuity preserving smoothing techniques, on the other hand, adaptively reduce the amount of smoothing near abrupt changes in the local structure, i.e., edges.

There are a large variety of approaches to achieve this goal, from adaptive Wiener filtering [31], to implementing isotropic [50] and anisotropic [44] local diffusion processes, a topic which recently received renewed interest [19, 37, 56]. The diffusion based techniques, however, do not have a straightforward stopping criterion and after a sufficiently large number of iterations, the processed image collapses into a flat surface. The connection between anisotropic diffusion and M-estimators is analyzed in [5].

A recently proposed noniterative discontinuity preserving smoothing technique is the bilateral filtering [59]. The relation between bilateral filtering and diffusion based techniques was analyzed in [3]. The bilateral filters also work in the joint spatial-range domain. The data is independently weighted in the two domains and the center pixel is computed as the weighted average of the window. The fundamental difference between the bilateral filtering and the mean shift based smoothing algorithm is in the use of local information.

Mean Shift Filtering

Let $x_i$ and $z_i$ , $i=1,\dots,n$ , be the $d$ -dimensional input and filtered image pixels in the joint spatial-range domain. For each pixel

Initialize $j = 1$ and $y_{i,1}=x_i$ .
Compute $y_{i,j+1}$ according to (20) untile convergence, $y=y_{i,c}$ .
Assign $z_i=(x^s_i,y^r_{i,c})$ .

The upperscripts $s$ and $r$ denote the spatial and range components of a vector, respectively. The assignment specifies that the filtered data at the spatial location $x_i^s$ will have the range component of the point of convergence $y^r_{i,c}$ .

The kernel (window) in the mean shift procedure moves in the direction of the maximum increase in the joint density gradient, while the bilateral filtering uses a fixed, static window. In the image smoothed by mean shift filtering, information beyond the individual windows is also taken into account.

An important connection between filtering in the joint domain and robust M-estimation should be mentioned. The improved performance of the generalized M-estimators(GM,or bounded influence estimators) is due to the presence of a second weight function which offsets the influence of leverage points, i.e., outliers in the input domain [32, Sec.8E]. A similar (at least in spirit) twofold weighting is employed in the bilateral and mean shift based filterings, which is the main reason of their excellent smoothing performance.

Mean shift filtering with uniform kernel having $h_s,h_r)=(8,4)$ has been applied to the often used $256 \times 256$ gray level cameraman image (Figure 3a), the result being shown in Figure 3b. The regions containing the grass field have been almost completely smoothed while details such as the tripod and the buildings in the background were preserved. The processing required fractions of a second on a standard PC (600Mhz Pentium III) using an optimized C++ implementation of the algorithm. On the average 3:06 iterations were necessary until the filtered value of a pixel was defined, i.e., its mean shift procedure converged.

To better visualize the filtering process, the $40\times20$ window marked in Figure 3a is represented in three dimensions in Figure 4a. Note that the data was reflected over the horizontal axis of the window for a more informative display. In Figure 4b the mean shift paths associated with every other pixel (in both directions) from the plateau and the line are shown. Note that convergence points (black dots) are situated in the center of the plateau, away from the discontinuities delineating it. Similarly, the mean shift trajectories on the line remain on it. As a result, the filtered data (Figure 4c) shows clean quasi-homogeneous regions.

The physical interpretation of the mean shift based filtering is easy to see by examining Figure 4a which in fact displays the three dimensions of the joint domain of a gray level image. Take a pixel on the line. The uniform kernel defines a parallelepiped centered on this pixel, and the computationof the mean shift vectortakes into account only those pixels which have both their spatial coordinates and gray level values inside the parallelepiped. Thus, if the parallelepiped is not too large, only pixels on the line are averaged and the new location of the window is guaranteed to remain on it.

A second filtering example is shown in Figure 5. The $512 \times 512$ color image baboon was processed with mean shift filters employing normal kernels defined using various spatial and range resolutions, $(h_s ,h_r ) = (8 \div 32, 4 \div 16)$ . While the texture of the fur has been removed, the details of the eyes and the whiskers remained crisp (up to a certain resolution). One can see that the spatial bandwidth has a distinct effect on the output when compared to the range (color) bandwidth. Only features with large spatial support are represented in the filtered image when $h_s$ increases. On the other hand, only features with high color contrast survive when $h_r$ is large. Similar behavior was also reported for the bilateral filter [59, Figure 3].

4.1 图像平滑

通过将窗口中心的像素替换为窗口中像素的（加权）平均值进行平滑处理，可以使图像模糊不清，不仅消除了噪声，还消除了重要信息。另一方面，图像平滑技术自适应地减小了局部结构即边缘的突变附近的平滑量。

从自适应维纳滤波[31]到实现各向同性[50]和各向异性[44]局部扩散过程，有各种各样的方法可以实现这一目标，最近这个话题受到了新的关注[19，37，56]。但是，基于扩散的技术没有直接的停止标准，并且在经过足够多的迭代之后，处理后的图像会塌陷成平坦的表面。在[5]中分析了各向异性扩散和M估计量之间的联系。

最近提出的一种非迭代图像平滑技术是双边滤波[59]。在[3]中分析了双边过滤和基于扩散的技术之间的关系。双边滤波器也可以在联合空间范围域中工作。在两个域中对数据进行独立加权，并且将中心像素计算为窗口的加权平均值。双边滤波和基于均值偏移的平滑算法之间的根本区别在于使用本地信息。

均值偏移滤波

令 $x_i$ 和 $z_i$ , $i=1,\dots,n$ , 表示联合空间范围域中的d维输入和滤波后的图像像素。对于每个像素

初始化 $j = 1$ , $y_{i,1}=x_i$
根据(20)计算 $y_{i,j+1}$ 直到收敛于 $y=y_{i,c}$
令 $z_i=(x^s_i,y^r_{i,c})$

上标 $s$ 和 $r$ 分别表示向量的空间和范围分量。该分配指定在空间位置 $x^s_i$ 处的滤波数据将具有会聚点 $y^r_{i,c}$ 的范围分量。

均值偏移过程中的核（窗口）在关节密度梯度最大增加的方向上移动，而双边过滤使用固定的静态窗口。在通过均值偏移滤波平滑的图像中，还考虑了超出各个窗口的信息。

需要指出联合域滤波与鲁棒m估计之间的一个重要联系。广义影响估计器(GM，或boundedinence估计器)的改进性能是由于第二个权重函数的存在，它抵消了杠杆点的影响，即。，输入域中的异常值[32，第8e段]。类似的(至少在精神上)双重加权被用在基于双边和平均移的滤波中，这是它们平滑性能优异的主要原因。

将具有 $h_s,h_r)=(8,4)$ 的均匀核的均值偏移滤波应用于经常使用的 $256\times 256$ 灰度摄影师图像（图3a），结果如图3b所示。保留了草地的区域几乎被完全平滑处理，同时保留了诸如三脚架和背景中的建筑物之类的细节。使用算法的优化C ++实现，在标准PC（600Mhz Pentium III）上处理所需的时间仅为几分之一秒。平均需要3.06次迭代，直到定义了像素的滤波值，即其平均移位过程收敛为止。

在这里插入图片描述

图3：摄像师。（a）原始。（b）均值偏移滤波 $h_s,h_r)=(8,4)$ 。

为了更好地显示过滤过程，在图4a中以三个维度表示了图3a中标记的 $40\times20$ 的窗口。请注意，数据在窗口的水平轴上反射，以显示更多信息。在图4b中，显示出了与来自平台和直线的每个其他像素（在两个方向上）相关联的均值偏移路径。注意，收敛点(黑点)位于高原中心，远离描绘它的不连续点。同样，这条线上的均值偏移轨迹仍然在这条线上。因此，过滤后的数据(图4c)显示了干净的准均匀区域。

通过查看图4a很容易看出基于均值偏移的滤波的物理解释，图4a实际上显示了灰度图像的三个关节域。取直线上的一个像素。统一核定义了一个以该像素为中心的平行六面体，而均值偏移矢量的计算只考虑那些同时具有空间坐标和灰度值在平行六面体内的像素。因此，如果平行六面体不是太大，则只对这条线上的像素进行平均，并且保证窗口的新位置保持在该线上。

在这里插入图片描述

图4：基于均值偏移的灰度级数据过滤和细分的可视化。（a）输入。(b)面上和线上像素的平均偏移路径。黑点是收敛点。（c）滤波结果 $h_s,h_r)=(8,4)$ 。（d）分割结果。

第二个过滤示例如图5所示。 $512 \times 512$ 彩色图像狒狒是使用均值移位滤波器处理的，该滤波器采用了使用各种空间和范围分辨率 $(h_s ,h_r ) = (8 \div 32, 4 \div 16)$ 定义的标准核。虽然皮毛的纹理被去掉了，但眼睛和胡须的细节仍然清晰(达到一定的分辨率)。可以看出，与范围(颜色)带宽相比，空间带宽对输出有明显的影响。当 $h_s$ 增大时，滤波后的图像中只表示空间支持度大的特征。另一方面，当 $h_r$ 较大时，只有具有高色彩对比度的功能才能保留。双侧滤波器也有类似的表现[59，图3]。

在这里插入图片描述

图5：狒狒。原始图像和滤波图像。

4.2 Image Segmentation

Image segmentation, decomposition of a gray level or color image into homogeneous tiles, is arguably the most important low-level vision task. Homogeneity is usually defined as similarity in pixel values, i.e. a piecewise constant model is enforced over the image. From the diversity of image segmentation methods proposed in the literature will mention only some whose basic processing relies on the joint domain. In each case a vector field is defined over the sampling lattice of the image.

The attraction force field defined in[57] is computedat each pixel as a vector sum of pair wise affinities between the current pixeland all other pixels,with similarity measured in both spatial and range domains. The region boundaries are then identified as loci where the force vectors diverge. It is interesting to note that for a given pixel, the magnitude and orientation of the force field are similar to those of the joint domain mean shift vector computed at that pixel and projected into the spatial domain. However, in contrast to [57] the mean shift procedure moves in the direction of this vector, away from the boundaries.

The edge flow in [34] is obtained at each location for a given set of directions as the magnitude of the gradient of a smoothed image. The boundaries are detected at image locations which encounter two opposite directions of flow. The quantization of the edge flow direction, however, may introduce artifacts. Recall that the direction of the mean shift is dictated solely by the data.

The mean shift procedure based image segmentation is a straightforward extension of the discontinuity preserving smoothing algorithm. Each pixel is associated with a significant mode of the joint domain density located in its neighborhood, after nearby modes were pruned as in the generic feature space analysis technique (Section 3).

Mean Shift Segmentation

Let $x_i$ and $z_i$ , $1,\dots , n$ , be the d-dimensional input and filtered image pixels in the joint spatial-range domain, and $L_i$ the label of the $i$ -th pixel in the segmented image.

Run the mean shift filtering procedure for the image and store all the information about the $d$ -dimensional convergence point in $z_i$ , i.e., $z_i = y_{i,c}$ .
Delineate in the joint domain the clusters $\{C_p\}_{p=1\dots m}$ by grouping together all $z_i$ which are closer than $h_s$ in the spatial domain and $h_r$ in the range domain, i.e., concatenate the basins of attraction of the corresponding convergence points.
For each $i = 1, . . ., n$ , assign $L_i=\{p|z_i\in C_p\}$ .
Optional: Eliminate spatial regions containing less than $M$ pixels.

The cluster delineation step can be refined according to a priori information, and thus physicsbased segmentation algorithms, e.g., [2, 35] can be incorporated. Since this process is performed on region adjacency graphs, hierarchical techniques like[36]can provide significant speed-up. The effect of the cluster delineation step is shown in Figure 4d. Note the fusion into larger homogeneous regions of the result of filtering shown in Figure 4c. The segmentation step does not add a significant overhead to the filtering process.

The region representation used by the mean shift segmentation is similar to the blob representation employed in [64]. However, while the blob has a parametric description (multivariate Gaussians in both spatial and color domain), the partition generated by the mean shift is characterized by a nonparametric model. An image region is defined by all the pixels associated with the same mode in the joint domain.

In [43] a nonparametric clustering method is described in which after kernel density estimation with a small bandwidth the clusters are delineated through concatenation of the detected modes’ neighborhoods. The merging process is based on two intuitive measures capturing the variations in the local density. Being a hierarchical clustering technique the method is computationally expensive, it takes several minutes in MATLAB to analyze a 2000 pixel subsample of the feature space. The method is not recommended to be used in the joint domain since the measures employed in the merging process become ineffective. Comparing the results for arbitrarily shaped synthetic data [43, Figure 6] with a similarly challenging example processed with the mean shift method [12, Figure 1], shows that the use of a hierarchical approach can be successfully avoided in the nonparametric clustering paradigm.

All the segmentation experiments were performed using uniform kernels. The improvement due to joint space analysis can be seen in Figure 6 where the $256 \times 256$ gray level image MIT was processed with $h_s ,h_r , M ) = (8,7, 20)$ . A number of $225$ homogeneous regions were identified in fractions of a second, most of them delineating semantically meaningful regions like walls, sky, steps, inscription on the building, etc. Compare the results with the segmentation obtained by onedimensional clustering of the gray level values in [11, Figure 4] or by using a Gibbs random fields based approach [40, Figure 7].

The joint domain segmentation of the color $256 \times 256$ room image presented in Figure 7 is also satisfactory. Compare this result with the segmentation presented in [38, Figures 3e and 5c] obtained by recursive thresholding.In both these examples, one can notice that regions in which a small gradient of illumination exist (like the sky in the MIT or the carpet in the room image), were delineated as a single region. Thus, the joint domain mean shift based segmentation succeeds to overcome the inherent limitations of methods based only on gray-level or color clustering, which typically oversegment small gradient regions.

The segmentation with $h_s,h_r,M)=(16,7,40)$ of the $512 \times 512$ color image lake is shown in Figure 8. Compare this result with that of the multiscale approach in [57, Figure 11]. Finally, one can compare the contours of the color image $h_s,h_r,M)=(16,7,40)$ hand presented in Figure 9 with those from [66, Figure 15] obtained through a complex global optimization, and from [41, Figure 4a] obtained with geodesic active contours.

The segmentation is not very sensitive to the choice of the resolution parameters $h_s$ and $h_r$ . Note that all $256 \times 256$ images used the same $h_s = 8$ corresponding to a $17 \times 17$ spatial window, while all $512 t i m e s 512$ images used $h_s = 16$ corresponding to a $31 \times 31$ window. The range parameter $h_ r$ and the smallest significant feature size $M$ control the number of regions in the segmented image. The more an image deviates from the assumed piecewise constant model, larger values have to be used for $h _r$ and $M$ to discard the effect of small local variations in the feature space. For example, the heavily textured background in the hand image is compensated by using $h _r = 19$ and $M = 40$ , values which are much larger than those used for the room image ( $h _r = 5, M = 20$ ) since the latter better obeys the model.As with any low level vision algorithm, the quality of the segmentation output can be assessed only in the context of the whole vision task, and thus the resolution parameters should be chosen according to that criterion. An important advantage of mean shift based segmentation is its modularity which makes the control of segmentation output very simple.

Other segmentation examples in which the original image has the region boundaries superposed are shown in Figure 10 and in which the original and labeled images compared in Figure 11.

As potential application of the segmentation, we return to the cameraman image. Figure 12a shows the reconstructed image after the regions corresponding to the sky and grass were manually replaced with white. The mean shift segmentation has been applied with $h_s , h _r , M ) = (8, 4, 10)$ . Observe the preservation of the details which suggests that the algorithm can also be used for image editing, as shown in Figure 12b.

The (unoptimized) JAVA code for the discontinuity preserving smoothing and image segmentation algorithms integrated into a single system with graphical interface is available at http://www.caip.rutgers.edu/riul/research/code.html

4.2 图像分割

图像分割是将一幅灰度或彩色图像分解为同质块，是最重要的低层视觉任务。同质性通常定义为像素值的相似性，即对图像实施分段常数模型。从文献中提出的图像分割方法的多样性来看，我们只会提到一些基本处理依赖于联合域的分割方法。在每种情况下，向量场都定义在图像的采样格上。

在[57]中定义的引力场在每个像素上计算为当前像素和所有其他像素之间的成对亲和力的向量和，在空间和范围域上测量相似性。区域边界被识别为力向量发散的轨迹。值得注意的是，对于一个给定的像素，力场的大小和方向与在该像素处计算并投射到空间域中的关节域平均位移向量的大小和方向相似。然而，与[57]相反的是，平均偏移程序在这个向量的方向上移动，远离边界。

[34]中的边缘流在给定的一组方向的每个位置上得到，作为平滑图像的梯度的大小。边界检测在图像的位置，遇到两个相反的方向的流动。然而，边缘流动方向的量化可能会引入伪影。回想一下，平均位移的方向完全由数据决定。

基于均值偏移的图像分割是对图像平滑算法的简单扩展。按照一般的特征空间分析技术(第3节)，在对附近的模态进行修剪之后，每个像素都与其邻域的联合域密度的一个显著模态相关联。

均值偏移分割

令 $x_i$ 和 $z_i$ , $1,\dots,n$ ，为联合空间范围域中的 $d$ 维输入滤波图像像素， $L_i$ 为分割图像中第 $i$ 个像素的标签。

对图像运行均值偏移滤波程序，将 $d$ 维收敛点的所有信息存储在 $z_i$ 中，即， $z_i = y_{i,c}$ .
在联合域中描绘聚类 $\{C_p\}_{p=1\dots m}$ ，将所有 $h_s$ 更近的 $z_i$ 空间域和 $h_r$ 范围域分组在一起，即将相应收敛点的吸引域连接起来。
对于每个 $i = 1, . . ., n$ , 令 $L_i=\{p|z_i\in C_p\}$ .
可选：消除包含少于 $M$ 个像素的空间区域。

可以根据先验信息细化聚类描述步骤，从而引入基于物理的分割算法，如[2,35]。由于这个过程是在区域邻接图上执行的，像[36]这样的层次技术可以提供显著的加速。聚类描述步骤的效果如图4d所示。注意，如图4c所示，滤波结果融合到更大的均匀区域。分割步骤不会给过滤过程增加显著的开销。

均值偏移分割使用的区域表示类似于[64]中使用的斑点表示。然而，虽然斑点有一个参数化描述(空间和颜色域的多元高斯函数)，由均值偏移产生的划分是由一个非参数模型表征的。一个图像区域是由联合域中所有具有相同模式的像素所定义的。

在[43]中描述了一种非参数聚类方法，该方法在小带宽下进行核密度估计后，通过连接检测模式的邻域来划分聚类。合并的过程是基于两个直观的测量捕获的变化在局部密度。作为一种层次聚类技术，该方法的计算成本很高，它需要在MATLAB中花费几分钟来分析特征空间的2000像素子样本。由于合并过程中所采取的措施失效，因此不建议在联合域中使用该方法。将任意形状的合成数据[43，图6]的结果与用均值偏移法处理的类似具有挑战性的例子[12，图1]进行比较，可以发现在非参数聚类范式中可以成功地避免使用分层方法。

所有分割实验均使用统一的内核进行。在图6中可以看到由于联合空间分析而导致的改进，其中 $256 x 256$ 灰度图像MIT的处理方式为 $h_s ,h_r , M ) = (8,7, 20)$ 。一秒之内就确定了 $225$ 个同质区域，其中大多数区域描绘了语义上有意义的区域，例如墙壁，天空，台阶，建筑物上的铭文等。将结果与通过灰度值的一维聚类获得的分割进行比较在[11，图4]中或通过使用基于吉布斯随机场的方法[40，图7]。

在这里插入图片描述

图6：MIT图像。（a）原始。（b） $h_s ,h_r , M ) = (8,7, 20)$ 分割。（c）区域边界。

图7中显示的256×256彩色房间图像的联合域分割也令人满意。将该结果与通过递归阈值获得的[38，图3e和5c]中的分割进行比较。在这两个示例中，可以注意到存在一个小照明梯度的区域（如MIT中的天空或房间图像中的地毯）被描绘为单个区域。因此，基于联合域均值偏移的分割成功克服了仅基于灰度或颜色聚类的方法的固有局限性，灰度或颜色聚类通常会过度分割小梯度区域。

在这里插入图片描述

图7：房间图像。（a）原始。（b）在输入上绘制以 $h_s ,h_r , M ) = (8,5, 20)$ 表示的区域边界

图8显示了 $512 \times 512$ 彩色图像湖的 $h_s,h_r,M)=(16,7,40)$ 的分割。将该结果与[57，图11]中的多尺度方法进行比较。最后，可以将图9中呈现的彩色图像 $h_s,h_r,M)=(16,7,40)$ 手的轮廓与通过复杂全局优化获得的[66，图15]和[[ 41，图4a]用测地线活动轮廓获得。在这里插入图片描述

图8：湖影。（a）原始。（b）以 $h_s,h_r,M)=(16,7,40)$ 分割。

分割对分辨率参数 $h_s$ 和 $h_r $ 的选择不是很敏感。请注意，所有 $256 \times 256$ 图像都使用相同的 $h = 8$ ，对应于 $17 \times 17$ 空间窗口，而所有 $512 \times 512$ 图像所使用的 $h_s = 16$ 对应于 $31 \times 31$ 空间窗口。范围参数 $h_r$ 和最小有效特征尺寸 $M$ 控制分割图像中的区域数量。图像偏离假定的分段常数模型越多，则 $h_r$ 和 $M$ 必须使用更大的值用，以消除特征空间中局部变化小的影响。例如，通过使用 $h_r = 19$ 和 $M = 40$ 来补偿手部图像中纹理严重的背景，该值比用于房间图像的值（ $h r = 5, M = 20$ ）大得多，因为后者更好地遵循了模型。与任何底层视觉算法一样，分割只能在整个视觉任务的情况下评估分段输出的质量，因此应根据该标准选择分辨率参数。基于均值偏移的分段的一个重要优点是其模块化，这使得分段输出的控制非常简单。

在这里插入图片描述

图9：手。（a）原始。（b）在输入上绘制以 $h_s,h_r,M)=(16,19,40)$ 表示的区域边界。

原始图像区域边界重叠的其他分割示例如图10所示，原始图像与标记图像的对比如图11所示。在这里插入图片描述

图10：景观图。所有区域边界均用 $h_s , h _r , M ) = (8, 7, 100)$ 并绘制在原始图像上。

作为分割的潜在应用，我们回到摄影师的图像。图12a是人工将天空和草地对应区域替换为白色后的重建图像。采用 $h_s , h _r , M ) = (8, 4, 10)$ 。观察细节的保存情况，说明该算法也可以用于图像编辑，如图12b所示。

在这里插入图片描述

图11： $h_s , h _r , M ) = (8, 7, 20)$ 的其他一些分割示例。左：原始图像。右：分割图像。

在这里插入图片描述

图12：摄影师。（a）以 $h_s , h _r , M ) = (8, 4, 10)$ 的分割和消除代表天空和草地的区域后的重建。（b）监督纹理插入。

该(非最优) JAVA代码的图像平滑和图像分割算法集成到一个单一的系统与图形界面是可用的 http://www.caip.rutgers.edu/riul/research/code.html

5 Discussion

The mean shift based feature space analysis technique introduced in this paper is a general tool which is not restricted to the two applications discussed here. Since the quality of the output is controlled only by the kernel bandwidth, i.e., the resolution of the analysis, the technique should be also easily integrable into complex vision systems where the control is relinquished to a closed loop process. Additional insights on the bandwidth selection can be obtained by testing the stability of the mean shift direction across the different bandwidths, as investigated in [57] in the case of the force field. The nonparametric toolbox developed in this paper is suitable for a large variety of computer vision tasks where parametric models are less adequate, for example, modeling the background in visual surveillance [18].

The complete solution toward autonomous image segmentation is to combine a bandwidth selection technique (like the ones discussed in Section 3.1) with top-down task related high level information. In this case each mean shift process is associated with a kernel best suited to the local structure of the joint domain. Several interesting theoretical issues have to be addressed though, before the benefits of such a data driven approach can be fully exploited. We are currently investigating these issues.

The ability of the mean shift procedure to be attracted by the modes (local maxima) of an underlying density function, can be exploited in an optimization framework. Cheng [7] already discusses a simple example. However, by introducing adequate objective functions the optimization problem can acquire physical meaning in the context of a computer vision task. For example, in [14] by definingthedistance between thedistributionsof the modeland a candidate ofthe target, non-rigid objects were tracked in an image sequence under severe distortions. The distance was defined at every pixel in the region of interest of the new frame, and the mean shift procedure was used to find the mode of this measure nearest to the previous location of the target.

The above mentioned tracking algorithm can be regarded as an example of computer vision techniques which are based on in situ optimization. Under this paradigm the solution is obtained by using the input domain to define the optimization problem. The in situ optimization is a very powerful method. In [23] and [58] each input data point was associated with a local field (voting kernel) to produce a more dense structure from where the sought information (salient features, the hyperplane representing the fundamental matrix) can be reliably extracted.

The mean shift procedure is not computationally expensive. Careful C++ implementation of the tracking algorithm allowed real time (30 frames/second) processing of the video stream. While it is not clear if the segmentation algorithm described in this paper can be made so fast, given the quality of the region boundaries it provides, it can be used to support edge detection without significant overhead in time.

Kernel density estimation in particular and nonparametric techniques in general do not scale well with the dimension of the space. This is mostly due to the empty space phenomenon [20, p.70], [54, p.93] by which most of the mass in a high dimensional space is concentrated in a small region of the space. Thus, whenever the feature space has more than (say) six dimensions the analysis should be approached carefully. Employing projection pursuit, in which the density is analyzed along lower dimensional cuts, e.g., [27], is a possibility.

To conclude, the mean shift procedure is a valuable computational module whose versatility can make it an important component of any computer vision toolbox.

5 讨论

本文介绍的基于均值偏移的特征空间分析技术是一种通用工具，不限于此处讨论的两个应用程序。由于输出的质量仅由内核带宽（即分析的分辨率）控制，因此该技术还应易于集成到复杂的视觉系统中，在该系统中，控制权将被放弃。如[57]中对力场的研究，可以通过测试不同带宽上平均移动方向的稳定性来获得关于带宽选择的更多见解。本文开发的非参数工具箱适用于参数模型不够充分的各种计算机视觉任务，例如，在视觉监视中为背景建模[18]。

自主图像分割的完整解决方案是将带宽选择技术（如3.1节中讨论的技术）与自上而下与任务相关的高级信息相结合。在这种情况下，每个均值偏移过程都与最适合联合域局部结构的内核相关联。在充分利用这种数据驱动方法的好处之前，必须解决几个有趣的理论问题。我们目前正在调查这些问题。

均值偏移程序被潜在密度函数的模态(局部极大值)所吸引的能力，可以在优化框架中加以利用。Cheng[7]已经讨论了一个简单的示例。然而，通过引入足够的目标函数，优化问题可以在计算机视觉任务中获得物理意义。例如，在[14]中，通过定义模型分布与候选目标之间的距离，在严重失真的图像序列中跟踪非刚性目标。在新帧感兴趣区域的每一个像素上定义距离，并使用均值偏移程序来寻找最接近目标之前位置的测量模式。

上述跟踪算法可以作为基于原位优化的计算机视觉技术的一个实例。在此模式下，通过使用输入域来定义优化问题的求解。原位优化是一种非常有效的方法。在[23]和[58]中，每个输入数据点都与一个局部场(投票核)相关联，从而产生一个更密集的结构，从中可以可靠地提取所寻求的信息(显著特征，表示基本矩阵的超平面)。

均值偏移过程在计算代价并不高。跟踪算法的C++实现允许对视频流进行实时（30帧/秒）处理。尽管尚不清楚本文中描述的分割算法是否可以如此快地进行，但鉴于其提供的区域边界的质量，它可以用于支持边缘检测而没有大量时间开销。

特别是内核密度估计和非参数技术通常不能很好地适应空间的尺寸。这主要是由于空白现象[20，p.70]，[54，p.93]，高空间中的大部分质量都集中在空间的一小部分。因此，只要特征空间超过（例如）六个维度，就应谨慎进行分析。可以采用投影追踪方法，其中沿着较低尺寸的切口（例如[27]）分析密度。

总而言之，均值偏移过程是一个有价值的计算模块，其多功能性使其成为任何计算机视觉工具箱的重要组成部分。

Appendix

附录

Proof of Theorem 1

If the kernel $K$ has a convex and monotonically decreasing profile, the sequences ${\ y_j\}_{j=1,2...}$ and $\{\hat{f}_{h,K}(j) \}_j=1,2...$ converge, and $\{\hat{f}_{h,K}(j) \}_j=1,2...$ is also monotonically increasing.

Since $n$ is finite the sequence $\hat{f}_{h,K}$ (21) is bounded therefore it is sufficient to show that $\hat{f}_{h,K}$ is strictly monotonic increasing, i.e., if $y_i\neq y_{i+1}$ then $\hat{f}_{h,K}(j)<\hat{f}_{h,K}(j+1)$ , for $j = 1, 2 . . .$ . Without loss of generality can be assumed that $y_j = 0$ and thus from (16) and (21)
$\hat{f}_{h,K}(j+1)-\hat{f}_{h,K}(j)=\frac{c_{k.d}}{nh^d}\sum^n_{i=1} \left[ k\left(\left\| \frac{y_{j+1}-x_i}{h} \right\|^2\right)- k\left(\left\| \frac{x_i}{h} \right\|^2\right) \right] \tag{A.1}$
The convexity of the profile $k (x)$ implies that
$k(x_2)\geq k(x_1)+k^{'}(x_1)(x_2-x_1) \tag{A.2}$
for all $x_1,x_2\in[0,\infty),x_1\neq x_2$ , and since $g (x) = -k ^{'} (x)$ , the inequality (A.2) becomes
$k(x_2)-k(x_1)\geq g(x_1)(x_1-x_2) \tag{A.3}$
Using now (A.1) and (A.3) we obtain
$\begin{aligned} \hat{f}_{h,K}(j+1)-\hat{f}_{h,K}(j) &\geq \frac{c_{k,d}}{nh^{d+2}}\sum^n_{i=1}g\left(\left\| \frac{x_i}{h} \right\|^2\right) [\|x_i\|^2-\|y_{j+1}-x_i\|^2]\\ &=\frac{c_{k,d}}{nh^{d+2}}\sum^n_{i=1}g\left(\left\| \frac{x_i}{h} \right\|^2\right) [2y^\top_{j+1}x_i-\|y_{j+1}\|^2]\\ &=\frac{c_{k,d}}{nh^{d+2}}\left[2y^\top_{j+1}\sum^n_{i=1}x_ig\left(\left\| \frac{x_i}{h} \right\|^2\right)-\|y_{j+1}\|^2\sum^n_{i=1}g\left(\left\| \frac{x_i}{h} \right\|^2\right) \right] \end{aligned} \tag{A.4}$
and recalling (20) yields
$\hat{f}_{h,K}(j+1)-\hat{f}_{h,K}(j)=\frac{c_{k,d}}{nh^{d+2}}\|y_{j+1}\|^2\sum^n_{i=1}g\left(\left\| \frac{x_i}{h} \right\|^2\right) \tag{A.5}$
The profile $k (x)$ being monotonically decreasing for all $x\geq0$ the sum $\sum^n_{i=1}g(\|\frac{x_i}{h}\|^2)$ is strictly positive. Thus, as long as $y_{j+1}\neq y_j=0$ , the right term of (A.5) is strictly positive, i.e., $\hat{f}_{h,K}(j+1)>\hat{f}_{h,K}(j)$ . Consequently, the sequence $\{\hat{f}_{h,K}(j)\}_{j=1,2...}$ is convergent.

To prove the convergence of the sequence ${y_j\}_{j=1,2...}$ (A.5) is rewritten for an arbitrary kernel location $y_j\neq0$ . After some algebra we have
$\hat{f}_{h,K}(j+1)-\hat{f}_{h,K}(j)\geq \frac{c_{k,d}}{nh^{d+2}}\|y_{j+1}-y_{j}\|^2 \sum^n_{i=1}g\left(\left\|\frac{y_j-x_i}{h} \right\|^2\right) \tag{A.6}$
Summing now the two terms of the inequality (A.6) for indices $j, j + 1 . . . j + m - 1$ it results that
$\begin{aligned} \hat{f}_{h,K}(j+1)-\hat{f}_{h,K}(j) &\geq \frac{c_{k,d}}{nh^{d+2}}\|y_{j+m}-y_{j+m-1}\|^2\sum^n_{i=1}g\left(\left\| \frac{y_{j+m-1}-x_i}{h} \right\|^2\right)+...\\ &+\frac{c_{k,d}}{nh^{d+2}}\|y_{j+1}-y_{j}\|^2\sum^n_{i=1}g\left(\left\| \frac{y_{j}-x_i}{h} \right\|^2\right) \\ &\geq\frac{c_{k,d}}{nh^{d+2}}[\|y_{j+m}-y_{j+m-1}\|^2+...+\|y_{j+1}-y_{j}\|^2]M \\ &\geq\frac{c_{k,d}}{nh^{d+2}}\|y_{j+m}-y_{j}\|^2M \end{aligned} \tag{A.7}$
where $M$ represents the minimum (always strictly positive) of the sum $\sum^n_{i=1}g(\|\frac{y_i-x_I}{h} \|^2)$ for all ${y_j \}_{j=1,2...}$

Since $\{\hat{f}_{h,K}(j)\}_{j=1,2...}$ is convergent, it is also a Cauchy sequence. This property in conjunction with (A.7) implies that ${y_j \}_{j=1,2...}$ is a Cauchy sequence, hence, it is convergent in the Euclidean space.

定理1的证明

如果内核 $K$ 为凸函数且单调递减，则序列 ${\ y_j\}_{j=1,2...}$ 和 $\{\hat{f}_{h,K}(j) \}_j=1,2...$ 收敛，且 $\{\hat{f}_{h,K}(j) \}_j=1,2...$ 也单调递减。

由于 $n$ 是有界的，序列 $\hat{f}_{h,K}$ （21）是有界的，因此证明 $\hat{f}_{h,K}$ 是严格单调递增的就足够了。如果 $y_i\neq y_{i+1}$ , 则 $\hat{f}_{h,K}(j)<\hat{f}_{h,K}(j+1)$ ， $j = 1, 2 . . .$ . 在不失一般性的前提下，可以假设 $y_j = 0$ ，由(16)和(21)
$\hat{f}_{h,K}(j+1)-\hat{f}_{h,K}(j)=\frac{c_{k,d}}{nh^d}\sum^n_{i=1} \left[ k\left(\left\| \frac{y_{j+1}-x_i}{h} \right\|^2\right)- k\left(\left\| \frac{x_i}{h} \right\|^2\right) \right] \tag{A.1}$
廓线 $k (x)$ 的凸性表明
$k(x_2)\geq k(x_1)+k^{'}(x_1)(x_2-x_1) \tag{A.2}$
对于所有的 $x_1,x_2\in[0,\infty),x_1\neq x_2$ ，因为 $g (x) = -k ^{'} (x)$ ，不等式 (A.2) 变为
$k(x_2)-k(x_1)\geq g(x_1)(x_1-x_2) \tag{A.3}$
由 (A.1)和(A.3) 得
$\begin{aligned} \hat{f}_{h,K}(j+1)-\hat{f}_{h,K}(j) &\geq \frac{c_{k,d}}{nh^{d+2}}\sum^n_{i=1}g\left(\left\| \frac{x_i}{h} \right\|^2\right) [\|x_i\|^2-\|y_{j+1}-x_i\|^2]\\ &=\frac{c_{k,d}}{nh^{d+2}}\sum^n_{i=1}g\left(\left\| \frac{x_i}{h} \right\|^2\right) [2y^\top_{j+1}x_i-\|y_{j+1}\|^2]\\ &=\frac{c_{k,d}}{nh^{d+2}}\left[2y^\top_{j+1}\sum^n_{i=1}x_ig\left(\left\| \frac{x_i}{h} \right\|^2\right)-\|y_{j+1}\|^2\sum^n_{i=1}g\left(\left\| \frac{x_i}{h} \right\|^2\right) \right] \end{aligned} \tag{A.4}$
结合 (20)
$\hat{f}_{h,K}(j+1)-\hat{f}_{h,K}(j)=\frac{c_{k,d}}{nh^{d+2}}\|y_{j+1}\|^2\sum^n_{i=1}g\left(\left\| \frac{x_i}{h} \right\|^2\right) \tag{A.5}$
当 $x\geq 0$ 时， $\sum^n_{i=1}g(\|\frac{x_i}{h}\|^2)$ 严格为正， $k (x)$ 单调递减。因此，若 $y_{j+1}\neq y_j=0$ ，（A.5）的右项严格为正，即 $\hat{f}_{h,K}(j+1)>\hat{f}_{h,K}(j)$ ，则序列 $\{\hat{f}_{h,K}(j)\}_{j=1,2...}$ 收敛。

为了证明序列 ${y_j\}_{j=1,2...}$ (A.5) 在任意核位置的收敛性。经过一系列的代数运算
$\hat{f}_{h,K}(j+1)-\hat{f}_{h,K}(j)\geq \frac{c_{k,d}}{nh^{d+2}}\|y_{j+1}-y_{j}\|^2 \sum^n_{i=1}g\left(\left\|\frac{y_j-x_i}{h} \right\|^2\right) \tag{A.6}$
对指数 $j$ 的不等式 (A.6) 的两项求和， $j + 1 . . . j + m - 1$ 的结果为
$\begin{aligned} \hat{f}_{h,K}(j+1)-\hat{f}_{h,K}(j) &\geq \frac{c_{k,d}}{nh^{d+2}}\|y_{j+m}-y_{j+m-1}\|^2\sum^n_{i=1}g\left(\left\| \frac{y_{j+m-1}-x_i}{h} \right\|^2\right)+...\\ &+\frac{c_{k,d}}{nh^{d+2}}\|y_{j+1}-y_{j}\|^2\sum^n_{i=1}g\left(\left\| \frac{y_{j}-x_i}{h} \right\|^2\right) \\ &\geq\frac{c_{k,d}}{nh^{d+2}}[\|y_{j+m}-y_{j+m-1}\|^2+...+\|y_{j+1}-y_{j}\|^2]M \\ &\geq\frac{c_{k,d}}{nh^{d+2}}\|y_{j+m}-y_{j}\|^2M \end{aligned} \tag{A.7}$
其中 $M$ 表示所有 ${y_j \}_{j=1,2...}$ 时 $\sum^n_{i=1}g(\|\frac{y_i-x_I}{h} \|^2)$ 和的最小值（总是严格为正）。

由于 $\{\hat{f}_{h,K}(j)\}_{j=1,2...}$ 是收敛的，它也是一个柯西数列。这个性质与(A .7)综合表明 ${y_j \}_{j=1,2...}$ 是一个柯西序列，因此，它在欧几里得空间是收敛的。

Proof of Theorem 2

The cosine of the angle between two consecutive mean shift vectors is strictly positive when a normal kernel is employed.

We can assume w.l.g. that $y_j=0$ and $y_{j+1}\neq y_i\neq 0$ since otherwise convergence has already been achieved. Therefore the mean shift vector $\pmb{m}_{h,N}(\pmb0)$ is
$\pmb{m}_{h,N}(\pmb0)=y_{j+1}=\frac{\sum^n_{i=1}x_i\exp(-\|\frac{x_i}{h}\|^2)} {\sum^n_{i=1}\exp(-\|\frac{x_i}{h}\|^2)} \tag{B.1}$
We will show first that when the weights are given by a normal kernel centered at $y_{j+1}$ the weighted sum of the projections of $y_{j+1}-x_i)$ onto $y_{j+1}$ is strictly negative, i.e.,
$\sum^n_{i=1}(\|y_{j+1}\|^2-y^\top_{j+1}x_i)\exp\left(-\left\|\frac{y_{j+1-x_i}}{h}\right\|^2\right)<0 \tag{B.2}$
The space $R_d$ can be decomposed into the following three domains
$\begin{aligned} D1&=\{x\in R^d |\quad y^\top_{j+1}x\leq\frac{1}{2}\|y_{j+1}\|^2 \}\\ D2&=\{x\in R^d |\quad \frac{1}{2}\|y_{j+1}\|^2<y^\top_{j+1}x\leq \|y_{j+1}\|^2\}\\ D3&=\{x\in R^d |\quad \|y_{j+1}\|^2< y^\top_{j+1}x\}\\ \end{aligned} \tag{B.3}$
and after some simple manipulations from (B.1) we can derive the equality
$\qquad\sum_{x_i\in D_2}(\|y_{j+1}\|^2-y^\top_{j+1}x_i)\exp\left(-\left\|\frac{x_i}{h}\right\|^2\right)\\ =\sum_{x_i\in D1\bigcup D2}(y^\top_{j+1}x_i-\|y_{j+1}\|^2)\exp\left(-\left\|\frac{x_i}{h}\right\|^2\right)\\ \tag{B.4}$
In addition, for $x\in D_2$ we have $\|y_{j+1}\|^2-y^\top_{j+1}x_i>0$ , which implies
$\|y_{i+1}-x_i\|^2=\|y_{i+1}\|^2+\|x_i\|^2-2y^\top_{j+1}x_i\geq\|x_i\|^2-\|y_{j+1}\|^2 \tag{B.5}$
from where
$\begin{aligned} &\sum_{x_i\in D_2}(\|y_{j+1}\|^2-y^\top_{j+1}x_i)\exp\left(-\left\|\frac{y_{j+1}-x_i}{h}\right\|^2\right)\\ \leq&\exp\left(\left\|\frac{y_{j+1}}{h}\right\|^2\right)\sum_{x_i\in D_2}(\|y_{j+1}\|^2-y^\top_{j+1}x_i)\exp\left(-\left\|\frac{x_i}{h}\right\|^2\right) \end{aligned} \tag{B.6}$
Introducing now (B.4) in (B.6) we have
$\sum_{x_i\in D_2}(\|y_{j+1}\|^2-y^\top_{j+1}x_i)\exp\left(-\left\|\frac{y_{j+1}-x_i}{h}\right\|^2\right)\\ \leq\exp\left(\left\|\frac{y_{j+1}}{h}\right\|^2\right)\sum_{x_i\in D1\bigcup D2}(y^\top_{j+1}x_i-\|y_{j+1}\|^2)\exp\left(-\left\|\frac{x_i}{h}\right\|^2\right)\\ \tag{B.7}$
and by adding to both sides of (B.7) the quantity
$\sum_{x_i\in D1\bigcup D2}(\|y_{j+1}\|^2-y^\top_{j+1}x_i)\exp\left(-\left\|\frac{y_{j+1}-x_i}{h}\right\|^2\right)$
after some algebra it results that
$\sum^n_{i=1}(\|y_{j+1}\|^2-y^\top_{j+1}x_i)\exp\left(-\left\|\frac{y_{j+1}-x_i}{h}\right\|^2\right)\\ \leq\exp\left(\left\|\frac{y_{j+1}}{h}\right\|^2\right)\sum_{x_i\in D1\bigcup D2}(\|y_{j+1}\|^2-y^\top_{j+1}x_i)\exp\left(-\left\|\frac{x_i}{h}\right\|^2\right)\\ \times\left\{\exp\left[-\frac{2}{h^2}(\|y_{j+1}\|^2-y^\top_{j+1}x_i) \right]-1\right\} \tag{B.8}$
The right side of (B.8) is negative because $(\|y_{j+1}\|^2-y^\top_{j+1}x_i)$ and the last product term have opposite signs in both the $D_1$ and $D_3$ domains. Therefore, the left side of (B.8) is also negative, which proves (B.2).

We can use now (B.2) to write
$\|y_{j+1}\|^2<j^\top_{j+1}\frac{\sum^n_{i=1}x_i\exp\left(-\left\|\frac{y_{j+1}-x_i}{h}\right\|^2\right)}{\sum^n_{i=1}\exp\left(-\left\|\frac{y_{j+1}-x_i}{h}\right\|^2\right)}=y^\top_{j+1}y_{j+2} \tag{B.9}$
from where
$\frac{y^\top_{j+1}(y_{j+2}-y_{j+1})}{\|y^\top_{j+1}\|\|y_{j+2}-y_{j+1}\|}>0 \tag{B.10}$
or by taking into account (24)
$\frac{\pmb{m}_{h,N}(y_j)^\top\pmb{m}_{h,N}(y_{j+1})}{\|\pmb{m}_{h,N}(y_j)\|\|\pmb{m}_{h,N}(y_{j+1})\|}>0$

定理2的证明

当使用标准核时，两个连续的均值偏移矢量之间的角度的余弦严格为正。

我们可以假设w.l.g $y_j=0$ , $y_{j+1}\neq y_i\neq 0$ ，否则已经达到收敛。因此，平均偏移向量为 $\pmb{m}_{h,N}(\pmb0)$
$\pmb{m}_{h,N}(\pmb0)=y_{j+1}=\frac{\sum^n_{i=1}x_i\exp(-\|\frac{x_i}{h}\|^2)} {\sum^n_{i=1}\exp(-\|\frac{x_i}{h}\|^2)} \tag{B.1}$
我们首先说明，当以 $y_{j+1}$ 为中心的法线向核给出权重时， $y_{j+1}-x_i)$ 到 $y_{j+1}$ 上投影的加权和严格为负，即，
$\sum^n_{i=1}(\|y_{j+1}\|^2-y^\top_{j+1}x_i)\exp\left(-\left\|\frac{y_{j+1-x_i}}{h}\right\|^2\right)<0 \tag{B.2}$
空间 $R_d$ 可分解为以下三个域
$\begin{aligned} D1&=\{x\in R^d |\quad y^\top_{j+1}x\leq\frac{1}{2}\|y_{j+1}\|^2 \}\\ D2&=\{x\in R^d |\quad \frac{1}{2}\|y_{j+1}\|^2<y^\top_{j+1}x\leq \|y_{j+1}\|^2\}\\ D3&=\{x\in R^d |\quad \|y_{j+1}\|^2< y^\top_{j+1}x\}\\ \end{aligned} \tag{B.3}$
通过对(B.1)做一些简单的操作，我们可以推导出等式
$\qquad\sum_{x_i\in D_2}(\|y_{j+1}\|^2-y^\top_{j+1}x_i)\exp\left(-\left\|\frac{x_i}{h}\right\|^2\right)\\ =\sum_{x_i\in D1\bigcup D2}(y^\top_{j+1}x_i-\|y_{j+1}\|^2)\exp\left(-\left\|\frac{x_i}{h}\right\|^2\right)\\ \tag{B.4}$
另外，对于 $x\in D_2$ ，有 $\|y_{j+1}\|^2-y^\top_{j+1}x_i>0$ ，这意味着
$\|y_{i+1}-x_i\|^2=\|y_{i+1}\|^2+\|x_i\|^2-2y^\top_{j+1}x_i\geq\|x_i\|^2-\|y_{j+1}\|^2 \tag{B.5}$
则
$\begin{aligned} &\sum_{x_i\in D_2}(\|y_{j+1}\|^2-y^\top_{j+1}x_i)\exp\left(-\left\|\frac{y_{j+1}-x_i}{h}\right\|^2\right)\\ \leq&\exp\left(\left\|\frac{y_{j+1}}{h}\right\|^2\right)\sum_{x_i\in D_2}(\|y_{j+1}\|^2-y^\top_{j+1}x_i)\exp\left(-\left\|\frac{x_i}{h}\right\|^2\right) \end{aligned} \tag{B.6}$
由（B.6）和（B.4）
$\sum_{x_i\in D_2}(\|y_{j+1}\|^2-y^\top_{j+1}x_i)\exp\left(-\left\|\frac{y_{j+1}-x_i}{h}\right\|^2\right)\\ \leq\exp\left(\left\|\frac{y_{j+1}}{h}\right\|^2\right)\sum_{x_i\in D1\bigcup D2}(y^\top_{j+1}x_i-\|y_{j+1}\|^2)\exp\left(-\left\|\frac{x_i}{h}\right\|^2\right)\\ \tag{B.7}$
在（B.7）两边加上
$\sum_{x_i\in D1\bigcup D2}(\|y_{j+1}\|^2-y^\top_{j+1}x_i)\exp\left(-\left\|\frac{y_{j+1}-x_i}{h}\right\|^2\right)$
经过一系列代数运算得
$\sum^n_{i=1}(\|y_{j+1}\|^2-y^\top_{j+1}x_i)\exp\left(-\left\|\frac{y_{j+1}-x_i}{h}\right\|^2\right)\\ \leq\exp\left(\left\|\frac{y_{j+1}}{h}\right\|^2\right)\sum_{x_i\in D1\bigcup D2}(\|y_{j+1}\|^2-y^\top_{j+1}x_i)\exp\left(-\left\|\frac{x_i}{h}\right\|^2\right)\\ \times\left\{\exp\left[-\frac{2}{h^2}(\|y_{j+1}\|^2-y^\top_{j+1}x_i) \right]-1\right\} \tag{B.8}$
（B.8）的右侧为负，因为 $(\|y_{j+1}\|^2-y^\top_{j+1}x_i)$ 和最后一个乘积项在 $D_1$ 和 $D_3$ 域中都具有相反的符号。因此，（B.8）的左侧也为负，从而证明（B.2）。

由（B.2）得
$\|y_{j+1}\|^2<j^\top_{j+1}\frac{\sum^n_{i=1}x_i\exp\left(-\left\|\frac{y_{j+1}-x_i}{h}\right\|^2\right)}{\sum^n_{i=1}\exp\left(-\left\|\frac{y_{j+1}-x_i}{h}\right\|^2\right)}=y^\top_{j+1}y_{j+2} \tag{B.9}$
则
$\frac{y^\top_{j+1}(y_{j+2}-y_{j+1})}{\|y^\top_{j+1}\|\|y_{j+2}-y_{j+1}\|}>0 \tag{B.10}$
或代入到（24）
$\frac{\pmb{m}_{h,N}(y_j)^\top\pmb{m}_{h,N}(y_{j+1})}{\|\pmb{m}_{h,N}(y_j)\|\|\pmb{m}_{h,N}(y_{j+1})\|}>0$

参考文献

References