论文翻译 3-3Flexible Metric Nearest Neighbor Classi¯cation 弹性度量最近邻居分类

弹性度量最近邻居分类

Flexible Metric Nearest Neighbor Classi¯cation
Jerome H. Friedman¤杰罗姆 h. 弗里德曼 斯坦福,加利福尼亚州
November 11, 1994
美国国家科学基金会资助,授权号为DMS-9403804
摘要

The K-nearest-neighbor decision rule assigns an object of unknown class to the plurality class among the K labeled \training" objects that are closest to it. Closeness is usually de¯ned in terms of a metric distance on the Euclidean space with the input measurement variables as axes. The metric chosen to de¯ne this distance can strongly e®ect performance. An optimal choice depends on the problem at hand as characterized by the respective class distributions on the input measurement space, and within a given problem, on the location of the unknown object in that space. In this paper new types of K-nearest-neighbor procedures are described that estimate the local relevance of each input variable, or their linear combinations, for each individual point to be classi¯ed. This information is then used to separately customize the metric used to de¯ne distance from that object in ¯nding its nearest neighbors.These procedures are a hybrid between regular K-nearest-neighbor methods and tree-structured recursive partitioning techniques popular in statistics and machine learning
K最近邻决策规则将最接近它的K个标记为“训练”的对象中的一个未知类的对象分配给多个类。紧密度通常根据欧氏空间上的度量距离来确定,其中 输入测量变量作为轴。选择用于确定该距离的度量可以有效地影响性能。最佳选择取决于当前问题,其特征在于输入测量空间上以及给定问题内的各个类别分布, 本文描述了一种新型的K最近邻方法,用于估计要分类的每个单独点的每个输入变量或其线性组合的局部相关性。 然后,此信息用于单独定制用于确定与该对象的最近邻居的距离的度量。这些过程是常规K最近邻方法与统计和机器学习中流行的树状结构递归分区技术之间的混合体。
1引言
Nearest-neighbor methods are among the most popular for classi¯cation. They represent the earliest general (nonparametric) methods proposed for this problem and were heavily investigated in the ¯elds of statistics and (especially) pattern recognition. Recently renewed interest in them has emerged in the connectionist literature (\memory" methods) and also in machine learning (\instance-based" methods). Despite their basic simplicity and the fact that many more sophisticated alternative techniques have been developed since their introduction, nearest-neighbor methods still remain among the most successful for many classi¯cation problems.
K最近邻方法是最流行的分类方法之一。 它们代表了针对此问题提出的最早的通用(非参数)方法,并在统计和(尤其是)模式识别领域进行了深入研究。 最近,在连接主义文献(“内存”方法)和机器学习(“基于实例”方法)中都出现了对它们的新兴趣。 尽管它们的基本简单性和自引入以来已经开发出许多更复杂的替代技术的事实,但是最近邻方法仍然是解决许多分类问题的最成功方法之一。
This paper presents extensions to the basic nearest-neighbor approach with the goal of enhancing its performance in certain situations. These are characterized by the fact that the measured variables input into the classi¯cation procedure may not all be equally relevant for classifying a new object. Moreover, this di®erential relevance may depend on the location of the object in the input measurement space. The relevance of a particular input variable for classifying the object may depend on the particular object being classi¯ed within the same classi¯cation problem. It is well known that input variables of low relevance can degrade the performance of nearest-neighbor procedures if they are allowed to be equally in°uential with those of high relevance in de¯ning the (near-neighbor) distance from the point to be classi¯ed.
本文介绍了基本最近邻居方法的扩展,目的是在某些情况下提高其性能。 这些特征在于以下事实:输入到分类过程中的测量变量可能并非对分类新对象都具有同等的相关性。 此外,这种不同的相关性可能取决于对象在输入测量空间中的位置。 用于分类对象的特定输入变量的相关性可能取决于在同一分类问题中被分类的特定对象。 众所周知,如果允许低输入变量与高相关变量在从点到点的距离(近邻)之间具有相等的影响,则它们可能会降低最近邻过程的性能。
If the relative (local) relevance of each input variable were known, this information could be used to advantage by constructing a metric to de¯ne nearest-neighbor distance that provides di®erential weighting for the inputs; variables of higher relevance receive more weight in de¯ning the distance. Unfortunately, such information is seldom available in advance so that it is either ignored, or attempts must be made to estimate it from the training data at hand. In this paper new types of nearest-neighbor procedures are described that estimate the local relevance of each input variable for each individual object to be classi¯ed. This information is then used to separately customize the metric used to de¯ne the distance from that object in ¯nding its nearest-neighbors.
如果知道每个输入变量的相对(局部)相关性,则可以通过构建距离最近邻距离的度量为输入提供权重的度量来利用此信息。 相关性较高的变量在确定距离时会获得更多权重。 不幸的是,这种信息很少事先可用,因此要么被忽略,要么必须尝试从手头的训练数据中对其进行估计。 在本文中,描述了一种新型的最近邻过程,该过程为要分类的每个单个对象估计每个输入变量的局部相关性。 然后,此信息将用于分别自定义用于确定与该对象的最近邻居的距离的度量。
The paper is organized as follows. The next section de¯nes the (generic) classi¯cation problem establishing notation for what follows. Section 3 focuses on the K-nearestneighbor approach characterizing its strengths and weaknesses. Two established methods for attempting to mitigate these weaknesses are then described: input variable subset selection, and the recursive partitioning approach basic to the tree-structured induction methods of machine learning. Possible limitations of these methods are described so as to motivate the new procedures described in Section 4. Sections 4.1 - 4.3 describe measures of local input variable relevance and how they can be estimated form the training data at hand. Section 4.4 describes one way to use these estimates (the \machete") to form a local metric for classifying a new object. Section 4.5 discusses generalizations of the basic machete through including additional (\derived") variables that are functions of the original inputs. These derived variables are individually customized to have high local relevance for the particular object being classi¯ed. Section 4.6 presents a further generalization (the \scythe") which de¯nes a whole class of classi¯cation procedures including the machete and ordinary K-nearest-neighbor methods as special cases. The relative performances of several versions of these procedures, as well as the ordinary K-nearest-neighbor method and tree-based recursive partitioning methods are examined in Section 5. Section 5.1 compares the methods through a set of arti¯cially simulated examples while Section 5.2 uses real data examples. A concluding summary is presented in Section 6.
本文的结构如下。 下一节将确定(通用)分类问题,并为随后的内容建立符号。 第3节重点介绍K近邻方法的优点和缺点。 然后描述了两种尝试减轻这些弱点的已建立方法:输入变量子集选择,以及基于树结构的机器学习归纳方法的递归分区方法。 描述了这些方法的可能局限性,以激发第4节中描述的新过程。第4.1-4.3节描述了本地输入变量相关性的度量以及如何从手边的训练数据中估计它们。 第4.4节描述了一种使用这些估计值(\ machete“)形成用于对新对象进行分类的局部度量的方法。第4.5节讨论了通过包含作为原始输入函数的其他变量(\ derive”)对基本砍刀的概括。 。 这些派生变量是单独定制的,对于要分类的特定对象具有较高的局部相关性。 第4.6节提出了进一步的概括(“ scythe”),该类定义了包括砍刀和普通K最近邻方法在内的一类完整的分类程序。这些程序的几个版本的相对性能如下: 以及在第5节中研究了普通的K最近邻方法和基于树的递归分区方法,第5.1节通过一组人工模拟的示例对这些方法进行了比较,而第5.2节则使用了真实的数据示例,并在第6节给出了总结性总结。

Classi¯cation 分类

在这里插入图片描述
在这里插入图片描述
在分类问题中,假定具有一组测量值x =(x1;¢¢¢; xp)2 Rp的对象是J组(“类”)x 2 fGjgJ 1的成员。
该特定组是未知的,目标是使用其测量值x将对象分配给正确的组。 更正式地说,令Ljk为与将其分配给第j个组x相关的损失(成本)! Gj; 当它实际上是第k组x 2 Gk的成员时。 那么,预期损失(“分类错误风险”)为Rj = JX k = 1 Ljk Pr(kjx)(1)其中,Pr(kjx)是在给定特定度量集的情况下x是第k组成员的概率 风险(1)通过赋值x!Gj¤最小化,其中j¤= arg min 1·j·J Rj(x)(2)降低为j¤= arg max 1·j·J Pr( jjx)(3)如果所有错误分类均被视为代价相同L jk = 1¡±jk;±jk =½10 jj = 6 = k
在这里插入图片描述
从而使风险降低到误分类(错误)率。 这些(2)(3)被称为“贝叶斯”决策规则(对于给定的问题和损失矩阵Ljk),其相关的误分类风险表示可实现的最小值。
为了应用贝叶斯规则,必须为要进行分配预测的每个点x 2 Rp知道(真)条件概率fPr(j x)gJ 1。 这几乎是不可能的。 通过“监督学习”范式,可以获取正确标记的对象fxn的(“训练”)样本; gngN 1; gn = j)xn 2 Gj(5)假定是以各自的概率fPr(j j xn)gJ 1绘制的随机样本。 还假定训练样本fxngN 1在测量空间上的分布(至少在某种程度上)表示将来要分类的“测试”对象的对应分布。训练数据(5)用于获得估计值 fPcr(jjx)gJ 1(6)然后用于(1)(2)^ j(x)= arg min 1·j·JJX k = 1 L jkPcr(kjx)(7)或( 3)估计班级分配。
在这里插入图片描述
这里,Pr(xjj)是第j类在测量空间上的概率密度分布,Pr(j)是在没有一组测量值x的情况下观察j类物体的“先验”概率。 训练样本根据类别标签分为J组,每组中的数据分别用于估计测量空间上各个类别的条件概率密度Pr(xjj)。
先验概率或者是事先已知的,或者被估计为训练样本中每个类别的比例。 这些估计然后用于导出位置条件概率(6)到(8)的估计。 这种密度估计方法的一个例子是“判别分析” [参见McLachlan(1992)],其中通过正态分布来近似类条件密度,并且使用每个类的训练数据来估计各个参数(均值向量和协方差矩阵) 这种方法的例子还包括基于(正常)混合[Chow and Chen(1992)]和学习矢量量化技术[Kohonen(1990)]的方法。
在这里插入图片描述
用于分类的密度估计方法根据类标签的值估计概率。 第二类监督学习技术基于回归,以预测点x为条件,并直接估计(6)概率fPr(j j x)gJ 1。 在给定的预测点x处,假定类别标签g是来自概率分布为fPr(j j x)gJ 1的多项式分布的随机变量。 g(在x处)的每个潜在值都由一个单独的\ dummy“输出变量来表征。
在这里插入图片描述
显然,fj(x)=:Pr(j j x)= Pr(yj = 1j x)= E(yj j x)(10)和fj(x)= arg min f E [(yj¡f)2j x]; j = 1; J; (11)因此,位置条件概率ffj(x)= Pr(j x)gJ 1是一组(条件)最小二乘问题的解。 回归方法假设训练数据(5)和要预测的未来数据是此过程的随机实现,并应用标准回归方法从相应的训练样本fxn估计每个单独的“目标”函数fj(x) ; yjgN n = 1; j = 1; J:(12)因为它们表示概率,所以目标函数ffj(x)gJ 1满足约束0·fj(x)·1,并且JX j = 1 fj(x)= 1(13)对于x。CART的所有值[Breiman,Friedman,Olshen和Stone(1984)],投影追踪[Friedman(1985)],神经网络[Lippmann(1989)],最近邻核方法以及许多其他方法 机器学习和模式识别中开发的技术中的任何一种都直接或间接地将这种回归范式应用于分类问题。

Nearest-neighbor kernel methods
最近邻核方法

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值