python对数据进行分类,如何使用Python使用最邻近算法对数据进行分类?

I need to classify some data with (I hope) nearest-neighbour algorithm. I've googled this problem and found a lot of libraries (including PyML, mlPy and Orange), but I'm unsure of where to start here.

How should I go about implementing k-NN using Python?

解决方案

Particularly given the technique (k-Nearest Neighbors) that you mentioned in your Q, i would strongly recommend Note: after this Answer was posted, the lead developer of this Project informed me of a new homepage for this Project.]

A few features that i believe distinguish this library from the others (at least the other Python ML libraries that i have used, which is most of them):

an extensive diagnostics & testing library (including plotting

modules, via Matplotlib)--includes feature-selection algorithms,

confusion matrix, ROC, precision-recall, etc.;

a nice selection of 'batteries-included' data sets (including

handwriting digits, facial images, etc.) particularly suited for ML techniques;

extensive documentation (a nice surprise given that this Project is

only about two years old) including tutorials and step-by-step

example code (which use the supplied data sets);

Without exception (at least that i can think of at this moment) the python ML libraries are superb. (See the PyMVPA homepage for a list of the dozen or so most popular python ML libraries.)

In the past 12 months for instance, i have used ffnet (for MLP), neurolab (also for MLP), PyBrain (Q-Learning), neurolab (MLP), and PyMVPA (SVM) (all available from the Python Package Index)--these vary significantly from each other w/r/t maturity, scope, and supplied infrastructure, but i found them all to be of very high quality.

Still, the best of these might be scikits.learn; for instance, i am not aware of any python ML library--other than scikits.learn--that includes any of the three features i mentioned above (though a few have solid example code and/or tutorials, none that i know of integrate these with a library of research-grade data sets and diagnostic algorithms).

Second, given you the technique you intend to use (k-nearest neighbor) scikits.learn is a particularly good choice. Scikits.learn includes kNN algorithms for both regression (returns a score) and classification (returns a class label), as well as detailed sample code for each.

Using the scikits.learn k-nearest neighbor module (literally) couldn't be any easier:

>>> # import NumPy and the relevant scikits.learn module

>>> import numpy as NP

>>> from sklearn import neighbors as kNN

>>> # load one of the sklearn-suppplied data sets

>>> from sklearn import datasets

>>> iris = datasets.load_iris()

>>> # the call to load_iris() loaded both the data and the class labels, so

>>> # bind each to its own variable

>>> data = iris.data

>>> class_labels = iris.target

>>> # construct a classifier-builder by instantiating the kNN module's primary class

>>> kNN1 = kNN.NeighborsClassifier()

>>> # now construct ('train') the classifier by passing the data and class labels

>>> # to the classifier-builder

>>> kNN1.fit(data, class_labels)

NeighborsClassifier(n_neighbors=5, leaf_size=20, algorithm='auto')

What's more, unlike nearly all other ML techniques, the crux of k-nearest neighbors is not coding a working classifier builder, rather the difficult step in building a production-grade k-nearest neighbor classifier/regressor is the persistence layer--i.e., storage and fast retrieval of the data points from which the nearest neighbors are selected. For the kNN data storage layer, scikits.learn includes an algorithm for a ball tree (which i know almost nothing about other than is apparently superior to the kd-tree (the traditional data structure for k-NN) because its performance doesn't degrade in higher dimensional features space.

Additionally, k-nearest neighbors requires an appropriate similarity metric (Euclidean distance is the usual choice, though not always the best one). Scikits.learn includes a stand-along module comprised of various distance metrics as well as testing algorithms for selection of the appropriate one.

Finally, there are a few libraries that i have not mentioned either because they are out of scope (PyML, Bayesian); they are not primarily 'libraries' for developers but rather applications for end users (e.g., Orange), or they have unusual or difficult-to-install dependencies (e.g., mlpy, which requires the gsl, which in turn must be built from source) at least for my OS, which is Mac OS X.

(Note: i am not a developer/committer for scikits.learn.)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值