# 数据科学与机器学习管道中预处理的重要性（一）：中心化，缩放和K近邻

## 机器学习中K近邻分类

K近邻可视化描述

from IPython.display import Image
Image(url= 'http://36.media.tumblr.com/d100eff8983aae7c5654adec4e4bb452/tumblr_inline_nlhyibOF971rnd3q0_500.png')

Python(scikit-learn)实现k-NN

import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv ' , sep = ';')
X = df.drop('quality' , 1).values # drop target variable
y1 = df['quality'].values
pd.DataFrame.hist(df, figsize = [15,15]);

y = y1 <= 5 # is the rating <= 5?
# plot histograms of original target variable
# and aggregated target variable
plt.figure(figsize=(20,5));
plt.subplot(1, 2, 1 );
plt.hist(y1);
plt.xlabel('original target value')
plt.ylabel('count')
plt.subplot(1, 2, 2);
plt.hist(y)
plt.xlabel('aggregated target value')
plt.show()

## k-NN：实际性能和训练测试拆分

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

from sklearn import neighbors, linear_model
knn = neighbors.KNeighborsClassifier(n_neighbors = 5)
knn_model_1 = knn.fit(X_train, y_train)
print('k-NN accuracy for test set: %f' % knn_model_1.score(X_test, y_test))
k-NN accuracy for test set: 0.612500

from sklearn.metrics import classification_report
y_true, y_pred = y_test, knn_model_1.predict(X_test)
print(classification_report(y_true, y_pred))
             precision    recall  f1-score   support

False       0.66      0.64      0.65       179
True       0.56      0.57      0.57       141

avg / total       0.61      0.61      0.61       320

## 预处理：缩放实战

from sklearn.preprocessing import scale
Xs = scale(X)
from sklearn.cross_validation import train_test_split
Xs_train, Xs_test, y_train, y_test = train_test_split(Xs, y, test_size=0.2, random_state=42)
knn_model_2 = knn.fit(Xs_train, y_train)
print('k-NN score for test set: %f' % knn_model_2.score(Xs_test, y_test))
print('k-NN score for training set: %f' % knn_model_2.score(Xs_train, y_train))
y_true, y_pred = y_test, knn_model_2.predict(Xs_test)
print(classification_report(y_true, y_pred))
k-NN score for test set: 0.712500
k-NN score for training set: 0.814699
precision    recall  f1-score   support

False       0.72      0.79      0.75       179
True       0.70      0.62      0.65       141

avg / total       0.71      0.71      0.71       320

1. 预测变量可能包含非常不同的范围，并且在某些情况下，比如使用k-NN时，这些变量值需要进行削减以免某些特征在算法中占主导地位；
2. 你希望你的特征是单位独立的，也就是说，不涉及单位度量：例如，你可能有一些以米为单位的特征，我可能有用厘米表示的同样的特征。如果我们各自缩放数据，这些特征对我们来说都会是一样的。

# Set the the number of neighbors for k-NN
n_neig = 5

# Set sc = True if you want to scale your features
sc = False

# Load data
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv ' , sep = ';')
X = df.drop('quality' , 1).values # drop target variable

# Here we scale, if desired
if sc == True:
X = scale(X)

# Target value
y1 = df['quality'].values # original target variable
y = y1 <= 5 # new target variable: is the rating <= 5?

# Split the data into a test set and a training set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train k-NN model and print performance on the test set
knn = neighbors.KNeighborsClassifier(n_neighbors = n_neig)
knn_model = knn.fit(X_train, y_train)
y_true, y_pred = y_test, knn_model.predict(X_test)
print('k-NN accuracy for test set: %f' % knn_model.score(X_test, y_test))
print(classification_report(y_true, y_pred))
<script.py> output:
k-NN accuracy for test set: 0.612500
precision    recall  f1-score   support

False       0.66      0.64      0.65       179
True       0.56      0.57      0.57       141

avg / total       0.61      0.61      0.61       320

### 术语表

K近邻（k-Nearest Neighbors）：分类任务的一种算法，一个数据点的标签由离它最近的k个质心投票决定。

#### 数据科学与机器学习管道中预处理的重要性（一）：中心化、缩放和K近邻

2016-05-21 21:50:41

#### 数据科学与机器学习管道中预处理的重要性（二）：中心化、缩放和逻辑回归

2016-05-23 22:08:07

#### 机器学习——数据预处理

2014-08-19 16:40:12

#### 神经网络基本原理-4.3数据预处理(零中心化+归一化+PCA+白化)

2017-12-11 18:10:10

#### 机器学习作业3 - 中心化的作用

2017-10-12 19:51:06

#### 一文读懂数据科学、机器学习和AI区别在哪里？

2018-01-22 12:32:39

#### 《机器学习实战》学习笔记：k-近邻算法实现

2015-08-30 00:10:35

#### python3.5《机器学习实战》学习笔记（一）：k近邻算法

2017-09-12 11:10:41

#### 机器学习之k近邻算法——4、特征值归一化

2014-12-25 23:08:20

#### 一文读懂机器学习、数据科学、深度学习和统计学之间的区别

2017-04-13 10:25:09