数据科学与机器学习管道中预处理的重要性（一）：中心化、缩放和K近邻

K近邻可视化描述

from IPython.display import Image
Image(url= 'http://36.media.tumblr.com/d100eff8983aae7c5654adec4e4bb452/tumblr_inline_nlhyibOF971rnd3q0_500.png')

Python(scikit-learn)实现k-NN

import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv ' , sep = ';')
X = df.drop('quality' , 1).values # drop target variable
y1 = df['quality'].values
pd.DataFrame.hist(df, figsize = [15,15]);

y = y1 <= 5 # is the rating <= 5?
# plot histograms of original target variable
# and aggregated target variable
plt.figure(figsize=(20,5));
plt.subplot(1, 2, 1 );
plt.hist(y1);
plt.xlabel('original target value')
plt.ylabel('count')
plt.subplot(1, 2, 2);
plt.hist(y)
plt.xlabel('aggregated target value')
plt.show()

k-NN：实际性能和训练测试拆分

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

from sklearn import neighbors, linear_model
knn = neighbors.KNeighborsClassifier(n_neighbors = 5)
knn_model_1 = knn.fit(X_train, y_train)
print('k-NN accuracy for test set: %f' % knn_model_1.score(X_test, y_test))
k-NN accuracy for test set: 0.612500

from sklearn.metrics import classification_report
y_true, y_pred = y_test, knn_model_1.predict(X_test)
print(classification_report(y_true, y_pred))
             precision    recall  f1-score   support

False       0.66      0.64      0.65       179
True       0.56      0.57      0.57       141

avg / total       0.61      0.61      0.61       320

预处理：缩放实战

from sklearn.preprocessing import scale
Xs = scale(X)
from sklearn.cross_validation import train_test_split
Xs_train, Xs_test, y_train, y_test = train_test_split(Xs, y, test_size=0.2, random_state=42)
knn_model_2 = knn.fit(Xs_train, y_train)
print('k-NN score for test set: %f' % knn_model_2.score(Xs_test, y_test))
print('k-NN score for training set: %f' % knn_model_2.score(Xs_train, y_train))
y_true, y_pred = y_test, knn_model_2.predict(Xs_test)
print(classification_report(y_true, y_pred))
k-NN score for test set: 0.712500
k-NN score for training set: 0.814699
precision    recall  f1-score   support

False       0.72      0.79      0.75       179
True       0.70      0.62      0.65       141

avg / total       0.71      0.71      0.71       320

1. 预测变量可能包含非常不同的范围，并且在某些情况下，比如使用k-NN时，这些变量值需要进行削减以免某些特征在算法中占主导地位；
2. 你希望你的特征是单位独立的，也就是说，不涉及单位度量：例如，你可能有一些以米为单位的特征，我可能有用厘米表示的同样的特征。如果我们各自缩放数据，这些特征对我们来说都会是一样的。

# Set the the number of neighbors for k-NN
n_neig = 5

# Set sc = True if you want to scale your features
sc = False

df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv ' , sep = ';')
X = df.drop('quality' , 1).values # drop target variable

# Here we scale, if desired
if sc == True:
X = scale(X)

# Target value
y1 = df['quality'].values # original target variable
y = y1 <= 5 # new target variable: is the rating <= 5?

# Split the data into a test set and a training set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train k-NN model and print performance on the test set
knn = neighbors.KNeighborsClassifier(n_neighbors = n_neig)
knn_model = knn.fit(X_train, y_train)
y_true, y_pred = y_test, knn_model.predict(X_test)
print('k-NN accuracy for test set: %f' % knn_model.score(X_test, y_test))
print(classification_report(y_true, y_pred))
<script.py> output:
k-NN accuracy for test set: 0.612500
precision    recall  f1-score   support

False       0.66      0.64      0.65       179
True       0.56      0.57      0.57       141

avg / total       0.61      0.61      0.61       320

术语表

K近邻（k-Nearest Neighbors）：分类任务的一种算法，一个数据点的标签由离它最近的k个质心投票决定。

机器学习基础之 图像识别预处理技术

2018-03-19 15:34:43

TensorFlow 图像预处理（一） 图像编解码，图像尺寸调整

2017-06-12 11:34:22

TensorFlow 图像数据预处理及可视化

2017-08-17 00:19:05

数据科学与机器学习管道中预处理的重要性（一）：中心化、缩放和K近邻

2016-05-21 23:08:28

机器学习算法笔记之6：数据预处理

2017-04-17 17:16:26

图像数据预处理（上）

2014-04-23 11:22:19

机器学习-数据预处理

2018-06-04 16:18:19

机器学习之数据预处理-构造好的训练数据集

2018-05-17 10:34:04

机器学习基础之 python-ImageEnhance类实现图像的增强处理

2017-05-11 15:36:31

数据科学与机器学习管道中预处理的重要性（二）：中心化、缩放和逻辑回归

2016-05-23 22:08:07