机器学习基础 维基翻译 超参数选择 K近邻法 及简单的sklearn例子

In the context of mechine learning, hyperparameter optimization or model
selection is the problem of choosing a set of hyperparameters for a
learning algorithm, usually with the goal of optimizing a measure of
the algorithm's performance on a independent data set.
Often cross-validation is used to estimate the generalization performance.
Hyerparameter optimization contrasts with actual learning problems, which
are also often cast as optimization problems,but optimize a loss function
on the training set alone. In effect, learning algorithms learn parameters
that model/reconstruct their inputs well, while hyperparameter optimiztion
is to ensure the model dose not overfit its data by tuning.
e.g. regularization:
这里的正则化是指如AIC BIC引入的惩罚项。

Algorithms for hyperparameter optimization
Grid search
The tradition way of performing hyerparameter optimization has been grid
search, or a parameter sweep,(参数扫描分析)
which is simply an exhaustive searching(穷举搜索)
through a manually specified subset of the hyperparameter sapce of a learning algorithm. A grid search algorithm must be guided by some performance metric
typically measureed by cross-validation on the training set or on a held-out
validation set.
Since the parameter space of a machine learner may include real-valued or
unbouned value spaces for certain parameters, manually set bpuned and
may be necessary before applying grid search.
For example, a typical soft-margin(软边际 考虑是求解约束条件中对向量长度的限定)
SVM classifier quipped with am REF kernel(Gaussion) has at least two
hyperparameters that need to be tuned(调整) for good performance on unseen
data: a regularization constant C and a kernel hyperparameter y.
Both parameters are continous ,so to perform grid search, one selects a
finite set of "reasonable" each,...

Grid search then trains an SVM with each pair (C,y) in the Certesian product
(笛卡尔直积) ot these two sets and evaluates their performance on a held-out
validation set(or by interal cross-validation on the training set, in which case mutiple SVMs are trained per pair).Finally, the grid search algorithm ourpute the settings that achieved the highest score in the validation produre
Grid search suffers from the curse pf dimensionality(降低了用于拟合的样本量
导致维数灾难) but is often embarrassiongly parallel(并行) because typically the
hyperparameter settings it evaluates are independent of each other.

In pattern recognition, the k-Nearest Neighbors algorithm(or k-NN)
is a non-parametric methods used for classification and regression.In oth cases, the inputs consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression:
 In k-NN classification, the output is a class membership, An object is clssified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small)
 If k = 1,then the object is simpply assigned to the class of that single nearest neighbor.

 In k-NN regression,the output is the properly value for the object. This value is the average of the values of its k nearest neighbors.

k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The k-NN algorithm
is among the simplest of all machine learning algorithms.

Both for classification and regression, it can be useful to assign weight to the contributions of the neighbors,
so thst the nearer neighbors contributr more the the average than the more distant ones. For example, a common weighting scheme consists in giving each neghbor a weight of 1/d, where d is the distance to the neighbor.

The neighbors are taken from a set of objects for which the class(for k-NN classification)or the object properly value(for k-NN regression) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required.

A shortcoming of k-NN algorithm is that it is sensitive to the local structure
of the data.

The training examples are vectors in a mutidimensional feature space, each with
a class lable. The training phase of the algorithm consists only of the storing the feature vectors and class labels of the training samples.

In the classification phase, k is a user-defined constant,and an unlabeled vector (a query or test point) is classified by assigning the label which is most frequent among the k training samples nearest to the query point.

from sklearn.datasets import load_iris 
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import FeatureUnion
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV

iris = load_iris()
X, y = iris.data, iris.target 
pca = PCA(n_components = 2)
selection = SelectKBest(k = 1)
combine_features = FeatureUnion([('pca', pca), ('univ_select', selection)])
X_feature = combine_features.fit(X, y).transform(X)
#print X_feature

svm = SVC(kernel = "linear")

pipeline = Pipeline([("features", combine_features), ("svm", svm)])
param_grid = dict(features__pca__n_components = [1, 2, 3],
     features__univ_select__k = [1, 2],
     svm__C = [0.1, 1, 10])
grid_search = GridSearchCV(pipeline, param_grid = param_grid, verbose = 10)
grid_search.fit(X, y)
print grid_search.best_estimator_

