《机器学习系统设计》第二章:如何对真是样本分类
采用IRIS数据集。数据集可以在http://archive.ics.uci.edu/ml/下载
第一步是,可视化,这里选取第0个和第一个特征进行可视化
import scipy as sp
import numpy as np
import sklearn as sl
from matplotlib import pyplot as plt
from sklearn.datasets import load_<span style="font-family: Arial, Helvetica, sans-serif;">iris</span>
data = load_iris()
features = data['data']
feature_names = data['feature_names']
target = data['target']
labels = data['target_names'][data['target']]
for t,marker,c in zip(xrange(3),">ox","rgb"):
plt.scatter(features[target==t,0],features[target==t,1],marker=marker,c=c)
plt.show()
通过观察,发现通过花瓣长度就可以将Iris Setosa和其他两类区分开,寻找切分点
plength = features[:,2]
is_setosa = (labels == 'setosa')
max_setosa=plength[is_setosa].max()
min_non_setosa=plength[~is_setosa].min()
print('Maximum of setosa:{0}.'.format(max_setosa))
print('Minimum of others:{0}.'.format(min_non_setosa))
得到结果2是分界点
因此对一个新来的样本,可以如下方式进行分类
if features[:,2]<2:
print 'Iris Setosa'
else:
print 'Iris Virginica or Iris Versicolour'
然后去掉Setosa,用剩下的数据区分virginica 和 Versicolor
features=features[~is_setosa]
labels=labels[~is_setosa]
virginica=(labels == 'virginica')
现在对所有可能的特征和阈值进行遍历,找到最好的特征和相对应的阈值
best_acc = -1.0
for fi in xrange(features.shape[1]):
thresh = features[:,fi].copy()
thresh.sort()
for t in thresh:
pred = (features[:,fi]>t)
acc = (pred == virginica).mean()
if acc > best_acc:
best_acc = acc
best_fi = fi
best_t = t
然后就可以对一个新样本进行分类
if example[best_fi] > t:
print 'virginica'
else:
print 'versicolor'
交叉验证(去一法)
error = 0
for ei in range(len(features)):
#选择除了ei以外的所有位置
training = np.ones(len(features),bool)
training[ei] = False
testing = ~training
model=learn_model(features[training],virginica[training])
predictions=apply_model(feature[testing],virginica[testing],model)
error += np.sum(predictions != virginica[testing])
print error
代码中,learn_model和apply_model是自己定义的函数,需要自己实现
现在来看看一个复杂一些的数据集,Seeds数据集,他有七个特征
我们先把特征归一化到Z值,即减去平均值在除以标准差
#从特征值中减去平均值
features -= features.mean(axis=0)
#将特征值除以标准差
features /= features.std(axis=0)
另一种归一化方法是利用最大最小值缩放到(0,1)之间
我们使用最近邻法进行分类
def distance(p0,p1):
'Computes squared Euclidean distance'
return np.sum((p0-p1)**2)
def nn_classify(training_set,training_labels,new_example):
dists = np.array([distance(t, new_example)] for t in training_set)
nearest = dists.argmin()
return training_labels[nearest]
也可以使用K近邻,找到最近的K个点,投票决定他的标签