机器学习day04

十二、支持向量机(SVM)
1.原理

  1. 寻求最优分类边界:
    正确:对大部分样本可以正确地划分类别。
    泛化:最大化支持向量间距。
    公平:与支持向量等距。
    简单:线性,直线或平面,分割超平面。
  2. 基于核函数的升维变换:
    通过名为核函数的特征变换,增加新的特征,使得低维度空间中的线性不可分问题变为高维度空间中的线性可分问题。

2.不同核函数的分类效果

  1. 线性核函数:linear,不通过核函数进行维度提升,尽在原始维度空间中寻求线性分类边界。
    代码:svm_line.py
    # -*- coding: utf-8 -*-
    from __future__ import unicode_literals
    import numpy as np
    import sklearn.model_selection as ms
    import sklearn.svm as svm
    import sklearn.metrics as sm
    import matplotlib.pyplot as mp
    x, y = [], []
    with open('../../data/multiple2.txt', 'r') as f:
        for line in f.readlines():
            data = [float(substr) for substr
                    in line.split(',')]
            x.append(data[:-1])
            y.append(data[-1])
    x = np.array(x)
    y = np.array(y, dtype=int)
    train_x, test_x, train_y, test_y = \
        ms.train_test_split(
            x, y, test_size=0.25, random_state=5)
    # 基于线性核函数的支持向量机分类器
    model = svm.SVC(kernel='linear')
    model.fit(train_x, train_y)
    l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005
    b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005
    grid_x = np.meshgrid(np.arange(l, r, h),
                         np.arange(b, t, v))
    flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
    flat_y = model.predict(flat_x)
    grid_y = flat_y.reshape(grid_x[0].shape)
    pred_test_y = model.predict(test_x)
    cr = sm.classification_report(test_y, pred_test_y)
    print(cr)
    mp.figure('SVM Linear Classification',
              facecolor='lightgray')
    mp.title('SVM Linear Classification', fontsize=20)
    mp.xlabel('x', fontsize=14)
    mp.ylabel('y', fontsize=14)
    mp.tick_params(labelsize=10)
    mp.pcolormesh(grid_x[0], grid_x[1], grid_y,
                  cmap='gray')
    mp.scatter(test_x[:, 0], test_x[:, 1], c=test_y,
               cmap='brg', s=80)
    mp.show()

     

  2. 多项式核函数:poly,通过多项式函数增加原始样本特征的高次方幂
    x1 x2 -> y
    x1 x2 x1^2 x1x2 x2^2 -> y 2次多项式升维
    x1 x2 x1^3 x1^2x2 x1x2^2 x2^3 -> y 3次多项式升维
    代码:svm_poly.py
    # -*- coding: utf-8 -*-
    from __future__ import unicode_literals
    import numpy as np
    import sklearn.model_selection as ms
    import sklearn.svm as svm
    import sklearn.metrics as sm
    import matplotlib.pyplot as mp
    x, y = [], []
    with open('../../data/multiple2.txt', 'r') as f:
        for line in f.readlines():
            data = [float(substr) for substr
                    in line.split(',')]
            x.append(data[:-1])
            y.append(data[-1])
    x = np.array(x)
    y = np.array(y, dtype=int)
    train_x, test_x, train_y, test_y = \
        ms.train_test_split(
            x, y, test_size=0.25, random_state=5)
    # 基于3次多项式核函数的支持向量机分类器
    model = svm.SVC(kernel='poly', degree=3)
    model.fit(train_x, train_y)
    l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005
    b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005
    grid_x = np.meshgrid(np.arange(l, r, h),
                         np.arange(b, t, v))
    flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
    flat_y = model.predict(flat_x)
    grid_y = flat_y.reshape(grid_x[0].shape)
    pred_test_y = model.predict(test_x)
    cr = sm.classification_report(test_y, pred_test_y)
    print(cr)
    mp.figure('SVM Polynomial Classification',
              facecolor='lightgray')
    mp.title('SVM Polynomial Classification', fontsize=20)
    mp.xlabel('x', fontsize=14)
    mp.ylabel('y', fontsize=14)
    mp.tick_params(labelsize=10)
    mp.pcolormesh(grid_x[0], grid_x[1], grid_y,
                  cmap='gray')
    mp.scatter(test_x[:, 0], test_x[:, 1], c=test_y,
               cmap='brg', s=80)
    mp.show()

     

  3. 径向基核函数:rbf,通过高斯分布函数增加原始样本特征的分布概率
    代码:svm_rbf.py
    # -*- coding: utf-8 -*-
    from __future__ import unicode_literals
    import numpy as np
    import sklearn.model_selection as ms
    import sklearn.svm as svm
    import sklearn.metrics as sm
    import matplotlib.pyplot as mp
    x, y = [], []
    with open('../../data/multiple2.txt', 'r') as f:
        for line in f.readlines():
            data = [float(substr) for substr
                    in line.split(',')]
            x.append(data[:-1])
            y.append(data[-1])
    x = np.array(x)
    y = np.array(y, dtype=int)
    train_x, test_x, train_y, test_y = \
        ms.train_test_split(
            x, y, test_size=0.25, random_state=5)
    # 基于径向基核函数的支持向量机分类器
    model = svm.SVC(kernel='rbf', C=600, gamma=0.01)
    model.fit(train_x, train_y)
    l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005
    b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005
    grid_x = np.meshgrid(np.arange(l, r, h),
                         np.arange(b, t, v))
    flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
    flat_y = model.predict(flat_x)
    grid_y = flat_y.reshape(grid_x[0].shape)
    pred_test_y = model.predict(test_x)
    cr = sm.classification_report(test_y, pred_test_y)
    print(cr)
    mp.figure('SVM RBF Classification',
              facecolor='lightgray')
    mp.title('SVM RBF Classification', fontsize=20)
    mp.xlabel('x', fontsize=14)
    mp.ylabel('y', fontsize=14)
    mp.tick_params(labelsize=10)
    mp.pcolormesh(grid_x[0], grid_x[1], grid_y,
                  cmap='gray')
    mp.scatter(test_x[:, 0], test_x[:, 1], c=test_y,
               cmap='brg', s=80)
    mp.show()

     

3.样本类别均衡化

  • ..., class_weight='balanced', ...
    通过类别权重的均衡化,使所占比例较小的样本权重较高,而所占比例较大的样本权重较低,以此平均化不同类别样本对分类模型的贡献,提高模型性能。
    代码:svm_bal.py
    # -*- coding: utf-8 -*-
    from __future__ import unicode_literals
    import numpy as np
    import sklearn.model_selection as ms
    import sklearn.svm as svm
    import sklearn.metrics as sm
    import matplotlib.pyplot as mp
    x, y = [], []
    with open('../../data/imbalance.txt', 'r') as f:
        for line in f.readlines():
            data = [float(substr) for substr
                    in line.split(',')]
            x.append(data[:-1])
            y.append(data[-1])
    x = np.array(x)
    y = np.array(y, dtype=int)
    train_x, test_x, train_y, test_y = \
        ms.train_test_split(
            x, y, test_size=0.25, random_state=5)
    # 带有类别权重均衡的支持向量机分类器
    model = svm.SVC(kernel='rbf', C=100, gamma=1,
                    class_weight='balanced')
    model.fit(train_x, train_y)
    l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005
    b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005
    grid_x = np.meshgrid(np.arange(l, r, h),
                         np.arange(b, t, v))
    flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
    flat_y = model.predict(flat_x)
    grid_y = flat_y.reshape(grid_x[0].shape)
    pred_test_y = model.predict(test_x)
    cr = sm.classification_report(test_y, pred_test_y)
    print(cr)
    mp.figure('SVM Balanced Classification',
              facecolor='lightgray')
    mp.title('SVM Balanced Classification', fontsize=20)
    mp.xlabel('x', fontsize=14)
    mp.ylabel('y', fontsize=14)
    mp.tick_params(labelsize=10)
    mp.pcolormesh(grid_x[0], grid_x[1], grid_y,
                  cmap='gray')
    mp.scatter(test_x[:, 0], test_x[:, 1], c=test_y,
               cmap='brg', s=80)
    mp.show()

     

4.置信概率

  • 根据样本与分类边界的距离远近,对其预测类别的可信程度进行量化,离边界越近的样本,置信概率越高,反之,离边界越远的样本,置信概率越低。
    构造model时指定参数,probability=True
    model.predict_proba(输入样本矩阵)->置信概率矩阵
    预测结果(model.predict()函数返回):
    样本1 类别1
    样本2 类别1
    样本3 类别2
    置信概率矩阵:
    类别1    类别2
    样本1 0.8     0.2
    样本2 0.9     0.1
    样本3 0.4     0.5
    代码:svm_prob.py
    # -*- coding: utf-8 -*-
    from __future__ import unicode_literals
    import numpy as np
    import sklearn.model_selection as ms
    import sklearn.svm as svm
    import sklearn.metrics as sm
    import matplotlib.pyplot as mp
    x, y = [], []
    with open('../../data/multiple2.txt', 'r') as f:
        for line in f.readlines():
            data = [float(substr) for substr
                    in line.split(',')]
            x.append(data[:-1])
            y.append(data[-1])
    x = np.array(x)
    y = np.array(y, dtype=int)
    train_x, test_x, train_y, test_y = \
        ms.train_test_split(
            x, y, test_size=0.25, random_state=5)
    # 能够计算置信概率的支持向量机分类器
    model = svm.SVC(kernel='rbf', C=600, gamma=0.01,
                    probability=True)
    model.fit(train_x, train_y)
    l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005
    b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005
    grid_x = np.meshgrid(np.arange(l, r, h),
                         np.arange(b, t, v))
    flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
    flat_y = model.predict(flat_x)
    grid_y = flat_y.reshape(grid_x[0].shape)
    pred_test_y = model.predict(test_x)
    cr = sm.classification_report(test_y, pred_test_y)
    print(cr)
    prob_x = np.array([
        [2, 1.5],
        [8, 9],
        [4.8, 5.2],
        [4, 4],
        [2.5, 7],
        [7.6, 2],
        [5.4, 5.9]])
    print(prob_x)
    pred_prob_y = model.predict(prob_x)
    print(pred_prob_y)
    probs = model.predict_proba(prob_x)
    print(probs)
    mp.figure('SVM Confidence Probability',
              facecolor='lightgray')
    mp.title('SVM Confidence Probability', fontsize=20)
    mp.xlabel('x', fontsize=14)
    mp.ylabel('y', fontsize=14)
    mp.tick_params(labelsize=10)
    mp.pcolormesh(grid_x[0], grid_x[1], grid_y,
                  cmap='gray')
    mp.scatter(test_x[:, 0], test_x[:, 1], c=test_y,
               cmap='brg', s=80)
    mp.scatter(prob_x[:, 0], prob_x[:, 1], c=pred_prob_y,
               cmap='cool', s=70, marker='D')
    for i in range(len(probs)):
        mp.annotate(
            '{}% {}%'.format(
                round(probs[i, 0] * 100, 2),
                round(probs[i, 1] * 100, 2)),
            xy=(prob_x[i, 0], prob_x[i, 1]),
            xytext=(12, -12),
            textcoords='offset points',
            horizontalalignment='left',
            verticalalignment='top',
            fontsize=9,
            bbox={'boxstyle': 'round,pad=0.6',
                  'fc': 'orange', 'alpha': 0.8})
    mp.show()

     

5.网格搜索

  • ms.GridSearchCV(模型, 超参数组合列表, cv=折叠数)
      ->模型对象
    模型对象.fit(输入集,输出集)
    针对超参数组合列表中的每一个超参数组合,实例化给定的模型,做cv次交叉验证,将其中平均f1得分最高的超参数组合作为最佳选择,实例化模型对象。
    代码:svm_gs.py
    # -*- coding: utf-8 -*-
    from __future__ import unicode_literals
    import numpy as np
    import sklearn.model_selection as ms
    import sklearn.svm as svm
    import sklearn.metrics as sm
    import matplotlib.pyplot as mp
    x, y = [], []
    with open('../../data/multiple2.txt', 'r') as f:
        for line in f.readlines():
            data = [float(substr) for substr
                    in line.split(',')]
            x.append(data[:-1])
            y.append(data[-1])
    x = np.array(x)
    y = np.array(y, dtype=int)
    train_x, test_x, train_y, test_y = \
        ms.train_test_split(
            x, y, test_size=0.25, random_state=5)
    # 超参数组合列表
    params = [
        {'kernel': ['linear'], 'C': [1, 10, 100, 1000]},
        {'kernel': ['poly'], 'C': [1], 'degree': [2, 3]},
        {'kernel': ['rbf'], 'C': [1, 10, 100, 1000],
         'gamma': [1, 0.1, 0.01, 0.001]}]
    # 网格搜索寻优
    model = ms.GridSearchCV(
        svm.SVC(probability=True), params, cv=5)
    model.fit(train_x, train_y)
    for param, score in zip(
            model.cv_results_['params'],
            model.cv_results_['mean_test_score']):
        print(param, score)
    print(model.best_params_)
    print(model.best_score_)
    print(model.best_estimator_)
    l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005
    b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005
    grid_x = np.meshgrid(np.arange(l, r, h),
                         np.arange(b, t, v))
    flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
    flat_y = model.predict(flat_x)
    grid_y = flat_y.reshape(grid_x[0].shape)
    pred_test_y = model.predict(test_x)
    cr = sm.classification_report(test_y, pred_test_y)
    print(cr)
    prob_x = np.array([
        [2, 1.5],
        [8, 9],
        [4.8, 5.2],
        [4, 4],
        [2.5, 7],
        [7.6, 2],
        [5.4, 5.9]])
    print(prob_x)
    pred_prob_y = model.predict(prob_x)
    print(pred_prob_y)
    probs = model.predict_proba(prob_x)
    print(probs)
    mp.figure('Grid Search', facecolor='lightgray')
    mp.title('Grid Search', fontsize=20)
    mp.xlabel('x', fontsize=14)
    mp.ylabel('y', fontsize=14)
    mp.tick_params(labelsize=10)
    mp.pcolormesh(grid_x[0], grid_x[1], grid_y,
                  cmap='gray')
    mp.scatter(test_x[:, 0], test_x[:, 1], c=test_y,
               cmap='brg', s=80)
    mp.scatter(prob_x[:, 0], prob_x[:, 1], c=pred_prob_y,
               cmap='cool', s=70, marker='D')
    for i in range(len(probs)):
        mp.annotate(
            '{}% {}%'.format(
                round(probs[i, 0] * 100, 2),
                round(probs[i, 1] * 100, 2)),
            xy=(prob_x[i, 0], prob_x[i, 1]),
            xytext=(12, -12),
            textcoords='offset points',
            horizontalalignment='left',
            verticalalignment='top',
            fontsize=9,
            bbox={'boxstyle': 'round,pad=0.6',
                  'fc': 'orange', 'alpha': 0.8})
    mp.show()

     

6.事件预测
代码:svm_evt.py

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import numpy as np
import sklearn.preprocessing as sp
import sklearn.model_selection as ms
import sklearn.svm as svm


class DigitEncoder():

    def fit_transform(self, y):
        return y.astype(int)

    def transform(self, y):
        return y.astype(int)

    def inverse_transform(self, y):
        return y.astype(str)


data = []
# 二元分类
# with open('../../data/event.txt', 'r') as f:
# 多元分类
with open('../../data/events.txt', 'r') as f:
    for line in f.readlines():
        data.append(line[:-1].split(','))
data = np.delete(np.array(data).T, 1, 0)
encoders, x = [], []
for row in range(len(data)):
    if data[row][0].isdigit():
        encoder = DigitEncoder()
    else:
        encoder = sp.LabelEncoder()
    if row < len(data) - 1:
        x.append(encoder.fit_transform(data[row]))
    else:
        y = encoder.fit_transform(data[row])
    encoders.append(encoder)
x = np.array(x).T
train_x, test_x, train_y, test_y = \
    ms.train_test_split(x, y, test_size=0.25,
                        random_state=5)
model = svm.SVC(kernel='rbf',
                class_weight='balanced')
print(ms.cross_val_score(
    model, train_x, train_y, cv=3,
    scoring='accuracy').mean())
model.fit(train_x, train_y)
pred_test_y = model.predict(test_x)
print((pred_test_y == test_y).sum() /
      pred_test_y.size)
data = [['Tuesday', '12:30:00', '21', '23']]
data = np.array(data).T
x = []
for row in range(len(data)):
    encoder = encoders[row]
    x.append(encoder.transform(data[row]))
x = np.array(x).T
pred_y = model.predict(x)
print(encoders[-1].inverse_transform(pred_y))


7.交通流量预测(回归)
代码:svm_trf.py

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import numpy as np
import sklearn.preprocessing as sp
import sklearn.model_selection as ms
import sklearn.svm as svm
import sklearn.metrics as sm


class DigitEncoder():

    def fit_transform(self, y):
        return y.astype(int)

    def transform(self, y):
        return y.astype(int)

    def inverse_transform(self, y):
        return y.astype(str)


data = []
# 回归
with open('../../data/traffic.txt', 'r') as f:
    for line in f.readlines():
        data.append(line[:-1].split(','))
data = np.array(data).T
encoders, x = [], []
for row in range(len(data)):
    if data[row][0].isdigit():
        encoder = DigitEncoder()
    else:
        encoder = sp.LabelEncoder()
    if row < len(data) - 1:
        x.append(encoder.fit_transform(data[row]))
    else:
        y = encoder.fit_transform(data[row])
    encoders.append(encoder)
x = np.array(x).T
train_x, test_x, train_y, test_y = \
    ms.train_test_split(x, y, test_size=0.25,
                        random_state=5)
# 支持向量机回归器
model = svm.SVR(kernel='rbf', C=10, epsilon=0.2)
model.fit(train_x, train_y)
pred_test_y = model.predict(test_x)
print(sm.r2_score(test_y, pred_test_y))
data = [['Tuesday', '13:35', 'San Francisco', 'yes']]
data = np.array(data).T
x = []
for row in range(len(data)):
    encoder = encoders[row]
    x.append(encoder.transform(data[row]))
x = np.array(x).T
pred_y = model.predict(x)
print(int(pred_y))


十三、聚类
分类 vs. 聚类
class     cluster
有监督   无监督
1.样本相似性:欧氏距离
欧几里得
《几何原理》
P(x1) - Q(x2): |x1-x2| = sqrt((x1-x2)^2)
P(x1,y1) - Q(x2,y2): sqrt((x1-x2)^2+(y1-y2)^2)
P(x1,y1,z1) - Q(x2,y2,z2):
sqrt((x1-x2)^2+(y1-y2)^2+(z1-z2)^2)
用两个样本对应特征值之差的平方和之平方根,即欧氏距离,来表示这两个样本的相似性。
2.K均值算法
第一步:随机选择k个样本作为k个聚类的中心,计算每个样本到各个聚类中心的欧氏距离,将该样本分配到与之距离最近的聚类中心所在的类别中。
第二步:根据第一步所得到的聚类划分,分别计算每个聚类的几何中心,将几何中心作为新的聚类中心,重复第一步,直到计算所得几何中心与聚类中心重合或接近重合为止。

  1. 聚类数k必须事先已知。
    借助某些评估指标,优选最好的聚类数。
  2. 聚类中心的初始选择会影响到最终聚类划分的结果。
    初始中心尽量选择距离较远的样本。
    代码:km.py
    # -*- coding: utf-8 -*-
    from __future__ import unicode_literals
    import numpy as np
    import sklearn.cluster as sc
    import matplotlib.pyplot as mp
    x = []
    with open('../../data/multiple3.txt', 'r') as f:
        for line in f.readlines():
            data = [float(substr) for substr
                    in line.split(',')]
            x.append(data)
    x = np.array(x)
    # K均值聚类器
    model = sc.KMeans(n_clusters=4)
    model.fit(x)
    centers = model.cluster_centers_
    l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005
    b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005
    grid_x = np.meshgrid(np.arange(l, r, h),
                         np.arange(b, t, v))
    flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
    flat_y = model.predict(flat_x)
    grid_y = flat_y.reshape(grid_x[0].shape)
    pred_y = model.predict(x)
    mp.figure('K-Means Cluster', facecolor='lightgray')
    mp.title('K-Means Cluster', fontsize=20)
    mp.xlabel('x', fontsize=14)
    mp.ylabel('y', fontsize=14)
    mp.tick_params(labelsize=10)
    mp.pcolormesh(grid_x[0], grid_x[1], grid_y,
                  cmap='gray')
    mp.scatter(x[:, 0], x[:, 1], c=pred_y, cmap='brg',
               s=80)
    mp.scatter(centers[:, 0], centers[:, 1], marker='+',
               c='gold', s=1000, linewidth=1)
    mp.show()
    

    图像量化
    代码:quant.py
    # -*- coding: utf-8 -*-
    from __future__ import unicode_literals
    import numpy as np
    import scipy.misc as sm
    import scipy.ndimage as sn
    import sklearn.cluster as sc
    import matplotlib.pyplot as mp
    
    
    # 通过K均值聚类量化图像中的颜色
    def quant(image, n_clusters):
        x = image.reshape(-1, 1)
        model = sc.KMeans(n_clusters=n_clusters)
        model.fit(x)
        y = model.labels_
        centers = model.cluster_centers_.squeeze()
        return centers[y].reshape(image.shape)
    
    
    original = sm.imread('../../data/lily.jpg', True)
    quant4 = quant(original, 4)
    quant3 = quant(original, 3)
    quant2 = quant(original, 2)
    mp.figure('Image Quant', facecolor='lightgray')
    mp.subplot(221)
    mp.title('Original', fontsize=16)
    mp.axis('off')
    mp.imshow(original, cmap='gray')
    mp.subplot(222)
    mp.title('Quant-4', fontsize=16)
    mp.axis('off')
    mp.imshow(quant4, cmap='gray')
    mp.subplot(223)
    mp.title('Quant-3', fontsize=16)
    mp.axis('off')
    mp.imshow(quant3, cmap='gray')
    mp.subplot(224)
    mp.title('Quant-2', fontsize=16)
    mp.axis('off')
    mp.imshow(quant2, cmap='gray')
    mp.tight_layout()
    mp.show()

     

3.均值漂移算法
首先假定样本空间中的每个聚类均服从某种已知的概率分布规则,然后用不同的概率密度函数拟合样本中的统计直方图,不断移动密度函数的中心(均值)的位置,直到获得最佳拟合效果为止。这些概率密度函数的峰值点就是聚类的中心,再根据每个样本距离各个中心的距离,选择最近聚类中心所属的类别作为该样本的类别。

  1. 聚类数不必事先已知,算法会自动识别出统计直方图的中心数量。
  2. 聚类中心不依据于最初假定,聚类划分的结果相对稳定。
  3. 样本空间应该服从某种概率分布规则,否则算法的准确性会大打折扣。
    代码:shift.py
    # -*- coding: utf-8 -*-
    from __future__ import unicode_literals
    import numpy as np
    import sklearn.cluster as sc
    import matplotlib.pyplot as mp
    x = []
    with open('../../data/multiple3.txt', 'r') as f:
        for line in f.readlines():
            data = [float(substr) for substr
                    in line.split(',')]
            x.append(data)
    x = np.array(x)
    # 量化带宽,决定每次调整概率密度函数的步进量
    bw = sc.estimate_bandwidth(x, n_samples=len(x),
                               quantile=0.1)
    # 均值漂移聚类器
    model = sc.MeanShift(bandwidth=bw, bin_seeding=True)
    model.fit(x)
    centers = model.cluster_centers_
    l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005
    b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005
    grid_x = np.meshgrid(np.arange(l, r, h),
                         np.arange(b, t, v))
    flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
    flat_y = model.predict(flat_x)
    grid_y = flat_y.reshape(grid_x[0].shape)
    pred_y = model.predict(x)
    mp.figure('Mean Shift Cluster', facecolor='lightgray')
    mp.title('Mean Shift Cluster', fontsize=20)
    mp.xlabel('x', fontsize=14)
    mp.ylabel('y', fontsize=14)
    mp.tick_params(labelsize=10)
    mp.pcolormesh(grid_x[0], grid_x[1], grid_y,
                  cmap='gray')
    mp.scatter(x[:, 0], x[:, 1], c=pred_y, cmap='brg',
               s=80)
    mp.scatter(centers[:, 0], centers[:, 1], marker='+',
               c='gold', s=1000, linewidth=1)
    mp.show()

     

4.凝聚层次算法
首先假定每个样本都是一个独立的聚类,如果统计出来的聚类数大于期望的聚类数,则从每个样本出发寻找离自己最近的另一个样本,与之聚集,形成更大的聚类,同时令总聚类数减少,不断重复以上过程,直到统计出来的聚类数达到期望值为止。

  1. 聚类数k必须事先已知。
    借助某些评估指标,优选最好的聚类数。
  2. 没有聚类中心的概念,因此只能在训练集中划分聚类,但不能对训练集以外的未知样本确定其聚类归属。
  3. 在确定被凝聚的样本时,除了以距离作为条件以外,还可以根据连续性来确定被聚集的样本。
    代码:agglo.py、spiral.py
    # -*- coding: utf-8 -*-
    from __future__ import unicode_literals
    import numpy as np
    import sklearn.cluster as sc
    import matplotlib.pyplot as mp
    x = []
    with open('../../data/multiple3.txt', 'r') as f:
        for line in f.readlines():
            data = [float(substr) for substr
                    in line.split(',')]
            x.append(data)
    x = np.array(x)
    # 凝聚层次聚类器
    model = sc.AgglomerativeClustering(n_clusters=4)
    pred_y = model.fit_predict(x)
    mp.figure('Agglomerative Cluster',
              facecolor='lightgray')
    mp.title('Agglomerative Cluster', fontsize=20)
    mp.xlabel('x', fontsize=14)
    mp.ylabel('y', fontsize=14)
    mp.tick_params(labelsize=10)
    mp.scatter(x[:, 0], x[:, 1], c=pred_y, cmap='brg',
               s=80)
    mp.show()
    # -*- coding: utf-8 -*-
    from __future__ import unicode_literals
    import numpy as np
    import sklearn.cluster as sc
    import sklearn.neighbors as nb
    import matplotlib.pyplot as mp
    n_samples = 500
    t = 2.5 * np.pi * (1 + 2 * np.random.rand(
        n_samples, 1))
    x = 0.05 * t * np.cos(t)
    y = 0.05 * t * np.sin(t)
    n = 0.05 * np.random.rand(n_samples, 2)
    x = np.hstack((x, y)) + n
    # 无连续性的凝聚层次聚类器
    model_nonc = sc.AgglomerativeClustering(
        linkage='average', n_clusters=3)
    pred_y_nonc = model_nonc.fit_predict(x)
    # 近邻筛选器
    conn = nb.kneighbors_graph(
        x, 10, include_self=False)
    # 有连续性的凝聚层次聚类器
    model_conn = sc.AgglomerativeClustering(
        linkage='average', n_clusters=3,
        connectivity=conn)
    pred_y_conn = model_conn.fit_predict(x)
    mp.figure('Nonconnectivity',
              facecolor='lightgray')
    mp.title('Nonconnectivity', fontsize=20)
    mp.xlabel('x', fontsize=14)
    mp.ylabel('y', fontsize=14)
    mp.tick_params(labelsize=10)
    mp.axis('equal')
    mp.scatter(x[:, 0], x[:, 1], c=pred_y_nonc,
               cmap='brg', alpha=0.5, s=60)
    mp.figure('Connectivity',
              facecolor='lightgray')
    mp.title('Connectivity', fontsize=20)
    mp.xlabel('x', fontsize=14)
    mp.ylabel('y', fontsize=14)
    mp.tick_params(labelsize=10)
    mp.axis('equal')
    mp.scatter(x[:, 0], x[:, 1], c=pred_y_conn,
               cmap='brg', alpha=0.5, s=60)
    mp.show()

     

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值