第3章 无监督学习与预处理

1. 无监督学习的类型

  • 两种无监督学习
    • 数据集变换(数据集的无监督变换)
      • 创建数据新的表示的算法
        • 新的表示可能更容易被人或其他机器学习算法所理解
      • 常见应用
        • 降维
          • 接受包含许多特征的数据的高维表示
          • 找到表示该数据的一种新方法
          • 用较少的特征就可以概括其重要特征
          • 常见应用
            • 将数据降为二维(为了可视化)
        • 找到“构成”数据的各个组成部分
          • 常见应用
            • 对文本文档集合进行主题提取
              • 任务
                • 找到每个文档中讨论的未知主题
                • 学习每个文档中出现了哪些主题
              • 用于追踪社交媒体上的话题讨论
    • 聚类
      • 将数据划分成不同的组
      • 每个组包含相似的物项
      • 常见应用
        • 相册的智能分类
          • 提取所有的人脸
          • 将看起来相似的人脸分在一组

2. 无监督学习的挑战

  • 主要挑战:评估算法是否学到了有用的东西
  • 无监督学习算法一般用于不包含任何标签信息的数据,所以我们不知道正确的输出应该是什么
  • 我们没有办法“告诉”算法我们要的是什么
  • 通常来说,评估无监督算法结果的唯一方法就是人工检查
  • 如果数据科学家想要更好地理解数据,那么无监督算法通常可用于探索性的目的,而不是作为大型自动化系统的一部分
  • 无监督算法的另一个常见应用是作为监督算法的预处理步骤
    • 可以提高监督算法的精度
    • 可以减少内存占用和时间开销

3. 预处理与缩放

  • 对于数据缩放敏感的算法,可以对特征进行调节,使数据表示更适合与这些算法
  • 对数据的简单的按特征的缩放和移动

3.1 不同类型的预处理

from matplotlib import pyplot as plt
import mglearn

mglearn.plots.plot_scaling()

plt.tight_layout()
plt.show()

对数据集缩放和预处理的各种方法

  • 左侧:有两个特征的二分类数据
    • 第一个特征值:10~15
    • 第二个特征值:1~9
  • 右侧:4种数据变换方法
    • StandardScaler
      • 确保每个特征的平均值为0,方差为1,使所有特征都位于同一量级
      • 不能确保特征任何特定的最大值和最小值
    • RobustScaler
      • 确保每个特征的统计属性都位于同一范围
        • 中位数和四分位数
      • 忽略与其他点有很大不同的数据点(异常值
    • MinMaxScaler
      • 使所有特征都刚好位于0~1
    • Normalizer
      • 对每个数据点进行缩放,使特征向量的欧式长度等于1
        • 将每个数据点投射到半径为1的圆(球面)上
      • 每个数据点的缩放比例都不相同

3.2 应用数据变换

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=1)

scaler = MinMaxScaler()

scaler.fit(X_train)

# 对训练数据进行变换
X_train_scaled = scaler.transform(X_train)

# 打印缩放之后数据集属性
print("per-feature minimum after scaling:\n {}".format(X_train_scaled.min(axis=0)))
# per-feature minimum after scaling:
#  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
#   0. 0. 0. 0. 0. 0.]
print("per-feature maximum after scaling:\n {}".format(X_train_scaled.max(axis=0)))
# per-feature maximum after scaling:
#  [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
#   1. 1. 1. 1. 1. 1.]

# 对测试数据进行变换
X_test_scaled = scaler.transform(X_test)

# 打印缩放之后数据集属性
print("per-feature minimum after scaling:\n {}".format(X_test_scaled.min(axis=0)))
# per-feature minimum after scaling:
#  [ 0.0336031   0.0226581   0.03144219  0.01141039  0.14128374  0.04406704
#    0.          0.          0.1540404  -0.00615249 -0.00137796  0.00594501
#    0.00430665  0.00079567  0.03919502  0.0112206   0.          0.
#   -0.03191387  0.00664013  0.02660975  0.05810235  0.02031974  0.00943767
#    0.1094235   0.02637792  0.          0.         -0.00023764 -0.00182032]

print("per-feature maximum after scaling:\n {}".format(X_test_scaled.max(axis=0)))
# per-feature maximum after scaling:
#  [0.9578778  0.81501522 0.95577362 0.89353128 0.81132075 1.21958701
#   0.87956888 0.9333996  0.93232323 1.0371347  0.42669616 0.49765736
#   0.44117231 0.28371044 0.48703131 0.73863671 0.76717172 0.62928585
#   1.33685792 0.39057253 0.89612238 0.79317697 0.84859804 0.74488793
#   0.9154725  1.13188961 1.07008547 0.92371134 1.20532319 1.63068851]
  • 由于scaler进行拟合使用的数据为X_train,所以对于X_train所有的特征都在0~1,而对于X_test则出现数据混乱

3.3 对训练数据和测试数据进行相同的缩放

import mglearn
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=50, centers=5, random_state=4, cluster_std=2)
X_train, X_test = train_test_split(X, random_state=5, test_size=.1)

# 绘制训练集和测试集
fig, axes = plt.subplots(1, 3, figsize=(13, 4))
axes[0].scatter(X_train[:, 0], X_train[:, 1], c=mglearn.cm2(0), label="Training set", s=60)
axes[0].scatter(X_test[:, 0], X_test[:, 1], marker='^', c=mglearn.cm2(1), label="Test set", s=60)
axes[0].legend(loc='upper left')
axes[0].set_title("Original Data")

# 利用MinMaxScaler缩放数据
scaler = MinMaxScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 将正确缩放的数据可视化
axes[1].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], c=mglearn.cm2(0), label="Training set", s=60)
axes[1].scatter(X_test_scaled[:, 0], X_test_scaled[:, 1], marker='^', c=mglearn.cm2(1), label="Test set", s=60)
axes[1].set_title("Scaled Data")

# 单独对测试集进行缩放
test_scaler = MinMaxScaler()
test_scaler.fit(X_test)

X_test_scaled_badly = test_scaler.transform(X_test)

# 将错误缩放的数据可视化
axes[2].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], c=mglearn.cm2(0), label="training set", s=60)
axes[2].scatter(X_test_scaled_badly[:, 0], X_test_scaled_badly[:, 1], marker='^', c=mglearn.cm2(1), label="test set", s=60)
axes[2].set_title("Improperly Scaled Data")

for ax in axes:
    ax.set_xlabel("Feature 0")
    ax.set_ylabel("Feature 1")

plt.tight_layout()
plt.show()

对左图中的训练数据和测试数据同时缩放的效果(中)和分别缩放的效果(右)

  • 左图:未缩放的二维数据集
  • 中图:使用MinMaxScaler进行缩放
  • 右图:训练集和测试集分解进行不同的缩放

快捷方式与高效的替代方法

scaler.fit(X).transform(X)
# 等效于
scaler.fit_transform(X)

3.4 预处理对监督学习的作用

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)

svm = SVC(C=100)
svm.fit(X_train, y_train)

print("test score: {:.3f}".format(svm.score(X_test, y_test)))
# test score: 0.944

# 使用0-1缩放进行预处理
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 在缩放后的训练数据上学习SVM
svm.fit(X_train_scaled, y_train)

# 在缩放后的测试集上计算分数
print("test score: {:.3f}".format(svm.score(X_test_scaled, y_test)))
# test score: 0.965

# 利用零均值和单位方差的缩放方法进行预处理
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 在缩放后的训练数据上学习SVM
svm.fit(X_train_scaled, y_train)

# 在缩放后的测试集上计算分数
print("test score: {:.3f}".format(svm.score(X_test_scaled, y_test)))
# test score: 0.958

4. 降维、特征提取与流形学习

4.1 主成分分析(PCA)

  • 一种旋转数据集的方法
    • 旋转后的特征在统计上不相关
    • 转转后通常根据新特征对解释数据的重要性来选择它的一个子集
import matplotlib.pyplot as plt
import mglearn

mglearn.plots.plot_pca_illustration()

plt.tight_layout()
plt.show()

用PCA做数据变换

  • 左上图:原始数据点

    1. 算法查找方差最大的方向(Component 1)

      • 数据中包含最多信息的方向
    2. 算法找到与第一个方向正交且包含最多信息的方向

    • 利用此方法找到的方向称为主成分
      • 数据方差的主要方向
    • 主成分的个数与原始特征相同
  • 右上图:旋转原始数据,使第一主成分与x轴平行且第二主成分与y轴平行

    • 旋转之前,数据减去平均值
      • 使变换后的数据以0为中心
  • 左下图:仅保留第一个主成分

    • 将二维数据降为一维数据
  • 右下图:反向旋转并将平均值重新加到数据中

    • 去除数据中的噪声影响
    • 将主成分中保留的那部分信息可视化

4.1.1 将PCA应用于cancer数据集并可视化

  • 对每个特征分别计算两个类别的直方图

    import mglearn
    import numpy as np
    from matplotlib import pyplot as plt
    from sklearn.datasets import load_breast_cancer
    from sklearn.model_selection import train_test_split
    
    cancer = load_breast_cancer()
    X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)
    
    fig, axes = plt.subplots(15, 2, figsize=(10, 20))
    malignant = cancer.data[cancer.target == 0]
    benign = cancer.data[cancer.target == 1]
    
    ax = axes.ravel()
    
    for i in range(30):
        _, bins = np.histogram(cancer.data[:, i], bins=50)
        ax[i].hist(malignant[:, i], bins=bins, color=mglearn.cm3(0), alpha=.5)
        ax[i].hist(benign[:, i], bins=bins, color=mglearn.cm3(2), alpha=.5)
        ax[i].set_title(cancer.feature_names[i])
        ax[i].set_yticks(())
    
    ax[0].set_xlabel("Feature magnitude")
    ax[0].set_ylabel("Frequency")
    ax[0].legend(["malignant", "benign"], loc="best")
    
    fig.tight_layout()
    fig.show()
    

    乳腺癌数据集中每个类别的特征直方图

  • 利用PCA,获取到主要的相互作用

    1. 利用StandardScaler缩放数据

      from sklearn.preprocessing import StandardScaler
      from sklearn.datasets import load_breast_cancer
      
      cancer = load_breast_cancer()
      
      scaler = StandardScaler()
      scaler.fit(cancer.data)
      X_scaled = scaler.transform(cancer.data)
      
    2. 学习并应用PCA

      • 默认情况下,PCA仅旋转(移动)数据,并保留所有主成分
      from sklearn.preprocessing import StandardScaler
      from sklearn.datasets import load_breast_cancer
      from sklearn.decomposition import PCA
      
      cancer = load_breast_cancer()
      
      scaler = StandardScaler()
      scaler.fit(cancer.data)
      X_scaled = scaler.transform(cancer.data)
      
      # 保留数据的前两个主成分
      pca = PCA(n_components=2)
      # n_components: 保留的主成分个数
      
      # 对乳腺癌数据拟合PCA模型
      pca.fit(X_scaled)
      
      # 将数据变换到前两个主成分的方向上
      X_pca = pca.transform(X_scaled)
      
      print("Original shape: {}".format(str(X_scaled.shape)))
      # Original shape: (569, 30)
      
      print("Reduced shape: {}".format(str(X_pca.shape)))
      # Reduced shape: (569, 2)
      
    3. 对前两个主成分作图

      import mglearn
      from matplotlib import pyplot as plt
      from sklearn.preprocessing import StandardScaler
      from sklearn.datasets import load_breast_cancer
      from sklearn.decomposition import PCA
      
      cancer = load_breast_cancer()
      
      scaler = StandardScaler()
      scaler.fit(cancer.data)
      X_scaled = scaler.transform(cancer.data)
      
      pca = PCA(n_components=2)
      pca.fit(X_scaled)
      
      X_pca = pca.transform(X_scaled)
      
      plt.figure(figsize=(8, 8))
      mglearn.discrete_scatter(X_pca[:, 0], X_pca[:, 1], cancer.target)
      
      plt.legend(cancer.target_names, loc="best")
      plt.gca().set_aspect("equal")
      plt.xlabel("First principal component")
      plt.ylabel("Second principal component")
      
      plt.tight_layout()
      plt.show()
      

      利用前两个主成分绘制乳腺癌数据集的二维散点图

  • PCA的缺点:不容易对图中的两个轴做出解释

  • 主成分在PCA对象的components_属性中

  • 用热图将系数可视化

    from matplotlib import pyplot as plt
    from sklearn.preprocessing import StandardScaler
    from sklearn.datasets import load_breast_cancer
    from sklearn.decomposition import PCA
    
    cancer = load_breast_cancer()
    
    scaler = StandardScaler()
    scaler.fit(cancer.data)
    X_scaled = scaler.transform(cancer.data)
    
    pca = PCA(n_components=2)
    pca.fit(X_scaled)
    
    X_pca = pca.transform(X_scaled)
    
    plt.matshow(pca.components_, cmap='viridis')
    plt.yticks([0, 1], ["First component", "Second component"])
    plt.colorbar()
    plt.xticks(range(len(cancer.feature_names)), cancer.feature_names, rotation=60, ha='left')
    
    plt.xlabel("Feature")
    plt.ylabel("Principal components")
    plt.tight_layout()
    plt.show()
    

    乳腺癌数据集前两个主成分的热图

4.1.2 特征提取的特征脸

  • 思想:找到一种表示,比给定的原始更适合于分析
  • 应用实例:图像
    • 图像由像素构成
    • 通常存储为RGB强度
from matplotlib import pyplot as plt
from sklearn.datasets import fetch_lfw_people
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)

image_shape = people.images[0].shape

fix, axes = plt.subplots(2, 5, figsize=(15, 8), subplot_kw={'xticks': (), 'yticks': ()})

for target, image, ax in zip(people.target, people.images, axes.ravel()):
    ax.imshow(image)
    ax.set_title(people.target_names[target])

print("people.images.shape: {}".format(people.images.shape))
# people.images.shape: (3023, 87, 65)
# 3023张图像
# 87像素*65像素

print("Number of classes: {}".format(len(people.target_names)))
# Number of classes: 62
# 62个人

plt.tight_layout()
plt.show()

来自Wild数据集中Labeled Faces的一些图像

  • 数据集有些偏斜

    • 参与分类的两个类别(或多个类别)样本数量差异很大
    import numpy as np
    from sklearn.datasets import fetch_lfw_people
    import ssl
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)
    
    # 计算每个目标出现的次数
    counts = np.bincount(people.target)
    
    # 将次数与目标名称一起打印出来
    for i, (count, name) in enumerate(zip(counts, people.target_names)):
        print("{0:25} {1:3}".format(name, count), end='   ')
        if (i + 1) % 3 == 0:
            print()
    # Alejandro Toledo           39   Alvaro Uribe               35   Amelie Mauresmo            21   
    # Andre Agassi               36   Angelina Jolie             20   Ariel Sharon               77   
    # Arnold Schwarzenegger      42   Atal Bihari Vajpayee       24   Bill Clinton               29   
    # Carlos Menem               21   Colin Powell              236   David Beckham              31   
    # Donald Rumsfeld           121   George Robertson           22   George W Bush             530   
    # Gerhard Schroeder         109   Gloria Macapagal Arroyo    44   Gray Davis                 26   
    # Guillermo Coria            30   Hamid Karzai               22   Hans Blix                  39   
    # Hugo Chavez                71   Igor Ivanov                20   Jack Straw                 28   
    # Jacques Chirac             52   Jean Chretien              55   Jennifer Aniston           21   
    # Jennifer Capriati          42   Jennifer Lopez             21   Jeremy Greenstock          24   
    # Jiang Zemin                20   John Ashcroft              53   John Negroponte            31   
    # Jose Maria Aznar           23   Juan Carlos Ferrero        28   Junichiro Koizumi          60   
    # Kofi Annan                 32   Laura Bush                 41   Lindsay Davenport          22   
    # Lleyton Hewitt             41   Luiz Inacio Lula da Silva  48   Mahmoud Abbas              29   
    # Megawati Sukarnoputri      33   Michael Bloomberg          20   Naomi Watts                22   
    # Nestor Kirchner            37   Paul Bremer                20   Pete Sampras               22   
    # Recep Tayyip Erdogan       30   Ricardo Lagos              27   Roh Moo-hyun               32   
    # Rudolph Giuliani           26   Saddam Hussein             23   Serena Williams            52   
    # Silvio Berlusconi          33   Tiger Woods                23   Tom Daschle                25   
    # Tom Ridge                  33   Tony Blair                144   Vicente Fox                32   
    # Vladimir Putin             49   Winona Ryder               24   
    
  • 降低数据偏斜

    • 每个人最多取50张图像
    import numpy as np
    from sklearn.datasets import fetch_lfw_people
    import ssl
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)
    mask = np.zeros(people.target.shape, dtype=np.bool_)
    
    for target in np.unique(people.target):
        mask[np.where(people.target == target)[0][:50]] = 1
    
    X_people = people.data[mask]
    y_people = people.target[mask]
    
    # 将灰度值缩放到0到1之间,而不是在0到255之间
    # 以得到更好的数据稳定性
    X_people = X_people / 255.
    
  • 使用单一最近邻分类器(1-nn)

    • 寻找与要分类的人脸最为相似的人脸
    import numpy as np
    from sklearn.datasets import fetch_lfw_people
    from sklearn.model_selection import train_test_split
    from sklearn.neighbors import KNeighborsClassifier
    import ssl
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)
    mask = np.zeros(people.target.shape, dtype=np.bool_)
    
    for target in np.unique(people.target):
        mask[np.where(people.target == target)[0][:50]] = 1
    
    X_people = people.data[mask]
    y_people = people.target[mask]
    
    X_people = X_people / 255.
    
    # 将数据分为训练集和测试集
    X_train, X_test, y_train, y_test = train_test_split(X_people, y_people, stratify=y_people, random_state=0)
    
    # 使用一个邻居构建KNeighborsClassifier
    knn = KNeighborsClassifier(n_neighbors=1)
    knn.fit(X_train, y_train)
    
    print("test score: {:.3f}".format(knn.score(X_test, y_test)))
    # test score: 0.215
    
  • 使用PCA

    • 启动白化选项

      • 将主成分缩放到相同的尺度
      • 结果与StandardScaler相同
      from matplotlib import pyplot as plt
      import mglearn
      
      mglearn.plots.plot_pca_whitening()
      
      plt.tight_layout()
      plt.show()
      

      利用启用白化的PCA进行数据变换

    import numpy as np
    from sklearn.datasets import fetch_lfw_people
    from sklearn.decomposition import PCA
    from sklearn.model_selection import train_test_split
    from sklearn.neighbors import KNeighborsClassifier
    import ssl
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)
    mask = np.zeros(people.target.shape, dtype=np.bool_)
    
    for target in np.unique(people.target):
        mask[np.where(people.target == target)[0][:50]] = 1
    
    X_people = people.data[mask]
    y_people = people.target[mask]
    
    X_people = X_people / 255.
    
    X_train, X_test, y_train, y_test = train_test_split(X_people, y_people, stratify=y_people, random_state=0)
    
    pca = PCA(n_components=100, whiten=True, random_state=0).fit(X_train)
    # 提取前100个主成分,并进行拟合
    
    X_train_pca = pca.transform(X_train)
    X_test_pca = pca.transform(X_test)
    
    print("X_train_pca.shape: {}".format(X_train_pca.shape))
    # X_train_pca.shape: (1547, 100)
    
    knn = KNeighborsClassifier(n_neighbors=1)
    knn.fit(X_train_pca, y_train)
    
    print("test score: {:.3f}".format(knn.score(X_test_pca, y_test)))
    # test score: 0.297
    
  • 主成分可视化

    import numpy as np
    from matplotlib import pyplot as plt
    from sklearn.datasets import fetch_lfw_people
    from sklearn.decomposition import PCA
    from sklearn.model_selection import train_test_split
    import ssl
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)
    mask = np.zeros(people.target.shape, dtype=np.bool_)
    
    for target in np.unique(people.target):
        mask[np.where(people.target == target)[0][:50]] = 1
    
    X_people = people.data[mask]
    y_people = people.target[mask]
    
    X_people = X_people / 255.
    
    X_train, X_test, y_train, y_test = train_test_split(X_people, y_people, stratify=y_people, random_state=0)
    
    pca = PCA(n_components=100, random_state=0).fit(X_train)
    
    image_shape = people.images[0].shape
    
    fix, axes = plt.subplots(3, 5, figsize=(15, 12), subplot_kw={'xticks': (), 'yticks': ()})
    
    for i, (component, ax) in enumerate(zip(pca.components_, axes.ravel())):
        ax.imshow(component.reshape(image_shape), cmap='viridis')
        ax.set_title("{}. component".format((i + 1)))
    
    plt.tight_layout()
    plt.show()
    

    人脸数据集前15个主成分的成分向量

  • 尝试找到一些数字(PCA旋转后的新特征值),使我们可以将测试点表示为主成分的加权求和

    图解PCA:将图像分解为成分的加权求和

    • x 0 x_0 x0 x 1 x_1 x1等:数据点的主成分系数
  • 对人脸数据进行变换

    • 将数据降维到只包含一些主成分,然后反向旋转回到原始空间
      • 回到原始特征空间的方法:inverse_transform
    import mglearn
    import numpy as np
    from matplotlib import pyplot as plt
    from sklearn.datasets import fetch_lfw_people
    from sklearn.model_selection import train_test_split
    import ssl
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)
    mask = np.zeros(people.target.shape, dtype=np.bool_)
    
    for target in np.unique(people.target):
        mask[np.where(people.target == target)[0][:50]] = 1
    
    X_people = people.data[mask]
    y_people = people.target[mask]
    
    X_people = X_people / 255.
    
    X_train, X_test, y_train, y_test = train_test_split(X_people, y_people, stratify=y_people, random_state=0)
    
    image_shape = people.images[0].shape
    
    mglearn.plots.plot_pca_faces(X_train, X_test, image_shape)
    
    plt.tight_layout()
    plt.show()
    

    利用越来越多的主成分对三张人脸图像进行重建

  • 利用PCA的前两个主成分,将数据集中的所有人脸在散点图中可视化

    import numpy as np
    from matplotlib import pyplot as plt
    from sklearn.datasets import fetch_lfw_people
    from sklearn.decomposition import PCA
    from sklearn.model_selection import train_test_split
    import mglearn
    import ssl
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)
    mask = np.zeros(people.target.shape, dtype=np.bool_)
    
    for target in np.unique(people.target):
        mask[np.where(people.target == target)[0][:50]] = 1
    
    X_people = people.data[mask]
    y_people = people.target[mask]
    
    X_people = X_people / 255.
    
    X_train, X_test, y_train, y_test = train_test_split(X_people, y_people, stratify=y_people, random_state=0)
    
    pca = PCA(n_components=100, whiten=True, random_state=0).fit(X_train)
    
    X_train_pca = pca.transform(X_train)
    X_test_pca = pca.transform(X_test)
    
    mglearn.discrete_scatter(X_train_pca[:, 0], X_train_pca[:, 1], y_train)
    
    plt.xlabel("First principal component")
    plt.ylabel("Second principal component")
    
    plt.tight_layout()
    plt.show()
    

利用前两个主成分绘制人脸数据集的散点图

4.2 非负矩阵分解(NMF)

  • 提取有用的特征
  • 将每个数据点写成一些分量的加权求和
  • 希望分量和系数都大于或等于0
  • 只能应用于每个特征都是非负的数据
  • 对由多个独立源相加创建而成的数据特别有用
    • 多人说话的音轨
    • 多种乐器的音乐

4.2.1 将NMF应用于模拟数据

from matplotlib import pyplot as plt
import mglearn

mglearn.plots.plot_nmf_illustration()

plt.tight_layout()
plt.show()

两个分量的非负矩阵分解(左)和一个分量的非负矩阵分解(右)找到的分量

  • 左图:所有数据点都可以写成这两个分量的正数组合
  • 右图:指向平均值的分量
  • NMF使用随机初始化,根据随机种子的不同可能产生不同的结果

4.2.2 将NMF应用于人脸图像

  • NMF的主要参数

    • 想要提取的分量个数
      • 要小于输入特征的个数
  • 分量个数对NMF重建数据的影响

    from matplotlib import pyplot as plt
    import mglearn
    
    mglearn.plots.plot_nmf_illustration()
    
    plt.tight_layout()
    plt.show()
    

    利用越来越多分量的NMF重建三张人脸图像

    • 比PCA稍差
  • 提取一部分分量,并观察数据

    import numpy as np
    from matplotlib import pyplot as plt
    from sklearn.datasets import fetch_lfw_people
    from sklearn.model_selection import train_test_split
    from sklearn.decomposition import NMF
    import ssl
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)
    mask = np.zeros(people.target.shape, dtype=np.bool_)
    
    for target in np.unique(people.target):
        mask[np.where(people.target == target)[0][:50]] = 1
    
    X_people = people.data[mask]
    y_people = people.target[mask]
    
    X_people = X_people / 255.
    
    X_train, X_test, y_train, y_test = train_test_split(X_people, y_people, stratify=y_people, random_state=0)
    
    image_shape = people.images[0].shape
    
    nmf = NMF(n_components=15, random_state=0)
    nmf.fit(X_train)
    
    X_train_nmf = nmf.transform(X_train)
    X_test_nmf = nmf.transform(X_test)
    
    fix, axes = plt.subplots(3, 5, figsize=(15, 12), subplot_kw={'xticks': (), 'yticks': ()})
    for i, (component, ax) in enumerate(zip(nmf.components_, axes.ravel())):
        ax.imshow(component.reshape(image_shape))
        ax.set_title("{}. component".format(i))
    
    plt.tight_layout()
    plt.show()
    

    使用15个分量的NMF在人脸数据集上找到的分量

    • 绘制分量4和7的图像

      import numpy as np
      from matplotlib import pyplot as plt
      from sklearn.datasets import fetch_lfw_people
      from sklearn.model_selection import train_test_split
      from sklearn.decomposition import NMF
      import ssl
      
      ssl._create_default_https_context = ssl._create_unverified_context
      
      people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)
      mask = np.zeros(people.target.shape, dtype=np.bool_)
      
      for target in np.unique(people.target):
          mask[np.where(people.target == target)[0][:50]] = 1
      
      X_people = people.data[mask]
      y_people = people.target[mask]
      
      X_people = X_people / 255.
      
      X_train, X_test, y_train, y_test = train_test_split(X_people, y_people, stratify=y_people, random_state=0)
      
      image_shape = people.images[0].shape
      
      nmf = NMF(n_components=15, random_state=0)
      nmf.fit(X_train)
      
      X_train_nmf = nmf.transform(X_train)
      
      compn = 4
      # 按第4个分量排序,绘制前10张图像
      inds = np.argsort(X_train_nmf[:, compn])[::-1]
      fig, axes = plt.subplots(2, 5, figsize=(15, 8), subplot_kw={'xticks': (), 'yticks': ()})
      for i, (ind, ax) in enumerate(zip(inds, axes.ravel())):
          ax.imshow(X_train[ind].reshape(image_shape))
      
      plt.tight_layout()
      plt.show()
      
      compn = 7
      # 按第7个分量排序,绘制前10张图像
      inds = np.argsort(X_train_nmf[:, compn])[::-1]
      fig, axes = plt.subplots(2, 5, figsize=(15, 8), subplot_kw={'xticks': (), 'yticks': ()})
      for i, (ind, ax) in enumerate(zip(inds, axes.ravel())):
          ax.imshow(X_train[ind].reshape(image_shape))
      
      plt.tight_layout()
      plt.show()
      

      分量4系数较大的人脸

      分量7系数较大的人脸

  • 对信号进行处理

    import mglearn
    from matplotlib import pyplot as plt
    
    S = mglearn.datasets.make_signals()
    
    plt.figure(figsize=(6, 1))
    plt.plot(S, '-')
    plt.xlabel("Time")
    plt.ylabel("Signal")
    
    plt.tight_layout()
    plt.show()
    

    原始信号源

  • 将混合信号分解为原始信号

    import mglearn
    import numpy as np
    from matplotlib import pyplot as plt
    from sklearn.decomposition import NMF, PCA
    
    S = mglearn.datasets.make_signals()
    
    # 将数据混合成100维的状态
    A = np.random.RandomState(0).uniform(size=(100, 3))
    X = np.dot(S, A.T)
    
    # 使用NMF还原信号
    nmf = NMF(n_components=3, random_state=42)
    S_ = nmf.fit_transform(X)
    
    # 使用PCA还原信号
    pca = PCA(n_components=3)
    H = pca.fit_transform(X)
    
    models = [X, S, S_, H]
    names = ['Observations (first three measurements)',
             'True sources',
             'NMF recovered signals',
             'PCA recovered signals']
    fig, axes = plt.subplots(4, figsize=(8, 4), gridspec_kw={'hspace': .5}, subplot_kw={'xticks': (), 'yticks': ()})
    
    for model, name, ax in zip(models, names, axes):
        ax.set_title(name)
        ax.plot(model[:, :3], '-')
    
    plt.tight_layout()
    plt.show()
    

    利用NMF和PCA还原混合信号源

4.3 用t-SNE进行流形学习

  • 流形学习算法

    • 用于可视化的算法
    • 允许进行复杂的映射
    • 可以给出较好的可视化
    • 算法计算训练数据的一种新表示,但不允许变换新数据
    • 只能变换用于测试的数据
  • t-SNE

    • 思想:找到数据的一个二维表示,尽可能地保持数据点之间的距离
    • 步骤
      1. 给出每个数据点的随机二维表示
      2. 尝试让在原始特征空间中距离较近的点更加靠近,原始特征空间中距离较远的更加远离
    • 重点关注距离较近的点
    • 试图保存那些表示哪些点比较靠近的信息
    • 仅根据原始空间中数据点之间的靠近程度就能将各个类别明确分开
  • 加载手写数字数据集

    from matplotlib import pyplot as plt
    from sklearn.datasets import load_digits
    
    digits = load_digits()
    
    fig, axes = plt.subplots(2, 5, figsize=(10, 5), subplot_kw={'xticks': (), 'yticks': ()})
    for ax, img in zip(axes.ravel(), digits.images):
        ax.imshow(img)
    
    plt.tight_layout()
    plt.show()
    

    digits数据集的示例图像

  • 用PCA将降到二维的数据可视化

    • 对前两个主成分作图,并按类别对数据点着色
    from matplotlib import pyplot as plt
    from sklearn.datasets import load_digits
    from sklearn.decomposition import PCA
    
    digits = load_digits()
    
    # 构建一个PCA模型
    pca = PCA(n_components=2)
    pca.fit(digits.data)
    
    # 将digits数据变换到前两个主成分的方向上
    digits_pca = pca.transform(digits.data)
    colors = ["#476A2A", "#7851B8", "#BD3430", "#4A2D4E", "#875525",
              "#A83683", "#4E655E", "#853541", "#3A3120", "#535D8E"]
    
    plt.figure(figsize=(10, 10))
    plt.xlim(digits_pca[:, 0].min(), digits_pca[:, 0].max())
    plt.ylim(digits_pca[:, 1].min(), digits_pca[:, 1].max())
    
    for i in range(len(digits.data)):
        # 将数据实际绘制成文本,而不是散点
        plt.text(digits_pca[i, 0], digits_pca[i, 1], str(digits.target[i]),
                 color=colors[digits.target[i]],
                 fontdict={'weight': 'bold', 'size': 9})
    
    plt.xlabel("First principal component")
    plt.ylabel("Second principal component")
    
    plt.tight_layout()
    plt.show()
    

    利用前两个主成分绘制digits数据集的散点图

    • 0、4、6相对较好地分开
  • 将t-SNE应用于数据集

    • TSNE类没有transform方法
    • 调用fit_transform代替
      • 构建模型,并立刻返回变换后的数据
    from matplotlib import pyplot as plt
    from sklearn.datasets import load_digits
    from sklearn.manifold import TSNE
    
    digits = load_digits()
    
    tsne = TSNE(random_state=42)
    
    # 使用fit_transform而不是fit,因为TSNE没有transform方法
    digits_tsne = tsne.fit_transform(digits.data)
    colors = ["#476A2A", "#7851B8", "#BD3430", "#4A2D4E", "#875525",
              "#A83683", "#4E655E", "#853541", "#3A3120", "#535D8E"]
    
    plt.figure(figsize=(10, 10))
    plt.xlim(digits_tsne[:, 0].min(), digits_tsne[:, 0].max() + 1)
    plt.ylim(digits_tsne[:, 1].min(), digits_tsne[:, 1].max() + 1)
    
    for i in range(len(digits.data)):
        # 将数据实际绘制成文本,而不是散点
        plt.text(digits_tsne[i, 0], digits_tsne[i, 1], str(digits.target[i]),
                 color=colors[digits.target[i]],
                 fontdict={'weight': 'bold', 'size': 9})
    
    plt.xlabel("t-SNE feature 0")
    plt.xlabel("t-SNE feature 1")
    
    plt.tight_layout()
    plt.show()
    

    利用t-SNE找到的两个分量绘制digits数据集的散点图

    • 大多数类别都形成一个密集的组

5. 聚类

  • 将数据集划分成组()的任务
  • 目标:划分数据,使得一个簇内的数据点非常相似且不同簇内的数据点非常不同
  • 算法为每个数据点分配(或预测)一个数字,表示这个点属于哪个簇

5.1 k均值聚类

  • 试图找到代表数据特定区域的簇中心

  • 步骤

    1. 将每个数据点分配给最近的簇中心

    2. 将每个簇中心设置为所分配的所有数据点的平均值

    3. 重复执行以上两个步骤,直到簇的分配不再发生变化

  • 算法说明

    from matplotlib import pyplot as plt
    import mglearn
    
    mglearn.plots.plot_kmeans_algorithm()
    
    plt.tight_layout()
    plt.show()
    

    输入数据与k均值算法的三个步骤

    • 三角形:簇中心
    • 圆形:数据点
    • 颜色:簇成员
    • 寻找3个簇
      • 声明3个随机数据点为簇中心来将算法初始化
      • 运行迭代算法
        • 每个数据点被分配给距离最近的簇中心
        • 将簇中心修改为所分配点的平均值
      • 重复2次,第三次迭代后,为簇中心分配的数据点保持不变,算法结束
  • 簇中心的边界

    from matplotlib import pyplot as plt
    import mglearn
    
    mglearn.plots.plot_kmeans_boundaries()
    
    plt.tight_layout()
    plt.show()
    

    k均值算法找到的簇中心和簇边界

  • 使用k均值

    from matplotlib import pyplot as plt
    import mglearn
    from sklearn.datasets import make_blobs
    from sklearn.cluster import KMeans
    
    # 生成模拟的二维数据
    X, y = make_blobs(random_state=1)
    
    # 构建聚类模型
    kmeans = KMeans(n_clusters=3)
    # n_clusters: 簇的个数(默认为8)
    
    kmeans.fit(X)
    
    # 打印每个点的簇标签
    print("Cluster memberships:\n{}".format(kmeans.labels_))
    # Cluster memberships:
    # [0 2 2 2 1 1 1 2 0 0 2 2 1 0 1 1 1 0 2 2 1 2 1 0 2 1 1 0 0 1 0 0 1 0 2 1 2
    #  2 2 1 1 2 0 2 2 1 0 0 0 0 2 1 1 1 0 1 2 2 0 0 2 1 1 2 2 1 0 1 0 2 2 2 1 0
    #  0 2 1 1 0 2 0 2 2 1 0 0 0 0 2 0 1 0 0 2 2 1 1 0 1 0]
    
    # predict方法也可以为新数据点分配簇标签
    print(kmeans.predict(X))
    # [0 2 2 2 1 1 1 2 0 0 2 2 1 0 1 1 1 0 2 2 1 2 1 0 2 1 1 0 0 1 0 0 1 0 2 1 2
    #  2 2 1 1 2 0 2 2 1 0 0 0 0 2 1 1 1 0 1 2 2 0 0 2 1 1 2 2 1 0 1 0 2 2 2 1 0
    #  0 2 1 1 0 2 0 2 2 1 0 0 0 0 2 0 1 0 0 2 2 1 1 0 1 0]
    
    mglearn.discrete_scatter(X[:, 0], X[:, 1], kmeans.labels_, markers='o')
    mglearn.discrete_scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], [0, 1, 2],
                             markers='^', markeredgewidth=2)
    
    plt.tight_layout()
    plt.show()
    
    • 每个元素都有一个标签

      • 不存在真实的标签
      • 标签本身没有先验意义
    • 绘制图像

      from matplotlib import pyplot as plt
      import mglearn
      from sklearn.datasets import make_blobs
      from sklearn.cluster import KMeans
      
      X, y = make_blobs(random_state=1)
      
      kmeans = KMeans(n_clusters=3)
      kmeans.fit(X)
      
      mglearn.discrete_scatter(X[:, 0], X[:, 1], kmeans.labels_, markers='o')
      mglearn.discrete_scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], [0, 1, 2],
                               markers='^', markeredgewidth=2)
      
      plt.tight_layout()
      plt.show()
      

      3个簇的k均值算法找到的簇分配和簇中心

    • 使用更多或更少的簇中心

      from matplotlib import pyplot as plt
      import mglearn
      from sklearn.datasets import make_blobs
      from sklearn.cluster import KMeans
      
      X, y = make_blobs(random_state=1)
      
      fig, axes = plt.subplots(1, 2, figsize=(10, 5))
      
      # 使用2个簇中心
      kmeans = KMeans(n_clusters=2)
      kmeans.fit(X)
      assignments = kmeans.labels_
      
      mglearn.discrete_scatter(X[:, 0], X[:, 1], assignments, ax=axes[0])
      
      # 使用5个簇中心
      kmeans = KMeans(n_clusters=5)
      kmeans.fit(X)
      assignments = kmeans.labels_
      
      mglearn.discrete_scatter(X[:, 0], X[:, 1], assignments, ax=axes[1])
      
      plt.tight_layout()
      plt.show()
      

      使用2个簇(左)和5个簇(右)的k均值算法找到的簇分配

5.1.1 k均值的失败案例

  • 每个簇仅由其中心定义

    • 每个簇都是凸形
  • k均值只能找到相对简单的形状

  • k均值假设所有簇在某种程度上都具有相同的直径,总是将簇之间的边界刚好画在簇中心的中间位置

    from matplotlib import pyplot as plt
    import mglearn
    from sklearn.datasets import make_blobs
    from sklearn.cluster import KMeans
    
    X, y = make_blobs(random_state=1)
    
    X_varied, y_varied = make_blobs(n_samples=200, cluster_std=[1.0, 2.5, 0.5], random_state=170)
    
    y_pred = KMeans(n_clusters=3, random_state=0).fit_predict(X_varied)
    
    mglearn.discrete_scatter(X_varied[:, 0], X_varied[:, 1], y_pred)
    
    plt.legend(["cluster 0", "cluster 1", "cluster 2"], loc='best')
    plt.xlabel("Feature 0")
    plt.ylabel("Feature 1")
    
    plt.tight_layout()
    plt.show()
    

    簇的密度不同时,k均值找到的簇分配

    • 簇0和簇1都包含一些原理簇中其他点的点
  • k均值假设所有方向对每个簇都同等重要

    import numpy as np
    from matplotlib import pyplot as plt
    import mglearn
    from sklearn.datasets import make_blobs
    from sklearn.cluster import KMeans
    
    # 生成一些随机分组数据
    X, y = make_blobs(random_state=170, n_samples=600)
    rng = np.random.RandomState(74)
    
    # 变换数据使其拉长
    transformation = rng.normal(size=(2, 2))
    X = np.dot(X, transformation)
    
    # 将数据聚类成3个簇
    kmeans = KMeans(n_clusters=3)
    kmeans.fit(X)
    y_pred = kmeans.predict(X)
    
    # 画出簇分配和簇中心
    plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap=mglearn.cm3)
    plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
                marker='^', c=[0, 1, 2], s=100, linewidth=2)
    plt.xlabel("Feature 0")
    plt.ylabel("Feature 1")
    
    plt.tight_layout()
    plt.show()
    

    k均值无法识别非球形簇

  • 簇的形状很复杂

    from matplotlib import pyplot as plt
    import mglearn
    from sklearn.datasets import make_moons
    from sklearn.cluster import KMeans
    
    # 生成模拟的two moons数据(这次的噪声较小)
    X, y = make_moons(n_samples=200, noise=0.05, random_state=0)
    
    # 将数据聚类成2个簇
    kmeans = KMeans(n_clusters=2)
    kmeans.fit(X)
    y_pred = kmeans.predict(X)
    
    # 画出簇分配和簇中心
    plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap=mglearn.cm2, s=60)
    plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
                marker='^', c=[mglearn.cm2(0), mglearn.cm2(1)], s=100, linewidth=2)
    plt.xlabel("Feature 0")
    plt.ylabel("Feature 1")
    
    plt.tight_layout()
    plt.show()
    

    k均值无法识别具有复杂形状的簇

5.1.2 矢量量化,或者将k均值看作分解

  • 矢量量化:k均值是一种分解方法,其中每个点用单一分量来表示

  • 并排比较PCA、NMF和k均值,分别显示提取的分量,以及利用100个分量对测试集中人脸的重建

    import numpy as np
    from matplotlib import pyplot as plt
    from sklearn.cluster import KMeans
    from sklearn.datasets import fetch_lfw_people
    from sklearn.model_selection import train_test_split
    from sklearn.decomposition import NMF, PCA
    import ssl
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)
    mask = np.zeros(people.target.shape, dtype=np.bool_)
    
    for target in np.unique(people.target):
        mask[np.where(people.target == target)[0][:50]] = 1
    
    X_people = people.data[mask]
    y_people = people.target[mask]
    X_people = X_people / 255.
    
    image_shape = people.images[0].shape
    
    X_train, X_test, y_train, y_test = train_test_split(X_people, y_people, stratify=y_people, random_state=0)
    
    nmf = NMF(n_components=100, random_state=0)
    nmf.fit(X_train)
    pca = PCA(n_components=100, random_state=0)
    pca.fit(X_train)
    kmeans = KMeans(n_clusters=100, random_state=0)
    kmeans.fit(X_train)
    
    X_reconstructed_pca = pca.inverse_transform(pca.transform(X_test))
    X_reconstructed_kmeans = kmeans.cluster_centers_[kmeans.predict(X_test)]
    X_reconstructed_nmf = np.dot(nmf.transform(X_test), nmf.components_)
    
    fig, axes = plt.subplots(3, 5, figsize=(8, 8), subplot_kw={'xticks': (), 'yticks': ()})
    
    fig.suptitle("Extracted Components")
    for ax, comp_kmeans, comp_pca, comp_nmf in zip(
            axes.T, kmeans.cluster_centers_, pca.components_, nmf.components_):
        ax[0].imshow(comp_kmeans.reshape(image_shape))
        ax[1].imshow(comp_pca.reshape(image_shape), cmap='viridis')
        ax[2].imshow(comp_nmf.reshape(image_shape))
    
    axes[0, 0].set_ylabel("kmeans")
    axes[1, 0].set_ylabel("pca")
    axes[2, 0].set_ylabel("nmf")
    
    plt.tight_layout()
    
    fig, axes = plt.subplots(4, 5, subplot_kw={'xticks': (), 'yticks': ()},figsize=(8, 8))
    
    fig.suptitle("Reconstructions")
    for ax, orig, rec_kmeans, rec_pca, rec_nmf in zip(
            axes.T, X_test, X_reconstructed_kmeans, X_reconstructed_pca, X_reconstructed_nmf):
        ax[0].imshow(orig.reshape(image_shape))
        ax[1].imshow(rec_kmeans.reshape(image_shape))
        ax[2].imshow(rec_pca.reshape(image_shape))
        ax[3].imshow(rec_nmf.reshape(image_shape))
    
    axes[0, 0].set_ylabel("original")
    axes[1, 0].set_ylabel("kmeans")
    axes[2, 0].set_ylabel("pca")
    axes[3, 0].set_ylabel("nmf")
    
    plt.tight_layout()
    plt.show()
    

对比k均值的簇中心与PCA和NMF找到的分量

利用100个分量(或簇中心)的k均值、PCA和NMF的图像重建的对比——k均值的每张图像中仅使用了一个簇中心

  • 用比输入维度更多的簇来对数据进行编码

    from matplotlib import pyplot as plt
    from sklearn.datasets import make_moons
    from sklearn.cluster import KMeans
    
    X, y = make_moons(n_samples=200, noise=0.05, random_state=0)
    
    kmeans = KMeans(n_clusters=10, random_state=0)
    kmeans.fit(X)
    y_pred = kmeans.predict(X)
    
    plt.scatter(X[:, 0], X[:, 1], c=y_pred, s=60, cmap='Paired')
    plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
                marker='^', s=60, c=range(kmeans.n_clusters),linewidth=2, cmap='Paired')
    plt.xlabel("Feature 0")
    plt.ylabel("Feature 1")
    
    print("Cluster memberships:\n{}".format(kmeans.labels_))
    # Cluster memberships:
    # [9 2 5 4 2 7 9 6 9 6 1 0 2 6 1 9 3 0 3 1 7 6 8 6 8 5 2 7 5 8 9 8 6 5 3 7 0
    #  9 4 5 0 1 3 5 2 8 9 1 5 6 1 0 7 4 6 3 3 6 3 8 0 4 2 9 6 4 8 2 8 4 0 4 0 5
    #  6 4 5 9 3 0 7 8 0 7 5 8 9 8 0 7 3 9 7 1 7 2 2 0 4 5 6 7 8 9 4 5 4 1 2 3 1
    #  8 8 4 9 2 3 7 0 9 9 1 5 8 5 1 9 5 6 7 9 1 4 0 6 2 6 4 7 9 5 5 3 8 1 9 5 6
    #  3 5 0 2 9 3 0 8 6 0 3 3 5 6 3 2 0 2 3 0 2 6 3 4 4 1 5 6 7 1 1 3 2 4 7 2 7
    #  3 8 6 4 1 4 3 9 9 5 1 7 5 8 2]
    
    plt.tight_layout()
    plt.show()
    

    利用k均值的许多簇来表示复杂数据集中的变化

  • 将到每个簇中心的距离作为特征,可以得到一种表现力很强的数据表示

    • 使用transform方法
    from sklearn.datasets import make_moons
    from sklearn.cluster import KMeans
    
    X, y = make_moons(n_samples=200, noise=0.05, random_state=0)
    
    kmeans = KMeans(n_clusters=10, random_state=0)
    kmeans.fit(X)
    
    distance_features=kmeans.transform(X)
    print("Distance feature shape: {}".format(distance_features.shape))
    # Distance feature shape: (200, 10)
    
    print("Distance features:\n{}".format(distance_features))
    # Distance features:
    # [[0.9220768  1.46553151 1.13956805 ... 1.16559918 1.03852189 0.23340263]
    #  [1.14159679 2.51721597 0.1199124  ... 0.70700803 2.20414144 0.98271691]
    #  [0.78786246 0.77354687 1.74914157 ... 1.97061341 0.71561277 0.94399739]
    #  ...
    #  [0.44639122 1.10631579 1.48991975 ... 1.79125448 1.03195812 0.81205971]
    #  [1.38951924 0.79790385 1.98056306 ... 1.97788956 0.23892095 1.05774337]
    #  [1.14920754 2.4536383  0.04506731 ... 0.57163262 2.11331394 0.88166689]]
    

5.1.3 优点、缺点

  • 优点
    • 非常流行的聚类算法
    • 相对容易理解和实现
    • 运行速度相对较快
    • 可以轻松扩展到大型数据集
  • 缺点
    • 依赖于随机初始化
      • 算法的输出依赖于随机种子
      • 默认情况下,scikit-learn用10种不同的随机初始化将算法运行10次,并返回最佳结果(簇的方差之和最小)
    • 对簇形状的假设的约束性较强
    • 要求指定所要寻找的簇的个数(在现实世界的应用中可能并不知道这个数字)

5.2 凝聚聚类

  • 许多基于相同原则构建的聚类算法

    • 原则:算法首先声明每个点是自己的簇,然后合并两个最相似的簇,直到满足某种停止准则为止
      • 准则
        • scikit-learn:簇的个数
      • 链接准则:规定如何度量最相似的簇
        • 定义在两个现有的簇之间
        • scikit-learn中实现的三种选项
          • ward
            • 默认选项
            • 挑选两个簇进行合并,使得所有簇中的方差增加最小
            • 会得到大小差不多相等的簇
            • 用于大多数数据集
          • average
            • 将簇中所有点之间平均距离最小的两个簇合并
          • complete
            • 将簇中点之间最大距离最小的两个簇合并
  • 二维数据集上的凝聚聚类过程

    • 寻找3个簇
    import matplotlib.pyplot as plt
    import mglearn
    
    mglearn.plots.plot_agglomerative_algorithm()
    
    plt.tight_layout()
    plt.show()
    

    凝聚聚类用迭代的方式合并两个最近的簇

  • 凝聚聚类对简单三簇数据的效果

    from matplotlib import pyplot as plt
    import mglearn
    from sklearn.datasets import make_blobs
    from sklearn.cluster import AgglomerativeClustering
    
    X, y = make_blobs(random_state=1)
    agg = AgglomerativeClustering(n_clusters=3)
    
    assignment = agg.fit_predict(X)
    mglearn.discrete_scatter(X[:, 0], X[:, 1], assignment)
    
    plt.xlabel("Feature 0")
    plt.ylabel("Feature 1")
    
    plt.tight_layout()
    plt.show()
    

    使用3个簇的凝聚聚类的簇分配

层次聚类与树状图

  • 同时查看所有可能的聚类

    from matplotlib import pyplot as plt
    import mglearn
    mglearn.plots.plot_agglomerative()
    
    plt.tight_layout()
    plt.show()
    

    凝聚聚类生成的层次化的簇分配(用线表示)以及带有编号的数据点

  • 树状图

    • 可以处理多维数据
    from matplotlib import pyplot as plt
    from sklearn.datasets import make_blobs
    from scipy.cluster.hierarchy import dendrogram, ward
    
    X, y = make_blobs(random_state=0, n_samples=12)
    
    # 将ward聚类应用于数据数组X
    # SciPy的ward函数返回一个数组,指定执行凝聚聚类时跨越的距离
    linkage_array = ward(X)
    
    # 现在为包含簇之间距离的linkage array绘制树状图
    dendrogram(linkage_array)
    
    # 在树中标记划分成两个簇或三个簇的位置
    ax = plt.gca()
    bounds = ax.get_xbound()
    ax.plot(bounds, [7.25, 7.25], '--', c='k')
    ax.plot(bounds, [4, 4], '--', c='k')
    
    ax.text(bounds[1], 7.25, ' two clusters', va='center', fontdict={'size': 15})
    ax.text(bounds[1], 4, ' three clusters', va='center', fontdict={'size': 15})
    
    plt.xlabel("Sample index")
    plt.ylabel("Cluster distance")
    
    plt.tight_layout()
    plt.show()
    

    聚类的树状图(用线表示划分成两个簇和三个簇)

    • x轴:数据点
    • y轴:聚类算法中簇的合并时间
    • 分支长度:合并的簇之间的距离

5.3 DBSCAN

  • 优点

    • 不需要用户先验地设置簇的个数
    • 可以划分具有复杂形状的簇
    • 可以找出不属于任何簇的点
    • 可以扩展到相对较大的数据集
  • 缺点

    • 运行速度较慢
  • 原理:识别特征空间的“拥挤”区域中的点

    • “拥挤”区域(密集区域):区域中许多数据点靠近在一起
      • 密集区域中的点:核心样本
        • 如果在距一个给定数据点eps的距离内至少有min_samples个数据点,那么这个点就是核心样本
        • DBSCAN将彼此距离小于eps的核心样本放到同一个簇中
  • 思想:簇形成数据的密集区域,并由相对较空的区域隔开

  • 步骤

    1. 选取任意一个点
    2. 找到到这个点的距离小于等于eps的所有点
      • 如果距起始点的距离在eps之内的数据点个数小于min_samples,则这个点被标记为噪声
        • 这个点不属于任何簇
      • 如果距起始点的距离在eps之内的数据点个数大于min_samples,则这个点被标记为核心样本,并被分配一个新的簇标签
    3. 访问该点的所有邻居(在距离eps以内)
      • 如果它们还没有被分配一个簇,则将刚刚创建的新的簇标签分配给它们
      • 如果它们是核心样本,那么依次访问其邻居
    4. 簇逐渐增大,直到在簇的eps距离内没有更多的核心样本为止
    5. 选取另一个未被访问过的点,重复以上步骤
  • eps和min_samples取不同值时的簇分类

    import matplotlib.pyplot as plt
    import mglearn
    
    mglearn.plots.plot_dbscan()
    
    plt.tight_layout()
    plt.show()
    # min_samples: 2 eps: 1.000000  cluster: [-1  0  0 -1  0 -1  1  1  0  1 -1 -1]
    # min_samples: 2 eps: 1.500000  cluster: [0 1 1 1 1 0 2 2 1 2 2 0]
    # min_samples: 2 eps: 2.000000  cluster: [0 1 1 1 1 0 0 0 1 0 0 0]
    # min_samples: 2 eps: 3.000000  cluster: [0 0 0 0 0 0 0 0 0 0 0 0]
    # min_samples: 3 eps: 1.000000  cluster: [-1  0  0 -1  0 -1  1  1  0  1 -1 -1]
    # min_samples: 3 eps: 1.500000  cluster: [0 1 1 1 1 0 2 2 1 2 2 0]
    # min_samples: 3 eps: 2.000000  cluster: [0 1 1 1 1 0 0 0 1 0 0 0]
    # min_samples: 3 eps: 3.000000  cluster: [0 0 0 0 0 0 0 0 0 0 0 0]
    # min_samples: 5 eps: 1.000000  cluster: [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
    # min_samples: 5 eps: 1.500000  cluster: [-1  0  0  0  0 -1 -1 -1  0 -1 -1 -1]
    # min_samples: 5 eps: 2.000000  cluster: [-1  0  0  0  0 -1 -1 -1  0 -1 -1 -1]
    # min_samples: 5 eps: 3.000000  cluster: [0 0 0 0 0 0 0 0 0 0 0 0]
    

    在min_samples和eps参数不同取值的情况下,DBSCAN找到的簇分配

    • -1:噪声
    • 实心:属于簇的点
    • 空心:噪声点
    • 较大的标记:核心样本
    • 较小的标记:边界点
  • 使用StandardScaler或MinMaxScaler对数据进行缩放后,有时更容易找到eps的较好取值

  • 在two_moons数据集上运行DBSCAN的结果

    import mglearn
    from matplotlib import pyplot as plt
    from sklearn.cluster import DBSCAN
    from sklearn.datasets import make_moons
    from sklearn.preprocessing import StandardScaler
    
    X, y = make_moons(n_samples=200, noise=0.05, random_state=0)
    
    # 将数据缩放成平均值为0、方差为1
    scaler = StandardScaler()
    scaler.fit(X)
    X_scaled = scaler.transform(X)
    
    dbscan = DBSCAN()
    clusters = dbscan.fit_predict(X_scaled)
    
    # 绘制簇分配
    plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap=mglearn.cm2, s=60)
    
    plt.xlabel("Feature 0")
    plt.ylabel("Feature 1")
    
    plt.tight_layout()
    plt.show()
    

    利用默认值eps=0.5的DBSCAN找到的簇分配

5.4 聚类算法的对比与评估

5.4.1 用真实值评估聚类

  • 用于评估聚类算法相对于真实聚类结果的指标

    • 调整rand指数(ARI)
      • 最佳值:1
      • 不相关:0
    • 归一化互信息(NMI)
      • 最佳值:1
      • 不相关:0
  • 使用ARI比较k均值、凝聚聚类和DBSCAN算法

    import numpy as np
    import mglearn
    from matplotlib import pyplot as plt
    from sklearn.cluster import DBSCAN, KMeans, AgglomerativeClustering
    from sklearn.datasets import make_moons
    from sklearn.preprocessing import StandardScaler
    from sklearn.metrics.cluster import adjusted_rand_score
    
    X, y = make_moons(n_samples=200, noise=0.05, random_state=0)
    
    # 将数据缩放成平均值为0、方差为1
    scaler = StandardScaler()
    scaler.fit(X)
    X_scaled = scaler.transform(X)
    
    fig, axes = plt.subplots(1, 4, figsize=(15, 3), subplot_kw={'xticks': (), 'yticks': ()})
    
    # 列出要使用的算法
    algorithms = [KMeans(n_clusters=2), AgglomerativeClustering(n_clusters=2), DBSCAN()]
    
    # 创建一个随机的簇分配,作为参考
    random_state = np.random.RandomState(seed=0)
    random_clusters = random_state.randint(low=0, high=2, size=len(X))
    
    # 绘制随机分配
    axes[0].scatter(X_scaled[:, 0], X_scaled[:, 1], c=random_clusters, cmap=mglearn.cm3, s=60)
    axes[0].set_title("Random assignment - ARI: {:.2f}".format(adjusted_rand_score(y, random_clusters)))
    
    for ax, algorithm in zip(axes[1:], algorithms):
        # 绘制簇分配和簇中心
        clusters = algorithm.fit_predict(X_scaled)
        ax.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap=mglearn.cm3, s=60)
        ax.set_title("{} - ARI: {:.2f}".format(algorithm.__class__.__name__, adjusted_rand_score(y, clusters)))
    
    plt.tight_layout()
    plt.show()
    

    利用监督ARI分数在two_moons数据集上比较随机分配、k均值、凝聚聚类和DBSCAN

  • 评估聚类时,不应该使用accuracy_score

    • 精度评估:分配的簇标签与真实值完全匹配
    • 但簇标签没有意义
    from sklearn.metrics.cluster import adjusted_rand_score
    from sklearn.metrics import accuracy_score
    
    # 这两种点标签对应于相同的聚类
    clusters1 = [0, 0, 1, 1, 0]
    clusters2 = [1, 1, 0, 0, 1]
    
    # 精度为0,因为二者标签完全不同
    print("Accuracy: {:.2f}".format(accuracy_score(clusters1, clusters2)))
    # Accuracy: 0.00
    
    # 调整rand分数为1,因为二者聚类完全相同
    print("ARI: {:.2f}".format(adjusted_rand_score(clusters1, clusters2)))
    # ARI: 1.00
    

5.4.2 在没有真实值的情况下评估聚类

  • 不需要真实值的聚类评分指标

    • 轮廓系数
      • 计算一个簇的紧致度
      • 越大越好
      • 最大值:1
      • 不允许复杂的形状
  • 使用轮廓系数比较k均值、凝聚聚类和DBSCAN算法

    import numpy as np
    import mglearn
    from matplotlib import pyplot as plt
    from sklearn.cluster import DBSCAN, KMeans, AgglomerativeClustering
    from sklearn.datasets import make_moons
    from sklearn.preprocessing import StandardScaler
    from sklearn.metrics.cluster import silhouette_score
    
    X, y = make_moons(n_samples=200, noise=0.05, random_state=0)
    
    # 将数据缩放成平均值为0、方差为1
    scaler = StandardScaler()
    scaler.fit(X)
    
    X_scaled = scaler.transform(X)
    
    fig, axes = plt.subplots(1, 4, figsize=(15, 3), subplot_kw={'xticks': (), 'yticks': ()})
    
    # 列出要使用的算法
    algorithms = [KMeans(n_clusters=2), AgglomerativeClustering(n_clusters=2), DBSCAN()]
    
    # 创建一个随机的簇分配,作为参考
    random_state = np.random.RandomState(seed=0)
    random_clusters = random_state.randint(low=0, high=2, size=len(X))
    
    # 绘制随机分配
    axes[0].scatter(X_scaled[:, 0], X_scaled[:, 1], c=random_clusters, cmap=mglearn.cm3, s=60)
    axes[0].set_title("Random assignment - ARI: {:.2f}".format(silhouette_score(X_scaled, random_clusters)))
    
    for ax, algorithm in zip(axes[1:], algorithms):
        # 绘制簇分配和簇中心
        clusters = algorithm.fit_predict(X_scaled)
        ax.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap=mglearn.cm3, s=60)
        ax.set_title("{} - ARI: {:.2f}".format(algorithm.__class__.__name__, silhouette_score(X_scaled, clusters)))
    
    plt.tight_layout()
    plt.show()
    

    利用无监督的轮廓分数在two_moons数据集上比较随机分配、k均值、凝聚聚类和DBSCAN

  • 较好的评估聚类策略:使用基于鲁捧性聚类指标

    • 先向数据中添加一些噪声,或使用不同的参数设定
    • 然后运行算法,并对结果进行比较
    • 思想:如果许多算法参数和许多数据扰动返回相同的结果,那么它很可能是可信的

5.4.3 在人脸数据集上比较算法

  • 加载人脸数据

    • 使用数据的特征脸表示
      • 由100个成分的PCA(whiten=True)生成
    import numpy as np
    from sklearn.datasets import fetch_lfw_people
    from sklearn.decomposition import PCA
    import ssl
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)
    mask = np.zeros(people.target.shape, dtype=np.bool_)
    
    for target in np.unique(people.target):
        mask[np.where(people.target == target)[0][:50]] = 1
    
    X_people = people.data[mask]
    y_people = people.target[mask]
    X_people = X_people / 255.
    
    # 从lfw数据中提取特征脸,并对数据进行变换
    pca = PCA(n_components=100, whiten=True, random_state=0)
    # 100个成分
    
    pca.fit_transform(X_people)
    
    X_pca = pca.transform(X_people)
    
用DBSCAN分析人脸数据集
  • 应用DBSCAN

    import numpy as np
    from sklearn.cluster import DBSCAN
    from sklearn.datasets import fetch_lfw_people
    from sklearn.decomposition import PCA
    import ssl
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)
    mask = np.zeros(people.target.shape, dtype=np.bool_)
    
    for target in np.unique(people.target):
        mask[np.where(people.target == target)[0][:50]] = 1
    
    X_people = people.data[mask]
    y_people = people.target[mask]
    X_people = X_people / 255.
    
    
    pca = PCA(n_components=100, whiten=True, random_state=0)
    pca.fit_transform(X_people)
    
    X_pca = pca.transform(X_people)
    
    # 应用默认参数的DBSCAN
    dbscan = DBSCAN()
    labels = dbscan.fit_predict(X_pca)
    print("Unique labels: {}".format(np.unique(labels)))
    # Unique labels: [-1]
    
    • 所有数据点都被标记为噪声
    • 改进的两种方式
      • 增大eps参数
      • 减小min_samples参数
  • 减小min_samples参数

    import numpy as np
    from sklearn.cluster import DBSCAN
    from sklearn.datasets import fetch_lfw_people
    from sklearn.decomposition import PCA
    import ssl
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)
    mask = np.zeros(people.target.shape, dtype=np.bool_)
    
    for target in np.unique(people.target):
        mask[np.where(people.target == target)[0][:50]] = 1
    
    X_people = people.data[mask]
    y_people = people.target[mask]
    X_people = X_people / 255.
    
    pca = PCA(n_components=100, whiten=True, random_state=0)
    pca.fit_transform(X_people)
    
    X_pca = pca.transform(X_people)
    
    dbscan = DBSCAN(min_samples=3)
    labels = dbscan.fit_predict(X_pca)
    print("Unique labels: {}".format(np.unique(labels)))
    # Unique labels: [-1]
    
    • 没有发生变化
  • 增大eps参数

    import numpy as np
    from sklearn.cluster import DBSCAN
    from sklearn.datasets import fetch_lfw_people
    from sklearn.decomposition import PCA
    import ssl
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)
    mask = np.zeros(people.target.shape, dtype=np.bool_)
    
    for target in np.unique(people.target):
        mask[np.where(people.target == target)[0][:50]] = 1
    
    X_people = people.data[mask]
    y_people = people.target[mask]
    X_people = X_people / 255.
    
    pca = PCA(n_components=100, whiten=True, random_state=0)
    pca.fit_transform(X_people)
    
    X_pca = pca.transform(X_people)
    
    dbscan = DBSCAN(min_samples=3, eps=15)
    labels = dbscan.fit_predict(X_pca)
    print("Unique labels: {}".format(np.unique(labels)))
    # Unique labels: [-1  0]
    
    • 得到了单一簇和噪声点
  • 查看数据点的情况

    import numpy as np
    from sklearn.cluster import DBSCAN
    from sklearn.datasets import fetch_lfw_people
    from sklearn.decomposition import PCA
    import ssl
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)
    mask = np.zeros(people.target.shape, dtype=np.bool_)
    
    for target in np.unique(people.target):
        mask[np.where(people.target == target)[0][:50]] = 1
    
    X_people = people.data[mask]
    y_people = people.target[mask]
    X_people = X_people / 255.
    
    pca = PCA(n_components=100, whiten=True, random_state=0)
    pca.fit_transform(X_people)
    
    X_pca = pca.transform(X_people)
    
    dbscan = DBSCAN(min_samples=3, eps=15)
    labels = dbscan.fit_predict(X_pca)
    
    # 计算所有簇中的点数和噪声中的点数
    # bincount不允许负值,所以我们需要加1
    # 结果中的第一个数字对应于噪声点
    print("Number of points per cluster: {}".format(np.bincount(labels + 1)))
    # Number of points per cluster: [  37 2026]
    
  • 查看所有的噪声点

    import numpy as np
    from matplotlib import pyplot as plt
    from sklearn.cluster import DBSCAN
    from sklearn.datasets import fetch_lfw_people
    from sklearn.decomposition import PCA
    import ssl
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)
    mask = np.zeros(people.target.shape, dtype=np.bool_)
    
    for target in np.unique(people.target):
        mask[np.where(people.target == target)[0][:50]] = 1
    
    X_people = people.data[mask]
    y_people = people.target[mask]
    X_people = X_people / 255.
    
    image_shape = people.images[0].shape
    
    pca = PCA(n_components=100, whiten=True, random_state=0)
    pca.fit_transform(X_people)
    
    X_pca = pca.transform(X_people)
    
    dbscan = DBSCAN(min_samples=3, eps=15)
    labels = dbscan.fit_predict(X_pca)
    
    noise = X_people[labels == -1]
    
    fig, axes = plt.subplots(3, 9, subplot_kw={'xticks': (), 'yticks': ()}, figsize=(12, 4))
    for image, ax in zip(noise, axes.ravel()):
        ax.imshow(image.reshape(image_shape))
    
    plt.tight_layout()
    plt.show()
    

    人脸数据集中被DBSCAN标记为噪声的样本

  • 异常值检测:尝试找出数据集中对不匹配的数据

  • eps不同取值对应的结果

    import numpy as np
    from sklearn.cluster import DBSCAN
    from sklearn.datasets import fetch_lfw_people
    from sklearn.decomposition import PCA
    import ssl
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)
    mask = np.zeros(people.target.shape, dtype=np.bool_)
    
    for target in np.unique(people.target):
        mask[np.where(people.target == target)[0][:50]] = 1
    
    X_people = people.data[mask]
    y_people = people.target[mask]
    X_people = X_people / 255.
    
    pca = PCA(n_components=100, whiten=True, random_state=0)
    pca.fit_transform(X_people)
    
    X_pca = pca.transform(X_people)
    
    for eps in [1, 3, 5, 7, 9, 11, 13]:
        print("\neps={}".format(eps))
        dbscan = DBSCAN(eps=eps, min_samples=3)
        labels = dbscan.fit_predict(X_pca)
        print("Clusters present: {}".format(np.unique(labels)))
        print("Cluster sizes: {}".format(np.bincount(labels + 1)))
    # eps=1
    # Clusters present: [-1]
    # Cluster sizes: [2063]
    # 
    # eps=3
    # Clusters present: [-1]
    # Cluster sizes: [2063]
    # 
    # eps=5
    # Clusters present: [-1  0]
    # Cluster sizes: [2059    4]
    # 
    # eps=7
    # Clusters present: [-1  0  1  2  3  4  5  6]
    # Cluster sizes: [1954   75    4   14    6    4    3    3]
    # 
    # eps=9
    # Clusters present: [-1  0  1]
    # Cluster sizes: [1199  861    3]
    # 
    # eps=11
    # Clusters present: [-1  0]
    # Cluster sizes: [ 403 1660]
    # 
    # eps=13
    # Clusters present: [-1  0]
    # Cluster sizes: [ 119 1944]
    
  • 打印eps=7时7个簇中的图像

    import numpy as np
    from matplotlib import pyplot as plt
    from sklearn.cluster import DBSCAN
    from sklearn.datasets import fetch_lfw_people
    from sklearn.decomposition import PCA
    import ssl
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)
    mask = np.zeros(people.target.shape, dtype=np.bool_)
    
    for target in np.unique(people.target):
        mask[np.where(people.target == target)[0][:50]] = 1
    
    X_people = people.data[mask]
    y_people = people.target[mask]
    X_people = X_people / 255.
    
    image_shape = people.images[0].shape
    
    pca = PCA(n_components=100, whiten=True, random_state=0)
    pca.fit_transform(X_people)
    
    X_pca = pca.transform(X_people)
    
    dbscan = DBSCAN(min_samples=3, eps=7)
    labels = dbscan.fit_predict(X_pca)
    
    for cluster in range(max(labels) + 1):
        mask = labels == cluster
        n_images = np.sum(mask)
        fig, axes = plt.subplots(1, n_images, figsize=(n_images * 1.5, 4), subplot_kw={'xticks': (), 'yticks': ()})
        
        for image, label, ax in zip(X_people[mask], y_people[mask], axes):
            ax.imshow(image.reshape(image_shape))
            ax.set_title(people.target_names[label].split()[-1])
    
        plt.tight_layout()
        plt.show()
    
用k均值分析人脸数据集
  • 提取簇

    import numpy as np
    from sklearn.cluster import KMeans
    from sklearn.datasets import fetch_lfw_people
    from sklearn.decomposition import PCA
    import ssl
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)
    mask = np.zeros(people.target.shape, dtype=np.bool_)
    
    for target in np.unique(people.target):
        mask[np.where(people.target == target)[0][:50]] = 1
    
    X_people = people.data[mask]
    y_people = people.target[mask]
    X_people = X_people / 255.
    
    image_shape = people.images[0].shape
    
    pca = PCA(n_components=100, whiten=True, random_state=0)
    pca.fit_transform(X_people)
    
    X_pca = pca.transform(X_people)
    
    # 用k均值提取簇
    km = KMeans(n_clusters=10, random_state=0)
    labels_km = km.fit_predict(X_pca)
    
    print("Cluster sizes k-means: {}".format(np.bincount(labels_km)))
    # Cluster sizes k-means: [ 70 198 139 109 196 351 207 424 180 189]
    
    • 簇的大小相似
  • 可视化

    import numpy as np
    from matplotlib import pyplot as plt
    from sklearn.cluster import KMeans
    from sklearn.datasets import fetch_lfw_people
    from sklearn.decomposition import PCA
    import ssl
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)
    mask = np.zeros(people.target.shape, dtype=np.bool_)
    
    for target in np.unique(people.target):
        mask[np.where(people.target == target)[0][:50]] = 1
    
    X_people = people.data[mask]
    y_people = people.target[mask]
    X_people = X_people / 255.
    
    image_shape = people.images[0].shape
    
    pca = PCA(n_components=100, whiten=True, random_state=0)
    pca.fit_transform(X_people)
    
    X_pca = pca.transform(X_people)
    
    km = KMeans(n_clusters=10, random_state=0)
    labels_km = km.fit_predict(X_pca)
    
    fig, axes = plt.subplots(2, 5, subplot_kw={'xticks': (), 'yticks': ()}, figsize=(12, 4))
    for center, ax in zip(km.cluster_centers_, axes.ravel()):
        ax.imshow(pca.inverse_transform(center).reshape(image_shape))
        plt.tight_layout()
    plt.show()
    

    将簇的数量设置为10时,k均值找到的簇中心

  • 绘制每个簇中心最典型和最不典型各5个图像

    import mglearn
    import numpy as np
    from matplotlib import pyplot as plt
    from sklearn.cluster import KMeans
    from sklearn.datasets import fetch_lfw_people
    from sklearn.decomposition import PCA
    import ssl
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)
    mask = np.zeros(people.target.shape, dtype=np.bool_)
    
    for target in np.unique(people.target):
        mask[np.where(people.target == target)[0][:50]] = 1
    
    X_people = people.data[mask]
    y_people = people.target[mask]
    X_people = X_people / 255.
    
    image_shape = people.images[0].shape
    
    pca = PCA(n_components=100, whiten=True, random_state=0)
    pca.fit_transform(X_people)
    
    X_pca = pca.transform(X_people)
    
    km = KMeans(n_clusters=10, random_state=0)
    km.fit_predict(X_pca)
    
    mglearn.plots.plot_kmeans_faces(km, pca, X_pca, X_people, y_people, people.target_names)
    
    plt.tight_layout()
    plt.show()
    

    k均值为每个簇找到的样本图像

用凝聚聚类分析人脸数据集
  • 提取簇

    import numpy as np
    from sklearn.cluster import AgglomerativeClustering
    from sklearn.datasets import fetch_lfw_people
    from sklearn.decomposition import PCA
    import ssl
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)
    mask = np.zeros(people.target.shape, dtype=np.bool_)
    
    for target in np.unique(people.target):
        mask[np.where(people.target == target)[0][:50]] = 1
    
    X_people = people.data[mask]
    y_people = people.target[mask]
    X_people = X_people / 255.
    
    image_shape = people.images[0].shape
    
    pca = PCA(n_components=100, whiten=True, random_state=0)
    pca.fit_transform(X_people)
    
    X_pca = pca.transform(X_people)
    
    # 用ward凝聚聚类提取簇
    agglomerative = AgglomerativeClustering(n_clusters=10)
    labels_agg = agglomerative.fit_predict(X_pca)
    
    print("Cluster sizes agglomerative clustering: {}".format(np.bincount(labels_agg)))
    # Cluster sizes agglomerative clustering: [264 100 275 553  49  64 546  52  51 109]
    
  • 计算ARI来度量凝聚聚类与k均值给出的两种数据划分是否相似

    import numpy as np
    from sklearn.cluster import AgglomerativeClustering, KMeans
    from sklearn.datasets import fetch_lfw_people
    from sklearn.decomposition import PCA
    from sklearn.metrics import adjusted_rand_score
    import ssl
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)
    mask = np.zeros(people.target.shape, dtype=np.bool_)
    
    for target in np.unique(people.target):
        mask[np.where(people.target == target)[0][:50]] = 1
    
    X_people = people.data[mask]
    y_people = people.target[mask]
    X_people = X_people / 255.
    
    image_shape = people.images[0].shape
    
    pca = PCA(n_components=100, whiten=True, random_state=0)
    pca.fit_transform(X_people)
    
    X_pca = pca.transform(X_people)
    
    km = KMeans(n_clusters=10, random_state=0)
    labels_km = km.fit_predict(X_pca)
    
    agglomerative = AgglomerativeClustering(n_clusters=10)
    labels_agg = agglomerative.fit_predict(X_pca)
    
    print("ARI: {:.3f}".format(adjusted_rand_score(labels_agg, labels_km)))
    # ARI: 0.088
    
  • 绘制树状图

    import numpy as np
    from matplotlib import pyplot as plt
    from scipy.cluster.hierarchy import ward, dendrogram
    from sklearn.cluster import AgglomerativeClustering
    from sklearn.datasets import fetch_lfw_people
    from sklearn.decomposition import PCA
    import ssl
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)
    mask = np.zeros(people.target.shape, dtype=np.bool_)
    
    for target in np.unique(people.target):
        mask[np.where(people.target == target)[0][:50]] = 1
    
    X_people = people.data[mask]
    y_people = people.target[mask]
    X_people = X_people / 255.
    
    image_shape = people.images[0].shape
    
    pca = PCA(n_components=100, whiten=True, random_state=0)
    pca.fit_transform(X_people)
    
    X_pca = pca.transform(X_people)
    
    agglomerative = AgglomerativeClustering(n_clusters=10)
    labels_agg = agglomerative.fit_predict(X_pca)
    
    linkage_array = ward(X_pca)
    
    # 现在我们为包含簇之间距离的linkage array绘制树状图
    plt.figure(figsize=(20, 5))
    dendrogram(linkage_array, p=7, truncate_mode='level', no_labels=True)
    
    plt.xlabel("Sample index")
    plt.ylabel("Cluster distance")
    
    plt.tight_layout()
    plt.show()
    

    凝聚聚类在人脸数据集上的树状图

  • 将10个簇可视化

    import numpy as np
    from matplotlib import pyplot as plt
    from sklearn.cluster import AgglomerativeClustering
    from sklearn.datasets import fetch_lfw_people
    from sklearn.decomposition import PCA
    import ssl
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)
    mask = np.zeros(people.target.shape, dtype=np.bool_)
    
    for target in np.unique(people.target):
        mask[np.where(people.target == target)[0][:50]] = 1
    
    X_people = people.data[mask]
    y_people = people.target[mask]
    X_people = X_people / 255.
    
    image_shape = people.images[0].shape
    
    pca = PCA(n_components=100, whiten=True, random_state=0)
    pca.fit_transform(X_people)
    
    X_pca = pca.transform(X_people)
    
    agglomerative = AgglomerativeClustering(n_clusters=10)
    labels_agg = agglomerative.fit_predict(X_pca)
    
    n_clusters = 10
    for cluster in range(n_clusters):
        mask = labels_agg == cluster
        fig, axes = plt.subplots(1, 10, subplot_kw={'xticks': (), 'yticks': ()}, figsize=(15, 8))
        axes[0].set_ylabel(np.sum(mask))
        for image, label, asdf, ax in zip(X_people[mask], y_people[mask], labels_agg[mask], axes):
            ax.imshow(image.reshape(image_shape))
            ax.set_title(people.target_names[label].split()[-1], fontdict={'fontsize': 9})
        plt.tight_layout()
        plt.show()
    

5.5 聚类方法小结

  • 聚类的应用与评估是一个非常定性的过程,通常在数据分析的探索阶段很有帮助
  • 三种聚类算法
    • k均值
      • 允许指定想要的簇的数量
      • 可以用簇的平均值来表示簇
      • 可以被看作一种分解方法,每个数据点都由其簇中心表示
    • DBSCAN
      • 允许用eps参数定义接近程度,从而间接影响簇的大小
      • 可以检测到没有分配任何簇的“噪声点”
      • 可以帮助自动判断簇的数量
      • 允许簇具有复杂的形状
    • 凝聚聚类
      • 允许指定想要的簇的数量
      • 可以提供数据的可能划分的整个层次结构
      • 可以通过树状图轻松查看
  • 三种算法都可以控制聚类的粒度
  • 三种方法都可以用于大型的现实世界数据集,都相对容易理解,也都可以聚类成多个簇
  • 19
    点赞
  • 17
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值