【学习笔记】吴恩达机器学习 | 第十一章 | 无监督学习

最新推荐文章于 2024-07-13 20:38:58 发布

Benjamin Chen.

最新推荐文章于 2024-07-13 20:38:58 发布

阅读量239

点赞数 2

分类专栏：【学习笔记】吴恩达机器学习学习笔记文章标签：学习机器学习人工智能 k-means 聚类

本文链接：https://blog.csdn.net/jermy00/article/details/131753216

版权

学习笔记同时被 2 个专栏收录

27 篇文章 25 订阅

订阅专栏

【学习笔记】吴恩达机器学习

17 篇文章 23 订阅

订阅专栏

在这里插入图片描述

简要声明

课程学习相关网址
由于课程学习内容为英文，文本会采用英文进行内容记录，采用中文进行简要解释。
本学习笔记单纯是为了能对学到的内容有更深入的理解，如果有错误的地方，恳请包容和指正。
非常感谢Andrew Ng吴恩达教授的无私奉献！！！

文章目录

简要声明
专有名词
Unsupervised learning
- - Unsupervised Learning
  - Applications
K‐means algorithm
- - K-means algorithm
Optimization objective
- - K-means optimization objective
Random initialization
Choosing the number of clusters
- - Elbow method
  - Choosing the value of K
吴恩达教授语录

专有名词

Clustering	聚类	K‐means	K均值
cluster centroids	聚类中心	distortion function	失真函数

Unsupervised learning

Unsupervised Learning

非监督学习没有任何labels标签
非监督学习会把数据分成不同的clusters簇
非监督学习是clustering algorithm聚类算法

Applications

Organize computing clusters 组织大型计算机集群高效协作
Social network analysis 社交网络分析
Market segmentation 市场分割
Segmentation data analysis 天文数据分析

K‐means algorithm

在这里插入图片描述

随机生成两个点 →cluster centroids 聚类中心 →将数据聚成两类
cluster assignment 簇分配 →将每个数据点根据就近原则分配给聚类中心之一
move centroid 移动聚类中心 →将聚类中心移动到同簇点的均值处
Input 输入
1. K (number of clusters) → 聚类处的簇的个数
2. Training set { x⁽¹⁾, x⁽²⁾, …, x^(m) } →一系列无标签的训练集
3. x⁽ⁱ⁾ ∈ Rⁿ →x⁽ⁱ⁾是一个n维实数向量

K-means algorithm

Randomly initialize K cluster centroids μ_1, μ_2, …, μ_K ∈ Rⁿ →随机初始化K个聚类中心

Repeat {

%簇分配步骤

for i = 1 to m

c⁽ⁱ⁾ := index (from 1 to K) of cluster centroid closest to x⁽ⁱ⁾ →表示最接近x⁽ⁱ⁾的聚类中心 → min{k} || x⁽ⁱ⁾ - μ_k||²

%移动聚类中心

for k = 1 to K

μ_k := average (mean) of points assigned to cluster k →簇中所有点的均值

}

如果存在一个没有点的聚类中心 →直接移除那个聚类中心 →得到K-1个簇

Optimization objective

K-means optimization objective

c⁽ⁱ⁾ = index of cluster (1,2,…, K) to which example x⁽ⁱ⁾ is currently assigned →表示当前样本x⁽ⁱ⁾所属的那个簇的索引或者序号
μ_k = cluster centroid k →表示第k个聚类中心的位置
μ_(c⁽ⁱ⁾) = cluster centroid of cluster to which example x⁽ⁱ⁾ has been assigned →表示x⁽ⁱ⁾所属的那个簇的聚类中心
Optimization objective 优化目标 →每一个样本x⁽ⁱ⁾到x⁽ⁱ⁾所属的聚类中心的距离的平方值 → J() distortion 失真代价函数
cluster assignment 簇分配 →选择c⁽ⁱ⁾来最小化失真函数，保持最近的聚类中心μ_1到μ_k的位置固定不变
move centroid 移动聚类中心 →选择μ_k来最小化失真函数，保持最近的聚类中心c⁽¹⁾到c^(m)的位置固定不变

$J(c^{(1)},\cdots,c^{(m)},\mu_1,\cdots,\mu_K)=\frac{1}{m}\sum_{i=1}^m||x^{(i)}-\mu_{c^{(i)}}||^2 \\ \min \limits_{c^{(1)},\cdots,c^{(m)}, \mu_1,\cdots,\mu_K} J(c^{(1)},\cdots,c^{(m)},\mu_1,\cdots,\mu_K)$

Random initialization

K < m →把聚类中心数值K设置为比训练样本数量m小的值
Randomly pick K training examples →随机挑选K个训练样本
Set μ_1, … , μ_k equal to these examples →设定μ_1到μ_k等于这K个样本
如果随机初始化得到的结果不好就可能得到不同的局部最优解

在这里插入图片描述

For i = 1 to 100 { %典型的运行次数一般在50~1000

Randomly initialize K-means →随机初始化K均值算法

Run K‐means. Get c⁽¹⁾, …, c^(m), μ_1, μ_2, …, μ_K →运行K均值算法并得到一系列聚类结果和一系列聚类中心

Compute cost function (distortion) J(c⁽¹⁾, …, c^(m), μ_1, μ_2, …, μ_K) →计算失真函数

}

Pick clustering that gave lowest cost J(c⁽¹⁾, …, c^(m), μ_1, μ_2, …, μ_K) →选择代价最小的簇 →如果运行K均值算法时所用聚类数较小（K=2~10），多次随机初始化通常能找到较好的局部最优解，如果K非常大的话多次随机初始化就不会有很大的改善

Choosing the number of clusters

Elbow method

在这里插入图片描述

横坐标是 K 聚类总数，纵坐标是代价函数 J
随着聚类数量的增多，畸变值下降
Elbow method 肘部方法 →曲线有一个肘部 →肘部之前快速下降，肘部之后缓慢下降
选择曲线的拐点作为聚类数量
肘部方法并不是那么常用，在实际聚类问题时会得到相对模糊的曲线，没有清晰的拐点

Choosing the value of K

在这里插入图片描述

通常使用K均值聚类是为了得到一些聚类用于某个目的，看那个聚类数量能更好地应用于后续目的
大部分时候聚类数量K仍然是通过手动人工输入或者经验来决定

吴恩达教授语录

“This will be exciting because this is our first unsupervised learning algorithm where we learn from unlabeled data instead of the label data.”
“The K Means algorithm is by far the most popular, by far the most widely used clustering algorithm.”
“But I do get asked this question quite a lot of how do you choose the number of clusters and so I just want to tell you know what are peoples’ current thinking on it, although the most common thing is actually to choose the number of clusters by hand.