1.1 引言
- 非监督学习-
Unsupervised Learning
1)聚类算法-Clustering
2)异常检测-Anomaly Detection
- 个性化推荐系统-
Recommender System
- 强化学习-
Reinforcement Learning
2.1 聚类(Clustering)
什么是聚类:输入的数据集中只有特征x,没有标签y(label),需要算法自行找出数据间的关系。
应用:将内容相似的新闻分组(grouping similar news),市场/用户人群分析(market segmentation),DNA 分析(DNA analysis),航天数据分析(Astronomical data analysis)。
2.2 K-means直观理解(K-means intuition)
首先算法猜测数据的中心点——聚类中心 / 簇质心;
K-means will take a random guess at where are the centers of the clusters.
The centers of the cluster are calledcluster centroids
.
步骤一:
遍历所有数据点,判断每个数据点距离哪一个簇质心更近,并将数据点分配给更接近的簇质心。
Assign each point to its closest centroid.
步骤二:
计算分配好的2(n)组数据点,取出一组,对组内所有数据点计算平均值,并把改组对应的簇质心移动到平均值所在的点;对所有组进行以上操作。
Recompute the centroids.
以上第一次迭代完成后,即通过计算获得新的一组簇质心;重新对所有数据重复步骤一、二的操作,更新簇质心的位置,直到数据分类和簇质心不再改变(算法收敛)。
Look at each point and assign it to the nearest cluster centroid and then move each cluster centroid to the mean of all the points with the same color.
2.3 算法实现(K-means algorithm)
Randomly initialize K cluster centroids μ 1 , μ 2 , … , μ K {\mu _1},{\mu _2}, \ldots ,{\mu _K} μ1,μ2,…,μK
Repeat{
# Assign points to cluster centroids
------for i = 1 to m:
----------c(i):=index(1~K) of cluster centroid closest to x(i)
# Move cluster centroids
------for k = 1 to K:
----------μ(k):= average(mean) of points assigned to cluster k
}
2.4 优化目标(optimization objective)
符号:
c
(
i
)
{c^{\left( i \right)}}
c(i), (i = 1,…,K):数据集所分成所有聚类的索引(index of cluster to which example x(i) is currently assigned)
μ
K
{\mu _K}
μK:第k个聚类的中心点向量(cluster centroid k)
μ
c
(
i
)
{\mu _{{c^{\left( i \right)}}}}
μc(i):样本x(i)所在聚类的中心点向量(cluster centroid of cluster to which example x(i) has been assigned)
代价函数(Cost function / Distortion function):
J
min
(
c
(
1
)
,
…
,
c
(
m
)
,
μ
1
,
…
,
μ
K
)
=
1
m
∑
i
=
1
m
∥
x
(
i
)
−
μ
c
(
i
)
∥
2
\mathop J\limits_{\min } \left( {{c^{\left( 1 \right)}}, \ldots ,{c^{\left( m \right)}},{\mu _1}, \ldots ,{\mu _K}} \right) = \frac{1}{m}\sum\limits_{i = 1}^m {\left\| {{x^{\left( i \right)}} - } \right.} {\left. {{\mu _{{c^{\left( i \right)}}}}} \right\|^2}
minJ(c(1),…,c(m),μ1,…,μK)=m1i=1∑m
x(i)−μc(i)∥2
代价函数每一步都应该降低/不变,如果出现代价函数增大,说明有bug;可以通过代价函数降低的梯度判断算法是否收敛。
2.5 初始化(Initializing K-means)
- 选择K<m(簇的数量要少于样本总数)
- 随机选取K个训练样本,最为初始的簇质心(Randomly pick K training examples.)
- 将上步选择的K个样本赋值给 μ 1 , μ 2 , … , μ K {\mu _1},{\mu _2}, \ldots ,{\mu _K} μ1,μ2,…,μK
随机初始化(Random initialization)程序实现:
For i = 1 to 100{
__Randomly initialize K-means.
__Run K-means. Get c(1),…,c(m), μ1,…, μk
__Computer cost function(distortion)
__J
}
Pick set of clusters that gave lowest cost J.
2.6 如何选择聚类数量(Choosing the number of clusters)
第一种,肘方法(elbow method):绘制J-K图像,选择J的降低速度变化明显的转折点。
第二种方法:具体问题具体分析。
Sometimes, you’re running K-means to get clusters to use for some later / downstream purpose.
Evaluate K-means based on a metric for how well it performs for that later purpose.