目录
- Cluster Analysis聚类分析
-
- Introduction to Unsupervised Learning无监督学习简介
- Clustering
- Distance or Similarity Function距离或相似度函数
- Clustering as an Optimization Problem聚类是一个优化问题
- Types of Clustering聚类的类型
- Partitioning Method
Cluster Analysis聚类分析
- Introduction to Unsupervised Learning
- Clustering
- Similarity or Distance Calculation
- Clustering as an Optimization Function
- Types of Clustering Methods
- Partitioning Clustering - KMeans & Meanshift
- Hierarchial Clustering - Agglomerative
- Density Based Clustering - DBSCAN
- Measuring Performance of Clusters
1.无监督学习简介
2.聚类
3.相似度或距离计算
4.聚类作为优化函数
5.聚类方法的类型
6.分区聚类-KMeans和Meanshift
7.层次聚类-聚集
8.基于密度的群集-DBSCAN
9.衡量集群的绩效
10.比较所有聚类方法
Introduction to Unsupervised Learning无监督学习简介
- Unsupervised Learning is a type of Machine learning to draw inferences from unlabelled datasets.
- Model tries to find relationship between data.
- Most common unsupervised learning method is clustering which is used for exploratory data analysis to find hidden patterns or grouping in data
- 无监督学习是一种机器学习,可以从未标记的数据集中得出推论。
- 模型试图查找数据之间的关系。
- 最常见的无监督学习方法是聚类,用于探索性数据分析以发现隐藏模式或数据分组
Clustering
-
A learning technique to group a set of objects in such a way that objects of same group are more similar to each other than from objects of other group.
-
Applications of clustering are as follows
- Automatically organizing the data
- Labeling data
- Understanding hidden structure of data
- News Cloustering for grouping similar news together
- Customer Segmentation
- Suggest social groups
-
一种将一组对象进行分组的学习技术,使得同一组的对象比来自其他组的对象彼此更相似。
-
集群的应用如下:
- 自动整理数据
- 标签数据
- 了解数据的隐藏结构
- 新闻汇总,将相似的新闻分组在一起
- 客户细分
- 建议社交团体
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import make_blobs#make_blobs 产生多类数据集,对每个类的中心和标准差有很好的控制
- Generating natural cluster
- 生成自然簇
X,y = make_blobs(n_features=2, n_samples=1000, centers=3, cluster_std=1, random_state=3)
plt.scatter(X[:,0], X[:,1], s=5, alpha=.5)
Distance or Similarity Function距离或相似度函数
- Data belonging to same cluster are similar & data belonging to different cluster are different.
- We need mechanisms to measure similarity & differences between data.
- This can be achieved using any of the below techniques.
- Minkowiski breed of distance calculation:
-
Manhatten (p=1), Euclidian (p=2)
-
Cosine: Suited for text data
-
属于同一集群的数据相似,而属于不同集群的数据则不同。
-
我们需要一种机制来衡量数据之间的相似性和差异。
-
这可以使用以下任何一种技术来实现。
- Minkowiski距离计算品种:
-
曼哈顿(p = 1),欧几里得(p = 2)
-
余弦:适用于文本数据
from sklearn.metrics.pairwise import euclidean_distances,cosine_distances,manhattan_distances
X = [[0, 1], [1, 1]]
print(euclidean_distances(X, X))
print(euclidean_distances(X, [[0,0]]))
print(euclidean_distances(X, [[0,0]]))
print(manhattan_distances(X,X))
[[0. 1.]
[1. 0.]]
[[1. ]
[1.41421356]]
[[1. ]
[1.41421356]]
[[0. 1.]
[1. 0.]]
Clustering as an Optimization Problem聚类是一个优化问题
-
Maximize inter-cluster distances
-
Minimize intra-cluster distances
-
最大化集群间距离
-
最小化集群内距离
Types of Clustering聚类的类型
- Partitioning methods
- Partitions n data into k partitions
- Initially, random partitions are created & gradually data is moved across different partitions.
- It uses distance between points to optimize clusters.
- KMeans & Meanshift are examples of Partitioning methods
- Hierarchical methods
- These methods does hierarchical decomposition of datasets.
- One approach is, assume each data as cluster & merge to create a bigger cluster
- Another approach is start with one cluster & continue splitting
- Density-based methods
- All above techniques are distance based & such methods can find only spherical clusters and not suited for clusters of other shapes.
- Continue growing the cluster untill the density exceeds certain threashold.
- 分区方法
- 将n个数据分区为k个分区
- 最初,创建随机分区,然后逐渐在不同分区之间移动数据。
- 它使用点之间的距离来优化聚类。
- KMeans和Meanshift是分区方法的示例
- 分层方法
- 这些方法对数据集进行分层分解。
- 一种方法是,将每个数据假定为群集并合并以创建更大的群集
- 另一种方法是从一个群集开始并继续拆分
- 基于密度的方法
- 所有上述技术都是基于距离的,并且此类方法只能找到球形簇,而不适合其他形状的簇。
- 继续生长群集,直到密度超过特定阈值。
Partitioning Method
KMeans
- Minimizing creteria : within-cluster-sum-of-squares.
-
The centroids are chosen in such a way that it mi