python——聚类

最新推荐文章于 2024-03-06 23:05:47 发布

VIP文章 a useful man

最新推荐文章于 2024-03-06 23:05:47 发布

阅读量891

点赞数

分类专栏：数据科学

本文链接：https://blog.csdn.net/sinat_23971513/article/details/105248609

版权

Cluster Analysis聚类分析

Introduction to Unsupervised Learning
Clustering
Similarity or Distance Calculation
Clustering as an Optimization Function
Types of Clustering Methods
Partitioning Clustering - KMeans & Meanshift
Hierarchial Clustering - Agglomerative
Density Based Clustering - DBSCAN
Measuring Performance of Clusters

1.无监督学习简介

2.聚类

3.相似度或距离计算

4.聚类作为优化函数

5.聚类方法的类型

6.分区聚类-KMeans和Meanshift

7.层次聚类-聚集

8.基于密度的群集-DBSCAN

9.衡量集群的绩效

10.比较所有聚类方法

Introduction to Unsupervised Learning无监督学习简介

Unsupervised Learning is a type of Machine learning to draw inferences from unlabelled datasets.
Model tries to find relationship between data.
Most common unsupervised learning method is clustering which is used for exploratory data analysis to find hidden patterns or grouping in data
无监督学习是一种机器学习，可以从未标记的数据集中得出推论。
模型试图查找数据之间的关系。
最常见的无监督学习方法是聚类，用于探索性数据分析以发现隐藏模式或数据分组

Clustering

A learning technique to group a set of objects in such a way that objects of same group are more similar to each other than from objects of other group.
Applications of clustering are as follows
- Automatically organizing the data
- Labeling data
- Understanding hidden structure of data
- News Cloustering for grouping similar news together
- Customer Segmentation
- Suggest social groups
一种将一组对象进行分组的学习技术，使得同一组的对象比来自其他组的对象彼此更相似。
集群的应用如下:
- 自动整理数据
- 标签数据
- 了解数据的隐藏结构
- 新闻汇总，将相似的新闻分组在一起
- 客户细分
- 建议社交团体

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import make_blobs#make_blobs 产生多类数据集，对每个类的中心和标准差有很好的控制

Generating natural cluster
生成自然簇

X,y = make_blobs(n_features=2, n_samples=1000, centers=3, cluster_std=1, random_state=3)
plt.scatter(X[:,0], X[:,1], s=5, alpha=.5)

在这里插入图片描述

Distance or Similarity Function距离或相似度函数

Data belonging to same cluster are similar & data belonging to different cluster are different.
We need mechanisms to measure similarity & differences between data.
This can be achieved using any of the below techniques.

Minkowiski breed of distance calculation:

Manhatten (p=1), Euclidian (p=2)
Cosine: Suited for text data

属于同一集群的数据相似，而属于不同集群的数据则不同。
我们需要一种机制来衡量数据之间的相似性和差异。
这可以使用以下任何一种技术来实现。
- Minkowiski距离计算品种：
- 曼哈顿（p = 1），欧几里得（p = 2）
- 余弦：适用于文本数据

from sklearn.metrics.pairwise import euclidean_distances,cosine_distances,manhattan_distances
X = [[0, 1], [1, 1]]
print(euclidean_distances(X, X))
print(euclidean_distances(X, [[0,0]]))
print(euclidean_distances(X, [[0,0]]))
print(manhattan_distances(X,X))

[[0. 1.]
 [1. 0.]]
[[1.        ]
 [1.41421356]]
[[1.        ]
 [1.41421356]]
[[0. 1.]
 [1. 0.]]

Clustering as an Optimization Problem聚类是一个优化问题

Maximize inter-cluster distances
Minimize intra-cluster distances
最大化集群间距离
最小化集群内距离

Types of Clustering聚类的类型

Partitioning methods
- Partitions n data into k partitions
- Initially, random partitions are created & gradually data is moved across different partitions.
- It uses distance between points to optimize clusters.
- KMeans & Meanshift are examples of Partitioning methods
Hierarchical methods
- These methods does hierarchical decomposition of datasets.
- One approach is, assume each data as cluster & merge to create a bigger cluster
- Another approach is start with one cluster & continue splitting
Density-based methods
- All above techniques are distance based & such methods can find only spherical clusters and not suited for clusters of other shapes.
- Continue growing the cluster untill the density exceeds certain threashold.
分区方法
- 将n个数据分区为k个分区
- 最初，创建随机分区，然后逐渐在不同分区之间移动数据。
- 它使用点之间的距离来优化聚类。
- KMeans和Meanshift是分区方法的示例
分层方法
- 这些方法对数据集进行分层分解。
- 一种方法是，将每个数据假定为群集并合并以创建更大的群集
- 另一种方法是从一个群集开始并继续拆分
基于密度的方法
- 所有上述技术都是基于距离的，并且此类方法只能找到球形簇，而不适合其他形状的簇。
- 继续生长群集，直到密度超过特定阈值。

Partitioning Method

KMeans

Minimizing creteria : within-cluster-sum-of-squares.

The centroids are chosen in such a way that it mi

最低0.47元/天解锁文章

a useful man

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
打赏
0
评论
python——聚类

目录Cluster Analysis聚类分析Introduction to Unsupervised Learning无监督学习简介ClusteringDistance or Similarity Function距离或相似度函数Clustering as an Optimization Problem聚类是一个优化问题Types of Clustering聚类的类型Partitioning Me...
复制链接

扫一扫