机器学习——聚类

最新推荐文章于 2024-04-14 20:54:05 发布

只想安静的一个人

最新推荐文章于 2024-04-14 20:54:05 发布

阅读量262

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/u014258362/article/details/80985373

版权

机器学习专栏收录该内容

5 篇文章 0 订阅

订阅专栏

###概念
对于没有标签的数据，我们首先能做的，就是寻找具有相同特征的数据，将他们分配到相同的组。

####K-means（k均值）

概念

K均值算法试图将给定的数据分割为K个不相交的组或者簇，每个簇的指标就是该组所有成员的均值。

算法拆解

对于未分类的样本，首先随机以K个元素作为其实质心。
计算每个样本跟质心之间的距离，并将该样本分配个理他最近的质心所属的簇，重新计算分配好后的质心
在质心改变后，他们的位移会引起各个距离的改变，因此需要重新分配各个样本。
在停止条件满足之前，不断重复2、3步，直到满足条件
停止条件：
1. 选择一个比较大的迭代次数N
2. 如果已经没有元素从一个类转移到另一个类，则结束。

示意图

代码实现

__author__ = 'ding'
'''
聚类 对人工数据集的K-means算法
'''
import tensorflow as tf
import numpy as np
import time

import matplotlib
import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs
from sklearn.datasets import make_circles

DATA_TYPE = 'blobs'

if DATA_TYPE == 'circle':
    K = 2
else:
    K = 4

MAX_ITERS = 1000
start = time.time()

centers = [(-2, -2), (-2, 1.5), (1.5, -2), (2, 1.5)]

# n_samples是待生成的样本的总数。
# n_features是每个样本的特征数。
# centers表示类别数。
# noise表示噪声
# cluster_std表示每个类别的方差，例如我们希望生成2类数据，其中一类比另一类具有更大的方差，可以将cluster_std设置为[1.0,3.0]。
if (DATA_TYPE == 'circle'):
    data, features = make_circles(n_samples=200, shuffle=True, noise=0.01, factor=0.4)
else:
    data, features = make_blobs(n_samples=200, centers=centers, n_features=2, cluster_std=0.8, shuffle=False,
                                random_state=42)

fig, ax = plt.subplots()
ax.scatter(np.asarray(centers).transpose()[0], np.asarray(centers).transpose()[1], marker='o', s=50)
plt.show()

fig, ax = plt.subplots()
ax.scatter(np.asarray(data).transpose()[0], np.asarray(data).transpose()[1], marker='o', s=250)
ax.scatter(data.transpose()[0], data.transpose()[1], marker='o', s=100, c=features, cmap=plt.cm.coolwarm)
plt.show()

N = 200

points = tf.Variable(data)
cluster_assignments = tf.Variable(tf.zeros([N], dtype=tf.int64))
centroids = tf.Variable(tf.slice(points.initialized_value(), [0, 0], [K, 2]))

sess = tf.Session()
sess.run(tf.global_variables_initializer())
sess.run(centroids)

rep_centroids = tf.reshape(tf.tile(centroids, [N, 1]), [N, K, 2])
rep_points = tf.reshape(tf.tile(points, [1, K]), [N, K, 2])
sum_squares = tf.reduce_sum(tf.square(rep_points - rep_centroids), reduction_indices=2)
best_centroids = tf.argmin(sum_squares, 1)
did_assignment_change = tf.reduce_any(tf.not_equal(best_centroids, cluster_assignments))


def bucket_mean(data, bucket_ids, num_buckets):
    total = tf.unsorted_segment_sum(data, bucket_ids, num_buckets)
    count = tf.unsorted_segment_sum(tf.ones_like(data), bucket_ids, num_buckets)
    return total / count


means = bucket_mean(points, best_centroids, K)
with tf.control_dependencies([did_assignment_change]):
    do_updates = tf.group(centroids.assign(means), cluster_assignments.assign(best_centroids))

changed = True
iters = 0

fig, ax = plt.subplots()
if DATA_TYPE == 'blobs':
    colourindexes = [2, 1, 4, 3]
else:
    colourindexes = [2, 1]

while changed and iters < MAX_ITERS:
    fig, ax = plt.subplots()
    iters += 1
    [changed, _] = sess.run([did_assignment_change, do_updates])
    [centers, assignments] = sess.run([centroids, cluster_assignments])
    ax.scatter(sess.run(points).transpose()[0], sess.run(points).transpose()[1], marker='o', s=200, c=assignments,
               cmap=plt.cm.coolwarm)
    ax.scatter(centers[:, 0], centers[:, 1], marker='^', s=550, c=colourindexes,
               cmap=plt.cm.plasma)
    ax.set_title('Iteration ' + str(iters))
    plt.savefig('Kmeans' + str(iters) + '.png')
    ax.scatter(sess.run(points).transpose()[0], sess.run(points).transpose()[1], marker='o', s=200, c=assignments,
               cmap=plt.cm.coolwarm)
    plt.show()
    end = time.time()
    print(('Found in %.2f seconds' % (end - start)), iters, "iterations")
    print('Centroids:')
    print(centers)
    print('Cluster assignments :', assignments)

结果输出

![这里写图片描述](https://img-blog.csdn.net/20180710130614641?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3UwMTQyNTgzNjI=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70) ![这里写图片描述](https://img-blog.csdn.net/20180710130626254?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3UwMTQyNTgzNjI=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70) **·** **.** **.** **.** **.** ![这里写图片描述](https://img-blog.csdn.net/20180710130648481?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3UwMTQyNTgzNjI=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70) - k-means优缺点 1 扩展性好 2 应用范围广缺点： 1 他需要先验知识（可能的聚类的数量应该预先知道） 2 异常值影响质心的结果，因为算法没有办法剔除异常值 3 对于非圆状的簇，该算法不是特别理想

####K-nn（K最邻近）

概念
该方法只需要查看周围点的类别信息，并假设所有样本都属于已知类别
算法拆解

设定训练集的数据类别信息
读取下一个要分类的样本，并计算从新样本到训练集的每个样本的欧几里得距离
痛欧几里得距离上最近的样本来确定新样本的类别信息。确定的的方式就是最近的K个样本的投票
重复以上步骤，直到所有测试样本都确定了类别

示意图

代码实现

 __author__ = 'ding'
'''
聚类 对人工数据集使用K-nn算法
'''
import tensorflow as tf
import numpy as np
import time
import matplotlib
import matplotlib.pyplot as plt

from sklearn.datasets.samples_generator import make_circles

N = 210
K = 2

MAX_ITERS = 1000
cut = int(N * 0.7)

start = time.time()

data, features = make_circles(n_samples=N, shuffle=True, noise=0.12, factor=0.4)
tr_data, tr_features = data[:cut], features[:cut]
te_data, te_features = data[cut:], features[cut:]

fig, ax = plt.subplots()
ax.scatter(tr_data.transpose()[0], tr_data.transpose()[1], marker='o', s=100, c=tr_features, cmap=plt.cm.coolwarm)
plt.show()

points = tf.Variable(data)
cluster_assignments = tf.Variable(tf.zeros([N], dtype=tf.int64))

sess = tf.Session()
sess.run(tf.global_variables_initializer())

test = []

for i, j in zip(te_data, te_features):
    distances = tf.reduce_sum(tf.square(tf.subtract(i, tr_data)), reduction_indices=1)
    neighbor = tf.arg_min(distances, 0)
    test.append(tr_features[sess.run(neighbor)])

print(test)

fig, ax = plt.subplots()
ax.scatter(te_data.transpose()[0], te_data.transpose()[1], marker='o', s=100, c=test, cmap=plt.cm.coolwarm)
plt.show()

end = time.time()
print('Found in %.2f seconds' % (end - start))
print('Cluster assignments: ', test)

结果输出

![这里写图片描述](https://img-blog.csdn.net/20180710141843578?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3UwMTQyNTgzNjI=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70) 计算之后 ![这里写图片描述](https://img-blog.csdn.net/20180710141902347?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3UwMTQyNTgzNjI=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70) ![这里写图片描述](https://img-blog.csdn.net/20180710141926527?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3UwMTQyNTgzNjI=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70) - 优缺点 1.简单无需调整参数 2.无训练过程，我们只需要更多的训练样本改变模型缺点：计算成本高

只想安静的一个人

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
机器学习——聚类

概念对于没有标签的数据，我们首先能做的，就是寻找具有相同特征的数据，将他们分配到相同的组。K-means（k均值）概念K均值算法试图将给定的数据分割为K个不相交的组或者簇，每个簇的指标就是该组所有成员的均值。算法拆解对于未分类的样本，首先随机以K个元素作为其实质心。计算每个样本跟质心之间的距离，并将该样本分配个理他最近的质心所属的簇，重新计算分配好后的质心在质...
复制链接

扫一扫