论文笔记 Deep k-Means: Jointly clustering with k-Means and learning representations

最新推荐文章于 2023-07-15 20:51:40 发布

菜鸟上路dd

最新推荐文章于 2023-07-15 20:51:40 发布

阅读量457

点赞数 3

文章标签： kmeans 机器学习

本文链接：https://blog.csdn.net/2201_75349501/article/details/130308402

版权

笔记只记录个人认为重要的信息，查看全文还请移步论文原文地址。

论文原文：Deep k-Means: Jointly clustering with k-Means and learning representations - ScienceDirect

Introduction部分

Clustering is a long-standing problem in the machine learning and data mining fields, and thus accordingly fostered abundant research. Traditional clustering methods, e.g., k-Means and Gaussian Mixture Models (GMMs), fully rely on the original data representations and may then be ineffective when the data points (e.g., images and text documents) live in a high-dimensional space – a problem commonly known as the curse of dimensionality.

聚类是机器学习和数据挖掘领域中一个长期存在的问题，因此促进了大量的研究。传统的聚类方法，如k-Means和高斯混合模型（GMM），完全依赖于原始数据表示，当数据点（如图像和文本文档）生活在高维空间中时，可能是无效的——这个问题通常被称为维度诅咒。

Significant progress has been made in the last decade or so to learn better, low-dimensional data representations. The most successful techniques to achieve such high-quality representations rely on deep neural networks (DNNs), which apply successive non-linear transformations to the data in order to obtain increasingly high-level features. Auto-encoders (AEs) are a special instance of DNNs which are trained to embed the data into a (usually dense and low-dimensional) vector at the bottleneck of the network, and then attempt to reconstruct the input based on this vector. The appeal of AEs lies in the fact that they are able to learn representations in a fully unsupervised way. The representation learning breakthrough enabled by DNNs spurred the recent development of numerous deep clustering approaches which aim at jointly learning the data points’ representations as well as their cluster assignments.

在过去十年左右的时间里，更好的深度学习模型在低维数据表示方面取得了重大进展。实现这种高质量表示的最成功技术依赖于深度神经网络（DNN），该网络将连续的非线性变换应用于数据，以获得越来越高级的特征。自动编码器（AE）是DNN的一个特殊实例，它被训练成将数据嵌入网络瓶颈处的（通常是密集和低维的）向量中，然后尝试基于该向量重构输入。AE的吸引力在于它们能够以完全无监督的方式学习表征。DNN实现的表征学习突破推动了许多深度聚类方法的最新发展，这些方法旨在联合学习数据点的表征及其聚类分配。

In this study, we specifically focus on the k-Means-related deep clustering problem.

在本研究中，我们特别关注与k-Means相关的深度聚类问题。

Contrary to previous approaches that alternate between continuous gradient updates and discrete cluster assignment steps, we show here that one can solely rely on gradient updates to learn, truly jointly, representations and clustering parameters. This ultimately leads to a better deep k-Means method which is also more scalable as it can fully benefit from the efficiency of stochastic gradient descent (SGD). In addition, we perform a careful comparison of different methods by (a) relying on the same auto-encoders, as the choice of auto-encoders impacts the results obtained, (b) tuning the hyperparameters of each method on a small validation set, instead of setting them without clear criteria, and (c) enforcing, whenever possible, that the same initialization and sequence of SGD minibatches are used by the different methods. The last point is crucial to compare different methods as these two factors play an important role and the variance of each method is usually not negligible.

与之前在连续梯度更新和离散聚类分配步骤之间交替的方法相反，我们在这里表明，可以单独依靠梯度更新来真正联合地学习特征和聚类参数。这最终导致了一种更好的深度k-Means方法，该方法也更具可扩展性，因为它可以充分受益于随机梯度下降（SGD）的效率。此外，我们对不同的方法进行了仔细的比较，方法是（a）依赖于相同的自动编码器，因为自动编码器的选择会影响所获得的结果，（b）在一个小的验证集上调整每个方法的超参数，而不是在没有明确标准的情况下设置它们，以及（c）尽可能强制执行，不同的方法使用相同的初始化和SGD小批量的序列。最后一点对于比较不同的方法至关重要，因为这两个因素起着重要作用，并且每种方法的方差通常不可忽略。

Related work

In the wake of the groundbreaking results obtained by DNNs in computer vision, several deep clustering algorithms were specifically designed for image clustering. These works have in common the exploitation of Convolutional Neural Networks (CNNs), which extensively contributed to last decade’s significant advances in computer vision. Inspired by agglomerative clustering, Yang et al. [30] proposed a recurrent process which successively merges clusters and learn image representations based on CNNs. In [7], the clustering problem is formulated as binary pairwise-classification so as to identify the pairs of images which should belong to the same cluster. Due to the unsupervised nature of clustering, the CNN-based classifier in this approach is only trained on noisily labeled examples obtained by selecting increasingly difficult samples in a curriculum learning fashion. Dizaji et al. [9] jointly trained a CNN auto-encoder and a multinomial logistic regression model applied to the AE’s latent space. Similarly, Hsu and Lin [13] alternate between representation learning and clustering where mini-batch k-Means is utilized as the clustering component. Differently from these works, Hu et al. [14] proposed an information-theoretic framework based on data augmentation to learn discrete representations, which may be applied to clustering or hash learning. Although these different algorithms obtained state-of-the-art results on image clustering, their ability to generalize to other types of data (e.g., text documents) is not guaranteed due to their reliance on essentially image-specific techniques – Convolutional Neural Network architectures and data augmentation.

在DNN在计算机视觉中获得突破性成果后，专门为图像聚类设计了几种深度聚类算法。这些工作的共同点是卷积神经网络（CNNs）的开发，它为过去十年计算机视觉的重大进步做出了广泛贡献。受凝聚聚类的启发，Yang等人[30]提出了一种递归过程，该过程依次合并聚类并学习基于细胞神经网络的图像表示。在[7]中，聚类问题被公式化为二进制成对分类，以便识别应该属于同一聚类的图像对。由于聚类的无监督性质，该方法中基于CNN的分类器仅在通过以课程学习的方式选择越来越难的样本而获得的有噪声标记的样本上进行训练。Dizaji等人[9]联合训练了一个CNN自动编码器和一个应用于AE潜在空间的多项式逻辑回归模型。类似地，Hsu和Lin[13]在特征学习和聚类之间交替，其中使用小批量k-Means作为聚类组件。与这些工作不同的是，Hu等人[14]提出了一种基于数据扩充的信息论框架来学习离散表示，该框架可以应用于聚类或哈希学习。尽管这些不同的算法在图像聚类方面获得了最先进的结果，但由于它们依赖于本质上特定于图像的技术——卷积神经网络架构和数据扩充，因此无法保证它们推广到其他类型数据（如文本文档）的能力。

Nonetheless, many general-purpose – non-image-specific – approaches to deep clustering have also been recently designed. Generative models were proposed in Dilokthanakul et al. [8], Jiang et al. [18] which combine variational AEs and GMMs to perform clustering. Alternatively, Ji et al. [17], Peng et al. [25], [26] framed deep clustering as a subspace clustering problem in which the mapping from the original data space to a low-dimensional subspace is learned by a DNN. Xie et al. [28] defined the Deep Embedded Clustering (DEC) method which simultaneously updates the data points’ representations, initialized from a pre-trained AE, and cluster centers. DEC uses soft assignments which are optimized to match stricter assignments through a Kullback-Leibler divergence loss. IDEC was subsequently proposed in Guo et al. [11] as an improvement to DEC by integrating the AE’s reconstruction error in the objective function.

尽管如此，最近还设计了许多通用的、非图像特定的深度聚类方法。Diloktanakul等人[8]，Jiang等人[18]提出了生成模型，将AEs变体和GMM结合起来进行聚类。或者，Ji等人[17]，Peng等人[25]，[26]将深度聚类定义为一个子空间聚类问题，其中通过DNN学习从原始数据空间到低维子空间的映射。Xie等人[28]定义了深度嵌入式聚类（DEC）方法，该方法同时更新从预先训练的AE初始化的数据点表示和聚类中心。DEC使用经过优化的软分配，通过Kullback-Leibler发散损失来匹配更严格的分配。随后在Guo等人[11]中提出了IDEC，通过将AE的重建误差集成到目标函数中，作为对DEC的改进。

Few approaches were directly influenced by k-Means clustering. The Deep Embedding Network (DEN) model first learns representations from an AE while enforcing locality-preserving constraints and group sparsity; clusters are then obtained by simply applying k-Means to these representations. Yet, as representation learning is decoupled from clustering, the performance is not as good as the one obtained by methods that rely on a joint approach. Besides [13], mentioned before in the context of images, the only study, to our knowledge, that directly addresses the problem of jointly learning representations and clustering with k-Means (and not an approximation of it) is the Deep Clustering Network (DCN) approach. However, as in Hsu and Lin [13], DCN alternatively learns (rather than jointly learns) the object representations, the cluster centroids and the cluster assignments, the latter being based on discrete optimization steps which cannot benefit from the efficiency of stochastic gradient descent. The approach proposed here, entitled Deep k-Means (DKM), addresses this problem.

很少有方法是直接基于k-Means聚类的。深度嵌入网络（DEN）模型首先从AE中学习表示，同时强制执行局部保持约束和组稀疏性；然后通过简单地将k-Means应用于这些特征来获得聚类。然而，由于特征学习与聚类解耦，其性能不如依赖于联合方法的方法所获得的性能。除了[13]，如前所述，在图像的背景下，据我们所知，唯一直接解决联合学习表示和使用k-Means聚类（而不是其近似值）的问题的研究是深度聚类网络（DCN）方法[29]。然而，如Hsu和Lin[13]所述，DCN交替学习（而不是联合学习）对象表示、聚类质心和聚类分配，后者基于离散优化步骤，无法从随机梯度下降的效率中获益。这里提出的方法，名为deep k-Means（DKM），解决了这个问题。

Experiments

In order to evaluate the clustering results of our approach, we conducted experiments on different datasets and compared it against state-of-the-art standard and k-Means-related deep clustering models.

为了评估我们方法的聚类结果，我们在不同的数据集上进行了实验，并将其与最先进的标准和k-Means相关的深度聚类模型进行了比较。

Datasets

The datasets used in the experiments are standard clustering benchmark collections. We considered both image and text datasets to demonstrate the general applicability of our approach. Image datasets consist of MNIST (70,000 images, 28 × 28 pixels, 10 classes) and USPS (9,298 images, 16 × 16 pixels, 10 classes) which both contain hand-written digit images. We reshaped the images to one-dimensional vectors and normalized the pixel intensity levels (between 0 and 1 for MNIST, and between -1 and 1 for USPS). The text collections we considered are the 20 Newsgroups dataset (hereafter, 20NEWS) and the RCV1-v2 dataset (hereafter, RCV1). For 20NEWS, we used the whole dataset comprising 18,846 documents labeled into 20 different classes. Similarly to [11], [28], we sampled from the full RCV1-v2 collection a random subset of 10,000 documents, each of which pertains to only one of the four largest classes. Because of the text datasets’ sparsity, and as proposed in Xie et al. [28], we selected the 2000 words with the highest tf-idf values to represent each document.

实验中使用的数据集是标准的聚类基准集合。我们考虑了图像和文本数据集，以证明我们的方法的普遍适用性。图像数据集由MNIST（70000幅图像，28×28像素，10类）和USPS（9298幅图像，16×16像素，10级）组成，这两个数据集都包含手写数字图像。我们将图像重塑为一维向量，并对像素强度水平进行归一化（MNIST在0到1之间，USPS在-1到1之间）。我们考虑的文本集合是20个新闻组数据集（以下简称20NEWS）和RCV1-v2数据集（下面简称RCV1）。对于20NEWS，我们使用了包含18846个文档的整个数据集，这些文档被标记为20个不同的类。类似于[11]，[28]，我们从完整的RCV1-v2集合中采样了10000个文档的随机子集，每个文档只属于四个最大类中的一个。由于文本数据集的稀疏性，正如Xie等人[28]所提出的，我们选择了具有最高tf idf值的2000个单词来表示每个文档。

模型地址：MaziarMF/deep-k-means (github.com)