论文笔记 Deep k-Means: Jointly clustering with k-Means and learning representations


论文原文:Deep k-Means: Jointly clustering with k-Means and learning representations - ScienceDirect


Clustering is a long-standing problem in the machine learning and data mining fields, and thus accordingly fostered abundant research. Traditional clustering methods, e.g., k-Means and Gaussian Mixture Models (GMMs), fully rely on the original data representations and may then be ineffective when the data points (e.g., images and text documents) live in a high-dimensional space – a problem commonly known as the curse of dimensionality. 


Significant progress has been made in the last decade or so to learn better, low-dimensional data representations. The most successful techniques to achieve such high-quality representations rely on deep neural networks (DNNs), which apply successive non-linear transformations to the data in order to obtain increasingly high-level features. Auto-encoders (AEs) are a special instance of DNNs which are trained to embed the data into a (usually dense and low-dimensional) vector at the bottleneck of the network, and then attempt to reconstruct the input based on this vector. The appeal of AEs lies in the fact that they are able to learn representations in a fully unsupervised way. The representation learning breakthrough enabled by DNNs spurred the recent development of numerous deep clustering approaches which aim at jointly learning the data points’ representations as well as their cluster assignments. 


In this study, we specifically focus on the k-Means-related deep clustering problem. 


Contrary to previous approaches that alternate between continuous gradient updates and discrete cluster assignment steps, we show here that one can solely rely on gradient updates to learn, truly jointly, representations and clustering parameters. This ultimately leads to a better deep k-Means method which is also more scalable as it can fully benefit from the efficiency of stochastic gradient descent (SGD). In addition, we perform a careful comparison of different methods by (a) relying on the same auto-encoders, as the choice of auto-encoders impacts the results obtained, (b) tuning the hyperparameters of each method on a small validation set, instead of setting them without clear criteria, and (c) enforcing, whenever possible, that the same initialization and sequence of SGD minibatches are used by the different methods. The last point is crucial to compare different methods as these two factors play an important role and the variance of each method is usually not negligible.


Related work

In the wake of the groundbreaking results obtained by DNNs in computer vision, several deep clustering algorithms were specifically designed for image clustering. These works have in common the exploitation of Convolutional Neural Networks (CNNs), which extensively contributed to last decade’s significant advances in computer vision. Inspired by agglomerative clustering, Yang et al. [30] proposed a recurrent process which successively merges clusters and learn image representations based on CNNs. In [7], the clustering problem is formulated as binary pairwise-classification so as to identify the pairs of images which should belong to the same cluster. Due to the unsupervised nature of clustering, the CNN-based classifier in this approach is only trained on noisily labeled examples obtained by selecting increasingly difficult samples in a curriculum learning fashion. Dizaji et al. [9] jointly trained a CNN auto-encoder and a multinomial logistic regression model applied to the AE’s latent space. Similarly, Hsu and Lin [13] alternate between representation learning and clustering where mini-batch k-Means is utilized as the clustering component. Differently from these works, Hu et al. [14] proposed an information-theoretic framework based on data augmentation to learn discrete representations, which may be applied to clustering or hash learning. Although these different algorithms obtained state-of-the-art results on image clustering, their ability to generalize to other types of data (e.g., text documents) is not guaranteed due to their reliance on essentially image-specific techniques – Convolutional Neural Network architectures and data augmentation.


Nonetheless, many general-purpose – non-image-specific – approaches to deep clustering have also been recently designed. Generative models were proposed in Dilokthanakul et al. [8], Jiang et al. [18] which combine variational AEs and GMMs to perform clustering. Alternatively, Ji et al. [17], Peng et al. [25], [26] framed deep clustering as a subspace clustering problem in which the mapping from the original data space to a low-dimensional subspace is learned by a DNN. Xie et al. [28] defined the Deep Embedded Clustering (DEC) method which simultaneously updates the data points’ representations, initialized from a pre-trained AE, and cluster centers. DEC uses soft assignments which are optimized to match stricter assignments through a Kullback-Leibler divergence loss. IDEC was subsequently proposed in Guo et al. [11] as an improvement to DEC by integrating the AE’s reconstruction error in the objective function.


Few approaches were directly influenced by k-Means clustering. The Deep Embedding Network (DEN) model first learns representations from an AE while enforcing locality-preserving constraints and group sparsity; clusters are then obtained by simply applying k-Means to these representations. Yet, as representation learning is decoupled from clustering, the performance is not as good as the one obtained by methods that rely on a joint approach. Besides [13], mentioned before in the context of images, the only study, to our knowledge, that directly addresses the problem of jointly learning representations and clustering with k-Means (and not an approximation of it) is the Deep Clustering Network (DCN) approach. However, as in Hsu and Lin [13], DCN alternatively learns (rather than jointly learns) the object representations, the cluster centroids and the cluster assignments, the latter being based on discrete optimization steps which cannot benefit from the efficiency of stochastic gradient descent. The approach proposed here, entitled Deep k-Means (DKM), addresses this problem.

很少有方法是直接基于k-Means聚类的深度嵌入网络(DEN)模型首先从AE中学习表示,同时强制执行局部保持约束和组稀疏性;然后通过简单地将k-Means应用于这些特征来获得聚类。然而,由于特征学习与聚类解耦,其性能不如依赖于联合方法的方法所获得的性能。除了[13],如前所述,在图像的背景下,据我们所知,唯一直接解决联合学习表示和使用k-Means聚类(而不是其近似值)的问题的研究是深度聚类网络(DCN)方法[29]。然而,如Hsu和Lin[13]所述,DCN交替学习(而不是联合学习)对象表示、聚类质心和聚类分配,后者基于离散优化步骤,无法从随机梯度下降的效率中获益。这里提出的方法,名为deep k-Means(DKM),解决了这个问题。


In order to evaluate the clustering results of our approach, we conducted experiments on different datasets and compared it against state-of-the-art standard and k-Means-related deep clustering models.



The datasets used in the experiments are standard clustering benchmark collections. We considered both image and text datasets to demonstrate the general applicability of our approach. Image datasets consist of MNIST (70,000 images, 28 × 28 pixels, 10 classes) and USPS (9,298 images, 16 × 16 pixels, 10 classes) which both contain hand-written digit images. We reshaped the images to one-dimensional vectors and normalized the pixel intensity levels (between 0 and 1 for MNIST, and between -1 and 1 for USPS). The text collections we considered are the 20 Newsgroups dataset (hereafter, 20NEWS) and the RCV1-v2 dataset (hereafter, RCV1). For 20NEWS, we used the whole dataset comprising 18,846 documents labeled into 20 different classes. Similarly to [11], [28], we sampled from the full RCV1-v2 collection a random subset of 10,000 documents, each of which pertains to only one of the four largest classes. Because of the text datasets’ sparsity, and as proposed in Xie et al. [28], we selected the 2000 words with the highest tf-idf values to represent each document. 

实验中使用的数据集是标准的聚类基准集合。我们考虑了图像文本数据集,以证明我们的方法的普遍适用性。图像数据集由MNIST(70000幅图像,28×28像素,10类)和USPS(9298幅图像,16×16像素,10级)组成,这两个数据集都包含手写数字图像。我们将图像重塑为一维向量,并对像素强度水平进行归一化(MNIST在0到1之间,USPS在-1到1之间)。我们考虑的文本集合是20个新闻组数据集(以下简称20NEWS)和RCV1-v2数据集(下面简称RCV1)。对于20NEWS,我们使用了包含18846个文档的整个数据集,这些文档被标记为20个不同的类。类似于[11],[28],我们从完整的RCV1-v2集合中采样了10000个文档的随机子集,每个文档只属于四个最大类中的一个。由于文本数据集的稀疏性,正如Xie等人[28]所提出的,我们选择了具有最高tf idf值的2000个单词来表示每个文档。

模型地址:MaziarMF/deep-k-means (github.com)

