介绍一下常见的文本聚类Text Clustering方法
文本聚类(clustering)和文本分类(classification)的区别?
- 分类是事先定义好类别。分类器需要由人工标注的分类训练语料训练得到,属于有supervised learning。Classification uses predefined classes in which objects are assigned.
适用范围:分类适合类别已经确定的场合。例如按照国图分类法分类图书。
- 聚类则没有事先定义的类别。聚类不需要人工标注和预先训练分类器,类别在聚类过程中自动生成,属于unsupervised learning。Clustering identifies similarities between objects, which are grouped as "clusters" according to their common characteristics.
适用范围:聚类事先不知道要寻找的内容,没有预先设定好的目标变量。适合分类体系不确定的场合,比如搜索引擎结果后聚类(元搜素)。 【手心里的雪2017 知乎】Clustering is when you have no clue of what types there are, and you want an algorithm to discover what types there might be.
文本聚类的输入和输出是什么?
输入:词向量矩阵
输出:clusters向量矩阵