利用lda对文本进行分类
LDA, or Latent Dirichlet Allocation, is one of the most widely used topic modelling algorithms. It is scalable, it is computationally fast and more importantly it generates simple and comprehensible topics that are close to what the human mind assigns when reading a text. While most of the use of LDA is for unsupervised tasks, e.g. topic modelling or document clustering, it can also be used as a feature extraction system for supervised tasks such as text classification. In this article we are going to assemble an LDA based classifier and see how it performs! Let’s go!
LDA或潜在Dirichlet分配是最广泛使用的主题建模算法之一。 它具有可扩展性,计算速度快,更重要的是,它生成简单易懂的主题,这些主题与阅读文本时人脑赋予的主题很接近。 尽管LDA的大多数用途是用于非监督任务,例如主题建模或文档聚类,但它也可以用作监督任务(例如文本分类)的特征提取系统。 在本文中,我们将组装一个基于LDA的分类器,并查看其性能! 我们走吧!
工具: (Tools:)
For simplicity, we’re going to use lda_classification python package, which offers simple wrappers compatible with scikit-learn estimator API for text preprocessing or text vectorization.
为简单起见,我们将使用lda_classification python软件包,该软件包提供与scikit-learn estimator API兼容的简单包装器,用于文本预处理或文本矢量化。
分类问题: (The Classification Problem:)
The 20 News Group dataset is one of the most known and heavily referenced datasets in the field of natural language processing. It consists of around 18K news documents in various categories. For making the task a little less resource heavy, we choose a subset of this dataset for our text classification problem. Since I really like following sports culture, I decided to choose the sport-related section of this dataset. The categories in this subset are as follows:
20新闻组数据集是自然语言处理领域中最著名和引用最多的数据集之一。 它由各种类别的约18K新闻文档组成。 为了使任务少一些资源,我们为文本分类问题选择了该数据集的一个子集。 由于我非常喜欢体育文化,因此我决定选择此数据集中与体育相关的部分。 此子集中的类别如下:
rec.autosrec.motorcyclesrec.sport.baseballrec.sport.hockey
设置代码! (Setting up the code!)
Before we run the example we import modules we want:
在运行示例之前,我们需要导入模块:
import logging
import matplotlib.pyplot as plt
import numpy as np
from matplotlib import cm
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import (RepeatedStratifiedKFold, cross_val_score, )
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from tomotopy import HDPModel
from lda_classification.model import TomotopyLDAVectorizer
from lda_classification.preprocess.spacy_cleaner import SpacyCleaner
#############################################
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
workers = 4 #Numbers of workers throughout the project
use_umap = False #make this True if you want to use UMAP for your visualizations
min_df = 5 #Minimum number for document frequency in the corpus
rm_top = 5 #Re

本文介绍了如何运用lda(潜在狄利克雷分配)进行文本分类,详细解析了分类过程,并提供了相关资源链接。
最低0.47元/天 解锁文章
2165

被折叠的 条评论
为什么被折叠?



