利用lda对文本进行分类_使用lda进行文本分类

本文介绍了如何运用lda(潜在狄利克雷分配)进行文本分类,详细解析了分类过程,并提供了相关资源链接。

利用lda对文本进行分类

LDA, or Latent Dirichlet Allocation, is one of the most widely used topic modelling algorithms. It is scalable, it is computationally fast and more importantly it generates simple and comprehensible topics that are close to what the human mind assigns when reading a text. While most of the use of LDA is for unsupervised tasks, e.g. topic modelling or document clustering, it can also be used as a feature extraction system for supervised tasks such as text classification. In this article we are going to assemble an LDA based classifier and see how it performs! Let’s go!

LDA或潜在Dirichlet分配是最广泛使用的主题建模算法之一。 它具有可扩展性,计算速度快,更重要的是,它生成简单易懂的主题,这些主题与阅读文本时人脑赋予的主题很接近。 尽管LDA的大多数用途是用于非监督任务,例如主题建模或文档聚类,但它也可以用作监督任务(例如文本分类)的特征提取系统。 在本文中,我们将组装一个基于LDA的分类器,并查看其性能! 我们走吧!

工具: (Tools:)

For simplicity, we’re going to use lda_classification python package, which offers simple wrappers compatible with scikit-learn estimator API for text preprocessing or text vectorization.

为简单起见,我们将使用lda_classification python软件包,该软件包提供与scikit-learn estimator API兼容的简单包装器,用于文本预处理或文本矢量化。

分类问题: (The Classification Problem:)

The 20 News Group dataset is one of the most known and heavily referenced datasets in the field of natural language processing. It consists of around 18K news documents in various categories. For making the task a little less resource heavy, we choose a subset of this dataset for our text classification problem. Since I really like following sports culture, I decided to choose the sport-related section of this dataset. The categories in this subset are as follows:

20新闻组数据集是自然语言处理领域中最著名和引用最多的数据集之一。 它由各种类别的约18K新闻文档组成。 为了使任务少一些资源,我们为文本分类问题选择了该数据集的一个子集。 由于我非常喜欢体育文化,因此我决定选择此数据集中与体育相关的部分。 此子集中的类别如下:

rec.autosrec.motorcyclesrec.sport.baseballrec.sport.hockey

设置代码! (Setting up the code!)

Before we run the example we import modules we want:

在运行示例之前,我们需要导入模块:

import logging


import matplotlib.pyplot as plt
import numpy as np
from matplotlib import cm
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import (RepeatedStratifiedKFold, cross_val_score, )
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from tomotopy import HDPModel


from lda_classification.model import TomotopyLDAVectorizer
from lda_classification.preprocess.spacy_cleaner import SpacyCleaner


#############################################


logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
workers = 4 #Numbers of workers throughout the project


use_umap = False #make this True if you want to use UMAP for your visualizations


min_df = 5 #Minimum number for document frequency in the corpus
rm_top = 5 #Re
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值