[FastText in Text Classification]论文实现：Bag of Tricks for Efficient Text Classification

Bigcrab__

已于 2023-11-20 11:59:00 修改

阅读量66

点赞数 1

分类专栏：机器学习文章标签： python nlp word2vec

于 2023-11-20 11:58:24 首次发布

本文链接：https://blog.csdn.net/m0_72947390/article/details/134449767

版权

机器学习专栏收录该内容

39 篇文章 0 订阅

订阅专栏

Bag of Tricks for Efficient Text Classification

论文：Bag of Tricks for Efficient Text Classification
作者：Armand Joulin，Edouard Grave，Piotr Bojanowski，Tomas Mikolov
时间：2016
地址：https://cs.brown.edu/people/pfelzens/segment

一、完整代码

直接调用fastext库就好，很快就能搞定！

import fasttext

# data.train.txt是一个文本文件，每行包含一个训练句和标签。默认情况下，我们假设标签是以  __label__ 为前缀的单词
model = fasttext.train_supervised('data.train.txt')

# 返回概率最高的三个结果，由于预测两个，一共会返回6个结果
model.predict(["Which baking dish is best to bake a banana bread ?", "Why not put knives in the dishwasher?"], k=3)

api中的label是前缀，默认为__label__

二、论文解读

2.1 模型架构

A simple and efficient baseline for sentence classification is to represent sentences as bag of words (BoW) and train a linear classifier, e.g., a logistic regression or an SVM (Joachims, 1998; Fan et al., 2008). However, linear classifiers do not share parameters among features and classes. This possibly limits their generalization in the context of large output space where some classes have very few examples. Common solutions to this problem are to factorize the linear classifier into low rank matrices (Schutze, 1992; Mikolov et al., 2013) or to use multilayer neural networks (Collobert and Weston, 2008; Zhang et al., 2015).

由于线性分类器不会在类别和特征之间共享参数，所以我们不需要计算每一个类别的softmax值；即可以使用Hierarchical Softmax 或者 Negative Sampling 来加快训练速度；
同时，由于各个词的内部特征也可以进行考虑，根据论文Enriching Word Vectors with Subword Information，我们可以使用论文的subwords方法进行映射；
模型架构如下：

与论文Enriching Word Vectors with Subword Information不同的是，那里的output是词向量，而这里是类别(class or label)；完毕！

三、过程实现

在论文中Enriching Word Vectors with Subword Information详细讲了；这里只需要softmax就可以了

四、整体总结

实现难度远不如Enriching Word Vectors with Subword Information；

Bigcrab__

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[FastText in Text Classification]论文实现：Bag of Tricks for Efficient Text Classification

其实现很简单，用subwords包含词内部信息，然后使用cbow模型，利用标签计算损失
复制链接

扫一扫

专栏目录