fastText-文本分类

AI强仔

已于 2022-05-11 19:01:30 修改

阅读量1.1k

点赞数 1

分类专栏： NLP 文章标签：分类机器学习深度学习

于 2022-05-11 16:35:05 首次发布

NLP 专栏收录该内容

51 篇文章 5 订阅

订阅专栏

1 简介

本文根据2016年《Bag of Tricks for Efficient Text Classification》翻译总结的。主要写了fastText， fast text classifier。fastText使用一个多核CPU在10分钟内可以训练十亿单词。可以在1分钟内将50万句子分类为312K个类别。

一个简单有效的句子分类模型是将句子表示成BoW（bag of words），然后训练一个线性分类器，如逻辑回归、SVM。但是，线性分类器不在特征和类别之间共享参数，这可能限制了他们的泛化能力，尤其是当某类别有较少的样本时。一般解决办法是将线性分类器改为 low rank 矩阵或者多层神经网络。

不过fastText是继续采用的线性分类器。

2 fastText

在这里插入图片描述

fastText采用简单的线性模型，带有rank约束。公式如下;
在这里插入图片描述

第一个权重矩阵A是一个基于单词的look-up 表。如上图，n gram 特征被embed然后平均，形式文本的表示，最后传到线性分类器。其中文本的表示是一个隐藏变量，可以被再次使用。我们使用softmax函数分类。N个document，最小化上面的函数。其中y是标签。

2.1 Hierarchical softmax

当类别的数量较大时，线性分类器的计算量会较大，计算复杂度时O(kh)，其中k是类别的数量，h是隐藏层文本表示的维度。

为了改善计算时间，我们采用Hierarchical softmax，基于huffman coding tree，计算复杂度降低为在这里插入图片描述

2.2 N-gram features

Bag of words 对于单词顺序是不变的，如果考虑单词顺序往往需要较多的计算量。相反，我们使用n-grams的bag作为额外特征可以捕获一定的单词顺序的信息，而且效率也较高。

3 实验结果

从表1可以看到，fasttext的准确率不低，除了比深度CNN（VDCNN）稍微低外。从表2可以看到，fasttext的计算时间很小、很快。
在这里插入图片描述

4.使用

4.1 train_unsupervised 用于学习词向量

# Skipgram model :
model = fasttext.train_unsupervised('data.txt', model='skipgram')

# or, cbow model :
model = fasttext.train_unsupervised('data.txt', model='cbow')

方法参数：

input_file                 训练文件路径（必须）
output                     输出文件路径（必须）
label_prefix               标签前缀 default __label__
lr                         学习率 default 0.1
lr_update_rate             学习率更新速率 default 100
dim                        词向量维度 default 100
ws                         上下文窗口大小 default 5
epoch                      epochs 数量 default 5
min_count                  最低词频 default 5
word_ngrams                n-gram 设置 default 1
loss                       损失函数 {ns,hs,softmax} default softmax
minn                       最小字符长度 default 0
maxn                       最大字符长度 default 0
thread                     线程数量 default 12
t                          采样阈值 default 0.0001
silent                     禁用 c++ 扩展日志输出 default 1
encoding                   指定 input_file 编码 default utf-8
pretrained_vectors         指定使用已有的词向量 .vec 文件 default None

4.2 train_supervised 用于文本分类

model = fasttext.train_supervised('data.txt')

4.3 训练后模型可以使用的方法

get_dimension           # Get the dimension (size) of a lookup vector (hidden layer).
get_input_vector        # Given an index, get the corresponding vector of the Input Matrix.
get_input_matrix        # Get a copy of the full input matrix of a Model.
get_labels              # Get the entire list of labels of the dictionary
get_line                # Split a line of text into words and labels.
get_output_matrix       # Get a copy of the full output matrix of a Model.
get_sentence_vector     # Given a string, get a single vector represenation. This function
                        # assumes to be given a single line of text. We split words on
                        # whitespace (space, newline, tab, vertical tab) and the control
                        # characters carriage return, formfeed and the null character.
get_subword_id          # Given a subword, return the index (within input matrix) it hashes to.
get_subwords            # Given a word, get the subwords and their indicies.
get_word_id             # Given a word, get the word id within the dictionary.
get_word_vector         # Get the vector representation of word.
get_words               # Get the entire list of words of the dictionary
                        # This is equivalent to `words` property.
is_quantized            # whether the model has been quantized
predict                 # Given a string, get a list of labels and a list of corresponding probabilities.
quantize                # Quantize the model reducing the size of the model and it's memory footprint.
save_model              # Save the model to the given path
test                    # Evaluate supervised model using file given by path
test_label              # Return the precision and recall score for each label.