fasttext 文本分类_深度学习系列––fasttext模型&帮助文档

最新推荐文章于 2024-09-08 14:29:16 发布

weixin_39684454

最新推荐文章于 2024-09-08 14:29:16 发布

阅读量339

点赞数

文章标签： fasttext 文本分类

摘要

接着NLP/CV/领域的常见数据集介绍[1] ,本文开始介绍一个轻便建议而效果有很不错的模型——fasttext模型[2]。fastText是一种Facebook AI Research在16年开源的模型，特点就是轻巧快捷，其效果基本与textCNN相当甚至略好，而其训练速度会比其他模型的时耗节约数倍甚至数十倍，是一个非常值得推荐在工业简单场景应用的模型。

1 fasttext 模型原理

Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016)[2]提出fasttext模型。fastText 模型架构和 Word2Vec 中的 CBOW 模型很类似，不同之处在于，fastText 预测标签，而 CBOW 模型预测中间词。一般情况下，使用fastText进行文本分类的同时也会产生词的embedding，即embedding是fastText分类的产物。

如图1所示，fastText的模型也是三层架构：输入层、隐藏层、输出层（Hierarchical Softmax）。fastText的输入是多个单词及其n-gram特征，这些特征用来表示单个文档，将整个文本作为特征去预测文本对应的类别。

个人认为速度很快的原因有以下几个方面：

（1）模型总体只有三层，结构简单；

（2）文本表示的向量简单相加平均；

（3）在输出时，fastText采用了分层Softmax，大大降低了模型训练时间；

（4）我们直接调用的fastText是Facebook 2016年开源的一个词向量计算以及文本分类的工具，该项目是 C++ 写的。

图 1

2 fasttext 应用

fasttext是 Facebook fastText的python接口，fasttext包主要用途有两个：词向量表示、文本分类，对应使用数据结构如：

财经:资管、私募、信托，傻傻分不清楚！__label__证券

实际上根据论文中给出的数据，fasttext模型在速度(图2) 和效果(图3)上都非常不错[2]，尤其速度和模型大小是线上平响要求高的任务之所爱，在简易任务和获取embedding用于文本embedding初始化的话是值得推荐的。

（1）词向量表示

import fasttext 
# skipgram model
model = fasttext.skipgram('data.txt','model')
print model.words # 输出为一个词向量字典
# cbow model
model = fasttext.cbow('data.txt', 'model')
print model.words # 

其中data.txt是一个utf-8编码文件，默认的n-grams范围：3-6

程序输出为两个文件：model.bin and model.vec

model.vec 是一个每行为一个词向量的文本文件，model.bin是一个包含词典模型和所有超参数的二进制文件

获取OOV词的词向量表示
print model['king']  # 获得单词king的词向量表示

model.bin可以使用如下方式重建模型：
model = fasttext.load_model('model.bin')
print model.words # list of words in dictionary
print model['king'] # get the vector of the word 'king'

（2）文本分类

使用方式为：
classifier = fasttext.supervised('data.train.txt', 'model', label_prefix='__label__') #　原作者使用模型方法
其中data.train.txt是每行包含标签的文本文件，默认标签前缀为__label__

模型建立好以后，可以用来检测测试集上的准确度：
result = classifier.test('test.txt')
print 'P@1:', result.precision  # 准确率
print 'R@1:', result.recall  # 召回率
print 'Number of examples:', result.nexamples  # 测试样本数量

也可以使用训练好的模型进行预测：
texts = ['example very long text 1', 'example very longtext 2']
labels = classifier.predict(texts)
print labels  # 返回为一个二元数组[[labels]]

# 或者同时包含概率值
labels = classifier.predict_proba(texts)
print labels

也可以返回最可能的k个标签值：
labels = classifier.predict(texts, k=3)
print labels

# 同时包含概率值
labels = classifier.predict_proba(texts, k=3)
print labels
################################################################

3 fasttext python使用帮助文档

fasttext 有基于skipgram和cbow两种方式训练方式，以下给出一些常用的各个模型参数帮助文档：

（1）skipgram(params)

input_file     training file path (required)  # 训练文件路径
output         output file path (required)  # 输出文件路径
lr             learning rate [0.05]  # 学习率
lr_update_rate change the rate of updates for the learning rate [100]  # 学习率的更新速度
dim            size of word vectors [100]  # 词向量维度
ws             size of the context window [5]  # 窗口宽度大小
epoch          number of epochs [5]  # 迭代次数
min_count      minimal number of word occurences [5]  # 最小词频数
neg            number of negatives sampled [5]  # 负样本个数
word_ngrams    max length of word ngram [1]  # 词ngram的最大长度
loss           loss function {ns, hs, softmax} [ns]  # 损失函数
bucket         number of buckets [2000000]  # 
minn           min length of char ngram [3]  # 字符ngram的最小长度
maxn           max length of char ngram [6]  # 字符ngram的最大长度
thread         number of threads [12]  # 线程数
t              sampling threshold [0.0001]  # 
silent         disable the log output from the C++ extension [1]
encoding       specify input_file encoding [utf-8]  # 输入文件格式

示例说明：model = fasttext.skipgram('train.txt', 'model', lr=0.1, dim=300)

（2）cbow(params)

input_file     training file path (required)
output         output file path (required)
lr             learning rate [0.05]
lr_update_rate change the rate of updates for the learning rate [100]
dim            size of word vectors [100]
ws             size of the context window [5]
epoch          number of epochs [5]
min_count      minimal number of word occurences [5]
neg            number of negatives sampled [5]
word_ngrams    max length of word ngram [1]
loss           loss function {ns, hs, softmax} [ns]
bucket         number of buckets [2000000]
minn           min length of char ngram [3]
maxn           max length of char ngram [6]
thread         number of threads [12]
t              sampling threshold [0.0001]
silent         disable the log output from the C++ extension [1]
encoding       specify input_file encoding [utf-8]

示例说明：model = fasttext.cbow('train.txt', 'model', lr=0.1, dim=300)

（3）skipgram和cbow模型的返回值字段参数

model.model_name       # Model name  模型名称
model.words            # List of words in the dictionary  词典单词向量列表
model.dim              # Size of word vector  词向量维度
model.ws               # Size of context window  内容窗口大小
model.epoch            # Number of epochs  迭代训练次数
model.min_count        # Minimal number of word occurences  
model.neg              # Number of negative sampled  负样本个数
model.word_ngrams      # Max length of word ngram  词ngram的最大长度
model.loss_name        # Loss function name  损失函数名称
model.bucket           # Number of buckets  
model.minn             # Min length of char ngram  字符ngram的最小长度
model.maxn             # Max length of char ngram  字符ngram的最大长度
model.lr_update_rate   # Rate of updates for the learning rate  学习率更新速度
model.t                # Value of sampling threshold  样本门限值
model.encoding         # Encoding of the model  模型编码
model[word]            # Get the vector of specified word  返回给定词的预测词向量

（4）supervised(params)

input_file             training file path (required)  # 训练文件路径
output                 output file path (required)  # 输出文件路径
label_prefix           label prefix ['__label__']  # 标签前缀
lr                     learning rate [0.1]  # 学习率
lr_update_rate         change the rate of updates for the learning rate [100]  # 学习率的更新速度
dim                    size of word vectors [100]  # 词向量维度
ws                     size of the context window [5]  # 内容窗口大小
epoch                  number of epochs [5]  # 迭代次数
min_count              minimal number of word occurences [1] 最小词频数
neg                    number of negatives sampled [5]  # 负样本个数
word_ngrams            max length of word ngram [1]  # 词ngram的最大长度
loss                   loss function {ns, hs, softmax} [softmax]  # 损失函数
bucket                 number of buckets [0]  
minn                   min length of char ngram [0]  # 字符ngram的最小长度
maxn                   max length of char ngram [0]  # 字符ngram的最大长度
thread                 number of threads [12]
t                      sampling threshold [0.0001]  # 
silent                 disable the log output from the C++ extension [1]
encoding               specify input_file encoding [utf-8]  # 默认编码
pretrained_vectors     pretrained word vectors (.vec file) for supervised learning []  # 是否保持词向量输出文件model.vec，默认不保持

示例说明：classifier = fasttext.supervised('train.txt', 'model', label_prefix='__myprefix__', thread=4)

（5）supervised模型返回值字段参数

classifier.labels                  # List of labels  标签列表
classifier.label_prefix            # Prefix of the label  标签前缀
classifier.dim                     # Size of word vector  词向量维度
classifier.ws                      # Size of context window  内容窗口大小
classifier.epoch                   # Number of epochs  迭代次数
classifier.min_count               # Minimal number of word occurences  
classifier.neg                     # Number of negative sampled  负样本个数
classifier.word_ngrams             # Max length of word ngram  词ngram的最大长度
classifier.loss_name               # Loss function name  损失函数名称
classifier.bucket                  # Number of buckets  
classifier.minn                    # Min length of char ngram  字符ngram的最小长度
classifier.maxn                    # Max length of char ngram  字符ngram的最大长度
classifier.lr_update_rate          # Rate of updates for the learning rate  学习率的更新速度
classifier.t                       # Value of sampling threshold  
classifier.encoding                # Encoding that used by classifier  分类器使用编码
classifier.test(filename, k)       # Test the classifier  用分类器进行测试
classifier.predict(texts, k)       # Predict the most likely label  使用分类器进行文本预测
classifier.predict_proba(texts, k) # Predict the most likely label include their probability  使用分类器进行文本预测类别并且返回他们的概率值