fastText文本分类原理
【1.】对N个文档分词获得词表
【2.】用词粒度/字粒度的ngram扩充词表(有一些hash tricks以防词表爆炸)
【3.】获得某1个文档的分词和ngram词索引向量
【4.】对上一步的词索引向量做embedding,获得该文档的embedding词向量序列
【5.】对上一步的embedding词向量序列,按词取平均
【6.】对上一步取平均之后的向量做层次softmax(多分类softmax的变体)
fastText文本分类建模调参案例
安装fastText:https://github.com/facebookresearch/fastText/archive/v0.1.0.zip
下载数据集(cooking.stackexchange.tar.gz):https://download.csdn.net/download/cymy001/11222071
数据集说明:数据集是从Stack exchange 网站的烹饪部分下载的问题示例及相应的类别标签,将基于此数据集构建一个分类器来自动识别烹饪问题的类别。 cooking.stackexchange.tar.gz解压缩后的文本文件的每一行都包含一个标签列表(所有标签都以 __label__ 前缀开始),其后是相应的文档。对模型进行训练,以预测给定文档的标签。
step1:baseline
#查看全量数据集
prodeMBP:fastText_learn pro$ wc cooking.stackexchange.txt
15404 169582 1401900 cooking.stackexchange.txt
#切分训练集、验证集,验证集评估该学习分类器对新数据的适用程度
prodeMBP:fastText_learn pro$ head -n 12404 cooking.stackexchange.txt > cooking.train
prodeMBP:fastText_learn pro$ tail -n 3000 cooking.stackexchange.txt > cooking.valid
#训练模型(input参数传入训练数据,-output 保存模型的位置和文件名)
prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext supervised -input cooking.train -output model_cooking
Read 0M words
Number of words: 14543
Number of labels: 735
Progress: 0.1% words/sec/thread: 50053 lr: 0.099944 loss: 15.778889
Progress: 0.1% words/sec/thread: 77722 lr: 0.099885 loss: 16.707058
Progress: 0.2% words/sec/thread: 94088 lr: 0.099843 loss: 17.041201
...
Progress: 99.1% words/sec/thread: 81627 lr: 0.000851 loss: 9.994794
Progress: 100.0% words/sec/thread: 81663 lr: 0.000000 loss: 9.992051 eta: 0h0m
#拿两个例子测试训练好的模型文件model_cooking.bin
#(1)预测1个标签
prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext predict model_cooking.bin -
Which baking dish is best to bake a banana bread ?
__label__baking
__label__food-safety
Why not put knives in the dishwasher?
__label__food-safety
#(2)预测5个标签
prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext predict model_cooking.bin - 5
Why not put knives in the dishwasher?
__label__food-safety __label__baking __label__bread __label__equipment __label__substitutions
在 Stack Exchange 上,这句话标有三个标签:equipment,cleaning 和 knives。模型预测的五个标签中有一个是正确的,精确度为 0.20。 在三个真实标签中,只有 equipment 标签被该模型预测出,召回率为 0.33。
#拿测试集整体了解模型model_cooking.bin的预测质量,P@i/R@i表示预测i个标签对应的准确/召回率
#(1)预测1个标签
prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext test model_cooking.bin cooking.valid
N 3000
P@1 0.149
R@1 0.0646
Number of examples: 3000
#(2)预测5个标签
prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext test model_cooking.bin cooking.valid 5
N 3000
P@5 0.067
R@5 0.145
Number of examples: 3000
step2:数据预处理优化
#预处理单词包含大写字母或标点符号
prodeMBP:fastText_learn pro$ cat cooking.stackexchange.txt | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > cooking.preprocessed_1.txt
prodeMBP:fastText_learn pro$ head -n 12404 cooking.preprocessed_1.txt > cooking_1.train
prodeMBP:fastText_learn pro$ tail -n 3000 cooking.preprocessed_1.txt > cooking_1.valid
prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext supervised -input cooking_1.train -output model_cooking_1
Read 0M words
Number of words: 8952
Number of labels: 735
Progress: 0.0% words/sec/thread: 85227 lr: 0.099987 loss: 15.532344
Progress: 0.1% words/sec/thread: 125621 lr: 0.099921 loss: 16.70705
···
Progress: 99.4% words/sec/thread: 88113 lr: 0.000576 loss: 9.895709
Progress: 100.0% words/sec/thread: 96064 lr: 0.000000 loss: 9.894776 eta: 0h0m
prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext test model_cooking_1.bin cooking_1.valid
N 3000
P@1 0.177
R@1 0.0767
Number of examples: 3000
step3:调参
#加入参数迭代次数epoch,表示每个样本在整个训练过程出现的次数。
prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext supervised -input cooking_1.train -output model_cooking_2 -epoch 25
Read 0M words
Number of words: 8952
Number of labels: 735
Progress: 0.0% words/sec/thread: 41040 lr: 0.100000 loss: 15.532344
Progress: 0.0% words/sec/thread: 101927 lr: 0.099987 loss: 16.70705
···
Progress: 100.0% words/sec/thread: 88445 lr: 0.000026 loss: 7.17666
Progress: 100.0% words/sec/thread: 88444 lr: 0.000000 loss: 7.176500 eta: 0h0m
prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext test model_cooking_2.bin cooking_1.valid
N 3000
P@1 0.519
R@1 0.225
Number of examples: 3000
#加入参数lr,改变模型学习速度,即处理每个样本后 模型变化的幅度。
#学习率为 0 意味着模型根本不会改变,即不会学到任何东西;好的学习速率在 0.1 - 1.0 范围内。
prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext supervised -input cooking_1.train -output model_cooking_3 -lr 1.0
Read 0M words
Number of words: 8952
Number of labels: 735
Progress: 0.6% words/sec/thread: 83356 lr: 0.994433 loss: 15.532344
Progress: 0.6% words/sec/thread: 84618 lr: 0.994055 loss: 14.675364
Progress: 0.7% words/sec/thread: 81756 lr: 0.993252 loss: 13.116207
···
Progress: 99.6% words/sec/thread: 88352 lr: 0.004163 loss: 6.479478
Progress: 100.0% words/sec/thread: 88351 lr: 0.000000 loss: 6.479478 eta: 0h0m
prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext test model_cooking_3.bin cooking_1.valid
N 3000
P@1 0.568
R@1 0.245
Number of examples: 3000
#同时加入参数lr、epoch
prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext supervised -input cooking_1.train -output model_cooking_4 -lr 1.0 -epoch 25
Read 0M words
Number of words: 8952
Number of labels: 735
Progress: 0.0% words/sec/thread: 75736 lr: 0.999973 loss: 15.532344
Progress: 0.0% words/sec/thread: 76247 lr: 0.999843 loss: 16.707058
Progress: 0.0% words/sec/thread: 90871 lr: 0.999761 loss: 17.011614
···
Progress: 99.9% words/sec/thread: 88333 lr: 0.001316 loss: 4.354121
Progress: 100.0% words/sec/thread: 88332 lr: 0.000000 loss: 4.353338 eta: 0h0m
prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext test model_cooking_4.bin cooking_1.valid
N 3000
P@1 0.591
R@1 0.255
Number of examples: 3000
#加入2-gram
prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext supervised -input cooking_1.train -output model_cooking_5 -lr 1.0 -epoch 25 -wordNgrams 2
Read 0M words
Number of words: 8952
Number of labels: 735
Progress: 0.0% words/sec/thread: 81584 lr: 0.999974 loss: 15.532344
Progress: 0.0% words/sec/thread: 158995 lr: 0.999657 loss: 16.70705
Progress: 0.0% words/sec/thread: 96518 lr: 0.999574 loss: 17.011614
Progress: 0.1% words/sec/thread: 87310 lr: 0.999470 loss: 17.178629
···
Progress: 99.8% words/sec/thread: 87492 lr: 0.001646 loss: 3.171684
Progress: 100.0% words/sec/thread: 87481 lr: 0.000000 loss: 3.171049 eta: 0h0m
prodeMBP:fastText_learn pro$ ./fastText-0.1.0/fasttext test model_cooking_5.bin cooking_1.valid
N 3000
P@1 0.613
R@1 0.265
Number of examples: 3000
精度从 14.9% 到达 61.3%的重要改进步骤包括:
预处理数据 ;
改变迭代次数 (使用选项 -epoch, 标准范围 [5 - 50]) ;
改变学习速率 (使用选项 -lr, 标准范围 [0.1 - 1.0]) ;
使用 word n-grams (使用选项 -wordNgrams, 标准范围 [1 - 5]).