【手撕 - 自然语言处理】手撕 FastText 源码（01）分类器的预测过程

最新推荐文章于 2023-06-14 21:45:24 发布

VIP文章 LogM

最新推荐文章于 2023-06-14 21:45:24 发布

阅读量387

点赞数

分类专栏：自然语言处理文章标签：自然语言处理

本文为博主原创文章，未经博主允许不得转载。

本文链接：https://blog.csdn.net/qq_28739605/article/details/104212532

版权

作者：LogM

本文原载于 https://segmentfault.com/u/logm/articles ，不允许转载~

1. 源码来源

FastText 源码：https://github.com/facebookre...

本文对应的源码版本：Commits on Jun 27 2019, 979d8a9ac99c731d653843890c2364ade0f7d9d3

FastText 论文：

[1] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

2. 概述

FastText 的论文写的比较简单，有些细节不明白，网上也查不到，所幸直接撕源码。

FastText 的"分类器"功能是用的最多的，所以先从"分类器的predict"开始挖。

3. 开撕

先看程序入口的 main 函数，ok，是调用了 predict 函数。

// 文件：src/main.cc
// 行数：403
int main(int argc, char** argv) {
  std::vector<std::string> args(argv, argv + argc);
  if (args.size() < 2) {
    printUsage();
    exit(EXIT_FAILURE);
  }
  std::string command(args[1]);
  if (command == "skipgram" || command == "cbow" || command == "supervised") {
    train(args);           
  } else if (command == "test" || command == "test-label") {
    test(args);
  } else if (command == "quantize") {
    quantize(args);
  } else if (command == "print-word-vectors") {
    printWordVectors(args);
  } else if (co