fastText Ngram 的处理过程

最新推荐文章于 2024-08-04 18:53:40 发布

lainegates

最新推荐文章于 2024-08-04 18:53:40 发布

阅读量8.7k

点赞数 4

分类专栏：深度学习文章标签： fastText wordNgram ngrams

本文链接：https://blog.csdn.net/lainegates/article/details/77839847

版权

深度学习专栏收录该内容

8 篇文章 0 订阅

订阅专栏

最近小研究了下fastText过程，挺有收获，这里单讲下fastText的ngram处理过程，其余训练一类和word2vec很像，有兴趣的请移步 fastText 源码分析。

基础信息

首先说下
（1）ngram的使用前提是数据稀疏，这一点极其重要，也是后文成立的关键；
（2）fastText训练和预测使用的隐藏层和输出层大小是一致的；
（3）fastText的ngram并没有有意保存ngram信息，只是hash并用于检索了；
（4）fastText的ngram分两种，第一种用于分类（supervised，需预设定label），第二种用于词向量（cbow和skipgram）。

然后步入正文，有代码有真相。
从数据读入开始。

void Dictionary::readFromFile(std::istream& in) {
  std::string word;
  int64_t minThreshold = 1;
  while (readWord(in, word)) { // readWord每次读入一个词，空格和常见的不可见字符都可做为分割符
    add(word);
    if (ntokens_ % 1000000 == 0 && args_->verbose > 1) {
      std::cerr << "\rRead " << ntokens_  / 1000000 << "M words" << std::flush;
    }
    if (size_ > 0.75 * MAX_VOCAB_SIZE) {  // 保证word和label的总数小于限额
      minThreshold++;
      threshold(minThreshold, minThreshold);  // 超过限额，按要求删除一些低频词
    }
  }
  threshold(args_->minCount, args_->minCountLabel);
  initTableDiscard();
  initNgrams();
  ... //省略输出统计信息的代码
}

用于词向量的ngram

首先看看计算word的subword过程：
下段代码需处理utf8字符，详见 “UTF8 编码原理简介”

// 该函数仅被 Dictionary::getSubword（...）调用
// Dictionary::getSubword（...）被用于训练词向量模型(skipgram和cbow）
// 、输出词向量（print-word-vectors和print-sentence-vectors）和计算词相似性（nn和analogies）
// ！！！也就是说这个函数计算的ngram与分类不相关！！！
void Dictionary::computeSubwords(const std::string& word,
                               std::vector<int32_t>& ngrams) const {
  for (size_t i = 0; i < word.size(); i++) {
    std::string ngram;
    if ((word[i] & 0xC0) == 0x80) continue;
    for (size_t j = i, n = 1; j < word.size() && n <= args_->maxn; n++) {
      ngram.push_back(word[j++]);
      // 处理utf8字符，原理见前文帖子
      while (j < word.size() && (word[j] & 0xC0) == 0x80) { 
        ngram.push_back(word[j++]);
      }
      // 处理ngram
      if (n >= args_->minn && !(n == 1 && (i == 0 || j == word.size()))) {
        int32_t h = hash(ngram) % args_->bucket;
        ngrams.push_back(nwords_ + h); // 这个ngram用于检索subword
      }
    }
  }
}

用于分类的ngram

再来看用于分类的ngram
重点在于Dictionary::getLine（…）和Dictionary::addWordNgrams（…）

int32_t Dictionary::getLine(std::istream& in,
                            std::vector<int32_t>& words,
                            std::vector<int32_t>& labels,
                            std::minstd_rand& rng) const {
  std::vector<int32_t> word_hashes;
  int32_t ntokens = getLine(in, words, word_hashes, labels, rng); //计算词及词的hash
  if (args_->model == model_name::sup ) {  // ！！！关键之处，分类问题与词向量的分界！！！
    addWordNgrams(words, word_hashes, args_->wordNgrams);
  }
  return ntokens;
}

void Dictionary::addWordNgrams(std::vector<int32_t>& line,
                           const std::vector<int32_t>& hashes,
                           int32_t n) const {
  if (pruneidx_size_ == 0) return;
  for (int32_t i = 0; i < hashes.size(); i++) {
    uint64_t h = hashes[i];
    // 这里的 "n" 就是wordNgrams了
    for (int32_t j = i + 1; j < hashes.size() && j < i + n; j++) {
      // 这个处理让我迷糊了好久，这个hash的计算过程和Dictionary::hash()不同
      // 那么词是如何接起来的呢，后来终于恍然大悟
      // ！！！fastText使用两种ngram！！！
      // 词向量的ngram是分解单词，abc => a, ab, abc, b, bc
      // 分类的ngram是组合单词，a,b,c => a, ab, abc, b, bc ， 这个长度由wordNgrams指定，这也是fastText的优势所在
      h = h * 116049371 + hashes[j];  
      int64_t id = h % args_->bucket;
      if (pruneidx_size_ > 0) {
        if (pruneidx_.count(id)) {
          id = pruneidx_.at(id);
        } else {continue;}
      }
      line.push_back(nwords_ + id);
    }
  }
}