基础知识_多分类文本处理与特征工程

自学AI的鲨鱼儿

已于 2024-03-16 00:18:35 修改

阅读量111

点赞数

分类专栏： NLP # NLP_基础文章标签： NLP

于 2020-12-25 17:28:14 首次发布

本文链接：https://blog.csdn.net/qq_16555103/article/details/110825849

版权

NLP 同时被 2 个专栏收录

9 篇文章 0 订阅

订阅专栏

NLP_基础

8 篇文章 0 订阅

订阅专栏

三、独热编码one-hot、tf-idf与主题模型LSA/LDA

四、静态分布式词向量word2vec【cbow、skip-gram、glove】

1、cbow 模型损失函数的推导

2、skip-garm 模型损失函数的推导

3、 cbow、skip-gram 模型优化

（1）数据预处理优化：word2vec中低频词的处理【具体看：word2vec里面的数学与代码细节.pdf】

（2）数据预处理优化：高频词subsampling 二次重采样 word2vec中的subsampling【概念】

（3）计算优化【将多分类的softmax转化为多次逻辑回归】

4、cbow、skip-gram Mikolov 超参数的建议如下：

5、word2vec手写代码【cbow/skip-gram、huffman tree/negative sampling】

5.1、词汇表生成代码【包含特殊标记，且安照词频排序，将低于min_count的词转化为unk】

O、知识脑图

1、知识脑图简览

2、知识脑图链接

文本多分类与特征工程脑图

一、语言模型

二、文本预处理

三、独热编码one-hot、tf-idf与主题模型LSA/LDA

四、静态分布式词向量word2vec【cbow、skip-gram、glove】

1、cbow 模型损失函数的推导

cbow 模型的损失函数构建很简单，在hidden layer中做均值池化，将其得到的向量映射到 vocab 的维度，使用softmax函数进行转化，然后与中心词one-hot向量构建交叉熵loss。公式如下

概率P越大越好：

故Loss = - P，即函数为：

2、skip-garm 模型损失函数的推导

skip-gram 损失函数构建的思想：

① 使用中新词 Wi 预测上下文词Wc，p(Wc|wi) 表示预测的概率，skip-gram需要满足中心词Wi 预测所有上下文词的条件概率的乘积最大，

即，取log，将连乘符号转化为加和，即

② 对句子中所有的中新词做第一步的操作后相加，需要使得这个概率越大越好，即

③ p(Wc|Wi)可以用向量的相似度的softmax表示，即 p(Wc|Wi) =

④ 概率P计算的最终公式为：

⑤ loss = - p ，即【skip-gram还是softmax交叉熵损失函数，与cbow的区别是，上下文词

ont-hot tag 有多个而已，具体skip-gram运行过程请看脑图介绍】

3、 cbow、skip-gram 模型优化

（1）数据预处理优化：word2vec中低频词的处理【具体看：word2vec里面的数学与代码细节.pdf】

1、去除某些低频词的作用：
    ① 深度学习、机器学习大多都是基于统计学的算法模型，统计学本身就需要被统计指标拥有一定的数量，统计起来才有意义，
      词频低的不具有统计意义，因此应当去除。
    ② 去除词频低的词可以减低词汇表的维度，加快word2vec的训练速度。

（2）数据预处理优化：高频词subsampling 二次重采样 word2vec中的subsampling【概念】

a、为什么要进行高频词subsampling呢？

为了降低n-gram窗口内 无用高频词的影响，使用subsampling基于词的词频对每一个样本每一个词进行概率删除。【因为是概率删除，
所以可能第一个句子中删除了A词，而句子二中存在着A词】
例如：非常|优秀|的|人 >>>> 非常|优秀|人 ， “的”这个词可以用到很多的句子中，因此包含‘的’样本训练的信息量小，通过
subsampling后，高频词大概率会被删除，这样skip-gram窗口内就较低概率出现 p(的|优秀)的训练样本

b、subsampling 与特征工程文本预处理停止词是否冲突？

答：不冲突。实际subsampling相当于有概率的停用词的过程，两个过程可以同时使用

停用词：对模型训练意义较少的词，例：的 is the a 等【需要注意的是，这与后面的subsampling二次重采样不冲突，停用词词典
去除的是通用场景无用的词，但是还是可能存在一些高频词不在停用词典中缺依旧对模型训练效果不好，subsampling可以依
据这些词的频率进行概率删除，词频越高删除的概率越大，由于subsampling随机删除，所有可能第一个样本一个词被删掉了，而另一
个句子中该词是存在的】

c、subsampling 与 negative sampling 谁前谁后？

d、subsampling 重采样概率公式【参考：Negative Sampling】

下面公司中的 t 是阈值，f(w) 为词的频率

（3）计算优化【将多分类的softmax转化为多次逻辑回归】

TIP：不管是层次化softmax 还是负采样NEG，他们都是将多分类softmax问题转化为多次逻辑回归二分类问题

层次化 softmax 【 Hierarchical Softmax 】基于Hierarchical Softmax的模型

a、怎么理解 Hierarchical Softmax 的 binary tree 与传统 softmax的区别 ?

1、传统的softmax是一个多分类问题，但由于vocab的维度过高会导致计算难度较大；而层次softmax是将多分类问题转化为多次的逻
辑回归二分类问题，每一个节点模型参数θ只进行二分类判断，标签为 1 走左子树，反之右子树，因此这个多次逻辑回归的过程相当构
建了一颗binary tree，树的编码路径深度就是找到词的时间复杂度，因为类似于“二分法”的做法，时间复杂度为O(log N)。

2、那么怎么看待这颗二叉树呢？该树的内部节点相当于softmax输出层的神经元，叶子节点相当于输出层的输出，数目为vocab个，
树的根节点为输出层的输入。

b、与多分类softmax的输出层相比，为什么基于binary tree 多次逻辑回归的 Hierarchical Softmax 的时间复杂度较低？

1、传统的softmax需要从vocab的维度查找类别，时间复杂度为 O(N)，N是vocab的维度
   层次化softmax类似于“二分法的思想”，查找到该类别的时间复杂度为 O(log N)

c、不同的binary tree 树结构 Hierarchical Softmax 的效率不同，为什么霍夫曼树的效率最高? 霍夫曼树是怎样的实现的？

1、由于vocab的词的频词不是均匀的，我们希望模型查找词的编码路径之和越短越好，因此期望词频高的词编码路径较小，而霍夫
   曼树正好满足这个特点。

d、霍夫曼树这种计算方式的缺点是什么？优点是什么？

1、霍夫曼树softmax的缺点：
    由于霍夫曼树构建的特点，词频高的词接近于根节点，确实是编码路径较短，查找速度较快，但是词频低的生僻词编码路径往往
    很长，训练很费时间

2、霍夫曼树softmax 与 negative sampling 在低频词表现的区别：
    从上面的执行方式可以看出，层次化softmax二叉树类似于“二分法”的思想，生僻词虽说编码路径较长，但每一个样本训练的过程
    中都进行的决策作用，这就决定了层次化softmax对于低频词的效果好于 NEG负采样的计算方式。

负采样 negative sampling【NEG】 基于Negative Sampling的模型

a、怎么理解 negative sampling 与传统 softmax的区别 ?

1、与层次化softmax一样，NEG负采样同样将softmax多分类问题转化为多次逻辑回归二分类问题，而且不在使用vocab维度的负例，而
   是从中抽取 n 个负例，加上本次训练的正例。最终 将 vocab维的多分类问题转化为 1+n 次的二分类问题。

b、NEG 损失函数推导过程（使用 cbow模型进行推导）

b.1 定义 Wo 为中心词，context(Wo)为中心词的上下文词，Wi 为 NEG 负采样后的样本，i = 0,1,2,... neg ，一共有 1 + neg 个，

其中一个正例，neg个负例

b.2 逻辑回归的概率：，下文用来代替 singmoid 函数，

由逻辑回归可知，

b.3 1 + neg 个样本联合概率 P = ，使用log取对数，P = ，

该值越大越好【但需要注意的是：这是一个样本窗口】

b.4 使用梯度提升求导：

$\frac{\partial }{\partial \Theta ^{w_{i}}} = \sum_{i=0}^{neg}\left [ y_{w_{i}}-sigmoid\left ( X_{w_{o}}^{T} * \Theta ^{w_{i}} \right ) \right ] * X_{w_{o}}$

$\frac{\partial }{\partial \ X_{w_{o}}} = \sum_{i=0}^{neg}\left [ y_{w_{i}}-sigmoid\left ( X_{w_{o}}^{T} * \Theta ^{w_{i}} \right ) \right ] * \Theta ^{w_{i}}$

b.5 word2vec 中层次化softmax 与 NEG负采样均使用的是随机梯度提升，也就是说一个样本窗口更新一次参数

c、NEG 负例采样概率公式

p = 实验的到的启发式公式

d、NEG 是怎样的进行负例采样的

1、将长度为 1 的线段分割为 vocab份，每一份代表一个词，长度的比例依据 采样概率公式根据词频计算的结果。

e、NEG 负采样与霍夫曼树softmax模型效果的差异性？

1、层次化softmax对低频词的效果好更好【这是因为霍夫曼树本质上还是用所有的负样本进行训练，频率较低的词虽说编码路劲较长， 
    但每个样本的训练都会作为标签】；
2、而NEG负采样对高频词的效果更好【NEG负采样本质上还是倾向于去取频率较高的负例，因此每一个样本训练过程中通常会使用频
    率较高的词作为标签，因此一些低频词参与训练的次数较少，因此低频词的效果较少】
3、向量的维度较低NEG负采样效果更好。

4、cbow、skip-gram Mikolov 超参数的建议如下：

1、模型结构：Skip-gram 训练速度更慢一些，但是其对低频词的效果比cbow好【这是因为cbow隐层使用了均值池化，某种
程度上减低了低频词的重要性】，而cbow的运行速度较快。
2、优化计算算法：层次化softmax对低频词的效果好更好【这是因为霍夫曼树本质上还是用所有的负样本进行训练，频率
较低的词虽说编码路劲较长， 但每个样本的训练都会作为标签】；而NEG负采样对高频词的效果更好【NEG负采样本质上还
是倾向于去频率较高的负例，因此每一个样本训练过程中通常会使用频率较高的词作为标签，因此一些低频词参与训练的
次数较少，因此低频词的效果较少】，向量的维度较低NEG负采样效果更好。
3、高频词subsampling：对于大数据语料集合可以同时提高精度与速度，sampling 的值一般取1e-3 - 1e-5【当语料库
较大时，sampling 的值可以适当调小】
4、词向量的维度：中文：
5、窗口大小：skip-gram 的窗口大小一般为10左右，cbow窗口大小一般为5左右。
7、NEG负采样的负例个数：小数据集负例个数选择 5- 20，大数据集负例分数选则 2 - 5
8、word2vec 在训练前请加入 开始标记符<bol> 与 结束标记符 <eol> ，未知词标记 <unk> 【 这里的 unk 代表着 OOV 现
   象：① 测试集中出现的词在训练集中不存在 ② 训练集/测试集某些词的词频低于min_count，低频词没有统计意义，且可以降
   低vocab的维度，因此用 unk 来代替】

5、word2vec手写代码【cbow/skip-gram、huffman tree/negative sampling】

5.1、词汇表生成代码【包含 <bol> <eol> <unk>特殊标记，且安照词频排序，将低于min_count的词转化为unk】

class VocabItem:
    def __init__(self, word):
        self.word = word
        self.count = 0

class Vocab:
    def __init__(self, fi, min_count):
        “”“
            fi：预料库的输入路径，用空格隔开分词的结果
            min_count：最小词频，低于该数值的单词会映射为unk
        ”“”
        vocab_items = []
        vocab_hash = {}
        word_count = 0
        fi = open(fi, 'r',encoding='utf-8-sig')

        # Add special tokens <bol> (beginning of line) and <eol> (end of line)
        for token in ['<bol>', '<eol>']:
            vocab_hash[token] = len(vocab_items)  #
            vocab_items.append(VocabItem(token))

        for line in fi:
            tokens = line.split(' ')
            for token in tokens:
                if token not in vocab_hash:
                    vocab_hash[token] = len(vocab_items)
                    vocab_items.append(VocabItem(token))
                    
                #assert vocab_items[vocab_hash[token]].word == token, 'Wrong vocab_hash index'
                vocab_items[vocab_hash[token]].count += 1
                word_count += 1
            
                if word_count % 10000 == 0:
                    sys.stdout.write("\rReading word %d" % word_count)
                    sys.stdout.flush()

            # Add special tokens <bol> (beginning of line) and <eol> (end of line)
            vocab_items[vocab_hash['<bol>']].count += 1
            vocab_items[vocab_hash['<eol>']].count += 1
            word_count += 2

        self.bytes = fi.tell()
        self.vocab_items = vocab_items         # List of VocabItem objects
        self.vocab_hash = vocab_hash           # Mapping from each token to its index in vocab
        self.word_count = word_count           # Total number of words in train file

        # Add special token <unk> (unknown),即 提供 OOV ：out of vocab
        # merge words occurring less than min_count into <unk>, and
        # sort vocab in descending order by frequency in train file
        self.__sort(min_count) # 添加 unk、将小于min_count映射为unk、根据词频从大到小排序

        #assert self.word_count == sum([t.count for t in self.vocab_items]), 'word_count and sum of t.count do not agree'
        print('Total words in training file: %d' % self.word_count)
        print('Total bytes in training file: %d' % self.bytes)
        print('Vocab size: %d' % len(self))

    def __getitem__(self, i):
        return self.vocab_items[i]

    def __len__(self):
        return len(self.vocab_items)

    def __iter__(self):
        return iter(self.vocab_items)

    def __contains__(self, key):
        return key in self.vocab_hash

    def __sort(self, min_count):
        tmp = []
        tmp.append(VocabItem('<unk>'))
        unk_hash = 0
        
        count_unk = 0
        for token in self.vocab_items:
            if token.count < min_count:
                count_unk += 1
                tmp[unk_hash].count += token.count
            else:
                tmp.append(token)

        tmp.sort(key=lambda token : token.count, reverse=True)

        # Update vocab_hash
        vocab_hash = {}
        for i, token in enumerate(tmp):
            vocab_hash[token.word] = i

        self.vocab_items = tmp
        self.vocab_hash = vocab_hash

        print()
        print ('Unknown vocab size:', count_unk)  # 记录有多少个词【词频小于 min_count 的词】转化为 unk

    def indices(self, tokens):
        return [self.vocab_hash[token] if token in self else self.vocab_hash['<unk>'] for token in tokens]

5.2、word2vec代码

word2vec 在训练前请加入 开始标记符<bol> 与 结束标记符 <eol> ，未知词标记 <unk> 【 这里的 unk 代表着 OOV 现
   象：① 测试集中出现的词在训练集中不存在 ② 训练集/测试集某些词的词频低于min_count，低频词没有统计意义，且可以降
   低vocab的维度，因此用 unk 来代替】

注意：若使用 gensim的模型中不会添加 <unk>、<bol>、<eol> 的字符，所以该过程可以在预料预处理的时候进行替换

下列代码在linux/mac可以执行，而windows可能因为多进程调用问题报错

import argparse
import math
import struct
import sys
import time
import warnings

import numpy as np

from multiprocessing import Pool, Value, Array

class VocabItem:
    def __init__(self, word):
        self.word = word
        self.count = 0
        self.path = None # Path (list of indices) from the root to the word (leaf)
        self.code = None # Huffman encoding

class Vocab:
    def __init__(self, fi, min_count):
        vocab_items = []
        vocab_hash = {}
        word_count = 0
        fi = open(fi, 'r',encoding='utf-8-sig')

        # Add special tokens <bol> (beginning of line) and <eol> (end of line)
        for token in ['<bol>', '<eol>']:
            vocab_hash[token] = len(vocab_items)  #
            vocab_items.append(VocabItem(token))

        for line in fi:
            tokens = line.split(' ')
            for token in tokens:
                if token not in vocab_hash:
                    vocab_hash[token] = len(vocab_items)
                    vocab_items.append(VocabItem(token))
                    
                #assert vocab_items[vocab_hash[token]].word == token, 'Wrong vocab_hash index'
                vocab_items[vocab_hash[token]].count += 1
                word_count += 1
            
                if word_count % 10000 == 0:
                    sys.stdout.write("\rReading word %d" % word_count)
                    sys.stdout.flush()

            # Add special tokens <bol> (beginning of line) and <eol> (end of line)
            vocab_items[vocab_hash['<bol>']].count += 1
            vocab_items[vocab_hash['<eol>']].count += 1
            word_count += 2

        self.bytes = fi.tell()
        self.vocab_items = vocab_items         # List of VocabItem objects
        self.vocab_hash = vocab_hash           # Mapping from each token to its index in vocab
        self.word_count = word_count           # Total number of words in train file

        # Add special token <unk> (unknown),即 提供 OOV ：out of vocab
        # merge words occurring less than min_count into <unk>, and
        # sort vocab in descending order by frequency in train file
        self.__sort(min_count) # 添加 unk、将小于min_count隐射为unk、根据词频从大到小排序

        #assert self.word_count == sum([t.count for t in self.vocab_items]), 'word_count and sum of t.count do not agree'
        print('Total words in training file: %d' % self.word_count)
        print('Total bytes in training file: %d' % self.bytes)
        print('Vocab size: %d' % len(self))

    def __getitem__(self, i):
        return self.vocab_items[i]

    def __len__(self):
        return len(self.vocab_items)

    def __iter__(self):
        return iter(self.vocab_items)

    def __contains__(self, key):
        return key in self.vocab_hash

    def __sort(self, min_count):
        tmp = []
        tmp.append(VocabItem('<unk>'))
        unk_hash = 0
        
        count_unk = 0
        for token in self.vocab_items:
            if token.count < min_count:
                count_unk += 1
                tmp[unk_hash].count += token.count
            else:
                tmp.append(token)

        tmp.sort(key=lambda token : token.count, reverse=True)

        # Update vocab_hash
        vocab_hash = {}
        for i, token in enumerate(tmp):
            vocab_hash[token.word] = i

        self.vocab_items = tmp
        self.vocab_hash = vocab_hash

        print()
        print ('Unknown vocab size:', count_unk)  # 记录有多少个词【词频小于 min_count 的词】转化为 unk

    def indices(self, tokens):
        return [self.vocab_hash[token] if token in self else self.vocab_hash['<unk>'] for token in tokens]

    def encode_huffman(self):
        # Build a Huffman tree
        vocab_size = len(self)
        count = [t.count for t in self] + [1e15] * (vocab_size - 1)
        parent = [0] * (2 * vocab_size - 2)
        binary = [0] * (2 * vocab_size - 2)
        
        pos1 = vocab_size - 1
        pos2 = vocab_size

        for i in range(vocab_size - 1):
            # Find min1
            if pos1 >= 0:
                if count[pos1] < count[pos2]:
                    min1 = pos1
                    pos1 -= 1
                else:
                    min1 = pos2
                    pos2 += 1
            else:
                min1 = pos2
                pos2 += 1

            # Find min2
            if pos1 >= 0:
                if count[pos1] < count[pos2]:
                    min2 = pos1
                    pos1 -= 1
                else:
                    min2 = pos2
                    pos2 += 1
            else:
                min2 = pos2
                pos2 += 1

            count[vocab_size + i] = count[min1] + count[min2]
            parent[min1] = vocab_size + i
            parent[min2] = vocab_size + i
            binary[min2] = 1

        # Assign binary code and path pointers to each vocab word
        root_idx = 2 * vocab_size - 2
        for i, token in enumerate(self):
            path = [] # List of indices from the leaf to the root
            code = [] # Binary Huffman encoding from the leaf to the root

            node_idx = i
            while node_idx < root_idx:
                if node_idx >= vocab_size: path.append(node_idx)
                code.append(binary[node_idx])
                node_idx = parent[node_idx]
            path.append(root_idx)

            # These are path and code from the root to the leaf
            token.path = [j - vocab_size for j in path[::-1]]
            token.code = code[::-1]

class UnigramTable:
    """
    负采样 negative sampling
    将 单次多分类 转化为多次二分类
    A list of indices of tokens in the vocab following a power law distribution,
    used to draw negative samples.
    """
    def __init__(self, vocab):
        vocab_size = len(vocab)
        power = 0.75
        norm = sum([math.pow(t.count, power) for t in vocab]) # Normalizing constant

        table_size = 1e8 # Length of the unigram table   # 定义一个水平等划分
        table = np.zeros(int(table_size), dtype=np.uint32)

        print('Filling unigram table')
        p = 0 # Cumulative probability
        i = 0
        for j, unigram in enumerate(vocab):
            p += float(math.pow(unigram.count, power))/norm
            while (i < table_size) and (float(i) / table_size < p):
                table[i] = j
                i += 1
        self.table = table

    def sample(self, count):
        indices = np.random.randint(low=0, high=len(self.table), size=count)
        return [self.table[i] for i in indices]

def sigmoid(z):
    if z > 6:
        return 1.0
    elif z < -6:
        return 0.0
    else:
        return 1 / (1 + math.exp(-z))

def init_net(dim, vocab_size):
    """
    初始化函数，两种初始化的方法
    """
    # Init syn0 with random numbers from a uniform distribution on the interval [-0.5, 0.5]/dim
    tmp = np.random.uniform(low=-0.5/dim, high=0.5/dim, size=(vocab_size, dim))

    syn0 = np.ctypeslib.as_ctypes(tmp)
    syn0 = Array(syn0._type_, syn0, lock=False)  # 多进程

    # Init syn1 with zeros
    tmp = np.zeros(shape=(vocab_size, dim))
    syn1 = np.ctypeslib.as_ctypes(tmp)
    syn1 = Array(syn1._type_, syn1, lock=False) # 多进程

    return (syn0, syn1)

def train_process(pid):
    # Set fi to point to the right chunk of training file
    start = vocab.bytes / num_processes * pid
    end = vocab.bytes if pid == num_processes - 1 else vocab.bytes / num_processes * (pid + 1)
    fi.seek(start)
    #print 'Worker %d beginning training at %d, ending at %d' % (pid, start, end)

    alpha = starting_alpha

    word_count = 0
    last_word_count = 0

    while fi.tell() < end:
        line = fi.readline().strip()
        # Skip blank lines
        if not line:
            continue

        # Init sent, a list of indices of words in line
        sent = vocab.indices(['<bol>'] + line.split() + ['<eol>'])

        for sent_pos, token in enumerate(sent):
            if word_count % 10000 == 0:
                global_word_count.value += (word_count - last_word_count)
                last_word_count = word_count

                # Recalculate alpha
                alpha = starting_alpha * (1 - float(global_word_count.value) / vocab.word_count)
                if alpha < starting_alpha * 0.0001: alpha = starting_alpha * 0.0001

                # Print progress info
                sys.stdout.write("\rAlpha: %f Progress: %d of %d (%.2f%%)" %
                                 (alpha, global_word_count.value, vocab.word_count,
                                  float(global_word_count.value) / vocab.word_count * 100))
                sys.stdout.flush()

            # Randomize window size, where win is the max window size
            current_win = np.random.randint(low=1, high=win+1)
            context_start = max(sent_pos - current_win, 0)
            context_end = min(sent_pos + current_win + 1, len(sent))
            context = sent[context_start:sent_pos] + sent[sent_pos+1:context_end] # Turn into an iterator?

            # CBOW
            if cbow:
                # Compute neu1
                neu1 = np.mean(np.array([syn0[c] for c in context]), axis=0)
                assert len(neu1) == dim, 'neu1 and dim do not agree'

                # Init neu1e with zeros
                neu1e = np.zeros(dim)

                # Compute neu1e and update syn1
                if neg > 0:
                    classifiers = [(token, 1)] + [(target, 0) for target in table.sample(neg)]
                else:
                    classifiers = zip(vocab[token].path, vocab[token].code)
                for target, label in classifiers:
                    z = np.dot(neu1, syn1[target])
                    p = sigmoid(z)
                    g = alpha * (label - p)
                    neu1e += g * syn1[target] # Error to backpropagate to syn0
                    syn1[target] += g * neu1  # Update syn1

                # Update syn0
                for context_word in context:
                    syn0[context_word] += neu1e

            # Skip-gram
            else:
                for context_word in context:
                    # Init neu1e with zeros
                    neu1e = np.zeros(dim)

                    # Compute neu1e and update syn1
                    if neg > 0:
                        classifiers = [(token, 1)] + [(target, 0) for target in table.sample(neg)]
                    else:
                        classifiers = zip(vocab[token].path, vocab[token].code)
                    for target, label in classifiers:
                        z = np.dot(syn0[context_word], syn1[target])
                        p = sigmoid(z)
                        g = alpha * (label - p)
                        neu1e += g * syn1[target]              # Error to backpropagate to syn0
                        syn1[target] += g * syn0[context_word] # Update syn1

                    # Update syn0
                    syn0[context_word] += neu1e

            word_count += 1

    # Print progress info
    global_word_count.value += (word_count - last_word_count)
    sys.stdout.write("\rAlpha: %f Progress: %d of %d (%.2f%%)" %
                     (alpha, global_word_count.value, vocab.word_count,
                      float(global_word_count.value)/vocab.word_count * 100))
    sys.stdout.flush()
    fi.close()

def save(vocab, syn0, fo, binary):
    print ('Saving model to', fo)
    dim = len(syn0[0])
    if binary:
        fo = open(fo, 'wb')
        fo.write('%d %d\n' % (len(syn0), dim))
        fo.write('\n')
        for token, vector in zip(vocab, syn0):
            fo.write('%s ' % token.word)
            for s in vector:
                fo.write(struct.pack('f', s))
            fo.write('\n')
    else:
        fo = open(fo, 'w')
        fo.write('%d %d\n' % (len(syn0), dim))
        for token, vector in zip(vocab, syn0):
            word = token.word
            vector_str = ' '.join([str(s) for s in vector])
            fo.write('%s %s\n' % (word, vector_str))

    fo.close()

def __init_process(*args):
    global vocab, syn0, syn1, table, cbow, neg, dim, starting_alpha
    global win, num_processes, global_word_count, fi
    
    vocab, syn0_tmp, syn1_tmp, table, cbow, neg, dim, starting_alpha, win, num_processes, global_word_count = args[:-1]
    fi = open(args[-1], 'r')
    with warnings.catch_warnings():
        warnings.simplefilter('ignore', RuntimeWarning)
        syn0 = np.ctypeslib.as_array(syn0_tmp)
        syn1 = np.ctypeslib.as_array(syn1_tmp)

def train(fi, fo, cbow, neg, dim, alpha, win, min_count, num_processes, binary):
    # Read train file to init vocab
    # min_count - Min count for words used to learn <unk>
    vocab = Vocab(fi, min_count)

    # Init net
    syn0, syn1 = init_net(dim, len(vocab))    # Vocab 中定义了 __len__() 方法，这里面 len(vocab) 是去重词汇表的大小

    global_word_count = Value('i', 0)
    table = None
    if neg > 0:
        print ('Initializing unigram table')
        table = UnigramTable(vocab)
    else:
        print ('Initializing Huffman tree')
        vocab.encode_huffman()

    # Begin training using num_processes workers
    t0 = time.time()
    pool = Pool(processes=num_processes, initializer=__init_process,
                initargs=(vocab, syn0, syn1, table, cbow, neg, dim, alpha,
                          win, num_processes, global_word_count, fi))
    pool.map(train_process, range(num_processes))
    t1 = time.time()
    print()
    print ('Completed training. Training took', (t1 - t0) / 60, 'minutes')

    # Save model to file
    save(vocab, syn0, fo, binary)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('-train', help='Training file', dest='fi', required=True)
    parser.add_argument('-model', help='Output model file', dest='fo', required=True)
    parser.add_argument('-cbow', help='1 for CBOW, 0 for skip-gram', dest='cbow', default=1, type=int)
    parser.add_argument('-negative', help='Number of negative examples (>0) for negative sampling, 0 for hierarchical softmax', dest='neg', default=5, type=int)
    parser.add_argument('-dim', help='Dimensionality of word embeddings', dest='dim', default=100, type=int)
    parser.add_argument('-alpha', help='Starting alpha', dest='alpha', default=0.025, type=float)
    parser.add_argument('-window', help='Max window length', dest='win', default=5, type=int) 
    parser.add_argument('-min-count', help='Min count for words used to learn <unk>', dest='min_count', default=1, type=int)
    parser.add_argument('-processes', help='Number of processes', dest='num_processes', default=1, type=int) # 几个进程
    parser.add_argument('-binary', help='1 for output model in binary format, 0 otherwise', dest='binary', default=0, type=int)
    #TO DO: parser.add_argument('-epoch', help='Number of training epochs', dest='epoch', default=1, type=int)
    args = parser.parse_args()

    train(args.fi, args.fo, bool(args.cbow), args.neg, args.dim, args.alpha, args.win,
          args.min_count, args.num_processes, bool(args.binary))