零基础入门NLP之新闻文本分类_Task5

Word2Vec的使用和基础原理

Word2Vec模型背后的基本思想是对出现在上下文环境里的词进行预测。对于每一条输入文本,我们选取一个上下文窗口和一个中心词,并基于这个中心词去预测窗口里其他词出现的概率。因此,Word2Vec模型可以方便地从新增预料中学习到新增词的向量表达,是一种高效地在线学习方法。

本文主要通过代码的形式,介绍Word2Vec的使用和原理。

导入第三方模块

from gensim.models.word2vec import Word2Vec
import logging  # 提供日志打印功能
import numpy as np
import random
import pandas as pd
import torch

这里安装torch费了点时间,一直报错无法安装,最后在官网输入机器配置及环境得到安装代码,然后在终端执行安装,代码如下:

pip install torch==1.5.1+cpu torchvision==0.6.1+cpu -f https://download.pytorch.org/whl/torch_stable.html

Word2Vec的主要思路

通过单词和上下文彼此预测,对应的两个算法为:

  1. Skip-grams(SG):预测上下文
  2. Continuous Bag Of Words(CBOW):预测目标单词

Word2vec模型实际上分了两个部分,第一部分是建模,第二部分是通过模型获取嵌入词向量:

  1. 建模过程:基于训练数据构建神经网络
  2. 获取嵌入词向量:模型训练好以后,获取通过训练数据所学得的参数,如隐层的权重矩阵等

Skip-grams(SG)过程

神经网络基于训练数据,将会输出一个概率分布,这些概率代表着词典中每个词作为input word的output word的可能性
模型的输出概率代表着我们词典中的每个词有多大可能性和input word同时出现

input word和out word都会进行one-hot编码,形成一个稀疏向量(实际上仅有一个位置是1)
为了节约计算资源,它会仅仅选择矩阵对应向量中维度值为1的索引行计算

Skip-grams训练

Word2Vec模型是一个超级大的神经网络(权重矩阵规模非常大)。
百万数量级的权重矩阵和亿万数量级的训练样本意味着训练灾难。

问题解决:

  1. 将常见的组合单词或词组作为单个’words’来处理

  2. 对高频词抽样来减少样本个数

  3. 对优化目标采用’negative sampling’方法,这样每个训练样本的训练只会更新一小部分模型权重,从而降低计算负担
    3.1 负采样时,随机选择一小部分negative words来更新对应权重,同时对positive words更新权重
    3.2 使用’一元模型分布’来选择’negative words’,个单词被选作negative sample的概率和它出现频次有关,频次越高越容易被选中
    3.3 负采样代码中,有一个包含了一亿个元素的数组’unigram table’,数组由词汇表中每个单词的索引号填充。单次负采样的概率*1亿=单次在表中出现的次数;也就是说,进行负采样时,只需要在0-1亿范围内生成一个随机数,然后选择表中索引号为这个随机数的单次作为negative word即可;一个单词负采样概率越大,它在表中出现的次数越多,被选择的概率就越大

  4. 霍夫曼树:输入权值为(w1,w2…wn)的n个节点;输出对应的霍夫曼树,一般得到霍夫曼树后会对叶子节点进行霍夫曼编码,由于权重高的叶子节点靠近根节点,而权重低的叶子节点会远离根节点。 所以高权重节点编码值较短,而低权重值编码值较长,这保证了树的带权路径最短,也符合信息论:常用词拥有更短的编码
    4.1.将(w1,w2…wn)看做是有n棵树的森林,每个数仅有一个节点
    4.2.在森林中选择根节点权值最小的两个数合并,得到一棵新树,这两棵树分布作为新树的左右子树,新树根节点权重为左右子树根节点权重和
    4.3.删除森林中权值最小的两棵树,并把合并后的新树加入森林
    4.4.重复4.2与4.3,直到森林中只剩一棵树
    4.5.在Word2Vec中,约定左子树编码为1,右子树编码为0,同时约定左子树的权重不小于右子树的权重

  5. Hierarchical Softmax过程:为了避免计算所有词的softmax概率,Word2Vec采用了霍夫曼树代替从隐藏层到输出softmax层的映射。霍夫曼树的建立:
    5.1.根据标签(label)和频率建立霍夫曼树(label出现的频率越高,Huffman树的路径越短)
    5.2.Huffman树中每一叶子节点代表一个label
    5.2.1. p - 从根节点出发到达w对应叶子节点的路径
    5.2.2. l - 路径p中包含节点的个数
    5.2.3. p1,p2,…pl - 路径p中的l个节点,其中p1表示根节点,p2表示词w对应的第二个节点
    5.2.4. d2,d3,…dl∈{0,1} - 词w的Huffman编码,它有l-1位编码构成,dl表示路径p中第l个节点对应的编码(根节点无)
    5.2.5. θ1,θ2,…θ(l-1)∈R - 路径p中非叶子节点对应的向量,θj表示路径p中第j个非叶子节点对应的向量
    5.3.一棵Huffman树,是一个二分类树(二叉树)。再Word2Vec中,1表示负类,0表示正类,通过Sigmoid函数分类

尝试通过Word2Vec训练词向量

'''
model = Word2Vec(sentences, workers=num_workers, size=num_features)
参数详解:
    sentences - 语料集,可以是一个list,对于大语料集,建议使用BrownCorpus,Text8Corpus,lineSentence构建
    sg - 用于设置训练算法,默认为0,即CBOW算法;sg=1则采用skip-gram算法
    size - 指定特征向量的维度,默认为100。大的size需要更多的训练数据,但是效果会更好
    window - 指当前词与预测词在一个句子中的最大距离
    alpha - 学习速率
    seed - 随机种子
    min_count - 可以对字典做截断,词频数少于min_count则被丢弃,默认为5
    max_vocab_size - 设置词向量构建期间的RAM限制。
                        如果所有独立单词个数超过限制,则丢弃其中最不频繁的一个。每一千万个单词大约需要1GB的RAM
    sample - 高频词汇的随机降采样的配置阈值,默认1乘e的-3次方,范围是0到1乘e的-5次方
    workers - 参加控制训练的并行数
    hs - hs=1采用Hierarchica_softmax技巧,hs=0采用negative_sampling(下采样)
    iter - 迭代次数,默认5次
'''

定义输出日志参数

logging.basicConfig(level=logging.INFO, format='%(asctime)-15s %(levelname)s: %(message)s')
'''
level - 设置日志级别
format - 指定输出格式
    %(asctime) - 打印日志的时间
    %(levelname)s - 打印日志级别名称
    %(levelno)s - 打印日志级别的数值
    %(message)s - 打印日志信息
    %(funcName)s - 打印日志的当前函数
    %(lineno)d - 打印日志的当前行号
    %(thread)d - 打印线程ID
    %(process)d - 打印进程ID
'''

设置随机种子

seed = 2020
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

十折交叉验证

fold_num = 10
data_file = r'D:\Users\Felixteng\Documents\Pycharm Files\Nlp\data\train_set.csv'

定义分折函数

def all_data2fold(fold_num, num=10000):
    fold_data = []
    f = pd.read_csv(data_file, sep='\t', encoding='UTF-8')
    # tolist()函数用于将数组或矩阵转化成列表,这里我只取10000个样本
    texts = f['text'].tolist()[:num]
    labels = f['label'].tolist()[:num]
    # 统计有标签的样本树,理应为10000个
    total = len(labels)
    # 创建一个索引列表,包含10000个索引,从0到9999
    index = list(range(total))
    # 将索引列表随机打乱
    np.random.shuffle(index)
    all_texts = []
    all_labels = []
    # 按打乱后的索引列表顺序,重组texts和labels
    for i in index:
        all_texts.append(texts[i])
        all_labels.append(labels[i])

    label2id = {}
    # range(total) - 0到9999
    # 这一步将每类标签整合到一个字典中
    for i in range(total):
        label = str(all_labels[i])
        if label not in label2id:
            label2id[label] = [i]
        else:
            label2id[label].append(i)
    # 创建fold_num个空列表,用于存放索引
    all_index = [[] for _ in range(fold_num)]
    for label, data in label2id.items():
        # 可以发现这10000个样本中每个类别分布不均
        print(label, len(data))
        # 把每一类分成fold_num份,取整
        batch_size = int(len(data) / fold_num)
        # other是没有分完的
        other = len(data) - batch_size * fold_num
        for i in range(fold_num):
            # cur_batch_size是每折按batch_size+1分配的数量,当余数不够分配的时候,给batch_size,即整数部分
            cur_batch_size = batch_size + 1 if i < other else batch_size
            print(cur_batch_size)
            batch_data = [data[i * batch_size + b] for b in range(cur_batch_size)]
            all_index[i].extend(batch_data)
    batch_size = int(total / fold_num)
    other_texts = []
    other_labels = []
    other_num = 0
    start = 0
    for fold in range(fold_num):
        num = len(all_index[fold])
        # 将数据和标签分为fold份
        texts = [all_texts[i] for i in all_index[fold]]
        labels = [all_labels[i] for i in all_index[fold]]
        if num > batch_size:
            # 取一份当验证集,剩下的训练集
            fold_texts = texts[:batch_size]
            other_texts.extend(texts[batch_size:])
            fold_labels = labels[:batch_size]
            other_labels.extend(labels[batch_size:])
            other_num += num - batch_size
        elif num < batch_size:
            end = start + batch_size - num
            fold_texts = texts + other_texts[start: end]
            fold_labels = labels + other_labels[start: end]
            start = end
        else:
            fold_texts = texts
            fold_labels = labels
        assert batch_size == len(fold_labels)
        # shuffle - 打乱分好的每一折
        index = list(range(batch_size))
        np.random.shuffle(index)
        shuffle_fold_texts = []
        shuffle_fold_labels = []
        for i in index:
            shuffle_fold_texts.append(fold_texts[i])
            shuffle_fold_labels.append(fold_labels[i])
        data = {'label': shuffle_fold_labels, 'text': shuffle_fold_texts}
        fold_data.append(data)
    logging.info('Fold lens %s', str([len(data['label']) for data in fold_data]))
    return fold_data

拆分数据集为10份

fold_data = all_data2fold(fold_num=10, num=200000)

为Word2Vec创建训练集

fold_id = 9
train_texts = []
for i in range(0, fold_id):
    data = fold_data[i]
    train_texts.extend(data['text'])
logging.info('Total %d docs.' % len(train_texts))
2020-07-31 22:47:16,645 INFO: Fold lens [20000, 20000, 20000, 20000, 20000, 20000, 20000, 20000, 20000, 20000]

2020-07-31 22:48:19,883 INFO: Total 180000 docs.

可以看到,训练集有18000个词汇。

开始训练

logging.info('Starting training...')

num_features = 100  # 词向量维数
num_works = 8   # 并发线程数

train_texts = list(map(lambda x: list(x.split()), train_texts))
'''
split()函数默认分割所有空字符,包括空格符,换行符,制表符
map()函数搭配lambda函数使用,把函数运用到传入可迭代对象的每一个元素中,并把结果作为新的可迭代对向返回
'''

2020-07-31 22:49:01,006 INFO: Starting training...

创建模型实例

model = Word2Vec(train_texts, workers=num_works, size=num_features)

2020-07-31 22:54:54,262 INFO: collecting all words and their counts
2020-07-31 22:54:54,270 INFO: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-07-31 22:54:59,512 INFO: PROGRESS: at sentence #10000, processed 9148165 words, keeping 5311 word types
2020-07-31 22:55:04,945 INFO: PROGRESS: at sentence #20000, processed 18193591 words, keeping 5704 word types
2020-07-31 22:55:10,346 INFO: PROGRESS: at sentence #30000, processed 27309645 words, keeping 5908 word types
2020-07-31 22:55:15,596 INFO: PROGRESS: at sentence #40000, processed 36283313 words, keeping 6087 word types
2020-07-31 22:55:20,184 INFO: PROGRESS: at sentence #50000, processed 45410969 words, keeping 6225 word types
2020-07-31 22:55:26,618 INFO: PROGRESS: at sentence #60000, processed 54548917 words, keeping 6324 word types
2020-07-31 22:55:33,732 INFO: PROGRESS: at sentence #70000, processed 63656832 words, keeping 6391 word types
2020-07-31 22:55:39,213 INFO: PROGRESS: at sentence #80000, processed 72844840 words, keeping 6435 word types
2020-07-31 22:55:45,167 INFO: PROGRESS: at sentence #90000, processed 81847771 words, keeping 6481 word types
2020-07-31 22:55:50,762 INFO: PROGRESS: at sentence #100000, processed 91001853 words, keeping 6541 word types
2020-07-31 22:55:55,067 INFO: PROGRESS: at sentence #110000, processed 100092931 words, keeping 6586 word types
2020-07-31 22:56:00,453 INFO: PROGRESS: at sentence #120000, processed 109101979 words, keeping 6621 word types
2020-07-31 22:56:04,678 INFO: PROGRESS: at sentence #130000, processed 118112658 words, keeping 6672 word types
2020-07-31 22:56:10,503 INFO: PROGRESS: at sentence #140000, processed 126928025 words, keeping 6689 word types
2020-07-31 22:56:15,953 INFO: PROGRESS: at sentence #150000, processed 136177978 words, keeping 6718 word types
2020-07-31 22:56:20,138 INFO: PROGRESS: at sentence #160000, processed 145204090 words, keeping 6759 word types
2020-07-31 22:56:25,443 INFO: PROGRESS: at sentence #170000, processed 154259340 words, keeping 6788 word types
2020-07-31 22:56:30,140 INFO: collected 6826 word types from a corpus of 163331797 raw words and 180000 sentences
2020-07-31 22:56:30,148 INFO: Loading a fresh vocabulary
2020-07-31 22:56:30,588 INFO: effective_min_count=5 retains 5978 unique words (87% of original 6826, drops 848)
2020-07-31 22:56:30,588 INFO: effective_min_count=5 leaves 163330134 word corpus (99% of original 163331797, drops 1663)
2020-07-31 22:56:30,644 INFO: deleting the raw counts dictionary of 6826 items
2020-07-31 22:56:30,700 INFO: sample=0.001 downsamples 61 most-common words
2020-07-31 22:56:30,700 INFO: downsampling leaves estimated 140986777 word corpus (86.3% of prior 163330134)
2020-07-31 22:56:30,740 INFO: estimated required memory for 5978 words and 100 dimensions: 7771400 bytes
2020-07-31 22:56:30,740 INFO: resetting layer weights
2020-07-31 22:56:32,372 INFO: training model with 8 workers on 5978 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2020-07-31 22:56:33,476 INFO: EPOCH 1 - PROGRESS: at 0.71% examples, 965562 words/s, in_qsize 14, out_qsize 1
2020-07-31 22:56:34,493 INFO: EPOCH 1 - PROGRESS: at 1.35% examples, 908630 words/s, in_qsize 13, out_qsize 2
2020-07-31 22:56:35,501 INFO: EPOCH 1 - PROGRESS: at 2.03% examples, 927393 words/s, in_qsize 14, out_qsize 2
2020-07-31 22:56:36,513 INFO: EPOCH 1 - PROGRESS: at 2.66% examples, 923767 words/s, in_qsize 14, out_qsize 1
......
......

这个设定代表当前训练好的词向量为最终版,也可以加快模型的训练速度

model.init_sims(replace=True)

2020-07-31 23:13:39,065 INFO: precomputing L2-norms of word weight vectors

保存模型

'''
save保存的模型,载入之后可以继续在此基础上接着训练
而format_save保存的模型不能,但有个好处就是如果设置binary=False则保存后的结果可以直接打开查看
'''
model.save(r'D:\Users\Felixteng\Documents\Pycharm Files\Nlp\word2vec.bin')

2020-07-31 23:14:08,564 INFO: saving Word2Vec object under D:\Users\Felixteng\Documents\Pycharm Files\Nlp\word2vec.bin, separately None
2020-07-31 23:14:08,588 INFO: not storing attribute vectors_norm
2020-07-31 23:14:08,588 INFO: not storing attribute cum_table
2020-07-31 23:14:08,733 INFO: saved D:\Users\Felixteng\Documents\Pycharm Files\Nlp\word2vec.bin

加载模型

model = Word2Vec.load(r'D:\Users\Felixteng\Documents\Pycharm Files\Nlp\word2vec.bin')

2020-07-31 23:17:06,610 INFO: loading Word2Vec object from D:\Users\Felixteng\Documents\Pycharm Files\Nlp\word2vec.bin
2020-07-31 23:17:06,682 INFO: loading wv recursively from D:\Users\Felixteng\Documents\Pycharm Files\Nlp\word2vec.bin.wv.* with mmap=None
2020-07-31 23:17:06,682 INFO: setting ignored attribute vectors_norm to None
2020-07-31 23:17:06,682 INFO: loading vocabulary recursively from D:\Users\Felixteng\Documents\Pycharm Files\Nlp\word2vec.bin.vocabulary.* with mmap=None
2020-07-31 23:17:06,682 INFO: loading trainables recursively from D:\Users\Felixteng\Documents\Pycharm Files\Nlp\word2vec.bin.trainables.* with mmap=None
2020-07-31 23:17:06,682 INFO: setting ignored attribute cum_table to None
2020-07-31 23:17:06,682 INFO: loaded D:\Users\Felixteng\Documents\Pycharm Files\Nlp\word2vec.bin

转化格式

'''
通过该方式保存的模型,能通过文本格式打开,也能通过设置binary是否保存为二进制文件
但该模型在保存时丢弃了树的保存形式(详情参加word2vec构建过程,以类似哈夫曼树的形式保存词)
所以在后续不能对模型进行追加训练
'''
model.wv.save_word2vec_format(r'D:\Users\Felixteng\Documents\Pycharm Files\Nlp\word2vec.txt', binary=False)

2020-07-31 23:17:44,648 INFO: storing 5978x100 projection weights into D:\Users\Felixteng\Documents\Pycharm Files\Nlp\word2vec.txt

下一个任务会涉及到Bert的使用,包括pretrain和finetune.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值