NLP-词向量(Word Embedding)-2015：C2W模型（Character to Word Embedding）【CharacterEmbedding(字符嵌入)】

本文链接：https://blog.csdn.net/u013250861/article/details/118946088

《原始论文：Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation》

一、概述

词向量的学习对于自然语言处理的应用非常重要，词向量可以在空间上捕获词之间的语法和语义相似性。
但是词向量机制中的词和词之间是独立的，这种独立性假设是有问题的，词之间形式上的相似性会一定程度造成功能的相似性，尤其是在形态丰富的语言中。
但是这种形态和功能之间的关系有不是绝对的，为了学习这种关系，本文在字符嵌入上使用双向LSTM来捕捉这种关系。
C2W模型能够很好地捕捉词之间的语法和语义相似度，并且在两个任务上取得最优的结果。

《Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation》论文摘要

我们提出了一种新的使用字符和双向LSTM生成词表示的模型。
相对于传统的词向量方法，我们的C2W模型需要的参数比较少，主要有两部分，一部分是字符映射成向量的参数，一部分是组合模块- LSTM的参数。
尽管我们的模型参数少，并且单词中的形式-功能关系很难学习，我们的模型在语言模型和词性标注任务上取得最优的结果。
这种优势在形态丰富的语言中更加明显。

二、Word Embedding的缺点

1、推理问题（Reasoning Problem）

词向量之间都是独立的，不能进行推理。

Even though models based on word lookup tables are often observed to learn that cats, kings and queens exist in roughly the same linear correspondences to each other as cat, king and queen do,
the model does not represent the fact that adding an s at the end of the word is evidence for this transformation.
This means that word lookup tables cannot generate representations for previously unseen words, such as Frenchiﬁcation, even if the components, French and -iﬁcation, are observed in other con- texts.

在这里插入图片描述

2、词表大小问题（Vocabulary Size Problem）

Even if copious data is available, it is impractical to actually store vectors for all word types. (即使有丰富的数据可用，实际存储所有单词类型的向量也是不切实际的。)

在这里插入图片描述

三、字符嵌入(C2W)模型

在这里插入图片描述

词嵌入模型、字符嵌入模型的相同点：输入都是一个单词，输出都是该单词的词向量表示。
词嵌入模型、字符嵌入模型的不同点：
- 词嵌入模型先训练得到词表中每个单词的词向量，使用时直接通过 “单词-词向量” 表来查找一个单词的词向量，
- 字符嵌入模型先训练得到每个字符的字符向量，如果想要得到一个单词的词向量，需要将组成该单词的每一个字符的字符向量输入BiLSTM模型，最终得到该单词的词向量；

1、C2W模型优缺点

优点：

能够解决OOV词问题；
可以捕获字符间的结构信息；
可以推理出相似结构的词表示；

缺点：

模型必须“从头开始”学习大量词汇（训练时还需要通过LSTM生成词表示，速度比词向量机制要慢）。
自然文本表现出数百或数千个时间步长的长距离依赖性。
字符序列比字序列长，因此需要更多的计算步骤。
测试时虽然可以通过缓存的方法预先生成一些词向量，但是对于OOV词的词表示生成依旧速度慢；

2、C2W模型的使用场景

一般地，Character Embedding(字符嵌入)的方法与Word Embedding(词嵌入)的方法会结合起来使用。

Character Embedding(字符嵌入)用来捕捉字符的信息，但是不擅长捕捉语义、语法信息；
Character Embedding(字符嵌入)可以解决OOV的问题；
Word Embedding(词嵌入)捕捉语法、语义信息的能力比较强；
比如阅读理解，预训练语言模型，序列标注，命名实体识别一般使用两者结合的方式。

2.1 需要字符信息的任务

在这里插入图片描述

序列标注
命名实体识别

2.2 用于OOV词比较多的任务

在这里插入图片描述

对抗样本

3、Word Embedding模型、Character Embedding模型对比实验

3.1 语言模型实验

在这里插入图片描述
在英语、葡萄牙语、加泰罗尼亚语、德语和土耳其语五种语言的语言模型上均取得最优结果。

在这里插入图片描述

3.2 词性标注（Part-Of-Speech）实验

在这里插入图片描述
在英语的词性标注任务上取得最优的结果。

4、论文启发点

《Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation》论文启发点：

这种词的独立性假设是存在本质问题的，尤其是在形态学丰富的语言中。在这种形态学丰富的语言中，更合理的假设是形态相似的词功能上（语法和语义）可能也相似。
This paper argues that this independence assumption is inherently problematic, in particular in morphologically rich languages (e.g., Turkish). In such languages, a more reasonable assumption would be that orthographic (formal) similarity is evidence for functional similarity (Introduction P1)
我们这篇工作的目的不是为了超越基准模型，而是为了说明基准模型中的特征工程可以从数据中自动学习出来。
The goal of our work is not to overcome existing benchmarks, but show that much of the feature engineering done in the benchmarks can be learnt automatically from the task specific data. (5.5 Discussion P1)

四、C2W模型代码

1、分割训练集、验证集、测试集

data_processing.py

# -*- coding: utf-8 -*-
import  json
import nltk
datas = open("./wiki_00",encoding="utf-8").read().splitlines()
num_words = 0
f_train = open("train.txt","w",encoding="utf-8")
f_valid = open("valid.txt","w",encoding="utf-8")
f_test = open("test.txt","w",encoding="utf-8")
for data in datas:
    data = json.loads(data,strict=False)
    sentences = data["text"]
    sentences = sentences.replace("\n\n",". ")
    sentences = sentences.replace("\n",". ")
    sentences = nltk.sent_tokenize(sentences)
    for sentence in sentences:
        sentence = nltk.word_tokenize(sentence)
        if len(sentence)<10 or len(sentence)>100:
            continue
        num_words+=len(sentence)
        sentence = " ".join(sentence) +"\n"
        if num_words<=1000000:
            f_train.write(sentence)
        elif num_words<=1020000:
            f_valid.write(sentence)
        elif num_words<=1040000:
            f_test.write(sentence)
        else:
            exit()

2、数据加载

data_load.py

#coding:utf-8
from torch.utils import data
import os
import numpy as np
import pickle
from collections import Counter
class Char_LM_Dataset(data.DataLoader):
    def __init__(self,mode="train",max_word_length=16,max_sentence_length=100):

        self.path = os.path.abspath('.')
        if "data" not in self.path:
            self.path += "/data"
        self.mode = mode
        self.max_word_length = max_word_length
        self.max_sentence_length = max_sentence_length
        datas = self.read_file()
        datas,char_datas,weights = self.generate_data_label(datas)
        self.datas = datas.reshape([-1])
        self.char_datas = char_datas.reshape([-1,self.max_word_length])
        self.weights = weights
        print (self.datas.shape,self.char_datas.shape,weights.shape)
    def __getitem__(self, index):
        return self.char_datas[index], self.datas[index],self.weights[index]

    def __len__(self):
        return len(self.datas)
    def read_file(self):
        if self.mode == "train":
            datas = open(self.path+"/train.txt",encoding="utf-8").read().strip("\n").splitlines()
            datas = [s.split() for s in datas]
            if not os.path.exists(self.path+"/word2id"):
                words = []
                chars = []
                for data in datas:
                    for word in data:
                        words.append(word.lower())
                        chars.extend(word)
                words = dict(Counter(words).most_common(5000-2))
                chars = dict(Counter(chars).most_common(512-3))

                word2id = {"<pad>":0,"<unk>":1}
                for word in words:
                    word2id[word] = len(word2id)
                char2id = {"<pad>":0,"<unk>":1,"<start>":2}
                for char in chars:
                    char2id[char] = len(char2id)
                self.word2id = word2id
                self.char2id = char2id
                pickle.dump(self.word2id,open(self.path+"/word2id","wb"))
                pickle.dump(self.char2id,open(self.path+"/char2id","wb"))
            else:
                self.word2id = pickle.load(open(self.path+"/word2id","rb"))
                self.char2id = pickle.load(open(self.path+"/char2id","rb"))
            return datas
        elif self.mode=="valid":
            datas = open(self.path+"/valid.txt",encoding="utf-8").read().strip("\n").splitlines()
            datas = [s.split() for s in datas]
            self.word2id = pickle.load(open(self.path+"/word2id", "rb"))
            self.char2id = pickle.load(open(self.path+"/char2id", "rb"))
            return datas
        elif self.mode=="test":
            datas = open(self.path+"/test.txt",encoding="utf-8").read().strip("\n").splitlines()
            datas = [s.split() for s in datas]
            self.word2id = pickle.load(open(self.path+"/word2id", "rb"))
            self.char2id = pickle.load(open(self.path+"/char2id", "rb"))
            return datas
    def generate_data_label(self,datas):
        char_datas = []
        weights = []
        for i,data in enumerate(datas):
            if i%1000==0:
                print (i,len(datas))
            char_data = [[self.char2id["<start>"]]*self.max_word_length]
            for j,word in enumerate(data):
                char_word = []
                for char in word:
                    char_word.append(self.char2id.get(char,self.char2id["<unk>"]))
                char_word = char_word[0:self.max_word_length] + \
                            [self.char2id["<pad>"]]*(self.max_word_length-len(char_word))
                datas[i][j] = self.word2id.get(datas[i][j].lower(),self.word2id["<unk>"])
                char_data.append(char_word)
            weights.extend([1] * len(datas[i])+[0]*(self.max_sentence_length-len(datas[i])))
            datas[i] = datas[i][0:self.max_sentence_length]+[self.word2id["<pad>"]]*(self.max_sentence_length-len(datas[i]))
            char_datas.append(char_data)
            char_datas[i] = char_datas[i][0:self.max_sentence_length]+\
                            [[self.char2id["<pad>"]]*self.max_word_length]*(self.max_sentence_length-len(char_datas[i]))

        datas = np.array(datas)
        char_datas = np.array(char_datas)
        weights = np.array(weights)
        return  datas ,char_datas,weights
if __name__=="__main__":
    char_lm_dataset = Char_LM_Dataset()

3、C2W模型

# -*- coding: utf-8 -*-
import torch
import torch.nn as nn
import numpy as np


class C2W(nn.Module):
    def __init__(self, config):
        super(C2W, self).__init__()
        self.char_hidden_size = config.char_hidden_size
        self.word_embed_size = config.word_embed_size
        self.lm_hidden_size = config.lm_hidden_size
        self.character_embedding = nn.Embedding(config.n_chars, config.char_embed_size)
        self.sentence_length = config.max_sentence_length
        self.char_lstm = nn.LSTM(input_size=config.char_embed_size, hidden_size=config.char_hidden_size,
                                 bidirectional=True, batch_first=True)
        self.lm_lstm = nn.LSTM(input_size=self.word_embed_size, hidden_size=config.lm_hidden_size, batch_first=True)
        self.fc_1 = nn.Linear(2 * config.char_hidden_size, config.word_embed_size)
        self.fc_2 = nn.Linear(config.lm_hidden_size, config.vocab_size)

    def forward(self, x):
        input = self.character_embedding(x)
        char_lstm_result = self.char_lstm(input)
        word_input = torch.cat([char_lstm_result[0][:, 0, 0:self.char_hidden_size], char_lstm_result[0][:, -1, self.char_hidden_size:]], dim=1)
        word_input = self.fc_1(word_input)
        word_input = word_input.view([-1, self.sentence_length, self.word_embed_size])
        lm_lstm_result = self.lm_lstm(word_input)[0].contiguous()
        lm_lstm_result = lm_lstm_result.view([-1, self.lm_hidden_size])
        out = self.fc_2(lm_lstm_result)
        return out


class config:
    def __init__(self):
        self.n_chars = 64
        self.char_embed_size = 50
        self.max_sentence_length = 8
        self.char_hidden_size = 50
        self.lm_hidden_size = 150
        self.word_embed_size = 50
        config.vocab_size = 1000


if __name__ == "__main__":
    config = config()
    c2w = C2W(config)
    test = np.zeros([64, 16])
    c2w(test)

4、配置信息

config.py

# —*- coding: utf-8 -*-

import argparse

def ArgumentParser():
    parser = argparse.ArgumentParser()
    parser.add_argument("--cuda", type=bool, default=True, help="whether use gpu")
    parser.add_argument("--gpu", type=int, default=1, help="whether use gpu")
    parser.add_argument('--n_chars', type=int, default=512, help="number of characters")
    parser.add_argument("--char_embed_size",type=int,default=50,help="character embedding size")
    parser.add_argument("--max_word_length",type=int,default=16,help="max number of characters in word")
    parser.add_argument("--max_sentence_length",type=int,default=100,help="max number of words in sentence")
    parser.add_argument("--char_hidden_size",type=int,default=150,help="hidden size of char lstm")
    parser.add_argument("--lm_hidden_size",type=int,default=150,help="hidden size of lm lstm")
    parser.add_argument("--word_embed_size",type=int,default=50,help="word embedding size")
    parser.add_argument("--vocab_size",type=int,default=5000,help="number of words")
    parser.add_argument("--learning_rate",type=float,default=0.0005,help="learning rate during training")
    parser.add_argument("--batch_size",type=int,default=200,help="batch size during training")
    parser.add_argument("--seed",type=int,default=1,help="seed of random")
    parser.add_argument("--epoch",type=int,default=100,help="epoch of training")
    return parser.parse_args()

5、训练

main.py

# -*- coding: utf-8 -*-
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.optim as optim
from model import C2W
from data import Char_LM_Dataset
from tqdm import tqdm
import config as argumentparser

config = argumentparser.ArgumentParser()


def get_test_result(data_iter, data_set):
    # 生成测试结果
    model.eval()
    all_ppl = 0
    for data, label, weights in data_iter:
        if config.cuda and torch.cuda.is_available():
            data = data.cuda()
            label = label.cuda()
            weights = weights.cuda()
        else:
            data = torch.autograd.Variable(data).long()
        label = torch.autograd.Variable(label).squeeze()
        out = model(data)
        loss_now = criterion(out, autograd.Variable(label.long()))
        ppl = (loss_now * weights.float()).view([-1, config.max_sentence_length])
        ppl = torch.sum(ppl, dim=1) / torch.sum((weights.view([-1, config.max_sentence_length])) != 0, dim=1).float()
        ppl = torch.sum(torch.exp(ppl))
        all_ppl += ppl.data.item()
    return all_ppl * config.max_sentence_length / data_set.__len__()


if __name__ == "__main__":
    # Create the configuration
    if config.cuda and torch.cuda.is_available():
        torch.cuda.set_device(config.gpu)
    training_set = Char_LM_Dataset(mode="train")
    training_iter = torch.utils.data.DataLoader(dataset=training_set,
                                                batch_size=config.batch_size * config.max_sentence_length,
                                                shuffle=False,
                                                num_workers=2)
    valid_set = Char_LM_Dataset(mode="valid")

    valid_iter = torch.utils.data.DataLoader(dataset=valid_set,
                                             batch_size=config.batch_size * config.max_sentence_length,
                                             shuffle=False,
                                             num_workers=0)
    test_set = Char_LM_Dataset(mode="test")

    test_iter = torch.utils.data.DataLoader(dataset=test_set,
                                            batch_size=32 * 100,
                                            shuffle=False,
                                            num_workers=0)
    model = C2W(config)
    if config.cuda and torch.cuda.is_available():
        model.cuda()
    criterion = nn.CrossEntropyLoss(reduce=False)
    optimizer = optim.Adam(model.parameters(), lr=config.learning_rate)
    loss = -1
    for epoch in range(config.epoch):
        model.train()
        process_bar = tqdm(training_iter)
        for data, label, weights in process_bar:
            if config.cuda and torch.cuda.is_available():
                data = data.cuda()
                label = label.cuda()
                weights = weights.cuda()
            else:
                data = torch.autograd.Variable(data).long()
            label = torch.autograd.Variable(label).squeeze()
            out = model(data)
            loss_now = criterion(out, autograd.Variable(label.long()))
            ppl = (loss_now * weights.float()).view([-1, config.max_sentence_length])
            ppl = torch.sum(ppl, dim=1) / torch.sum((weights.view([-1, config.max_sentence_length])) != 0, dim=1).float()
            ppl = torch.mean(torch.exp(ppl))
            loss_now = torch.sum(loss_now * weights.float()) / torch.sum(weights != 0)
            if loss == -1:
                loss = loss_now.data.item()
            else:
                loss = 0.95 * loss + 0.05 * loss_now.data.item()
            process_bar.set_postfix(loss=loss, ppl=ppl.data.item())
            process_bar.update()
            optimizer.zero_grad()
            loss_now.backward()
            optimizer.step()
        print("Valid ppl is:", get_test_result(valid_iter, valid_set))
        print("Test ppl is:", get_test_result(test_iter, valid_set))