论文题目:Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation
发表期刊:2015-EMNLP
单位:卡内基梅陇大学
前人问题:
- 不同词向量之间不应该是独立的,无法进行推理
- 随着词表的增大,参数也会增大
1、背景介绍
- 词向量的学习对于自然语言处理的应用非常重要,词向量可以在空间上捕获词之间的语法和语义相似性。【背景介绍】
- 但是词向量机制中假设词和词之间是独立的,这种独立性假设是有问题的,词之间形式上的相似性会一定程度造成功能的相似性,尤其是在形态丰富的语言中(土耳其语)。【提出问题】
- 但是这种形态和功能之间的关系又不是绝对的(形态相似的词即可能功能相近也可能不相近;eg:lesson & lessen coarse & course),为了学习这种关系,本文在字符嵌入上使用双向LSTM来捕捉这种关系。【分析问题及提供解决办法】
- 本文的C2W模型能够很好地捕捉词之间的语法和语义相似度,并且在两个任务上取得最优的结果。【模型表现】
2、相关工作
2.1、当前词向量机制存在问题:
- 推理问题:由于每个词向量都是独立的,因此无法生成之前没见过词的词表示,比如两个见过的词组合成的新词(Frenchification:即使French 和 -fication在过去的上下文中都见过 )
- 由于单词的数量庞大,为了避免OOV(out-of-vocabulary)词的出现,就需要尽可能的扩大词表,因此就需要保存大量的参数
- 词汇之间本身就存在关联,而不是相互独立的(认知科学中)。
2.2、基于词素的词表示
3、模型结构
- 相同点:都是输入词获得相应的词表示
- 不同点:
- 词嵌入模型是直接通过 Look up table获得词表示;
- 而字符嵌入模型是首先对每个字符生成字符嵌入(字符Look up table),再将字符嵌入输入双向LSTM,将正向与反向LSTM的隐藏输出经过一定的线性变换获得最终的词表示
- 下图展示了LSTM的公式(下图左侧),即将字符嵌入送入双向LSTM,然后获得前向LSTM的最后一个隐藏状态和反向LSTM 的最后一个隐藏状态,随后对这两个隐藏状态进行线性组合获得最终的词嵌入(下图右下角公式)
显然字符嵌入使用的双向LSTM处理起来是比较复杂消耗时间的,并且因为在模型参数未更新的情况下,获得的词表示是一定的,因此本文提出通过提前缓存常见词表示来获得效率和表现的平衡。
3.2、C2W模型优缺点
3.3、C2W模型的应用场景
- 可以用于需要字符信息的任务,如序列标注(词性标注和命名实体识别)
- 用于OOV词比较多的任务,如对抗样本
4、实验结果与分析
4.1、语言模型实验
- 问题一:由于训练时通过字符嵌入的方式不存在OOV词,而在测试时需要确定一个固定大小的词表(每个元素预测一个词的概率),因此作者通过统计词频的方式获得固定大小的词表,词频较小的词全部替换成OOV词
- 为了解决训练和测试不一致的问题,训练不存在OOV词,而测试时存在,作者在训练期间以0.5的概率将仅出现一次的词替换为OOV词
超参数设置
- =50为字符嵌入的维度;
- =150为LSTM隐藏单元个数;
- 词向量的维度采用50;
- 语言模型的LSTM中隐藏单元个数也是150;
- batch size=100,即每次输入100句话;
- 学习率0.2,动量0.95,
- 输出词表大小为5000
4.2、词性标注实验
- 为什么作者采用词性标注实验来验证C2W的效果呢?
- 因为词性标注是需要字符信息的,比如英语中的后缀 -sion -ment -tion等代表名词,-ly代表副词;此外,命名实体识别在一定程度上也是需要字符信息的。因此后续可以尝试引入字符编码到这些任务中
- 从上述实验结果可以看出即使采用字符嵌入的模型,RNN获得的提升有限,效果还差于词嵌入的方法
- 最优结果采用d_CS=150
5、代码实现
wiki百科英文语料下载地址:https://dumps.wikimedia.org/enwiki/latest/
wikiextractor:https://github.com/attardi/wikiextractor/tree/e4abb4cbd019b0257824ee47c23dd163919b731b
python WikiExtractor.py -o output -b 1000M enwiki-latest-pages-articles14.xml-p7697595p7744800.bz2 --json
pip install nltk 下载完之后还需要在python中 运行下列命令下载语料
import nltk
nltk.download()
或者 下载 : https://pan.baidu.com/s/1hq7UUFU 解压到nltk目录
data_processing.py
# -*- coding: utf-8 -*-
import json
import nltk
datas = open("./wiki_00",encoding="utf-8").read().splitlines() # 读取后按行分割
num_words = 0
f_train = open("train.txt","w",encoding="utf-8")
f_valid = open("valid.txt","w",encoding="utf-8")
f_test = open("test.txt","w",encoding="utf-8")
for data in datas:
data = json.loads(data,strict=False) # loads 为加载一个字符串
sentences = data["text"] # 加载完为字典类型,分别有 id url title text 等,我们需要读取句子,因此获取text部分
sentences = sentences.replace("\n\n",". ") # 将文本中的 \n\n 替换成句号
sentences = sentences.replace("\n",". ")
sentences = nltk.sent_tokenize(sentences) # sent_tokenize 分句,返回一个list,每个元素为一句话
for sentence in sentences:
sentence = nltk.word_tokenize(sentence) # word_tokenize 分词
if len(sentence)<10 or len(sentence) > 100: # 句子太长太短的舍弃
continue
num_words += len(sentence)
sentence = " ".join(sentence) +"\n" # 将句子划分为训练 验证 测试集并写入文件
if num_words<=1000000:
f_train.write(sentence)
elif num_words<=1020000:
f_valid.write(sentence)
elif num_words<=1040000:
f_test.write(sentence)
else:
exit()
data_load.py
#coding:utf-8
from torch.utils import data
import os
import numpy as np
import pickle
from collections import Counter # 统计词频
class Char_LM_Dataset(data.DataLoader):
def __init__(self,mode="train",max_word_length=16,max_sentence_length=100):
self.path = os.path.abspath('.')
if "data" not in self.path:
self.path += "/data"
self.mode = mode
self.max_word_length = max_word_length
self.max_sentence_length = max_sentence_length
datas = self.read_file()
datas,char_datas,weights = self.generate_data_label(datas)
self.datas = datas.reshape([-1]) # reshape成一维,即 (句子数 * max_sentence_length,)
self.char_datas = char_datas.reshape([-1,self.max_word_length]) # reshape成 (句子数 * max_sentence_length, max_word_length)
self.weights = weights
print (self.datas.shape,self.char_datas.shape,weights.shape)
def __getitem__(self, index):
return self.char_datas[index], self.datas[index],self.weights[index]
def __len__(self):
return len(self.datas)
def read_file(self):
if self.mode == "train":
datas = open(self.path+"/train.txt",encoding="utf-8").read().strip("\n").splitlines()
datas = [s.split() for s in datas] # 使用空格进行切分,使得list中每个元素为一个词
if not os.path.exists(self.path+"/word2id"):
words = []
chars = []
for data in datas: # datas为一个list,其中的每个元素(data)为一句话
for word in data: # data 又是一个list,其中的每个元素(word)为一个词
words.append(word.lower()) # 将每个词变为小写并添加到list
chars.extend(word) # 将每个词切分为字符存入list
words = dict(Counter(words).most_common(5000-2)) # 统计高频的5000-2个词,并保存为字典,key为word,value为词频
chars = dict(Counter(chars).most_common(512-3))
word2id = {"<pad>":0,"<unk>":1} # 构建word2id
for word in words:
word2id[word] = len(word2id)
char2id = {"<pad>":0,"<unk>":1,"<start>":2} # 构建char2id
for char in chars:
char2id[char] = len(char2id)
self.word2id = word2id
self.char2id = char2id
pickle.dump(self.word2id,open(self.path+"/word2id","wb")) # 将字典保存
pickle.dump(self.char2id,open(self.path+"/char2id","wb"))
else:
self.word2id = pickle.load(open(self.path+"/word2id","rb"))
self.char2id = pickle.load(open(self.path+"/char2id","rb"))
return datas
elif self.mode == "valid":
datas = open(self.path+"/valid.txt",encoding="utf-8").read().strip("\n").splitlines()
datas = [s.split() for s in datas]
self.word2id = pickle.load(open(self.path+"/word2id", "rb"))
self.char2id = pickle.load(open(self.path+"/char2id", "rb"))
return datas
elif self.mode == "test":
datas = open(self.path+"/test.txt",encoding="utf-8").read().strip("\n").splitlines()
datas = [s.split() for s in datas]
self.word2id = pickle.load(open(self.path+"/word2id", "rb"))
self.char2id = pickle.load(open(self.path+"/char2id", "rb"))
return datas
def generate_data_label(self, datas): # 构建特征和标签
char_datas = []
weights = [] # 将padding位置的weight设为0,计算loss时乘以weights,即不计算padding部分的loss
for i, data in enumerate(datas):
if i % 1000 == 0:
print(i, len(datas)) # 打印进度,当前处理的句子数,句子总数
char_data = [[self.char2id["<start>"]]*self.max_word_length] # 添加起始字符
# max_word_length为一个词的最大字符数,即每个词用max_word_length个字符表示
for j, word in enumerate(data):
char_word = []
for char in word:
char_word.append(self.char2id.get(char,self.char2id["<unk>"]))
# dic.get有两个参数,如果找到该字符则使用找到的id,找不到则使用第二个参数设置的unk id
char_word = char_word[0:self.max_word_length] + \
[self.char2id["<pad>"]]*(self.max_word_length-len(char_word)) # 如果大于固定长度则截断,小于固定长度则padding
datas[i][j] = self.word2id.get(datas[i][j].lower(),self.word2id["<unk>"])
char_data.append(char_word)
weights.extend([1] * min(len(datas[i]), self.max_sentence_length) + [0]*(self.max_sentence_length-len(datas[i])))
# 句子中的词权重为1,填充的词权重为0,shape: (句子数 * max_sentence_length)
datas[i] = datas[i][0:self.max_sentence_length]+[self.word2id["<pad>"]]*(self.max_sentence_length-len(datas[i])) # 构建词矩阵,shape:(句子数 , max_sentence_length(100))
char_datas.append(char_data)
char_datas[i] = char_datas[i][0:self.max_sentence_length]+\
[[self.char2id["<pad>"]]*self.max_word_length]*(self.max_sentence_length-len(char_datas[i])) # 字符padding,每个词补成长度为 max_word_length
datas = np.array(datas) # shape:(句子数 , max_sentence_length(100))
char_datas = np.array(char_datas) # shape: (句子数 , max_sentence_length(100) , max_word_length(16))
weights = np.array(weights) # shape: (句子数 * max_sentence_length)
return datas, char_datas, weights
if __name__=="__main__":
char_lm_dataset = Char_LM_Dataset()
c2w.py
# -*- coding: utf-8 -*-
import torch
import torch.nn as nn
import numpy as np
class C2W(nn.Module):
def __init__(self, config):
super(C2W, self).__init__()
self.char_hidden_size = config.char_hidden_size
self.word_embed_size = config.word_embed_size
self.lm_hidden_size = config.lm_hidden_size
self.character_embedding = nn.Embedding(config.n_chars,config.char_embed_size)
self.sentence_length = config.max_sentence_length
self.char_lstm = nn.LSTM(input_size=config.char_embed_size,hidden_size=config.char_hidden_size,
bidirectional=True,batch_first=True) # batch_first 默认为False,当batch_size在第一个维度时设置为True
self.lm_lstm = nn.LSTM(input_size=self.word_embed_size,hidden_size=config.lm_hidden_size,batch_first=True)
self.fc_1 = nn.Linear(2*config.char_hidden_size,config.word_embed_size) # 线性组合生成词表示
self.fc_2 =nn.Linear(config.lm_hidden_size,config.vocab_size) # 生成类别,预测下一个词
def forward(self, x): # x的尺寸为: (batch_size * max_sentence_length) * max_word_length
input = self.character_embedding(x) # 获取字符嵌入,
# 输出尺寸为: (batch_size * max_sentence_length) * max_word_length * char_embedding_size
char_lstm_result = self.char_lstm(input)
# 输出尺寸为: (batch_size * max_sentence_length) * max_word_length * (hidden_size * 2) 正向输出和反向输出拼接到一起了
word_input = torch.cat([char_lstm_result[0][:,-1,0:self.char_hidden_size],
char_lstm_result[0][:,0,self.char_hidden_size:]],dim=1)
word_input = self.fc_1(word_input) # 线性组合生成词表示,输出尺寸为: (batch_size * max_sentence_length) * word_embedding
word_input = word_input.view([-1,self.sentence_length,self.word_embed_size]) # reshape为 batch_size * max_sentence_length * word_embedding
lm_lstm_result = self.lm_lstm(word_input)[0].contiguous() # contiguous用于转为连续的 输出尺寸为:batch_size * max_sentence_length * lm_hidden_size
lm_lstm_result = lm_lstm_result.view([-1,self.lm_hidden_size]) # reshape成: (batch_size * max_sentence_length) * lm_hidden_size
out = self.fc_2(lm_lstm_result)
return out
class config:
def __init__(self):
self.n_chars = 64
self.char_embed_size = 50
self.max_sentence_length = 8
self.char_hidden_size = 50
self.lm_hidden_size = 150
self.word_embed_size = 50
config.vocab_size = 1000
if __name__=="__main__":
config = config()
c2w = C2W(config)
test = np.zeros([64,16])
c2w(test)
train&test.py
# -*- coding: utf-8 -*-
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.optim as optim
from model import C2W
from data import Char_LM_Dataset
from tqdm import tqdm
import config as argumentparser
config = argumentparser.ArgumentParser()
def get_test_result(data_iter,data_set):
# 生成测试结果
model.eval()
all_ppl = 0
for data, label, weights in data_iter:
if config.cuda and torch.cuda.is_available():
data = data.cuda()
label = label.cuda()
weights = weights.cuda()
else:
data = torch.autograd.Variable(data).long()
label = torch.autograd.Variable(label).squeeze()
out = model(data)
loss_now = criterion(out, autograd.Variable(label.long()))
# 依据困惑度计算公式逐步计算
ppl = (loss_now * weights.float()).view([-1, config.max_sentence_length])
ppl = torch.sum(ppl, dim=1) / torch.sum((weights.view([-1, config.max_sentence_length])) != 0, dim=1).float()
ppl = torch.sum(torch.exp(ppl))
all_ppl += ppl.data.item()
return all_ppl*config.max_sentence_length/data_set.__len__()
if __name__=="__main__":
# Create the configuration
if config.cuda and torch.cuda.is_available():
torch.cuda.set_device(config.gpu)
training_set = Char_LM_Dataset(mode="train")
training_iter = torch.utils.data.DataLoader(dataset=training_set,
batch_size=config.batch_size*config.max_sentence_length,
shuffle=False,
num_workers=2)
valid_set = Char_LM_Dataset(mode="valid")
valid_iter = torch.utils.data.DataLoader(dataset=valid_set,
batch_size=config.batch_size*config.max_sentence_length,
shuffle=False,
num_workers=0)
test_set = Char_LM_Dataset(mode="test")
test_iter = torch.utils.data.DataLoader(dataset=test_set,
batch_size=32*100,
shuffle=False,
num_workers=0)
model = C2W(config)
if config.cuda and torch.cuda.is_available():
model.cuda()
criterion = nn.CrossEntropyLoss(reduce=False) # reduce = True时会对loss求平均,我们还需要乘以weight,因此设置为False
optimizer = optim.Adam(model.parameters(), lr=config.learning_rate)
loss = -1
for epoch in range(config.epoch):
model.train()
process_bar = tqdm(training_iter)
for data, label,weights in process_bar:
if config.cuda and torch.cuda.is_available():
data = data.cuda()
label = label.cuda()
weights = weights.cuda()
else:
data = torch.autograd.Variable(data).long()
label = torch.autograd.Variable(label).squeeze()
out = model(data)
loss_now = criterion(out, autograd.Variable(label.long()))
ppl = (loss_now*weights.float()).view([-1,config.max_sentence_length])
ppl = torch.sum(ppl,dim=1)/torch.sum((weights.view([-1,config.max_sentence_length]))!=0,dim=1).float()
ppl = torch.mean(torch.exp(ppl))
loss_now = torch.sum(loss_now*weights.float())/torch.sum(weights!=0)
if loss==-1:
loss = loss_now.data.item()
else:
loss = 0.95 * loss + 0.05 * loss_now.data.item()
process_bar.set_postfix(loss=loss,ppl=ppl.data.item())
process_bar.update()
optimizer.zero_grad()
loss_now.backward()
optimizer.step()
print ("Valid ppl is:",get_test_result(valid_iter,valid_set))
print ("Test ppl is:",get_test_result(test_iter,valid_set))
本文为深度之眼paper论文班的学习笔记,仅供自己学习使用,如有问题欢迎讨论!关于课程可以扫描下图二维码