【TensorFlow】基于Seq2Seq模型的中英机器翻译完整项目(一)数据处理篇

声明

  文中出现的代码有些是部分的、表意用的,不能直接搬用,一个是直接上完整代码篇幅就太长了,另一个是我本人希望用“授人以渔”态度写文章。有不足的地方还请各位指正!

数据集

  开始写这个项目时,我找的数据集来源于 https://wit3.fbk.eu/2015-01,是TED Talk的中英双语数据集。但是当时不知道效果为啥不好,就换了个数据集,是2017年的AI Challenge中用于翻译比赛的数据。下载地址:数据集,部分内容如下:
图1
在这里插入图片描述

  在这个数据集上的效果还行,从数据上看,可能是因为TED Talk的数据是人类自然用语,更复杂多变、有很多长句;但是AI Challenge的数据绝大部分是如上图的简单短句。因此建议读者在实践时选择简单的数据集、限制用于训练的最大句子长度。

数据处理

  • 句子水平的处理

    • 去除异种符号,包括但不限于数字、“#@”等奇怪的符号、其他语言字符,我在实践中还去除了标点符号,这点我也拿不准,仁者见仁智者见智吧;
      sentence = re.sub(r"[a-zA-Z]", r"", sentence) if IsChinese(sentence) else sentence
      sentence = re.sub(r"[0-9]", r"", sentence)
      sentence = re.sub(r"[@#$%^-&*]", r"", sentence)
      sentence = re.sub(r"[,.!?';:,。!?’‘“”;:]", r"", sentence)
      
    • 分词,中文我使用python库jieba,英文我使用python库nltk;
      # word segmentation
      sentence = jieba.lcut(sentence) if IsChinese(sentence) else nltk.word_tokenize(sentence)
      
    • 特殊处理,面对具体的语言和数据集会有不同的处理,还比如英文可以做小写化,比如在我用的数据集中文语句里还有空格隔开以及括号括住但是英文语句里没有的部分,这些我就去掉了;
      # lowercase
      sentence = sentence.lower()
      
      # delete invalid characters
      sentence = re.sub(r"(.*)", r"", sentence) if IsChinese(sentence) else sentence
      sentence = re.sub(r" ", r"", sentence) if IsChinese(sentence) else sentence
      
    • 加上开始和结束标志符,这个随意,可以用“bos”表示开始、“eos”表示结束,读者可以自行定义。我实际的处理是给源语言(source language)加开始符号、给目标语言(target language)加开始和结束符号;
      for sent_index in range(len(source_data)):
          # This loop is for dropping long sentence and building vocabulary.
          if len(source_data[sent_index]) + 1 <= src_max_length and \
                  len(target_data[sent_index]) + 2 <= tgt_max_length:
              source.AddSentence(source_data[sent_index])
              target.AddSentence(target_data[sent_index])
              data.append((source_data[sent_index] + ["<eos>"],
                           ["<bos>"] + target_data[sent_index] + ["<eos>"]))
              # Actually just add "<eos>" to source data and "<bos>" and "<eos>" both to target data.
      
    • 补足句子,把短句子补长,这样才能放在一个batch里输入到模型,一般使用0填补且0不能作为词汇表中某个词的索引代表。有两种方式:一种全部补长到最大句子长度;一种把同一个batch里的句子补长到这个batch里的最大长度,暂且称之为“batch padding”。第二种可以节省存储、算力、时间,建议第二种。
      def DoBatchPadding(self, data):
          dataset = []
          for batch_index in range(len(data) // self.batch_size):
              # This loop decides how many batches in trian data.
              src_batch = []
              tgt_batch = []
              for sent_index in range(batch_index * self.batch_size, (batch_index + 1) * self.batch_size):
                  # This loop processes every sentence in one batch.
                  src_batch.append([int(num) for num in
                                    data[sent_index].split(" ")[:data[sent_index].split(" ").index("1")]])
                  tgt_batch.append([int(num) for num in
                                    data[sent_index].split(" ")[data[sent_index].split(" ").index("1"):]])
              src_max_len = max([len(sent) for sent in src_batch])
              tgt_max_len = max([len(sent) for sent in tgt_batch])
              for sent_index in range(self.batch_size):
                  # This loop lengthens shorter sentences to max length in the same batch.
                  src_batch[sent_index] += [0] * (src_max_len - len(src_batch[sent_index]))
                  tgt_batch[sent_index] += [0] * (tgt_max_len - len(tgt_batch[sent_index]))
              dataset.append((convert_to_tensor(src_batch), convert_to_tensor(tgt_batch)))
          return dataset
      
  • 构建词汇表,对句子分词后进行,一般用python数据结构字典来存储,将字符表示的词汇转为数字;从词汇表中去除低频词,替换为“unkown”这样的字符。最后词汇表要以文件的形式保存下来,测试翻译时要单独用。

    class Language:
        def __init__(self, name):
            self.name = name
            self.word2index = None
            self.index2word = None
        	self.word2frequency = {}
        	self.vocab_size = None
        	self.low_frequency = None
    
    	def AddWord(self, word):
        	if word not in self.word2frequency.keys():
            	self.word2frequency[word] = 1
        	else:
            	self.word2frequency[word] += 1
    
    	def AddSentence(self, sentence):
        	# The sentence should have been processed.
        	for word in sentence:
            	self.AddWord(word)
    
    	def DeleteLowFrequencyWord(self):
        	self.word2index = ["<bos>", "<eos>", "<unk>"] + [word for (word, count) in self.word2frequency.items()
                                                         if count > self.low_frequency]
        	self.vocab_size = len(self.word2index)
        	self.word2index = dict([(word, index + 1) for index, word in enumerate(self.word2index)])
        	self.index2word = dict([(value, word) for (word, value) in self.word2index.items()])
    
    # Delete low frequency words.
    src_words_count = sum([len(sent[0]) for sent in data])
    tgt_words_count = sum([len(sent[1]) for sent in data])
    print("The count of all words in source language corpus is: {}".format(src_words_count))
    source.low_frequency = int(input())
    print("The count of all words in target language corpus is: {}".format(tgt_words_count))
    target.low_frequency = int(input())
    source.DeleteLowFrequencyWord()
    target.DeleteLowFrequencyWord()
    
  • 限制最大句子长度,这个比较重要,至少简单的机器翻译模型本身就无法处理过长的复杂句子,建议在20左右,当然如果数据集中句子大多都是更短的也可以设置更小的最大句子长度。

    我的实践也证明了这点:我在句子分词后绘制了句子长度分布图,就是每个长度的句子有多少个,在TED Talk的数据集上长句子不少,我就自认为取了个中位64为最大长度,结果不行,图就没保存;但是在AI Challenge数据集上的我有,贴图为证:
    在这里插入图片描述
    在这里插入图片描述
    大部分居于20以下,最后我设置20为最大句子长度。

    PlotSentenceLengthDistribution(source_data, target_data)
    print("Please input max sequence length for source language:")
    src_max_length = int(input())
    print("Please input max sequence length for target language:")
    tgt_max_length = int(input())
    
    source = Language("Chinese")
    target = Language("English")
    data = []
    for sent_index in range(len(source_data)):
        # This loop is for dropping long sentence and building vocabulary.
        if len(source_data[sent_index]) + 1 <= src_max_length and \
                len(target_data[sent_index]) + 2 <= tgt_max_length:
            source.AddSentence(source_data[sent_index])
            target.AddSentence(target_data[sent_index])
            data.append((source_data[sent_index] + ["<eos>"],
                         ["<bos>"] + target_data[sent_index] + ["<eos>"]))
            # Actually just add "<eos>" to source data and "<bos>" and "<eos>" both to target data.
    
    def PlotSentenceLengthDistribution(src_data, tgt_data):
       	src_length_dict = {}
    	tgt_length_dict = {}
    	for index in range(len(src_data)):
        	# This loop counts frequency for every sentence length.
        	src_length = len(src_data[index])
        	tgt_length = len(tgt_data[index])
        	if src_length not in src_length_dict.keys():
            	src_length_dict[src_length] = 1
        	else:
            	src_length_dict[src_length] += 1
        	if tgt_length not in tgt_length_dict.keys():
            	tgt_length_dict[tgt_length] = 1
        	else:
            	tgt_length_dict[tgt_length] += 1
    	src_max_length = numpy.max(numpy.asarray(list(src_length_dict.keys())))
    	tgt_max_length = numpy.max(numpy.asarray(list(tgt_length_dict.keys())))
    	width = 0.4
    	plt.bar(numpy.asarray(list(src_length_dict.keys())),
        	    numpy.asarray(list(src_length_dict.values())),
            	width=width, color="b")
    	plt.xticks(numpy.arange(0, src_max_length, src_max_length // 15),
        	       labels=list(numpy.arange(0, src_max_length, src_max_length // 15)))
    	plt.xlabel("source sentence length")
    	plt.ylabel("nums")
    	plt.title("source sentence length distribution")
    
    	# If run in terminal such as remote server, save it, or you maybe can't see it.
    	plt.savefig("./distribution_of_source_sentence_length")
    
    	plt.show()
    	plt.bar(numpy.asarray(list(tgt_length_dict.keys())),
        	    numpy.asarray(list(tgt_length_dict.values())),
            	width=width, color="r")
    	plt.xticks(numpy.arange(0, tgt_max_length, tgt_max_length // 15),
        	       labels=list(numpy.arange(0, tgt_max_length, tgt_max_length // 15)))
    	plt.xlabel("target sentence length")
    	plt.ylabel("nums")
    	plt.title("target sentence length distribution")
    	plt.savefig("./distribution_of_target_sentence_length")
    	plt.show()
    
  • 保存数据,这个可选,因为我写这个项目的时候是自己写的数据加载器(dataloader)以及按batch补足句子的操作,要实现每个epoch的数据都是随机的,只能重新读取、处理,就比较费时。所以我保存了分词之后的数据,需要重新读取、重新做batch padding,只能说我尽力缩减重复的操作了。

    # Save train data and test data in the form of numbers.
    split = int(len(data) * 0.8)
    train_dataset = data[:split]
    test_dataset = data[split:]
    with open(train_save_path, "w", encoding="utf-8") as train:
        for src_sent, tgt_sent in train_dataset:
            train.write(" ".join([str(GetWordIndex(word, source.word2index)) for word in src_sent] +
                                 [str(GetWordIndex(word, target.word2index)) for word in tgt_sent]) + "\n")
    with open(test_save_path, "w", encoding="utf-8") as test:
        for src_sent, tgt_sent in test_dataset:
            test.write(" ".join([str(GetWordIndex(word, source.word2index)) for word in src_sent] +
                                [str(GetWordIndex(word, target.word2index)) for word in tgt_sent]) + "\n")
    
  • 加载数据,就是读取已经处理了大部分后保存的数据,随机打乱后重新做batch padding就行。

    def Make(self):
        # First, load data from train_data.txt and test_data.txt
        train_data, test_data = self.LoadData()
    
        # shuffle train data except test data
        random.shuffle(train_data)
    
        # Do batch padding, in one batch, all sentences' length are the same,
        # and lengths maybe different between different batches.
        train_dataset = self.DoBatchPadding(train_data)
        if test_data is not None:
            test_dataset = self.DoBatchPadding(test_data)
            return train_dataset, test_dataset
        else:
            return train_dataset
    

本篇完整源码

DataPreprocess.py
DatasetMake.py
Tools.py

文章集合

【TensorFlow】基于Seq2Seq模型的中英机器翻译完整项目(一)数据处理篇

【TensorFlow】基于Seq2Seq模型的中英机器翻译完整项目(二)模型训练篇

             创作不易,如果有所帮助,求点赞收藏加关注,谢谢!

u

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值