SentencePiece,subword-nmt,bpe算法

BPE(Byte Pair Encoding,双字节编码)。2016年应用于机器翻译,解决 集外词(OOV)和罕见词(Rare word)问题。论文题目《Neural Machine Translation of Rare Words with Subword Units》 —发表于ACL2016

http://www.sohu.com/a/115373230_465975

tensor2tensor有用到bpe,抽取:

data_generators/problem.py

data_generators/translate_ende.py

bpe算法实现:

1.参考:https://plmsmile.github.io/2017/10/19/subword-units/

import re
def process_raw_words(words, endtag='-'):
    '''把单词分割成最小的符号,并且加上结尾符号'''
    vocabs = {}
    for word, count in words.items():
        # 加上空格
        word = re.sub(r'([a-zA-Z])', r' \1', word)
        word += ' ' + endtag
        vocabs[word] = count
    return vocabs

def get_symbol_pairs(vocabs):
    ''' 获得词汇中所有的字符pair,连续长度为2,并统计出现次数
    Args:
        vocabs: 单词dict,(word, count)单词的出现次数。单词已经分割为最小的字符
    Returns:
        pairs: ((符号1, 符号2), count)
    '''
    #pairs = collections.defaultdict(int)
    pairs = dict()
    for word, freq in vocabs.items():
        # 单词里的符号
        symbols = word.split()
        for i in range(len(symbols) - 1):
            p = (symbols[i], symbols[i + 1])
            pairs[p] = pairs.get(p, 0) + freq
    return pairs

def merge_symbols(symbol_pair, vocabs):
    '''把vocabs中的所有单词中的'a b'字符串用'ab'替换
    Args:
        symbol_pair: (a, b) 两个符号
        vocabs: 用subword(symbol)表示的单词,(word, count)。其中word使用subword空格分割
    Returns:
        vocabs_new: 替换'a b'为'ab'的新词汇表
    '''
    vocabs_new = {}
    raw = ' '.join(symbol_pair)
    merged = ''.join(symbol_pair)
    # 非字母和数字字符做转义
    bigram =  re.escape(raw)
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    for word, count in vocabs.items():
        word_new = p.sub(merged, word)
        vocabs_new[word_new] = count
    return vocabs_new

raw_words = {"low":5, "lower":2, "newest":6, "widest":3}
vocabs = process_raw_words(raw_words)

num_merges = 10
print(vocabs)
for i in range(num_merges):
    pairs = get_symbol_pairs(vocabs)
    # 选择出现频率最高的pair
    symbol_pair = max(pairs, key=pairs.get)
    vocabs = merge_symbols(symbol_pair, vocabs)
print(vocabs)

输出:

原来:{"low":5, "lower":2, "newest":6, "widest":3}
经过bpe:{' low-': 5, ' low e r -': 2, ' newest-': 6, ' wi d est-': 3}

{“low”:5, “lower”:2, “newest”:6, “widest”:3}这个是原本每个单词出现的频率。最后输出,可以以空格为划分,比如作为建模单元,比如这里的建模单元为 low e r newest wi d est 。输出文本经过建模单元就能都映射出来,一串表示。


2.参考 《Neural Machine Translation of Rare Words with Subword Units》

论文讲解:http://www.sohu.com/a/115373230_465975

import re, collections
def get_stats(vocab):
     pairs = collections.defaultdict(int)
     for word, freq in vocab.items():
       symbols = word.split()
       print(symbols)
       print("len(symbols)     ---   ",len(symbols))
       for i in range(len(symbols)-1):
         pairs[symbols[i],symbols[i+1]] += freq
     return pairs
def merge_vocab(pair, v_in):
     v_out = {}
     bigram = re.escape(' '.join(pair))
     print("bigram    ",bigram)
     p = re.compile(r'(?     for word in v_in:
       w_out = p.sub(''.join(pair), word)
       print("w_out    ",w_out)
       v_out[w_out] = v_in[word]
     return v_out
     
vocab = {'l o w ' : 5, 'l o w e r ' : 2,
'n e w e s t ':6, 'w i d e s t ':3}
num_merges = 10

for i in range(num_merges):
   print("=#####################################=== ")
   pairs = get_stats(vocab)
   print("===========11111======= ")
   print(pairs)
   #print("===========11111======= ")
   
   best = max(pairs, key=pairs.get)
   print("===========2222======= ")
   print("pairs.get   ",pairs.get)
   print("best   ",best)
   #raise SystemExit
   vocab = merge_vocab(best, vocab)
   print("vocab   ",vocab)

个人觉得分词最好用的还是sentencepiece~~

SentencePiece

参考https://github.com/google/sentencepiece/tree/master/python

分词20k个label id

>>> import sentencepiece as spm
>>> spm.SentencePieceTrainer.Train('--input=/data/yelong/bpe_test/lib.txt --model_prefix=/data/yelong/bpe_test/bpe --vocab_size=20000 --model_type=bpe') 

import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("/data/yelong/bpe_test/bpe.model")
with open('/data/yelong/bpe_test/wav/train/text.txt', 'a') as fid, open('/data/yelong/bpe_test/wav/train/train.txt') as did:
    for line in did:
        a = line.strip().split()[1:]  # eg. "TWO COME MUSE MIGRATE"
        aa = ' '.join([t for t in a])
        listid = sp.EncodeAsIds(aa)
        strid = ' '.join([str(t) for t in listid])
        b = line.strip().split()[:1]
        b =''.join([t for t in b])
        fid.write(b+' '+strid+'\n')

得到.model和.vocab两个文件,

bpe.vocab:

<unk>   0
<s>     0
</s>    0
▁T      -0
HE      -1
▁A      -2
▁THE    -3
IN      -4
▁S      -5
▁W      -6

一个映射关系,右边并不是id号,因为model_type有好几种(unigram (default), bpe, char, or word),当选择比如unigram种类时,得到的右边是小数,所以并不是id号。

所以我不应该把nabu里配置里的alphabet里只写了0-19996(bpe.vocab末尾是19996),而应该写0-19999才对。

验证过了,0-19999的id都有对应的piece,验证方法:

% python
>>> import sentencepiece as spm
>>> sp = spm.SentencePieceProcessor()
>>> sp.Load("/data/yelong/bpe_test/bpe.model")
>>> for i in range(20000):
...     sp.IdToPiece(i)

都能输出。(不能输出的话会报错,退出)

  • 3
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值