为啥要写
一般的tokenizer(e.g., spacy)不太好应对很多punctuations,如果下游想要用类似于Glove
这种预训练词向量,有些情况下词表将没有对应的token,会导致出现一些'unk'
,i,e, 标点如果和token连在一起。所以需要自己写一个tokenizer把原始的token清洗分割开才行。
如果是BERT,ELMO这种的话自己有tokenizer的(word_piece or BPE
),就不用。
用啥写的
用的re
,对正则还不是很熟悉,大概能查doc能用,后续有空还要稍微系统学一下,届时这个func也会对应更新.
def clean_tokenize(data, lower=False):
''' used to clean token, split all token with space and lower all tokens
this function usually use in some language models which don't require strict pre-tokenization
such as LSTM(with glove vector) or ELMO(already has tokenizer)
:param data: string
:return: list, contain all cleaned tokens from original input
'''
# data = re.sub(r"[^A-Za-z0-9(),!?\'\`\.]", " ", data) ## if don't want some other punctuations, e.g., '<>','《》', uncomment this line
# recover some abbreviations
data = re.sub(r"\'s", " \'s", data)
data = re.sub(r"n\'t", " not", data)
data = re.sub(r"\'ve", " have", data)
data = re.sub(r"\'re", " are", data)
data = re.sub(r"\'d", " would", data) ## actually, it can be 'would','should' and 'had'
data = re.sub(r"\'ll", " will", data)
data = re.sub(r"\'m", " am", data)
# split all tokens with a space
data = re.sub(r"\.", " . ", data)
data = re.sub(r",", " , ", data)
data = re.sub(r"!", " ! ", data)
data = re.sub(r"\(", " ( ", data)
data = re.sub(r"\)", " ) ", data)
data = re.sub(r"\?", " ? ", data)
data = re.sub(r"\s{2,}", " ", data)
data = data.lower() if lower else data
# split all tokens, form a list
return [x.strip() for x in re.split('(\W+)', data) if x.strip()]