用re正则对文本数据tokenize (作个人记录)

最新推荐文章于 2024-04-30 10:06:23 发布

Reza.

最新推荐文章于 2024-04-30 10:06:23 发布

阅读量162

点赞数

文章标签： python 开发语言后端自然语言处理

本文链接：https://blog.csdn.net/weixin_43301333/article/details/121063831

版权

为啥要写

一般的tokenizer(e.g., spacy)不太好应对很多punctuations,如果下游想要用类似于Glove这种预训练词向量，有些情况下词表将没有对应的token，会导致出现一些'unk'，i,e, 标点如果和token连在一起。所以需要自己写一个tokenizer把原始的token清洗分割开才行。
如果是BERT，ELMO这种的话自己有tokenizer的(word_piece or BPE)，就不用。

用啥写的

用的re,对正则还不是很熟悉，大概能查doc能用，后续有空还要稍微系统学一下，届时这个func也会对应更新.

def clean_tokenize(data, lower=False):
    ''' used to clean token, split all token with space and lower all tokens
    this function usually use in some language models which don't require strict pre-tokenization
    such as LSTM(with glove vector) or ELMO(already has tokenizer)
    :param data: string
    :return: list, contain all cleaned tokens from original input
    '''
    # data = re.sub(r"[^A-Za-z0-9(),!?\'\`\.]", " ", data)  ## if don't want some other punctuations, e.g., '<>','《》', uncomment this line
    # recover some abbreviations
    data = re.sub(r"\'s", " \'s", data)
    data = re.sub(r"n\'t", " not", data)
    data = re.sub(r"\'ve", " have", data)
    data = re.sub(r"\'re", " are", data)
    data = re.sub(r"\'d", " would", data)  ## actually, it can be 'would','should' and 'had'
    data = re.sub(r"\'ll", " will", data)
    data = re.sub(r"\'m", " am", data)
    # split all tokens with a space
    data = re.sub(r"\.", " . ", data)
    data = re.sub(r",", " , ", data)
    data = re.sub(r"!", " ! ", data)
    data = re.sub(r"\(", " ( ", data)
    data = re.sub(r"\)", " ) ", data)
    data = re.sub(r"\?", " ? ", data)
    data = re.sub(r"\s{2,}", " ", data)
    data = data.lower() if lower else data

    # split all tokens, form a list
    return [x.strip() for x in re.split('(\W+)', data) if x.strip()]

Reza.

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
用re正则对文本数据tokenize (作个人记录)

为啥要写一般的tokenizer(e.g., spacy)不太好应对很多punctuations,如果下游想要用类似于Glove这种预训练词向量，有些情况下词表将没有对应的token，会导致出现一些'unk'，i,e, 标点如果和token连在一起。所以需要自己写一个tokenizer把原始的token清洗分割开才行。如果是BERT，ELMO这种的话自己有tokenizer的(word_piece or BPE)，就不用。用啥写的用的re,对正则还不是很熟悉，大概能查doc能用，后续有空还要稍微系统
复制链接

扫一扫