用re正则对文本数据tokenize (作个人记录)

为啥要写

一般的tokenizer(e.g., spacy)不太好应对很多punctuations,如果下游想要用类似于Glove这种预训练词向量,有些情况下词表将没有对应的token,会导致出现一些'unk',i,e, 标点如果和token连在一起。所以需要自己写一个tokenizer把原始的token清洗分割开才行。
如果是BERT,ELMO这种的话自己有tokenizer的(word_piece or BPE),就不用。

用啥写的

用的re,对正则还不是很熟悉,大概能查doc能用,后续有空还要稍微系统学一下,届时这个func也会对应更新.

def clean_tokenize(data, lower=False):
    ''' used to clean token, split all token with space and lower all tokens
    this function usually use in some language models which don't require strict pre-tokenization
    such as LSTM(with glove vector) or ELMO(already has tokenizer)
    :param data: string
    :return: list, contain all cleaned tokens from original input
    '''
    # data = re.sub(r"[^A-Za-z0-9(),!?\'\`\.]", " ", data)  ## if don't want some other punctuations, e.g., '<>','《》', uncomment this line
    # recover some abbreviations
    data = re.sub(r"\'s", " \'s", data)
    data = re.sub(r"n\'t", " not", data)
    data = re.sub(r"\'ve", " have", data)
    data = re.sub(r"\'re", " are", data)
    data = re.sub(r"\'d", " would", data)  ## actually, it can be 'would','should' and 'had'
    data = re.sub(r"\'ll", " will", data)
    data = re.sub(r"\'m", " am", data)
    # split all tokens with a space
    data = re.sub(r"\.", " . ", data)
    data = re.sub(r",", " , ", data)
    data = re.sub(r"!", " ! ", data)
    data = re.sub(r"\(", " ( ", data)
    data = re.sub(r"\)", " ) ", data)
    data = re.sub(r"\?", " ? ", data)
    data = re.sub(r"\s{2,}", " ", data)
    data = data.lower() if lower else data

    # split all tokens, form a list
    return [x.strip() for x in re.split('(\W+)', data) if x.strip()]
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值