NLP 学习笔记 3: Data preparation, tokenization, and filtering.

本文链接：https://blog.csdn.net/qq_35455206/article/details/131859790

tokenization

tokenization是对一串输入字符进行划分和分类的任务，其输出的token将会被用于其他处理。

Data Preparation

read a dataset

定义一个MyDataSetReader来读取dataset，这个类接受两个输入

path:要处理的文本文件路径
lower:一个布尔型指示texture是否必须小写

class MyDatasetReader(object):
    def __init__(self, path, lower = True):
        self.path = path
        self.lower = lower

    def __iter__(self):
        for line in open(self.path, 'r', encoding='utf-8'):
          yield line # yields only the current line

在colab上我们用wiki_10k建立dataset，并观察前5行

data_path = '/content/drive/My Drive/Colab Notebooks/nlp_data'
dataset_path = os.path.join(data_path, "wiki_10k.txt")
i=0
for line in MyDatasetReader(dataset_path):
  print(line)
  i+=1
  if i>5:
    break

Clean the dataset

我们可以决定哪些dataset中要包含哪些token，根据我们的nlp任务，我们可以根据以下标准来过滤token

filtering stopwords

在很多nlp任务中，消除虚词是很常见的操作，即消除那些语义内容差，连词，介词，代词，情态动词。不同task可以有不同的stopword list。

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stopset = stopwords.words('english')
print(stopset)

比如，如果你希望根据内容对文件进行分类，你希望捕获最关键的特征，丢弃不太相关的特征，例如冠词和代词，因为他们几乎出现在所有文本中，对描述文本内容几乎没有用处。

remove numbers

在某些情况（如topic modelling）下，光消除stopwords并不够，你可能想去掉数字，只从文本中提取具有特定主题代表性的单词

Remove alphanumeric tokens

alphanumeric称为文数字，该token可能由OCR错误产生，因此在某些情况下我们可能希望消除它们。

Lowercasing tokens

为了保持模型vocabulary大小并减少稀疏性，我们希望所有token都小写

Remove short tokens

某些情况您可能会希望去除short token，因为对这种token来说，stopwords比较多

Using a predefined number of words

在许多情况下，在模型的词汇表中使用预定义数量的单词是很有用的，这样可以管理其复杂性

Eliminating low-frequency words

由于词频遵循Zipfian分布，因此语料库中有许多低频词，这会影响模型的词汇量带下，在word2vec中也消除了低频词，因为它们很难建模并预测。

from collections import Counter
import re

alpha = re.compile('^[a-zA-Z_]+$')                # strings that contain only alpha
alpha_or_num = re.compile('^[a-zA-Z_]+|[0-9_]+$') # strings that contain a combination of letters and numbers
alphanum = re.compile('^[a-zA-Z0-9_]+$')          # strings that contain numbers

class MyTokenizer(object):
  def __init__(self,
               keepStopwords=False,   # indicates if the stopwords have to be removed
               keepNum = False,       # indicates whether the numbers have to be kept or not
               keepAlphaNum = False,  # indicates whether alphanumeric tokens have to be kept or not
               lower = True,          # indicates if the strings have to be lowercased
               minlength = 0,         # indicates the length (in characters) of tokens under which tokens will be removed (this operation is not considered when minlength is 0)
               vocabSize = 5000,      # indicate the size of the vocabulary (it keeps only the N most frequent words in the dataset)
               minfreq = 10e-5,       # indicates the frequency under which tokens will be removed
               stopset = None,        # indicates a custom list of words that will be removed
               vocab = None           # indicates a predefined vocabulary (only the words in it will be mantained)
               ):
    
    self.keepStopwords = keepStopwords
    self.keepNum = keepNum
    self.keepAlphaNum = keepAlphaNum
    self.lower = lower
    self.minlength = minlength
    self.vocabSize = vocabSize
    self.minfreq = minfreq
    self.vocab = vocab
    self.keepStopwords = keepStopwords
    if not self.keepStopwords and not stopset:
      import string
      import nltk
      nltk.download('stopwords')
      from nltk.corpus import stopwords
      stopset = set(stopwords.words('english')+[p for p in string.punctuation])
    self.stopset = stopset
    
  def tokenize(self, text):
    if not self.lower:
      return text.split()
    else:
      return [t.lower() for t in text.split()]

  def get_vocab(self, Tokens):
    Vocab = Counter()
    for tokens in Tokens:
      Vocab.update(tokens)
    self.vocab = Counter(Vocab)

  def cleanTokens(self, Tokens):
    tokens_n = sum(self.vocab.values())
    filtered_voc = self.vocab.most_common(self.vocabSize)
    Freqs = Counter({t : f/tokens_n for t, f in filtered_voc if
                     f/tokens_n > self.minfreq and 
                     t not in self.stopset
                     })
    words = list(Freqs.keys())
    # remove tokens that contain numbers
    if not self.keepAlphaNum and not self.keepNum:
      words = [w for w in words if alpha.match(w)]
    # or just remove tokens that contain a combination of letters and numbers
    elif not self.keepAlphaNum:
      words = [w for w in words if alpha_or_num.match(w)]
    words.sort()
    self.words = words
    words2idx = {w : i for i, w in enumerate(words)}
    self.words2idx = words2idx
    print('Vocabulary')
    print(words[:12])
    cleanTokens = []
    for tokens in Tokens:
      cleanTokens.append([t for t in tokens if t in words])
    self.tokens = cleanTokens


tokenizer = MyTokenizer()
Tokens = [tokenizer.tokenize(text) for text in MyDatasetReader(dataset_path)]
tokenizer.get_vocab(Tokens)
print('Most frequent words')
print(tokenizer.vocab.most_common(12))
tokenizer.cleanTokens(Tokens)

Out-of-Vocabulary Words (OOV)

移除stopwords的优点是很方便，因为它可以控制vocabulary的大小以及data的稀疏性。但缺点是模型不知道如何处理out-vocabulary-words (OOV)，即词汇表vocabulary外的词。比如，一些lamma词频比较低，所以可能会被从dataset中删除，为了处理这种情况，可以将OOV单词映射到词汇表的相同条目。

Subword Information

另一种方法是利用语言的morphology。FastText (https://fasttext.cc/docs/en/crawl-vectors.html)，不是在token级建立单词的表示，而是使用n-gram，允许将OOV单词表示为其成分(n-gram)的总和。例如：当n=3时，单词where将被标记为<wh, when, her, ere, re>

Byte Pair Encoding (BPE)

Byte Paire Encoding(BPE)是一种数据压缩技术，它用单个未使用的字节迭代地替换序列中最频繁的字节对,该算法适用于word segmentation。例如，算法迭代计数所有符号对，并替换出现频率最高的符号对(例如‘A’,'B'用新符号'AB'替换)

import re, collections

# compute the occurrencies of symbol pairs
def get_stats(vocab):
    pairs = collections.defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pairs[symbols[i],symbols[i+1]] += freq
    return pairs

# updates the vocabulary with a new sequence
def merge_vocab(pair, v_in):
    v_out = {}
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    for word in v_in:
        w_out = p.sub(''.join(pair), word)
        v_out[w_out] = v_in[word]
    return v_out

# initial vocabulary (</w>) indicates the end of the token
vocab = {'l o w </w>' : 5, 'l o w e r </w>' : 2, 'n e w e s t </w>':6,'w i d e s t </w>':3}
# number of merges to be performerd
num_merges = 10

for i in range(num_merges):
    pairs = get_stats(vocab)            # compute the occurrencies of symbol pairs
    best = max(pairs, key=pairs.get)    # find the most frequent pair
    vocab = merge_vocab(best, vocab)    # merge a symbols pair
    print(best)                         # print the merged symbols