实践中理解Tokenizer

最新推荐文章于 2024-09-14 16:43:45 发布

汨攸

最新推荐文章于 2024-09-14 16:43:45 发布

阅读量752

点赞数 20

文章标签：人工智能 nlp 深度学习

本文链接：https://blog.csdn.net/m0_52965867/article/details/140921821

版权

Tokenizer

# Tokenization using DebertaV2Tokenizer
model_name = "/kaggle/input/qwen2/transformers/qwen2-7b-instruct/1"
# model_name = "/kaggle/input/deberta-v3/pytorch/large/1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Limit the vocabulary size
# tokenizer.model_max_length = max_length
tokenizer.add_tokens(['[CLS]', '[SEP]', '[PAD]'], special_tokens=True)

如果在进行文本分类任务时，在输入文本的开头添加 [CLS] 令牌，模型可以学习识别这个令牌并根据其位置和相关特征进行分类判断
在对不同文本段落进行拼接时，可以使用 [SEP] 来明确区分不同的部分
而在对长度不一的文本序列进行处理时，使用 [PAD] 来使所有序列长度相同，便于模型进行统一处理。

# Check and set special tokens if they are not present
    if tokenizer.cls_token_id is None:
        tokenizer.cls_token_id = tokenizer.convert_tokens_to_ids('[CLS]')
    if tokenizer.sep_token_id is None:
        tokenizer.sep_token_id = tokenizer.convert_tokens_to_ids('[SEP]')
    if tokenizer.pad_token_id is None:
        tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids('[PAD]')

如果 tokenizer 的 cls_token_id （表示 [CLS] 令牌的 ID）为空，就通过 tokenizer.convert_tokens_to_ids(‘[CLS]’) 这个方法将 [CLS] 令牌转换为对应的 ID 并赋值给 cls_token_id

这样做的目的是确保在后续的处理中，如果需要使用这些特殊令牌的 ID 时，它们已经被正确地定义和赋值，避免出现因为未定义而导致的错误

def process(row):
        max_len = max_length - 2 # 2 separator tokens
        # Tokenize prompt
        prompt_tokens = tokenizer(row['prompt'], truncation=True, max_length=max_len//4)['input_ids']
        remaining_length = max_len - len(prompt_tokens)

        # Tokenize response A
        response_a_tokens = tokenizer(row['response_a'], truncation=True, max_length=remaining_length//2)['input_ids']
        remaining_length -= len(response_a_tokens)

        # Tokenize response B
        response_b_tokens = tokenizer(row['response_b'], truncation=True, max_length=remaining_length//2)['input_ids']

        # Add responses
        input_ids = [tokenizer.cls_token_id] + prompt_tokens + [tokenizer.sep_token_id] + response_a_tokens + [tokenizer.sep_token_id] + response_b_tokens
        token_type_ids = [0] * (len(prompt_tokens) + 2) + [1] * (len(response_a_tokens) + 1) + [2] * len(response_b_tokens)
        attention_mask = [1] * len(input_ids)

        # Add padding
        padding_length = max_length - len(input_ids)
        if padding_length > 0:
            input_ids = input_ids + [tokenizer.pad_token_id] * padding_length
            token_type_ids = token_type_ids + [0] * padding_length
            attention_mask = attention_mask + [0] * padding_length

        input_ids = input_ids[:max_length]
        token_type_ids = token_type_ids[:max_length]
        attention_mask = attention_mask[:max_length]
        
        return input_ids, token_type_ids, attention_mask
    
    df[['input_ids', 'token_type_ids', 'attention_mask']] = df.apply(lambda row: pd.Series(process(row)), axis=1)

这段的核心是这句：

input_ids = [tokenizer.cls_token_id] + prompt_tokens + [tokenizer.sep_token_id] + response_a_tokens + [tokenizer.sep_token_id] + response_b_tokens

↑ 这些部分共同组成了编码后的句子

token_type_ids = [0] * (len(prompt_tokens) + 2) + [1] * (len(response_a_tokens) + 1) + [2] * len(response_b_tokens)

例如，如果 len(prompt_tokens) = 5 ， len(response_a_tokens) = 3 ， len(response_b_tokens) = 2 ，那么 token_type_ids 将会是 [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2] 。这种标记类型的区分在一些模型中可以帮助模型更好地理解输入序列中不同部分的含义和作用

attention_mask = [1] * len(input_ids)

全为 1 的 attention_mask 表示模型在处理输入序列 input_ids 时，应关注所有的位置。

# Prepare data for training
input_ids = train['input_ids']
attention_mask = train['attention_mask']

X_train = sq.pad_sequences(input_ids, maxlen=max_length)
X_train_attention_mask = sq.pad_sequences(attention_mask, maxlen=max_length)

y_train = labels

使用 sq.pad_sequences 函数对 input_ids 进行填充操作

为什么有一个X_train还有一个X_train_attention_mask？
——这样做的目的是为了让输入数据和对应的注意力掩码在长度上保持一致，以便后续能够正确地输入到模型中进行处理

Model

# Define the LSTM model
def create_lstm_model(vocab_size, embedding_dim, max_length):
    model = Sequential([
        Input(shape=(max_length,), dtype=tf.int32, name='input_ids'),
        Embedding(input_dim=vocab_size, output_dim=embedding_dim),
        LSTM(256, return_sequences=True),
#         BatchNormalization(),
#         Dropout(0.5),
        GlobalMaxPooling1D(),
        Dense(128, activation='relu'),
        Dropout(0.5),
        Dense(3, activation='softmax')
    ])
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

# Parameters
vocab_size = tokenizer.vocab_size
# vocab_size = max_length
embedding_dim = 100
max_length = max_length
max_features = tokenizer.vocab_size
# max_features = max_length * 2
max_len = max_length
maxlen = max_len
batch_size = 16
embedding_dims = 100
nb_filter = 150
filter_length = 3
hidden_dims = 100
nb_epoch = 100
# Create the model
model = create_lstm_model(vocab_size, embedding_dim, max_length)
model.summary()

在这里插入图片描述

Training

from keras.callbacks import EarlyStopping

# Train the model
early_stopping = EarlyStopping(monitor='val_loss', patience=8, verbose=1)

history = model.fit([X_train, X_train_attention_mask], y_train, epochs=20, batch_size=32, validation_split=0.2
                    , callbacks=[early_stopping])