Sentence-Transformer的使用及fine-tune教程

简述

同时要求安装这个,过程遇到一些问题,也不清楚之间的关系,有很多依赖包,大包要由于非常慢导致超时,要单独下载,以为是huggingface的sentence-transformers,好像不是,以后再研究。

Sentence-Transformer官方文档写的很详细,里面有各种你可能会用到的示例代码,并且都有比较详细的说明,如果有什么问题,应该先去看官方文档

本文主要从两种情况来介绍如何使用Sentence-Transformer,一种是直接使用,另一种是在自己的数据集上fine-tune

首先,无论何种场景,您都应该先安装以下两个库

pip install -U sentence-transformers
pip install -U transformers

直接使用

Sentence-Transformer提供了非常多的预训练模型供我们使用,对于STS(Semantic Textual Similarity)任务来说,比较好的模型有以下几个

  • roberta-large-nli-stsb-mean-tokens - STSb performance: 86.39
  • roberta-base-nli-stsb-mean-tokens - STSb performance: 85.44
  • bert-large-nli-stsb-mean-tokens - STSb performance: 85.29
  • distilbert-base-nli-stsb-mean-tokens - STSb performance: 85.16

全部STS模型列表(需要科学的力量)

这里我就选择最好的模型做一下语义文本相似度任务

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('roberta-large-nli-stsb-mean-tokens')

语义文本相似度任务指的是给定一个句子(query),在整个语料库中寻找和该句子语义上最相近的几个句子

用一个list来代表整个语料库,list中存的是str类型的句子

sentences = ['Lack of saneness',
        'Absence of sanity',
        'A man is eating food.',
        'A man is eating a piece of bread.',
        'The girl is carrying a baby.',
        'A man is riding a horse.',
        'A woman is playing violin.',
        'Two men pushed carts through the woods.',
        'A man is riding a white horse on an enclosed ground.',
        'A monkey is playing drums.',
        'A cheetah is running behind its prey.']
sentence_embeddings = model.encode(sentences)

for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

非常简单几行代码就获得了一个句子的向量表示

下面定义query句子,并获得它的向量表示

query = 'Nobody has sane thoughts'  #  A query sentence uses for searching semantic similarity score.
queries = [query]
query_embeddings = model.encode(queries)

scipy库计算两个向量的余弦距离,找出与query句子余弦距离最小的前三个句子

import scipy

print("Semantic Search Results")
number_top_matches = 3
for query, query_embedding in zip(queries, query_embeddings):
    distances = scipy.spatial.distance.cdist([query_embedding], sentence_embeddings, "cosine")[0]
    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])
    print("Query:", query)
    print("\nTop {} most similar sentences in corpus:".format(number_top_matches))

    for idx, distance in results[0:number_top_matches]:
        print(sentences[idx].strip(), "(Cosine Score: %.4f)" % (1-distance))

distance表示两个句子的余弦距离,1-distance可以理解为两个句子的余弦分数,分数越大表示两个句子的语义越相近

Fine-Tune

Fine-Tune仍然是STS任务,我使用的数据集是中-韩句子对,如下所示

每一行第一列是韩语句子,第二列是对应的中文,所有这些行就构成了正样本,负样本只需要将左右两列的行顺序打乱即可

这里我自己实现了一个shuffle函数,因为random.shuffle()会改变原有数据,而我不希望改变原有数据

from copy import deepcopy
from random import randint

def shuffle(lst):
  temp_lst = deepcopy(lst)
  m = len(temp_lst)
  while (m):
    m -= 1
    i = randint(0, m)
    temp_lst[m], temp_lst[i] = temp_lst[i], temp_lst[m]
  return temp_lst

首先读取数据

import xlrd
f = xlrd.open_workbook('Ko2Cn.xlsx').sheet_by_name('Xbench QA')
Ko_list = f.col_values(0) # 所有的中文句子
Cn_list = f.col_values(1) # 所有的韩语句子

shuffle_Cn_list = shuffle(Cn_list) # 所有的中文句子打乱排序
shuffle_Ko_list = shuffle(Ko_list) # 所有的韩语句子打乱排序

Sentence-Transformer在fine-tune的时候,数据必须保存到list中,list里是Sentence-Transformer库的作者自己定义的InputExample()对象

InputExample()对象需要传两个参数textslabel,其中,texts也是个list类型,里面保存了一个句子对,label必须为float类型,表示这个句子对的相似程度

比方说下面的示例代码

train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
   InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]

以下是构建数据集的代码

from sentence_transformers import SentenceTransformer, SentencesDataset, InputExample, evaluation, losses
from torch.utils.data import DataLoader

train_size = int(len(Ko_list) * 0.8)
eval_size = len(Ko_list) - train_size

# Define your train examples.
train_data = []
for idx in range(train_size):
  train_data.append(InputExample(texts=[Ko_list[idx], Cn_list[idx]], label=1.0))
  train_data.append(InputExample(texts=[shuffle_Ko_list[idx], shuffle_Cn_list[idx]], label=0.0))

# Define your evaluation examples
sentences1 = Ko_list[train_size:]
sentences2 = Cn_list[train_size:]
sentences1.extend(list(shuffle_Ko_list[train_size:]))
sentences2.extend(list(shuffle_Cn_list[train_size:]))
scores = [1.0] * eval_size + [0.0] * eval_size

evaluator = evaluation.EmbeddingSimilarityEvaluator(sentences1, sentences2, scores)
# Define your train dataset, the dataloader and the train loss
train_dataset = SentencesDataset(train_data, model)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=32)
train_loss = losses.CosineSimilarityLoss(model)

然后定义模型开始训练,这里我是用的是multilingual预训练模型,因为这个数据集既包含中文,也有韩语。Sentence-Transformer提供了三个可使用的多语言预训练模型

  • distiluse-base-multilingual-cased: Supported languages: Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish. Model is based on DistilBERT-multi-lingual.
  • xlm-r-base-en-ko-nli-ststb: Supported languages: English, Korean. Performance on Korean STSbenchmark: 81.47
  • xlm-r-large-en-ko-nli-ststb: Supported languages: English, Korean. Performance on Korean STSbenchmark: 84.05
#Define the model. Either from scratch of by loading a pre-trained model
model = SentenceTransformer('distiluse-base-multilingual-cased')

# Tune the model
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100, evaluator=evaluator, evaluation_steps=100, output_path='./Ko2CnModel')

每隔100次训练集的迭代,进行一次验证,并且它会自动将在验证集上表现最好的模型保存到output_path

如果要加载模型做测试,使用如下代码即可

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('./Ko2CnModel')

# Sentences are encoded by calling model.encode()
emb1 = model.encode("터너를 이긴 푸들.")
emb2 = model.encode("战胜特纳的泰迪。")

cos_sim = util.pytorch_cos_sim(emb1, emb2)
print("Cosine-Similarity:", cos_sim)

 

### Transformer Model Jupyter Notebook Examples and Tutorials For those interested in exploring the capabilities of Transformer models through practical implementation, several resources provide comprehensive Jupyter Notebook examples and tutorials that cover various aspects of these powerful architectures. #### Example 1: Hugging Face Transformers Library Tutorial A widely used library for implementing Transformer-based models is provided by Hugging Face. The official documentation includes a detailed tutorial on how to fine-tune pre-trained models like BERT, DistilBERT, RoBERTa, etc., using PyTorch or TensorFlow frameworks within a Jupyter environment[^1]. ```python from transformers import AutoModelForMaskedLM, AutoTokenizer import torch model_name = "distilbert-base-uncased" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForMaskedLM.from_pretrained(model_name) text = "[MASK] was a great day." inputs = tokenizer(text, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits predicted_token_id = logits.argmax(dim=-1).squeeze()[1].item() print(tokenizer.decode(predicted_token_id)) ``` This code snippet demonstrates loading a pre-trained DistilBERT model along with its corresponding tokenizer from the Hugging Face Hub, processing an input sentence containing masked tokens, performing inference, and finally decoding predictions back into readable text form. #### Example 2: Implementing Custom Text Classification Pipeline Using Scikit-Learn and Keras Another approach involves integrating traditional machine learning pipelines (such as scikit-learn) with deep learning components via libraries such as Keras/TensorFlow. This hybrid method allows leveraging both classical algorithms alongside modern neural networks when building NLP applications[^3]. ```python from sklearn.feature_extraction.text import CountVectorizer from keras.preprocessing.sequence import pad_sequences from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Embedding, LSTM, Dense max_features = 5000 max_len = 100 vectorizer = CountVectorizer(max_features=max_features) X_train_counts = vectorizer.fit_transform(X_train).toarray() vocab_size = len(vectorizer.vocabulary_) + 1 embedding_dim = 100 model = Sequential([ Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_len), LSTM(units=64), Dense(1, activation='sigmoid') ]) model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) history = model.fit(pad_sequences(X_train_counts), y_train, epochs=10, batch_size=32, validation_split=0.2) ``` In this example, `CountVectorizer` converts raw texts into numerical feature vectors suitable for training simple classifiers while maintaining compatibility with more complex sequential data structures required by recurrent layers found inside LSTMs. --related questions-- 1. What are some best practices for preparing datasets specifically designed for transformer-based models? 2. How does one evaluate the performance of different types of embedding techniques utilized during preprocessing stages prior to feeding inputs into transformer architectures? 3. Can you explain what hyperparameters should be tuned carefully when working with large-scale language modeling tasks involving transformers?
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值