NLP(七十七)文本补全中的动态提示(Dynamic Prompting)

本文将会介绍如何在大模型中对文本补充(Text Completion)进行动态提示(Dynamic Prompting)。

本文以文本分类任务为例,样例数据集为TREC,使用大模型的文本补全进行类别预测,分别在Zero-Shot, Few-Shot, Dynamic Few-Shot场景下进行测试,验证动态提示对于提升模型表现的有效性。

数据集

Text REtrieval Conference (TREC) Question Classification数据集包含训练集中的约5500个标记问题和测试集中的另外 500 个问题。

该数据集有 6 个粗类标签和 50 个精细类标签。 每个句子的平均长度为10,词汇量为8700。6 个粗类标签为ABBR, ENTY, DESC, HUM, LOC, NUM.

该数据集从四个来源收集:USC发布的4,500个英语问题(Hovy et al., 2001)、针对少数稀有类别的大约500个手动构建的问题、894个 TREC 8 和 TREC 9 问题,以及来自 TREC 10 的500个问题作为测试集。 这些问题是手动标记的。

该数据集的HuggingFace网址为: https://huggingface.co/datasets/trec,使用datasets模块加载代码如下:

import openai
from datasets import load_dataset
from sklearn.metrics import classification_report

dataset = load_dataset("trec")

dataset

输出结果为:

DatasetDict({
    train: Dataset({
        features: ['text', 'coarse_label', 'fine_label'],
        num_rows: 5452
    })
    test: Dataset({
        features: ['text', 'coarse_label', 'fine_label'],
        num_rows: 500
    })
})

其中test数据集的第一条数据为:

{'text': 'How far is it from Denver to Aspen ?',
 'coarse_label': 5,
 'fine_label': 40}

对数据进行预处理,代码如下:

# name of the text and label column
label_type = 'coarse_label'
text_key = "text"
# create mapping of ids2class and class2id
id2class = dict((i, label) for i, label in enumerate(dataset['train'].features[label_type].names))
class2id = dict((label, i) for i, label in enumerate(dataset['train'].features[label_type].names))
# create a dictionary with classes as key and containing all the training examples within that class
class2TrainDataset = dict((label, []) for label in dataset['train'].features[label_type].names)
for example in dataset['train']:
    label = id2class[example[label_type]]
    class2TrainDataset[label].append(example[text_key])

其中,id2class和class2id分别为id与类别对应表、类型与id对应表,class2TrainDataset为每个类别中的训练数据集。

Zero-Shot

构建Zero-Shot的prompt,代码如下:

# a prompt for asking LLM to perform a task
task_prompt = "As a Question Answering agent, your goal is to categorize questions into different semantic classes that impose constraints on potential answers, so that they can be utilized in later stages of the question answering process.\nFollowing are the semantic classes: ["
task_prompt += ", ".join([label for label in class2TrainDataset]) + "]"
# a prompt for asking LLM to generate the output for current task
query_prompt = "\nClassify the following question into one of the above classes. Please answer in a single word.\nquestion: "
answer_prompt = "\noutput: "

那么,test数据集的第一条的Zero-Shot prompt为:

zeroshot_prompt = task_prompt +  query_prompt + dataset['test'][0][text_key] + answer_prompt
>>> zeroshot_prompt

As a Question Answering agent, your goal is to categorize questions into different semantic classes that impose constraints on potential answers, so that they can be utilized in later stages of the question answering process.
Following are the semantic classes: [ABBR, ENTY, DESC, HUM, LOC, NUM]
Classify the following question into one of the above classes. Please answer in a single word.
question: How far is it from Denver to Aspen ?
output:

调用openai的大模型进行回复,调用函数代码如下:

openai.api_key = "sk-xxx"
model_name = "gpt-3.5-turbo-instruct"

import tiktoken
enc = tiktoken.encoding_for_model(model_name)
log_bias_dict = {}
for label in dataset['train'].features["coarse_label"].names:
    for token_id in enc.encode(label):
      log_bias_dict[token_id] = 5
      
# Text completion using GPT
def trim_text(text):
    return text.strip().strip('\n').strip('\\n')

def generate_using_gpt(prompt):
    generated_sentence = ""
    try:
        # Create a completion for the provided prompt and parameters
        response = openai.Completion.create(
            model=model_name,
            prompt=prompt, 
            max_tokens=3,
            temperature=0,
            top_p=1,
            stop=None,
            frequency_penalty=0,
            presence_penalty=0.0,
            logit_bias=log_bias_dict
        )
        
        choices = response.get("choices", "")
        if len(choices) == 0 or "text" not in choices[0]:
            print("Text not generated properly")
        generated_sentence = choices[0]['text'].lstrip('\\n').rstrip('\\n').lstrip('\n\n').rstrip('\n\n').lstrip('\n').rstrip('\n')
        
    except openai.error.APIError as e:
        # Handle API error here, e.g. retry or log
        print(f"OpenAI API returned an API Error: {e}")

    except openai.error.AuthenticationError as e:
        # Handle Authentication error here, e.g. invalid API key
        print(f"OpenAI API returned an Authentication Error: {e}")

    except openai.error.APIConnectionError as e:
        # Handle connection error here
        print(f"Failed to connect to OpenAI API: {e}")

    except openai.error.InvalidRequestError as e:
        # Handle connection error here
        print(f"Invalid Request Error: {e}")
        
    except openai.error.RateLimitError as e:
        # Handle rate limit error
        print(f"OpenAI API request exceeded rate limit: {e}")

    except openai.error.ServiceUnavailableError as e:
        # Handle Service Unavailable error
        print(f"Service Unavailable: {e}")

    except openai.error.Timeout as e:
        # Handle request timeout
        print(f"Request timed out: {e}")
    return generated_sentence

使用模型为gpt-3.5-turbo-instruct, max_tokens为3。为了保证输出token为数据集中的粗类类别,使用tiktoken得到这些粗类类别的token id,采用logit_bias对这些token id的输出进行加强。

对test数据集第一条数据进行测试:

>>> generate_using_gpt(zeroshot_prompt)

'LOC'

对全量test数据集使用Zero-Shot Prompt,代码如下:

# prompt without any examples from the training dataset
labels = []
predictions = []
for example in dataset['test']:
    zeroshot_prompt = task_prompt +  query_prompt + example[text_key] + answer_prompt
    pred = generate_using_gpt(zeroshot_prompt)
    pred=trim_text(pred)
    labels.append(example[label_type])
    if pred not in class2id:
        predictions.append(-1)
    else:
        predictions.append(class2id[pred])
        
report = classification_report(labels, predictions, digits=4) 

评估结果如下:

              precision    recall  f1-score   support

           0     0.6364    0.7778    0.7000         9
           1     0.4432    0.4149    0.4286        94
           2     0.7154    0.6377    0.6743       138
           3     0.9455    0.8000    0.8667        65
           4     0.8222    0.9136    0.8655        81
           5     0.8195    0.9646    0.8862       113

    accuracy                         0.7380       500
   macro avg     0.7304    0.7514    0.7369       500
weighted avg     0.7336    0.7380    0.7324       500

weighted avg F1值为0.7324.

Few-Shot

接下来,使用Few-Shot对prompt进行加强,方法为从每个类别的train数据集中提取第一条样本作为Few-Shot,即In-Context Learning(ICL),代码如下:

# function to selection few examples in each of the classes from the training dataset
def generateFewshotPrompt(class2TrainDataset, N=3):
    fewshot_prompt = "\nFollowing are some examples."
    for label in class2TrainDataset:
        for example in class2TrainDataset[label][:N]:
            fewshot_prompt += "\nquestion: " + example
            fewshot_prompt += "\noutput: " + label
    return fewshot_prompt
    
# prompt with one example in each of the classes
fewshot_examples = generateFewshotPrompt(class2TrainDataset, N=1)
fewshot_prompt = task_prompt +  fewshot_examples + query_prompt + dataset['test'][0][text_key] + answer_prompt
>>> fewshot_prompt

test数据集的第一条数据的Few-Shot prompt如下:

As a Question Answering agent, your goal is to categorize questions into different semantic classes that impose constraints on potential answers, so that they can be utilized in later stages of the question answering process.
Following are the semantic classes: [ABBR, ENTY, DESC, HUM, LOC, NUM]
Following are some examples.
question: What is the full form of .com ?
output: ABBR
question: What films featured the character Popeye Doyle ?
output: ENTY
question: How did serfdom develop in and then leave Russia ?
output: DESC
question: What contemptible scoundrel stole the cork from my lunch ?
output: HUM
question: What sprawling U.S. state boasts the most airports ?
output: LOC
question: When was Ozzy Osbourne born ?
output: NUM
Classify the following question into one of the above classes. Please answer in a single word.
question: How far is it from Denver to Aspen ?
output:

基于Few-Shot prompt,对全量test数据集进行评估,代码如下:

# prompt is created by adding one example in each of the classes 
labels = []
predictions = []
for example in dataset['test']:
    fewshot_prompt = task_prompt + fewshot_examples + query_prompt + example[text_key] + answer_prompt
    pred = generate_using_gpt(fewshot_prompt)
    pred=trim_text(pred)
    labels.append(example[label_type])
    if pred not in class2id:
        predictions.append(-1)
    else:
        predictions.append(class2id[pred])
        
report = classification_report(labels, predictions, digits=4)

评估结果如下:

              precision    recall  f1-score   support

           0     0.8182    1.0000    0.9000         9
           1     0.5217    0.5106    0.5161        94
           2     0.7727    0.7391    0.7556       138
           3     1.0000    0.8462    0.9167        65
           4     0.8021    0.9506    0.8701        81
           5     0.9474    0.9558    0.9515       113

    accuracy                         0.7980       500
   macro avg     0.8103    0.8337    0.8183       500
weighted avg     0.8001    0.7980    0.7969       500

此时,weighted avg F1值为0.7969.

Dynamic Few-Shot

上面Few-Shot prompt的效果已经比Zero-Shot prompt好很多了,还有提升空间吗?

对于Few-Shot的样本,我们是否可以进行选择,使得评估样本与Few-Shot样本接可能相近。基于此,我们想到了Dynamic Few-Shot,在每次评估测试样本时,在训练集的每个类别中选择与其语义相似度最高的k(本文取k=1)个样本。

考虑到文本的语义相似度,我们需要一个语义相似度计算的基础模型,一般为文本嵌入(Text Embedding)模型,本文选择all-mpnet-base-v2,使用sentence_transformers进行文本嵌入。代码如下:

from sentence_transformers import SentenceTransformer, util
import numpy as np
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

# loading Sentence Transformer based model
model = SentenceTransformer('all-mpnet-base-v2', device=device)

# extract embeddings for a set of examples
def ExtractEmbeddings(examples):
    embedding_ls = []
    for example in examples:
        embedding = model.encode(example)     
        embedding_ls.append(embedding)
    return embedding_ls

# extract embeddings for all the training examples
class2TrainDatasetWithEmbedding = {}
for label in class2TrainDataset:
    embeddings = ExtractEmbeddings(class2TrainDataset[label])
    class2TrainDatasetWithEmbedding[label] = [class2TrainDataset[label], embeddings]

在上述代码中,我们使用sentence_transformers加载all-mpnet-base-v2模型,并对每个类别的训练集数据进行文本嵌入,获取它们的词向量,储存在内存中。

接着,针对每条评估测试样本,选择每个类别中与其语义相似度最高的1条样本,形成Dynamic Few-Shot prompt,代码如下:

# extract similar queries for a given input text from each of the classes
def getSimilarExamples(input_text, dataset, dataset_embedding):
    input_embedding = model.encode(input_text)
    sim_score = util.dot_score(input_embedding, dataset_embedding)[0]
    topN_ids = np.argsort(-sim_score)
    return [dataset[i] for i in topN_ids]
    
def getClasswiseSimilarExamples(input_text, class2TrainDatasetWithEmbedding):
    classwiseSimilarExamples = {}
    for label in class2TrainDataset:
        similarExamples = getSimilarExamples(input_text, class2TrainDatasetWithEmbedding[label][0], class2TrainDatasetWithEmbedding[label][1])
        classwiseSimilarExamples[label] = similarExamples
    return classwiseSimilarExamples
    
# generate a prompt with similar examples in each of the classes
def generateDynamicPrompt(input_text, class2TrainDatasetWithEmbedding, N=3):
    classwiseSimilarExamples = getClasswiseSimilarExamples(input_text, class2TrainDatasetWithEmbedding)
    dynamic_prompt = "\nFollowing are some examples."
    for label in classwiseSimilarExamples:
        for example in classwiseSimilarExamples[label][:N]:
            dynamic_prompt += "\nquestion: " + example
            dynamic_prompt += "\noutput: " + label
    return dynamic_prompt
    
# dynamic prompt with one similar example in each of the classes
fewshot_examples = generateDynamicPrompt(dataset['test'][0][text_key], class2TrainDatasetWithEmbedding, N=1)
dynamic_prompt = task_prompt + fewshot_examples + query_prompt + dataset['test'][0][text_key] + answer_prompt
>>> dynamic_prompt

此时,test数据集中的第一条样本的Dynamic Few-Shot prompt为:

As a Question Answering agent, your goal is to categorize questions into different semantic classes that impose constraints on potential answers, so that they can be utilized in later stages of the question answering process.
Following are the semantic classes: [ABBR, ENTY, DESC, HUM, LOC, NUM]
Following are some examples.
question: What do the letters D.C. stand for in Washington , D.C. ?
output: ABBR
question: What race is 1 , 137 miles long ?
output: ENTY
question: Why is the mile 528 feet ?
output: DESC
question: Who lives at 39 Stone Canyon Way ?
output: HUM
question: What Colorado city owns its own glacier ?
output: LOC
question: How high is the city of Denver ?
output: NUM
Classify the following question into one of the above classes. Please answer in a single word.
question: How far is it from Denver to Aspen ?
output:

可以看到此时的Dynamic Few-Shot prompt中的样本明显比Few-Shot prompt中的样本更好。

此时,再对全量test数据集进行评估,代码如下:

labels = []
predictions = []
for example in dataset['test']:
    fewshot_examples = generateDynamicPrompt(example[text_key], class2TrainDatasetWithEmbedding, N=1)
    dynamic_prompt = task_prompt + fewshot_examples + query_prompt + example[text_key] + answer_prompt
    pred = generate_using_gpt(dynamic_prompt)
    pred=trim_text(pred)
    labels.append(example[label_type])
    if pred not in class2id:
        predictions.append(-1)
    else:
        predictions.append(class2id[pred])
        
report = classification_report(labels, predictions, digits=4) 

评估结果如下:

              precision    recall  f1-score   support

           0     1.0000    0.7778    0.8750         9
           1     0.7083    0.7234    0.7158        94
           2     0.8615    0.8116    0.8358       138
           3     0.9508    0.8923    0.9206        65
           4     0.8824    0.9259    0.9036        81
           5     0.8926    0.9558    0.9231       113

    accuracy                         0.8560       500
   macro avg     0.8826    0.8478    0.8623       500
weighted avg     0.8572    0.8560    0.8557       500

最终得到的weighted avg F1值为0.8557.

总结

对上述的内容进行总结,我们使用gpt-3.5-turbo-instructTREC的中test数据集,分别就Zero-Shot, Few-Shot, Dynamic Few-Shot情形进行评估,得到的评估指标为:

promptweighted avg F1
Zero-Shot0.7324
Few-Shot0.7969
Dynamic Few-Shot0.8557

显然,Dynamic Few-Shot prompt的效果是最好的,比Zero-Shot prompt的指标高了12%多,而这还是没有对模型进行任何微调的结果!

在平时工作中,我们也可以尝试使用Dynamic Few-Shot prompt。

参考文献

  1. Basic_Samples/Completions/completions_with_dynamic_prompt.ipynb: https://github.com/Azure-Samples/openai/blob/main/Basic_Samples/Completions/completions_with_dynamic_prompt.ipynb
  2. TREC dataset: https://huggingface.co/datasets/trec

欢迎关注我的公众号NLP奇幻之旅,原创技术文章第一时间推送。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
要使用自然语言处理NLP)技术来解析文本,您可以采取以下步骤: 1. 分词(Tokenization):将文本划分为单个单词或标记的序列。这是NLP处理的第一步,可以使用现有的NLP库或工具来实现分词。 2. 词性标注(Part-of-Speech Tagging):标注每个词的词性,例如名词、动词、形容词等。这有助于理解文本不同词语的含义和作用。 3. 命名实体识别(Named Entity Recognition):识别文本的命名实体,如人名、地名、组织机构等。这对于从文本提取特定信息非常有用。 4. 依存句法分析(Dependency Parsing):分析句子单词之间的依存关系,例如主谓关系、修饰关系等。这有助于理解句子的结构和语义。 5. 关键词提取(Keyword Extraction):从文本提取出最重要的关键词或短语。这可以帮助您理解文本的主题或重点。 6. 情感分析(Sentiment Analysis):确定文本的情感倾向,如积极、消极或性。这有助于了解文本的情感色彩。 7. 文本分类(Text Classification):将文本归类到预定义的类别。这可以用于文本过滤、主题分类等任务。 8. 实体关系抽取(Entity Relationship Extraction):从文本识别并提取出实体之间的关系。这有助于发现文本的关联信息。 以上仅是NLP技术的一些常见任务,您可以根据具体需求和情况选择适合的技术和工具。有许多开源的NLP库和工具可供使用,如NLTK、spaCy、Stanford CoreNLP等,您可以根据自己的喜好和需求进行选择和实现。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值