ChatGPT微调分类示例

最新推荐文章于 2024-01-04 22:42:19 发布

AE86Jag

最新推荐文章于 2024-01-04 22:42:19 发布

阅读量795

点赞数

分类专栏： ChatGPT 文章标签： chatgpt 分类 python

原文链接：https://github.com/openai/openai-cookbook

版权

ChatGPT 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

我们将微调 ada 分类器以区分两种运动：棒球和曲棍球。

from sklearn.datasets import fetch_20newsgroups
import pandas as pd
import openai

categories = ['rec.sport.baseball', 'rec.sport.hockey']
sports_dataset = fetch_20newsgroups(subset='train', shuffle=True, random_state=42, categories=categories)

数据探索

可以使用 sklearn 加载新闻组数据集。首先，我们将查看数据本身：

print(sports_dataset['data'][0])

From: dougb@comm.mot.com (Doug Bank)
Subject: Re: Info needed for Cleveland tickets
Reply-To: dougb@ecs.comm.mot.com
Organization: Motorola Land Mobile Products Sector
Distribution: usa
Nntp-Posting-Host: 145.1.146.35
Lines: 17

In article <1993Apr1.234031.4950@leland.Stanford.EDU>, bohnert@leland.Stanford.EDU (matthew bohnert) writes:

|> I'm going to be in Cleveland Thursday, April 15 to Sunday, April 18.
|> Does anybody know if the Tribe will be in town on those dates, and
|> if so, who're they playing and if tickets are available?

The tribe will be in town from April 16 to the 19th.
There are ALWAYS tickets available! (Though they are playing Toronto,
and many Toronto fans make the trip to Cleveland as it is easier to
get tickets in Cleveland than in Toronto.  Either way, I seriously
doubt they will sell out until the end of the season.)

-- 
Doug Bank                       Private Systems Division
dougb@ecs.comm.mot.com          Motorola Communications Sector
dougb@nwu.edu                   Schaumburg, Illinois
dougb@casbah.acns.nwu.edu       708-576-8207

sports_dataset.target_names[sports_dataset['target'][0]]

'rec.sport.baseball'

len_all, len_baseball, len_hockey = len(sports_dataset.data), len([e for e in sports_dataset.target if e == 0]), len([e for e in sports_dataset.target if e == 1])
print(f"Total examples: {len_all}, Baseball examples: {len_baseball}, Hockey examples: {len_hockey}")

Total examples: 1197, Baseball examples: 597, Hockey examples: 600

数据准备

我们将数据集转换为 pandas 数据框，其中有一列用于提示和完成。提示包含来自邮件列表的电子邮件，完成是运动的名称，曲棍球或棒球。仅出于演示目的和微调速度，我们仅采用 300 个示例。在实际用例中，示例越多性能越好。

import pandas as pd

labels = [sports_dataset.target_names[x].split('.')[-1] for x in sports_dataset['target']]
texts = [text.strip() for text in sports_dataset['data']]
df = pd.DataFrame(zip(texts, labels), columns = ['prompt','completion']) #[:300]
df.head()

	prompt	completion
0	From: dougb@comm.mot.com (Doug Bank)\nSubject:…	baseball
1	From: gld@cunixb.cc.columbia.edu (Gary L Dare)…	hockey
2	From: rudy@netcom.com (Rudy Wade)\nSubject: Re…	baseball
3	From: monack@helium.gas.uug.arizona.edu (david…	hockey
4	Subject: Let it be Known\nFrom: <ISSBTL@BYUVM…	baseball

数据准备工具

我们现在可以使用数据准备工具，它会在微调之前对我们的数据集提出一些改进建议。在启动该工具之前，我们更新了 openai 库以确保我们使用的是最新的数据准备工具。我们另外指定 -q 自动接受所有建议。

!pip install --upgrade openai

!openai tools fine_tunes.prepare_data -f sport2.jsonl -q

Analyzing...

- Your file contains 1197 prompt-completion pairs
- Based on your data it seems like you're trying to fine-tune a model for classification
- For classification, we recommend you try one of the faster and cheaper models, such as `ada`
- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training
- There are 11 examples that are very long. These are rows: [134, 200, 281, 320, 404, 595, 704, 838, 1113, 1139, 1174]
For conditional generation, and for classification the examples shouldn't be longer than 2048 tokens.
- Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples. If you intend to do open-ended generation, then you should leave the prompts empty
- The completion should start with a whitespace character (` `). This tends to produce better results due to the tokenization we use. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details

Based on the analysis we will perform the following actions:
- [Recommended] Remove 11 long examples [Y/n]: Y
- [Recommended] Add a suffix separator `\n\n###\n\n` to all prompts [Y/n]: Y
- [Recommended] Add a whitespace character to the beginning of the completion [Y/n]: Y
- [Recommended] Would you like to split into training and validation set? [Y/n]: Y


Your data will be written to a new JSONL file. Proceed [Y/n]: Y

Wrote modified files to `sport2_prepared_train.jsonl` and `sport2_prepared_valid.jsonl`
Feel free to take a look!

Now use that file when fine-tuning:
> openai api fine_tunes.create -t "sport2_prepared_train.jsonl" -v "sport2_prepared_valid.jsonl" --compute_classification_metrics --classification_positive_class " baseball"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string `\n\n###\n\n` for the model to start generating completions, rather than continuing with the prompt.
Once your model starts training, it'll approximately take 30.8 minutes to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.

该工具有助于对数据集提出一些改进建议，并将数据集拆分为训练集和验证集。

提示和完成之间的后缀是必要的，以告诉模型输入文本已停止，现在需要预测类别。由于我们在每个示例中使用相同的分隔符，因此该模型能够了解它是为了预测分隔符后的棒球或曲棍球。补全中的空格前缀很有用，因为大多数单词标记都是用空格前缀标记的。该工具还认识到这可能是一项分类任务，因此它建议将数据集拆分为训练数据集和验证数据集。这将使我们能够轻松衡量新数据的预期性能。

微调

该工具建议我们运行以下命令来训练数据集。由于这是一项分类任务，我们想知道我们的分类用例在提供的验证集上的泛化性能如何。该工具建议添加 --compute_classification_metrics --classification_positive_class " baseball" 以计算分类指标。

我们可以简单地从 CLI 工具中复制建议的命令。我们特别添加 -m ada 来微调更便宜和更快的 ada 模型，该模型在性能上通常与分类用例中更慢和更昂贵的模型相当。

!openai api fine_tunes.create -t "sport2_prepared_train.jsonl" -v "sport2_prepared_valid.jsonl" --compute_classification_metrics --classification_positive_class " baseball" -m ada

Upload progress: 100%|████████████████████| 1.52M/1.52M [00:00<00:00, 1.81Mit/s]
Uploaded file from sport2_prepared_train.jsonl: file-Dxx2xJqyjcwlhfDHpZdmCXlF
Upload progress: 100%|███████████████████████| 388k/388k [00:00<00:00, 507kit/s]
Uploaded file from sport2_prepared_valid.jsonl: file-Mvb8YAeLnGdneSAFcfiVcgcN
Created fine-tune: ft-2zaA7qi0rxJduWQpdvOvmGn3
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2021-07-30 13:15:50] Created fine-tune: ft-2zaA7qi0rxJduWQpdvOvmGn3
[2021-07-30 13:15:52] Fine-tune enqueued. Queue number: 0
[2021-07-30 13:15:56] Fine-tune started
[2021-07-30 13:18:55] Completed epoch 1/4
[2021-07-30 13:20:47] Completed epoch 2/4
[2021-07-30 13:22:40] Completed epoch 3/4
[2021-07-30 13:24:31] Completed epoch 4/4
[2021-07-30 13:26:22] Uploaded model: ada:ft-openai-2021-07-30-12-26-20
[2021-07-30 13:26:27] Uploaded result file: file-6Ki9RqLQwkChGsr9CHcr1ncg
[2021-07-30 13:26:28] Fine-tune succeeded

Job complete! Status: succeeded 🎉
Try out your fine-tuned model:

openai api completions.create -m ada:ft-openai-2021-07-30-12-26-20 -p <YOUR_PROMPT>

模型在十分钟左右训练成功。我们可以看到模型名称是 ada:ft-openai-2021-07-30-12-26-20，我们可以使用它来进行推理。

[高级] 结果和预期的模型性能

我们现在可以下载结果文件以观察在保留的验证集上的预期性能。

!openai api fine_tunes.results -i ft-2zaA7qi0rxJduWQpdvOvmGn3 > result.csv

results = pd.read_csv('result.csv')
results[results['classification/accuracy'].notnull()].tail(1)

	step	elapsed_tokens	elapsed_examples	training_loss	training_sequence_accuracy	training_token_accuracy	classification/accuracy	classification/precision	classification/recall	classification/auroc	classification/auprc	classification/f1.0	validation_loss	validation_sequence_accuracy	validation_token_accuracy
929	930	3027688	3720	0.044408	1.0	1.0	0.991597	0.983471	1.0	1.0	1.0	0.991667	NaN	NaN	NaN

准确率达到99.6%。在下图中，我们可以看到在训练运行期间验证集的准确性如何提高。

results[results['classification/accuracy'].notnull()]['classification/accuracy'].plot()

在这里插入图片描述

使用模型

我们现在可以调用模型来获得预测。

test = pd.read_json('sport2_prepared_valid.jsonl', lines=True)
test.head()

	prompt	completion
0	From: gld@cunixb.cc.columbia.edu (Gary L Dare)…	hockey
1	From: smorris@venus.lerc.nasa.gov (Ron Morris …	hockey
2	From: golchowy@alchemy.chem.utoronto.ca (Geral…	hockey
3	From: krattige@hpcc01.corp.hp.com (Kim Krattig…	baseball
4	From: warped@cs.montana.edu (Doug Dolven)\nSub…	baseball

我们需要按照我们在微调期间使用的提示使用相同的分隔符。在这种情况下，它是 \n\n###\n\n。由于我们关心的是分类，所以我们希望温度尽可能低，我们只需要一个令牌完成来确定模型的预测。

ft_model = 'ada:ft-openai-2021-07-30-12-26-20'
res = openai.Completion.create(model=ft_model, prompt=test['prompt'][0] + '\n\n###\n\n', max_tokens=1, temperature=0)
res['choices'][0]['text']

' hockey'

要获取对数概率，我们可以在完成请求中指定 logprobs 参数

res = openai.Completion.create(model=ft_model, prompt=test['prompt'][0] + '\n\n###\n\n', max_tokens=1, temperature=0, logprobs=2)
res['choices'][0]['logprobs']['top_logprobs'][0]

<OpenAIObject at 0x7fe114e435c8> JSON: {
  " baseball": -7.6311407,
  " hockey": -0.0006307676
}

我们可以看到该模型预测曲棍球的可能性比棒球大得多，这是正确的预测。通过请求 log_probs，我们可以看到每个类别的预测（对数）概率。

概括

有趣的是，我们的微调分类器非常通用。尽管接受了针对不同邮件列表的电子邮件的训练，它也成功地预测了推文。

sample_hockey_tweet = """Thank you to the 
@Canes
 and all you amazing Caniacs that have been so supportive! You guys are some of the best fans in the NHL without a doubt! Really excited to start this new chapter in my career with the 
@DetroitRedWings
 !!"""
res = openai.Completion.create(model=ft_model, prompt=sample_hockey_tweet + '\n\n###\n\n', max_tokens=1, temperature=0, logprobs=2)
res['choices'][0]['text']

' hockey'

sample_baseball_tweet="""BREAKING: The Tampa Bay Rays are finalizing a deal to acquire slugger Nelson Cruz from the Minnesota Twins, sources tell ESPN."""
res = openai.Completion.create(model=ft_model, prompt=sample_baseball_tweet + '\n\n###\n\n', max_tokens=1, temperature=0, logprobs=2)
res['choices'][0]['text']