近日,有外媒报道称,字节跳动在使用OpenAI技术开发自己的大语言模型,违反了OpenAI服务条款从而遭到禁用。具体的情况笔者并不了解,在此不做评论。但是实际上,利用OpenAI的API进行高质量数据生成,早已有之。较早的比较有影响力的例子是Alpaca模型。Alpaca是由Meta的LLaMA 7B微调而来的全新模型,仅用了52k数据,性能约等于GPT-3.5。关键是训练成本奇低,不到600美元。具体花费如下:在8个80GB A100上训练了3个小时,不到100美元;生成数据使用OpenAI的API,500美元。
为什么52k的数据能够达到这样神奇的效果,不得不引起人们对数据集的内容和生成原理。
https://github.com/tatsu-lab/stanford_alpaca
1、Alpaca指令集的prompt
You are asked to come up with a set of 20 diverse task instructions. These task instructions will be given to a GPT model and we will evaluate the GPT model for completing the instructions.
Here are the requirements:
1. Try not to repeat the verb for each instruction to maximize diversity.
2. The language used for the instruction also should be diverse. For example, you should combine questions with imperative instrucitons.
3. The type of instructions should be diverse. The list should include diverse types of tasks like open-ended generation, classification, editing, etc.
2. A GPT language model should be able to complete the instruction. For example, do not ask the assistant to create any visual or audio output. For another example, do not ask the assistant to wake you up at 5pm or set a reminder because it cannot perform any action.
3. The instructions should be in English.
4. The instructions should be 1 to 2 sentences long. Either an imperative sentence or a question is permitted.
5. You should generate an appropriate input to the instruction. The input field should contain a specific example provided for the instruction. It should involve realistic data and should not contain simple placeholders. The input should provid