OpenAI 训练你自己的文档库
前言
赶上ai的大潮,尝试做个问答式文档库
项目
使用OpenAI训练一个简单的文档库,这里用王者荣耀早期公开的数据进行测试.
安装环境
!pip install gpt_index
!pip install langchain
引入包
from gpt_index import SimpleDirectoryReader, GPTListIndex, GPTSimpleVectorIndex, LLMPredictor, PromptHelper
from langchain import OpenAI
import sys
#from google.colab import drive
import os
设置环境变量(key)
os.environ["OPENAI_API_KEY"] = 'Your OpenAI Key 你的OpenAI Key'
声明方法
将文件夹内的文件加入训练数据集合并输出训练好的数据
def construct_index(directory_path):
# set maximum input size
max_input_size = 4096
# set number of output tokens
num_outputs = 256
# set maximum chunk overlap
max_chunk_overlap = 20
# set chunk size limit
chunk_size_limit = 600
prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)
# define LLM text-davinci-003
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="text-ada-001", max_tokens=num_outputs))
documents = SimpleDirectoryReader(directory_path).load_data()
index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper)
index.save_to_disk('index.json')
return index
带训练数据集合的提问
def ask_bot(input_index = 'index.json'):
index = GPTSimpleVectorIndex.load_from_disk(input_index)
while True:
query = input('What do you want to ask the bot? \n')
response = index.query(query, response_mode="compact")
print ("\nBot says: \n\n" + response.response + "\n\n\n")
生成训练数据
调用方法,将content文件夹内的所有txt都加入训练
index = construct_index("/content/")
使用训练好的数据集进行提问
ask_bot('index.json')