打定思路后针对原始PDF文件进行了处理,完全转化成word文档格式,其中的格式为图片的表格以及说明等都进行了相应的文字转换,接下来我们进行了代码实现。实现思路主要有三种:
1. Playground里提示代码使用adapter方式允许在completion里指定Cognitive Search为datasource,不过这个思路还需要进一步完善,通过这个方案返回来的信息并没有被组装成具有自然语义的内容,进一步把这个信息和Promp再送给completion进行正常处理时遇到“The extensions chat completions operation must have at least one extension”错误,核心原因在于前面的adapter修改了completion的请求目标URL,后续如何进行进一步调整网上可找得到的代码不多,需要另行研究;
2. 通过Cognitive Search的SearchClient类来进行查询(一般建议Hybrid模式,Vector+关键字),查询回来的结果再和Prompt一起送给completion进行处理;
3. 通过langchain来进行处理,langchain支持的向量查询库很多,不一定要用Cognitive Search,关于这一块可以另行尝试。
本文主要提供第2种思路的代码:
import os
import json
import openai
import streamlit as st
import requests
from dotenv import load_dotenv
from tenacity import retry, wait_random_exponential, stop_after_attempt
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.models import Vector
from azure.search.documents.indexes.models import (
SearchIndex,
SearchField,
SearchFieldDataType,
SimpleField,
SearchableField,
SearchIndex,
SemanticConfiguration,
PrioritizedFields,
SemanticField,
SearchField,
SemanticSettings,
VectorSearch,
HnswVectorSearchAlgorithmConfiguration,
)
# References: https://github.com/Azure/cognitive-search-vector-pr/blob/main/demo-python/code/azure-search-vector-python-sample.ipynb
# 初始化openai,其中通过Streamlit获取key的方式可以被任何方式替代
openai.api_key = st.secrets["OPENAI_API_KEY"]
openai.api_type = "azure"
openai.api_version = "2023-08-01-preview"
openai.api_base = "https://***.openai.azure.com/"
deployment_id = "gpt4model"
search_endpoint = "https://***.search.windows.net"
search_key = st.secrets["SEARCH_KEY"]
search_index_name = "***index01"
credential = AzureKeyCredential(search_key)
search_client = SearchClient(endpoint=search_endpoint, index_name=search_index_name, credential=credential)
def generate_embeddings(text):
response = openai.Embedding.create(
input=text, engine="embeddingmodel")
embeddings = response['data'][0]['embedding']
return embeddings
# Step 1: Query from Azure Cognitive Search
# Create query vector
prompt = "静脉留置针有什么特点?"
vector = Vector(value=generate_embeddings(prompt), k=3, fields="contentVector")
results = search_client.search(
search_text=prompt,
top=3,
vectors= [vector],
select=["title", "content"],
)
rawdata = ''
for result in results:
rawdata += f"Title: {result['title']}\n"
rawdata += f"Score: {result['@search.score']}\n"
rawdata += f"Content: {result['content']}\n"
# Step 2: Query from OpenAI
prompt += '###\n' + rawdata + '\n###\n'
completion = openai.ChatCompletion.create(
engine="gpt4model",
messages=[{"role": "user", "content": prompt}],
)
rawdata = json.dumps(completion, ensure_ascii=False)
print(rawdata)
这个思路往下如何进一步优化?下一篇文章继续。