最新ChatGPT GPT-4 NLU实战之文档问答类ChatPDF功能(附ipynb与python源码及视频)——开源DataWhale发布入门ChatGPT技术新手从0到1必备使用指南手册(五)




其中最火的莫过于ChatPDF,它是国外小哥Mathis Lichtenberger开发的一个应用。通过上传PDF文件到ChatPDF,就能实现和PDF跨语言对话,并根据PDF内容回答的提问。即,通过ChatPDF能够实现和PDF聊天。跨语言是指如果PDF是英文,你可以输入中文和它对话,反之亦然。而该应用的核心方法就是基于OpenAI的 Chat API,给PDF的每一段创建语义索引,然后使用关联最密切的段落去提示 (prompt) Chat API。


  • 无论是课本、讲义还是演示文稿,都可以轻松理解。无需再花费数小时翻阅研究论文和学术文章,让我们更有效地支持学术成长。

  • ChatPDF可以帮助我们更好地学习。无论是课本、讲义还是演示文稿,都可以轻松理解。无需再花费数小时翻阅研究论文和学术文章,让我们更有效地支持学术成长。

  • 通过ChatPDF,我们可以轻松地解锁无尽知识。从历史文档到诗歌、文学作品,无论是什么语言,ChatPDF都能理解并用喜欢的语言回复。让好奇心得到满足,拓宽视野,这个工具能回答任何来自PDF文件的问题。


最新ChatGPT GPT-4 自然语言理解NLU实战之文档问答类ChatPDF功能





第29届夏季奥林匹克运动会(Beijing 2008; Games of the XXIX Olympiad),又称2008年北京奥运会,2008年8月8日晚上8时整在中国首都北京开幕。8月24日闭幕。




  • 召回:与QA类似,这次召回的是Doc,这一步其实就是相似Embedding选择最相似的。
  • 回答:将召回来的文档和问题以Prompt的方式提交给Completion/ChatCompletion接口,直接得到答案。

ChatGPT 接口


import openai
OPENAI_API_KEY = "填入专属的API key"

openai.api_key = OPENAI_API_KEY
def complete(prompt):
    response = openai.Completion.create(
    ans = response["choices"][0]["text"].strip(" \n")
    return ans
# 来自官方文档
prompt = """Answer the question as truthfully as possible using the provided text, and if the answer is not contained within the text below, say "I don't know"

The men's high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium.
33 athletes from 24 nations competed; the total possible number depended on how many nations would use universality places 
to enter athletes in addition to the 32 qualifying through mark or ranking (no universality places were used in 2021).
Italian athlete Gianmarco Tamberi along with Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following
a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal in a rare instance
where the athletes of different nations had agreed to share the same medal in the history of Olympics. 
Barshim in particular was heard to ask a competition official "Can we have two golds?" in response to being offered a 
'jump off'. Maksim Nedasekau of Belarus took bronze. The medals were the first ever in the men's high jump for Italy and 
Belarus, the first gold in the men's high jump for Italy and Qatar, and the third consecutive medal in the men's high jump
for Qatar (all by Barshim). Barshim became only the second man to earn three medals in high jump, joining Patrik Sjöberg
of Sweden (1984 to 1992).

Q: Who won the 2020 Summer Olympics men's high jump?
'Gianmarco Tamberi and Mutaz Essa Barshim emerged as joint winners of the event.'



prompt = """请根据以下Context回答问题,直接输出答案即可,不用附带任何上下文。


def ask(content):
    response = openai.ChatCompletion.create(
        messages=[{"role": "user", "content": content}]

    ans = response.get("choices")[0].get("message").get("content")
    return ans
ans = ask(prompt)


  首先是加载数据集,取自:openai-cookbook/olympics-1-collect-data.ipynb at 1f6c2304b401e931928e74e978d9a0b8a40d1cf7 · openai/openai-cookbook

import pandas as pd
df = pd.read_csv("./dataset/olympics_sections_text.csv")
(3964, 4)
02020 Summer OlympicsSummaryThe 2020 Summer Olympics (Japanese: 2020年夏季オリン...726
12020 Summer OlympicsHost city selectionThe International Olympic Committee (IOC) vote...126
22020 Summer OlympicsImpact of the COVID-19 pandemicIn January 2020, concerns were raised about th...374
32020 Summer OlympicsQualifying event cancellation and postponementConcerns about the pandemic began to affect qu...298
42020 Summer OlympicsEffect on doping testsMandatory doping tests were being severely res...163


  我们这次不用Redis,换一个工具:Qdrant - Vector Search Engine,Qdrant相比Redis的单线程更容易扩展。但我们切记,要根据实际情况选择工具,很多时候过度优化是原罪,适合的就是最好的。我们真正需要做的是将业务逻辑抽象,做到尽量不依赖任何工具,换工具只需要换一个适配器就好。


docker run -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant/storage qdrant/qdrant`


pip install qdrant-client


from openai.embeddings_utils import get_embedding, cosine_similarity


def get_embedding_direct(inputs):
    embed_model = "text-embedding-ada-002"

    res = openai.Embedding.create(
        input=inputs, engine=embed_model
    return res
texts = [v.content for v in df.itertuples()]
import pnlp
emds = []
for idx, batch in enumerate(pnlp.generate_batches_by_size(texts, 200)):
    response = get_embedding_direct(batch)
    for v in response.data:
    print(f"batch: {idx} done")
batch: 0 done
batch: 1 done
batch: 2 done
batch: 3 done
batch: 4 done
batch: 5 done
batch: 6 done
batch: 7 done
batch: 8 done
batch: 9 done
batch: 10 done
batch: 11 done
batch: 12 done
batch: 13 done
batch: 14 done
batch: 15 done
batch: 16 done
batch: 17 done
batch: 18 done
batch: 19 done
len(emds), len(emds[0])
(3964, 1536)


from qdrant_client import QdrantClient
client = QdrantClient(host="localhost", port=6333)


# client = QdrantClient(":memory:")
# 或
# client = QdrantClient(path="path/to/db")


from qdrant_client.models import Distance, VectorParams

    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
# client.delete_collection("doc_qa")


    {"content": v.content, "heading": v.heading, "title": v.title, "tokens": v.tokens} for v in df.itertuples()


query = "Who won the 2020 Summer Olympics men's high jump?"
query_vector = get_embedding(query, engine="text-embedding-ada-002")
hits = client.search(
[ScoredPoint(id=236, version=3, score=0.90316474, payload={'content': 'The men\'s high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium. 33 athletes from 24 nations competed; the total possible number depended on how many nations would use universality places to enter athletes in addition to the 32 qualifying through mark or ranking (no universality places were used in 2021). Italian athlete Gianmarco Tamberi along with Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal in a rare instance where the athletes of different nations had agreed to share the same medal in the history of Olympics. Barshim in particular was heard to ask a competition official "Can we have two golds?" in response to being offered a \'jump off\'. Maksim Nedasekau of Belarus took bronze. The medals were the first ever in the men\'s high jump for Italy and Belarus, the first gold in the men\'s high jump for Italy and Qatar, and the third consecutive medal in the men\'s high jump for Qatar (all by Barshim). Barshim became only the second man to earn three medals in high jump, joining Patrik Sjöberg of Sweden (1984 to 1992).', 'heading': 'Summary', 'title': "Athletics at the 2020 Summer Olympics – Men's high jump", 'tokens': 275}, vector=None),
 ScoredPoint(id=313, version=4, score=0.88258004, payload={'content': "The men's long jump event at the 2020 Summer Olympics took place between 31 July and 2 August 2021 at the Japan National Stadium. Approximately 35 athletes were expected to compete; the exact number was dependent on how many nations use universality places to enter athletes in addition to the 32 qualifying through time or ranking (1 universality place was used in 2016). 31 athletes from 20 nations competed. Miltiadis Tentoglou won the gold medal, Greece's first medal in the men's long jump. Cuban athletes Juan Miguel Echevarría and Maykel Massó earned silver and bronze, respectively, the nation's first medals in the event since 2008.", 'heading': 'Summary', 'title': "Athletics at the 2020 Summer Olympics – Men's long jump", 'tokens': 136}, vector=None),
 ScoredPoint(id=284, version=4, score=0.8821836, payload={'content': "The men's pole vault event at the 2020 Summer Olympics took place between 31 July and 3 August 2021 at the Japan National Stadium. 29 athletes from 18 nations competed. Armand Duplantis of Sweden won gold, with Christopher Nilsen of the United States earning silver and Thiago Braz of Brazil taking bronze. It was Sweden's first victory in the event and first medal of any color in the men's pole vault since 1952. Braz, who had won in 2016, became the ninth man to earn multiple medals in the pole vault.", 'heading': 'Summary', 'title': "Athletics at the 2020 Summer Olympics – Men's pole vault", 'tokens': 112}, vector=None),
 ScoredPoint(id=222, version=3, score=0.876395, payload={'content': "The men's triple jump event at the 2020 Summer Olympics took place between 3 and 5 August 2021 at the Japan National Stadium. Approximately 35 athletes were expected to compete; the exact number was dependent on how many nations use universality places to enter athletes in addition to the 32 qualifying through time or ranking (2 universality places were used in 2016). 32 athletes from 19 nations competed. Pedro Pichardo of Portugal won the gold medal, the nation's second victory in the men's triple jump (after Nelson Évora in 2008). China's Zhu Yaming took silver, while Hugues Fabrice Zango earned Burkina Faso's first Olympic medal in any event.", 'heading': 'Summary', 'title': "Athletics at the 2020 Summer Olympics – Men's triple jump", 'tokens': 139}, vector=None),
 ScoredPoint(id=205, version=3, score=0.86075026, payload={'content': "The men's 110 metres hurdles event at the 2020 Summer Olympics took place between 3 and 5 August 2021 at the Olympic Stadium. Approximately forty athletes were expected to compete; the exact number was dependent on how many nations used universality places to enter athletes in addition to the 40 qualifying through time or ranking (1 universality place was used in 2016). 40 athletes from 29 nations competed. Hansle Parchment of Jamaica won the gold medal, the nation's second consecutive victory in the event. His countryman Ronald Levy took bronze. American Grant Holloway earned silver, placing the United States back on the podium in the event after the nation missed the medals for the first time in Rio 2016 (excluding the boycotted 1980 Games).", 'heading': 'Summary', 'title': "Athletics at the 2020 Summer Olympics – Men's 110 metres hurdles", 'tokens': 149}, vector=None)]



SEPARATOR = "\n* "
separator_len = 3
def construct_prompt(question: str):
    query_vector = get_embedding(question, engine="text-embedding-ada-002")
    hits = client.search(
    choose = []
    length = 0
    indexes = []
    for hit in hits:
        doc = hit.payload
        length += doc["tokens"] + separator_len
        if length > MAX_SECTION_LEN:
        choose.append(SEPARATOR + doc["content"].replace("\n", " "))
        indexes.append(doc["title"] + doc["heading"])
    # Useful diagnostic information
    print(f"Selected {len(choose)} document sections:")
    header = """Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."\n\nContext:\n"""
    return header + "".join(choose) + "\n\n Q: " + question + "\n A:"
prompt = construct_prompt("Who won the 2020 Summer Olympics men's high jump?")

print("===\n", prompt)
Selected 2 document sections:
Athletics at the 2020 Summer Olympics – Men's high jumpSummary
Athletics at the 2020 Summer Olympics – Men's long jumpSummary
 Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."


* The men's high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium. 33 athletes from 24 nations competed; the total possible number depended on how many nations would use universality places to enter athletes in addition to the 32 qualifying through mark or ranking (no universality places were used in 2021). Italian athlete Gianmarco Tamberi along with Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal in a rare instance where the athletes of different nations had agreed to share the same medal in the history of Olympics. Barshim in particular was heard to ask a competition official "Can we have two golds?" in response to being offered a 'jump off'. Maksim Nedasekau of Belarus took bronze. The medals were the first ever in the men's high jump for Italy and Belarus, the first gold in the men's high jump for Italy and Qatar, and the third consecutive medal in the men's high jump for Qatar (all by Barshim). Barshim became only the second man to earn three medals in high jump, joining Patrik Sjöberg of Sweden (1984 to 1992).
* The men's long jump event at the 2020 Summer Olympics took place between 31 July and 2 August 2021 at the Japan National Stadium. Approximately 35 athletes were expected to compete; the exact number was dependent on how many nations use universality places to enter athletes in addition to the 32 qualifying through time or ranking (1 universality place was used in 2016). 31 athletes from 20 nations competed. Miltiadis Tentoglou won the gold medal, Greece's first medal in the men's long jump. Cuban athletes Juan Miguel Echevarría and Maykel Massó earned silver and bronze, respectively, the nation's first medals in the event since 2008.

 Q: Who won the 2020 Summer Olympics men's high jump?
def complete(prompt):
    response = openai.Completion.create(
    ans = response["choices"][0]["text"].strip(" \n")
    return ans
'Gianmarco Tamberi and Mutaz Essa Barshim emerged as joint winners of the event following a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal.'


def ask(content):
    response = openai.ChatCompletion.create(
        messages=[{"role": "user", "content": content}]

    ans = response.get("choices")[0].get("message").get("content")
    return ans


ans = ask(prompt)
"Gianmarco Tamberi and Mutaz Essa Barshim shared the gold medal in the men's high jump event at the 2020 Summer Olympics."


query = "Why was the 2020 Summer Olympics originally postponed?"
prompt = construct_prompt(query)
answer = complete(prompt)

print(f"\nQ: {query}\nA: {answer}")
Selected 1 document sections:
Concerns and controversies at the 2020 Summer OlympicsSummary

Q: Why was the 2020 Summer Olympics originally postponed?
A: The 2020 Summer Olympics were originally postponed due to the COVID-19 pandemic.
query = "In the 2020 Summer Olympics, how many gold medals did the country which won the most medals win?"
prompt = construct_prompt(query)
answer = complete(prompt)

print(f"\nQ: {query}\nA: {answer}")
Selected 2 document sections:
2020 Summer Olympics medal tableSummary
List of 2020 Summer Olympics medal winnersSummary

Q: In the 2020 Summer Olympics, how many gold medals did the country which won the most medals win?
A: The United States won the most medals overall, with 113, and the most gold medals, with 39.
# ChatGPT
answer = ask(prompt)

print(f"\nQ: {query}\nA: {answer}")
Q: In the 2020 Summer Olympics, how many gold medals did the country which won the most medals win?
A: The country that won the most medals at the 2020 Summer Olympics was the United States, with 113 medals, including 39 gold medals.
query = "What is the tallest mountain in the world?"
prompt = construct_prompt(query)
answer = complete(prompt)

print(f"\nQ: {query}\nA: {answer}")
Selected 3 document sections:
Sport climbing at the 2020 Summer Olympics – Men's combinedRoute-setting
Ski mountaineering at the 2020 Winter Youth Olympics – Boys' individualSummary
Ski mountaineering at the 2020 Winter Youth Olympics – Girls' individualSummary

Q: What is the tallest mountain in the world?
A: I don't know.
# ChatGPT
answer = ask(prompt)

print(f"\nQ: {query}\nA: {answer}")
Q: What is the tallest mountain in the world?
A: I don't know.


最新ChatGPT GPT-4 NLU实战之实体分类识别与模型微调

最新ChatGPT GPT-4 NLU实战之智能多轮对话机器人




ChatGPT 使用指南:句词分类 @长琴



如果大家想继续了解人工智能相关学习路线和知识体系,欢迎大家翻阅我的另外一篇博客《重磅 | 完备的人工智能AI 学习——基础知识学习路线,所有资料免关注免套路直接网盘下载





当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则




¥1 ¥2 ¥4 ¥6 ¥10 ¥20



钱包余额 0


