LangChain入门开发教程（一）：Model I/O

最新推荐文章于 2025-03-14 20:25:01 发布

我就算饿死也不做程序员

最新推荐文章于 2025-03-14 20:25:01 发布

阅读量1.8k

点赞数 8

分类专栏：自然语言处理 LLMs python 文章标签： langchain chatgpt 人工智能

本文链接：https://blog.csdn.net/sgyuanshi/article/details/139305781

版权

python 同时被 3 个专栏收录

33 篇文章

订阅专栏

自然语言处理

17 篇文章

订阅专栏

LLMs

3 篇文章

订阅专栏

官方文档：https://python.langchain.com/docs/get_started/introduction/

LangChain是一个能够利用大语言模型（LLM，Large Language Model）能力进行快速应用开发的框架：

高度抽象的组件，可以像搭积木一样，使用LangChain的组件来实现我们的应用
集成外部数据到LLM中，比如API接口数据、文件、外部应用等；
提供了许多可自定义的LLM高级能力，比如Agent、RAG等等；

LangChain框架主要由以下六个部分组成：

Model IO：格式化和管理LLM的输入和输出
Retrieval：检索，与特定应用数据交互，比如RAG，与向量数据库密切相关，能够实现从向量数据库中搜索与问题相关的文档来作为增强LLM的上下文
Agents：决定使用哪个工具（高层指令）的结构体，而tools则是允许LLM与外部系统交互的接口
Chains：构建运行程序的block-style组合，即能将多个模块连接起来，实现复杂的功能应用
Memory：在运行一个链路（chain）时能够存储程序状态的信息，比如存储历史对话记录，随时能够对这些历史对话记录重新加载，保证长对话的准确性
Callbacks：回调机制，可以追踪任何链路的步骤，记录日志

Model IO

https://python.langchain.com/docs/modules/model_io/

Model IO

Model IO是直接与LLM交互的核心部分，从上图可以看出，它包括：

对输入的格式化，涉及的组件是Prompts
调用LLM进行预测，涉及的组件是LLM和Chat Model
解析LLM的输出，涉及的组件是Output Parses

在这里先介绍下LLM和Chat Model的区别：

LLM可以认为是一个文本生成式模型，即大模型会根据用户输入的指示，生成一段相关的文本；
而Chat Model则是一个聊天驱动的模型，即可以进行多轮对话。

Prompts

示例代码：prompt_templates.ipynb

prompt是作为用户的我们提供给LLM的一系列指示（instructions），来指导模型的生成，帮助它理解上下文，来生成相关和连贯的语言输出，比如回答问题、完成句子、进行对话等等。

prompt templates

prompt模板是一些预定义的方法，用来生成（格式化）LLM的prompt。一个模板可能包含指示、few-shot样例、适用于一个给定任务的特定的上下文和问题。

LangChain提供了一些与模型无关的模板，可以复用到不同的LLM。

PromptTemplate

用于创建一个字符串prompt，其实它就是相当于的Python str.format 。它支持任意数量的变量，包括没有变量

from langchain.prompts import PromptTemplate

prompt_template = PromptTemplate.from_template(
    "Tell me a {adjective} joke about {content}."
)
prompt_template.format(adjective="funny", content="chickens")

"""
'Tell me a funny joke about chickens.'
"""

ChatPromptTemplate

用于创建chat model的prompt，是一个chat message的列表。

任何chat message除了相关联的文本内容，还有一个额外的参数role。比如，OpenAI的ChatGPT，一般有三种role（其他LLM基本也遵循）：

system，用于制作此次对话的人设，它是可选的，不是必要的。如果存在的话，则对话必须以system message为开头，但如果不存在的话，则人设就是普通的assistant，相当于使用"You are a helpful assistant."作为system message；
user，作为用户的我们的输入，提供了一些诉求来让assistant（LLM）回答；
assistant，一般情况是assistant（LLM）的输出，可以用来存储多轮对话（user-assistant-user-assistant-…）。但它也可以由我们来写入，用作一些样例，比如few-shot，后面再详细展开。

from langchain_core.prompts import ChatPromptTemplate

chat_template = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a helpful AI bot. Your name is {name}."),
        ("human", "Hello, how are you doing?"),
        ("ai", "I'm doing well, thanks!"),
        ("human", "{user_input}"),
    ]
)

messages = chat_template.format_messages(name="Bob", user_input="What is your name?")

"""
[SystemMessage(content='You are a helpful AI bot. Your name is Bob.'),
 HumanMessage(content='Hello, how are you doing?'),
 AIMessage(content="I'm doing well, thanks!"),
 HumanMessage(content='What is your name?')]
"""

在LangChain中，human对应上述的user，ai对应上述的assistant。

可以看到生成的prompt其实是填入format并生成对应的Message对象，因此在定义模板时也可以传入Message对象，如下：

from langchain.prompts import HumanMessagePromptTemplate
from langchain_core.messages import SystemMessage

chat_template = ChatPromptTemplate.from_messages(
    [
        SystemMessage(
            content=(
                "You are a helpful assistant that re-writes the user's text to "
                "sound more upbeat."
            )
        ),
        HumanMessagePromptTemplate.from_template("{text}"),
    ]
)

这对于构造和复用chat prompts提供了许多灵活性。

Message Prompts

从上一个代码样例可以看出，MessagePromptTemplate 是用来构造Message对象的模板，三种role分别对应三种message模板：AIMessagePromptTemplate, SystemMessagePromptTemplate and HumanMessagePromptTemplate。

而LangChain还提供另外一种可以使用任意role的message模板：ChatMessagePromptTemplate

from langchain.prompts import ChatMessagePromptTemplate

prompt = "May the {subject} be with you"

chat_message_prompt = ChatMessagePromptTemplate.from_template(
    role="Jedi", template=prompt
)
chat_message_prompt.format(subject="force")

MessagesPlaceholder

message占位符提供了一种对格式化时呈现什么message的完全控制，其用途在于不确定使用什么role或者想插入一个message列表的场景。

from langchain.prompts import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
    MessagesPlaceholder,
)

human_prompt = "Summarize our conversation so far in {word_count} words."
human_message_template = HumanMessagePromptTemplate.from_template(human_prompt)

chat_prompt = ChatPromptTemplate.from_messages(
    [MessagesPlaceholder(variable_name="conversation"), human_message_template]
)

可以看到，上述代码在chat模板中定义了一个占位符+human message的prompt，相对于确定的human message而言，message占位符则是不确定的message，可以在最终格式化呈现prompt的时候才填充数据，如下面的代码：

from langchain_core.messages import AIMessage, HumanMessage

human_message = HumanMessage(content="What is the best way to learn programming?")
ai_message = AIMessage(
    content="""\
1. Choose a programming language: Decide on a programming language that you want to learn.

2. Start with the basics: Familiarize yourself with the basic programming concepts such as variables, data types and control structures.

3. Practice, practice, practice: The best way to learn programming is through hands-on experience\
"""
)

chat_prompt.format_prompt(
    conversation=[human_message, ai_message], word_count="10"
).to_messages()

"""
[HumanMessage(content='What is the best way to learn programming?'),
 AIMessage(content='1. Choose a programming language: Decide on a programming language that you want to learn.\n\n2. Start with the basics: Familiarize yourself with the basic programming concepts such as variables, data types and control structures.\n\n3. Practice, practice, practice: The best way to learn programming is through hands-on experience'),
 HumanMessage(content='Summarize our conversation so far in 10 words.')]
"""

Few-shot Prompt

最后再介绍一个实用性比较高，也是提示词工程经常提及的技巧：few-shot。

上面我们也提到了比如OpenAI的ChatGPT，都是包含三个不同的role：system、user、assistant。我们在user-assistant多轮对话中，塞入一些样例让大模型能够学习到这样的上下文知识。

比如这样的形式：user message_1（样例1的输入）->assistant message_1（样例1的输出，即让大模型学习应该这样回答） -> … -> user message_n（最后，放入我们真正的输入）

from langchain_core.prompts import (
    ChatPromptTemplate,
    FewShotChatMessagePromptTemplate,
)

examples = [
    {"input": "2+2", "output": "4"},
    {"input": "2+3", "output": "5"},
]

# This is a prompt template used to format each individual example.
example_prompt = ChatPromptTemplate.from_messages(
    [
        ("human", "{input}"),
        ("ai", "{output}"),
    ]
)
few_shot_prompt = FewShotChatMessagePromptTemplate(
    example_prompt=example_prompt,
    examples=examples,
)

print(few_shot_prompt.format())
"""
Human: 2+2
AI: 4
Human: 2+3
AI: 5
"""

final_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a wondrous wizard of math."),
        few_shot_prompt,
        ("human", "{input}"),
    ]
)

print(final_prompt.format(input='6+7'))
"""
System: You are a wondrous wizard of math.
Human: 2+2
AI: 4
Human: 2+3
AI: 5
Human: 6+7
"""

LCEL

在这里，先了解下LangChain的基础模块： Runnable interface，对于整个LangChain的使用会更为清晰，也就是LCEL作为LangChain的表达语言（LangChain Expression Language）。

Langchain实现了一种"Runable"协议，许多基础的组件包括上述prompt模板，以及LLM、ChatModel，以及后续的Retriever等等的调用，都继承于这个接口，当然，这也是为了更好得实现自定义的chains/组件。

它包括以下需要实现的方法：

invoke：使用单个输入调用chain，比如LLM
batch：使用一组输入调用chain
stream：流式返回结果

以及对应的异步方法，即需要Python的asyncio await语法实现的并发：

ainvoke、abatch、astream：分别对应上面invoke、batch、stream的异步方法

astream_log：回溯中间步骤的执行信息，加到最终的返回中
astream_event：记录流事件（events）的执行（ langchain-core 0.1.14才引入的beta功能）

然后，还可以先了解下每种组件的输入和输出，方便后面的学习理解：

Component	Input Type	Output Type
Prompt	Dictionary	PromptValue
ChatModel	Single string, list of chat messages or a PromptValue	ChatMessage
LLM	Single string, list of chat messages or a PromptValue	String
OutputParser	The output of an LLM or ChatModel	Depends on the parser
Retriever	Single string	List of Documents
Tool	Single string or dictionary, depending on the tool	Depends on the tool

LLMs

示例代码：llms.ipynb

Large Language Models (LLMs)是LangChain的一个核心组件，但需要注意的是，LangChain自己并不进行LLMs的服务，而是提供一个标准接口去跟许多LLMs供应商（OpenAI，Cohere等）进行交互。

具体来说，LangChain的LLMs组件是一个接收字符串作为输入，然后返回一个字符串的接口。

下面我们将使用通义千问来作为我们的LLMs供应商，目前许多国产大模型都提供了几百万的免费tokens额度，这为我们的学习用途提供了很大的便利，主打一个白嫖。

基础使用

初始化LLM.

from langchain.llms import Tongyi

# api key作为参数传入
llm = Tongyi(dashscope_api_key='....')

# 或者配置环境变量`DASHSCOPE_API_KEY`
llm = Tongyi()

使用LLM进行文本生成.

1、阻塞模式

llm.invoke("有什么关于失业和通货膨胀的相关性的理论")

"""
'在经济学中，失业和通货膨胀之间的关系是通过一种叫做菲利普斯曲线（Phillips Curve）的理论来描述的.....'
"""

2、流式模式

for chunk in llm.stream(
    "有什么关于失业和通货膨胀的相关性的理论"
):
    print(chunk, end="\n", flush=True)
   
"""
失业
和
通
货膨胀之间的关系是
经济学中的一个重要概念，通常被描述
为菲利普斯曲线（Phill
ips Curve）的理论。该理论
由新西兰经济学家A.W.H.菲
利普斯在1958
年提出，后来被进一步发展和完善
.....
"""

3、批次调用

llm.batch(
    [
        "简单介绍一个唐代诗人",
        "简单介绍一个宋代诗人"
    ]
)

"""
['李白，字太白，号青莲居士，是唐朝时期著名的浪漫主义诗人，被誉为“诗仙”......',
 '苏轼，字子瞻，号东坡居士，是北宋时期著名的文学家、书画家，被后人尊称为“唐宋八大家”之一......']
"""

自定义LLM

当你在本地部署了一个大模型，比如基于Ollama或者Xinference，那么你可以自定义封装LangChain的LLM接口，便可以像内置的LLMs一样使用。

首先，你需要继承LLM这个基类来自动变成上述提到的Runable，然后下面是必须重写实现的方法：

Method	Description
`_call`	Takes in a string and some optional stop words, and returns a string. Used by `invoke`.
`_llm_type`	A property that returns a string, used for logging purposes only.

这里可以看到，LLM这个基类已经帮我们实现了invoke的方法，但它是需要调用_call方法的，返回的是字符串
但其实，更多的情况下比如Tongyi等模型，是重写_generate函数来实现invoke调用，因为_generate函数返回的LLMResult结构体，可以附带更多信息，比如tokens消耗等
因为整个invoke的调用顺序是这样的：BaseLLM.invoke -> LLM._generate -> LLM._call，因此当你实现了_generate函数，就不需要再去调用_call

下面几个则是非必须实现的方法：

Method	Description
`_identifying_params`	Used to help with identifying the model and printing the LLM; should return a dictionary. This is a @property.
`_acall`	Provides an async native implementation of `_call`, used by `ainvoke`.
`_stream`	Method to stream the output token by token. 看了源码，如果不实现`_stream`的话，则默认是调用`invoke`，然后转为迭代器返回。
`_astream`	Provides an async native implementation of `_stream`; in newer LangChain versions, defaults to `_stream`.

接下来，我们还是以官网的教程和例子来实现一个非常简单的自定义LLM，它的功能仅仅是回复输入的前n个字符。

from typing import Any, Dict, Iterator, List, Mapping, Optional

from langchain_core.callbacks.manager import CallbackManagerForLLMRun
from langchain_core.language_models.llms import LLM
from langchain_core.outputs import GenerationChunk


class CustomLLM(LLM):
  	"""A custom chat model that echoes the first `n` characters of the input."""

    n: int

    def _call(
        self,
        prompt: str,
        stop: Optional[List[str]] = None,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> str:
        """Run the LLM on the given input.

        Override this method to implement the LLM logic.

        Args:
            prompt: The prompt to generate from.
            stop: Stop words to use when generating. Model output is cut off at the
                first occurrence of any of the stop substrings.
                If stop tokens are not supported consider raising NotImplementedError.
            run_manager: Callback manager for the run.
            **kwargs: Arbitrary additional keyword arguments. These are usually passed
                to the model provider API call.

        Returns:
            The model output as a string. Actual completions SHOULD NOT include the prompt.
        """
        if stop is not None:
            raise ValueError("stop kwargs are not permitted.")
        return prompt[: self.n]

    def _stream(
        self,
        prompt: str,
        stop: Optional[List[str]] = None,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> Iterator[GenerationChunk]:
        """Stream the LLM on the given prompt.

        This method should be overridden by subclasses that support streaming.

        Args:
            prompt: The prompt to generate from.
            stop: Stop words to use when generating. Model output is cut off at the
                first occurrence of any of these substrings.
            run_manager: Callback manager for the run.
            **kwargs: Arbitrary additional keyword arguments. These are usually passed
                to the model provider API call.

        Returns:
            An iterator of GenerationChunks.
        """
        for char in prompt[: self.n]:
            chunk = GenerationChunk(text=char)
            if run_manager:
                run_manager.on_llm_new_token(chunk.text, chunk=chunk)

            yield chunk

    @property
    def _identifying_params(self) -> Dict[str, Any]:
        """Return a dictionary of identifying parameters."""
        return {
            # The model name allows users to specify custom token counting
            # rules in LLM monitoring applications (e.g., in LangSmith users
            # can provide per token pricing for their model and monitor
            # costs for the given LLM.)
            "model_name": "CustomChatModel",
        }

    @property
    def _llm_type(self) -> str:
        """Get the type of language model used by this chat model. Used for logging purposes only."""
        return "custom"

它的使用方法与LangChain内置LLMs是相同：

llm = CustomLLM(n=5)
print(llm)
"""
CustomLLM
Params: {'model_name': 'CustomChatModel'}
"""

llm.invoke("This is a foobar thing")
"""
'This '
"""

for token in llm.stream("hello"):
    print(token, end="|", flush=True)
"""
h|e|l|l|o|
"""

llm.batch(["woof woof woof", "meow meow meow"])
"""
['woof ', 'meow ']
"""

缓存

缓存的使用场景是存在多次相同的文本生成请求，直接从缓存中获取结果进行回复，既可以提升性能，又可以减少对LLM供应商的请求，从而节省费用。

其中，最为简单的缓存方式便是内存，对应的实现类为：InMemoryCache，每次的prompt和结果都会存储在内存中。

%%time

set_llm_cache(InMemoryCache())

# The first time, it is not yet in cache, so it should take longer
llm.predict("说一个笑话")
"""
CPU times: user 137 ms, sys: 77.9 ms, total: 215 ms
Wall time: 1.26 s

'一朵花为什么很好笑？因为它很有梗。'
"""

%%time

# The second time it is, so it goes faster
llm.predict("说一个笑话")
"""
CPU times: user 752 µs, sys: 317 µs, total: 1.07 ms
Wall time: 1.03 ms

'一朵花为什么很好笑？因为它很有梗。'
"""

除了内存的缓存，LangChain还提供了RedisCache这种借助外部工具的缓存实现，可以实现服务重启的缓存不失效和分布式缓存。

LangChain提供了许多内置的缓存方式：LLM Caching

消耗追踪

实现代码：callbacks、custom tongyi

LangChain内置了OpenAI API的tokens消耗追溯，可以得到一个具体请求的消耗tokens和费用，但目前只支持了OpenAI的，并且不支持stream调用的消耗追溯。我参考LangChain内置方法，实现了一套能够追踪更通用化的消耗，并且支持stream，使用方式和返回结果与内置实现完全一致。

（这里要吐槽下LangChain设计的缺陷，内置的stream调用是上层的抽象类（父类）的方法，很难把信息放在llm_output，只能放在实现类的_stream方法中返回的generation_info，但目前每一个LLMs都缺少模型名称或者tokens消耗的信息，只能自己重写对应模型的_stream方法，把信息都带入generation_info）

from callbacks.manager import get_cn_llm_callback

with get_cn_llm_callback() as cb:
    llm.invoke("有什么关于失业和通货膨胀的相关性的理论")
    print(cb)
"""
Tokens Used: 429
	Prompt Tokens: 18
	Completion Tokens: 411
Successful Requests: 1
Total Cost (CYN): ¥0.008579999999999999
"""

# stream调用需要使用重写的Tongyi类
from tongyi.llm import CustomTongyi

# dashscope_api_key作为参数传入
# 或者配置环境变量`DASHSCOPE_API_KEY`
llm = CustomTongyi()

with get_generic_llms_callback() as cb:
    for chunk in llm.stream(
    "有什么关于失业和通货膨胀的相关性的理论"):
        print(chunk, end="|", flush=True)
    print(cb)

"""
在|经济学|中|，失业和通货|膨胀之间的关系被广泛研究，主要|理论有菲利普斯曲线（| Phillips Curve）和纳克斯的“|自然失业率”理论.....
Tokens Used: 301
	Prompt Tokens: 18
	Completion Tokens: 283
Successful Requests: 1
Total Cost (CYN): ¥0.006019999999999999
"""

Chat Model

示例代码：chatmodel.ipynb

正如上述提到，Chat Model则是一个聊天驱动的大语言模型，可以进行多轮对话。

LangChain内置了绝大部分市场上的大语言模型，可能部分模型没有LLM实现类，但是都会有ChatModel实现类。

Messages类型

Chat models使用chat message作为输入和输出，除了上述提到的三种基本messages类型，还有另外两种：

SystemMessage：用于制作此次对话的人设；
HumanMessage：用户的输入；
AIMessage：模型的输出；
FunctionMessage：函数调用(function call)的结果，除了对应的role和content参数外，还有一个name参数，表示对应名称的函数的执行结果
ToolMessage：工具调用(tool call)的结果，同样有额外的参数tool_call_id，表示对应id的工具的执行结果。

函数和工具的调用暂且不在这个章节进行阐述，留到后续有专门的章节。

基础使用

其实，Chat model的使用与LLMs比较类似，只是输入和输出的格式不同。

初始化LLM.

from langchain_community.chat_models import ChatTongyi

# api key作为参数传入
chat = ChatTongyi(dashscope_api_key='.....')

# 或者配置环境变量`DASHSCOPE_API_KEY`
chat = ChatTongyi()

使用LLM进行文本生成.

1、阻塞模式

from langchain_core.messages import HumanMessage, SystemMessage

messages = [
    SystemMessage(content="你是一个数学专家"),
    HumanMessage(content="什么是勾股定理"),
]

chat.invoke(messages)

"""
AIMessage(content='勾股定理是古希腊数学家毕达哥拉斯发现的一个几何学基本定理，也被称为毕达哥拉斯定理......', response_metadata={'model_name': 'qwen-turbo', 'finish_reason': 'stop', 'request_id': '09f736aa-84b9-9a31-b249-1cf1a779b034', 'token_usage': {'input_tokens': 22, 'output_tokens': 142, 'total_tokens': 164}}, id='run-0978a38d-3611-4c89-97e2-05280f578591-0')
"""

可以看到，Chat model的invoke函数不仅返回了模型的回复内容content，还附带了response_metadata，包含一些模型调用信息。

2、流式模式

for chunk in chat.stream(messages):
    print(chunk.content, end="|", flush=True)
   
"""
勾|股|定|理是古希腊数学|家毕达哥拉斯发现的一个几何|学基本定理，也被称为毕|达哥拉斯定理......
"""

当然，还有跟LLMs一样的批次调用、异步调用等等，如上面LCEL章节提到那些基础函数。

自定义Chat model

上述提到了Chat model的输入和输出都是message，因此自定义的Chat model也需要符合同样的输入和输出类型。

并且继承BaseChatModel基类，实现以下方法：

Method/Property	Description	Required/Optional
`_generate`	Use to generate a chat result from a prompt	Required
`_llm_type` (property)	Used to uniquely identify the type of the model. Used for logging.	Required
`_identifying_params` (property)	Represent model parameterization for tracing purposes.	Optional
`_stream`	Use to implement streaming.	Optional
`_agenerate`	Use to implement a native async method.	Optional
`_astream`	Use to implement async version of `_stream`.	Optional

接下来，仍然是一个简单的例子，实现取prompt里最后一个message的前n个字符的功能。

from typing import Any, AsyncIterator, Dict, Iterator, List, Optional

from langchain_core.callbacks import (
    AsyncCallbackManagerForLLMRun,
    CallbackManagerForLLMRun,
)
from langchain_core.language_models import BaseChatModel, SimpleChatModel
from langchain_core.messages import AIMessageChunk, BaseMessage, HumanMessage
from langchain_core.outputs import ChatGeneration, ChatGenerationChunk, ChatResult
from langchain_core.runnables import run_in_executor


class CustomChatModelAdvanced(BaseChatModel):
    """A custom chat model that echoes the first `n` characters of the input.

    When contributing an implementation to LangChain, carefully document
    the model including the initialization parameters, include
    an example of how to initialize the model and include any relevant
    links to the underlying models documentation or API.

    Example:

        .. code-block:: python

            model = CustomChatModel(n=2)
            result = model.invoke([HumanMessage(content="hello")])
            result = model.batch([[HumanMessage(content="hello")],
                                 [HumanMessage(content="world")]])
    """

    model_name: str
    """The name of the model"""
    n: int
    """The number of characters from the last message of the prompt to be echoed."""

    def _generate(
        self,
        messages: List[BaseMessage],
        stop: Optional[List[str]] = None,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> ChatResult:
        """Override the _generate method to implement the chat model logic.

        This can be a call to an API, a call to a local model, or any other
        implementation that generates a response to the input prompt.

        Args:
            messages: the prompt composed of a list of messages.
            stop: a list of strings on which the model should stop generating.
                  If generation stops due to a stop token, the stop token itself
                  SHOULD BE INCLUDED as part of the output. This is not enforced
                  across models right now, but it's a good practice to follow since
                  it makes it much easier to parse the output of the model
                  downstream and understand why generation stopped.
            run_manager: A run manager with callbacks for the LLM.
        """
        # Replace this with actual logic to generate a response from a list
        # of messages.
        last_message = messages[-1]
        tokens = last_message.content[: self.n]
        message = AIMessage(
            content=tokens,
            additional_kwargs={},  # Used to add additional payload (e.g., function calling request)
            response_metadata={  # Use for response metadata
                "time_in_seconds": 3,
            },
        )
        ##

        generation = ChatGeneration(message=message)
        return ChatResult(generations=[generation])

    def _stream(
        self,
        messages: List[BaseMessage],
        stop: Optional[List[str]] = None,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> Iterator[ChatGenerationChunk]:
        """Stream the output of the model.

        This method should be implemented if the model can generate output
        in a streaming fashion. If the model does not support streaming,
        do not implement it. In that case streaming requests will be automatically
        handled by the _generate method.

        Args:
            messages: the prompt composed of a list of messages.
            stop: a list of strings on which the model should stop generating.
                  If generation stops due to a stop token, the stop token itself
                  SHOULD BE INCLUDED as part of the output. This is not enforced
                  across models right now, but it's a good practice to follow since
                  it makes it much easier to parse the output of the model
                  downstream and understand why generation stopped.
            run_manager: A run manager with callbacks for the LLM.
        """
        last_message = messages[-1]
        tokens = last_message.content[: self.n]

        for token in tokens:
            chunk = ChatGenerationChunk(message=AIMessageChunk(content=token))

            if run_manager:
                # This is optional in newer versions of LangChain
                # The on_llm_new_token will be called automatically
                run_manager.on_llm_new_token(token, chunk=chunk)

            yield chunk

        # Let's add some other information (e.g., response metadata)
        chunk = ChatGenerationChunk(
            message=AIMessageChunk(content="", response_metadata={"time_in_sec": 3})
        )
        if run_manager:
            # This is optional in newer versions of LangChain
            # The on_llm_new_token will be called automatically
            run_manager.on_llm_new_token(token, chunk=chunk)
        yield chunk

    @property
    def _llm_type(self) -> str:
        """Get the type of language model used by this chat model."""
        return "echoing-chat-model-advanced"

    @property
    def _identifying_params(self) -> Dict[str, Any]:
        """Return a dictionary of identifying parameters.

        This information is used by the LangChain callback system, which
        is used for tracing purposes make it possible to monitor LLMs.
        """
        return {
            # The model name allows users to specify custom token counting
            # rules in LLM monitoring applications (e.g., in LangSmith users
            # can provide per token pricing for their model and monitor
            # costs for the given LLM.)
            "model_name": self.model_name,
        }

与内置Chat model同样的调用方式：

model = CustomChatModelAdvanced(n=3, model_name="my_custom_model")

model.invoke(
    [
        HumanMessage(content="hello!"),
        AIMessage(content="Hi there human!"),
        HumanMessage(content="Meow!"),
    ]
)

"""
AIMessage(content='Meo', response_metadata={'time_in_seconds': 3}, id='run-ddb42bd6-4fdd-4bd2-8be5-e11b67d3ac29-0')
"""

# 输入也支持字符串，可以等同于`[HumanMessage(content="cat vs dog")]`
for chunk in model.stream("cat vs dog"):
    print(chunk.content, end="|")
"""
c|a|t||
"""

缓存

缓存机制与LLMs基本也是一致的。

from langchain.globals import set_llm_cache
from langchain.cache import InMemoryCache

%%time

set_llm_cache(InMemoryCache())

# The first time, it is not yet in cache, so it should take longer
chat.invoke("说一个笑话")

"""
CPU times: user 69.4 ms, sys: 9.48 ms, total: 78.9 ms
Wall time: 999 ms

AIMessage(content='小王剪了一个中分，然后他就变成了小全。', response_metadata={'model_name': 'qwen-turbo', 'finish_reason': 'stop', 'request_id': '7fb3f662-e3b1-9f2d-ae2c-0b8a1a909aec', 'token_usage': {'input_tokens': 11, 'output_tokens': 13, 'total_tokens': 24}}, id='run-8e663365-66d5-4876-9714-f88857aea39d-0')
"""


%%time

# The second time it is, so it goes faster
chat.invoke("说一个笑话")
"""
CPU times: user 1.27 ms, sys: 794 µs, total: 2.07 ms
Wall time: 1.33 ms

AIMessage(content='小王剪了一个中分，然后他就变成了小全。', response_metadata={'model_name': 'qwen-turbo', 'finish_reason': 'stop', 'request_id': '7fb3f662-e3b1-9f2d-ae2c-0b8a1a909aec', 'token_usage': {'input_tokens': 11, 'output_tokens': 13, 'total_tokens': 24}}, id='run-8e663365-66d5-4876-9714-f88857aea39d-0')
"""

同样的，除了内存的缓存，LangChain还提供了RedisCache这种借助外部工具的缓存实现，可以实现服务重启的缓存不失效和分布式缓存。

其他内置的缓存方式：LLM Caching

消耗追踪

实现代码：callbacks、custom tongyi

正如上述提到的，我参考LangChain内置方法，实现了一套能够追踪更通用化的消耗，并且支持stream，使用方式和返回结果与内置实现完全一致。 它也同样适用于Chat model，不过要在调用stream的时候记录消耗，仍然需要自己使用自定义的Chat model类。

from callbacks.manager import get_generic_llms_callback

messages = [
        SystemMessage(content="你是一个数学专家"),
        HumanMessage(content="什么是勾股定理"),
    ]

with get_generic_llms_callback() as cb:
    chat.invoke(messages)
    print(cb)
"""
Tokens Used: 153
	Prompt Tokens: 22
	Completion Tokens: 131
Successful Requests: 1
Total Cost (CYN): ¥0.0027960000000000003
"""

from tongyi.chat_model import CustomChatTongyi

# dashscope_api_key作为参数传入
# 或者配置环境变量`DASHSCOPE_API_KEY`
chat = CustomChatTongyi(dashscope_api_key='....')

with get_generic_llms_callback() as cb:
    for chunk in chat.stream(messages):
        print(chunk.content, end="|", flush=True)
    
    print()
    print(cb)
"""
勾|股|定|理是古希腊数学|家毕达哥拉斯发现的一个几何|学基本定理，也被称为毕|达哥拉斯定理......
Tokens Used: 190
	Prompt Tokens: 22
	Completion Tokens: 168
Successful Requests: 1
Total Cost (CYN): ¥0.003536
"""

模型关键参数

在这里以openai gpt为例，列举下几个影响大语言模型表现的关键参数，可以更好地理解和使用（大部分主流模型一般都是包含这些参数的）：

max_tokens：模型返回的最大tokens，max_tokens+input_tokens=模型支持的上下文最大长度（context length，比如gpt-3.5-turbo-16k支持最大16k tokens的长下文）
temperature：采样温度，越大的值可以让模型输出更加随机
top_p：同样是采样系数，表示模型会考虑前top_p概率质量的tokens
frequency_penalty：频率惩罚，在生成新tokens对它们截止当前的出现频率的惩罚，越大的只可以降低模型重复相同tokens的概率
presence_penalty：存在惩罚，在生成新tokens对它们截止当前已经存在的的惩罚，越大的值可以增加模型阐述新主题的概率

Output Parsers

示例代码：output_parser.ipynb

从前面可以看到，LLM和ChatModel的输出都是自然语言文本，也就是字符串，但在对接过程中，如果大模型能够按照特定的格式进行返回，比如常用的json，这将为开发提供极大的便捷。

结构体解析

我们可以定一个数据结构的类，主要包含一些成员变量来存储解析结果，也就是可以从大模型的输出文本中，对应这些成员变量的含义来提取对应的信息。

首先，我们定义一个关于笑话的Joke类：

包含两个成员变量，description便是告诉大模型这个变量需要存储哪些信息
@validator("setup")可以来检验大模型解析的结果是否符合预期

from langchain.output_parsers import PydanticOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field, validator


# Define your desired data structure.
class Joke(BaseModel):
    setup: str = Field(description="开启一个笑话的问题")
    punchline: str = Field(description="解答笑话的答案")

    # You can add custom validation logic easily with Pydantic.
    @validator("setup")
    def question_ends_with_question_mark(cls, field):
        if field[-1] != "？":
            raise ValueError("Badly formed question!")
        return field

接着，使用LangChain内置的解析器PydanticOutputParser，通过Prompt组件来构建提示词，输入到ChatModel组件，大模型的输出再经过解析器得到解析后的数据。

这样一个流程其实就是Chains，正如开头提到，Chains便是多个组件连接起来的序列组合，一步一步地执行，然后将结果传递给下一步。 在LangChain中使用|符号将多个组件连接起来。

# Set up a parser + inject instructions into the prompt template.
parser = PydanticOutputParser(pydantic_object=Joke)

prompt = PromptTemplate(
    template="根据用户的输入进行解答.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

# And a query intended to prompt a language model to populate the data structure.
chain = prompt | chat | parser
chain.invoke({"query": "讲一个笑话"})
"""
Joke(setup='为什么电脑永远不会感冒？', punchline='因为它有Windows（Windows，意为窗户，这里指电脑不会打开，所以不会受冷）')
"""

这里，我们进一步揭开这个解析器的面纱，先来看看解析器是如何指导提示词的：

parser.get_format_instructions()
"""
The output should be formatted as a JSON instance that conforms to the JSON schema below.
As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.
Here is the output schema:
```
{"properties": {"setup": {"title": "Setup", "description": "开启一个笑话的问题", "type": "string"}, "punchline": {"title": "Punchline", "description": "解答笑话的答案", "type": "string"}}, "required": ["setup", "punchline"]}
```
"""

可以看到，解析器会干预提示词，指导大模型输出json格式的字符串，然后字段的含义便是我们定义的Joke类的成员变量的description，这样其实就很容易将对应的字段提取出来，写入导对应的成员变量了。

json解析

json解析器与上面的结构体解析器使用相同的提示词指导，只是最后返回dict格式的数据。

from langchain_core.output_parsers import JsonOutputParser

# Set up a parser + inject instructions into the prompt template.
parser = JsonOutputParser(pydantic_object=Joke)

chain = prompt | chat | parser

chain.invoke({"query": "讲一个笑话"})
"""
{'setup': '为什么电脑永远不会感冒？', 'punchline': '因为它有Windows（窗户）但是不开！'}
"""

其他内置解析器

LangChain内置了许多解析器，可以去官方文档查看支持的所有解析器类型。

自定义解析器

LangChain提供了两种自定义方法：

使用RunnableLambda或RunnableGenerator，这可以符合LCEL。
继承一个输出解析器的基础类。

Runnable Lambdas.

from typing import Iterable

from langchain_core.messages import AIMessage, AIMessageChunk

def parse(ai_message: AIMessage) -> str:
    """Parse the AI message."""
    return ai_message.content.swapcase()


chain = chat | parse
chain.invoke("hello")
"""
'hELLO! hOW CAN i ASSIST YOU TODAY?'
"""

可以看到，这种方法非常简单，仅需定义一个基本方法，接收Chat model的输出AIMessage，上面的例子的解析器是将Chat model的回复文本转换大小写。

Runnable Generators.

from langchain_core.runnables import RunnableGenerator


def streaming_parse(chunks: Iterable[AIMessageChunk]) -> Iterable[str]:
    for chunk in chunks:
        yield chunk.content.swapcase()


streaming_parse = RunnableGenerator(streaming_parse)

chain = chat | streaming_parse

for chunk in chain.stream("tell me about yourself in one sentence"):
    print(chunk, end="|", flush=True)
"""
i| AM| A| LARGE LANGUAGE MODEL CREATED BY| aLIBABA cLOUD, DESIGNED TO ANSWER QUESTIONS AND| PROVIDE INFORMATION ON VARIOUS TOPICS.|
"""

流式模式支持同样简单，仅需要定义一个方法，接收Chat model的输出AIMessageChunk迭代器，遍历对每次Chat model的流式返回进行处理，即streaming_parse方法是对每一块chunk进行处理，而非上面的parse方法是对LLM返回的完整数据进行处理。

继承解析器基类.

from langchain_core.exceptions import OutputParserException
from langchain_core.output_parsers import BaseOutputParser


# The [bool] desribes a parameterization of a generic.
# It's basically indicating what the return type of parse is
# in this case the return type is either True or False
class BooleanOutputParser(BaseOutputParser[bool]):
    """Custom boolean parser."""

    true_val: str = "YES"
    false_val: str = "NO"

    def parse(self, text: str) -> bool:
        cleaned_text = text.strip().upper()
        if cleaned_text not in (self.true_val.upper(), self.false_val.upper()):
            raise OutputParserException(
                f"BooleanOutputParser expected output value to either be "
                f"{self.true_val} or {self.false_val} (case-insensitive). "
                f"Received {cleaned_text}."
            )
        return cleaned_text == self.true_val.upper()

    @property
    def _type(self) -> str:
        return "boolean_output_parser"

可以看到，其实跟上面直接定义解析方法是差不多，只不过是继承重写了parse函数。这个例子实现了：判断LLM返回是否为YES或NO，然后对应True或False的布尔值。

调用方法也是跟上面一样遵从LCEL，便不再展示了。

解析LLM原始输出.

模型的输出其实经常包含一些额外信息metadata的，因此如果解析器需要这部分信息的话，可以使用下面的方法。

from typing import List

from langchain_core.exceptions import OutputParserException
from langchain_core.messages import AIMessage
from langchain_core.output_parsers import BaseGenerationOutputParser
from langchain_core.outputs import ChatGeneration, Generation


class StrInvertCase(BaseGenerationOutputParser[str]):
    """An example parser that inverts the case of the characters in the message.

    This is an example parse shown just for demonstration purposes and to keep
    the example as simple as possible.
    """

    def parse_result(self, result: List[Generation], *, partial: bool = False) -> str:
        """Parse a list of model Generations into a specific format.

        Args:
            result: A list of Generations to be parsed. The Generations are assumed
                to be different candidate outputs for a single model input.
                Many parsers assume that only a single generation is passed it in.
                We will assert for that
            partial: Whether to allow partial results. This is used for parsers
                     that support streaming
        """
        if len(result) != 1:
            raise NotImplementedError(
                "This output parser can only be used with a single generation."
            )
        generation = result[0]
        if not isinstance(generation, ChatGeneration):
            # Say that this one only works with chat generations
            raise OutputParserException(
                "This output parser can only be used with a chat generation."
            )
        return generation.message.content.swapcase()


chain = anthropic | StrInvertCase()