一起学大模型 - LangChain 的 OutputParser

做个天秤座的程序猿

已于 2024-07-17 13:20:39 修改

阅读量1k

点赞数 33

文章标签： langchain outPutParser 大模型

于 2024-06-04 11:16:59 首次发布

本文链接：https://blog.csdn.net/kljyrx/article/details/139436305

版权

前言

LangChain 是一个用于构建语言模型应用的框架，它提供了许多工具和类来简化与语言模型交互的过程。OutputParser 是其中一个关键组件，用于解析语言模型生成的输出，并将其转换为更易处理的结构化数据。

一、 OutputParser 的概述

OutputParser 是一个抽象基类，定义了从语言模型输出解析数据的接口。在使用语言模型生成文本时，输出通常是非结构化的纯文本数据。OutputParser 提供了一种机制，将这些非结构化的文本数据转换为结构化的格式（如字典、列表、对象等），以便于后续处理。

一个典型的 OutputParser 类需要实现以下关键方法：

parse 方法：
该方法接受模型生成的输出文本，并返回解析后的结构化数据。
get_format_instructions 方法：
（可选）该方法返回一个字符串，包含如何格式化模型输出的指令。这对于提示设计（prompt design）非常有用。

二、 JSON OutputParser

下面是一个简单的示例，展示了如何实现一个 OutputParser，将语言模型的输出解析为 JSON 格式的数据。

from typing import Any, Dict
import json
from abc import ABC, abstractmethod

class OutputParser(ABC):
    @abstractmethod
    def parse(self, text: str) -> Any:
        pass
    
    @abstractmethod
    def get_format_instructions(self) -> str:
        pass

class JsonOutputParser(OutputParser):
    def parse(self, text: str) -> Dict:
        try:
            # 尝试将文本解析为 JSON
            return json.loads(text)
        except json.JSONDecodeError:
            # 处理 JSON 解析错误
            raise ValueError("Failed to parse JSON")

    def get_format_instructions(self) -> str:
        return "Please provide the output in JSON format."

# 示例使用
if __name__ == "__main__":
    parser = JsonOutputParser()
    
    # 模拟的语言模型输出
    model_output = '{"name": "Alice", "age": 30}'
    
    try:
        result = parser.parse(model_output)
        print("Parsed output:", result)
    except ValueError as e:
        print("Error:", e)
    
    format_instructions = parser.get_format_instructions()
    print("Format instructions:", format_instructions)

输出

Parsed output: {'name': 'Alice', 'age': 30}
Format instructions: Please provide the output in JSON format.

函数说明

parse 方法：
- 接受一个字符串（模型输出），尝试将其解析为 JSON 格式。
- 如果解析成功，返回一个字典。
- 如果解析失败，抛出一个 ValueError。
get_format_instructions 方法：
- 返回一个字符串，说明模型应该如何格式化输出。在这个例子中，要求模型输出 JSON 格式的数据。

三、自定义格式解析器

下面是一个自定义格式解析器的示例，该解析器将语言模型的输出解析为自定义格式的数据。假设我们有一个自定义的输出格式，包含一些预定义的标签和对应的值。我们将实现一个 CustomFormatOutputParser 来解析这种格式的输出。

1. 假设的自定义格式

我们的自定义格式如下所示，每一行包含一个标签和值，以冒号分隔：

name: Alice
age: 30
location: Wonderland

2. 实现 CustomFormatOutputParser

我们将创建一个 CustomFormatOutputParser，它会解析上述格式的输出，将其转换为一个字典。

from typing import Any, Dict
from abc import ABC, abstractmethod

class OutputParser(ABC):
    @abstractmethod
    def parse(self, text: str) -> Any:
        pass
    
    @abstractmethod
    def get_format_instructions(self) -> str:
        pass

class CustomFormatOutputParser(OutputParser):
    def parse(self, text: str) -> Dict[str, Any]:
        result = {}
        lines = text.strip().split('\n')
        
        for line in lines:
            key, value = line.split(':', 1)
            result[key.strip()] = value.strip()
        
        return result

    def get_format_instructions(self) -> str:
        return "Please provide the output in the following format:\nname: <name>\nage: <age>\nlocation: <location>"

# 示例使用
if __name__ == "__main__":
    parser = CustomFormatOutputParser()
    
    # 模拟的语言模型输出
    model_output = """
    name: Alice
    age: 30
    location: Wonderland
    """
    
    try:
        result = parser.parse(model_output)
        print("Parsed output:", result)
    except ValueError as e:
        print("Error:", e)
    
    format_instructions = parser.get_format_instructions()
    print("Format instructions:", format_instructions)

输出

Parsed output: {'name': 'Alice', 'age': '30', 'location': 'Wonderland'}
Format instructions: Please provide the output in the following format:
name: <name>
age: <age>
location: <location>

3. 更复杂的自定义格式

假设我们有一个更复杂的自定义格式，其中标签和值可能包含多个单词，并且每个条目之间有一个空行：

Name: Alice Smith

Age: 30

Location: Wonderland

我们可以对 CustomFormatOutputParser 进行修改，以处理这种更复杂的格式：

class ComplexCustomFormatOutputParser(OutputParser):
    def parse(self, text: str) -> Dict[str, Any]:
        result = {}
        lines = text.strip().split('\n\n')  # 使用双换行分割条目
        
        for line in lines:
            key, value = line.split(':', 1)
            result[key.strip()] = value.strip()
        
        return result

    def get_format_instructions(self) -> str:
        return "Please provide the output in the following format with each entry separated by a blank line:\nName: <full name>\n\nAge: <age>\n\nLocation: <location>"

# 示例使用
if __name__ == "__main__":
    parser = ComplexCustomFormatOutputParser()
    
    # 模拟的语言模型输出
    model_output = """
    Name: Alice Smith

    Age: 30

    Location: Wonderland
    """
    
    try:
        result = parser.parse(model_output)
        print("Parsed output:", result)
    except ValueError as e:
        print("Error:", e)
    
    format_instructions = parser.get_format_instructions()
    print("Format instructions:", format_instructions)

输出

Parsed output: {'Name': 'Alice Smith', 'Age': '30', 'Location': 'Wonderland'}
Format instructions: Please provide the output in the following format with each entry separated by a blank line:
Name: <full name>

Age: <age>

Location: <location>

在这个版本中，我们修改了 parse 方法，以处理更复杂的格式，其中每个条目之间有一个空行。我们通过双换行符分割条目，并去掉每个条目的空白字符。这种方式可以更好地处理复杂的输出格式。

这个示例展示了如何根据特定的业务需求定制 OutputParser，以解析自定义格式的语言模型输出。

四、正则表达式解析器

正则表达式解析器 (RegexOutputParser) 是一种用于从非结构化文本中提取特定模式的数据的工具。正则表达式强大且灵活，适用于各种复杂的解析任务。

1. 示例：正则表达式解析器

假设我们有一个语言模型输出包含一些结构化信息，但这些信息混杂在自然语言文本中。我们的目标是提取这些信息。以下是一个简单的例子，展示如何实现一个 RegexOutputParser 来解析特定的输出格式。

2. 假设的语言模型输出

Name: Alice
Age: 30
Location: Wonderland

3. 实现 RegexOutputParser

我们将创建一个 RegexOutputParser，它使用正则表达式来解析上述格式的输出，并将其转换为一个字典。

import re
from typing import Any, Dict
from abc import ABC, abstractmethod

class OutputParser(ABC):
    @abstractmethod
    def parse(self, text: str) -> Any:
        pass
    
    @abstractmethod
    def get_format_instructions(self) -> str:
        pass

class RegexOutputParser(OutputParser):
    def parse(self, text: str) -> Dict[str, Any]:
        result = {}
        # 定义正则表达式模式
        patterns = {
            'name': re.compile(r'Name:\s*(.*)'),
            'age': re.compile(r'Age:\s*(\d+)'),
            'location': re.compile(r'Location:\s*(.*)')
        }
        
        for key, pattern in patterns.items():
            match = pattern.search(text)
            if match:
                result[key] = match.group(1).strip()
        
        return result

    def get_format_instructions(self) -> str:
        return "Please provide the output in the following format:\nName: <name>\nAge: <age>\nLocation: <location>"

# 示例使用
if __name__ == "__main__":
    parser = RegexOutputParser()
    
    # 模拟的语言模型输出
    model_output = """
    Name: Alice
    Age: 30
    Location: Wonderland
    """
    
    try:
        result = parser.parse(model_output)
        print("Parsed output:", result)
    except ValueError as e:
        print("Error:", e)
    
    format_instructions = parser.get_format_instructions()
    print("Format instructions:", format_instructions)

输出

Parsed output: {'name': 'Alice', 'age': '30', 'location': 'Wonderland'}
Format instructions: Please provide the output in the following format:
Name: <name>
Age: <age>
Location: <location>

4. 更复杂的示例

假设我们有更复杂的输出，包括额外的信息，如电子邮件和电话号码：

Name: Alice Smith
Age: 30
Location: Wonderland
Email: alice@example.com
Phone: +1234567890

我们可以扩展 RegexOutputParser 来处理这种更复杂的格式：

class ComplexRegexOutputParser(OutputParser):
    def parse(self, text: str) -> Dict[str, Any]:
        result = {}
        # 定义正则表达式模式
        patterns = {
            'name': re.compile(r'Name:\s*(.*)'),
            'age': re.compile(r'Age:\s*(\d+)'),
            'location': re.compile(r'Location:\s*(.*)'),
            'email': re.compile(r'Email:\s*([\w\.-]+@[\w\.-]+)'),
            'phone': re.compile(r'Phone:\s*(\+\d+.*)')
        }
        
        for key, pattern in patterns.items():
            match = pattern.search(text)
            if match:
                result[key] = match.group(1).strip()
        
        return result

    def get_format_instructions(self) -> str:
        return ("Please provide the output in the following format:\n"
                "Name: <name>\n"
                "Age: <age>\n"
                "Location: <location>\n"
                "Email: <email>\n"
                "Phone: <phone>")

# 示例使用
if __name__ == "__main__":
    parser = ComplexRegexOutputParser()
    
    # 模拟的语言模型输出
    model_output = """
    Name: Alice Smith
    Age: 30
    Location: Wonderland
    Email: alice@example.com
    Phone: +1234567890
    """
    
    try:
        result = parser.parse(model_output)
        print("Parsed output:", result)
    except ValueError as e:
        print("Error:", e)
    
    format_instructions = parser.get_format_instructions()
    print("Format instructions:", format_instructions)

输出

Parsed output: {'name': 'Alice Smith', 'age': '30', 'location': 'Wonderland', 'email': 'alice@example.com', 'phone': '+1234567890'}
Format instructions: Please provide the output in the following format:
Name: <name>
Age: <age>
Location: <location>
Email: <email>
Phone: <phone>

5. 说明

在这个扩展的示例中，正则表达式解析器被更新以处理更多类型的信息，包括电子邮件和电话号码。每个字段都有其对应的正则表达式模式，用于匹配并提取信息。这种方式可以灵活地适应多种复杂的输出格式。

五、表格解析器

表格解析器用于解析结构化的表格数据，例如 CSV 或 Markdown 表格格式。下面我们将创建一个 TableOutputParser 类，该类可以解析简单的表格数据，并将其转换为列表或字典格式。

1. 假设的表格数据

假设我们有一个 Markdown 格式的表格数据：

| Name  | Age | Location   |
|-------|-----|------------|
| Alice | 30  | Wonderland |
| Bob   | 25  | Neverland  |

2. 实现 TableOutputParser

我们将实现一个 TableOutputParser 类来解析这种表格数据。

from typing import Any, Dict, List
from abc import ABC, abstractmethod
import re

class OutputParser(ABC):
    @abstractmethod
    def parse(self, text: str) -> Any:
        pass
    
    @abstractmethod
    def get_format_instructions(self) -> str:
        pass

class TableOutputParser(OutputParser):
    def parse(self, text: str) -> List[Dict[str, Any]]:
        lines = text.strip().split('\n')
        
        # 提取表头
        headers = [header.strip() for header in lines[0].strip('|').split('|')]
        
        # 提取数据行
        rows = []
        for line in lines[2:]:  # 跳过表头和分隔符行
            values = [value.strip() for value in line.strip('|').split('|')]
            row = dict(zip(headers, values))
            rows.append(row)
        
        return rows

    def get_format_instructions(self) -> str:
        return (
            "Please provide the output in the following Markdown table format:\n"
            "| Name  | Age | Location   |\n"
            "|-------|-----|------------|\n"
            "| Alice | 30  | Wonderland |\n"
            "| Bob   | 25  | Neverland  |"
        )

# 示例使用
if __name__ == "__main__":
    parser = TableOutputParser()
    
    # 模拟的语言模型输出
    model_output = """
    | Name  | Age | Location   |
    |-------|-----|------------|
    | Alice | 30  | Wonderland |
    | Bob   | 25  | Neverland  |
    """
    
    try:
        result = parser.parse(model_output)
        print("Parsed output:", result)
    except ValueError as e:
        print("Error:", e)
    
    format_instructions = parser.get_format_instructions()
    print("Format instructions:", format_instructions)

3. 输出

Parsed output: [{'Name': 'Alice', 'Age': '30', 'Location': 'Wonderland'}, {'Name': 'Bob', 'Age': '25', 'Location': 'Neverland'}]
Format instructions: Please provide the output in the following Markdown table format:
| Name  | Age | Location   |
|-------|-----|------------|
| Alice | 30  | Wonderland |
| Bob   | 25  | Neverland  |

4. 处理 CSV 格式的表格数据

我们还可以实现一个解析 CSV 格式表格数据的 CSVOutputParser。CSV 格式的表格数据如下：

Name,Age,Location
Alice,30,Wonderland
Bob,25,Neverland

5. 实现 CSVOutputParser

import csv
from io import StringIO

class CSVOutputParser(OutputParser):
    def parse(self, text: str) -> List[Dict[str, Any]]:
        f = StringIO(text)
        reader = csv.DictReader(f)
        return list(reader)

    def get_format_instructions(self) -> str:
        return (
            "Please provide the output in the following CSV format:\n"
            "Name,Age,Location\n"
            "Alice,30,Wonderland\n"
            "Bob,25,Neverland"
        )

# 示例使用
if __name__ == "__main__":
    parser = CSVOutputParser()
    
    # 模拟的语言模型输出
    model_output = """
    Name,Age,Location
    Alice,30,Wonderland
    Bob,25,Neverland
    """
    
    try:
        result = parser.parse(model_output)
        print("Parsed output:", result)
    except ValueError as e:
        print("Error:", e)
    
    format_instructions = parser.get_format_instructions()
    print("Format instructions:", format_instructions)

6. 输出

Parsed output: [{'Name': 'Alice', 'Age': '30', 'Location': 'Wonderland'}, {'Name': 'Bob', 'Age': '25', 'Location': 'Neverland'}]
Format instructions: Please provide the output in the following CSV format:
Name,Age,Location
Alice,30,Wonderland
Bob,25,Neverland