LlamaIndex 结构化输出

IT民工老包

已于 2024-07-14 22:37:20 修改

阅读量482

点赞数 24

分类专栏： LlamaIndex 文章标签：人工智能 llamaindex

于 2024-07-14 22:33:34 首次发布

本文链接：https://blog.csdn.net/baoj2010/article/details/140423555

版权

LlamaIndex 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

我们和大模型是通过 prompt 进行交互的，我们提示什么，大模型就输出什么。

假如我们要求大模型输出结构化的数据如 JSON，yaml 是不是也可以？

第一个例子

先建一个索引：

from llama_index.core import VectorStoreIndex,SimpleDirectoryReader


documents = SimpleDirectoryReader("./data").load_data()
# build index
index = VectorStoreIndex.from_documents(documents)

定义输出格式

from typing import List
from pydantic import BaseModel


class Biography(BaseModel):
    """Data model for a biography."""

    name: str
    best_known_for: List[str]
    extra_info: str

定义一个 query_engine

query_engine = index.as_query_engine(
    response_mode="tree_summarize", output_cls=Biography
)

response = query_engine.query("Who is Paul Graham?")



print(response.name)
# > 'Paul Graham'
print(response.best_known_for)
# > ['working on Bel', 'co-founding Viaweb', 'creating the programming language Arc']
print(response.extra_info)
# > "Paul Graham is a computer scientist, entrepreneur, and writer. He is best known      for ..."

从上面的示例可以看出，大模型的检索输出已经是一个自定义的 Biography 类对象。

更复杂的输出结构

大模型对于更复杂的嵌套模型也是能胜任的

from pydantic import BaseModel
from typing import List

from llama_index.core.program import LLMTextCompletionProgram
from dotenv import load_dotenv,find_dotenv
load_dotenv(find_dotenv())

class Song(BaseModel):
    """Data model for a song."""

    title: str
    length_seconds: int


class Album(BaseModel):
    """Data model for an album."""

    name: str
    artist: str
    songs: List[Song]

prompt_template_str = """\
从电影中获取一个唱片信息，需要包含作家名和歌曲列表. \
电影 {movie_name}：\
"""
program = LLMTextCompletionProgram.from_defaults(
    output_cls=Album,
    prompt_template_str=prompt_template_str,
    verbose=True,
)

output = program(movie_name="功夫")

print(output)

输出：

Album(name='Kung Fu Hustle Soundtrack', artist='Various Artists', songs=[Song(title='Kung Fu Fighting', length_seconds=180), Song(title='Zhi Yao Wei Ni Huo Yi Tian', length_seconds=210), Song(title='The Axe Gang', length_seconds=195), Song(title='Sing Sing Sing', length_seconds=240)])

自定义输出转换方式

LlamaIndex 也支持用户自己定义大模型的输出转换方式，由用户自己定义如何对结果进行输出转换，提供了更大的灵活性

# 自定义输出结果
from llama_index.core.output_parsers import ChainableOutputParser

class CustomAlbumOutputParser(ChainableOutputParser):

    def __init__(self, verbose:bool=False):
        self.verbose = verbose

    def parse(self, output:str) -> Album:
        if self.verbose:
            print(f"> Raw output:{output}")

        lines = output.split('\n')
        name, artist= lines[0].split(",")

        songs = []
        for i in range(1, len(lines)):
            title, length_seconds = lines[i].split(",")
            songs.append(Song(title=title, length_seconds=length_seconds))
        return Album(name=name, artist=artist, songs=songs)


prompt_template_str = """\
从电影 {movie_name} 中找出相关的唱片，需要有创作者和歌曲列表 


返回的格式如下
第一行包括:
<唱片名>, <唱片作者>
接下来每一行代表一首歌
<歌名>, <歌曲时长(已秒计时)>
"""

program = LLMTextCompletionProgram.from_defaults(
    output_parser=CustomAlbumOutputParser(verbose=True),
    output_cls=Album,
    prompt_template_str=prompt_template_str,
    verbose=True
)

output = program(movie_name="长江七号")

print(output)

输出分两部分：

第一部分是我们打开了 verbose=True 时大模型的原始回复

> Raw output:长江七号原声带, 万籁鸣
大江东去, 240
梦想的船, 180
长江之歌, 210
夜色温柔, 195
追梦人, 220

第二部分才是我们的转换后的结果

Album(name='长江七号原声带', artist=' 万籁鸣', songs=[Song(title='大江东去', length_seconds=240), Song(title='梦想的船', length_seconds=180), Song(title='长江之歌', length_seconds=210), Song(title='夜色温柔', length_seconds=195), Song(title='追梦人', length_seconds=220)])

通过 Function Calling 方式

LlamaIndex 提供了 FunctionCallingProgram 来生成结构化的响应。当然使用这个类的前提是大模型需要支持 Function Calling，如果大模型不支持 Function Calling 那就使用 LLMTextCompletionProgram 来完成结构化的输出，两者的原理是不一样的。

当然两者的原理是不一样的，Function Calling 是通过将 Pydantic 对象结构描述作为 tool 来实现的，LLMTextCompletionProgram 则是通过提示词要求大模型返回结构化输出。

from pydantic import BaseModel
from typing import List
from llama_index.llms.openai import OpenAI

from llama_index.core.program import FunctionCallingProgram

class Song(BaseModel):
    """Data model for a song."""

    title: str
    length_seconds: int


class Album(BaseModel):
    """Data model for an album."""

    name: str
    artist: str
    songs: List[Song]
    
prompt_template_str = """\
Generate an example album, with an artist and a list of songs. \
Every song item has title and length_seconds.
Using the movie {movie_name} as inspiration.\
"""
llm = OpenAI(model="gpt-3.5-turbo")

program = FunctionCallingProgram.from_defaults(
    output_cls=Album,
    prompt_template_str=prompt_template_str,
    verbose=True,
    llm=llm
)

output = program(movie_name="The Shining")
print(output)

输出第一部分：

=== Calling Function ===
Calling function: Album with args: {"name": "The Shining Soundtrack", "artist": "Various Artists", "songs": [{"title": "Main Title", "length_seconds": 180}, {"title": "The Overlook Hotel", "length_seconds": 240}, {"title": "Danny's Vision", "length_seconds": 200}, {"title": "Room 237", "length_seconds": 220}, {"title": "Redrum", "length_seconds": 190}, {"title": "The Maze", "length_seconds": 210}, {"title": "Here's Johnny", "length_seconds": 195}]}
=== Function Output ===
name='The Shining Soundtrack' artist='Various Artists' songs=[Song(title='Main Title', length_seconds=180), Song(title='The Overlook Hotel', length_seconds=240), Song(title="Danny's Vision", length_seconds=200), Song(title='Room 237', length_seconds=220), Song(title='Redrum', length_seconds=190), Song(title='The Maze', length_seconds=210), Song(title="Here's Johnny", length_seconds=195)]

输出第二部分：

Album(name='The Shining Soundtrack', artist='Various Artists', songs=[Song(title='Main Title', length_seconds=180), Song(title='The Overlook Hotel', length_seconds=240), Song(title="Danny's Vision", length_seconds=200), Song(title='Room 237', length_seconds=220), Song(title='Redrum', length_seconds=190), Song(title='The Maze', length_seconds=210), Song(title="Here's Johnny", length_seconds=195)])

Function Call 支持批量处理

prompt_template_str = """\
Generate example albums, with an artist and a list of songs, each song has title and length_seconds fields.
Using each movie below as inspiration. \

Here are the movies:
{movie_names}
"""
llm = OpenAI(model="gpt-3.5-turbo")

program = FunctionCallingProgram.from_defaults(
    output_cls=Album,
    prompt_template_str=prompt_template_str,
    verbose=True,
    allow_parallel_tool_calls=True,
)
output = program(movie_names="The Shining, The Blair Witch Project, Saw")

print(output)

输出第一部分：

=== Calling Function ===
Calling function: Album with args: {"name": "The Shining", "artist": "Wendy Carlos", "songs": [{"title": "Main Title", "length_seconds": 180}, {"title": "Rocky Mountains", "length_seconds": 240}, {"title": "Lontano", "length_seconds": 200}]}
=== Function Output ===
name='The Shining' artist='Wendy Carlos' songs=[Song(title='Main Title', length_seconds=180), Song(title='Rocky Mountains', length_seconds=240), Song(title='Lontano', length_seconds=200)]
=== Calling Function ===
Calling function: Album with args: {"name": "The Blair Witch Project", "artist": "Tony Cora", "songs": [{"title": "The Rustin Parr House", "length_seconds": 150}, {"title": "The Blair Witch Project", "length_seconds": 210}, {"title": "The House in the Woods", "length_seconds": 180}]}
=== Function Output ===
name='The Blair Witch Project' artist='Tony Cora' songs=[Song(title='The Rustin Parr House', length_seconds=150), Song(title='The Blair Witch Project', length_seconds=210), Song(title='The House in the Woods', length_seconds=180)]
=== Calling Function ===
Calling function: Album with args: {"name": "Saw", "artist": "Charlie Clouser", "songs": [{"title": "Hello Zepp", "length_seconds": 220}, {"title": "Bite the Hand That Bleeds", "length_seconds": 190}, {"title": "X Marks the Spot", "length_seconds": 205}]}
=== Function Output ===
name='Saw' artist='Charlie Clouser' songs=[Song(title='Hello Zepp', length_seconds=220), Song(title='Bite the Hand That Bleeds', length_seconds=190), Song(title='X Marks the Spot', length_seconds=205)]

可以看到调了多次 Function Calling