LIDA:让LLM自动可视化数据-《LIDA: A Tool for Automatic Generation of Grammar-Agnostic Visualizations and Info》

最新推荐文章于 2024-10-12 12:32:09 发布

chencjiajy

最新推荐文章于 2024-10-12 12:32:09 发布

阅读量2k

点赞数 5

分类专栏：工具文章标签：可视化论文阅读 LLM

本文链接：https://blog.csdn.net/beingstrong/article/details/132919223

版权

工具专栏收录该内容

23 篇文章

订阅专栏

1. 简介

LIDA用LLM来自动对数据进行可视化，其基本信息如下：

论文名称：LIDA: A Tool for Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models
项目主页：https://microsoft.github.io/lida/
github: https://github.com/microsoft/lida
论文下载：arxiv链接， 2023 ACL链接

论文摘要翻译：支持用户自动进行可视化的系统必须解决几个子任务——理解数据的语义、枚举相关的可视化目标并生成可视化规范。在这项工作中，我们将可视化生成视为多阶段生成问题，并认为基于 ChatGPT/GPT-4 等大型语言模型 (LLM) 和图像生成模型 (IGM) 的精心编排的管道适合解决这些任务。我们推出 LIDA，一种用于生成与语法无关的可视化和信息图表的新颖工具。 LIDA 由 4 个模块组成：SUMMARIZER将数据转换为丰富但紧凑的自然语言摘要；GOAL EXPLORER 枚举给定数据的可视化目标；VISGENERATOR 生成、细化、执行和过滤可视化代码；以及 INFOGRAPHER 使用 IGM 生成忠于数据本身的风格化图形。 LIDA 提供 Python API 和混合用户界面（直接操作、支持多语言自然语言）用于交互式图表、信息图表和数据故事生成。在https://microsoft.github.io/lida/了解有关该项目的更多信息。

2. 实现方法

LIDA将可视化任务拆分为如下图所示的四个模块：

在这里插入图片描述

SUMMARIZER

SUMMARIZER模块是对数据内容的总结，被作者分为两个阶段：1. 为了减轻幻觉，先用pandas处理数据生成一些数据字段描述信息。 2. 也可以进一步让LLM根据第一步的内容生成一些补充内容或者用户添加一些额外的总结。

在这里插入图片描述

在论文4.3的消融实验结果(下图，错误率越低越好)表明summary这个步骤是很重要的，但是额外的补充内容对于提醒性能意义不大，也就是第一阶段的信息提取就够了，没有必要调用LLM。(schema是指仅仅包括数据的列名)

在这里插入图片描述

第一阶段对应的代码如下：

    def get_column_properties(self, df: pd.DataFrame, n_samples: int = 3) -> list[dict]:
        """Get properties of each column in a pandas DataFrame"""
        properties_list = []
        for column in df.columns:
            dtype = df[column].dtype
            properties = {}
            if dtype in [int, float, complex]:
                properties["dtype"] = "number"
                properties["std"] = self.check_type(dtype, df[column].std())
                properties["min"] = self.check_type(dtype, df[column].min())
                properties["max"] = self.check_type(dtype, df[column].max())

            elif dtype == bool:
                properties["dtype"] = "boolean"
            elif dtype == object:
                # Check if the string column can be cast to a valid datetime
                try:
                    with warnings.catch_warnings():
                        warnings.simplefilter("ignore")
                        pd.to_datetime(df[column], errors='raise')
                        properties["dtype"] = "date"
                except ValueError:
                    # Check if the string column has a limited number of values
                    if df[column].nunique() / len(df[column]) < 0.5:
                        properties["dtype"] = "category"
                    else:
                        properties["dtype"] = "string"
            elif pd.api.types.is_categorical_dtype(df[column]):
                properties["dtype"] = "category"
            elif pd.api.types.is_datetime64_any_dtype(df[column]):
                properties["dtype"] = "date"
            else:
                properties["dtype"] = str(dtype)

            # add min max if dtype is date
            if properties["dtype"] == "date":
                try:
                    properties["min"] = df[column].min()
                    properties["max"] = df[column].max()
                except TypeError:
                    cast_date_col = pd.to_datetime(df[column], errors='coerce')
                    properties["min"] = cast_date_col.min()
                    properties["max"] = cast_date_col.max()
            # Add additional properties to the output dictionary
            nunique = df[column].nunique()
            if "samples" not in properties:
                non_null_values = df[column][df[column].notnull()].unique()
                n_samples = min(n_samples, len(non_null_values))
                samples = pd.Series(non_null_values).sample(n_samples, random_state=42).tolist()
                properties["samples"] = samples
            properties["num_unique_values"] = nunique
            properties["semantic_type"] = ""
            properties["description"] = ""
            properties_list.append({"column": column, "properties": properties})

        return properties_list

第2阶段让LLM生成补充内容的system prompt如下：

system_prompt = """
You are an experienced data analyst that can annotate datasets. Your instructions are as follows:
i) ALWAYS generate the name of the dataset and the dataset_description
ii) ALWAYS generate a field description.
iii.) ALWAYS generate a semantic_type (a single word) for each field given its values e.g. company, city, number, supplier, location, gender, longitude, latitude, url, ip address, zip code, email, etc
You must return an updated JSON dictionary without any preamble or explanation.
"""

GPAL EXPLOARE

这一步让LLM针对第一步生成的数据总结来生成数据可视化目标，要求如下图的question（假设）、visualization（如何解决question）、rationale（回答为什么要有这个目标）。作者发现让LLM生成rationale可以生成更有语义意义的目标。

在这里插入图片描述

让LLM生成目标的system prompt 如下：

system_prompt = """You are a an experienced data analyst as well as visualization specialist who can generate a given number of insightful GOALS about data, when given a summary of the data. The VISUALIZATIONS YOU RECOMMEND MUST FOLLOW VISUALIZATION BEST PRACTICES (e.g., must use bar charts instead of pie charts for comparing quantities) AND BE MEANINGFUL (e.g., plot longitude and latitude on maps where appropriate).

The GOALS that you recommend must mention the exact fields from the dataset summary above. Your OUTPUT MUST BE ONLY A CODE SNIPPET of a JSON LIST in the format:
```[{ "index": 0,  "question": "What is the distribution of X", "visualization": "histogram of X", "rationale": "This tells about "}, .. ]
```
"""

VIS GENERATOR

在这里插入图片描述

Code scaffold constructor: 第一步生成代码脚手架(code scaffold)，如上图的左半部分。代码脚手架包括：1. 导入相关的包；2. 定义空函数stub来返回可视化信息。

开源代码中的实现如下：

class ChartScaffold(object):
    """Return code scaffold for charts in multiple visualization libraries"""

    def __init__(
        self,
    ) -> None:

        pass

    def get_template(self, goal: Goal, library: str):
        mpl_pre = f"Set chart title to {goal.question}. If the solution requires a single value (e.g. max, min, median, first, last etc), ALWAYS add a line (axvline or axhline) to the chart, ALWAYS with a legend containing the single value (formatted with 0.2F). If using a <field> where semantic_type=date, YOU MUST APPLY the following transform before using that column i) convert date fields to date types using data[''] = pd.to_datetime(data[<field>], errors='coerce'), ALWAYS use  errors='coerce' ii) drop the rows with NaT values data = data[pd.notna(data[<field>])] iii) convert field to right time format for plotting.  ALWAYS make sure the x-axis labels are legible (e.g., rotate when needed). Use BaseMap for charts that require a map. Given the dataset summary, the plot(data) method should generate a {library} chart ({goal.visualization}) that addresses this goal: {goal.question}. DO NOT include plt.show(). The plot method must return a matplotlib object. Think step by step. \n"

        if library == "matplotlib":
            instructions = {"role": "assistant", "content": mpl_pre}
            template = \
                f"""
import matplotlib.pyplot as plt
import pandas as pd
<imports>
# plan -
def plot(data: pd.DataFrame):
    <stub> # only modify this section
    plt.title('{goal.question}', wrap=True)
    return plt;

chart = plot(data) # data already contains the data to be plotted. Always include this line. No additional code beyond this line."""
        elif library == "seaborn":
            instructions = {"role": "assistant", "content": mpl_pre}

            template = \
                f"""
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
<imports>
# solution plan
# i.  ..
def plot(data: pd.DataFrame):

    <stub> # only modify this section
    plt.title('{goal.question}', wrap=True)
    return plt;

chart = plot(data) # data already contains the data to be plotted. Always include this line. No additional code beyond this line."""

        elif library == "ggplot":
            instructions = {
                "role": "assistant",
                "content": f" If using a field where semantic_type=date, do the following i) convert date fields to date types using data[''] = pd.to_datetime(data[''], errors='coerce'), ALWAYS use  errors='coerce' ii) drop the rows with NaT values data = data[pd.notna(data[''])] iii) convert field to right time format for plotting.  ALWAYS make sure the x-axis labels are legible.  Solve the task  carefully by completing ONLY the <imports> AND <stub> section. Given the dataset summary, the plot(data) method should generate a {library} chart ({goal.visualization}) that addresses this goal: {goal.question}. The plot method must return a ggplot object (chart)`. Think step by step.p. \n",
            }

            template = \
                f"""
import plotnine as p9
<imports>
def plot(data: pd.DataFrame):
    chart = <stub>

    return chart;

chart = plot(data) # data already contains the data to be plotted. Always include this line. No additional code beyond this line.. """

        elif library == "altair":
            instructions = {
                "role": "system",
                "content": f"If using a field where semantic_type=date, do the following i) convert date fields to date types using data[''] = pd.to_datetime(data[''], errors='coerce'), ALWAYS use  errors='coerce' ii) drop the rows with NaT values data = data[pd.notna(data[''])] iii) convert field to right time format for plotting.  ALWAYS make sure the x-axis labels are legible.  Solve the task  carefully by completing ONLY the <imports> AND <stub> section. Given the dataset summary, the plot(data) method should generate a {library} chart ({goal.visualization}) that addresses this goal: {goal.question}. Always add a type that is BASED on semantic_type to each field such as :Q, :O, :N, :T, :G. Use :T if semantic_type is year or date. The plot method must return an altair object (chart)`. Think step by step. \n",
            }
            template = \
                """
import altair as alt
<imports>
def plot(data: pd.DataFrame):
    <stub> # only modify this section
    return chart
chart = plot(data) # data already contains the data to be plotted.  Always include this line. No additional code beyond this line..
"""
        else:
            raise ValueError(
                "Unsupported library. Choose from 'matplotlib', 'seaborn', 'plotly', 'bokeh', 'ggplot', 'altair'."
            )

        return template, instructions

Code generator: 用包含脚手架代码、数据总结、可视化目标的prompt，让LLM生成n个可视化代码。

代码如下：

system_prompt = """
You are a helpful assistant highly skilled in writing PERFECT code for visualizations. Given some code template, you complete the template to generate a visualization given the dataset and the goal described. The code you write MUST FOLLOW VISUALIZATION BEST PRACTICES ie. meet the specified goal, apply the right transformation, use the right visualization type, use the right data encoding, and use the right aesthetics (e.g., ensure axis are legible). The transformations you apply MUST be correct and the fields you use MUST be correct. The visualization CODE MUST BE CORRECT and MUST NOT CONTAIN ANY SYNTAX OR LOGIC ERRORS. You MUST first generate a brief plan for how you would solve the task e.g. what transformations you would apply e.g. if you need to construct a new column, what fields you would use, what visualization type you would use, what aesthetics you would use, etc.
YOU MUST ALWAYS return code using the provided code template. DO NOT add notes or explanations.
"""

class VizGenerator(object):
    """Generate visualizations from prompt"""

    def __init__(
        self
    ) -> None:

        self.scaffold = ChartScaffold()

    def generate(self, summary: Dict, goal: Dict,
                 textgen_config: TextGenerationConfig, text_gen: TextGenerator, library='altair'):
        """Generate visualization code given a summary and a goal"""

        library_template, library_instructions = self.scaffold.get_template(goal, library)
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "system", "content": f"The dataset summary is : {summary}"},
            library_instructions,
            {"role": "system", "content": f"Use the code template below \n {library_template}. DO NOT modify the rest of the code template."},
            {"role": "user",
             "content":
             "Always add a legend with various colors where appropriate. The visualization code MUST only use data fields that exist in the dataset (field_names) or fields that are transformations based on existing field_names). Only use variables that have been defined in the code or are in the dataset summary. You MUST return a full python code program that starts with an import statement. DO NOT add any explanation"}]

        # print(textgen_config.messages)
        completions: TextGenerationResponse = text_gen.generate(
            messages=messages, config=textgen_config)
        return [x['content'] for x in completions.text]

Code executor: 对生成的代码后处理和执行，输出是一个代码列表和对应的raster 图像。（github 源码)

除此之外，LIDA的demo还提供了如下的交互功能：

基于自然语言的可视化细化：可以用自然语言交互式对图表进行修改，比如 translate chart t hindi . . . zoom in by 50%。

用的system prompt 如下：

system_prompt = """
You are a helpful assistant highly skilled in modifying visualization code based on a summary of a dataset to follow instructions. Your modification should ONLY UPDATE the content of the plot(data) function/method. You MUST return a full program. DO NOT with NO backticks ```. DO NOT include any preamble text. Do not include explanations or prose.
"""

可视化解释：生成对图表的自然语言描述

用的system prompt 如下：

system_prompt = """
You are a helpful assistant highly skilled in providing helpful, structured explanations of visualization of the plot(data: pd.DataFrame) method in the provided code. You divide the code into sections and provide a description of each section and an explanation. The first section should be named "accessibility" and describe the physical appearance of the chart (colors, chart type etc), the goal of the chart, as well the main insights from the chart.
You can explain code across the following 3 dimensions:
1. accessibility: the physical appearance of the chart (colors, chart type etc), the goal of the chart, as well the main insights from the chart.
2. transformation: This should describe the section of the code that applies any kind of data transformation (filtering, aggregation, grouping, null value handling etc)
3. visualization: step by step description of the code that creates or modifies the presented visualization.

Your output MUST be perfect JSON in THE FORM OF A VALID JSON LIST  e.g.,
[{"section": "accessibility", "code": "None", "explanation": ".."},  { "section":  "transformation",  "code": "..", "explanation": ".."}, { "section":  "visualization",  "code": "..", "explanation": ".."}].

The code part of the dictionary must come from the supplied code and should cover the explanation. The explanation part of the dictionary must be a string. The section part of the dictionary must be one of "accessibility", "transformation", "visualization" with no repetition. The list must have exactly 3 dictionaries.
"""

可视化图表推荐：给定一些文本（如目标，已存在的图表），向用户推荐其他图表（用作比较、提供其他信息等等）
用的system prompt如下：

  system_prompt = """
  You are a helpful assistant highly skilled in recommending a DIVERSE set of visualization code. Your input is an example visualization code,  a summary of a dataset and an example visualization goal. Given this input, your task is to recommend an additional DIVERSE visualizations that a user may be interested. Your output considers different types of valid aggregations, chart types, and uses different variables from the data summary. THE CODE YOU GENERATE MUST BE CORRECT AND FOLLOW VISUALIZATION BEST PRACTICES. YOU MUST IMPORT ALL LIBRARIES THAT ARE USED.
  
  Your output MUST be a n code snippets separated by ******* (5 asterisks). Each snippet MUST BE AN independent code snippet (with one plot method) similar to the example code. For example
  
  ```python
  # code snippet 1
  ....
  ```
  *****
  
  ```python
  # code snippet 2
  ....
  ```
  
  ```python
  # code snippet n
  ....
  ```
  
  Do not include any text or explanation or prose. EACH CODE SNIPPET MUST BE A FULL PROGRAM (COMPLETE WITH IMPORT STATEMENT AND plot(data) method) THAT FOLLOWS THE STRUCTURE OF THE EXAMPLE VISUALIZATION CODE.
  """

INFOGRAPHER

对生成的图表使用扩散模型生成风格化图形，使用了Peacasso库的api。

评估

LIDA有两个评估指标：

Visualization Error Rate (VER)， $VER=\frac{E}{T}*100$ ， E是生成的无法编译的代码的个数； T是总的生成的可视化个数。
Self-Evaluated Visualization Quality (SEVQ)，让GPT4来从6个维度：code accuracy, data transformation, goal compliance, visualization type, data encoding, and aesthetics 按照1到10来打分并给出理由。

使用的prompt为:

system_prompt = """
You are a helpful assistant highly skilled in evaluating the quality of a given visualization code by providing a score from 1 (bad) - 10 (good) while providing clear rationale. YOU MUST CONSIDER VISUALIZATION BEST PRACTICES for each evaluation. Specifically, you can carefully evaluate the code across the following dimensions
- bugs (bugs):  are there bugs, logic errors, syntax error or typos? Are there any reasons why the code may fail to compile? How should it be fixed? If ANY bug exists, the bug score MUST be less than 5.
- Data transformation (transformation): Is the data transformed appropriately for the visualization type? E.g., is the dataset appropriated filtered, aggregated, or grouped  if needed?
- Goal compliance (compliance): how well the code meets the specified visualization goals?
- Visualization type (type): CONSIDERING BEST PRACTICES, is the visualization type appropriate for the data and intent? Is there a visualization type that would be more effective in conveying insights? If a different visualization type is more appropriate, the score MUST be less than 5.
- Data encoding (encoding): Is the data encoded appropriately for the visualization type?
- aesthetics (aesthetics): Are the aesthetics of the visualization appropriate for the visualization type and the data?

You must provide a score for each of the above dimensions.  Assume that data in chart = plot(data) contains a valid dataframe for the dataset. The `plot` function returns a chart (e.g., matplotlib, seaborn etc object).

Your OUTPUT MUST BE ONLY A CODE SNIPPET of a JSON LIST in the format:
```
[{ "dimension":  "bugs",  "score": 1, "rationale": " .."}, { "dimension":  "type",  "score": 1, "rationale": " .."},  ..]
```
"""

（除此之外的用户界面内容在此不做记录）