使用Giskard进行LLM的测试-CSDN博客

本文链接：https://blog.csdn.net/stingfire/article/details/138730643

Giskard是一个对AI模型进行测试的平台，可以执行功能验证、安全测试及合规扫描。工具主要分为两大块：Giskard Python库和一个server端Giskard Hub。其中Python库是开源的，github地址：https://github.com/Giskard-AI/giskard

使用Giskard的可以按照如下步骤进行测试：

1. 加载数据集进行功能验证；

2. 配置相关类型漏洞，进行安全漏洞扫描；

3. 生成测试报告，进行问题确认；

4. 针对问题生成测试用例；

5. 引入第三方LLM进行比对验证。

除了LLM，Giskard还支持NLP、视觉相关的模型测试，下面以LLM测试为例介绍Giskard的快速入门。使用Giskard Python库编写测试代码，就像把大象塞入冰箱一样“简单”：

封装Giskard模型
调用该模型的扫描
生成测试报告

封装Giskard模型

不能直接对LLM进行测试，需要进行封装才能做下一步操作。首先下载依赖的库：

pip install "giskard[llm]" --upgrade
pip install "langchain<=0.0.301" "pypdf<=3.17.0" "faiss-cpu<=1.7.4" "openai<=0.28.1" "tiktoken<=0.5.1"

笔者实践中发现faiss的安装推荐Python版本是3.11以下，且在windows环境下装的是cpu版faiss。

设置好OpenAI API key如下：

import os

# Set the OpenAI API Key environment variable.
os.environ["OPENAI_API_KEY"] = "sk-..."

此处的目的是使用OpenAI的API来进行用例或测试数据生成的模型，也可以使用其他比如ollama的模型，具体文档见这。

接下来就可以搭建自己的测试LLM，如下使用langchain搭建起来：

from langchain import OpenAI, FAISS, PromptTemplate
from langchain.embeddings import OpenAIEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Prepare vector store (FAISS) with IPPC report
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100, add_start_index=True)
loader = PyPDFLoader("https://www.ipcc.ch/report/ar6/syr/downloads/report/IPCC_AR6_SYR_LongerReport.pdf")
db = FAISS.from_documents(loader.load_and_split(text_splitter), OpenAIEmbeddings())

# Prepare QA chain
PROMPT_TEMPLATE = """You are the Climate Assistant, a helpful AI assistant made by Giskard.
Your task is to answer common questions on climate change.
You will be given a question and relevant excerpts from the IPCC Climate Change Synthesis Report (2023).
Please provide short and clear answers based on the provided context. Be polite and helpful.

Context:
{context}

Question:
{question}

Your answer:
"""

llm = OpenAI(model="gpt-3.5-turbo-instruct", temperature=0)
prompt = PromptTemplate(template=PROMPT_TEMPLATE, input_variables=["question", "context"])
climate_qa_chain = RetrievalQA.from_llm(llm=llm, retriever=db.as_retriever(), prompt=prompt)

# Test that everything works
climate_qa_chain.run({"query": "Is sea level rise avoidable? When will it stop?"})

然后进行Giskard模型的封装：

import giskard
import pandas as pd


def model_predict(df: pd.DataFrame):
    """Wraps the LLM call in a simple Python function.

    The function takes a pandas.DataFrame containing the input variables needed
    by your model, and must return a list of the outputs (one for each row).
    """
    return [climate_qa_chain.run({"query": question}) for question in df["question"]]


# Don’t forget to fill the `name` and `description`: they are used by Giskard
# to generate domain-specific tests.
giskard_model = giskard.Model(
    model=model_predict,
    model_type="text_generation",
    name="Climate Change Question Answering",
    description="This model answers any question about climate change based on IPCC reports",
    feature_names=["question"],
)

调用扫描

调用giskard的scan方法进行扫描。only参数用来控制扫描漏洞的类别，文档中为了减少耗时，只设置了hallucination用来扫描LLM幻觉相关漏洞。如果不设置就是全范围的扫描。

report = giskard.scan(giskard_model, giskard_dataset, only="hallucination")

生成报告

简单调用方法就可以生成报告：

display(full_report)

# Save it to a file
full_report.to_html("scan_report.html")

也可以生成markdown类型的报告：

display(full_report)

# Save it to a file
full_report.to_markdown("scan_report.md")

还能通过报告生成针对的测试用例集：

test_suite = full_report.generate_test_suite(name="Test suite generated by scan")
test_suite.run()

Giskard的Python库还能和Pytest框架集成，编写测试用例脚本。具体可以查看文档：https://docs.giskard.ai/en/stable/integrations/pytest/index.html。

示例代码：

import pytest

from giskard import Dataset, Model, Suite, demo
from giskard.testing import test_accuracy, test_f1

model_raw, df = demo.titanic()

wrapped_dataset = Dataset(
    name="Test Data Set",
    df=df,
    target="Survived",
    cat_columns=["Pclass", "Sex", "SibSp", "Parch", "Embarked"],
)

wrapped_model = Model(model=model_raw, model_type="classification", name="Classifier v1")

suite = (
    Suite(
        default_params={
            "model": wrapped_model,
            "dataset": wrapped_dataset,
        }
    )
    .add_test(test_f1(threshold=0.6))
    .add_test(test_accuracy(threshold=1))  # Certain to fail
)


@pytest.fixture
def dataset():
    return wrapped_dataset


@pytest.fixture
def model():
    return wrapped_model


# Single wrapped test
def test_only_accuracy(dataset, model):
    test_accuracy(model=model, dataset=dataset, threshold=1).assert_()


# Parametrise tests from suite
@pytest.mark.parametrize("test_partial", suite.to_unittest(), ids=lambda t: t.fullname)
def test_giskard(test_partial):
    test_partial.assert_()