从零手搭一个大模型part4-手搭一个Eval评估器

最新推荐文章于 2024-07-24 22:13:06 发布

1o0.0o1

最新推荐文章于 2024-07-24 22:13:06 发布

阅读量666

点赞数 13

文章标签：算法 python 人工智能

本文链接：https://blog.csdn.net/qq_60489376/article/details/139246877

版权

手搭一个Eval评估器

LLM常见的评估指标介绍
代码实现
参考

LLM常见的评估指标介绍

LLM评估的思想流程如下图:
在这里插入图片描述

首先，根据目标数据集的任务类型指定合理的评测metric.
根据目标数据的形式总结模型引导prompt.
根据模型初步预测结果采纳合理的抽取方式.
对相应的pred与anwser进行得分计算.

下面将介绍一下评估的指标

F1分数

F1分数是分类模型性能评估的重要指标之一。它综合了精确率和召回率，用于权衡两者之间的关系。

那么F1是怎么作用在大模型评估上的呢?想要了解这一点,我们先介绍一下计算F1最重要的指标,即精确率和召回率。而要了解这两个,又需要去介绍一下混淆矩阵。

混淆矩阵

混淆矩阵是用于描述分类模型性能的工具，通过显示模型预测结果和实际结果之间的对比来进行性能评估。典型的二分类混淆矩阵如下：
$\begin{array}{c|cc} & \text{实际正类} & \text{实际负类} \\ \hline \text{预测正类} & \text{TP} & \text{FP} \\ \text{预测负类} & \text{FN} & \text{TN} \\ \end{array}$
其中：

TP（True Positives）：真正例，正确预测为正类的样本数。
FP（False Positives）：假正例，错误预测为正类的样本数。
FN（False Negatives）：假负例，错误预测为负类的样本数。
TN（True Negatives）：真负例，正确预测为负类的样本数。

精确率

精确率：预测为正类的样本中实际为正类的比例。
公式:
$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}$

召回率

召回率：实际为正类的样本中被正确预测为正类的比例。
公式:
$\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}$

F1分数

结合了精确率和召回率的指标进行运算
公式:
$\cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$

大模型中的F1分数的运用

这里直接举一个简单的例子来说明怎么用这个F1来评估大模型:

假设我们有一个预测文本和一个真实文本：

预测文本 (Prediction): “The cat is on the mat”
真实文本 (Ground Truth): “There is a cat on the mat”

首先是文本预处理

原始文本: “The cat is on the mat”
小写化: “the cat is on the mat”
去掉标点符号: “the cat is on the mat”
删除冠词: " cat is on mat"

真实文本预处理:
原始文本: “There is a cat on the mat”
小写化: “there is a cat on the mat”
去掉标点符号: “there is a cat on the mat”
删除冠词: “there is cat on mat”

分词

对预处理后的文本进行分词(英文文本直接空格划分即可)：
预测文本分词:
“cat is on mat”
真实文本分词:
“there is cat on mat”

计算F1分数

通过分词结果计算F1分数：
计算共同元素：
预测文本中的词：{“cat”, “is”, “on”, “mat”}
真实文本中的词：{“there”, “is”, “cat”, “on”, “mat”}
共同词：{“cat”, “is”, “on”, “mat”}
共同词数量：4
计算精确率 (Precision)：
P = 共同词数量 / 预测文本词数 = 4 / 4 = 1.0
计算召回率 (Recall)：
R = 共同词数量 / 真实文本词数 = 4 / 5 = 0.8
计算F1分数：
F1分数 = (2 * P * R) / (P+ R) = (2 * 1.0 * 0.8) / (1.0 + 0.8) = 0.8889
因此，这个例子的F1分数为0.8889。

通过这里例子可以直到F1分数指标越高,说明LLM生成的内容越接近正确答案,大模型性能越好。

BLUE

BLEU是一种评估机器翻译结果的常用指标，它通过计算n-gram的重合情况来评估生成文本和参考文本之间的相似度。

里面有一个n-gram计算指标:表示的是连着n个词作为一个整体来比较

同理也是举例说明

假设我们有以下翻译结果和参考翻译：
生成文本 (Candidate): “the cat is on the mat”
参考文本 (Reference): “the cat sat on the mat”
计算步骤：
分词：
生成文本分词: [“the”, “cat”, “is”, “on”, “the”, “mat”]
参考文本分词: [“the”, “cat”, “sat”, “on”, “the”, “mat”]
计算n-gram重合情况：
1-gram匹配: 5个（“the”, “cat”, “on”, “the”, “mat”）
2-gram匹配: 3个（“the cat”, “on the”, “the mat”）
3-gram匹配: 1个（“on the mat”）
4-gram匹配: 0个
计算n-gram精确率：
1-gram精确率: 5/6 ≈ 0.8333
2-gram精确率: 3/5 = 0.6
3-gram精确率: 1/4 = 0.25
4-gram精确率: 0/3 = 0.0
计算BLEU分数：
通常会引入一个brevity penalty (BP)来惩罚生成文本过短的情况。
$reference_tokens ) len ( candidate_tokens ) ) = exp ⁡ ( 1 − 6 6 ) = exp ⁡ ( 0 ) = 1 BP = \exp\left(1 - \frac{\text{len}(\text{reference\_tokens})}{\text{len}(\text{candidate\_tokens})}\right) = \exp\left(1 - \frac{6}{6}\right) = \exp(0) = 1$
$\begin{aligned} BLEU &= BP \times \exp\left(\frac{1}{4} \left(\log(0.8333) + \log(0.6) + \log(0.25)\right)\right) \\ &= 1 \times \exp\left(\frac{1}{4} \left(-0.1823 + (-0.5108) + (-1.3863)\right)\right) \\ &= 1 \times \exp\left(\frac{1}{4} \left(-2.0794\right)\right) \\ &= \exp\left(-0.51985\right) \\ &\approx 0.594 \end{aligned}$
由于log(0.0)不存在，我们会忽略掉这个log(0.0)来计算实际的BLEU分数。

n-gram衡量LLM生成文本的流畅度,当n越大,对应评分越高,说明文本生成的流畅度越高

ROUND

ROUGE是一系列用于自动摘要和机器翻译结果评估的指标，特别关注召回率。常用的ROUGE指标包括ROUGE-N（n-gram重叠）、ROUGE-L（最长公共子序列，LCS）和ROUGE-W（加权最长公共子序列）。
例子:

分词：

生成文本分词: [“the”, “cat”, “is”, “on”, “the”, “mat”]
参考文本分词: [“the”, “cat”, “sat”, “on”, “the”, “mat”]

ROUGE-1（Unigram 重叠）

计算Unigram重叠情况：
重叠Unigram: {“the”, “cat”, “on”, “the”, “mat”}，共5个
计算召回率、精确率和F1分数：
召回率 = 重叠Unigram数 / 参考文本Unigram数 = 5 / 6 ≈ 0.8333
精确率 = 重叠Unigram数 / 生成文本Unigram数 = 5 / 6 ≈ 0.8333
F1分数 = 2 * (精确率 * 召回率) / (精确率 + 召回率) ≈ 0.8333

ROUGE-L（最长公共子序列）

计算最长公共子序列（LCS）：
LCS: “the cat on the mat”，长度为5

计算召回率、精确率和F1分数：

召回率 = LCS长度 / 参考文本长度 = 5 / 6 ≈ 0.8333
精确率 = LCS长度 / 生成文本长度 = 5 / 6 ≈ 0.8333
F1分数 = 2 * (精确率 * 召回率) / (精确率 + 召回率) ≈ 0.8333

ROUGE指标的值越大，表示生成文本与参考文本的相似度越高，这意味着生成的文本质量越好。和BLUE有点类似

代码实现

代码主要实现F1分数计算部分

定义LLM类

定义好需要评估的LLM模型的类:

"""
定义大模型模块,这里之构建一个ChatGPT接口
1. 先定义一个基类
2. 继承,完善接口

一个基类要有的方法:
1. init
2. Chat()
3. load_model() 这个分为两类,一个是调用本地的开源预训练大模型,另一个的调用API,只有调用本地的大模型才需要重写这个方法

# 对于LLM来说还需要一个提示词模板,我们也可以直接去定义
"""
import openai
from openai import OpenAI
import os
from typing import Dict, List, Optional, Tuple, Union


class BaseModel(object):
    def __init__(self, path: str = ""):
        self.path = path

    def chat(self, prompt: str, history: List[dict], content: str):
        pass

    def load_model(self):
        pass


class OpenAIChatModel(BaseModel):
    def __init__(self, path: str = "", model: str = "gpt-3.5-turbo-0125"):
        """

        :param path:
        :param model: 传入gpt模型
        """
        super().__init__(path)
        self.model = model

    def chat(self, prompt: str, history: List[dict], **kwargs):
        self.client = OpenAI()
        self.client.api_key = os.getenv("OPENAI_API_KEY")
        self.client.base_url = os.getenv("OPENAI_BASE_URL")
        history.append(
            {
                "role": "user",
                "content": prompt.format(**kwargs),
            }
        )
        response = self.client.chat.completions.create(
            model=self.model, messages=history, max_tokens=2000, temperature=0.1
        )
        return response.choices[0].message.content

    # gpt是调用API,不用再本地加载了


class OneAPI(BaseModel):
    def __init__(self, path: str = "", model: str = ""):
        """
        本类封装的的是one api中转站调用各种大模型API接口模式
        :param path: 无
        :param model: 输入你想调用的接口,前提是中转站支持
        """
        self.path = path
        self.model = model

    def chat(self, prompt: str, history: List[dict], **kwargs):
        self.client = OpenAI()
        self.client.api_key = os.getenv("ONE_API_KEY")
        self.client.base_url = os.getenv("ONE_BASE_URL")
        history.append(
            {
                "role": "user",
                "content": prompt.format(**kwargs),
            }
        )
        response = self.client.chat.completions.create(
            model=self.model, messages=history, max_tokens=4096, temperature=0.1
        )
        return response.choices[0].message.content

定义评估函数

这里只写了F1的评估方法的代码,分中英文

中文使用jieba分词,英文使用空格
对他们进行预处理
计算f1指标

"""
构建评估工具,这里主要编写的是计算f1分数

"""

import jieba
import string
import re
from collections import Counter

jieba.setLogLevel(jieba.logging.INFO)  # 设置jieba的日志级别为INFO，以减少控制台输出

# 数据预处理函数
def normalize_zh_answer(s):
    """对中文文本进行预处理：小写化，删除标点，删除空格"""

    def white_space_fix(text):
        """删除空格"""
        return "".join(text.split())

    def remove_punc(text):
        """删除标点符号"""
        cn_punctuation = "！？｡。＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏."
        all_punctuation = set(string.punctuation + cn_punctuation)
        return "".join(ch for ch in text if ch not in all_punctuation)

    def lower(text):
        """小写化"""
        return text.lower()

    return white_space_fix(remove_punc(lower(s)))


def normalize_en_answer(s):
    """对英文文本进行预处理：小写化，删除标点，删除冠词和多余空白"""

    def remove_articles(text):
        """删除冠词"""
        return re.sub(r"\b(a|an|the)\b", " ", text)

    def white_space_fix(text):
        """删除多余空白"""
        return " ".join(text.split())

    def remove_punc(text):
        """删除标点符号"""
        exclude = set(string.punctuation)
        return "".join(ch for ch in text if ch not in exclude)

    def lower(text):
        """小写化"""
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))


# F1分数计算函数
def f1_score_evaluation(prediction, ground_truth):
    """
    计算F1分数
    :param prediction: 预测文本
    :param ground_truth: 真实文本
    :return: F1分数
    """
    # 计算预测文本和真实文本中的共同元素
    common = Counter(prediction) & Counter(ground_truth)
    num_same = sum(common.values())
    if num_same == 0:
        return 0
    # 计算精确率和召回率
    precision = 1.0 * num_same / len(prediction)
    recall = 1.0 * num_same / len(ground_truth)
    # 计算F1分数
    f1 = (2 * precision * recall) / (precision + recall)
    return f1


# 英文F1分数计算
def qa_f1_score(prediction, ground_truth):
    """
    计算英文问答F1分数
    :param prediction: 预测文本
    :param ground_truth: 真实文本
    :return: F1分数
    """
    # 预处理预测文本和真实文本
    normalized_prediction = normalize_en_answer(prediction)
    normalized_ground_truth = normalize_en_answer(ground_truth)
    # 分词
    prediction_tokens = normalized_prediction.split()
    ground_truth_tokens = normalized_ground_truth.split()
    # 计算F1分数
    return f1_score_evaluation(prediction_tokens, ground_truth_tokens)


# 中文F1分数计算
def qa_f1_zh_score(prediction, ground_truth):
    """
    计算中文问答F1分数
    :param prediction: 预测文本
    :param ground_truth: 真实文本
    :return: F1分数
    """
    # 使用jieba进行分词
    prediction_tokens = list(jieba.cut(prediction, cut_all=False))
    ground_truth_tokens = list(jieba.cut(ground_truth, cut_all=False))
    # 预处理分词结果
    prediction_tokens_norm = [normalize_zh_answer(t) for t in prediction_tokens]
    ground_truth_tokens_norm = [normalize_zh_answer(t) for t in ground_truth_tokens]
    # 过滤掉长度为0的词
    prediction_tokens = [t for t in prediction_tokens_norm if len(t) > 0]
    ground_truth_tokens = [t for t in ground_truth_tokens_norm if len(t) > 0]
    # 计算F1分数
    return f1_score_evaluation(prediction_tokens, ground_truth_tokens)

定义提示词模板

可以做成一个字典,根据不同的数据集引入不同的模板,而且提示词部分要规定好模型的输出格式,方便提取,这里使用的是gpt3.5的模型,由于gpt可以直接返回json形式,所以直接用json.loads加载即可,对于其他模型的还需要精心调整一下提示词

PROMPT_TEMPLATE = {
    "NEWS_SUMMARY_PROMPT": """
        Please write a one-page summary based on the following news content.
        
        News:
        {context}

        Now, please write a one-page summary of all the news.

        Return the result in JSON format:
        {{
            "summary": ""
        }}
    """,
}

整合

这里对multi_news.jsonl这个数据集进行评估
大概的数据样式如下图:
在这里插入图片描述
其中content是新闻内容,answers是总结后的答案,我们最后对LLM生成的评估就是和这个answer进行计算

完整整合代码如下:

import json
import os
import pandas as pd
from utils.prompt import PROMPT_TEMPLATE
from utils.eval import qa_f1_score
from utils.LLM import OpenAIChatModel

# 设置环境变量
os.environ["OPENAI_API_KEY"] = ""
os.environ["OPENAI_BASE_URL"] = ""
# 读取JSONL文件
file_path = "./datawhale组队学习/tinyEval/data/multi_news.jsonl"
data = []
with open(file_path, "r", encoding="utf-8") as file:
    for line in file:
        json_object = json.loads(line)
        data.append(json_object)

# 转换为DataFrame并删除不需要的列
df = pd.DataFrame(data)
df = df.drop(columns=["input"])

# 初始化大语言模型
llm = OpenAIChatModel()

# 获取前10条新闻内容并生成摘要
results = []
for content in df["context"][:10].to_list():
    try:
        result = llm.chat(PROMPT_TEMPLATE["NEWS_SUMMARY_PROMPT"], [], context=content)
        result = json.loads(result)["summary"]
        results.append(result)
    except:
        # 如果api接口出问题,那就直接退出,方便后面数据对齐
        break

# 提取真实答案
answers = df["answers"].apply(lambda x: x[0]).to_list()

# 对齐results和answers，确保长度一致
min_length = min(len(results), len(answers))
results = results[:min_length]
answers = answers[:min_length]

# 计算每对预测文本和真实文本的F1分数
f1_scores = [qa_f1_score(pred, true) for pred, true in zip(results, answers)]


# 计算平均F1分数
average_f1 = sum(f1_scores) / len(f1_scores)
print(f"Average F1 Score: {average_f1}")

# 输出所有F1分数
print(f1_scores)

最终效果:
这里只评判了前10条的平均值
在这里插入图片描述

Average F1 Score: 0.30378376453565803
[0.33230769230769225, 0.369047619047619, 0.3072100313479624, 0.35119047619047616, 0.1945525291828794, 0.2845849802371542, 0.3561643835616438, 0.15625, 0.28409090909090906, 0.4024390243902439]

参考

datawhale手搭一个Eval, 包含数据集

1o0.0o1

关注

13
点赞
踩
16

收藏

觉得还不错? 一键收藏
0
评论
从零手搭一个大模型part4-手搭一个Eval评估器

首先，根据目标数据集的任务类型指定合理的评测metric.根据目标数据的形式总结模型引导prompt.根据模型初步预测结果采纳合理的抽取方式.对相应的pred与anwser进行得分计算.下面将介绍一下评估的指标"""定义大模型模块,这里之构建一个ChatGPT接口1. 先定义一个基类2. 继承,完善接口一个基类要有的方法:1. init2. Chat()
复制链接

扫一扫