RAG的评测工具及其指标含义

return0-0

已于 2024-02-01 13:54:09 修改

阅读量1.1k

点赞数 21

分类专栏： AIGC 文章标签： python AIGC

于 2024-02-01 11:19:35 首次发布

本文链接：https://blog.csdn.net/Summerworth/article/details/135969792

版权

AIGC 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

1、TruLens

(1)Context Relevance: retrieval, make sure that each chunk of context is relevant to the input query, evaluate context relevance by using the structure of the serialized record.

(2)Groundedness: separate the response into individual claims and independently search for evidence that supports each within the retrieved context.

(3)Answer Relevance: evaluate the relevance of the final response to the query.

2、Ragas

(1)Context Precision: evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Ideally all the relevant chunks must appear at the top ranks. This metric is computed using question the and the contexts, with values ranging between 0 and 1, where higher scores indicate better precision.

(2)Context Recall: measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth. It is computed based on the ground truth and the retrieved context, and the values range between 0 and 1, with higher values indicating better performance.

each sentence in the ground truth answer is analyzed to determine whether it can be attributed to the retrieved context or not. Ideally all sentences in the ground truth answer should be attributable to the retrieved context.

(3)Faithfulness: measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better.

The generated answer is regarded as faithful if all the claims that are made in the answer can be inferred from the given context.

(4)Answer Relevancy: assess how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information. This metric is computed using the question and the answer, with values ranging between 0 and 1, where higher scores indicate better relevancy.

This is reference free metric. If you’re looking to compare ground truth answer with generated answer refer to answer_correctness.

To calculate this score, the LLM is prompted to generate an appropriate question for the generated answer multiple times, and the mean cosine similarity between these generated questions and the original question is measured. The underlying idea is that if the generated answer accurately addresses the initial question, the LLM should be able to generate questions from the answer that align with the original question.

3、Tonic Validate

(1)Answer similarity: measures on a scale from 0 to 5, how well the answer from the RAG system corresponds in meaning to a reference answer. This score is an end-to-end test of the RAG LLM.

(2)Answer consistency: the percentage of the RAG system answer that can be attributed to retrieved context, is a float between 0 and 1.

(3)Retrieval precision: the percentage of retrieved context that is relevant to answer the question. For each context vector, we ask the LLM evaluator whether the context is relevant to use to answer the question. A float between 0 and 1.

(4)Augmentation precision: the percentage of information from the relevant context appears in the answer. A float between 0 and 1.

(5)Augmentation accuracy: the percentage of retrieved context for which some portion of the context appears in the answer from the RAG system.
在这里插入图片描述

return0-0

关注

21
点赞
踩
19

收藏

觉得还不错? 一键收藏
1
评论
RAG的评测工具及其指标含义

(1)Context Relevance: retrieval, make sure that each chunk of context is relevant to the input query, evaluate context relevance by using the structure of the serialized record.(2)Groundedness: separate the response into individual claims and independently
复制链接

扫一扫