RAG的评测工具及其指标含义

1、TruLens

(1)Context Relevance: retrieval, make sure that each chunk of context is relevant to the input query, evaluate context relevance by using the structure of the serialized record.

(2)Groundedness: separate the response into individual claims and independently search for evidence that supports each within the retrieved context.

(3)Answer Relevance: evaluate the relevance of the final response to the query.

2、Ragas

(1)Context Precision: evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Ideally all the relevant chunks must appear at the top ranks. This metric is computed using question the and the contexts, with values ranging between 0 and 1, where higher scores indicate better precision.

(2)Context Recall: measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth. It is computed based on the ground truth and the retrieved context, and the values range between 0 and 1, with higher values indicating better performance.

each sentence in the ground truth answer is analyzed to determine whether it can be attributed to the retrieved context or not. Ideally all sentences in the ground truth answer should be attributable to the retrieved context.

(3)Faithfulness: measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better.

The generated answer is regarded as faithful if all the claims that are made in the answer can be inferred from the given context.

(4)Answer Relevancy: assess how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information. This metric is computed using the question and the answer, with values ranging between 0 and 1, where higher scores indicate better relevancy.

This is reference free metric. If you’re looking to compare ground truth answer with generated answer refer to answer_correctness.

To calculate this score, the LLM is prompted to generate an appropriate question for the generated answer multiple times, and the mean cosine similarity between these generated questions and the original question is measured. The underlying idea is that if the generated answer accurately addresses the initial question, the LLM should be able to generate questions from the answer that align with the original question.

3、Tonic Validate

(1)Answer similarity: measures on a scale from 0 to 5, how well the answer from the RAG system corresponds in meaning to a reference answer. This score is an end-to-end test of the RAG LLM.

(2)Answer consistency: the percentage of the RAG system answer that can be attributed to retrieved context, is a float between 0 and 1.

(3)Retrieval precision: the percentage of retrieved context that is relevant to answer the question. For each context vector, we ask the LLM evaluator whether the context is relevant to use to answer the question. A float between 0 and 1.

(4)Augmentation precision: the percentage of information from the relevant context appears in the answer. A float between 0 and 1.

(5)Augmentation accuracy: the percentage of retrieved context for which some portion of the context appears in the answer from the RAG system.
在这里插入图片描述

  • 21
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值