使用LLM,对文本质量进行评估
We compared
three
\text{\color{blue}three}
three kinds of reference-free evaluation methods. The experimental
results prove that
\text{\color{blue}results prove that}
results prove that ChatGPT is capable of evaluating text quality effectively from various perspectives without reference and demonstrates superior performance than most existing automatic metrics.
In particular, the
Explicit Score
\text{\color{blue}Explicit Score}
Explicit Score (直接让模型打分), which utilizes ChatGPT to generate a numeric score measuring text quality,
is the most effective and reliable method
\text{\color{blue}is the most effective and reliable method}
is the most effective and reliable method among the three exploited approaches. However, directly comparing the quality of two texts may lead to sub-optimal results. We believe this paper will provide valuable insights for evaluating text quality with LLMs and have released the used data.
How accurately can ChatGPT assess text quality without references
It is feasible for ChatGPT to evaluate text quality without reference, and it outperforms commonly used metrics even with a simple prompt design.
What is the most effective approach to evaluate text quality using ChatGPT?
Generally, using ChatGPT to generate an explicit score for text quality is the best and most stable method among the three we compared. We suggest using greedy decoding for more reliable results.
Why may directly comparing two texts using ChatGPT yield suboptimal results?
主要是很难定义出,什么是高质量文本
Why is Implicit Score generally less effective than Explicit Score?
文章使用 txt-davinci 模型进行的实验,结果表明,Implicit Score的分布看起来是狭窄与尖峰的结构,而 Explicit Score 则是一个更平滑的分布
参考
https://arxiv.org/pdf/2304.00723.pdf