Clarifying Question领域最常见的三个数据集

最新推荐文章于 2025-05-19 22:12:11 发布

长命百岁️

最新推荐文章于 2025-05-19 22:12:11 发布

阅读量1k

点赞数 1

分类专栏：信息检索论文阅读文章标签：前端 python json

本文链接：https://blog.csdn.net/qq_52852138/article/details/128667263

版权

论文阅读同时被 2 个专栏收录

35 篇文章

订阅专栏

信息检索

31 篇文章

订阅专栏

文章介绍了Qulac、MIMICS和ClariQ三个数据集，它们主要用于研究和开发对话理解及澄清技术。Qulac包含话题、问题和答案，用于识别对话中缺乏清晰度的问题。MIMICS提供了大量用于搜索澄清的查询和用户交互信号。ClariQ数据挑战则分为两个阶段，评估系统如何帮助用户找到所需信息以及对话的自然度。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

Qulac
- - qulac.json:
  - qulac_hist012_dict.tar.gz:
MIMICS
ClariQ

Qulac

aliannejadi/qulac: Qulac: A dataset on asking Questions for Lack of Clarity in open-domain information-seeking conversations. (github.com)
在这里插入图片描述

qulac.json:

qulac.json contains the topics, facets, questions, and answers. This is the main file of Qulac. However, it may not be very straightforward to use this file for experiments directly. That is why we have provided some auxiliary data files which we describe in this document. In the qulac.json file, you will find these fields:

topic_id: the ID of the topic in TREC Web Track.
facet_id: the ID of the facet in TREC Web Track.
topic_facet_id: an ID corresponding to a topic and facet pair in the following format: %d-%d. For example, 21-1 corresponds to the first facet (facet_id=1) of the 21st topic in TREC Web Track data.
topic_facet_question_id: an ID corresponding to a topic, facet, and question triplet in the following format: %d-%d-%d. For example, 21-1-5 corresponds to the fifth question of the first facet of the 21st topic. Each row of the data is identified by this ID.
topic: the TREC topic (query).
topic_type: an str value indicating the type of a topic. Possible values are faceted and ambiguous.
facet_type: an str value indicating the type of a facet. Possible values are inf (i.e., informational) and nav (i.e., navigational).
topic_desc: a full description of the topic as it appears in the TREC Web Track data.
facet_desc: a full description of the facet (information need) as it appears in the TREC Web Track data.
question: a clarifying question that the system can pose to the user for the current topic and facet.
answer: an answer to the clarifying question, assuming that the user is in the context of the current row (i.e., the user’s initial query is topic, their information need is facet, and question has been posed to the user).

topic_id	facet_id	topic_facet_id	topic_facet_question_id	topic	topic_type	facet_type	topic_desc	facet_desc	question	answer
193	2	193-2	193-2-5	dog clean up bags	faceted	inf	Can I order dog clean-up bags online?	Are there biodegradable products for the dispo…	are you looking for a way to dispose your dog …	im looking for dog waste bags that are biodegr…
144	2	144-2	144-2-5	trombone for sale	ambiguous	inf	information on where I could buy a new or used…	good places to sell a used trombone	are you looking for a place to sell a used tro…	yes
78	3	78-3	78-3-7	dieting	ambiguous	inf	Find “reasonable” dieting advice, that is no…	Find crash diet plans that promise quick weigh…	do you want to know if dieting is safe	i would like to know more on quick and safe di…

qulac_hist012_dict.tar.gz:

qulac_hist012_dict.tar.gz can be used for experiments involving multi-turn conversations. As we have mentioned in [1], the conversations are artificially generated following the data that is available in qulac.json. Hence, the structure of the dict is as follows (after decompression):

{ <record_id>: 
	{ 
	  'history_id': <the ID of conversation history (context)>,
	  'history_list': [
				{ 'question': <question1 string>,
				  'answer': <answer1 string> },
				{ 'question': <question2 string>,
				  'answer': <answer2 string> },
				{ 'question': <question2 string>,
				  'answer': <answer2 string> },		 					 
			    ],
	 'query': <query (topic) string>,
	 'question': <current question string>,
	 'answer': <current answer string>
  }
  ....
}

Record ID:
```
topic_id - facet_id - past_question_id_1 - past_question_id_2 - current_question_id - answer_flag
```
- The flag is used to indicate whether the record is referring to the results that are obtained with (=1) or without (=0) final answer

 '18-2-1-2-10-1': {	 
	'history_id': '18-2-1-2',
	'history_list': [{'answer': 'no i just want to find spreadsheets and templates',
			'question': 'are you interested in a service for wedding budgeting'},
			{'answer': 'yes i want to find some spreadsheets to help me budget',
			'question': 'are you looking for advice on wedding budgeting'}],
	'query': 'wedding budget calculator',
	'question': 'what is your projected budget for your wedding',
	'answer': 'i need to find a spreadsheet to figure it out'},

'25-1-3-8-1' : {	 
	'history_id': '25-1-3',
	'history_list': [{'answer': 'no i am looking for information on the greek mathematician euclid',
			'question': 'do you need directions to euclid ave'}],
	'query': 'euclid',
	'question': 'do you want to know related people',
	'answer': 'no i only want to know about one particular person'}

MIMICS

microsoft/MIMICS: MIMICS: A Large-Scale Data Collection for Search Clarification (github.com)
在这里插入图片描述

Each clarification in MIMICS consists of a clarifying question and up to five candidate answers

query	headaches
question	What do you want to know about this medical condition?
candidate answers (options)	symptom, treatment, causes, diagnosis, diet

MIMICS contains three datasets:

MIMICS-Click includes over 400k unique queries, their associated clarification panes, and the corresponding aggregated user interaction signals (i.e., clicks).

[‘#HASH#value excel’, ‘What version of Excel are you looking for?’, ‘2010’, ‘2013’, ‘2016’, ‘’, ‘’, ‘medium’, ‘0’, ‘0.0’, ‘0.0’, ‘0.0’, ‘0.0’, ‘0.0’]

[‘%2f’, ‘What language are you looking for?’, ‘javascript’, ‘python’, ‘’, ‘’, ‘’, ‘medium’, ‘0’, ‘0.0’, ‘0.0’, ‘0.0’, ‘0.0’, ‘0.0’]

[‘.net’, ‘Select one to refine your search’, ‘powershell .net’, ‘iis .net’, ‘windows .net’, ‘sql .net’, ‘exchange .net’, ‘high’, ‘0’, ‘0.0’, ‘0.0’, ‘0.0’, ‘0.0’, ‘0.0’]

[‘.net 3.5 framework’, ‘Select one to refine your search’, ‘windows’, ‘powershell’, ‘xml’, ‘azure’, ‘json’, ‘high’, ‘3’, ‘0.8571428571428572’, ‘0.0’, ‘0.0’, ‘0.14285714285714285’, ‘0.0’]

MIMICS-ClickExplore is an exploration data that includes aggregated user interaction signals for over 60k unique queries, each with multiple clarification panes.

Column(s)	Description
query	(string) The query text.
question	(string) The clarifying question.
option_1, …, option_5	(string) Up to five candidate answers.
impression_level	(string) A three-level impression label (i.e., low, medium, or high).
engagement_level	(integer) A label in [0, 10] representing total user engagements.
option_cctr_1, …, option_cctr_5	(real) The conditional click probability on each candidate answer.

[‘0 degrees’, ‘Select one to refine your search’, ‘celsius’, ‘kelvin’, ‘fahrenheit’, ‘’, ‘’, ‘medium’, ‘0’, ‘0.0’, ‘0.0’, ‘0.0’, ‘0.0’, ‘0.0’]
[‘0 degrees’, ‘Select one to refine your search’, ‘fahrenheit’, ‘celsius’, ‘kelvin’, ‘’, ‘’, ‘medium’, ‘4’, ‘1.0’, ‘0.0’, ‘0.0’, ‘0.0’, ‘0.0’]
[‘0 degrees’, ‘Select one to refine your search’, ‘boots for 0 degrees’, ‘gloves for 0 degrees’, ‘’, ‘’, ‘’, ‘medium’, ‘0’, ‘0.0’, ‘0.0’, ‘0.0’, ‘0.0’, ‘0.0’]

MIMICS-Manual includes over 2k unique real search queries. Each query-clarification pair in this dataset has been manually labeled by at least three trained annotators. It contains graded quality labels for the clarifying question, the candidate answer set, and the landing result page for each candidate answer.

Column(s)	Description
query	(string) The query text.
question	(string) The clarifying question.
option_1, …, option_5	(string) Up to five candidate answers.
question_label	(integer) A three-level quality label for the clarifying question
options_overall_label	(integer) A three-level quality label for the candidate answer set
option_label_1, …, option_label_5	(integer) The conditional click probability on each candidate answer.

[‘multiple system atrophy’, ‘What do you want to know about this medical condition?’, ‘symptom’, ‘treatment’, ‘causes’, ‘diagnosis’, ‘diet’, ‘2’, ‘2’, ‘2’, ‘2’, ‘2’, ‘2’, ‘2’]

[‘team fortress 2’, ‘What would you like to know about this game?’, ‘team fortress 2 steam’, ‘team fortress 2 mods’, ‘team fortress 2 gameplay’, ‘team fortress 2 cheats’, ‘’, ‘1’, ‘2’, ‘2’, ‘2’, ‘2’, ‘2’, ‘’]

[‘google chrome exe’, ‘Select one to refine your search’, ‘64 bit’, ‘32 bit’, ‘’, ‘’, ‘’, ‘’, ‘2’, ‘2’, ‘2’, ‘’, ‘’, ‘’]
[‘google chrome exe’, ‘Select one to refine your search’, ‘32 bit’, ‘64 bit’, ‘’, ‘’, ‘’, ‘’, ‘2’, ‘2’, ‘2’, ‘’, ‘’, ‘’]

ClariQ

ConvAI3 Data Challenge

ClariQ is a part of this challenge.

The challenge ran in two stages:

stage1: participants were provided with a static dataset consisting mainly of an initial user request, clarifying question and user answer
stage2: human-in-the-loop

Stage1: initial dataset

The dataset consist of:

User Request: an initial user request in the conversational form with a label reflects if is needed ranged from 1 to 4
- 1: don’t need any clarification
- 4: need clarification (must)
Clarification question: a set of possible clarifying questions
User Answers: each questions is supplied with a user answer

Stage2: human-in-the-loop

Enables the top-performing teams of the first stage to evaluate their models with the help of human evaluators. We evaluate the performance of a system in two aspects:

how much the conversation can help a user find the information they are looking for
how natural and realistic does the conversation appear to a human evaluator

ClariQ Dataset

aliannejadi/ClariQ: ClariQ: SCAI Workshop data challenge on conversational search clarification. (github.com)

Feature	Value
# train (dev) topics	187 (50)
# faceted topics	141
# ambiguous topics	57
# single topics	39
# facets	891
# total questions	3,929
# single-turn conversations	11,489
# multi-turn conversations	~ 1 million
# documents	~ 2 million

File Format

train.tsv and dev.tsv

They have the same format, contain topics, facets, questions, answers and clarification need labels.

topic_id: the ID of the topic (initial_request).
initial_request: the query (text) that initiates the conversation.
topic_desc: a full description of the topic as it appears in the TREC Web Track data.
clarification_need: a label from 1 to 4, indicating how much it is needed to clarify a topic.
facet_id: the ID of the facet.
facet_desc: a full description of the facet (information need) as it appears in the TREC Web Track data.
question_id: the ID of the question as it appears in question_bank.tsv.
question: a clarifying question that the system can pose to the user for the current topic and facet.
answer: an answer to the clarifying question, assuming that the user is in the context of the current row (i.e., the user’s initial query is initial_request, their information need is facet_desc, and question has been posed to the user).

topic_id	initial_request	topic_desc	clarification_need	facet_id	facet_desc	question_id	question	answer
14	I’m interested in dinosaurs	I want to find information about and pictures of dinosaurs.	4	F0159	Go to the Discovery Channel’s dinosaur site, which has pictures of dinosaurs and games.	Q00173	are you interested in coloring books	no i just want to find the discovery channels website
14	I’m interested in dinosaurs	I want to find information about and pictures of dinosaurs.	4	F0159	Go to the Discovery Channel’s dinosaur site, which has pictures of dinosaurs and games.	Q03021	which dinosaurs are you interested in	im not asking for that i just want to go to the discovery channel dinosaur page

test.tsv

only contains the list of test topics, as well as their ID’s.

topic_id	initial_request
201	I would like to know more about raspberry pi
202	Give me information on uss carl vinson.

question_bank.tsv

Constitutes of all the questions in the collection. The TSV file has two columns: question_id, question(txet)

question_id	question
Q00001
Q02318	what kind of medium do you want this information to be in
Q02319	what kind of penguin are you looking for
Q02320	what kind of pictures are you looking for

Note: selecting Q00001 means selecting no question

dev_synthetic.pkl.tar.gz & train_synthetic.pkl.tar.gz

These files contain dicts of synthetically built multi-turn conversations (up to three turns).

{<record_id>: {'topic_id': <int>,
  'facet_id': <str>,
  'initial_request': <str>,
  'question': <str>,
  'answer': <str>,
  'conversation_context': [{'question': <str>,
   'answer': <str>},
  {'question': <str>,
   'answer': <str>}],
  'context_id': <int>},
  ...
  }

where

<record_id> is an int indicating the ID of the current conversation record.
- While in the dev set there exists multiple <record_id> values per <context_id>, in the test file there would be only one.
'topic_id', 'facet_id', and 'initial_request' indicate the topic, facet, and initial request of the current conversation, according to the single turn dataset.
'question': current clarifying question that is being posed to the user.
'answer': user’s answer to the clarifying question.
'conversation_context' identifies the context of the current conversation. A context consists of previous turns in a conversation. As we see, it is a list of 'question' and 'answer' items. This list tells us which questions have been asked in the conversation so far, and what has been the answer to them.
'context_id' is the ID of the conversation context. Basically, participants should predict the next utternace for each context_id.

  2288: {'topic_id': 8,
  'facet_id': 'F0969',
  'initial_request': 'I want to know about appraisals.',
  'question': 'are you looking for a type of appraiser',
  'answer': 'yes jewelry',
  'conversation_context': [],
  'context_id': 969},
  
 1570812: {'topic_id': 293,
 'facet_id': 'F0729',
 'initial_request': 'Tell me about the educational advantages of social networking sites.',
 'question': 'which social networking sites would you like information on',
 'answer': 'i don have a specific one in mind just overall educational benefits to social media sites',
 'conversation_context': [{'question': 'what level of schooling are you interested in gaining the advantages to social networking sites',
   'answer': 'all levels'},
  {'question': 'what type of educational advantages are you seeking from social networking',
   'answer': 'i just want to know if there are any'}],
 'context_id': 976573}

single_turn_train_eval.pkl & multi_turn_***_evla.pkl.tar.gz

These files are dicts of pre-computed document relevance results after asking each question

 { <evaluation_metric>: 
  	[ 
  	  <context_id>: 
  	  {
    	    <question_id> : 
  	  	 {
  	  	   'no_answer': <float>,
  	  	   'with_answer': <float>
  	  	 }
  	  	 , ... , 
  	  	 'MAX': 
  	  	  {
  	  	    'no_answer': <float>,
  	  	    'with_answer: <float>
  	  	  },
  	  	 'MIN':
  	  	  {
  	  	    'no_answer: <float>,
  	  	    'with_answer: <float>
  	  	  } 
  	  }
    ]
    ...
  }

MAX and MIN: These refer to the maximum and minimum performance that the retrieval model achieves by asking the “best” and “worst” questions among the candidate questions.

top10k_docs_dict.pkl.tar.gz

A dict consisting of a list of document ID’s for a given topic_id, this dict is useful for having the list of top 10,000 documents as an initial ranking.

train.qrel & dev.qrel

These files contain the relevance assessments of ClueWeb09 and ClueWeb12 collections for every facet in the train and dev sets, respectively

<facet_id> 0 <document_id> <relevance_score>

F0001 0 clueweb09-en0038-74-08250 1
F0001 0 clueweb09-enwp01-17-11113 1
F0002 0 clueweb09-en0001-02-21241 1
F0002 0 clueweb09-en0006-52-11056 1