EduChat论文要点提炼

引言

learn domain-specific knowledge by pre-training on the educational corpus(语料库) and stimulate(促进,激发) various skills with tool use by fine-tuning on designed system prompts and instructions.

1 Introduction

LLMs obtained the ability of reasoning, long-range context modeling, and task generalization by training on large-scale textual corpus with some strategies, such as code pretraining (Chen et al., 2021), instruction tuning (Wei et al., 2022), and reinforcement learning from human feedback (RLHF) (Stiennon et al., 2020).

通过强化学习RLHF,指令微调,代码预训练等方法,在大规模文本语料库上训练,来实现推理,长上下文建模和生成任务

However, there are several challenges of applying LLMs into education domain. One challenge (C1) is that there is still a gap between the LLMs and the educational expert since LLMs are pretrained on the general corpus, which lack sufficient educational knowledge and can not align well with real scenarios (e.g., essay assessment). The other challenge (C2) is that the knowledge in the field of education is updating, while LLMs can not learn up-to-date knowledge due to the training mechanism. Moreover, LLMs suffer from the hallucination problem, and may generate responses that are not truthful.

educhat面临的问题:

1.普通LLM在大众语料库上训练得到,在教育领域相关情景下(如作文评价)表现不好

2.教育领域知识在不断迭代,而模型无法学到不断更新的知识。

3.hallucination problem(懵)

For C1, we pre-train LLMs on a large number of educational books (e.g., psychology, ancient poetry) and 4 million cleaned diverse instructions to learn the fundamental knowledge. Then, we finetune the model on 500 thousand high-quality customized instructions to activate education-specific functions (e.g., essay assessment, Socratic teaching and emotional support), by aligning with the feedbacks from psychology experts and frontline teachers.

对于第一个挑战,首先使用大量的教育领域书籍(如心理学、古代诗歌)和 四百万个经过清洗的指令 来使模型获得教育领域的基础知识。接着在50万个高质量个性化指令上对模型进行微调(通过与一线教师的反馈保持一致),来加强模型在教育领域的功能。

For C2, we explore a retrieval-augmented technology, which enables LLMs to automatically judge the helpfulness of the retrieved information, and generate the response based on the relevant information and knowledge stored in LLMs. In this way, our EduChat can access the latest information from the internet, ensuring that the responses are accurate and credible.

对于第二个挑战,开发了一个检索技术,可以让模型检索最新信息,并自主判断 所检索到的信息是否有用,并结合模型中储存的知识来生成 response。

Diverse system prompts and instructions are designed to control the tool use and stimulate different skills, which alleviates the problem of hallucination and is more applicable in real education scenarios;

diverse system prompts and instructions被用来解决hallucination 和模型教育领域适应性等问题。

2 Related Work

In education, Baladn et al. (2023) tune open-source LLMs for generating better teacher responses in BEA 2023 Shared Task (Tack et al., 2023). But challenges still exist, such as the lack of domain knowledge in general LLMs and the necessity for them to align with educational abilities (e.g., essay assessment, emotional support, and Socratic teaching).

在此之前,Baladn和其同事微调开源LLM模型,使其在BEA 2023 Shared Task中生成更好的teacher responses。但做的不够好。

EduChat is pre-trained on a diverse education corpus to ensure the alignment of EduChat with educational abilities.

educhat 好像很强调自己的diverse 不管是diverse corpus 还是diverse system prompts and instructions

3 Core Functions of EduChat

Retrieval-Augmented Open Question Answering(QA)

以检索功能作为扩展的QA系统,可以很好地eliminating fabrication information 和maintaining up-to-date knowledge

Fine-grained Essay Assessment

overall scores, aspectlevel ratings, and detailed comments on content, expression, paragraph, and overall evaluation.

can identify standout sentences, highlighting strengths and areas for improvement, enabling personalized guidance for students’ essay writing skills.

大概意思是作文打分指标细,能识别好句子,能提出如何改进

Socratic Teaching

引导式的问答,不直接提供答案给学生。

Psychology-based Emotional Support

我觉得不需要这个功能。

4 Data Construction

4.1 Pre-training Data

Textbooks Data

In our research, we gather a vast amount of educational textbook and online question bank data from Chinese middle and high school exams for pre-training. Additionally, we enrich our model with over 70,000 Chinese poetries, providing detailed information on authors, backgrounds, and poetry appreciation to enhance its poetry creation and appreciation capabilities. To facilitate empathetic emotional support dialogues, we carefully select 60 famous works from hundreds of psychology books. These selected books belong to two main categories. The first category consists of 15 branches of psychological theory, including developmental and educational psychology, social psychology, behavioral psychology, counseling psychology and others. The second category contains various psychological practices, which offer practical cases of psychological consultation and emotional support dialogues. By incorporating the diverse fundamental data into pre-training, our model gains a deeper understanding of education and psychology, enabling it to generate more helpful responses.

中国初高中的教科书和在线题库。还用了70000个带赏析中国古诗来提高模型写古诗和赏析古诗的能力

后面都是情感支持的东西 划掉了

Fundamental Instruction Data

To achieve a more natural human-computer interaction, we collect a large volume of bilingual instruct tuning data from reputable open-source repositories like Alpaca5, BELLE (Ji et al., 2023), GPT4All6, Open- Assistant7, FLANCoT8, and Firefly9. The data spans various task types, enabling our models to acquire foundational instruction following capabilities for diverse instruction types. In addition, we source high-quality multi-turn dialogue data from MOSS (Sun et al., 2023), BELLE (Ji et al., 2023), COIG (Zhang et al., 2023a), LIMA (Zhou et al., 2023a), and ShareGPT10. This data covers various dialogue contexts, including role-playing, creative writing, and code-related discussions, ensuring our models’ competence in engaging and sustaining meaningful multi-turn conversations.

指令数据和多轮对话数据,具体的数据集为上面加粗字体。多轮对话数据甚至包括角色扮演,创意写作,和与代码相关的讨论。

4.2 Fine-tuning Data

we construct the Educational Instruction Data for finetuning,which covers retrieval-augmented open QA(22.6%),emotional support(29.4%), Socratic teaching(16.8%) and essay assessment(31.2%).

构造了一个Educational Instruction Data,包含四个部分

Retrieval-Augmented Open QA Data

To address hallucination and timely knowledge issues in Open QA, we design a retrieval-augmented open QA technique. We sample high-quality data through ChatGPT scoring in relevant Open QA and Subject QA datasets. To tackle irrelevant retrieved content, we introduce self-checking. ChatGPT assesses whether the retrieval content helps answer the question and then generates the answer using an self-check, incorporating the useful retrieval content and questions. To maintain data quality, we manually verify the data during this process.

大概意思是根据提问去网上爬数据,然后把爬来的数据交给chatgpt来判断是否和问题相关度,对相关度高的data,再问一遍自己这个data是否合question有关,再生成答案,这样生成数据集,然后再对数据集进行人工修改筛选

Emotional Support Data

To overcome the scarcity of Chinese emotional support dialogue data, we adopt a translation and expansion approach. We translate the widely-used English emotional support dataset, ESConv (Liu et al., 2021), into Chinese as ESConv-zh. After manual review and cleaning, we simulate multi-agent dialogues based on various patient scenarios within ESConvzh and also collect real-life Chinese psychological counseling consultation data, incorporating patient information and diagnosis results. By training our models on diverse datasets, we empower them to provide robust emotional support and act as compassionate counselors during consultations.

情感支持 略过

Socratic Teaching Data

Teachers play a key role in guiding and encouraging heuristic exploration rather than just providing answers. To support this, we generate dialogues simulating the Socratic teaching method by incorporating multi-step Q&A involving counter-questions, challenges, and inquiries. These dialogues are manually evaluatedfor accuracy, fluency, and progression from easyto complex questions. Integrating this dataset into training equips our model with a strong capability in Socratic teaching, distinguishing it from other LLMs that only offer direct answers.

通过合并许多带有反问、询问的多轮对话生成的数据集,并进行人工筛选,用于使得模型具有引导式问答的能力。

Essay Assessment Data

The lack of timely and detailed feedback often hinders students’ writing improvement. To tackle this issue, we create a high-quality essay assessment dataset. Initially, we collect essays and employ ChatGPT to evaluate them in terms of content, expression, and overall quality. To ensure data quality, we invite pedagogical experts to manually curate the comments. This dataset empowers EduChat with the ability to provide students with high-quality feedback, aiding in the enhancement of their writing skills.

要求chatgpt在内容,表达,和整体质量上对作文进行打分,并将作文和chatgpt的输出内容整合为数据集,并邀请教育专家(我觉得就是他们自己 哪个老师愿意打这个苦工)来人工筛选修改,提高数据集的质量。

4.3 Data Preprocessing

To enhance data quality, we conduct semantic-level deduplication on the dataset. Using the sentencetransformers model (Reimers and Gurevych, 2019), we obtain sentence embeddings for each data point and calculate cosine similarity between all pairs of embeddings. For similarities exceeding a threshold of 0.7, we remove one of the duplicates. We implement the similarity calculation using CUDA for GPU acceleration, speeding up the process.

终于讲点实际技术了

大意是对数据集去重来提高数据质量,计算句向量,句向量之间再两两计算余弦相似度,大于0.7视为重复。使用CUDA来实现相似度计算,来加快计算过程。

5 EduChat

5.1 Training Procedure of EduChat

The training of EduChat is mainly divided into two stages: fundamental capabilities acquisition and educational skills acquisition. In the first stage, we pre-train the model on educational books and Q&A pairs (detailed in Section 4.1) to equip it with foundational knowledge across disciplines. Besides, large-scale instruction tuning and opendomain dialogue datasets are also incorporated to enable basic instruction following ability and dialogue ability (detailed in Section 4.2). In the second stage, we develop EduChat’s pedagogical skills by fine-tuning the model on our carefully curated data, including retrieval-augmented open QA, emotional support, Socratic teaching and essay assessment datasets mentioned in Section 4.2

educhat预训练过程分为两个阶段,第一个阶段训练educhat的基础功能,使用的数据集是4.1节提到的textbook、instruction data 等数据,第二个阶段训练educhat的教育功能,使用的是4.2节提到的,他们自己做的数据集。

5.2 Online Knowledge Retrieval

Existing generative LLMs all suffer from the issues of generating hallucinations and outdated information, which is detrimental to an educational model. To mitigate this problem, we introduce self-check as shown in Figure 2. Specifically, when online knowledge retrieval is enabled, the model picks useful retrieval results by asking itself "Is this helpful for answering the question?" and append filtered snippets before the dialogue history.

之前做数据集的时候用了chatgpt,这里实际使用的时候应该是不用chatgpt,只有self-check了,大概意思是,用question在搜索引擎中搜索,然后拿着搜索到的内容问自己 这内容是否对回答问题有帮助(self-check),然后将有帮助的内容留下,用于生成答案。

论文说这样可以解决幻觉问题,并使得数据保持更新。

5.3 System Prompt Design

我觉得大概意思是说一些提示词的设计,包括模型简介,使用什么样的关键词效果更好,以及通过哪些关键词来开启/禁用功能。

6 Experimental Results

6.1 Resutls of C-Eval

C-Eval是用来评估中文大语言模型的一个评估套件,包含13948个多选题,跨越52个学科,分为4个难度。

  • 5
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值