nlp 预训练模型_nlp365第115天nlp论文摘要scibert科学文本的预训练语言模型-CSDN博客

nlp 预训练模型内置AI NLP365(INSIDE AI NLP365)Project #NLP365 (+1) is where I document my NLP learning journey every single day in 2020. Feel free to check out what I have been learning over the last 257 day...

摘要由CSDN通过智能技术生成

nlp 预训练模型

内置 AI NLP365(INSIDE AI NLP365)

Project #NLP365 (+1) is where I document my NLP learning journey every single day in 2020. Feel free to check out what I have been learning over the last 257 days here. At the end of this article, you can find previous papers summary grouped by NLP areas :)

项目＃NLP365(+1)是我记录我的NLP的学习之旅的每一天在2020年随时检查出什么，我一直在学习，在过去257天这里。在本文的结尾，您可以找到按NLP领域分组的以前的论文摘要：)

Today’s NLP paper is SCIBERT: A Pretrained Language Model for Scientific Text. Below are the key takeaways of the research paper.

今天的NLP论文是SCIBERT：科学文本的预训练语言模型。以下是研究论文的主要内容。

目标与贡献 (Objective and Contribution)

Released SCIBERT, a pretrained language model trained on multiple scientific corpuses to perform different downstream scientific NLP tasks. These tasks include sequence tagging, sentence classification, dependency parsing, and many more. SCIBERT has achieved new SOTA results on few of these downstream tasks. We also performed extensive experimentation on the performance of fine-tuning vs task-specific architectures, the effect of frozen embeddings, and the effect of in-domain vocabulary.

发布了SCIBERT，这是一种在多种科学语料库上进行过训练的预训练语言模型，可以执行不同的下游科学NLP任务。这些任务包括序列标记，句子分类，依存关系解析等等。 SCIBERT在其中一些下游任务上取得了新的SOTA结果。我们还针对微调与任务特定的体系结构的性能，冻结嵌入的效果以及域内词汇的效果进行了广泛的实验。

方法 (Methodology)

SCIBERT与BERT有何不同？(How is SCIBERT different from BERT?)

Scientific vocabulary (SCIVOCAB)
科学词汇(SCIVOCAB)
Train on scientific corpuses
训练科学语料库

SCIBERT is based on the BERT architecture. Everything is the same as BERT except it is pretrained on scientific corpuses. BERT uses WordPiece to tokenise input text and build the vocabulary (BASEVOCAB) for the model. The vocabulary contains the most frequent words / subword units. We use the SentencePiece library to construct a new WordPiece vocabulary (SCIVOCAB) on scientific corpuses. There’s a 42% overlap between BASEVOCAB and SCIVOCAB, showcasing the need for a new vocabulary for dealing with scientific text.

SCIBERT基于BERT架构。一切都与BERT相同，除了它是经过科学语料库训练的。 BERT使用WordPiece标记输入文本并为模型构建词汇表(BASEVOCAB)。词汇表包含最频繁的单词/子单词单位。我们使用SentencePiece库构建关于科学语料库的新WordPiece词汇表(SCIVOCAB)。 BASEVOCAB和SCIVOCAB之间有42％的重叠，这表明需要一种新的词汇来处理科学文本。

SCIBERT is trained on 1.14M papers from Semantic Scholar. Full text of the papers are used, including the abstracts. The papers have the average length of 154 sentences and sentences are split using ScispaCy.

SCIBERT接受了来自语义学者的114万篇论文的培训。使用了论文的全文，包括摘要。这些论文的平均长度为154个句子，并且使用ScispaCy对句子进行拆分。