一,题目
Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing
作者:YU GU, ROBERT TINN, HAO CHENG, MICHAEL LUCAS, NAOTO USUYAMA, XIAODONG LIU, TRISTAN NAUMANN, JIANFENG GAO, HOIFUNG POON
机构:Microsoft Research
年份:2021
二,研究背景
三,主要内容
Biomedical Language Understanding & Reasoning Benchmark (BLURB)
从头开始的特定领域预训练大大优于通用语言模型的持续预训练,从而证明了支持混合领域预训练的普遍假设并不总是适用的。
3.1 Language Model Pretraining
3.1.1 Vocabulary
Byte-Pair Encoding (BPE), WordPiece,
3.1.2 Model Architecture
transformer architectures – BERT
3.1.3 Self-Supervision
Masked Language Model (MLM)
Next Sentence Prediction (NSP)
3.1.4 Advanced Pretraining Techniques
WWM等基于预训练任务的各种变体与任务增加。
3.2 Biomedical Language Model Pretraining
使用PubMed文本的预训练可以在生物医学NLP任务中获得更好的性能
对于下游应用程序,从头开始预训练特定领域比采用混合领域预训练有更好的效果。
3.2.1 Mixed-Domain Pretraining
BioBERT基于标准的bert的初始化:Wikipedia,BookCorpus; 然后基于MLM与NSP任务,采用PubMed的摘要与PubMed论文全文训练;
BlueBERT使用PubMed文本与MIMIC-III 的临床记录来训练。
缺点: 词典还是通用语料的词典;词典代表不了目标生物医学领域;
SciBERT:它所以的训练都是从头开始的,包括词典,预训练所用的语料等等。
3.2.2 Domain-Specific Pretraining from Scratch
优点:拥有一个领域内的词汇表;
语言模型是纯粹是使用领域内的数据来训练的;
2.3 BLURB: A Comprehensive Benchmark for Biomedical NLP
相关研究对比
任务说明
2.4 Task-Specific Fine-Tuning
2.4.1 A General Architecture for Fine-Tuning Neural Language Models
Fine Tuning自然语言模型框架
2.4.2 Task-Specific Problem Formulation and Modeling Choices
NER:
TransformInput: returns the input sequence as is.
Featurizer: returns the BERT encoding of a given token.
Tagging scheme: BIO*; BIOUL; IO.
Classification layer: linear layer*; LSTM; CRF
PICO:
TransformInput: returns the input sequence as is.
Featurizer: returns the BERT encoding of a given token.
Tagging scheme: BIO*; BIOUL; IO.
Classification layer: linear layer*; LSTM; CRF.
Relation Extraction:
TransformInput: entity (dummification*; start/end marker; original); relation ([CLS]*; original).
Featurizer: entity (dummy token; pooling); relation ([CLS] BERT encoding; concatenation of the mention BERT encoding).
Classification layer: linear layer; more sophisticated classifiers (e.g., MLP).
Sentence Similarity:
TransformInput: [CLS] 𝑆1 [SEP] 𝑆2 [SEP], for sentence pair 𝑆1, 𝑆2.
Featurizer: [CLS] BERT encoding.
Regression layer: linear regression.
Document Classification:
TransformInput: [CLS] 𝐷 [SEP], for document 𝐷.
Featurizer: returns [CLS] BERT encoding.
Classification layer: linear layer.
Question Answering:
TransformInput: [CLS] 𝑄 [SEP] 𝑇 [SEP], for question 𝑄 and reference text 𝑇 .
Featurizer: returns [CLS] BERT encoding.
Classification layer: linear layer.
2.5 实验
数据集:https://pubmed.ncbi.nlm.nih.gov/; downloaded in Feb. 2020. 14 million abstracts, 3.2 billion words, 21 GB.
https://www.ncbi.nlm.nih.gov/pmc/ — PubMed Central (PMC) — 16.8 billion words (107 GB)
对比模型:
实验一:Domain-Specific Pretraining vs Mixed-Domain Pretraining
实验二,技术的消融研究
2.6 Fine-Tuning分析
由这数据来看,加入adversarial提升不是很大。
不同的标注情况:
评价
首先介绍了预测训练的相关知识,提出了一个医学领域的评价基准。对于这个基准的任务进行了介绍,另外对于预测训练中基于纯医学领域与混合领域的大量相关的大量实验。对于实验,这里是比较充分的。
参考
论文:https://arxiv.org/pdf/2007.15779.pdf
hugging face:https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext
benchmark:https://microsoft.github.io/BLURB