[论文阅读笔记48]BLURB

最新推荐文章于 2023-11-27 09:34:17 发布

happyprince

最新推荐文章于 2023-11-27 09:34:17 发布

阅读量616

点赞数 1

分类专栏： NER NLP 文章标签：医学预训练模型 PubMedBERT BLURB 生物医学NLP 领域特定语言模型

本文链接：https://blog.csdn.net/ld326/article/details/118927589

版权

NLP 同时被 2 个专栏收录

79 篇文章 6 订阅

订阅专栏

NER

39 篇文章 14 订阅

订阅专栏

一，题目

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

作者：YU GU, ROBERT TINN, HAO CHENG, MICHAEL LUCAS, NAOTO USUYAMA, XIAODONG LIU, TRISTAN NAUMANN, JIANFENG GAO, HOIFUNG POON
机构：Microsoft Research
年份：2021

二，研究背景

三，主要内容

Biomedical Language Understanding & Reasoning Benchmark (BLURB)

从头开始的特定领域预训练大大优于通用语言模型的持续预训练，从而证明了支持混合领域预训练的普遍假设并不总是适用的。

3.1 Language Model Pretraining

3.1.1 Vocabulary

Byte-Pair Encoding (BPE), WordPiece,

3.1.2 Model Architecture

transformer architectures – BERT

3.1.3 Self-Supervision

Masked Language Model (MLM)

Next Sentence Prediction (NSP)

3.1.4 Advanced Pretraining Techniques

WWM等基于预训练任务的各种变体与任务增加。

3.2 Biomedical Language Model Pretraining

使用PubMed文本的预训练可以在生物医学NLP任务中获得更好的性能

对于下游应用程序，从头开始预训练特定领域比采用混合领域预训练有更好的效果。

3.2.1 Mixed-Domain Pretraining

BioBERT基于标准的bert的初始化：Wikipedia，BookCorpus；然后基于MLM与NSP任务，采用PubMed的摘要与PubMed论文全文训练；

BlueBERT使用PubMed文本与MIMIC-III 的临床记录来训练。

缺点：词典还是通用语料的词典；词典代表不了目标生物医学领域；

SciBERT：它所以的训练都是从头开始的，包括词典，预训练所用的语料等等。

3.2.2 Domain-Specific Pretraining from Scratch

优点：拥有一个领域内的词汇表；

语言模型是纯粹是使用领域内的数据来训练的；

2.3 BLURB: A Comprehensive Benchmark for Biomedical NLP

2.4 Task-Specific Fine-Tuning

2.4.1 A General Architecture for Fine-Tuning Neural Language Models

Fine Tuning自然语言模型框架

2.4.2 Task-Specific Problem Formulation and Modeling Choices

NER:

TransformInput: returns the input sequence as is.

Featurizer: returns the BERT encoding of a given token.

Tagging scheme: BIO*; BIOUL; IO.

Classification layer: linear layer*; LSTM; CRF

PICO:

TransformInput: returns the input sequence as is.

Featurizer: returns the BERT encoding of a given token.

Tagging scheme: BIO*; BIOUL; IO.

Classification layer: linear layer*; LSTM; CRF.

Relation Extraction:

TransformInput: entity (dummification*; start/end marker; original); relation ([CLS]*; original).

Featurizer: entity (dummy token; pooling); relation ([CLS] BERT encoding; concatenation of the mention BERT encoding).

Classification layer: linear layer; more sophisticated classifiers (e.g., MLP).

Sentence Similarity:

TransformInput: [CLS] 𝑆1 [SEP] 𝑆2 [SEP], for sentence pair 𝑆1, 𝑆2.

Featurizer: [CLS] BERT encoding.

Regression layer: linear regression.

Document Classification:

TransformInput: [CLS] 𝐷 [SEP], for document 𝐷.

Featurizer: returns [CLS] BERT encoding.

Classification layer: linear layer.

Question Answering:

TransformInput: [CLS] 𝑄 [SEP] 𝑇 [SEP], for question 𝑄 and reference text 𝑇 .

Featurizer: returns [CLS] BERT encoding.

Classification layer: linear layer.

2.5 实验

数据集：https://pubmed.ncbi.nlm.nih.gov/; downloaded in Feb. 2020. 14 million abstracts, 3.2 billion words, 21 GB.

https://www.ncbi.nlm.nih.gov/pmc/ — PubMed Central (PMC) — 16.8 billion words (107 GB)

对比模型：

实验一：Domain-Specific Pretraining vs Mixed-Domain Pretraining

实验二，技术的消融研究

2.6 Fine-Tuning分析

由这数据来看，加入adversarial提升不是很大。

不同的标注情况：

评价

首先介绍了预测训练的相关知识，提出了一个医学领域的评价基准。对于这个基准的任务进行了介绍，另外对于预测训练中基于纯医学领域与混合领域的大量相关的大量实验。对于实验，这里是比较充分的。

参考

论文：https://arxiv.org/pdf/2007.15779.pdf
hugging face:https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext
benchmark:https://microsoft.github.io/BLURB

happyprince

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
[论文阅读笔记48]BLURB

一，题目Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing作者：YU GU, ROBERT TINN, HAO CHENG, MICHAEL LUCAS, NAOTO USUYAMA, XIAODONG LIU, TRISTAN NAUMANN, JIANFENG GAO, HOIFUNG POON机构：Microsoft Research年份：2021二，研究背景三，主要内容
复制链接

扫一扫

专栏目录