SBERT-WK: A Sentence Embedding Method byDissecting BERT-based Word Models

https://arxiv.org/pdf/2002.06652.pdf

I. INTRODUCTION

One limitation of BERT is that due to the large model size, it is time consuming to perform sentence pair regression such as clustering and semantic search.

One effective way to solve this problem is to transforms a sentence to a vector that encodes the semantic meaning of the sentence

Currently, a common sentence embedding approach from BERT-based models is to average the representations obtained from the last layer or using the [CLS] token for sentence-level prediction.

bert模型太大,encode句子变成向量,用 [CLS]表征句子级或用最后一层represention平均数

Different from SBERT, we investigate sentence embedding by studying the geometric structure of deep contextualized models and propose a new method by dissecting BERT-based word models.

主要研究了模型几何结构

SBERTWK inherits the strength of deep contextualized models which is trained on both word- and sentence-level objectives. It is compatible with most deep contextualized models such as BERT [5] and RoBERTa [11].

继承了词和句子的level,兼容BERT RoBERT等深层语境模型

II. RELATED WORK

Traditional word embedding methods provide a static representation for a word in a vocabulary set.

First, it cannot deal with polysemy. Second, it cannot adjust the meaning of a word based on its contexts.

传统静态embedding的缺点

Sentence embedding methods can be categorized into two categories: non-parameterized and parameterized models.

Non-parameterized methods usually rely on high quality pre-trained word embedding methods. Following this line of averaging word embeddings, several weighted averaging methods were proposed, including tf-idf, SIF [21], uSIF [22] and GEM [23].

Parameterized models are more complex, and they usualy perform better than non-parameterized models

sentence embedding 两种模型

However, unlike supervised tasks, universal sentence embedding methods in general do not have a clear objective function to optimize

universal sentence embedding methods没有明确目标去训练优化

IV. PROPOSED SBERT-WK METHOD

We propose a new sentence embedding method called SBERT-WK in this section.

1) Determine a unified word representation for each word in a sentence by integrating its representations across layers by examining its alignment and novelty properties.

2) Conduct a weighted average of unified word representations based on the word importance measure to yield the ultimate sentence embedding vector.

V. EXPERIMENTS

 • Semantic textual similarity tasks.

They predict the similarity between two given sentences. They can be used to indicate the embedding ability of a method in terms of clustering and information retrieval via semantic search.

• Supervised downstream tasks.

They measure embedding’s transfer capability to downstream tasks including entailment and sentiment classification.

Probing tasks.

They are proposed in recent years to measure the linguistic features of an embedding model and provide finegrained analysis.

For performance benchmarking, we compare SBERT-WK with the following 10 different methods, including parameterized and non-parameterized models

1) Average of GloVe word embeddings;

2) Average the last layer token representations of BERT;

3) Use [CLS] embedding from BERT, where [CLS] is used for next sentence prediction in BERT; 4) SIF model [21], which is a non-parameterized model that provides a strong baseline in textual similarity tasks;

5) GEM model [23], which is a non-parameterized model deriving from the analysis of static word embedding space;

6) p-mean model [29] that incorporates multiple word embedding models;

7) Skip-Thought [24]; 8) InferSent [25] with both GloVe and FastText versions;

9) Universal Sentence Encoder [30], which is a strong parameterized sentence embedding using multiple objectives and transformer architecture;

10) SBERT, which is a state-of-the-art sentence embedding model by training the Siamese network over BERT.

用了10种方法测试

A. Semantic Textural Similarity

To evaluate semantic textual similarity, we use 2012-2016 STS datasets [31]–[35].

They contain sentence pairs and labels between 0 and 5, which indicate their semantic relatedness. Some methods learn a complex regression model that maps sentence pairs to their similarity score.

In our experiments, we do not include the representation from the first three layers since their representations are less contextualized as reported in [20]. Some superficial information is captured by those representations and they play a subsidiary role in most tasks [8].

B. Supervised Downstream Tasks

For supervised tasks, we compare SBERT-WK with other sentence embedding methods in the following eight downstream tasks.

用了8中下游任务测试

 • MR: Binary sentiment prediction on movie reviews [39].

• CR: Binary sentiment prediction on customer product reviews [40].

• SUBJ: Binary subjectivity prediction on movie reviews and plot summaries [41].

• MPQA: Phrase-level opinion polarity classification [42].

• SST2: Stanford Sentiment Treebank with binary labels [43].

• TREC: Question type classification with 6 classes [44].

• MRPC: Microsoft Research Paraphrase Corpus for paraphrase prediction [45].

• SICK-E: Natural language inference dataset [36].

C. Probing Tasks

This could be attributed to that SBERT pays more attention to the sentence level information in its training objective. It focuses more on sentence pair similarities.

In contrast, the mask language objective in BERT focuses more on word- or phrase-level and the next sentence prediction objective captures the intersentence information.

Probing tasks are tested on the wordlevel information or the inner structure of a sentence.

D. Ablation and Sensitivity Study

To verify the effectiveness of each module in the proposed SBERT-WK model, we conduct the ablation study by adding one module at a time. Also, the effect of two hyper parameters (the context window size and the starting layer selection) is evaluated. The averaged results for textual semantic similarity datasets, including STS12-STS16 and STSB, are presented.

Inference time comparison of InferSent, BERT, XLNET, SBERT and SBERT-WK. Data are collected from 5 trails. 

 VI. CONCLUSION AND FUTURE WORK

In this work, we provided in-depth study of the evolving pattern of word representations across layers in deep contextualized models. Furthermore, we proposed a novel sentence embedding model, called SBERT-WK, by dissecting deep contextualized models, leveraging the diverse information learned in different layers for effective sentence representations. SBERT-WK is efficient, and it demands no further training. Evaluation was conducted on a wide range of tasks to show the effectiveness of SBERT-WK.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值