BERT for unsupervised text tasks

This post discusses how we use BERT and similar self-attention architectures to address various text crunching tasks at Ether Labs.

Self-attention architectures have caught the attention of NLP practitioners in recent years, first proposed in Vaswani et al., where the authors have used multi-headed self-attention architecture for machine translation tasks

BERT Architecture Overview

  • BERT’s model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation described in Vaswani et al.
  • Each word in BERT gets “n_layers*(num_heads*attn.vector) “ representations that capture the representation of the word in the current context
  • For example, in BERT base: n_layers = 12, N_heads = 12, attn.vector = dim(64)
  • In this case, we have 12X12X(64) representational sub-spaces for each word to leverage
  • This leaves us with a challenge and opportunity to leverage such rich representations unlike any other LM architectures proposed earlier

Sentence relatedness with BERT

BERT representations can be double-edged sword gives the richness in its representations. In our experiments with BERT, we have observed that it can often be misleading with conventional similarity metrics like cosine similarity. For example, consider pair-wise cosine similarities in below case (from the BERT model fine-tuned for HR-related discussions):

text1: Performance appraisals are both one of the most crucial parts of a successful business, and one of the most ignored.

text2: On the other, actual HR and business team leaders sometimes have a lackadaisical “I just do it because I have to” attitude.

text3: If your organization still sees employee appraisals as a concept they need to showcase just so they can “fit in” with other companies who do the same thing, change is the order of the day. How can you do that in a way that everyone likes?

text1<>text2–0.613270938396454

text1<>text3–0.634544332325459

text2<>text3–0.772294402122498

A metric that ranks text1<>text3 higher than any other pair would be desirable. How do we get there?

OOTB, BERT is pre-trained using two unsupervised tasks, Masked LM and Next Sentence Prediction (NSP) tasks.

Masked LM is a spin-up version of conventional language model training setup — next word prediction task. For more details, please refer to section 3.1 in the original paper.

Next Sentence Prediction (NSP) task is a novel approach proposed by authors to capture the relationship between sentences, beyond the similarity.

For the above text pair relatedness challenge, NSP seems to be an obvious fit and to extend its abilities beyond a single sentence, we have formulated a new training task.

From NSP to Context window

In a context window setup, we label each pair of sentences occurring within a window of n sentences as 1 and zero otherwise. For example, consider the following paragraph:

As a manager, it is important to develop several soft skills to keep your team charged. Invest time outside of work in developing effective communication skills and time management skills. Skills like these make it easier for your team to understand what you expect of them in a precise manner. Check in with your team members regularly to address any issues and to give feedback about their work to make it easier to do their job better. Encourage them to give you feedback and ask any questions as well. Effective communications can help you identify issues and nip them in the bud before they escalate into bigger problems.

For context window n=3, we generate following training examples

Invest time outside of work in developing effective communication skills and time management skills. <SEP> Check in with your team members regularly to address any issues and to give feedback about their work to make it easier to do their job better. Label: 1

As a manager, it is important to develop several soft skills to keep your team charged. <SEP> Effective communications can help you identify issues and nip them in the bud before they escalate into bigger problems. Label: 0

Effective communications can help you identify issues and nip them in the bud before they escalate into bigger problems. <SEP> Check in with your team members regularly to address any issues and to give feedback about their work to make it easier to do their job better. Label: 1

This training paradigm enables the model to learn the relationship between sentences beyond the pair-wise proximity. After context window fine-tuning BERT on HR data, we got following pair-wise relatedness scores

text1<>text2–0.1215614

text1<>text3–0.899943

text2<>text3–0.480266

This captures the sentence relatedness beyond similarity. In practice, we use a weighted combination of cosine similarity and context window score to measure the relationship between two sentences.

Document Embeddings

Generating feature representations for large documents (for retrieval tasks) has always been a challenge for the NLP community. Approaches like concatenating sentence representations make them impractical for downstream tasks and averaging or any other aggregation approaches (like p-means word embeddings) fail beyond certain document limit. We have explored several ways to address these problems and found the following approaches to be effective:

BERT+RNN Encoder

We have set up a supervised task to encode the document representations taking inspiration from RNN/LSTM based sequence prediction tasks.

[step-1] extract BERT features for each sentence in the document

[step-2] train RNN/LSTM encoder to predict the next sentence feature vector in each time step

[step-3] use final hidden state of the RNN/LSTM as the encoded representation of the document

This approach works effectively for smaller documents and is not effective for larger documents due to the limitations of RNN/LSTM architectures.

Distributed Document Representations

Generating a single feature vector for an entire document fails to capture the whole essence of the document even when using BERT like architectures. We have reformulated the problem of Document embedding to identify the candidate text segments within the document which in combination captures the maximum information content of the document. We use the following approaches to get the distributed representations — Feature clustering, Feature Graph Partitioning

Feature clustering

[step-1] split the candidate document into text chunks

[step-2] extract BERT feature for each text chunk

[step-3] run k-means clustering algorithm with relatedness score (discussed in the previous section) as a similarity metric on candidate document until convergence

[step-4] use the text segments closest to each centroid as the document embedding candidate

A general rule of thumb is to have a large chunk size and a smaller number of clusters. In practice, these values can be fixed for a specific problem type

Feature Graph Partitioning

[step-1] split the candidate document into text chunks

[step-2] extract BERT feature for each text chunk

[step-3] build a graph with nodes as text chunks and relatedness score between nodes as edge scores

[step-4] run community detection algorithms (eg. The Louvain algorithm)to extract community subgraphs

[step-5] use graph metrics like node/edge centrality, PageRank to identify the influential node in each sub-graph — used as document embedding candidate

Conclusion

This post highlights some of the novel approaches to use BERT for various text tasks. These approaches can be easily adapted to various usecases with minimal effort. More to come on Language Models, NLP, Geometric Deep Learning, Knowledge Graphs, contextual search and recommendations. Stay tuned!!

 

【用BERT完成无监督文本任务】《BERT for unsupervised text tasks》by Venkata Dikshit 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
图像识别技术在病虫害检测中的应用是一个快速发展的领域,它结合了计算机视觉和机器学习算法来自动识别和分类植物上的病虫害。以下是这一技术的一些关键步骤和组成部分: 1. **数据收集**:首先需要收集大量的植物图像数据,这些数据包括健康植物的图像以及受不同病虫害影响的植物图像。 2. **图像预处理**:对收集到的图像进行处理,以提高后续分析的准确性。这可能包括调整亮度、对比度、去噪、裁剪、缩放等。 3. **特征提取**:从图像中提取有助于识别病虫害的特征。这些特征可能包括颜色、纹理、形状、边缘等。 4. **模型训练**:使用机器学习算法(如支持向量机、随机森林、卷积神经网络等)来训练模型。训练过程中,算法会学习如何根据提取的特征来识别不同的病虫害。 5. **模型验证和测试**:在独立的测试集上验证模型的性能,以确保其准确性和泛化能力。 6. **部署和应用**:将训练好的模型部署到实际的病虫害检测系统中,可以是移动应用、网页服务或集成到智能农业设备中。 7. **实时监测**:在实际应用中,系统可以实时接收植物图像,并快速给出病虫害的检测结果。 8. **持续学习**:随着时间的推移,系统可以不断学习新的病虫害样本,以提高其识别能力。 9. **用户界面**:为了方便用户使用,通常会有一个用户友好的界面,显示检测结果,并提供进一步的指导或建议。 这项技术的优势在于它可以快速、准确地识别出病虫害,甚至在早期阶段就能发现问题,从而及时采取措施。此外,它还可以减少对化学农药的依赖,支持可持续农业发展。随着技术的不断进步,图像识别在病虫害检测中的应用将越来越广泛。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值