DS Wannabe之5-AM Project: DS 30day int prep day17-CSDN博客

本文链接：https://blog.csdn.net/wendyponcho/article/details/136117950

本文概述了统计学习理论中的EmpiricalRiskMinimization（经验风险最小化）、PAC学习、深度语言表示技术ELMo、语用分析、句法解析、ULMFit、BERT、XLNet和Transformer等在自然语言处理中的应用，展示了这些模型如何提升文本理解和处理能力。

摘要由CSDN通过智能技术生成

Q1. What is ERM (Empirical Risk Minimization)?

Empirical risk minimization (ERM): It is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance. The idea is that we don’t know exactly how well an algorithm will work in practice (the true "risk") because we don't know the true distribution of data that the algorithm will work on, but as an alternative we can measure its performance on a known set of training data.

We assumed that our samples come from this distribution and use our dataset as an approximation. If we compute the loss using the data points in our dataset, it’s called empirical risk. It is “empirical” and not “true” because we are using a dataset that’s a subset of the whole population.

When our learning model is built, we have to pick a function that minimizes the empirical risk that is the delta between predicted output and actual output for data points in the dataset. This process of finding this function is called empirical risk minimization (ERM). We want to minimize the true risk. We don’t have information that allows us to achieve that, so we hope that this empirical risk will almost be the same as the true empirical risk.

Let’s get a better understanding by Example

We would want to build a model that can differentiate between a male and a female based on specific features. If we select 150 random people where women are really short, and men are really tall, then the model might incorrectly assume that height is the differentiating feature. For building a truly accurate model, we have to gather all the women and men in the world to extract differentiating features. Unfortunately, that is not possible! So we select a small number of people and hope that this sample is representative of the whole population.

Q2. What is PAC (Probably Approximately Correct)?

PAC: Incomputational learning theory,probably approximately correct(PAC)learningis a framework for mathematical analysis of machine learning. PAC学习模型是计算学习理论中的一个框架，用于量化学习算法在特定任务上学习目标概念的效率和可靠性。"大概准确"意味着学习算法能够以高概率（Probably）输出一个近似准确（Approximately Correct）的假设或模型，该假设在未知数据上的错误率不超过某个预定的界限。PAC学习提供了一种方法来评估算法在有限的样本数据上学习未知概念的能力，并确保所学习的模型在新数据上也有良好的表现。

The learner receives samples and must have to pick a generalization function (called the hypothesis) from a specific class of possible functions. Our goal is that, with high probability, the selected function will have low generalization error. The learner must be able to learn the concept given any arbitrary approximation ratio, probability of success, or distribution of the samples.

Hypothesis class is PAC(Probably Approximately Correct) learnable if there exists a function m_H and algorithm that for any labeling function f, distribution D over the domain of inputs X,

delta and epsilon that with m ≥ m_H produces a hypothesis h like that with probability 1-delta it returns a true error lower than epsilon. Labeling function is nothing other than saying that we have a specific function f that labels the data in the domain.

Q3. What is ELMo?

ELMo is a novel way to represent words in vectors or embeddings. These word embeddings help achieve state-of-the-art (SOTA) results in several NLP tasks:

It is a deep contextualized word representation that models both complex characteristics of word use (e.g., syntax and semantics), and how these uses vary across linguistic contexts. These word vectors are learned functions of internal states of a deep biLM(bidirectional language model), which is pre- trained on large text corpus. They could be easily added to existing models and significantly improve state of the art across a broad range of challenging NLP problems, including question answering, textual entailment and sentiment analysis. ELMo（嵌入式语言模型，Embeddings from Language Models）是一种深度语言表示技术，它能够根据词在句子中的上下文来生成词的表示。与传统的词嵌入技术（如Word2Vec或GloVe）不同，ELMo为每个词生成的表示会随着上下文的不同而变化，这意味着相同的词在不同的句子中可以有不同的嵌入表示。ELMo通过预训练的双向LSTM（长短期记忆网络）模型来实现这一点，从而能够更好地捕捉词的语义和句法信息，提升了各种自然语言处理任务的性能。

Q4. What is Pragmatic Analysis in NLP?

Pragmatic Analysis(PA): It deals with outside word knowledge, which means understanding i.e external to documents and queries. PA that focuses on what was described is reinterpreted by what it actually meant, deriving the various aspects of language that require real-world knowledge.

It deals with overall communicative and social content and its effect on interpretation. It means abstracting the meaningful use of language in situations. In this analysis, the main focus always on what was said in reinterpreted on what is intended.

It helps users to discover this intended effect by applying a set of rules that characterize cooperative dialogues.

E.g., "close the window?" should be interpreted as a request instead of an order.

语用分析是自然语言处理（NLP）中的一个领域，它研究语言是如何在不同的沟通情境中被使用的。它关注于理解语言的实际使用方式，包括话语的意图、语境中的暗示、交际双方的预期和社会规范等。语用分析试图解析言语背后的含义，识别讽刺、幽默、礼貌表达等语言现象，并理解对话中的暗示和隐含意义。在自然语言处理中，语用分析有助于提升机器理解和生成自然语言的能力，尤其是在对话系统、情感分析和自动摘要等应用中。

Q5. What is Syntactic Parsing?

Syntactic Parsing or Dependency Parsing: It is a task of recognizing a sentence and assigning a syntactic structure to it. Most Widely we used syntactic structure is the parse tree which can be generated using some parsing algorithms. These parse trees are useful in various applications like grammar checking or more importantly, it plays a critical role in the semantic analysis stage. For example to answer the question “Who is the point guard for the LA Laker in the next game ?” we need to figure out its subject, objects, attributes to help us figure out that the user wants the point guard of the LA Lakers specifically for the next game.

Example:

句法分析，也称为依存分析，是自然语言处理中的一项任务，目的是识别句子的结构，并为其分配一个句法结构。最常用的句法结构是解析树，可以通过某些解析算法生成。这些解析树在各种应用中非常有用，比如语法检查，更重要的是，在语义分析阶段发挥着关键作用。例如，要回答“下一场比赛的LA湖人队的控球后卫是谁？”这样的问题，我们需要弄清楚它的主语、宾语、属性等，以帮助我们了解用户特别是想要知道下一场比赛的LA湖人队的控球后卫。

Q6. 什么是ULMFit？

ULMFit（Universal Language Model Fine-tuning for Text Classification，通用语言模型微调用于文本分类）是一种转移学习方法，专门应用于自然语言处理（NLP）领域。在2018年5月，Jeremy Howard和Sebastian Ruder发布了一篇论文，介绍了ULMFit，并探讨了在文本分类任务中使用预训练模型的好处。与之前的NLP转移学习尝试不同，ULMFit提出了一种可以应用于NLP中任何任务的转移学习方法，并且在六项文本分类任务上超越了当时的最先进技术。

ULMFit使用了一种常规的LSTM（长短期记忆网络），即AWD-LSTM，这是当时的最先进语言模型架构。该LSTM网络包含三层。在整个过程中——无论是在预训练还是在微调阶段——都使用单一的架构。通过这种方式，ULMFit能够有效利用预训练的语言模型来提升文本分类等NLP任务的性能，展示了转移学习在自然语言处理中的巨大潜力。

Transfer Learning in NLP(Natural language Processing) is an area that had not been explored with great success. But, in May 2018, Jeremy Howard and Sebastian Ruder came up with the paper – Universal Language Model Fine-tuning for Text Classification(ULMFit) which explores the benefits of using a pre trained model on text classification. It proposes ULMFiT(Universal Language Model Fine-tuning for Text Classification), a transfer learning method that could be applied to any task in NLP. In this method outperforms the state-of-the-art on six text classification tasks.

ULMFiT uses a regular LSTMwhich is the state-of-the-art language model architecture (AWD- LSTM). The LSTM network has three layers. Single architecture is used throughout – for pre-training as well as for fine-tuning.

ULMFiT achieves the state-of-the-art result using novel techniques like:

Discriminative fine-tuning
Slanted triangular learning rates
Gradual unfreezing

Different layers of a neural network capture different types of information so they should be fine- tuned to varying extents. Instead of using the same learning rates for all layers of the model, discriminative fine-tuning allows us to tune each layer with different learning rates.

Slanted triangular learning

The model should quickly converge to a suitable region of the parameter space in the beginning of training and then later refine its parameters. Using a constant learning rate throughout training is not the best way to achieve this behaviour. Instead Slanted Triangular Learning Rates (STLR) linearly increases the learning rate at first and then linearly decays it.

Gradual Unfreezing

Gradual unfreezing is the concept of unfreezing the layers gradually, which avoids the catastrophic loss of knowledge possessed by the model. It first unfreezes the top layer and fine-tunes all the unfrozen layers for 1 epoch. It then unfreezes the next lower frozen layer and repeats until all the layers have been fine-tuned until convergence at the last iteration.

Q7. What is BERT?

BERT (Bidirectional Encoder Representations from Transformers) is an open-sourced NLP pre- training model developed by researchers at Google in 2018. A direct descendant to GPT (Generalized Language Models), BERT has outperformed several models in NLP and provided top results in Question Answering, Natural Language Inference (MNLI), and other frameworks.

What makes it’s unique from the rest of the model is that it is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus. Since it’s open- sourced, anyone with machine learning knowledge can easily build an NLP model without the need for sourcing massive datasets for training the model, thus saving time, energy, knowledge and resources.

How does it work?

Traditional context-free models (like word2vec or GloVe) generate a single word embedding representation for each word in the vocabulary which means the word “right” would have the same context-free representation in “I’m sure I’m right” and “Take a right turn.” However, BERT would represent based on both previous and next context, making it bidirectional. While the concept of bidirectional was around for a long time, BERT was first on its kind to successfully pre-train bidirectional in a deep neural network.

Q8.What is XLNet?

XLNet is a BERT-like model instead of a totally different one. But it is an auspicious and potential one. In one word, XLNet is a generalized autoregressive pretraining method.

Autoregressive (AR) language model: It is a kind of model that using the context word to predict the next word. But here the context word is constrained to two directions, either forward or backwards.

The advantages of AR language model are good at generative Natural language Process(NLP) tasks. Because when generating context, usually is the forward direction. AR language model naturally works well on such NLP tasks.

But Autoregressive language model has some disadvantages, and it only can use forward context or backward context, which means it can't use forward and backward context at the same time.

Q9. What is the transformer?

Transformer: It is a deep machine learning model introduced in 2017, used primarily in the field of natural language processing (NLP). Like recurrent neural networks(RNN), It is designed to handle ordered sequences of data, such as natural language, for various tasks likemachine translation and text summarization. However, Unlike recurrent neural networks(RNN), Transformers do not require that the sequence be processed in the order. So, if the data in question is a natural language, the Transformer does not need to process the beginning of a sentence before it processes the end. Due to this feature, the Transformer allows for much more parallelization than RNNs during training.

Transformers are developed to solve the problem of sequence transduction current neural networks. It means any task that transforms an input sequence to an output sequence. This includes speech recognition, text-to-speech transformation, etc.

For models to perform a sequence transduction, it is necessary to have some sort of memory. example, let us say that we are translating the following sentence to another language (French):

“The Transformers” is a Japanese band. That band was formed in 1968, during the height of the Japanese music history.”

In the above example, the word “the band” in the second sentence refers to the band “The Transformers” introduced in the first sentence. When you read about the band in the second sentence, you know that it is referencing to the “The Transformers” band. That may be important for translation.

For translating other sentences like that, a model needs to figure out these sort of dependencies and connections. Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) have been used to deal with this problem because of their properties.

Q10. What is Text summarization?

Text summarization: It is the process of shortening a text document, to create a summary of the significant points of the original document.

Types of Text Summarization Methods :

Text summarization methods can be classified into different types.

Summary

在机器学习领域，特别是在自然语言处理（NLP）中，ULMFit、BERT、XLNet 和 Transformer 是几种重要的模型和架构，它们各自有不同的特点和应用场景：

ULMFit（Universal Language Model Fine-tuning）:
ULMFit 是一种基于预训练语言模型的转移学习方法，主要用于文本分类任务。它通过三个阶段的训练过程（预训练、微调和分类器训练）来改善NLP任务的性能。ULMFit 使用的基础模型是AWD-LSTM，这是一种循环神经网络（RNN）。
BERT（Bidirectional Encoder Representations from Transformers）:
BERT 是由Google在2018年提出的一种预训练语言表示模型，它使用了Transformer的编码器架构。BERT的创新之处在于它的双向训练，即同时考虑上下文中的左侧和右侧信息来生成词的表示。这种双向训练使BERT在多种NLP任务上取得了显著的性能提升。
XLNet:
XLNet 是一种基于Transformer的自回归预训练语言模型，由Google Brain和CMU联合提出。与BERT不同，XLNet通过最大化输入序列的所有排列的期望来捕获双向上下文，从而解决了BERT在使用[MASK]标记进行训练时可能引入的预训练和微调不一致的问题。XLNet结合了自回归语言建模和自编码语言建模的优点。
Transformer:
Transformer 是一种完全基于注意力机制的架构，用于处理序列到序列的任务，由Vaswani等人在2017年提出。它摒弃了之前广泛使用的循环神经网络结构，转而使用了多头注意力机制来并行处理数据。Transformer架构是BERT、XLNet等模型的基础，并且对当前大多数先进的NLP模型产生了深远的影响。