DS Wannabe之5-AM Project: DS 30day int prep day19

wendyponcho

已于 2024-02-17 12:52:13 修改

阅读量702

点赞数 26

于 2024-02-17 08:08:52 首次发布

本文链接：https://blog.csdn.net/wendyponcho/article/details/136134789

版权

Machine Learning 同时被 3 个专栏收录

18 篇文章 0 订阅

订阅专栏

Data Science

15 篇文章 0 订阅

订阅专栏

NLP

7 篇文章 0 订阅

订阅专栏

Q1. What is LSI(Latent Semantic Indexing)?

Latent Semantic Indexing (LSI): It is an indexing and retrieval method that uses a mathematical technique called SVD(Singular value decomposition) to find patterns in relationships between terms and concepts contained in an unstructured collection of text. It is based on the principle that words that are used in the same contexts tend to have similar meanings.

For example, Tiger and Woods are associated with men instead of an animal, and a Wood, Parris, and Hilton are associated with the singer.

Example:

If you use LSI to index a collection of articles and the words “fan” and “regulator” appear together frequently enough, the search algorithm would notice that the two terms are semantically close. A search for “fan” will, therefore, return a set of items containing that phrase, but also items that contain just the word “regulator”. It doesn't understand word distance, but by examining a sufficient number of documents, it only knows the two terms are interrelated. It then uses that information to provide an expanded set of results with better recall than an understandable keyword search.

The diagram below describes the effect between LSI and keyword searches. W stands for a document.

潜在语义索引（Latent Semantic Indexing，简称LSI）是一种自然语言处理和信息检索技术，用于发现文本数据集中的潜在语义结构。LSI通过奇异值分解（Singular Value Decomposition，SVD）这种数学方法，将高维的词-文档矩阵降维到较低维的语义空间中，以揭示词语和文档之间的隐含关系。

实际例子解释

假设我们有一个包含多篇文档的数据集，文档涵盖了如“体育”、“金融”和“科技”等不同的主题。在原始的词-文档矩阵中，每一行代表一个唯一的词语，每一列代表一个文档，矩阵中的每个元素表示该词语在相应文档中的出现频次或权重。

发现隐含主题：通过应用LSI，我们可以将这个高维矩阵降维到一个较低维的“主题空间”中。在这个过程中，LSI可能会发现一些隐含的主题，比如“球类运动”、“股市”、“编程语言”等，这些主题是原始主题更细致的分类，或者是跨越原始分类界限的新主题。
改善信息检索：在信息检索应用中，用户可能会搜索“Python”，意图可能是寻找关于“编程语言”的文档。在传统的词频方法中，只有包含“Python”这个词的文档会被检索出来。而LSI可以帮助识别“编程”这一隐含主题下的相关文档，即使这些文档没有直接提到“Python”，也能够被检索出来，因为它们与“Python”相关的其他词（如“编程”、“脚本”等）共享相同的语义空间。
文档分类和聚类：在文档分类任务中，LSI可以帮助识别文档的主题或分类，即使没有明显的关键词指示。例如，一篇关于“比特币交易”的文章可能不直接提到“金融”这个词，但通过LSI，它可以被分类到“金融”主题下，因为文章中的词汇与“金融”主题的词汇在语义空间中是接近的。

通过这些例子，我们可以看到LSI如何通过揭示文本数据的潜在语义结构，来改善信息检索、文档分类和其他文本相关的任务，使得结果更加精准和相关。

Q2. What is Named Entity Recognition? And tell some use cases of NER?

Named-entity recognition (NER): It is also known as entity extraction, and entity identification is a subtask ofinformation extractionthatexplore to locate and classify atomic elements in textintopredefined categorieslike the names ofpersons,organizations,places,expressions of times, quantities, monetary values, percentages and more.

In each text document, particular terms represent specific entities that are more informative and have a different context. These entities are called named entities, which more accurately refer to conditions that represent real-world objects like people, places, organizations or institutions, and so on, which are often expressed by proper names. The naive approach could be to find these by having a look at the noun phrases in text documents. It also is known as entity chunking/extraction, which is a popular technique used in information extraction to analyze and segment the named entities and categorize or classify them under various predefined classes.

Named Entity Recognition use-case

Classifying content for news providers-
NER can automatically scan entire articles and reveal which are the significant people, organizations, and places discussed in them. Knowing the relevant tags for each item helps in automatically categorizing the articles in defined hierarchies and enable smooth content discovery.
Customer Support:
Let’s say we are handling the customer support department of an electronics store with multiple branches worldwide; we go through a number mentions in our customers’ feedback. Such as this for instance.

Now, if we pass it through the Named Entity Recognition API, it pulls out the entities Bangalore (location) and Fitbit (Product). This can be then used to categorize the complaint and assign it to the relevant department within the organization that should be handling this.

命名实体识别（Named Entity Recognition, NER）是信息提取领域的一个子任务，旨在从文本中定位和分类特定的元素到预定义的类别，如人名、组织名、地点名、时间表达、数量、货币值、百分比等。在每个文本文档中，特定的术语代表特定的实体，这些实体更具信息性，具有不同的上下文。这些被称为命名实体的实体更准确地指代了代表现实世界对象的概念，如人、地点、组织或机构等，这些通常由专有名词表达。一种朴素的方法可能是通过查看文本文档中的名词短语来找到这些实体。这也被称为实体分块/提取，是信息提取中用于分析和分段命名实体并将它们分类到各种预定义类别的一种流行技术。

命名实体识别的一些使用案例包括：

新闻内容分类：NER可以自动扫描整篇文章，并揭示其中讨论的重要人物、组织和地点。知道每项内容的相关标签有助于自动将文章分类到定义的层次结构中，并实现内容的顺畅发现。
客户支持：假设我们负责一家在全球多个分支的电子产品商店的客户支持部门，我们会浏览客户反馈中的大量提及。例如，如果我们将某条反馈通过命名实体识别API处理，它可以识别出“Bangalore”（地点）和“Fitbit”（产品）这样的实体。然后，这些信息可以用来对投诉进行分类，并将其指派给组织内应该处理此事的相关部门。

通过这些使用案例，我们可以看到命名实体识别在自动化文本分析和信息提取、增强内容可发现性、提高客户支持效率等方面的实际应用价值。

Q3. What is perplexity?

Perplexity: It is a measurement of how well a probability model predicts a sample. In the context of NLP, perplexity(Confusion) is one way to evaluate language models.

The term perplexity has three closely related meanings. It is a measure of how easy a probability distribution is to predict. It is a measure of how variable a prediction model is. And It is a measure of prediction error. The third meaning of perplexity is calculated slightly differently, but all three have the same fundamental idea.

困惑度（Perplexity），在概率模型和自然语言处理（NLP）的语境中，是衡量模型预测样本的好坏的一种指标。更具体地说，困惑度用于评估语言模型的性能，它反映了模型对测试数据的预测能力。

困惑度有三个密切相关的含义：

预测难度的衡量：困惑度衡量一个概率分布的预测难度。如果一个概率分布非常分散，预测就更加困难，相应的困惑度就会更高。
预测模型的可变性：困惑度可以反映模型预测的可变性。预测结果越不确定，困惑度就越高。
预测误差的衡量：困惑度还可以用作预测误差的衡量指标。虽然计算方法略有不同，但本质上也是衡量模型预测的准确性。

在语言模型中，困惑度通常被定义为给定语言模型在一个测试集上的概率的倒数的幂，其中幂的底是测试集中的词汇数量。较低的困惑度表明模型对数据的预测能力较强，也就是说，模型对于测试数据的不确定性较小。

例如，在语言模型中，如果一个模型能够很好地预测测试集中的词序列，那么它对于这些词序列的概率分配将会较高，从而导致较低的困惑度，反映出模型较好的性能。相反，如果模型对于测试集中的词序列预测得很差，那么它对这些词序列的概率分配将会较低，从而导致较高的困惑度，反映出模型较差的性能。

总的来说，困惑度是评价语言模型和其他概率模型性能的一种重要指标，它帮助我们了解模型对未见数据的处理能力。

Q4. What is the language model?

Language Modelling (LM): It is one of the essential parts of modern NLP. There are many sorts of applications for Language Modelling, like Machine Translation, Spell Correction Speech Recognition, Summarization, Question Answering, Sentiment analysis, etc. Each of those tasks requires the use of the language model. The language model is needed to represent the text to a form understandable from the machine point of view.

The statistical language model is a probability distribution over a series of words. Given such a series, say of length m, it assigns a probability to the whole series.

It provides context to distinguish between phrases and words that sounds are similar. For example, in American English, the phrases " wreck a nice beach " and "recognize speech" sound alike but mean different things.

Data sparsity is a significant problem in building language models. Most possible word sequences are not noticed in training. One solution is to make the inference that the probability of a word only depends on the previous n words. This is called as an n-gram model or unigram model when n = 1. The unigram model is also known as the bag of words model.

语言模型（Language Modelling, LM）是现代自然语言处理（NLP）中非常重要的一部分。语言模型在多种应用中都有着广泛的用途，包括机器翻译、拼写纠正、语音识别、摘要生成、问答系统、情感分析等。所有这些任务都需要使用语言模型来将文本转换成机器能够理解的形式。

统计语言模型是对一系列词汇的概率分布。给定这样一个系列（比如长度为m的词序列），语言模型为整个序列分配一个概率。

语言模型提供了上下文信息，帮助区分听起来相似但意义不同的短语和词汇。例如，在美式英语中，短语“wreck a nice beach”（破坏一个美丽的海滩）和“recognize speech”（识别语音）听起来相似，但意义完全不同。

在构建语言模型时，数据稀疏性是一个重大问题。大多数可能的词序列在训练数据中都没有出现过。一种解决方案是假设一个词的概率只依赖于前n个词，这被称为n-gram模型。当n=1时，这种模型也被称为一元模型（unigram model），又名词袋模型（bag of words model）。

n-gram模型通过限定词汇依赖的范围来降低模型的复杂性，从而在一定程度上缓解了数据稀疏性问题。一元模型是最简单的形式，它假设每个词出现的概率与其他词无关，而二元模型（bigram model）则假设一个词的出现只与它前面的一个词有关，以此类推。

总之，语言模型是NLP中不可或缺的一部分，它通过为词序列分配概率来捕捉语言的统计特性，为各种NLP应用提供基础支持。

Q5. How does this Language Model help in NLP Tasks?

The probabilities restoration by a language model is most useful to compare the likelihood that different sentences are "good sentences." This was useful in many practical tasks, for example:

Spell checking: You observe a word that is not identified as a known word as part of a sentence. Using the edit distance algorithm, we find the closest known words to the unknown words. These are the candidate corrections. For example, we observe the word "wurd" in the context of the sentence, "I like to write this wurd." The candidate corrections are ["word", "weird", "wind"]. How can we select among these candidates the most likely correction for the suspected error "weird"?

Automatic Speech Recognition: we receive as input a string of phonemes; a first model predicts for sub-sequences of the stream of phonemes candidate words; the language model helps in ranking the most likely sequence of words compatible with the candidate words produced by the acoustic model.

Machine Translation: each word from the source language is mapped to multiple candidate words in the target language; the language model in the target language can rank the most likely sequence of candidate target words.

词嵌入（Word Embedding）是一种文本表示方法，它能够使具有相似含义的词在向量空间中拥有相似的表示。词嵌入通过训练将单词映射到一个连续的高维空间中，使得这些高维向量能够捕捉单词之间的语义和语法关系。

词嵌入是连接人类对语言理解和机器处理语言的一种方式。通过词嵌入，单词被表示为n维空间中的向量，这对于解决大多数自然语言处理问题至关重要。

词嵌入的获取并非随机，而是通过训练神经网络得到的。不同的词嵌入集合可能会有所不同，这取决于训练数据和模型的选择。一个著名的词嵌入实现是Google的Word2Vec，它通过预测语言中一个词周围出现的其他词来进行训练。例如，对于词“cat”，神经网络可能会预测“kitten”（小猫）和“feline”（猫科动物）等词。这种将语义上相近的词彼此“靠近”地放置在向量空间中的直觉，使得词嵌入能够有效地捕捉词义和使用上下文。

总的来说，词嵌入为自然语言处理提供了一种强大的工具，能够将词语转换为机器能够理解和处理的数学对象，从而支持各种基于语义的任务，如文本分类、情感分析、机器翻译等。

Q6. Do you have an idea about fastText?

fastText: It is another word embedding method that is an extension of the word2vec model. Alternatively, learning vectors for words directly. It represents each word as an n-gram of characters. So, for example, take the word, “artificial” with n=3, the fastText representation of this word is <ar, art, rti, tif, ifi, fic, ici, ial, al>, where the angular brackets indicate the beginning and end of the word.

This helps to capture the meaning of shorter words and grant the embeddings to understand prefixes and suffixes. Once the word has been showed using character skip-grams, a n-gram model is trained to learn the embeddings. This model is acknowledged to be a bag of words model with a sliding window over a word because no internal structure of the word is taken into account. As long as the characters are within this window, the order of the n-grams doesn’t matter.

fastText works well with rare words. So even if a word wasn’t seen during training, it can be broken down into n-grams to get its embeddings.

Word2vec and GloVe both fail to provide any vector representation for words that are not in the model dictionary. This is a huge advantage of this method.

fastText是一种词嵌入方法，它是word2vec模型的扩展。与直接为单词学习向量不同，fastText将每个词表示为字符的n-gram。例如，对于单词“artificial”且n=3时，fastText将其表示为<ar, art, rti, tif, ifi, fic, ici, ial, al>，其中尖括号表示单词的开始和结束。

这种表示方法有助于捕捉短词的含义，并使得词嵌入能够理解前缀和后缀。一旦使用字符跳跃n-gram表示了单词，就训练一个n-gram模型来学习嵌入。这个模型被认为是带有滑动窗口的词袋模型，因为它不考虑单词的内部结构。只要字符位于这个窗口内，n-gram的顺序就不重要。

fastText对于罕见词表现良好。因此，即使某个词在训练过程中没有出现，也可以将其分解为n-gram以获取其嵌入。

Word2vec和GloVe都无法为模型词典中不存在的单词提供任何向量表示。fastText方法的一个巨大优势就是即使遇到未见过的词汇，也能通过其n-gram来推断出词向量。

总的来说，fastText通过将词汇分解为更小的字符组合（n-gram），能够有效地处理和表示罕见词汇、新词汇或复合词汇，提供了一种灵活而强大的词嵌入方法。

Q7. What is GloVe?

GloVe(global vectors) is for word representation. GloVe is an unsupervised learning algorithm developed by Stanford for achieving word embeddings by aggregating a global word-word co- occurrence matrix from a corpus. The resulting embeddings show interesting linear substructures of the word in vector space.

The GloVe model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a new word analogy task. It also outperforms related models on similarity tasks and named entity recognition.

How GloVe find meaning in statistics?

Produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a new word analogy task. It also outperforms related models on similarity tasks and named entity recognition.

GloVe（Global Vectors for Word Representation，全局词向量）是一种用于词表示的技术。GloVe是一种无监督学习算法，由斯坦福大学开发，旨在通过从大量文本语料库中聚合全局词-词共现矩阵来实现词嵌入。得到的嵌入在向量空间中展示了词语的有趣的线性子结构。

GloVe模型产生了一个具有有意义子结构的向量空间，这一点通过其在新的词类比任务上75%的表现得到了证明。它在相似性任务和命名实体识别上也超过了相关模型。

GloVe如何在统计数据中发现含义？

GloVe通过构建一个词-词共现矩阵来捕捉词汇间的统计信息，该矩阵表达了词汇在整个语料库中与其他词汇共同出现的频率。这种共现关系揭示了词汇间的语义和语法关系，因为经常在相似的上下文中共同出现的词汇往往具有相似的含义或相关的语法功能。

GloVe模型通过对共现矩阵进行因子分解，将每个词汇映射到一个高维空间中的向量，这些向量能够保留词汇间的共现信息。在这个向量空间中，语义相似的词汇会被映射到相近的位置，而语义上不相关的词汇则会相距较远。

GloVe的独特之处在于它同时考虑了词汇的全局统计信息（即整个语料库中的共现信息）和局部上下文信息（即特定上下文中的词汇关系），从而能够生成富有洞察力的词嵌入，这些嵌入不仅能够捕捉到词汇间的相似性，还能揭示更复杂的语言模式，如词类比关系。因此，GloVe在诸如词类比、词汇相似性评估和命名实体识别等NLP任务中表现出色。

GloVe aims to achieve two goals:

(1) Create word vectors that capture meaning in vector space
(2) Takes advantage of global count statistics instead of only local information
Unlike word2vec – which learns by streaming sentences – GloVe determines based on a co-

occurrence matrix and trains word vectors, so their differences predict co-occurrence ratios GloVe weights the loss based on word frequency.

Somewhat surprisingly, word2vec and GloVe turn out to be remarkably similar, despite starting off from entirely different starting points.

GloVe模型的主要目标是利用大量文本数据中的全局统计信息，通过词-词共现矩阵来生成具有丰富语义和语法信息的词向量。这些词向量在向量空间中能够反映出词与词之间的各种关系，包括但不限于相似性、对比、共现等。具体来说，GloVe模型的目标包括：

捕捉全局共现统计信息：与基于局部上下文窗口的模型（如word2vec）不同，GloVe关注于整个语料库范围内的词-词共现信息，旨在捕获更广泛的词汇关系。
生成有意义的词向量：GloVe的目标是生成能够揭示词汇之间复杂关系的词向量。在这些向量空间中，不仅可以通过向量距离来衡量词汇之间的相似性，还可以利用向量运算（如向量加减）来探索词汇间更复杂的关系，如“国王”-“男人”+“女人”≈“王后”。
保留线性子结构：GloVe模型力图保留向量空间中的线性子结构，这使得模型可以通过简单的向量运算来解决词类比问题，例如找到与“巴黎”相对于“法国”与“意大利”相同关系的词“罗马”。
适用于多种NLP任务：GloVe生成的词向量旨在适用于广泛的NLP任务，包括文本分类、情感分析、命名实体识别、机器翻译等，从而增强模型的泛化能力和应用范围。
处理罕见词和新词：通过对共现矩阵进行适当的平滑和正则化处理，GloVe旨在改善对罕见词和未见过的新词的表示，使得模型对词汇表外的词也具有一定的处理能力。

Q8. Explain Gensim?

Gensim: It is billed as a Natural Language Processing package that does ‘Topic Modeling for Humans’. But its practically much more than that. If you are unfamiliar with topic modeling, it is a technique to extract the underlying topics from large volumes of text.

Gensim provides algorithms like LDA and LSI (which we already seen in previous interview questions) and the necessary sophistication to built high-quality topic models. It is an excellent library package for processing texts, working with word vector models (such as FastText, Word2Vec, etc) and for building the topic models. Another significant advantage with gensim is: it lets us handle large text files without having to load the entire file in memory. We can also tell as It is an open-source library for unsupervised topic modeling and natural language processing, using modern statistical machine learning. Gensim is implemented in Python and Cython.

Gensim is designed to handle extensive text collections using data streaming and incremental online algorithms, which differentiates it from most other machine learning software packages that target only in-memory processing.

Gensim是一个被誉为“面向人类的主题建模”的自然语言处理软件包，实际上它的功能远不止于此。如果你对主题建模不熟悉，这是一种从大量文本中提取潜在主题的技术。Gensim提供了如LDA和LSI等算法（我们在之前的问题中已经看到了）和构建高质量主题模型所需的复杂功能。它是一个用于处理文本、处理词向量模型（如FastText、Word2Vec等）和构建主题模型的优秀库包。

Gensim的另一个重要优势是：它允许我们处理大型文本文件，而无需将整个文件加载到内存中。Gensim是一个用于无监督主题建模和自然语言处理的开源库，使用现代的统计机器学习技术。Gensim用Python和Cython实现，旨在通过数据流和增量在线算法处理大型文本集合，这使其区别于大多数只针对内存处理的机器学习软件包。

总之，Gensim是一个功能强大的NLP库，特别适合于主题建模、文本分析和词向量处理等任务，非常适合处理和分析大规模文本数据。

Q9. What is Encoder-Decoder Architecture?

编码器-解码器（Encoder-Decoder）架构是一种在自然语言处理中常用的模型结构，尤其在机器翻译、文本摘要和问答系统中非常流行。这个架构包括两个主要部分：编码器和解码器。

编码器（Encoder）：编码器部分接收输入数据，如英语句子，并通过其递归层（如RNN、LSTM或GRU层）处理这些数据。编码器逐步读取输入句子中的每个单词或词符，并更新其内部状态。当所有的输入数据都被处理完毕后，编码器生成一个最终状态，这个状态被认为是输入数据的压缩表示，它捕获了输入数据的关键信息。这个最终状态（有时也称为上下文向量）将被用作解码器的初始状态。

编码器的作用是读取并理解输入数据，通常是一系列词向量，然后将输入数据转换为一个固定大小的内部表示或上下文向量。这个上下文向量（通常是最后一个隐藏状态）被认为是输入数据的压缩表示，捕捉了输入中的关键信息。在您提供的图像中，编码器通过读取"I am a student"这句话，并将其转换为一个内部状态表示，为解码器提供了信息。

解码器（Decoder）：解码器的作用是将编码器输出的上下文向量转换为目标数据，例如翻译后的句子。它一步步生成输出，每次生成一个元素（如一个词）。在您提供的图像中，解码器从编码器接收的上下文向量开始，生成法语翻译"Je suis étudiant"。

解码器以编码器的最终状态作为其初始状态，并开始生成输出序列，即目标语言的句子。在机器翻译的例子中，如果输入是英语句子，解码器的任务就是生成对应的法语翻译。解码器逐步生成输出句子中的每个单词或词符，直到遇到结束符号（如"<EOS>"，意为句子结束）为止。在这个过程中，解码器的每一步输出都可以取决于先前生成的输出，从而使得生成的翻译保持连贯性。

图中提到的“状态”通常指的是编码器处理完所有输入后的最终隐状态，它被传递到解码器来初始化解码过程。

Q10. What is Context2Vec?

Assume a case where you have a sentence like.

I can’t find May.

Word May maybe refers to a month's name or a person's name. You use the words surround it (context) to help yourself to determine the best suitable option. Actually, this problem refers to the Word Sense Disambiguation task, on which you investigate the actual semantics of the word based on several semantic and linguistic techniques.

The Context2Vec idea is taken from the original CBOW Word2Vec model, but instead of relying on averaging the embedding of the words, it relies on a much more complex parametric model that is based on one layer of Bi-LSTM. Figure1 shows the architecture of the CBOW model.

Context2Vec applied the same concept of windowing, but instead of using a simple average function, it uses 3 stages to learn complex parametric networks.

A Bi-LSTM layer that takes left-to-right and right-to-left representations
A feedforward network that takes the concatenated hidden representation and produces a

hidden representation through learning the network parameters.
Finally, we apply the objective function to the network output.

We used the Word2Vec negative sampling idea to get better performance while calculating the loss value.
The following are some samples of the closest words to a given context.

Context2Vec是一个用于生成词嵌入的模型，它考虑到了词汇的上下文信息以更准确地捕捉词义。这个模型的灵感来源于Word2Vec中的CBOW（Continuous Bag Of Words）模型，但它采用了更为复杂的模型来获取上下文信息，而不是简单地对周围词的嵌入进行平均。

在Word2Vec的CBOW模型中，目标词的嵌入是通过预测给定上下文中心词的概率来学习的，其中上下文是中心词前后词的嵌入的简单平均。而Context2Vec采用了一个双向长短时记忆网络（Bi-LSTM）层来捕捉序列数据的前后文信息，可以更全面地理解和表示目标词前后的上下文。

在Context2Vec中，Bi-LSTM层能够捕捉到句子中每个词前后的复杂上下文关系，这样每个词的表示都是基于其在具体句子中的使用情况来构建的。这对于词义消歧任务非常有用，比如句子“I can't find May”中，“May”这个词可能指的是五月份或者一个名为May的人。Context2Vec通过分析“May”前后的词来帮助模型决定最合适的含义。

总的来说，Context2Vec提供了一种强大的方法来生成具有丰富上下文信息的词嵌入，这些嵌入能够更准确地反映词汇的实际用法和含义，特别是在处理多义词和复杂的语言现象时。

wendyponcho

关注

26
点赞
踩
15

收藏

觉得还不错? 一键收藏
0
评论
DS Wannabe之5-AM Project: DS 30day int prep day19

Example:潜在语义索引（Latent Semantic Indexing，简称LSI）是一种自然语言处理和信息检索技术，用于发现文本数据集中的潜在语义结构。LSI通过奇异值分解（Singular Value Decomposition，SVD）这种数学方法，将高维的词-文档矩阵降维到较低维的语义空间中，以揭示词语和文档之间的隐含关系。
复制链接

扫一扫

专栏目录