线性分类模型和向量矩阵求导_自然语言处理中向量空间模型的矩阵设计

线性分类模型和向量矩阵求导

I remember when I studied algebra for the first time, I had this weird urge of representing these alphabets into words. I had a few scribbles of random alphabets being used as numbers and then just playing around.

我记得当我第一次学习代数时,我有把这些字母表示成单词的怪异冲动。 我把随机字母的一些涂鸦用作数字,然后随便玩耍。

Well, as it turns out, when I studied information theory and then word vectors, I was like “I knew it!!!”

好吧,事实证明,当我学习信息理论然后研究词向量时,我就像“我知道!!!”

One of the fundamentals of computer science is representing your knowledge in the form of numbers and even better — with a combination of 0s and 1s. Machine Learning models are no exceptions and hence, converting texts to numeric representations (or like my professor would call it — proxies) is of the building blocks of any kind of text analytics task.

计算机科学的基本原理之一是用数字甚至更好的数字来表示您的知识-结合使用0和1。 机器学习模型也不例外,因此,将文本转换为数字表示形式(或像我的教授称为“代理”一样)是任何类型的文本分析任务的基础。

NLP代表知识的哲学 (The Philosophy of Representing Knowledge for NLP)

There are two primary sources of human knowledge for comprehending human language:

用于理解人类语言的人类知识有两个主要来源:

  1. The dictionary

    词典
  2. The information we gather from reading books and artifacts

    我们从阅读书籍和人工制品中收集的信息

I would call the dictionary a structured learning artifact designed to aid natural human learning while knowledge gathered from reading books is a sort of unstructured and even unforeseen process. These are two key approaches to understanding natural language which has been systematically attempted to digitize for an artificial being to learn (I mean the computer 😅).

我将字典称为结构化学习工具,旨在帮助自然人学习,而从阅读书本中收集的知识则是一种非结构化甚至是无法预料的过程。 这是理解自然语言的两种主要方法,系统地尝试将其数字化以供人工学习(我是指计算机😅)。

Image for post
Photo by Yeshi Kangrang on Unsplash
Yeshi KangrangUnsplash拍摄的照片

Dictionaries are implemented as WordNet; the wealth of knowledge that serves as a digital dictionary ready to be consumed by an NLP program, much like how we refer to a dictionary to learn the meaning of a particular word.

词典以WordNet的形式实现; 丰富的知识可以用作NLP程序使用的数字词典,就像我们如何引用词典来学习特定单词的含义一样。

The other way of gaining knowledge is by learning from artifacts such as books, novels, item reviews, short messages, tweets, and so on. This knowledge can be effectively represented using vectors to capture semantic knowledge embedded in them. This is all the fuss about that I am going to briefly discuss here.

获得知识的另一种方法是通过学习诸如书籍,小说,物品评论,短消息,推文等文物。 可以使用向量捕获嵌入其中的语义知识来有效地表示该知识。 这就是我在这里简要讨论的所有大惊小怪的事情。

关于向量的注记 (A Note on Vectors)

Vectors are one-dimensional matrices used to represent a collection of numbers in a one-dimensional space. In Machine Learning, a feature vector is a one-dimensional vector used to represent all numeric encodings of features for one particular instance of data. As the number of instances expands, the matrix also grows. Similarly, in recommender systems, matrices are used to relate between users and the purchased or viewed item(s) where each vector represents the choice of each user. In psychology, vectors are used to assess the psychometric features and a feature vector would comprise of the points of each psychometric traits being assessed per person.

向量是一维矩阵,用于表示一维空间中的数字集合。 在机器学习中,特征向量是一维向量,用于表示一个特定数据实例的特征的所有数字编码。 随着实例数量的扩展,矩阵也随之增长。 类似地,在推荐系统中,矩阵用于在用户和购买或观看的商品之间建立联系,其中每个矢量代表每个用户的选择。 在心理学中,矢量用于评估心理测验特征,而特征矢量将包括每个人正在评估的每个心理测验特征的点。

Thus, the usage of vectors and matrices to quantify data into machine-interpretable formats is quite prevalent. In Natural Language Processing as well words are represented in a vector space where each vector corresponds to the distribution of a particular word. This makes the computation of predictions by generalization, which is mostly deciphered by calculating vector similarity, much like a supervised classification problem.

因此,使用向量和矩阵将数据量化为机器可解释的格式非常普遍。 在自然语言处理中,单词也在向量空间中表示,其中每个向量对应于特定单词的分布。 这使预测的计算可以通过泛化来完成,而泛化通常是通过计算向量相似度来解密的,这与监督分类问题非常相似。

单词表示的矩阵设计 (Matrix Designs for Word Representations)

“You shall know a word by the company it keeps”

“您将知道它所经营的公司的一句话”

~ J. R. Firth 1957: 11

〜JR生日1957:11

So as we have established already, we will be converting texts to their numeric representations, and we will do it by representing them as dense vectors of real numbers. These vectors, which collectively form a matrix can be designed in such a way that lets us conserve the meaning we expect to derive from the piece of text.

因此,正如我们已经建立的那样,我们将文本转换为数字表示形式,并将其表示为实数的密集向量。 这些向量共同构成一个矩阵,可以这样设计:让我们保留我们希望从文本中得到的含义。

A sentence, for example — “ I love text analytics.”, the occurrence of the word ‘love’ can be converted into a vector of 4 x 1 where 4 is the vocabulary size (denoted by V).

例如一个句子“我爱文本分析”。单词“ love”的出现可以转换为4 x 1的向量,其中4是词汇量(用V表示)

Mathematically, the vector to represent ‘love’ could be written as :

从数学上讲,代表“爱”的向量可以写成:

Image for post
One-Hot encoded word-vector for the word “love”
“爱”一词的一键编码词向量

This is a form of one-hot encoding.

这是一种单编码的形式。

If there are multiple sentences then the simplest way to scale this is to add the vocabulary into the vocabulary dimension of the vector and keep on expanding this “occurrence” vector.

如果存在多个句子,则最简单的缩放方法是将词汇添加到向量的词汇维度中,并继续扩展此“出现”向量。

Then, if there are M tokens, you could create a V x M matrix that encodes the presence and absence of the tokens in the vocabulary.

然后,如果有M个标记,则可以创建一个V x M矩阵,该矩阵对词汇表中标记的存在和不存在进行编码。

Now, let’s retrospect about this way of encoding and jot the caveats. The foremost important thing in languages is that a word itself does not give out predictive signals until it is put into a context for a holistic understanding. For example, if I say “Good” — Yes, good is a positive word, but what is good? Or is there a “not” before “good”? This is why I labeled this as the “occurrence” vector. It is a simple yes or no situation. This is not contextual.

现在,让我们回顾一下这种编码方式并注意一下警告。 语言中最重要的事情是,单词本身只有在被放到上下文中进行整体理解时才发出预测信号。 例如,如果我说“好”-是的,好是一个肯定的词,但是什么是好? 还是在“好”之前有“不”? 这就是为什么我将此标记为“发生”向量的原因。 这是一个简单的是或否情况。 这不是上下文。

The second problem is the massive vocabulary that could be generated if the word vector representation is not controlled. This would lead to expensive computations and time-consuming processes.

第二个问题是如果不控制单词向量表示,可能会产生大量词汇。 这将导致昂贵的计算和耗时的过程。

To some-what capture the meaning of a group of words occurring together, we have two most widely used matrix designs discussed below:

为了以某种方式捕获一起出现的一组单词的含义,我们在下面讨论了两种最广泛使用的矩阵设计:

  1. Word x Document Matrix

    Word x文档矩阵

In this vector representation method, each word is represented as the frequency of its occurrence per document. The vector size would hence be|V|x D, where D is the number of documents. This vector is comprised of real numbers and scales up with the addition of more documents. In other words, for the ith word, the term frequency (tf) of that word in the jth document is placed at ijth element of the matrix.

在这种矢量表示方法中,每个单词都表示为每个文档出现频率。 因此,矢量大小将为| V | x D,其中D为文档数。 此向量由实数组成,并随着添加更多文档而扩大。 换句话说,对于第i个单词,第j个文档中该单词的词频(tf)放置在矩阵的第ij个元素上。

To demonstrate this, consider these three documents:

为了证明这一点,请考虑以下三个文档:

Document 1:“ Listen, Harry can I have a go on it? Can I?”

文件1:“听着,哈里,我可以继续吗? 我可以吗?”

Document 2: “I don’t think anyone should ride that broom just yet!” said Hermione shrilly.

文件2:“我认为还没有人应该骑那把扫帚!” 赫敏刺耳地说。

Document 3: Harry and Ron looked at her.

文件3:哈利和罗恩看着她。

The word-document vector representation would look something like this:

Word文档矢量表示形式如下所示:

Image for post

2. Window-based Word x Word Matrix:

2.基于窗口的Word x Word矩阵:

This is also called a co-occurrence matrix where how many times a word has appeared in the vicinity of another word is measured. For example, if a window-size of 3 is being assessed, then how many times a word (i) in the vocabulary has occurred within the range j+3 to j-3 of the jth word in the vocabulary is measured and placed at the position ij of the co-occurrence matrix. If the vocabulary size is |V| then the final matrix would be of shape |V| x |V|.

这也称为共现矩阵,其中测量一个单词在另一个单词附近出现的次数。 例如,如果正在评估窗口大小为3,那么单词( i)在词汇表中出现了多少次 测量在词汇表中第j个单词的j + 3到j-3范围内出现的,并将其放置在共现矩阵的位置ij处。 如果词汇量是| V | 那么最终矩阵将是形状| V | x | V |。

To demonstrate this for window-size 3, the co-occurrence matrix for the previous example would be:

为了说明窗口大小为3的情况,上一示例的共现矩阵为:

Image for post

Had I used more examples, Harry, Ron and Hermione would be more prominent than others in terms of the frequency counts as their names would occur much more frequently together (within a span of 3 words, forwards and backward) which establishes semantic meaning in these stand-alone words.

如果我使用更多示例,在频率计数方面,Harry,Ron和Hermione会比其他人更为突出,因为他们的名字在一起出现的频率更高(在3个单词之内,向前和向后),从而在这些单词中建立了语义含义。独立词。

词向量的维数 (The Dimension of the Word Vector)

Image for post
Photo by Farhan Azam on Unsplash
Farhan AzamUnsplash拍摄的照片

Defining the dimension of the vector is contextual to the problem being solved using NLP techniques. Consider if you only want to know, in the context of user-ratings as “Very Good”, “Good”, “Average”, “Not Good”, “Worse”. The dimension for this would be a one-hot encoder, where 1 would correspond to the user rating. This is also a form of capturing user-sentiment in the context of a specific item. The size of the dimension for this problem is 5. Every instance will be a 1 x 5 matrix.

定义向量的维数取决于使用NLP技术解决的问题。 考虑一下您是否只想在用户评级的上下文中知道“非常好”,“好”,“平均”,“不好”,“更差”。 此尺寸为单点编码器,其中1对应于用户等级。 这也是在特定项目的上下文中捕获用户情感的一种形式。 此问题的维度大小为5。每个实例都是1 x 5矩阵。

Now, in terms of psychometric analysis, if you are encoding the big five scores per user, then there will be real numbers in the range of 0 and 50 per big five personality trait, namely, openness, conscientiousness, extraversion, agreeableness, and neuroticism. This vector is also of shape 1 x 5. Depending on the problem you are solving, these values could be scaled as lower values of each of these personality tests indicate the opposite behavior of that particular trait. For example, [15, 12, 34, 44, 29] could be scaled to [-1, -1, 1 , 1 , 0].

现在,就心理计量学分析而言,如果您对每个用户的前五项得分进行编码,则每个前五项人格特征的真实数字将在0到50之间,即开放性尽责 性格 外向性 ,和agree 可亲神经质 。 该向量的形状也为1 x5。根据要解决的问题,这些值可以按比例缩放,因为每个性格测试的较低值都表明该特定特征的相反行为。 例如,[15,12,34,44,29]可以缩放为[-1,-1,1,1,1,0]。

If you are working on a gender classification task, then your vector representation could possibly be of 1x2 and if you have 10k documents then your dimension would be of the same order unless you strategize the length of the word vector.

如果您正在执行性别分类任务,则向量表示可能为1x2,如果您有10k文档,则除非确定词向量的长度,否则维数的顺序将相同。

The choice of dimensionality for word vectors has a huge influence on the performance of a word embedding. A smaller dimension of vectors would be inefficient to capture all features (under-fitting) while a large dimension will lead to over-fitting. A large dimension directly increases model complexity, computational costs as well as the training time by adding latency. Besides, the features computed from the word-embeddings would be a linear or a quadratic function of the dimension which further impacts training time and computational costs.

词向量的维数选择对词嵌入的性能有很大影响。 向量的较小尺寸将不足以捕获所有特征(拟合不足),而较大的尺寸将导致过度拟合。 大尺寸会通过增加延迟直接增加模型的复杂性,计算成本以及训练时间。 此外,从词嵌入中计算出的特征将是维数的线性或二次函数,这会进一步影响训练时间和计算成本。

摘要 (Summary)

Embedded meaning in words can be identified by using count statistics and representing a word’s association with adjacent words or the document in which it resides and representing them in the form of a vector. More the vocabulary and documents, the larger is the matrix, which leads to the curse of dimensionality. Since these encodings strategies result in sparse matrices, they also have associated scalability issues as well as they are computationally expensive. An MxN matrix has the computational cost of O(mn²) which is undesirable. However, these two representations are the most widely used matrix design strategies and have proved to be quite efficient in solving NLP problems such as text classification, non-contextual sentiment analysis, and document similarity.

可以通过使用计数统计来识别单词中的嵌入含义,并表示单词与相邻单词或单词所驻留的文档的关联,并以矢量的形式表示它们。 词汇和文档越多,矩阵越大,这导致维度的诅咒。 由于这些编码策略导致矩阵稀疏,因此它们还具有相关的可伸缩性问题,并且它们的计算量很大。 MxN矩阵的计算成本为O(mn²) ,这是不希望的。 但是,这两种表示形式是使用最广泛的矩阵设计策略,并已证明在解决NLP问题(例如文本分类,非上下文情感分析和文档相似性)方面非常有效。

翻译自: https://towardsdatascience.com/matrix-design-for-vector-space-models-in-natural-language-processing-fbef22c10399

线性分类模型和向量矩阵求导

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值