机器学习 (Machine Learning)
In my previous article, I have written about a content-based recommendation engine using TF-IDF for Goodreads data. In this article, I am using the same Goodreads data and build the recommendation engine using word2vec.
在上一篇文章中 ,我写了关于使用TF-IDF for Goodreads数据的基于内容的推荐引擎的文章 。 在本文中,我将使用相同的Goodreads数据,并使用word2vec构建推荐引擎。
Like the previous article, I am going to use the same book description to recommend books. The algorithm that we use always struggles to handle raw text data and it only understands the data in numeric form. In order to make it understand, we need to convert the raw text into numeric. This can be achieved with the help of NLP by converting the raw text into vectors.
像上一篇文章一样,我将使用相同的书籍描述来推荐书籍。 我们使用的算法总是很难处理原始文本数据,并且只能理解数字形式的数据。 为了使其理解,我们需要将原始文本转换为数字。 这可以借助NLP将原始文本转换为矢量来实现。
I used TF-IDF to convert the raw text into vectors in the previous recommendation engine. However, it does not capture the semantic meaning and also it gives a sparse matrix. Research and breakthroughs are happening in NLP at an unprecedented pace. Neural network architectures has become famous for understanding the word representations and this is also called word embeddings.
我使用TF-IDF在先前的推荐引擎中将原始文本转换为矢量。 但是,它没有捕获语义,并且还提供了一个稀疏矩阵。 NLP的研究和突破正在以前所未有的速度进行。 神经网络体系结构以理解单词表示法而闻名,这也称为单词嵌入。
Word embedding features create a dense, low dimensional feature whereas TF-IDF creates a sparse, high dimensional feature. It also captures the semantic meaning very well.
单词嵌入特征创建密集的低维特征,而TF-IDF创建稀疏的高维特征。 它还很好地抓住了语义。
One of the significant breakthroughs is word2vec embeddings which were introduced in 2013 by Google. Using embeddings Word2vec outperforms TF-IDF in many ways. Another turning point in NLP was the Transformer network introduced in 2017. Followed by multiple research, BERT (Bidirectional Encoder Representations from Transformers), many others were introduced which considered as a state of art algorithm in NLP.
重大突破之一是Google在2013年推出的word2vec嵌入。 使用嵌入Word2vec在许多方面都优于TF-IDF。 NLP的另一个转折点是2017年推出的Transformer网络。随后进行了多项研究(来自Transformers的双向编码器表示),并引入了许多其他被认为是NLP最先进算法的技术。
This article explores how average Word2Vec and TF-IDF Word2Vec can be used to build a recommendation engine. I will explore how BERT embeddings can be used in my next article.
本文探讨了如何使用普通Word2Vec和TF-IDF Word2Vec构建推荐引擎。 我将在下一篇文章中探索如何使用BERT嵌入。
什么是Word2Vec? (What is Word2Vec?)
Word2Vec is a simple neural network model with a single hidden layer. It predicts the adjacent words for each and every word in the sentence or corpus. We need to get the weights that are learned by the hidden layer of the model and the same can be used as word embeddings.
Word2Vec是具有单个隐藏层的简单神经网络模型。 它为句子或语料库中的每个单词预测相邻的单词。 我们需要获得由模型的隐藏层学习的权重,并且可以将其用作单词嵌入。
Let’s see how it works with the sentence below:
让我们看看它如何与下面的句子一起工作:
From the above, let’s assume the word “theorist” is our input word. It has a context window of size 2. This means we are considering only the 2 adjacent words on either side of the input word as the adjacent words.
综上所述,我们假设“理论家”是我们的输入词。 它具有大小为2的上下文窗口。这意味着我们仅将输入单词两侧的2个相邻单词视为相邻单词。
Now, the task is to pick the nearby words (words in the context window) one-by-one and find the probability of every word in the vocabulary of being the selected adjacent word. Here, the context window can be changed as per our requirement.
现在,任务是一个接一个地挑选附近的单词(上下文窗口中的单词),并找到词汇表中每个单词成为所选相邻单词的概率。 在这里,上下文窗口可以根据我们的要求进行更改。
Word2Vec has two model architectures variants 1) Continuous Bag-of-Words (CBoW) and SkipGram. Internet is literally flooded with a lot of articles about Word2Vec, hence I have not explained in detail. Please check here for more details on Word2Vec.
Word2Vec具有两种模型架构变体1)连续词袋(CBoW)和SkipGram。 Internet上确实充斥着许多有关Word2Vec的文章,因此我没有详细解释。 请在此处查看有关Word2Vec的更多详细信息。
In simpler terms, Word2Vec takes the word and gives vector in D-dimensional space.
用更简单的术语来说,Word2Vec接受单词并在D维空间中给出矢量。
Please note, Word2Vec provides the word embeddings in low dimensional( 50–500) and dense (it’s not a sparse one, most values are non-zero). I used 300 dimension vectors for this recommendation engine. As I mentioned above, Word2Vec is good at capturing the semantic meaning and relationship.
请注意,Word2Vec提供了低维(50–500)且密集的单词嵌入(它不是稀疏的,大多数值都为非零)。 我为此推荐引擎使用了300个维度向量。 如上所述,Word2Vec擅长捕获语义含义和关系。
Training our own word embeddings expensive process and also need large datasets. I don’t have a large dataset as I scrapped Goodreads data only pertains to the genre: business and cooking. Used Google pre-trained word embeddings which were trained on a large corpus, such as Wikipedia, news articles etc.
训练我们自己的词嵌入过程很昂贵,并且还需要大型数据集。 我没有大型数据集,因为我报废了Goodreads数据,但仅涉及以下类型:商业和烹饪。 使用Google预先训练的词嵌入,这些词嵌入是在大型语料库(例如Wikipedia,新闻文章等)上进行训练的。
The pre-trained embeddings helped to get the vectors for the words you want. It is a large collection of key-value pairs, where keys are the words in the vocabulary and values are their corresponding word vectors.
经过预训练的嵌入有助于获得所需单词的向量。 它是键值对的大量集合,其中键是词汇表中的单词,而值是其对应的单词向量。
In our problem, we need to convert the book descriptions into a vector and finding the similarity between these vectors to recommend the book. Each book description is a sentence or sequence of words. I have tried two methods such as Average Word2vec and TF-IDF Word2Vec.
在我们的问题中,我们需要将书籍描述转换为向量,并找到这些向量之间的相似性以推荐书籍。 每本书的描述都是一个句子或单词序列。 我尝试了两种方法,例如平均Word2vec和TF-IDF Word2Vec。
平均Word2Vec (Average Word2Vec)
Let’s take some random description example in our dataset
让我们以数据集中的一些随机描述为例
Book title: The Four Pillars of Investing’
书名: 投资的四大Struts
Book Description:william bernstein american financial theorist neurologist research field modern portfolio theory research financial books individual investors wish manage equity field lives portland oregon
OK, how to convert the above description into vectors? As you know, word2vec takes the word and gives a d-dimension vector. First, we need to split the sentences into words and find the vectors representation for each word in the sentence.
好的,如何将以上描述转换为向量? 如您所知,word2vec接受单词并给出d维向量。 首先,我们需要将句子拆分为单词,并找到句子中每个单词的向量表示。
The above example has 23 words. Let’s denote the words as w1, w2, w3, w4 …w23. (w1 = william, w2 = bernstein ………. w23 = oregon). Calculate Word2Vec for all the 23 words.
上面的例子有23个单词。 让我们将单词表示为w1,w2,w3,w4…w23。 (w1 =威廉,w2 =伯恩斯坦………….w23 =俄勒冈州)。 为所有23个单词计算Word2Vec。
Then, sum all the vectors and divide the same by a total number of words in the description (n). It can be denoted as v1 and calculate as follow
然后,对所有向量求和,并将其除以描述中的单词总数(n)。 可以表示为v1并计算如下
Here, vectors are in d -dimensional (used 300 dimensions)
在此,向量为d维(使用了300个维)
N = number of words in description 1 (Total: 23)
N =描述1中的单词数(总计23)
v1 = vector representation of book description 1.
v1 =书籍描述1的向量表示。
This is how we calculate the average word2vec. In the same way, the other book descriptions can be converted into vectors. I have implemented the same in Python and code snippets have given below.
这就是我们计算平均word2vec的方式。 同样,其他书籍的描述也可以转换为向量。 我已经在Python中实现了相同的功能,下面给出了代码段。
TF-IDF Word2Vec (TF-IDF Word2Vec)
TF-IDF is a term frequency-inverse document frequency. It helps to calculate the importance of a given word relative to other words in the document and in the corpus. It calculates in two quantities, TF and IDF. Combining two will give a TF-IDF score. A detailed explanation of how TF-IDF work is beyond the scope of this article as it is flooded with many articles on the internet. Please check here for more details on TF-IDF.
TF-IDF是术语“频率反文档频率”。 它有助于计算给定单词相对于文档和语料库中其他单词的重要性。 它以TF和IDF两个量计算。 结合两个将给出TF-IDF分数。 TF-IDF如何工作的详细说明超出了本文的范围,因为Internet上充斥了许多文章。 请在此处查看有关TF-IDF的更多详细信息。
Let’s see how TF-IDF Word2Vec works,
让我们看看TF-IDF Word2Vec的工作原理,
Consider the same description example
考虑相同的描述示例
Book Description:william bernstein american financial theorist neurologist research field modern portfolio theory research financial books individual investors wish manage equity field lives portland oregon
Again, the description has 23 words. Let’s denote the words as w1, w2, w3, w4 …w23. (w1 = william, w2 = bernstein …. …..w23 = oregon).
同样,描述有23个字。 让我们将单词表示为w1,w2,w3,w4…w23。 (w1 =威廉,w2 =伯恩斯坦…... w23 =俄勒冈)。
计算TF-IDF Word2Vec的步骤: (Steps to calculate the TF-IDF Word2Vec:)
- Calculate the TF-IDF vector for each word in the above description. Let’s call the TF-IDF vectors as tf1, tf2, tf3…tf23. Please note TF-IDF vector won’t give d-dimensional vectors. 计算上述描述中每个单词的TF-IDF向量。 我们将TF-IDF向量称为tf1,tf2,tf3…tf23。 请注意,TF-IDF向量不会提供d维向量。
- Calculate the Word2Vec for each word in the description 计算描述中每个单词的Word2Vec
- Multiple the TF-IDF score and Word2Vec vector representation of each word and, total the same. 每个单词的TF-IDF得分和Word2Vec矢量表示的倍数,总和相同。
- Then, divide the total by sum of TF-IDF vectors. It can be called as v1 and written as follow 然后,将总数除以TF-IDF向量的总和。 可以称为v1并编写如下
v1 = vector representation of book description 1. This is the method for calculating TF-IDF Word2Vec. In the same way, we can convert the other descriptions into vectors. I have implemented the same in Python and code snippets have given below.
v1 =书籍说明1的向量表示。这是计算TF-IDF Word2Vec的方法。 同样,我们可以将其他描述转换为向量。 我已经在Python中实现了相同的功能,下面给出了代码段。
基于内容的推荐系统 (Content-based recommendation system)
A content-based recommendation system recommends books to a user by taking similarity of books. This recommender system recommends a book based on the book description. It identifies the similarity between the books based on its description. It also considers the user's previous book history in order to recommend a similar book.
基于内容的推荐系统通过获取书籍的相似性来向用户推荐书籍。 该推荐器系统根据书的描述来推荐一本书。 它基于书名的描述来识别书籍之间的相似性。 它还会考虑用户的先前书籍历史记录,以便推荐相似的书籍。
Example: If a user likes novel “Tell me your dreams” by Sidney Sheldon, then the recommender system recommends the user to read other Sidney Sheldon’s novels or it recommends novel with the genre “Non-fiction”. (Sidney Sheldon novels belong to Non-fiction genre).
示例:如果用户喜欢Sidney Sheldon的小说《告诉我你的梦想》,则推荐器系统会推荐用户阅读Sidney Sheldon的其他小说,或者推荐类型为“非小说”的小说。 (西德尼·谢尔顿的小说属于非小说类)。
We need to find similar books to a given book and then recommend those similar books to the user. How to find whether the given book is similar or dissimilar? A similarity measure was used to find the same. Cosine Similarity was used in our recommender system to recommend the books. For more details on the similarity measure, please refer to this article.
我们需要找到与给定书相似的书,然后将这些相似的书推荐给用户。 如何查找给定的书是相似还是相异? 使用相似性度量来找到相同内容。 推荐系统中使用了余弦相似度来推荐书籍。 有关相似性度量的更多详细信息,请参阅本文 。
As I mentioned above, we are using goodreads.com data and don’t have users reading history. Hence, I am not able to use a collaborative recommendation engine.
如上所述,我们使用的是goodreads.com数据,没有用户读取历史记录。 因此,我无法使用协作推荐引擎。
数据 (Data)
I scraped book details from goodreads.com pertaining to business, non-fiction, and cooking genres.
我从goodreads.com上抓取了有关商业,非小说和烹饪类型的书籍详细信息。
码 (Code)
输出量 (Output)
The data consist of 2382 records. It has two genres, 1) Business (1185 records) 2) Non-Fiction (1197 records). Also, it consists of Book title, description, author name, rating, and book image link.
数据包含2382条记录。 它有两种类型,1)商业(1185条记录)2)非小说类(1197条记录)。 而且,它由书名,描述,作者姓名,等级和书图像链接组成。
文字预处理 (Text Preprocessing)
Cleaning the book description and stored the cleaned description in a new variable called ‘cleaned’
清理书籍描述并将清理后的描述存储在名为“ cleaned”的新变量中
推荐引擎 (Recommendation Engine)
Building two recommendation engine using Average Word2Vec and TF-IDF Word2Vec word embeddings.
使用平均Word2Vec和TF-IDF Word2Vec词嵌入构建两个推荐引擎。
平均Word2Vec (Average Word2Vec)
Splitting the descriptions into words and stored in a list called ‘corpus’ for training our word2vec model
将描述分为单词,并存储在名为“语料库”的列表中,用于训练我们的word2vec模型
Training our corpus with Google Pretrained Word2Vec model.
使用Google预训练的Word2Vec模型训练我们的语料库。
Created a function called vectors for generating average Word2Vec embeddings and stored the same as a list called ‘word_embeddings’. The code follows the steps which I have written in the above-average word2vec explanation.
创建了一个称为矢量的函数,用于生成平均的Word2Vec嵌入,并将其存储为与名为“ word_embeddings”的列表相同的函数。 该代码遵循我在上面平均的word2vec解释中编写的步骤。
使用平均Word2Vec的前5条建议 (Top 5 Recommendation using Average Word2Vec)
Let’s recommend the book “The Da Vinci Code” by Dan Brown
让我们推荐丹·布朗(Dan Brown)所著的《达芬奇密码》
The model recommends other Dan Brown books based on the similarity existing in the book description.
该模型根据书中描述中存在的相似性推荐其他Dan Brown书籍。
Let’s recommend the book “The Murder of Roger Ackroyd” by Agatha Christie
让我们推荐阿加莎·克里斯蒂(Agatha Christie)的著作《 罗杰·阿克罗伊德的谋杀 》
This book belongs to Mystery thriller and it recommends a similar kind of novels.
这本书属于神秘惊悚片,推荐同类小说。
建立TF-IDF Word2Vec模型 (Building TF-IDF Word2Vec Model)
The code explains the same steps which I mentioned above the process of TF-IDF Word2Vec model. I am using the same corpus only change in the word embeddings
该代码说明了与我上面提到的TF-IDF Word2Vec模型过程相同的步骤。 我使用的是相同的语料库,仅更改单词嵌入
建立TF-IDF模型 (Building the TF-IDF model)
构建TF-IDF Word2Vec嵌入 (Building TF-IDF Word2Vec Embeddings)
使用TF-IDF Word2Vec的前5条推荐 (Top 5 Recommendation using TF-IDF Word2Vec)
Let’s recommend the book “The Da Vinci Code” by Dan Brown
让我们推荐丹·布朗(Dan Brown)所著的《达芬奇密码》
We can see the model recommends Sherlock homes novel and the output is different from the average word2vec.
我们可以看到模型推荐了Sherlock homes小说,并且输出结果与平均word2vec不同。
Let’s recommend the book “The Murder of Roger Ackroyd” by Agatha Christie
让我们推荐阿加莎·克里斯蒂(Agatha Christie)的著作《 罗杰·阿克罗伊德的谋杀 》
Again it recommends different from the average word2vec. But it gives a similar novel to the Agatha Christie.
再次建议与平均word2vec不同。 但这给了阿加莎·克里斯蒂(Agatha Christie)类似的小说。
It seems TF-IDF Word2Vec gives powerful recommendation than average word2vec. This article only explores how to use average Word2Vec and TF-IDF Word2Vec to build a recommendation engine. Not comparing the results between these two models. The data that I have used is very minimum and definitely the result would change when we deal with the larger dataset. Also, we can use Fast Text (from Facebook) and Glove (from Stanford) pre-trained word Embeddings instead of Google one.
似乎TF-IDF Word2Vec提供了比普通word2vec更强大的推荐。 本文仅探讨如何使用普通的Word2Vec和TF-IDF Word2Vec来构建推荐引擎。 不比较这两个模型之间的结果。 我使用的数据非常少,当我们处理较大的数据集时,结果肯定会改变。 同样,我们可以使用Fast Text(来自Facebook)和Glove(来自Stanford)预先训练好的单词Embeddings,而不是Google。
Real-world recommendation systems are more robust and advanced. A/B testing was used to evaluate the recommendation engine and also business domain plays a major role in the evaluation and, picking of the best recommendation engine.
现实世界中的推荐系统更加强大和先进。 A / B测试用于评估推荐引擎,并且业务领域在评估和选择最佳推荐引擎中也起着重要作用。
In my next article, I will show how to use BERT embeddings to build the same type of recommendation engine. You can find the entire code and data in my GitHub repo.
在我的下一篇文章中,我将展示如何使用BERT嵌入来构建相同类型的推荐引擎。 您可以在我的GitHub存储库中找到整个代码和数据。
Thanks for reading. Keep learning and stay tuned for more!
谢谢阅读。 继续学习,敬请期待!