潜在狄利克雷分布_使用潜在狄利克雷分配基于时间序列数据创建推荐系统

最新推荐文章于 2021-10-25 15:57:55 发布

weixin_26730921

最新推荐文章于 2021-10-25 15:57:55 发布

阅读量394

点赞数

文章标签： python java 大数据算法 linux

原文链接：https://towardsdatascience.com/create-a-recommendation-system-based-on-time-series-data-using-latent-dirichlet-allocation-2aa141b99e19

版权

本文介绍了如何基于时间序列数据利用潜在狄利克雷分配（LDA）来创建推荐系统。通过翻译一篇来自towardsdatascience的文章，详细探讨了这一主题。

摘要由CSDN通过智能技术生成

潜在狄利克雷分布

介绍 (Intro)

Many believe that the main goal of user segmentation and recommendation systems is to predict or better understand what customers want or likely to buy. Sometimes, however, the important question is not what customers are likely to buy, but when are they likely to buy it.

许多人认为，用户细分和推荐系统的主要目标是预测或更好地了解客户想要或可能购买的东西。但是，有时重要的问题不是客户可能会购买什么，而是客户何时会购买。

Consider the following examples: (1) An online store wants to send its most dedicated customers a coupon for a product that fits their preferences. However, they also know that customers differ in the time-of-day in which they usually shop —e.g. some prefer to shop early in the morning while others prefer to shop late at night. Thus, they want to know exactly when each customer (or group of customers) is more likely to shop and send the coupon at the appropriate time. (2) A provider of network proxy services needs to efficiently allocate network bandwidth to their users. Ideally, they would like to group users according to their activity hours and allocate them the bandwidth accordingly.

考虑以下示例：(1)一家在线商店希望向最专注的客户发送适合他们喜好的产品的优惠券。但是，他们也知道，顾客通常在一天中的不同时段购物-例如，有些人喜欢在清晨购物，而另一些人喜欢在深夜购物。因此，他们想确切地知道每个客户(或一组客户) 何时更有可能在适当的时间购物并发送优惠券。 (2)网络代理服务的提供者需要有效地为其用户分配网络带宽。理想情况下，他们希望根据用户的活动时间对其分组，并相应地为其分配带宽。

These “business problems” already assume that users or customers want a certain product. The question they pose is when are they most likely to buy or use it. In this post, I would like to share a very simple methodology that can add another tool that might help you answer such questions.

这些“业务问题”已经假定用户或客户想要某种产品。他们提出的问题是，他们何时最有可能购买或使用它。在本文中，我想分享一个非常简单的方法，可以添加另一个可以帮助您回答此类问题的工具。

I will assume that you have a simple time series data set in the form of:

我将假定您具有以下形式的简单时间序列数据集：

Time ('2020-01-05 10:30')
Entity (user, product, game)
Count (views, orders, downloads)

I will show how LDA enables us to (1) segment users into groups or clusters according to the time in which they are more likely to be “active” in quite an insightful way, and (2) let us predict how significantly each user is associated with each group, based on its data. (The same analysis can be also applied to products and cluster them into groups according to the time in which they are more likely to be purchased, etc).

我将展示LDA如何使我们(1)以一种很有见地的方式根据用户更可能处于“活动”状态的时间将其划分为组或集群，以及(2)让我们预测每个用户的重要性根据其数据与每个组相关联。 (同样的分析也可以应用于产品，并根据更可能被购买的时间将它们分为几类，等等)。

In the sections that follow, I will very briefly introduce LDA and the way it operates, and then dive right into the code.

在以下各节中，我将非常简要地介绍LDA及其操作方式，然后直接深入研究代码。

潜在狄利克雷分配(LDA)聚类-简短的简介 (Latent Dirichlet Allocation (LDA) Clustering — A (Very) Short Intro)

LDA is most commonly used for topic modeling in NLP contexts, and that is indeed a good example use case to grasp its value. Topic modeling algorithms usually start with a “corpus” or a group of “documents” that consist of “words”, and try to use the observed mixture of words and the frequency in which they show together in order to find latent patterns or topics within the documents.

LDA最常用于NLP上下文中的主题建模，这确实是把握其价值的一个很好的示例用例。主题建模算法通常从一个“语料库”或一组由“单词”组成的“文档”开始，并尝试使用观察到的单词混合以及它们一起显示的频率，以发现其中的潜在模式或主题。文件。

For example, suppose that I have 3 documents: (1) “My dog ate my lunch”; (2) “I just love eating fruits”; (3) “I am allergic to dogs”. A topic modeling algorithm should be able to identify at least 2 latent topics or themes within this “corpus”: a topic that deals with dogs and a topic that deals with food or eating. Some topic modelling algorithms (such as LDA) would also be able to associate a document with more than one topic and tell us which topic is more dominant in the document. For example, these algorithms would be able to tell us that sentence 1 (“My dog ate my lunch”) is both about dogs and about food or eating, and also that its more about dogs than it is about food.

例如，假设我有3个文件：(1)“我的狗吃了我的午餐”； (2)“我只是喜欢吃水果”； (3)“我对狗过敏”。主题建模算法应该能够识别出该“语料库”中的至少两个潜在主题或主题：与狗有关的主题以及与食物或饮食有关的主题。一些主题建模算法(例如LDA)也可以将一个文档与多个主题相关联，并告诉我们哪个主题在文档中占主导地位。例如，这些算法将能够告诉我们句子1(“我的狗吃了我的午餐”) 既与狗有关，又与食物或饮食有关 ，并且它还与狗有关，而不是与食物有关。

那么LDA如何做到这一点？ (So how does LDA do this?)

LDA makes at least 2 important working assumptions :

LDA至少做出两个重要的工作假设：

(1) Each topic is a probability distribution of words. Simply put, each topic is associated with certain words that occur in certain probabilities in a document that deals with the topic. One of the main outputs of an LDA model is a term-topic matrix that shows the probability of each word occurring in a certain topic. This matrix allows us to find semantic themes in each topic. For example, if WORD 0 and WORD 1 in the matrix below relate to animals then we can say that TOPIC3 (or its theme) is animals because words that relate to animals are very probable to occur within it.

(1) 每个主题都是单词的概率分布。 简而言之，每个主题都与处理该主题的文档中以某些概率出现的某些单词相关联。 LDA模型的主要输出之一是术语-主题矩阵，该矩阵显示每个单词在某个主题中出现的概率。这个矩阵使我们能够找到每个主题中的语义主题。例如，如果下面矩阵中的WORD 0和WORD 1与动物有关，那么我们可以说TOPIC3(或其主题)是动物，因为与动物有关的词很可能在其中出现。

最低0.47元/天解锁文章

weixin_26730921

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
潜在狄利克雷分布_使用潜在狄利克雷分配基于时间序列数据创建推荐系统

潜在狄利克雷分布介绍 (Intro)Many believe that the main goal of user segmentation and recommendation systems is to predict or better understand what customers want or likely to buy. Sometimes, however, the imp...
复制链接

扫一扫