机器学习 个性化推荐_使用机器学习来个性化用户体验

机器学习 个性化推荐

E-commerce websites, such as shops and platforms with many users, are designed to meet the needs of customers. Usually, a website behaves the same for each customer. However, this “one-size-fits-all” approach does not always meet the needs for all situations. Understanding the customers’ intentions can help to improve the journey, e.g. by taking shortcuts or giving recommendations, and make it a better experience overall. This article shows how to use existing data on customer behavior to create a machine learning model that is capable of predicting intent.

电子商务网站(例如具有许多用户的商店和平台)旨在满足客户的需求。 通常,网站对每个客户的行为都是相同的。 但是,这种“一刀切”的方法并不总是能满足所有情况的需求。 了解客户的意图可以帮助改善旅程,例如采取捷径或提出建议,并使其整体上具有更好的体验。 本文展示了如何使用有关客户行为的现有数据来创建能够预测意图的机器学习模型。

资料私隐 (Data Privacy)

I personally don’t like it when advertising technology companies like Google and Facebook follow online activities extensively. Nevertheless, I think that individual websites can use personalization techniques without violating privacy as long as the data is not shared or linked to external services. It makes a difference whether the data is used to improve the customer experience or whether all activities are tracked over the Internet to generate profits from advertising. Furthermore, any personalization should be an opt-out.

当广告技术公司(如Google和Facebook)广泛关注在线活动时,我个人不喜欢它。 不过,我认为,只要不共享数据或不将其链接到外部服务,各个网站都可以使用个性化技术而不会侵犯隐私。 无论是使用数据来改善客户体验还是通过Internet跟踪所有活动以从广告中获利,都将有所不同。 此外,任何个性化设置均应退出。

客户数据之旅 (Customer journey in data)

Typically, a user’s intention on a Web site can be understood by looking at their past interactions. In concrete terms, this means that a user leaves a sequence of events about the history of his page views and interactions. An event can be that a user makes a search query, calls up an article page or receives an e-mail. This data forms the basis for working with the following techniques. Therefore the first step is to collect or extract this data. Usually the raw data is already stored on web servers or in databases, which then need to be refined to be usable.

通常,可以通过查看用户过去的交互来了解用户在网站上的意图。 具体来说,这意味着用户留下了一系列有关其页面浏览和交互历史的事件。 事件可能是用户进行搜索查询,调出文章页面或接收电子邮件。 此数据构成使用以下技术的基础。 因此,第一步是收集或提取此数据。 通常,原始数据已经存储在Web服务器或数据库中,然后需要对其进行优化以使其可用。

Example:

范例

Image for post
Three different user event streams
三种不同的用户事件流

This image represents three different user journeys at the point, where he arrives on the website. In this case it’s a simple webshop and for this examples it’s a very simple journey. User 1 might be looking for a specific Product, while User 2 might be just browsing through the page and User 3 just bought something. To start with a simple intent, we want to predict if a user makes a purchase.

此图像代表用户到达网站时的三个不同的用户旅程。 在这种情况下,这是一个简单的网上商店,对于本示例来说,这是一个非常简单的过程。 用户1可能正在寻找特定产品,而用户2可能只是在浏览页面,而用户3刚刚购买了商品。 首先,我们要预测用户是否进行了购买。

训练数据 (Training data)

The first step is to assign an event id to each event, it might also be useful to cluster some similar events to one event id. This can be done manually, or with the LabelEncoder in sci-kit learn. It’s a good idea to start with 1 as the first id, because 0 is used for padding.

第一步是为每个事件分配一个事件ID,将一些类似事件聚类到一个事件ID也可能很有用。 可以手动完成此操作,也可以使用sci-kit learning中的LabelEncoder完成。 从1开始作为第一个ID是一个好主意,因为0用于填充。

We start where every event is a number and our data is just a sequence of numbers. The Input for our classifier has to have a fixed size, which means every event sequence must me equal in length. To do that we pad the data to a predefined length. By default pad_sequences fills the missing events by zeros and fills them in before the sequence starts. If the sequence is longer than the desired length it truncates the beginning of the sequence. The result is our X.

我们从每个事件都是数字开始,而我们的数据只是一个数字序列。 分类器的输入必须具有固定的大小,这意味着每个事件序列的长度必须相等。 为此,我们将数据填充到预定义的长度。 默认情况下,pad_sequences用零填充丢失的事件,并在序列开始之前将其填充。 如果序列长于所需长度,则会截断序列的开头。 结果是我们的X。

import numpy as np
import tensorflow as tfnum_events = 8 # example
seq_len = 10 # example
events = [
[1, 2, 1, 2, 1], # user1
[3, 4, 2, 4, 1], # user2
[1, 5, 6, 7]] # user3
x =

Now we need to find your y. This is highly use case dependent, but in this example, we will use information from another system, which tells us that the customer bought something. Usually this is also just an event from above. User 3 bought something so his target label is 1.

现在我们需要找到您的y。 这与用例高度相关,但是在本示例中,我们将使用来自另一个系统的信息,该信息告诉我们客户购买了商品。 通常这也是从上而下的事件。 用户3购买了某商品,因此其目标标签为1。

y = [0, 0, 1]
------------------
(x, y)
Output:
(array([[0, 0, 0, 0, 0, 1, 2, 1, 2, 1],
[0, 0, 0, 0, 0, 3, 4, 2, 4, 1],
[0, 0, 0, 0, 0, 0, 1, 5, 6, 7]]), [0, 0, 1])

Note that in real applications the number of unique events is likely in the thousands and the length of the hole event stream is often several hundreds. The number of events in click streams can be quite different for each user and session. At some point we must make a cut. The exact number depends on the data, but the 90th percentile should be a good starting point.

请注意,在实际应用中,唯一事件的数量可能为数千,而空穴事件流的长度通常为数百。 对于每个用户和会话,点击流中的事件数量可能会完全不同。 在某个时候,我们必须削减。 确切的数字取决于数据,但是第90个百分点应该是一个很好的起点。

Windows中的时间序列数据 (Windows in time series data)

In case of predicting intent, it’s important to divide the data into time windows. One window (X) represents the click data before time t0 and the second window represents data after t0, this is where we expect the target event Y to happen. In comparison to the image below, click data is not a continuous value, but the idea of the window approach is to move the window through the data, this also creates many sequences for one single user. We could for example use a window of 6 hours to predict if the customer purchases something in the next 2 hours and by sliding through the complete dayily data we get several sequences (X and Y).

在预测意图的情况下,将数据划分为时间窗口很重要。 一个窗口(X)代表时间t0之前的点击数据,第二个窗口代表t0之后的数据,这是我们希望目标事件Y发生的位置。 与下图相比,点击数据不是一个连续值,但是窗口方法的想法是在数据中移动窗口,这也为一个用户创建了许多序列。 例如,我们可以使用6个小时的窗口来预测客户是否在接下来的2个小时内购买了商品,并且通过浏览完整的每日数据,我们得到了几个序列(X和Y)。

Image for post


Time-series Extreme Event Forecasting with Neural Networks at Uber”. 在Uber使用神经网络进行时间序列极端事件预测 ”。

However, the most important thing is to make sure, that your events are not self-fulfilling prophecies. If an event is included, which leaks the label, your model is not very useful (e.g. including the pay button click for purchase prediction). There is no general rule which events to exclude, but if your classifier performs extremely well on such a task, you might have a leaky feature. Intent prediction will never give you high accurate models, because the event data is usually not clean and specific enough.

但是,最重要的是要确保您的事件不是自我实现的预言。 如果包括泄漏标签的事件,则您的模型不是很有用(例如,单击“付款”按钮以进行购买预测)。 没有一般规则要排除哪些事件,但是如果分类器在此类任务上的表现非常出色,则可能具有泄漏功能。 意图预测永远不会为您提供高精度的模型,因为事件数据通常不够干净且不够具体。

造型 (Modelling)

Now it’s time to put the data into an artificial neural network. Each event has an internal representation in the neural network, which is called embedding. This representation is learned by the network while training. The network learns to build an embedding space, where every event is located based on its similarity to other events. Using such representations make events comparable (see word embeddings). Furthermore, we can deal with many unique events without having to deal with high dimensional vectors. This number of distinct events is the counterpart to the vocabulary size in NLP.

现在是时候将数据放入人工神经网络了。 每个事件在神经网络中都有一个内部表示,称为嵌入。 网络在训练时会学习此表示。 网络学习构建一个嵌入空间,每个事件都基于其与其他事件的相似性而位于其中。 使用此类表示可使事件具有可比性(请参阅词嵌入 )。 此外,我们可以处理许多独特的事件,而不必处理高维向量。 此不同事件的数量与NLP中的词汇量相对应。

After the translation from event id to embedding representation is made (Embedding Layer), the sequences must somehow be reduced to a single vector. LSTMs are the standard approach for such tasks. The additional Masking layer in the below example removes the zeros from the sequence.

在完成从事件ID到嵌入表示的转换(嵌入层)之后,必须以某种方式将序列简化为单个向量。 LSTM是执行此类任务的标准方法。 以下示例中的附加Masking层从序列中删除了零。

In this example we generate 1000 random example sequences, which will obviously converge to 50% accuracy, however it shows the underlying idea.

在此示例中,我们生成了1000个随机示例序列,这些序列显然会收敛到50%的准确性,但是它显示了基本思想。

import randomnum_events = 1000
seq_len = 100y = np.random.choice(2, 1000, replace=True)
x = np.random.randint(num_events, size=(1000, seq_len))net_in = tf.keras.layers.Input(shape=(seq_len,))
emb = tf.keras.layers.Embedding(num_events, 8, input_length=seq_len, mask_zero=True, input_shape=(num_events,))(net_in)
mask = tf.keras.layers.Masking(mask_value=0)(emb)
lstm = tf.keras.layers.LSTM(64)(mask)
dense = tf.keras.layers.Dense(1, activation='sigmoid')(lstm)
model = tf.keras.Model(net_in, dense)
model.compile('adam', 'binary_crossentropy', metrics=['acc'])
model.summary()history = model.fit(x, y, epochs = 50, validation_split=0.2)

用卷积代替LSTM (Replacing LSTMs with convolutions)

In the recent years convolutional layers have shown good performance on sequence classification tasks as well. The idea is just to go through the sequence with a 1D convolution. For convolutional networks we have several “parallel” convolutional connections. In Keras we can apply such a convolutional sequence classification with this model:

近年来,卷积层在序列分类任务中也表现出良好的性能。 这个想法只是通过一维卷积来完成序列。 对于卷积网络,我们有几个“并行”卷积连接。 在Keras中,我们可以在此模型中应用这种卷积序列分类:

import randomnum_events = 1000
seq_len = 100y = np.random.choice(2, 1000, replace=True)
x = np.random.randint(num_events, size=(1000, seq_len))net_in = tf.keras.layers.Input(shape=(seq_len,))
emb = tf.keras.layers.Embedding(num_events, 8, input_length=seq_len, mask_zero=False, input_shape=(num_events,))(net_in)
c1 = tf.keras.layers.Conv1D(256, 3)(emb)
p1 = tf.keras.layers.GlobalMaxPooling1D()(c1)
c2 = tf.keras.layers.Conv1D(128, 7)(emb)
p2 = tf.keras.layers.GlobalMaxPooling1D()(c2)
c3 = tf.keras.layers.Conv1D(64, 11)(emb)
p3 = tf.keras.layers.GlobalMaxPooling1D()(c3)
c4 = tf.keras.layers.Conv1D(64, 15)(emb)
p4 = tf.keras.layers.GlobalAveragePooling1D()(c4)
c5 = tf.keras.layers.Conv1D(64, 19)(emb)
p5 = tf.keras.layers.GlobalAveragePooling1D()(c5)
c = tf.keras.layers.concatenate([p1, p2, p3, p4, p5])
bn = tf.keras.layers.BatchNormalization()(c)
dense = tf.keras.layers.Dense(128, activation='relu')(bn)
out = tf.keras.layers.Dense(1, activation='sigmoid')(dense)model = tf.keras.Model(net_in, out)
model.compile('adam', 'binary_crossentropy', metrics=['acc'])
model.summary()history = model.fit(x, y, epochs = 50, validation_split=0.2)

真实数据结果 (Results on real data)

The dataset I used was about 13 million of event streams, exported by a sliding window approach and the class imbalance was about 1:100. Unfortunately I cannot share the data set, only the results.

我使用的数据集是大约1300万个事件流,这些事件流是通过滑动窗口方法导出的,类不平衡度约为1:100。 不幸的是,我不能共享数据集,只能共享结果。

On this data, however it is possible to compare LSTMs with Conv1D architectures. It turns out, that the CNN approach outperforms LSTMS in many ways. First, convolutions are faster to compute, so it trains much faster. LSTMs are more sensitive to hyperparameters therefore, the model gets more robust with convolutions and furthermore even the accuracy increases slightly. In the characteristics of the classifiers shown below, there are notable differences in the Precision/Recall ratio and slight differences in the ROC curves. Hence, I would suggest using the CNN approach.

但是,根据此数据,可以将LSTM与Conv1D架构进行比较。 事实证明,CNN方法在许多方面都优于LSTMS。 首先,卷积的计算速度更快,因此训练速度更快。 LSTM对超参数更敏感,因此,该模型通过卷积变得更健壮,而且精度甚至略有提高。 在下面显示的分类器的特征中,Precision / Recall比率存在显着差异,而ROC曲线则存在轻微差异。 因此,我建议使用CNN方法。

Image for post
AUC and Precision-Recall curves for the experiment
实验的AUC和Precision-Recall曲线

技巧和窍门 (Tipps and tricks)

In reality, data is not perfect, we often have duplicate events, crawlers and bots or other noise. Before starting to classify sequences, make sure your data is corrected. Filter out outliers. Merge duplicate events and split sessions from users by a meaningful inactivity time. You can also bring in time into the model (e.g. put something like “time since last event” on top of the embedding vector). It may be feasible to not only use a max seq length, but also a min seq length. And if the target event is very rare, undersampling of negative sequences is an option.

实际上,数据并不完美,我们经常会出现重复的事件,爬虫和漫游器或其他噪音。 在开始对序列进行分类之前,请确保您的数据已更正。 过滤异常值。 合并重复的事件,并通过有意义的不活动时间拆分来自用户的会话。 您还可以将时间引入模型中(例如,在嵌入向量的顶部放置“自上次事件以来的时间”之类的内容)。 不仅可以使用最大序列长度,而且可以使用最小序列长度是可行的。 如果目标事件非常罕见,则可以对负序列进行欠采样。

This post was first published here.

这篇文章首先在这里发表。

翻译自: https://towardsdatascience.com/using-machine-learning-to-personalize-user-experience-f5b6abd65602

机器学习 个性化推荐

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值
>