python 引擎_Python中的简单趋势产品推荐引擎

python 引擎

克里斯·克拉克| 2017年2月28日 (by Chris Clark | February 28, 2017)

This blogpost originally appeared on Chris Clark’s blog. Chris is the cofounder of Grove Collaborative, a certified B-corp that delivers amazing, affordardable and effective natural products to your doorstep. We’re fans.

该博客文章最初出现在克里斯·克拉克(Chris Clark)的博客上 。 克里斯是格罗夫合作公司( Grove Collaborative )的联合创始人,该公司是一家经过认证的B-corp,可为您提供惊人,负担得起且有效的天然产品。 我们是粉丝。

背景 (Background)

Our product recommendations at Grove.co were boring. I knew that because our customers told us. When surveyed, the #1 thing they wanted from us was better product discovery. And looking at the analytics data, I could see customers clicking through page after page of recommendations, looking for something new to buy. We weren’t doing a good job surfacing the back half of our catalog. There was no serendipity.

我们在Grove.co的产品推荐很无聊。 我知道这是因为我们的客户告诉了我们。 在接受调查时,他们希望我们提供的第一件事是更好的产品发现。 在查看分析数据时,我可以看到客户逐页点击建议,寻找新的东西。 我们在目录的后半部分表现不佳。 没有巧合。

We weren’t doing a good job of surfacing around half of our catalog.

我们并没有很好地完成目录的一半。

One common way of increasing exposure to the long tail of products is by simply jittering the results at random. But injecting randomness has two issues: first, you need an awful lot of it to get products deep in the catalog to bubble up, and second, it breaks the framing of the recommendations and makes them less credible in the eyes of your customers.

增加产品长尾巴暴露的一种常见方法是简单地随机抖动结果。 但是注入随机性有两个问题:首先,您需要大量的随机性才能使产品深入目录中,使之冒泡;其次,它破坏了建议的框架,并在客户的眼中使它们不那么可信。

What do I mean by ‘framing’? Let’s look at a famous example from Yahoo!

“框”是什么意思? 让我们看看Yahoo!的一个著名示例!

布兰妮·斯皮尔斯效应。 (The Britney Spears Effect.)

Let’s say you’re reading about this weekend’s upcoming NFL game. Underneath that article are a bunch of additional articles, recommended for you by an algorithm. In the early 2000s, it turned out just about everyone wanted to read about Britney Spears, whether they would admit it or not.

假设您正在阅读有关本周末即将推出的NFL游戏的信息。 该文章下方是一堆其他文章,算法推荐给您。 在2000年代初,事实证明,几乎每个人都想阅读有关布兰妮·斯皮尔斯的信息,无论他们是否承认。

So you get to the bottom of your Super Bowl game preview and it says “You might also like:” and then shows you an article about Britney and K-fed. You feel kind of insulted by the algorithm. Yahoo! thinks I want to read about Britney Spears??

因此,您进入了超级碗游戏预览的底部,并显示“您可能也喜欢:”,然后向您显示有关布兰妮和K-fed的文章。 您会对算法感到侮辱。 雅虎! 认为我想阅读有关布兰妮·斯皮尔斯的内容??

Other people who read about recommender engines read…

读过推荐引擎的其他人也读过……

But instead, what if said “Other people who read this article read:”. Now…huh…ok – I’ll click. The framing gives me permission to click. This stuff matters!

但是,相反,如果说“阅读本文的其他人则读:”。 现在...嗯...好吧-我点击。 框架允许我点击。 这东西很重要!

Just like a good catcher can frame a on-the-margin baseball pitch for an umpire, showing product recommendations on a website in the right context puts customers in the right mood to buy or click.

就像一个优秀的接球手可以为裁判员准备一个边际棒球场一样,在合适的背景下在网站上显示产品推荐,也可以使客户以合适的心情进行购买或点击。

“Recommended for you” — ugh. So the website thinks it knows me, eh? How about this instead:

“推荐给您” –恩。 所以网站认为它了解我,是吗? 怎么样呢:

“Households like yours frequently buy”

“像您这样的家庭经常购买”

Now I have context. Now I understand. This isn’t a retailer shoving products in front of my face, it’s a helpful assemblage of products that customers just like me found useful. Chock-full of social proof!

现在我有了上下文。 现在我明白了。 这不是零售商在推销我的产品,而是像我这样的客户发现有用的产品组合。 充满社会证明!

寻找一些可能的偶然性 (Finding Some Plausible Serendipity)

After an awesome brainstorming session with one of our investors, Paul Martino from Bullpen Capital, we came up with the idea of a trending products algorithm. We’ll take all of the add-to-cart actions every day, and find products that are trending upwards. Sometimes, of course, this will just reflect the activities of our marketing department (promoting a product in an email, for instance, would cause it to trend), but with proper standardization it should also highlight newness, trending search terms, and other serendipitous reasons a product might be of interest. It’s also easier for slower-moving products to make sudden gains in popularity so should get some of those long-tail products to the surface.

在与我们的一位投资者,来自Bullpen Capital的 Paul Martino进行了精彩的集思广益讨论后,我们提出了一种趋势产品算法的想法。 我们每天都会执行所有添加到购物车中的操作,并查找上升趋势的产品。 当然,有时候,这只是反映了我们营销部门的活动(例如,在电子邮件中推广产品会导致产品趋向流行),但是通过适当的标准化,它还应该强调新颖性,趋向搜索术语以及其他偶然性产品可能引起关注的原因。 移动速度较慢的产品也更容易突然流行起来,因此应该将一些长尾产品浮出水面。

实施趋势产品引擎 (Implementing a Trending Products Engine)

First, let’s get our add-to-cart data. From our database, this is relatively simple; we track the creation time of every cart-product (we call it a ‘shipment item’) so we can just extract this using SQL. I’ve taken the last 20 days of cart data so we can see some trends (though really only a few days of data is needed to determine what’s trending):

首先,让我们获取购物车中的数据。 从我们的数据库中,这相对简单。 我们跟踪每个购物车产品(我们称之为“装运商品”)的创建时间,因此我们可以使用SQL提取该时间。 我已经获取了购物车数据的最后20天,因此我们可以看到一些趋势(尽管实际上只需几天的数据即可确定趋势):

SELECT v.product_id
, -(CURRENT_DATE - si.created_at::date) "age"
, COUNT(si.id)
FROM product_variant v
INNER JOIN schedule_shipmentitem si ON si.variant_id = v.id
WHERE si.created_at >= (now() - INTERVAL '20 DAYS')
AND si.created_at < CURRENT_DATE
GROUP BY 1, 2

SELECT v.product_id
, -(CURRENT_DATE - si.created_at::date) "age"
, COUNT(si.id)
FROM product_variant v
INNER JOIN schedule_shipmentitem si ON si.variant_id = v.id
WHERE si.created_at >= (now() - INTERVAL '20 DAYS')
AND si.created_at < CURRENT_DATE
GROUP BY 1, 2

I’ve simplified the above a bit (the production version has some subtleties around active products, paid customers, the circumstances in which the product was added, etc), but the shape of the resulting data is dead simple:

我已经对上述内容进行了一些简化(生产版本在有效产品,付费客户,添加产品的情况等方面有一些细微之处),但是结果数据的形状却非常简单:

Each row represents the number of cart adds for a particular product on a particular day in the past 20 days. I use ‘age’ as -20 (20 days ago) to -1 (yesterday) so that, when visualizing the data, it reads left-to-right, past-to-present, intuitively.

每行表示过去20天内特定日期为特定产品添加的购物车数量。 我将“年龄”从-20(20天前)改为-1(昨天),以便在可视化数据时,它直观地读取从左到右,从过去到现在的数据。

Here’s sample data for 100 random products from our database. I’m anonymized both the product IDs and the cart-adds in such a way that, when standardized, the results are completely real, but the individual data points don’t represent our actual business.

以下是我们数据库中100种随机产品的示例数据 。 我对产品ID和购物车添加了匿名,这样,当标准化时,结果是完全真实的,但是各个数据点并不代表我们的实际业务。

基本方法 (Basic Approach)

Before we dive into the code, let’s outline the basic approach by visualizing the data. All the code for each intermediate step, and the visualizations, is included and explained later.

在深入研究代码之前,让我们通过可视化数据来概述基本方法。 每个中间步骤的所有代码以及可视化内容都将包含并在以后进行解释。

Here’s the add-to-carts for product 542, from the sample dataset:

这是来自样本数据集的产品542的购物车:

The first thing we’ll do is add a low-pass filter (a smoothing function) so daily fluctuations are attentuated.

我们要做的第一件事是添加一个低通滤波器(平滑功能),以便减轻日常波动。

Then we’ll standardize the Y-axis, so popular products are comparable with less popular products. Note the change in the Y-axis values.

然后,我们将Y轴标准化,以便将流行产品与不流行产品进行比较。 注意Y轴值的变化。

Last, we’ll calculate the slopes of each line segment of the smoothed trend.

最后,我们将计算平滑趋势的每个线段的斜率。

Our algorithm will perform these steps (in memory, of course, not visually) for each product in the dataset and then simply return the products with the greatest slope values in the past day, e.g. the max values of the red line at t=-1.

我们的算法将针对数据集中的每个产品执行这些步骤(当然,不是在视觉上在内存中),然后简单地返回过去一天中斜率值最大的产品,例如,t =-处的红线最大值1。

代码! (The CODE!)

Let’s get into it! You can run all of the code in this post via a Python 2 Jupyter notebook or using Yhat’s Python IDE, Rodeo.

让我们开始吧! 您可以通过Python 2 Jupyter笔记本或使用Yhat的Python IDE Rodeo运行本文中的所有代码。

Here’s the code to produce the first chart (simply visualizing the trend). Just like we built up the charts, we’ll build from this code to create the final algorithm.

这是产生第一个图表的代码(简单地显示趋势)。 就像我们建立图表一样,我们将从这段代码中建立最终的算法。

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Read the data into a Pandas dataframe
df = pd.read_csv('sample-cart-add-data.csv')

# Group by ID & Age
cart_adds = pd.pivot_table(df, values='count', index=['id', 'age'])

ID = 542
trend = np.array(cart_adds[ID])

x = np.arange(-len(trend),0)
plt.plot(x, trend, label="Cart Adds")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.title(str(ID))
plt.show()

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Read the data into a Pandas dataframe
df = pd.read_csv('sample-cart-add-data.csv')

# Group by ID & Age
cart_adds = pd.pivot_table(df, values='count', index=['id', 'age'])

ID = 542
trend = np.array(cart_adds[ID])

x = np.arange(-len(trend),0)
plt.plot(x, trend, label="Cart Adds")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.title(str(ID))
plt.show()

It doesn’t get much simpler. I use the pandas pivot_table function to create an index of both product IDs and the ‘age’ dimension, which just makes it easy to select the data I want later.

它并没有变得更简单。 我使用pandasivot_table函数创建了产品ID和“年龄”维度的索引,这使以后选择所需的数据变得容易。

平滑处理 (Smoothing)

Let’s write the smoothing function and add it to the chart:

让我们编写平滑函数并将其添加到图表中:

This function merits an explanation. First, it’s taken more-or-less from the SciPy Cookbook, but modified to be…less weird.

此功能值得解释。 首先,它从《 SciPy Cookbook》中或多或少地被采用,但经过修改后变得……更不可思议。

The smooth function takes a ‘window’ of weights, defined in this case by the Hamming Window, and ‘moves’ it across the original data, weighting adjacent data points according to the window weights.

smooth函数采用权重的“窗口”(在这种情况下由汉明窗口定义),并在原始数据上“移动”权重,并根据窗口权重对相邻数据点进行加权。

Numpy provides a bunch of windows (Hamming, Hanning, Blackman, etc.) and you can get a feel for them at the command line:

Numpy提供了一堆窗口(Hamming,Hanning,Blackman等),您可以在命令行中找到它们:

>>> print np.hamming(7)
[ 0.08  0.31  0.77  1.    0.77  0.31  0.08]

>>> print np.hamming(7)
[ 0.08  0.31  0.77  1.    0.77  0.31  0.08]

That ‘window’ will be moved over the data set (‘convolved’) to create a new, smoothed set of data. This is just a very simple low-pass filter.

该“窗口”将在数据集上移动(“卷积”)以创建新的,平滑的数据集。 这只是一个非常简单的低通滤波器。

Lines 5-7 invert and mirror the first few and last few data points in the original series so that the window can still ‘fit’, even at the edge data points. This might seem a little odd, since at the end of the day we are only going to care about the final data point to determine our trending products. You might think we’d prefer to use a smoothing function that only examines historical data. But because the interpolation just mirrors the trailing data as it approaches the forward edge, there’s ultimately no net effect on the result.

第5-7行将原始系列的前几个和最后几个数据点反转并镜像,以便即使在边缘数据点处,窗口仍可以“拟合”。 这似乎有点奇怪,因为在一天结束时,我们将只关心最终数据点以确定我们的趋势产品。 您可能会认为我们宁愿使用仅检查历史数据的平滑函数。 但是,因为插值只是在尾随数据接近前边缘时对其进行镜像,所以最终不会对结果产生任何净影响。

标准化 (Standardization)

We need to compare products that average, for instance, 10 cart-adds per day to products that average hundreds or thousands. To solve this problem, we standardize the data by dividing by the Interquartile Range (IQR):

我们需要比较平均每天有10个购物车与平均成百上千个产品的产品。 为了解决此问题,我们通过除以四分位间距(IQR)来标准化数据:

I also subtract the median so that the series more-or-less centers around 0, rather than 1. Note that this is standardization not normalization, the difference being that normalization strictly bounds the value in the series between a known range (typically 0 and 1), whereas standardization just puts everything onto the same scale.

我还减去了中位数,以使序列或多或少地围绕0而不是1居中。请注意,这不是标准化,不是归一化,区别在于归一化严格将序列中的值限制在已知范围(通常为0和0)之间。 1),而标准化只是将所有内容放到相同的规模。

There are plenty of ways of standardizing data; this one is plenty robust and easy to implement.

有很多标准化数据的方法; 这一功能非常强大且易于实施。

连续下坡 (Slopes)

Really simple! To find the slope of the smoothed, standardized series at every point, just take a copy of the series, offset it by 1, and subtract. Visually, for some example data:

真的很简单! 要在每个点上找到平滑的标准化序列的斜率,只需复制该序列的副本,将其偏移1,然后减去。 在视觉上,对于一些示例数据:

And in code:

并在代码中:

slopes = smoothed_std[1:]-smoothed_std[:-1])
plt.plot(x, slopes)

slopes = smoothed_std[1:]-smoothed_std[:-1])
plt.plot(x, slopes)

Boom! That was easy.

繁荣! 那很简单。

放在一起 (Putting it all together)

Now we just need to repeat all of that, for every product, and find the products with the max slope value at the most recent time step.

现在,我们只需要对每个产品重复所有这些操作,并在最近的时间步长找到具有最大斜率值的产品。

The final implementation is below:

最终的实现如下:

And the result:

结果:

Top 5 trending products:
Product 103 (score: 1.31)
Product 573 (score: 1.25)
Product 442 (score: 1.01)
Product 753 (score: 0.78)
Product 738 (score: 0.66)

Top 5 trending products:
Product 103 (score: 1.31)
Product 573 (score: 1.25)
Product 442 (score: 1.01)
Product 753 (score: 0.78)
Product 738 (score: 0.66)

That’s the core of the algorithm. It’s now in production, performing well against our existing algorithms. We have a few additional pieces we’re putting in place to goose the performance further:

那是算法的核心。 它现已投入生产,与我们现有的算法相比表现良好。 我们还准备了一些其他功能来进一步提高性能:

  1. Throwing away any results from wildly unpopular products. Otherwise, products that fluctuate around 1-5 cart-adds per day too easily appear in the results just by jumping to 10+ adds for one day.

  2. Weighting products so that a product that jumps from an average of 500 adds/day to 600 adds/day has a chance to trend alongside a product that jumped from 20 to 40.

  1. 丢弃不受欢迎的产品带来的任何结果。 否则,每天仅在1-5个购物车添加量上下波动的产品就很容易出现在结果中,只需一天增加到10个以上即可。

  2. 对产品进行加权,以使平均每天从500个添加增加到600个添加的产品有机会与从20个增加到40个的产品保持趋势。

There is weirdly little material out there about trending algorithms – and it’s entirely possible (likely, even) that others have more sophisticated techniques that yield better results.

几乎没有关于趋势算法的资料–甚至有可能(甚至有可能)其他人使用更先进的技术来产生更好的结果。

But for Grove, this hits all the marks: explicable, serendipitous, and gets more clicks than anything other product feed we’ve put in front of our customers.

但是对于格罗夫来说,这可以达到所有的目的:可解释的,偶然的,获得的点击次数比我们摆在客户面前的任何其他产品Feed都要多。

There we have it, folks.

伙计们,我们到了。

翻译自: https://www.pybloggers.com/2017/02/a-simple-trending-products-recommendation-engine-in-python/

python 引擎

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值