rfm模型分析与客户细分_细分多伦多餐馆市场,向陷入困境的行业提供分析手

rfm模型分析与客户细分

The Toronto restaurant landscape is a bustling set of culinary traditions that is as rich as it is diverse. No surprise for a city touted as the most multicultural in the world. It isn’t hard to romanticize either; these are the places where Torontonians celebrate occasions, catch up with old friends, close important deals, sustain themselves on the go, and simply enjoy some of the best and most unique food offerings anywhere.

多伦多的餐厅景观是一整套繁华的烹饪传统,既丰富又多样。 对于一个被吹捧为世界上多元文化的城市来说,毫不奇怪。 也不难浪漫。 在这些地方,多伦多人可以庆祝各种场合,与老朋友见面,达成重要协议,在旅途中维持自己的生活,并在任何地方都能享受到一些最好,最独特的美食。

You could probably imagine then that the Toronto restaurant community is very near and dear to my heart. So, when the COVID-19 pandemic sent the sector teetering on the brink of catastrophe, I was compelled to lend whatever hand I could to help preserve the source of so many of my fondest memories. While the reality is that many doors have already been forced shut (a trend I anticipate will continue as the respite afforded by patio season draws to its end), I decided to put my skills to work and share some useful insights.

您可能会想到多伦多餐厅社区离我很近而且很亲爱。 因此,当COVID-19大流行病使该行业濒临灭顶之灾时,我不得不竭尽全力帮助保存我最美好的回忆的来源。 现实是许多门已经被迫关闭(随着露台季节休假的结束,我预计这种趋势将继续下去),但我决定发挥自己的技能,并分享一些有用的见解。

The following data science project is the result. It aims to guide restaurants’ recovery strategies through a broad overview of the market. More importantly, it hopes to inspire the data science community to lend their skills to this cause and further build on it. For any of my data science friends looking for their next side-project, consider this an open invitation to collaborate. As you’ll see, there are many worthy data science problems that remain to be explored within the scope of this project — check out my Github repository for more details.

结果是以下数据科学项目。 它旨在通过对市场的广泛概述来指导餐馆的恢复策略。 更重要的是,它希望激发数据科学界将其技能运用到这一事业中,并在此基础上进一步发展。 对于我的任何数据科学朋友,如果他们正在寻找下一个副项目,都可以将其视为开放的合作邀请。 如您所见,在此项目范围内还有许多值得探讨的数据科学问题-请查看我的Github存储库以获取更多详细信息。

With that said, I want to keep this article as approachable as possible so I’ve made every effort to annotate the technical nitty gritty in order to keep things a little more intuitive. So let’s get started!

话虽如此,我想使本文尽可能地平易近人,因此,我已尽一切努力对技术细节进行了注释,以使事情更加直观。 因此,让我们开始吧!

提示 (Prompt)

As alluded to in the title of this article, I set out to segment the Toronto restaurant market into distinct groups of more or less similar restaurants. The purpose of this is to help restaurants gain deeper insights into their respective competitive niche which they can use to parse out relevant trends from noise. In other words, find out how similar restaurants have responded to the pandemic by ruling out dissimilar ones. Along the way, I also wanted to explore some of the more salient features of the restaurant sector with particular attention given to neighbourhoods and the kinds of restaurants they play host to. To wrap things up, I consolidated my findings into an intuitive and interactive map — this will be the “deliverable”.

正如本文标题所暗示的那样,我着手将多伦多的餐厅市场划分为或多或少的相似餐厅。 这样做的目的是帮助餐馆更深入地了解其各自的竞争优势,从而可以用来分析噪音的相关趋势。 换句话说,找出相似的餐厅如何排除相似的餐厅,从而对这种流行病做出React。 在此过程中,我还想探索餐饮业的一些更为显着的特征,尤其要注意街区及其所托管餐馆的种类。 总结一下,我将调查结果整合到一个直观的交互式地图中,这将是“可交付成果”。

资料需求 (Data Requirements)

You can’t really apply fancy analytics without data so let’s tend to that now. From a market segmentation perspective, we want to collect the most relevant features on as many restaurants as possible within Toronto. Relevant features might include:

如果没有数据,您将无法真正应用精美的分析,因此,我们现在就趋向于做到这一点。 从市场细分的角度来看,我们希望在多伦多尽可能多的餐厅中收集最相关的功能。 相关功能可能包括:

  1. Price

    价钱
  2. Average Review

    平均评价
  3. Number of Reviews

    评论数
  4. Cuisines

    美食
  5. Location

    位置

A structured and readily available data set containing all of this information will likely be hard to come by (which I can now confirm was the case), so we’ll probably need to leverage sources of unstructured data such as online review forums and catalogs.

包含所有这些信息的结构化且易于使用的数据集可能很难获得(我现在可以确认是这种情况),因此我们可能需要利用非结构化数据源,例如在线评论论坛和目录。

(Solution)

For the most part, this data can be scraped from the web by building a crawler which will essentially extract information from a set of web pages. APIs offered by Google and Foursquare also offer solutions but I chose to go with the former (freer) of these options. When scraping, we’ll want to keep the scope to a single website for the sake of consistency. So, we’ll look for a single platform (website) with an extensive and rich library of Toronto restaurants.

在大多数情况下,可以通过构建搜寻器从网络上抓取这些数据,该搜寻器实际上将从一组网页中提取信息。 Google和Foursquare提供的API也提供了解决方案,但我选择使用这些选项中的前一个(免费)。 抓取时,出于一致性考虑,我们希望将范围限制在一个网站上。 因此,我们将寻找一个单一的平台(网站),其中包含多伦多餐馆的广泛而丰富的图书馆。

国土 (Lay of the land)

After a couple hours of scraping, we now have data on approximately 5,000 restaurants in the Greater Toronto Area. The first thing I want to do is to get familiar with it. To that end, there is no good substitute for good old-fashioned exploratory data analysis. I go in greater depth in my Jupyter Notebooks but for now, I’ll describe the data and visualize some of its more interesting features.

经过几个小时的搜寻,我们现在获得了大多伦多地区约5,000家餐厅的数据。 我要做的第一件事就是熟悉它。 为此,好的老式探索性数据分析没有很好的替代品。 我将在Jupyter笔记本中进行更深入的介绍,但现在,我将描述数据并可视化其一些更有趣的功能。

Data:

数据:

After considerable wrangling, my data looks a little something like this:

经过大量的争论之后,我的数据看起来像这样:

Image for post
Sample of restaurants data
餐厅数据样本

Some Interesting Stats:

一些有趣的统计数据:

The first thing I wanted to find out was how are the various types of cuisine distributed? Sticking with the food theme, I made a waffle-chart to get a sense of the 10 most common types of cuisines in the city and how they stack up against the rest.

我想了解的第一件事是如何分配各种美食? 坚持美食主题,我制作了华夫饼图,以了解该市10种最常见的美食以及它们与其他菜肴的叠加方式。

Image for post
Waffle-chart of Toronto restaurant distribution by cuisines
多伦多华夫饼图按美食分布的餐厅

As we can see, Asian food is by far the most common kind of cuisine with 15.56% of all restaurants cooking up Asian dishes. This is followed by Bar Food (8.03%), Italian (6.84%), and Cafes (6.09%). It’s important to note that restaurants can produce several different cuisines concurrently (which aren’t necessarily mutually exclusive eg. Japanese and Sushi). For instance, a restaurant might be labelled Italian, Pizza, and Desserts. That said, it’s interesting to see that 49.32% of all restaurants in our sample do not belong to the group of top 10 most common cuisines — signalling a rather diverse landscape.

如我们所见,亚洲美食是迄今为止最常见的美食,在所有餐厅中,有15.56%的人烹调亚洲美食。 其次是酒吧食品(8.03%),意大利文(6.84%)和咖啡厅(6.09%)。 重要的是要注意,餐厅可以同时生产几种不同的美食(不一定是互斥的,例如日本料理和寿司)。 例如,一家餐厅可能标有意大利,比萨和甜点。 就是说,很有趣的是,我们样本中的所有餐厅中有49.32%不属于最常见的10大美食类别-这标志着一个相当多样化的景观。

Next, I was curious to see if any of my quantitative variables were correlated with one another. For example, do more expensive restaurants tend to get better ratings?

接下来,我很好奇我的定量变量是否相互关联。 例如,更昂贵的餐厅是否倾向于获得更好的收视率?

Image for post

From the correlation matrix on the left, it looks like (on average) there aren’t really any meaningful relationships between the number of reviews, average reviews, and the cost for 2. But of course, this is only looking at the aggregate of our data and there may be more pronounced relationships for certain groups of restaurants. For example, it may turn out that more expensive fine dining restaurants garner better reviews than less expensive fine dining restaurants. This could be an interesting thing to look at in the future.

从左侧的相关矩阵来看,(平均)看来,评论的数量,平均评论和2的成本之间没有任何有意义的关系。但是,当然,这仅是针对我们的数据,并且某些餐馆集团之间的关系可能会更加明显。 例如,事实证明,较便宜的高级餐厅比廉价的高级餐厅获得更好的评价。 将来可能会发现这很有趣。

细分市场(集群) (Segmenting the Market (clustering))

Now for the main course: segmenting the Toronto restaurant market.

现在是主要课程:细分多伦多餐厅市场。

Disclaimer: This section gets a little technical. I break things down into plain English in the last paragraph so feel free to scroll there for the punchline.

免责声明:本节介绍一些技术知识。 在最后一段中,我将内容分解为简单的英语,因此可以随意滚动浏览。

Market segmentation is, of course, an unsupervised learning problem (i.e our learning algorithm trains on unlabeled data). Clustering algorithms such as k-means and k-modes are all tools we might want to apply to group our data into clusters of more or less similar restaurants. Each has its limitations, and the appropriate choice will ultimately depend on the data.

市场细分当然是一个无监督的学习问题(即我们的学习算法针对未标记的数据进行训练)。 我们可能希望应用诸如k-means和k-modes之类的聚类算法来将数据分组为或多或少相似餐馆的集群。 每个都有其局限性,适当的选择最终将取决于数据。

Our data has the particular property of being both numeric (average review, review counts, average cost, etc) and categorical (type of cuisine, occasion, etc). The first step we’ll want to take is to one-hot encode our categorical variables to obtain dummies that we can pass to our clustering algorithm.

我们的数据具有数字(平均评价,评价数,平均成本等)和分类(美食类型,场合等)的特殊属性。 我们要采取的第一步是对分类变量进行一次热编码,以获得可以传递给聚类算法的虚拟变量。

Now that we have dummies, let’s take inventory. Most of our data is categorical and presently coded as dummies (which follow a binomial distribution).

现在我们有了假人,让我们来盘点一下。 我们的大多数数据都是分类的,目前被编码为虚拟变量(遵循二项式分布)。

Image for post
Restaurant data with one-hot encoded categorical variables
具有一键编码分类变量的餐厅数据

In fact, we can see that 232 out of our 235 features are categorical.

实际上,我们可以看到235个功能中的232个是分类的。

Since k-means aims to minimize the euclidian distance between data points and cluster centroids (while maximizing euclidian distance between cluster centroids), it generally performs poorly on categorical data which belong to the discrete set {0, 1}. The k-modes algorithm is better equipped for categorical data, but conversely does not handle continuous data very well.

由于k-means旨在最小化数据点和聚类质心之间的欧式距离(同时最大化聚类质心之间的欧式距离),因此它通常在属于离散集{0,1}的分类数据上表现不佳。 k模式算法可以更好地用于分类数据,但是相反,它不能很好地处理连续数据。

PCA:

PCA:

To work around these constraints, we can reduce our data into its principle components through Principle Component Analysis (PCA). This should yield approximately continuous data across board (though it does sacrifice some information). Given the sparseness of many of our dummies (some dummies are “hot” in less than 5% of instances), we should intuitively be able to reduce our data quite a bit without sacrificing too much explained variance. We will then be able to pass it directly to our k-means algorithm in the form of a feature set.

为了解决这些限制,我们可以通过主成分分析(PCA)将数据简化为主成分。 这将产生大致连续的数据(尽管它确实牺牲了一些信息)。 考虑到我们许多虚拟人的稀疏性(某些虚拟人在不到5%的情况下“很热”),我们应该直观地能够在不牺牲太多解释方差的情况下减少数据。 然后,我们将能够以特征集的形式将其直接传递给我们的k-means算法。

Image for post

Running PCA, and plotting the cumulative explained variance on the number of components, we see that roughly 100 components explain 95% of the overall variance in our data. So, we’ll reduce the dimensionality of our data into a feature set of its 100 principle components. In the end, our data goes from looking like this:

运行PCA,并在组件数量上绘制累积解释方差,我们看到大约100个组件解释了数据中95%的总体方差。 因此,我们将把数据的维数减少到其100个主要组成部分的功能集中。 最后,我们的数据看起来像这样:

Image for post
Sample dataframe before PCA
PCA之前的样本数据帧

To something like this:

对于这样的事情:

Image for post
Sample dataframe after PCA
PCA之后的样本数据帧

Setting parameters (n_clusters):

设置参数(n_clusters):

We now have a feature set that is ready to be passed to our clustering algorithm (k-means). However, before we can do that, k-means requires us to predetermine the number of clusters we want to segment our restaurants into. Since we don’t have a priori knowledge of how many major segments make up the Toronto restaurant market, we’ll need another way of informing our choice of cluster number. The proposed solution, though admittedly a little more art than science, calls for iteratively running k-means for incremental numbers of clusters and plotting the corresponding sum of squared errors (SSE). Since the SSE should generally decrease monotonically for every increment of cluster number, we can’t simply minimize it without grossly overfitting our model. Instead, we’ll look for an inflection point where the rate of change in the SSE begins to really taper off (elbow method).

现在,我们有了一个功能集,可以随时将其传递给我们的聚类算法(k-means)。 但是,在执行此操作之前,k均值要求我们预先确定要将餐厅细分为的聚类的数量。 由于我们不了解构成多伦多餐厅市场的主要细分市场的先验知识,因此,我们需要另一种方法来告知我们选择簇数的方法。 所提出的解决方案虽然被承认比科学还多一些艺术,但它要求迭代地运行k均值以增加簇的数量,并绘制相应的平方误差和(SSE)。 由于SSE通常应随着簇数的增加而单调减少,因此我们不能简单地将其最小化而又不过度拟合模型。 取而代之的是,我们将寻找一个拐点,在该拐点处,SSE的变化率开始真正减小(肘部方法)。

Image for post

From the plot on the left, it looks like the rate of change starts to peter out at around 15 clusters so we’ll set the n_cluster parameter equal to that and run our clustering algorithm on the previously defined feature set.

从左侧的图中可以看出,变化率开始在大约15个簇处逐渐消失,因此我们将n_cluster参数设置为与之相等,并在先前定义的功能集中运行聚类算法。

In Plain English:

用简单的英语:

As mentioned earlier, our objective was to segment the Toronto restaurant market into groups of more or less similar restaurants. We did this by applying a machine learning algorithm (k-means) which essentially combs through our restaurants, looks for patterns, and assigns each restaurant to a group of other restaurants by determining which it most resembles. In our case, we determined that there were roughly 15 major segments in the Toronto restaurant market so each restaurant was assigned to one out of 15 possible groups (clusters).

如前所述,我们的目标是将多伦多的餐厅市场划分为或多或少相似的餐厅。 为此,我们应用了机器学习算法(k-means),该算法实质上会遍历我们的餐厅,寻找模式,然后通过确定最相似的餐厅,将每个餐厅分配给一组其他餐厅。 在我们的案例中,我们确定了多伦多餐厅市场中大约有15个主要细分市场,因此每个餐厅都被分配到15个可能的组(集群)中的一个。

Finally, I created a table highlighting each group’s main features:

最后,我创建了一个表格,突出显示每个组的主要功能:

Image for post

Looks like our model did a decent job of clustering restaurants into more or less consistent groups, though it would certainly improve with better data.

看起来我们的模型在将餐厅聚类成或多或少一致的群体方面做得不错,尽管通过更好的数据肯定会改善。

互动地图 (Interactive Map)

Now that I have my clusters, I put everything into an interactive map.

现在有了集群,我将所有内容放入交互式地图中。

How it works:

这个怎么运作:

The map contains several layers — each displaying different insights — which can be accessed by hovering over the layer control icon (in the top right corner) and then selecting a mode by clicking. Some of the modes contain additional interactive features which allow you to access deeper insights by either hovering or clicking on clickable objects on the map.

该地图包含多个图层-每个图层都显示不同的见解-可以通过将鼠标悬停在图层控件图标(位于右上角)上,然后单击以选择一种模式。 其中一些模式包含其他交互式功能,使您可以通过悬停或单击地图上的可单击对象来访问更深入的见解。

层数: (Layers:)

  1. openstreetmap: Simple map initiated to display Toronto.

    openstreetmap:启动显示多伦多的简单地图。

  2. # Restaurants: Restaurant count (from data sample) by neighbourhood

    #餐厅:按邻居划分的餐厅数(来自数据样本)

  3. Average Price (for 2): Average restaurant price for 2 people by neighbourhood

    平ASP格(2人):2个地区平均餐厅价格

  4. Average Rating: Average restaurant rating (out of 5) by neighbourhood

    平均评分:各地区餐厅平均评分 (满分5分)

  5. HeatMap: HeatMap illustrating restaurant density

    HeatMap:显示餐厅密度的HeatMap

  6. Granular: Individually plotted restaurants (clickable)

    粒度:单独绘制的餐厅(可点击)

  7. Granular (colour-coded by similarity): Individually plotted restaurants colour-coded by cluster (clickable). Each cluster (colour) represents a group of more or less similar restaurants.

    颗粒状(通过相似性用颜色编码):单独绘制的餐厅按聚类(可单击)进行颜色编码。 每个群集(颜色)代表一组或多或少相似的餐厅。

Interactive map of Toronto restaurants. Click on layer-controller (top-right) to access features
多伦多餐厅的互动地图。 单击图层控制器(右上角)以访问功能

And there we have it! An interactive map of of Toronto restaurants.

我们终于得到它了! 多伦多餐厅的互动地图。

未来方向 (Future Direction)

There are countless other interesting areas I would have liked to explore, but ultimately forwent in the interest of keeping the scope of this project manageable. Some of these included determining the factors driving reviews (regression), determining which restaurants face a heightened risk of closure (classification — though this would require more data), determining optimal street closures to accommodate increased outdoor seating (optimization), and many more. I leave these to the data science community at large.

我想探索的其他有趣领域很多,但最终为了保持该项目范围的可管理性而放弃了。 其中一些措施包括确定推动评论的因素(回归),确定哪些餐馆面临更高的关闭风险(分类-尽管这需要更多数据),确定最佳的街道关闭以适应增加的户外座位(优化)等等。 我将这些留给了整个数据科学界。

翻译自: https://medium.com/swlh/segmenting-the-toronto-restaurant-market-lending-an-analytics-hand-to-a-distressed-sector-125e3c8a7f2b

rfm模型分析与客户细分

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值