云vr端到端网络时延_使用阿里云构建端到端客户细分解决方案

最新推荐文章于 2022-06-25 12:22:56 发布

李_涛

最新推荐文章于 2022-06-25 12:22:56 发布

阅读量608

点赞数

文章标签：阿里云 python java 人工智能

原文链接：https://towardsdatascience.com/building-end-to-end-customer-segmentation-solution-with-alibaba-cloud-db326e90c2fb

版权

云vr端到端网络时延

Imagine we have a retail store selling various products. To be more successful in your business, we have to understand our customers well. Especially in today’s competitive world. So that we can answer:

想象一下，我们有一家零售店出售各种产品。为了在您的业务中取得更大的成功，我们必须充分了解客户。尤其是在当今竞争激烈的世界中。这样我们就可以回答：

- Who are our best customers?

-谁是我们最好的客户？

- Who are our potential customers?

-我们的潜在客户是谁？

- Which customers that need to be targeted and to be retained?

-需要锁定哪些目标客户并予以保留？

- What are the characteristics of our customers?

-客户的特点是什么？

One way to understand our customers is by conducting customer segmentation. Segmentation is a process of categorizing customers in several groups based on common characteristics. We can use many variables to segment our customers. The information such as customer demographic, geographic, psychographic, technographic, and behavioral are often used as a differentiator to segment our customers.

了解客户的一种方法是进行客户细分。细分是根据共同特征将客户分为几类的过程。我们可以使用许多变量来细分客户。客户的人口统计，地理，心理，技术和行为等信息通常被用作区分客户的区分因素。

By enabling customer segmentation in the business, we will be able to personalized your strategy to suit each segment’s characteristics. So that customer retention can be maximized, customer experience can be improved, have better ad performance, and marketing costs can be minimized.

通过在业务中实现客户细分，我们将能够个性化您的策略以适合每个细分市场的特征。这样可以最大限度地提高客户保留率，改善客户体验，提高广告效果，并最大程度地降低营销成本。

So, how can we do this customer segmentation?

那么，我们该如何进行客户细分？

We will be applying unsupervised machine learning techniques to make customer segmentation on the retail dataset. We will use Recency, Frequency, and Monetary (RFM) that proven as a useful indicator of customer transaction behaviors.

我们将应用无监督的机器学习技术对零售数据集进行客户细分。我们将使用被证明是客户交易行为的有用指标的新近度，频率和货币(RFM)。

Image for post — Recency, frequency, and Monetary definitions. Images by Bima

We will leverage the following products to build this use case.

我们将利用以下产品来构建此用例。

- Object Storage Service (OSS). OSS is an encrypted, secure, cost-effective, and easy-to-use object storage service that enables you to store, back up, and archive large amounts of data in the cloud, with guaranteed durability.

- 对象存储服务(OSS) 。 OSS是一种加密的，安全的，具有成本效益的，易于使用的对象存储服务，使您能够在云中存储，备份和存档大量数据，并保证持久性。

- MaxCompute (previously known as ODPS) is a general-purpose, fully managed, multi-tenancy data processing platform for large-scale data warehousing. MaxCompute supports various data importing solutions and distributed computing models, enabling users to effectively query massive datasets, reduce production costs, and ensure data security.

-MaxCompute (以前称为ODPS)是用于大型数据仓库的通用，完全托管，多租户数据处理平台。 MaxCompute支持各种数据导入解决方案和分布式计算模型，使用户能够有效地查询海量数据集，降低生产成本并确保数据安全性。

- DataWorks is a Big Data platform product launched by Alibaba Cloud. It provides one-stop Big Data development, data permission management, offline job scheduling, and other features. Also, it offers all-around services, including Data Integration, DataStudio, Data Map, Data Quality, and DataService Studio.

-DataWorks是阿里云推出的大数据平台产品。它提供一站式大数据开发，数据权限管理，离线作业调度和其他功能。此外，它还提供全方位的服务，包括数据集成，DataStudio，数据地图，数据质量和DataService Studio。

- Machine Learning Platform for AI (PAI) provides end-to-end machine learning services, including data processing, feature engineering, model training, model prediction, and model evaluation. Machine Learning Platform for AI combines all of these services to make AI more accessible than ever.

-AI机器学习平台 (PAI)提供端到端机器学习服务，包括数据处理，特征工程，模型训练，模型预测和模型评估。用于AI的机器学习平台结合了所有这些服务，使AI比以往任何时候都更易于访问。

- Data Lake Analytics is an interactive analytics service that utilizes serverless architecture. DLA uses SQL interfaces to interact with user service clients, which means it complies with standard SQL syntax and provides a variety of similar functions. DLA allows you to retrieve and analyze data from multiple data sources or locations such as OSS and Table Store for optimal data processing, analytics, and visualization to give better insights and ultimately guide better decision making.

-Data Lake Analytics是一种利用无服务器架构的交互式分析服务。 DLA使用SQL接口与用户服务客户端进行交互，这意味着它符合标准SQL语法并提供了多种类似功能。 DLA允许您从OSS和Table Store等多个数据源或位置检索和分析数据，以实现最佳数据处理，分析和可视化，以提供更好的见解并最终指导做出更好的决策。

We will start by preparing our data and then doing model training, followed by creating a pipeline for serving the model.

我们将首先准备数据，然后进行模型训练，然后创建用于服务模型的管道。

资料准备 (Data Preparation)

First, let’s get to know our data. We use online retail data from the UCI Machine Learning Repository (you can download this data from this link). This dataset consists of transaction data from 2010 and 2011.

首先，让我们了解我们的数据。我们使用UCI机器学习存储库中的在线零售数据(您可以从此链接下载此数据)。该数据集包含2010年和2011年的交易数据。

步骤1数据收集 (Step 1 Data collection)

We will use 2010 data for our model training purposes and rename the file to transaction_train.csv. Then data as an example of daily data that needs to be processed is 2011, and we rename it to transaction_daily.csv. We need to store this data set into Alibaba Object Storage Service (OSS).

我们将使用2010年数据进行模型训练，并将文件重命名为transaction_train.csv。然后，以每天需要处理的数据为例，数据为2011年，我们将其重命名为transaction_daily.csv。我们需要将此数据集存储到阿里巴巴对象存储服务(OSS)中。

步骤2资料撷取 (Step 2 Data Ingestion)

We then use MaxCompute and Data Works to orchestrate the process. This step aims to ingest data from the OSS into MaxCompute using Data Works.

然后，我们使用MaxCompute和Data Works来协调流程。此步骤旨在使用Data Works将数据从OSS提取到MaxCompute中。

First, we need to create a table to store data in the MaxCompute. This table should match our data set schema. We can do it by right-clicking the business flow and choosing Create Table. Then we can create the schema by using the DDL statement and follow the process.

首先，我们需要创建一个表以将数据存储在MaxCompute中。该表应匹配我们的数据集架构。我们可以通过右键单击业务流程并选择创建表来实现。然后，我们可以使用DDL语句创建模式并遵循该过程。

Now we have the table ready. We need to create a Data Integration Node to sync our data from OSS to MaxCompute. In this node, we need to set up the source and target, followed by mapping each column.

现在我们准备好桌子了。我们需要创建一个数据集成节点，以将我们的数据从OSS同步到MaxCompute。在此节点中，我们需要设置源和目标，然后映射每列。

Then we can run the component to ingest the data from OSS to MaxCompute and get our data in place.

然后，我们可以运行该组件以将数据从OSS提取到MaxCompute并将数据放置到位。

步骤3资料清理和转换 (Step 3 Data Cleaning and Transformation)

Our data needs to be cleaned and transformed before we use it for model training. First, we will clean the data from invalid values. Then we transform the data by calculating the Recency, Frequency, and Monetary values of each customer.

在我们将其用于模型训练之前，需要清理和转换我们的数据。首先，我们将清除无效值中的数据。然后，我们通过计算每个客户的新近度，频率和货币价值来转换数据。

In DataWorks, we will use a SQL node to perform this task. We can start by creating a new table to store the result, then we create the node and write a DML query to complete the task.

在DataWorks中，我们将使用SQL节点执行此任务。我们可以先创建一个新表来存储结果，然后创建节点并编写DML查询以完成任务。

The result of this task is we will have RFM values for each customer.

该任务的结果是我们将为每个客户提供RFM值。

We will use this data preparation step in the Model Training and Model Serving process. In model training, we only do it once until we have a model. However, this data preparation will run batch daily in the Model Serving pipeline.

我们将在模型培训和模型服务过程中使用此数据准备步骤。在模型训练中，我们只做一次，直到有了模型为止。但是，此数据准备将每天在模型服务管道中批量运行。

模型训练 (Model Training)

We will use K-Means unsupervised machine learning algorithm. K-means clustering is the most widely used clustering algorithm, which divides n objects into k clusters to maintain high similarity in each group. The similarity is calculated based on the average value of objects in a cluster.

我们将使用K-Means无监督机器学习算法。 K-均值聚类是使用最广泛的聚类算法，它将n个对象划分为k个聚类，以保持每个组的高度相似性。基于群集中对象的平均值计算相似度。

This algorithm randomly selects k objects, each of which initially represents the average value or center of a cluster. Then, the algorithm assigns the remaining objects to the nearest clusters based on their distances from the center of each cluster and re-calculates each cluster’s average value. This process repeats until the criterion function converges.

该算法随机选择k个对象，每个对象最初代表群集的平均值或中心。然后，该算法根据剩余对象到每个群集中心的距离将它们分配给最近的群集，并重新计算每个群集的平均值。重复此过程，直到准则函数收敛为止。

The K-means clustering algorithm assumes that we obtain object attributes from the spatial vector, and its objective is to ensure the minimum mean square error sum inside each group.

K-均值聚类算法假设我们从空间矢量获得对象属性，其目的是确保每组内部的最小均方误差和。

In this step, we will create an experiment using PAI studio and doing optimization using Data Science Workshop (DSW). PAI studio and DSW are part of Alibaba Machine Learning Platform For AI.

在这一步中，我们将使用PAI Studio创建一个实验，并使用Data Science Workshop(DSW)进行优化。 PAI工作室和DSW是阿里巴巴AI机器学习平台的一部分。

To create an experiment in a PAI studio is relatively simple. PAI Studio already has several functions as a component that we can drag and drop into the experiment pane. Then we just need to connect each component. The image below shows the experiment that we will build to make our model.

在PAI工作室中创建实验相对简单。 PAI Studio已经具有几个功能，可以将其拖放到实验窗格中。然后，我们只需要连接每个组件。下图显示了我们将要建立的实验以建立模型。

步骤1资料探索 (Step 1 Data Exploration)

In this step, we aim to understand the data. We will explore our data by generates descriptive statistics, creating a histogram, and create a scatter plot to check the correlation between variables. We can do this by creating a component to do those tasks after we read our data.

在这一步中，我们旨在了解数据。我们将通过生成描述性统计数据，创建直方图和创建散点图以检查变量之间的相关性来探索数据。我们可以通过在读取数据后创建一个组件来执行这些任务来做到这一点。

As a result, we know that we have skew data for frequency and monetary. Hence, we need to do feature engineering before creating the model.

结果，我们知道我们的频率和货币数据存在偏差。因此，我们需要在创建模型之前进行特征工程。

步骤2特征工程 (Step 2 Feature Engineering)

We will do a log transformation to handle the skew in our data. We also need to standardize the value before we use it for modeling. Because K-Means using distance as a measurement, and we need each of our parameters to be on the same scale. To do this, we need to create a feature transformation component and standardize component.

我们将进行日志转换以处理数据中的偏斜。在将其用于建模之前，我们还需要标准化该值。因为K均值使用距离作为度量，所以我们需要将每个参数都设置在相同的比例尺上。为此，我们需要创建一个特征转换组件并标准化该组件。

Then below this node, we should save the standardized parameter into a table so that we can use it during deployment.

然后，在此节点下，我们应该将标准化参数保存到表中，以便我们可以在部署期间使用它。

步骤3建立模型 (Step 3 Model Creation)

Now it’s time to create our model. We will use K-Means to find customer clusters. To do this, we need to input the number of clusters as our model hyper-parameter.

现在是时候创建我们的模型了。我们将使用K-Means查找客户群。为此，我们需要输入聚类的数量作为模型超参数。

To find the optimum cluster number, we need to use Data Science Workbench to iterate the modeling using different numbers of clusters and find the optimum by generating the elbow plot. The optimum number of clusters is where the sum square error start flattens out.

为了找到最佳的簇数，我们需要使用Data Science Workbench来使用不同数目的簇对模型进行迭代，并通过生成弯头图来找到最佳值。最佳簇数是平方和误差开始趋于平坦的地方。

DSW is a Jupyter Notebook like environment. Here We need to create an experiment by writing a python script and generate the elbow plot.

DSW是类似Jupyter Notebook的环境。在这里，我们需要通过编写python脚本来创建实验并生成弯头图。

As a result, we found that our optimum cluster is five. Then we use this as our hyper-parameter for the K-Means component in PAI studio and run the component.

结果，我们发现最优簇是五个。然后，将其用作PAI studio中K-Means组件的超参数并运行该组件。

The results of this component, we have a cluster_index for each customer. We also can visualize the results in the forms of scatter plot that already colored by cluster_index. This component also generates a model that will be served later.

这个组件的结果，我们为每个客户都有一个cluster_index。我们还可以以散点图的形式可视化结果，该散点图已经由cluster_index着色。该组件还生成了一个模型，将在以后提供。

步骤4保存群集结果 (Step 4 Save Cluster Results)

Here we will combine the original data with the results of the k-means component. Then we save the results to the MaxCompute table as our experiment output by creating a Write MaxCompute Table Component.

在这里，我们将原始数据与k-means分量的结果结合在一起。然后，通过创建Write MaxCompute Table Component，将结果保存为MaxCompute表作为实验输出。

步骤5：标记客户群 (Step 5 Labelling Customer Segment)

We will use the SQL component to calculate average Recency, Frequency, and Monetary for each cluster to understand the characteristics of each cluster and represent a name for each cluster.

我们将使用SQL组件计算每个群集的平均最近度，频率和货币，以了解每个群集的特征并代表每个群集的名称。

Now, we go back to Data Works and create a PAI Node to run our experiment. We are then using a SQL node to write a DML to segment each cluster. Lastly, we create another data integration node to send back the data from MaxCompute to OSS.

现在，我们回到Data Works并创建一个PAI节点来运行我们的实验。然后，我们使用SQL节点编写DML来分割每个集群。最后，我们创建另一个数据集成节点，以将数据从MaxCompute发送回OSS。

模型训练总结 (Model Training Summary)

Those are five steps that need to be done to create a model training using Alibaba Cloud Products like OSS, MaxCompute, DataWorks, and Machine Learning Platform for AI. In summary, the diagram below shows the model training architecture from data preparation to model training.

这是使用OSS，MaxCompute，DataWorks和AI机器学习平台等阿里云产品创建模型训练所需要完成的五个步骤。总之，下图显示了从数据准备到模型训练的模型训练体系结构。

模型服务 (Model Serving)

User recency, frequency, and monetary data could change over the period. It means we need to label our users based on their current characteristics continuously. Segmentation could be a challenging task if we do it manually. So that we will deploy and create the segmentation process automatically in a batch once a day. The image below summarizes the step to serving the model.

用户新近度，频率和货币数据可能会在此期间发生变化。这意味着我们需要根据用户的当前特征连续为其打上标签。如果我们手动进行细分，则可能是一项具有挑战性的任务。因此，我们每天将自动批量创建和部署细分过程。下图总结了服务模型的步骤。

We begin by putting the daily data into the OSS and then using the Data Works service to transfer those data from OSS to Max Compute. After that, we clean and transform the data into RFM form that same as before.

我们首先将每日数据放入OSS，然后使用Data Works服务将这些数据从OSS传输到Max Compute。之后，我们将数据清理并转换为与以前相同的RFM形式。

Then we create a new experiment as a PAI node. In this experiment, we do the same feature engineering process like Log Transform and Standardize. But we use the standardized parameter that we save during the training process.

然后，我们创建一个新实验作为PAI节点。在此实验中，我们执行与Log Transform和Standardize相同的功能工程过程。但是我们使用在训练过程中保存的标准化参数。

Next, we use the prediction component using the RFM model that has been created before to predict our new data. We then join the segment results with the original parameters and save them into the Max Compute table.

接下来，我们使用之前已创建的RFM模型使用预测组件来预测新数据。然后，我们将细分结果与原始参数结合起来，并将其保存到Max Compute表中。

After that, we create a SQL node to labeling our customers with the respective segment name. Then we send the results back to OSS as a *.csv files by creating a Data Integration Node. The image below shows the overall flow in Data Works and PAI Experiment.

之后，我们创建一个SQL节点，以使用相应的细分名称为客户标记。然后，通过创建数据集成节点，将结果作为* .csv文件发送回OSS。下图显示了Data Works和PAI实验的总体流程。

Lastly, data lake analytics (DLA) will be used to connect OSS files with external visualization tools like Tableau. Data Lake Analytics (DLA) is a serverless cloud-native interactive query and analytics service. Hence we could present the segmentation results to our business team for monitoring and usage. The image below shows the customer segmentation dashboard created in Tableau.

最后，数据湖分析(DLA)将用于将OSS文件与Tableau等外部可视化工具连接。数据湖分析(DLA)是无服务器的云原生交互式查询和分析服务。因此，我们可以将细分结果呈现给我们的业务团队进行监控和使用。下图显示了在Tableau中创建的客户细分仪表板。

结论 (Conclusion)

As a result, five clusters have been identified. Those are loyalist, potential, churn, potential loss, and loss. Hence, our business team could design and execute a marketing strategy to engage with each segment.

结果，确定了五个集群。这些是效忠者，潜在者，客户流失，潜在损失和损失。因此，我们的业务团队可以设计和执行营销策略以与每个细分市场互动。

In the process, we also learn about how to combine various Alibaba Cloud Products to make an end-to-end machine learning pipelines and build a business use case.

在此过程中，我们还将学习如何组合各种阿里云产品来建立端到端的机器学习管道并构建业务用例。

参考书目 (Bibliography)

- https://www.alibabacloud.com/product/maxcompute

-https://www.alibabacloud.com/product/maxcompute

- https://www.alibabacloud.com/product/ide

-https://www.alibabacloud.com/product/ide

- https://www.alibabacloud.com/product/data-lake-analytics

-https://www.alibabacloud.com/product/data-lake-analytics

- https://www.alibabacloud.com/product/oss

-https://www.alibabacloud.com/product/oss

翻译自: https://towardsdatascience.com/building-end-to-end-customer-segmentation-solution-with-alibaba-cloud-db326e90c2fb

云vr端到端网络时延

李_涛

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
云vr端到端网络时延_使用阿里云构建端到端客户细分解决方案

云vr端到端网络时延Imagine we have a retail store selling various products. To be more successful in your business, we have to understand our customers well. Especially in today’s competitive world. So that we...
复制链接

扫一扫