神经网络架构搜索_神经网络架构

最新推荐文章于 2024-07-21 21:24:25 发布

weixin_26752765

最新推荐文章于 2024-07-21 21:24:25 发布

阅读量444

点赞数

文章标签：神经网络 python 深度学习 java 人工智能

原文链接：https://towardsdatascience.com/customer-preferences-in-the-age-of-the-platform-business-with-the-help-of-ai-98b0eabf42d9

版权

本文深入探讨了神经网络架构搜索的原理和应用，通过AI技术理解平台时代下的用户偏好。

摘要由CSDN通过智能技术生成

神经网络架构搜索

Marketing and product teams are tasked with understanding customers. To do so, they look at customer preferences — motivations, expectations and inclinations — which in combination with customer needs drive their purchasing decisions.

中号 arketing和产品团队的任务是了解客户。为此，他们着眼于客户的偏好 -动机，期望和倾向-结合客户需求来驱动他们的购买决策 。

In my years as a data scientist I learned that customers — their preferences and needs — rarely (or never?) fall into simple objective buckets or segmentations we use to make sense of them. Instead, customer preferences and needs are complex, intertwined and constantly changing.

在担任数据科学家的那几年里，我了解到客户-他们的喜好和需求-很少(或从来没有？)落入我们用来理解客户的简单客观的分类或细分中。相反，客户的喜好和需求是复杂的，相互交织的并且不断变化的。

While understanding customers is already challenging enough, many modern digital businesses don’t know much about their products either. They operate digital platforms to facilitate the exchange between producers and consumers. The digital platform business model creates markets and communities with network effects that allow their users to interact and transact. The platform business does not control their inventory via a supply chain like linear businesses do.

虽然了解客户已经具有足够的挑战性，但许多现代数字企业也不了解他们的产品。他们运营数字平台，以促进生产者和消费者之间的交流。 数字平台业务模型通过网络效应来创建市场和社区，从而使其用户进行交互和交易。平台业务不像线性业务那样通过供应链控制其库存。

Image for post — Mohamed Hassan from Pixabay Mohamed Hassan在Pixabay上发布

A good way to describe the platform business is that they do not own the means of production but they instead create the means of connection. Examples of a platform business are Amazon, Facebook, YouTube, Twitter, Ebay, AirBnB, a Property Portal like Zillo, and aggregator businesses like travel booking websites. Over the last few decades the platform businesses came to dominate the economy.

描述平台业务的一个好方法是，他们不拥有生产资料，而是创建联系方式 。平台业务的示例包括Amazon ， Facebook ， YouTube ， Twitter ， Ebay ， AirBnB ， Zillo之类的Property Portal和旅行预订网站之类的聚合业务。在过去的几十年中，平台业务开始主导经济。

How can we use AI to make sense of our customers and products in the age of the platform business?

在平台业务时代，我们如何使用AI来理解我们的客户和产品？

This blog post is a continuation of my previous discussion on the new gold standard of behavioural data in Marketing:

这篇博客文章是我之前关于市场营销行为数据新金标准的讨论的延续：

In this blog post we use a more advanced Deep Neural Network to model customers and products.

在此博客文章中，我们使用更高级的深度神经网络对客户和产品进行建模。

神经网络架构 (The Neural Network Architecture)

We use a deep Neural Network with the following elements:

我们使用具有以下要素的深度神经网络：

Encoder: takes input data describing products or customers and maps it into Feature Embeddings. (An embedding is defined as a projection of some input into another more convenient representation space)
编码器 ：获取描述产品或客户的输入数据，并将其映射到功能嵌入中。 (嵌入定义为某些输入到另一个更方便的表示空间中的投影)
Comparator: combines customer and product feature embeddings into a Preferences Tensor.
比较器 ：将客户和产品功能嵌入到首选项Tensor中 。
Predictor: turns the preferences into a predictive purchase propensity
预测者 ：将偏好转变为预测购买倾向

We use the neural network to predict product purchases as a target as we know that purchase decisions are driven by a customer’s preferences and needs. Therefore we teach the encoders to extract such preferences and needs from customer behavioural data, customer and product attributes.

由于我们知道购买决策是由客户的偏好和需求决定的，因此我们使用神经网络将产品购买作为目标进行预测。因此，我们教编码器从客户行为数据，客户和产品属性中提取此类偏好和需求。

We can analyse and cluster the learned customer and product features to derive a data driven segmentation. More on this later.

我们可以分析和聚类学习到的客户和产品功能，以得出数据驱动的细分。稍后再详细介绍。

TensorFlow实施 (TensorFlow Implementation)

The following code uses TensorFlow 2 and Keras to implement our Neural Network architecture:

以下代码使用TensorFlow 2和Keras来实现我们的神经网络体系结构：

The code creates TensorFlow feature columns and can use numerical as well as categorical features. We are using the Keras functional API to define our customer preference neural network which can be compiled with the Adam optimiser using a binary cross-entropy as the loss function.

该代码创建TensorFlow特征列，并且可以使用数字特征和分类特征 。我们正在使用Keras功能API定义我们的客户偏好神经网络，该网络可以使用亚当优化器使用二进制交叉熵作为损失函数进行编译。

使用Spark训练数据 (Training Data with Spark)

We will need training data for our customer preference model. As a platform business your raw data will fall into the Big Data category. To prepare TB of raw data from click streams, product searches and transactions we use Spark. The challenge is to bridge the two technologies and feed the training data from Spark into TensorFlow.

我们将需要针对我们的客户偏好模型的培训数据。作为平台业务，您的原始数据将属于大数据类别。为了从点击流，产品搜索和交易中准备TB的原始数据，我们使用Spark。挑战在于将这两种技术联系起来并将培训数据从Spark馈入TensorFlow。

The best format for big amounts of TensorFlow training data is to store it in the TFRecord file format, TensorFlow’s own binary storage format based on Protocol Buffers. The binary format greatly improves the performance of loading data and feeding it into model training. If you were to use, for example, csv files you will spend significant compute resources on loading and parsing your data rather than on training your neural network. The TFRecord file format makes sure your data pipeline is not bottlenecking your neural network training.

大量TensorFlow训练数据的最佳格式是将其存储为TFRecord文件格式 ，这是TensorFlow自己基于协议缓冲区的二进制存储格式。二进制格式大大提高了加载数据并将其输入模型训练的性能。例如，如果要使用csv文件，则将花费大量的计算资源来加载和解析数据，而不是训练神经网络。 TFRecord文件格式可确保您的数据管道不会成为神经网络训练的瓶颈。

The Spark-TensorFlow connector allows us to save TFRecords with Spark. Simply add it as a JAR to a new Spark session as follows:

Spark-TensorFlow连接器允许我们使用Spark保存TFRecords。只需将其作为JAR添加到新的Spark会话中，如下所示：

spark = (
  SparkSession.builder
  .master("yarn")
  .appName(app_name)
  .config("spark.submit.deployMode", "cluster")
  .config("spark.jars.packages","org.tensorflow:spark-tensorflow-connector_2.11:1.15.0")
  .getOrCreate()
)

and write a Spark DataFrame to TFRecords as follows:

并将Spark DataFrame写入TFRecords，如下所示：

(
  training_feature_df
  .write.mode("overwrite")
  .format("tfrecords")
  .option("recordType", "Example")
  .option("codec", "org.apache.hadoop.io.compress.GzipCodec")
  .save(path)
)

To load the TFRecords with TensorFlow you define the schema of your records and parse the data set into an iterator of python dictionaries using the TensorFlow dataset APIs:

要使用TensorFlow加载TFRecords，您需要定义记录的架构，并使用TensorFlow数据集API将数据集解析为python词典的迭代器：

SCHEMA = {
  "col_name1": tf.io.FixedLenFeature([], tf.string, default_value="Null"),
  "col_name2: tf.io.FixedLenFeature([], tf.float32, default_value=0.0),
}data = (
  tf.data.TFRecordDataset(list_of_file_paths, compression_type="GZIP")
  .map(
    lambda record: tf.io.parse_single_example(record, SCHEMA),
    num_parallel_calls=num_of_workers
  )
  .batch(num_of_records)
  .prefetch(num_of_batches)
)

使用Spark和PandasUDF进行批量计分 (Batch Scoring with Spark and PandasUDFs)

After training our Neural Network there are obvious real-time scoring applications, for example, scoring search results in a product search to address choice paralysis on platforms with thousands and millions of products.

在训练了我们的神经网络之后，便有了明显的实时评分应用程序，例如，在产品搜索中对搜索结果进行评分，以解决拥有成千上万种产品的平台上的选择瘫痪。

But there is an advanced analytics use-case to look at the product/ user features and preferences for insights and to create a data driven segmentation to help with product development etc. For this we score our entire customer base and product catalogue to capture the outputs of the Encoders and Comparator of our model for clustering.

但是，有一个高级的分析用例可以查看产品/用户的功能和偏好以获取见解，并创建数据驱动的细分以帮助产品开发等。为此，我们对整个客户群和产品目录进行评分，以获取输出聚类模型的编码器和比较器的设计Design 。

To capture the output of intermediary neural network layers we can reshape our trained TensorFlow as follows:

为了捕获中间神经网络层的输出，我们可以按如下所示重塑我们训练有素的TensorFlow：

trained_customer_preference_model = tf.keras.models.load_model(path)
customer_feature_model = tf.keras.Model(
  inputs=trained_customer_preference_model.input,
  outputs=trained_customer_preference_model.get_layer(
    "customer_features").output
)

We score our users with Spark using a PandasUDF to score a batch of users at a time for performance reasons:

出于性能方面的考虑，我们使用PandasUDF为Spark的用户评分，以便一次为一批用户评分：

from pyspark.sql import functions as F
import numpy as np
import pandas as pdspark = SparkSession.builder.getOrCreate()
customerFeatureModelWrapper = CustomerFeatureModelWrapper(path)
CUSTOMER_FEATURE_MODEL = spark.sparkContext.broadcast(customerFeatureModelWrapper)@F.pandas_udf("array<float>", F.PandasUDFType.SCALAR)
def customer_features_udf(*cols):
  model_input = dict(zip(FEATURE_COL_NAMES, cols))
  model_output = CUSTOMER_FEATURE_MODEL.value([model_input])
  return pd.Series([np.array(v) for v in model_output.tolist()])(
  customer_df
  .withColumn(
    "customer_features",
    customer_features_udf(*model_input_cols)
  )
)

We have to wrap our TensorFlow model into a wrapper class to allow serialisation, broadcasting across the Spark cluster and de-serialisation of the model on all workers. I use MLflow to track model artifacts but you could store them simply on any cloud storage without MLflow. Implement a download function fetching model artifacts from S3 or wherever you store your model.

我们必须将TensorFlow模型包装到包装器类中，以允许序列化，在Spark集群中广播以及在所有worker上对该模型进行反序列化。我使用MLflow跟踪模型工件，但是您可以将它们简单地存储在任何没有MLflow的云存储中。实现下载功能，以从S3或存储模型的任何地方获取模型工件。

class CustomerFeatureModelWrapper(object):
  def __init__(self, model_path):
    self.model_path = model_path
    self.model = self._build(model_path)
  def __getstate__(self):
    return self.model_path
  def __setstate__(self, model_path):
    self.model_path = model_path
    self.model = self._build(model_path)
  def _build(self, model_path):
    local_path = download(model_path)
    return tf.keras.models.load_model(local_path)

You can read more about how MLflow can help you with your Data Science Projects in my previous article:

在上一篇文章中，您可以了解有关MLflow如何帮助您进行数据科学项目的更多信息：

聚类和细分 (Clustering and Segmentation)

After scoring our customer base and product inventory with Spark we have a dataframe with feature and preference vectors as follows:

在使用Spark对我们的客户群和产品库存进行评分之后，我们得到了一个具有特征向量和偏好向量的数据框，如下所示：

+-----------+---------------------------------------------------+
|product_id |product_features                                   |
+-----------+---------------------------------------------------+
|product_1  |[-0.28878614, 2.026503, 2.352102, -2.010809, ...   |
|product_2  |[0.39889023, -0.06328985, 1.634547, 3.3479023, ... |
+-----------+---------------------------------------------------+

As a first step, we have to create a representative but much smaller sample of customers and products to use in clustering. It is important that you stratify your sample with equal numbers of customers and products per strata. Commonly, we have many anonymous customers with little customer attributes such as demographics etc. for stratification. In such a situation, we can stratify customers by the product attributes of the products the customers interact with as a proxy. This follows our general assumption that their preferences and needs drive their purchase decisions. In Spark you create a new column with the strata key. Get the total counts of customers and products by strata and calculate the faction per strata to sample approximately even counts by strata. You can use Spark’s

第一步，我们必须创建一个具有代表性的样本，但是要用于集群的客户和产品样本要小得多。你与客户的数量相等 ，每个阶层的产品分层你的样品是很重要的。通常，我们有许多匿名客户，几乎没有客户属性(例如人口统计信息)进行分层。在这种情况下，我们可以通过与客户交互的产品的产品属性来对客户进行分层，以作为代理。这是根据我们的普遍假设，即他们的偏好和需求决定他们的购买决定。在Spark中，您可以使用strata键创建一个新列。按层获取客户和产品的总数，并计算每个层的派系，以按层抽样平均数 。您可以使用Spark的

DataFrameStatFunctions.sampleBy(col_with_strata_keys, dict_of_sample_fractions, seed)

to create a stratified sample.

创建分层样本 。

To create our segmentation we use T-SNE to visualise the high-dimensional feature vectors of our stratified data sample. T-SNE is a stochastic ML algorithm to reduce dimensionality for visualisation purposes in a way that similar customers and products cluster together. This is also called a neighbour embedding. We can use additional product attributes to colour the t-sne results to interpret our clusters as part of our analysis to generate insights. After we obtain the results from T-SNE, we run DBSCAN on the T-SNE neighbour embeddings to find our clusters.

为了创建我们的分割，我们使用T-SNE可视化分层数据样本的高维特征向量。 T-SNE是一种随机ML算法，用于降低可视化目的的维数，以类似的客户和产品聚集在一起的方式进行。这也称为邻居嵌入 。我们可以使用其他产品属性为t-sne结果着色，以解释我们的集群，这是我们进行分析以生成见解的一部分。从T-SNE获得结果后，我们在T-SNE邻居嵌入上运行DBSCAN以找到我们的集群。

With the cluster labels from the DBSCAN output we can calculate cluster centroids:

使用DBSCAN输出中的集群标签，我们可以计算集群质心 ：

centroids = products[["product_features", "cluster"]].groupby(
    ["cluster"])["product_features"].apply(
    lambda x: np.mean(np.vstack(x), axis=0)
)cluster
0     [0.5143338, 0.56946456, -0.26320028, 0.4439753...
1     [0.42414477, 0.012167327, -0.662183, 1.2258132...
2     [-0.0057945233, 1.2221531, -0.22178105, 1.2349...
...
Name: product_embeddings, dtype: object

After we obtained our cluster centroids, we assign all our customer base and product catalogue to their representative cluster. Because so far, we only worked with a stratified sample of maybe 50,000 customer and products.

在获得集群质心之后 ，我们将所有客户群和产品目录分配给它们的代表集群。因为到目前为止，我们仅处理了大约50,000个客户和产品的分层样本。

We use again Spark to assign all our customers and products to their closest cluster centroid. We use the L1 norm (or taxicab distance) to calculate the distance of customers/products to cluster centroids to emphasis the per feature alignment.

我们再次使用Spark将所有客户和产品分配给最接近的群集质心。我们使用L1范数 (或出租车出租车距离)来计算客户/产品到聚类质心的距离，以强调每个功能的对齐方式 。

distance_udf = F.udf(lambda x, y, i: float(np.linalg.norm(np.array(x) - np.array(y), axis=0, ord=i)), FloatType())customer_centroids = spark.read.parquet(path)
customer_clusters = (
    customer_dataframe
    .crossJoin(
        F.broadcast(customer_centroids)
    )
    .withColumn("distance", distance_udf("customer_centroid", "customer_features", F.lit(1)))
    .withColumn("distance_order", F.row_number().over(Window.partitionBy("customer_id").orderBy("distance")))
    .filter("distance_order = 1")
    .select("customer_id", "cluster", "distance")
)+-----------+-------+---------+
|customer_id|cluster| distance|
+-----------+-------+---------+
| customer_1|      4|13.234212|
| customer_2|      4| 8.194665|
| customer_3|      1|  8.00042|
| customer_4|      3|14.705576|

We can then summarise our customer base to get the cluster prominence:

然后，我们可以总结我们的客户群，以突出显示集群：

total_customers = customer_clusters.count()
(
    customer_clusters
    .groupBy("cluster")
    .agg(
        F.count("customer_id").alias("customers"),
        F.avg("distance").alias("avg_distance")
    )
    .withColumn("pct", F.col("customers") / F.lit(total_customers))
)+-------+---------+------------------+-----+
|cluster|customers|      avg_distance|  pct|
+-------+---------+------------------+-----+
|      0|     xxxx|12.882028355869513| xxxx|
|      5|     xxxx|10.084179072882444| xxxx|
|      1|     xxxx|13.966814632296622| xxxx|

This completes all the steps needed to derive a data driven segmentation from our Neural Network embeddings:

这样就完成了从神经网络嵌入中导出数据驱动的分割所需的所有步骤：

Read more about segmentation and ways to extract insights from our model in my previous article:

在我之前的文章中，详细了解细分和从我们的模型中提取见解的方法：

实时评分 (Real-time scoring)

To learn more about how to deploy a model for real-time scoring I recommend my previous article on the topic:

要了解有关如何部署模型进行实时评分的更多信息，我建议我上一篇有关该主题的文章：

一般说明和建议 (General Notes and Advice)

Compared to the collaborative filtering approach in the linked article, the Neural network learns to generalise and a trained model can be used with new customers and new products. The Neural Network has no cold start problem.
与链接文章中的协作过滤方法相比，神经网络可以进行概括，并且训练有素的模型可以用于新客户和新产品。神经网络没有冷启动问题。
If you use at least some behavioural data as input for your customers in addition to historic purchases and other customer profile data, your trained model can make purchase propensity predictions even for new customers without any transactional or customer profile data.
如果除了历史性购买和其他客户资料数据之外，您至少还使用某些行为数据作为客户的输入，那么经过训练的模型甚至可以针对没有任何交易或客户资料数据的新客户做出购买倾向预测。
The learned product feature embeddings will cluster into a bigger number of distinct clusters than your customer feature embeddings. It’s not unusual that most customers fall into one big cluster. This does NOT mean 90% of your customers are alike. As described in the introduction, most of your customers have complex, intertwined and changing preferences and needs. This means that they cannot be separated into distinct groups. It doesn’t mean that they are the same. The simplification of a cluster is not able to capture this which just reiterates the need for machine learning to make sense of customers.
与您的客户功能嵌入相比，学习到的产品功能嵌入将聚集到更多数量的不同群集中。 大多数客户都属于一个大集群 ，这并不罕见。这并不意味着90％的客户都是一样的。如引言中所述，您的大多数客户都具有复杂，相互交织和变化的偏好和需求。这意味着它们不能分为不同的组。 这并不意味着它们是相同的 。集群的简化无法捕捉到这一点，而只是重申了机器学习对客户有意义的需求。
While many stakeholders will love the insights and segmentation the model can produce, the real value of the model is in its ability to predict a purchase propensity.
尽管许多利益相关者会喜欢该模型可以产生的见识和细分，但该模型的真正价值在于其预测购买倾向的能力。

Jan is a successful thought leader and consultant in the data transformation of companies and has a track record of bringing data science into commercial production usage at scale. He has recently been recognised by dataIQ as one of the 100 most influential data and analytics practitioners in the UK.

Jan是公司数据转换方面成功的思想领袖和顾问，并且拥有将数据科学大规模应用于商业生产的记录。最近，他被dataIQ认可为英国100位最具影响力的数据和分析从业者之一。

Connect on LinkedIn: https://www.linkedin.com/in/janteichmann/

在LinkedIn上连接： https ： //www.linkedin.com/in/janteichmann/

Read other articles: https://medium.com/@jan.teichmann

阅读其他文章： https : //medium.com/@jan.teichmann