PySpark+多任务学习模型MMOE实践

背景:

需要对页面的点击,点赞,评论等多项指标进行优化,利用MMOE模型将多项页面指标作为目标函数进行学习。出于隐私保护需要代码中隐去了具体使用的特征,实际应用时可以根据需要引入序列特征,hash分桶等处理。

运行环境:

deepctr[GPU],pyspark==2.4.0,pandas,scikit-learn,numpy,keras==2.2.4

模型文件:

import os
import pandas as pd
import numpy as np
import tensorflow as tf

from time import time

from deepctr.feature_column import build_input_features, input_from_feature_columns,SparseFeat, DenseFeat, get_feature_names
from deepctr.layers.core import PredictionLayer, DNN
from deepctr.layers.utils import combined_dnn_input, reduce_sum

# from evaluation import evaluate_deepctr

def MMOE(dnn_feature_columns, num_experts=3, expert_dnn_hidden_units=(256, 128), tower_dnn_hidden_units=(64,),
         gate_dnn_hidden_units=(), l2_reg_embedding=0.00001, l2_reg_dnn=0, seed=1024, dnn_dropout=0,
         dnn_activation='relu',
         dnn_use_bn=False, task_types=('binary', 'binary', 'binary'), task_names=('click', 'like', 'comment')):
    """Instantiates the Multi-gate Mixture-of-Experts multi-task learning architecture.
    :param dnn_feature_columns: An iterable containing all the features used by deep part of the model.
    :param num_experts: integer, number of experts.
    :param expert_dnn_hidden_units: list,list of positive integer or empty list, the layer number and units in each layer of expert DNN.
    :param tower_dnn_hidden_units: list,list of positive integer or empty list, the layer number and units in each layer of task-specific DNN.
    :param gate_dnn_hidden_units: list,list of positive integer or empty list, the layer number and units in each layer of gate DNN.
    :param l2_reg_embedding: float. L2 regularizer strength applied to embedding vector
    :param l2_reg_dnn: float. L2 regularizer strength applied to DNN
    :param seed: integer ,to use as random seed.
    :param dnn_dropout: float in [0,1), the probability we will drop out a given DNN coordinate.
    :param dnn_activation: Activation function to use in DNN
    :param dnn_use_bn: bool. Whether use BatchNormalization before activation or not in DNN
    :param task_types: list of str, indicating the loss of each tasks, ``"binary"`` for  binary logloss, ``"regression"`` for regression loss. e.g. ['binary', 'regression']
    :param task_names: list of str, indicating the predict target of each tasks
    :return: a Keras model instance
    """
    num_tasks = len(task_names)
    if num_tasks <= 1:
        raise ValueError("num_tasks must be greater than 1")
    if num_experts <= 1:
        raise ValueError("num_experts must be greater than 1")

    if len(task_types) != num_tasks:
        raise ValueError("num_tasks must be equal to the length of task_types")

    for task_type in task_types:
        if task_type not in ['binary', 'regression']:
            raise ValueError("task must be binary or regression, {} is illegal".format(task_type))

    features = build_input_features(dnn_feature_columns)

    inputs_list = list(features.values())

    sparse_embedding_list, dense_value_list = input_from_feature_columns(features, dnn_feature_columns,
                                                                         l2_reg_embedding, seed)
    dnn_input = combined_dnn_input(sparse_embedding_list, dense_value_list)

    # build expert layer
    expert_outs = []
    for i in range(num_experts):
        expert_network = DNN(expert_dnn_hidden_units, dnn_activation, l2_reg_dnn, dnn_dropout, dnn_use_bn, seed=seed,
                             name='expert_' + str(i))(dnn_input)
        expert_outs.append(expert_network)

    expert_concat = tf.keras.layers.Lambda(lambda x: tf.stack(x, axis=1))(expert_outs)  # None,num_experts,dim

    mmoe_outs = []
    for i in range(num_tasks):  # one mmoe layer: nums_tasks = num_gates
        # build gate layers
        gate_input = DNN(gate_dnn_hidden_units, dnn_activation, l2_reg_dnn, dnn_dropout, dnn_use_bn, seed=seed,
                         name='gate_' + task_names[i])(dnn_input)
        gate_out = tf.keras.layers.Dense(num_experts, use_bias=False, activation='softmax',
                                         name='gate_softmax_' + task_names[i])(gate_input)
        gate_out = tf.keras.layers.Lambda(lambda x: tf.expand_dims(x, axis=-1))(gate_out)

        # gate multiply the expert
        gate_mul_expert = tf.keras.layers.Lambda(lambda x: reduce_sum(x[0] * x[1], axis=1, keep_dims=False),
                                                 name='gate_mul_expert_' + task_names[i])([expert_concat, gate_out])
        mmoe_outs.append(gate_mul_expert)

    task_outs = []
    for task_type, task_name, mmoe_out in zip(task_types, task_names, mmoe_outs):
        # build tower layer
        tower_output = DNN(tower_dnn_hidden_units, dnn_activation, l2_reg_dnn, dnn_dropout, dnn_use_bn, seed=seed,
                           name='tower_' + task_name)(mmoe_out)

        logit = tf.keras.layers.Dense(1, use_bias=False, activation=None)(tower_output)
        output = PredictionLayer(task_type, name=task_name)(logit)
        task_outs.append(output)

    model = tf.keras.models.Model(inputs=inputs_list, outputs=task_outs)
    return model

PySpark训练和保存:

import pyspark
from sklearn.metrics import roc_auc_score

"""
	获取SparkSession
"""
def get_spark_session(app_name=""):
	spark_session = pyspark.sql.SparkSession.builder \
		.config('spark.driver.extraClassPath', '') \
		.config('spark.sql.parquet.compression.codec', 'none') \
		.config('spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation', 'true') \
		.config("spark.driver.memory", '8g') \
		.config("spark.executor.memory", '8g') \
		.config("spark.executor.cores", '4') \
		.config("spark.executor.instances", '40') \
		.config("spark.speculation", 'true') \
		.config("spark.kryoserializer.buffer.max", "2000m") \
		.config('spark.ui.showConsoleProgress', 'false') \
		.master("local[*]") \
		.appName(app_name) \
		.enableHiveSupport() \
		.getOrCreate()
	return spark_session

if __name__ == "__main__":
    epochs = 1
    batch_size = 128
    embedding_dim = 16
    target = ["click", "like", "comment"]
    sparse_features = []

    dense_features = []

    print("1、加载数据")
    spark_session = get_spark_session()
    sql_train = 'SELECT * from ${t1}'
    df = spark_session.sql(sql_train)

    df.persist() # persist the contents of the DataFrame
    df.printSchema() # show column info
    df.columns # show column names

    # for index, value in enumerate(df.columns):
    #     print(index, value)

    df = df.toPandas()

    data = df[df['flag']=='train'][sparse_features + dense_features + target]
    val = df[df['flag']=='val'][sparse_features + dense_features + target]

    # 1.fill nan dense_feature and do simple Transformation for dense features
    data[dense_features] = data[dense_features].fillna(0, )
    val[dense_features] = val[dense_features].fillna(0, )

    data[dense_features] = np.log(data[dense_features] + 1.0)
    val[dense_features] = np.log(val[dense_features] + 1.0)

    print('data.shape', data.shape)
    print('data.columns', data.columns.tolist())
    
    # 2.count #unique features for each sparse field,and record dense feature field name
    
    fixlen_feature_columns = [SparseFeat(feat, vocabulary_size=data[feat].max() + 1, embedding_dim=embedding_dim)
                              for feat in sparse_features] + [DenseFeat(feat, 1) for feat in dense_features]

    # fixlen_feature_columns = [DenseFeat(feat, 1) for feat in dense_features]

    dnn_feature_columns = fixlen_feature_columns
    feature_names = get_feature_names(dnn_feature_columns)

    # 3.generate input data for model
    train_model_input = {name: data[name] for name in feature_names}
    val_model_input = {name: val[name] for name in feature_names}

    # userid_list = val[''].astype(str).tolist()
    # test_model_input = {name: test[name] for name in feature_names}

    train_labels = [data[y].values for y in target]
    val_labels = [val[y].values for y in target]

    # 4.Define Model,train,predict and evaluate
    train_model = MMOE(dnn_feature_columns)
    train_model.compile("adagrad", loss='binary_crossentropy')
    
    # print(train_model.summary())
    for epoch in range(epochs):
        history = train_model.fit(train_model_input, train_labels,
                                  batch_size=batch_size, epochs=1, verbose=1)
        
        val_pred_ans = train_model.predict(val_model_input, batch_size=batch_size * 4)
        validation_click_roc_auc = roc_auc_score(val[['click']], val_pred_ans[:][-3])
        validation_like_roc_auc = roc_auc_score(val[['like']], val_pred_ans[:][-2])
        validation_comment_roc_auc = roc_auc_score(val[['comment']], val_pred_ans[:][-1])
        print('epoch----------------------',validation_click_roc_auc,validation_like_roc_auc,validation_comment_roc_auc)

    train_model.save('${MODEL_HOME}/....')

    # t1 = time()
    # pred_ans = train_model.predict(test_model_input, batch_size=batch_size * 20)
    # t2 = time()
    # print('4个目标行为%d条样本预测耗时(毫秒):%.3f' % (len(test), (t2 - t1) * 1000.0))
    # ts = (t2 - t1) * 1000.0 / len(test) * 2000.0
    # print('4个目标行为2000条样本平均预测耗时(毫秒):%.3f' % ts)

模型加载与预测:

from tensorflow.keras.models import load_model

val_model_input = {name: val[name] for name in feature_names}

# 4.Define Model,train,predict and evaluate
test_model = MMOE(dnn_feature_columns)
test_model = load_model('')
# test_model.summary()

# print(val_model_input)

test_model.predict(val_model_input)

参考:

[1] DeepCTR/docs at master · shenweichen/DeepCTR · GitHub

[2] 多目标学习在推荐系统中的应用 - 知乎

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
pyspark+streaming+kudu是一种常见的数据处理和分析方案,它结合了Spark的强大计算能力、流式数据处理和Kudu的高性能存储。下面是一个简单的示例代码,演示了如何使用pyspark和streaming来读取和写入Kudu表: ```python from pyspark.sql import SparkSession from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils from kudu.client import Partitioning from kudu.spark import get_kudu_context # 创建SparkSession spark = SparkSession.builder.appName("pyspark_streaming_kudu").getOrCreate() # 创建StreamingContext,设置批处理间隔为5秒 ssc = StreamingContext(spark.sparkContext, 5) # 设置Kafka相关参数 kafka_params = { "bootstrap.servers": "localhost:9092", "group.id": "pyspark_streaming_kudu", "auto.offset.reset": "latest" } # 创建Kafka数据流 kafka_stream = KafkaUtils.createDirectStream(ssc, ["topic"], kafka_params) # 从Kafka数据流中获取数据 lines = kafka_stream.map(lambda x: x[1]) # 将数据写入Kudu表 def write_to_kudu(rdd): if not rdd.isEmpty(): kudu_master = "kudu.master:7051" table_name = "my_table" kudu_context = get_kudu_context(kudu_master) df = spark.read.json(rdd) kudu_context.upsertRows(df, table_name) lines.foreachRDD(write_to_kudu) # 启动StreamingContext ssc.start() ssc.awaitTermination() ``` 上述代码中,我们首先创建了一个SparkSession和StreamingContext。然后,我们设置了Kafka的相关参数,并创建了一个Kafka数据流。接下来,我们从数据流中获取数据,并定义了一个函数`write_to_kudu`,用于将数据写入Kudu表。最后,我们通过调用`foreachRDD`方法将数据流中的每个RDD应用到`write_to_kudu`函数中,并启动StreamingContext。 请注意,上述代码仅为示例,实际使用时需要根据具体的环境和需求进行相应的配置和修改。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值