爱奇艺用户存留预测_wsdn 爱奇艺预测-CSDN博客

本文链接：https://blog.csdn.net/weixin_45526335/article/details/122732559

用户行为序列建模

爱奇艺用户行为序列建模

文章目录

用户行为序列建模
前言
一、赛题背景
二、特征提取
- 1.用户行为序列特征提取
- 2.用户属性特征提取
三、建模
四、总结

前言

爱奇艺组织的用户存留预测竞赛，预测用户未来七天会有几天登陆app，可用多分类或者回归+阈值后处理来做。

一、赛题背景

训练集60万样本，给定了多张表，包含用户属性、登陆时间、播放时长等特征。测试集A榜1万5样本，只给了用户id和需要预测的时间点。选手需要自定义标签y。
官方网址：http://challenge.ai.iqiyi.com/detail?raceId=61600f6cef1b65639cd5eaa6
竞赛代码：https://github.com/Actor12/aiqiyi-userremain
数据集下载地址：链接：https://pan.baidu.com/s/1ZIlbWZATcQviutyjAS-jWQ 提取码：pwk8

二、特征提取

1.用户行为序列特征提取

需要先对登陆序列按时间进行排序，这样分组提取登陆时间和类型后，每个用户的序列都是顺时有序的，有利于后续模型提取信息。下面提取用户的登陆时间和登陆类型序列。最终用的是登陆类型序列直接放入gru训练。尝试了w2v提取登陆时间序列的embedding，但是训练太耗时未成功。
代码如下（示例）：

#构建序列
launch_grp = pd.DataFrame()

user_id = []
launch_date_str = []
launch_type_str = []
for i in launch.groupby('user_id'):
    launch_date = []
    launch_type = []
    user_id.append(i[0])
    for j in i[1]['date']:
        launch_date.append(j)
    for j in i[1]['launch_type']:
        launch_type.append(j)
        
    launch_date_str.append(str(launch_date))
    launch_type_str.append(str(launch_type))
launch_grp['user_id'] = list(user_id)
launch_grp['launch_date'] = list(launch_date_str)
launch_grp['launch_type'] = list(launch_type_str)
launch_grp.head()

提取的两种序列如下：
在这里插入图片描述

2.用户属性特征提取

这部分特征提取就是常规的特征衍生操作，包括分组聚合、target_encoding、逻辑交叉、长度统计等特征（用户登录类型个数、序列长度，近30、15、7天播放时长等等）。其中在制作统计特征时，注意不要特征穿越，需要先提取出enddate之前的序列作为训练集数据。代码如下（示例）：

def get_train_launch_date(row):
    count = 0
    launch_date_list = row.launch_date
    for i in launch_date_list:
        if row.end_date>=i:
            count += 1
        else:
            break
    
    return launch_date_list[:count]

然后再进行训练集的部分统计特征，这样就不会提取到end_date之后的统计信息。代码如下：

#构建登录的统计特征,注意只用结束时间以前的序列来构建特征，否则会穿越。上述已经解决了穿越问题
launch_grp['launch_times'] = [len(v) for v in launch_grp.launch_date.values]
launch_grp['launch_type_0'] = [len(v)-sum(v) for v in launch_grp.launch_type.values]
launch_grp['launch_type_1'] = [sum(v) for v in launch_grp.launch_type.values]
launch_grp['launch_type_01rate'] = [sum(v)/len(v) if len(v)>0 else 0 for v in launch_grp.launch_type.values]
launch_grp['start_end_launch'] = [max(v)-min(v) if len(v)>0 else 0 for v in launch_grp.launch_date.values]

#计算launch_date的序列长度
launch_date_len = []
for i in launch_grp.launch_date:
    launch_date_len.append(len(i))
launch_grp['launch_date_len'] = launch_date_len

launch_grp.head()

三、建模

输入模型的特征主要分为行为序列特征和用户属性等统计特征，行为序列都只截取了近一个月的登陆序列（也尝试了加入近15、7、3天的序列）。用户的多种序列读取模型后各自给一个gru去处理，属性统计特征给基层dnn去处理，在对他们的结果做拼接，然后relu（因为是当做回归来做的，没有sotfmax）。
数据读入方式如下：

#制作一个迭代器，迭代器里面的每个元素是一个bt=n的step
#https://blog.csdn.net/weixin_37737254/article/details/103884255
class DataGenerator(Sequence):
    def __init__(self, df, batch_size):
        self.data = df
        self.num = df.shape[0]
        self.batch_size = batch_size
        self.fea = ['father_id_score', 'cast_id_score', 'tag_score',
       'device_type', 'device_ram', 'device_rom', 'sex', 'age', 'education',
       'occupation_status', 'territory_score','launch_times', 
       'launch_times_31', 'launch_times_15', 'launch_times_7', 'playtime_31',
       'playtime_15', 'playtime_7']#'launch_date_len_target_enc','start_end_launch',目前最佳只有钱18个,'launch_date_len','launch_type_0', 'launch_type_1'

    def __len__(self):
        return math.ceil(self.num / self.batch_size)

    def __getitem__(self,idx):
        batch_data = self.data.iloc[idx*self.batch_size:(idx+1)*self.batch_size]

        input_1 = np.array([i for i in batch_data.launch_seq_31])
        input_2 = np.array([i for i in batch_data.playtime_seq])
        input_3 = np.array([i for i in batch_data.duration_prefer])
        input_4 = np.array([i for i in batch_data.interact_prefer])
        input_5 = np.array(batch_data[self.fea])
        #以上特征要做成[[][][]]这样的形式读取
        
        output = np.array(batch_data.label)

        return (input_1, input_2, input_3, input_4, input_5), output

最终的模型结构如下：

def build_model(seq_len,dur_seq_len,inter_seq_len, feature_num):
    input_1 = tf.keras.Input(shape=(seq_len,1))
    output_1 = tf.keras.layers.GRU(32)(input_1)

    input_2 = tf.keras.Input(shape=(seq_len,1))
    output_2 = tf.keras.layers.GRU(32)(input_2)
    
    input_3 = tf.keras.Input(shape=(inter_seq_len,1))
    output_3 = tf.keras.layers.GRU(11)(input_3)  #11
    
    input_4 = tf.keras.Input(shape=(dur_seq_len,1))
    output_4 = tf.keras.layers.GRU(16)(input_4)  #16
    
    input_5 = tf.keras.Input(shape=(feature_num, ))
    output_5 = tf.keras.layers.Dense(64, activation="relu")(input_5)

    output = tf.concat([output_1, output_2,output_3,output_4,output_5], -1)
#     output = tf.keras.layers.Dense(128, activation="relu")(output)
#     dp = tf.keras.layers.Dropout(0.15)(output)去掉涨了0.002
    output = tf.keras.layers.Dense(64, activation="relu")(output)
    output = tf.keras.layers.Dense(1, activation="relu")(output)

    model = tf.keras.Model(inputs=[input_1, input_2,input_3, input_4,input_5], outputs=output)

    return model

模型训练：

new_test = DataGenerator(test,100)

new_train = DataGenerator(train[:594000],100)
new_val = DataGenerator(train.iloc[594000:],100)
        
model = build_model(seq_len=32,dur_seq_len=16,inter_seq_len=11,feature_num=18)
model.summary()

model.compile(optimizer=tf.keras.optimizers.Adam(0.0008),loss="mse",metrics=["mse"])
early_stopping = tf.keras.callbacks.EarlyStopping(monitor="val_mse", patience=3, restore_best_weights=True)
lr_reduce = tf.keras.callbacks.ReduceLROnPlateau(patience=2,monitor='val_mse', factor=0.1)
best_checkpoint = tf.keras.callbacks.ModelCheckpoint(model_dir,save_best_only=True, save_weights_only=False,verbose=1)
#model.fit(iter(train_bt),steps_per_epoch=len(train_bt),validation_data=iter(val_bt),validation_steps=len(val_bt),epochs=20,callbacks=[best_checkpoint,early_stopping,lr_reduce])
#model.save('./data/model/model_fold{}.h5'.format(kf))
model.fit_generator(generator=new_train,
                    steps_per_epoch=len(new_train),
                    epochs=20,
                    verbose=1,
                    validation_data=new_val,
                    validation_steps=len(new_val),
#                     use_multiprocessing=False,
#                     workers=1,
                    callbacks=[best_checkpoint,early_stopping,lr_reduce])
    
#重新加载当前折最优的模型
best_model = tf.keras.models.load_model(model_dir)
#测试集推理
test_pred =  best_model.predict(new_test, steps=len(new_test))[:,0]
 
#验证集推理
val_pred =  best_model.predict(new_val, steps=len(new_val))[:,0]

#计算整体验证集得分
y_true = train.iloc[594000:]['label']
score = aiyiqi_metric(y_true,val_pred)
print('得分：{}'.format(score))

线上评价指标：

def aiyiqi_metric(y_true,y_pred):
    y_true = list(y_true)
    y_pred = list(y_pred)
    score = 0
    for i in range(len(y_true)):
        score += abs(y_true[i]-y_pred[i])/7
    return 100*(1-score/len(y_true))