kaggle学习笔记-otto-baseline8-候选生成 + LGBM 排名器模型

丰。。

已于 2023-01-16 19:32:06 修改

阅读量2.7k

点赞数 1

分类专栏：推荐系统学习笔记 kaggle 机器学习笔记文章标签：学习深度学习推荐算法人工智能

于 2023-01-09 14:53:46 首次发布

本文链接：https://blog.csdn.net/CSDNXXCQ/article/details/128613824

版权

机器学习笔记同时被 3 个专栏收录

84 篇文章 4 订阅

订阅专栏

kaggle

31 篇文章 3 订阅

订阅专栏

推荐系统学习笔记

25 篇文章 2 订阅

订阅专栏

简介

我们尝试开发一个两阶段模型，其中包括候选生成模型（共同访问矩阵）和排名模型。这种做法自候选人一代以来在大型科技公司中广泛使用

应该注意的是，候选生成模型应以高召回率为目标，而排名模型应以最相关的项目为目标，首先对最相关的项目进行排名。

步骤一：模型训练
步骤 1.1 - 加载训练数据
此笔记本中的训练数据由以下逻辑提取：
train_df = train_df[train_df[‘session’]%10 == 1]

训练数据的标签存储在 test_labels.parquet 中，其中包含训练和测试数据的标签（仅用于快速体验）。

步骤 1.2 - 训练数据的特征工程
所有功能都预先计算并保存在镶木地板文件中。所有镶木地板文件都保存在这个 kaggle 数据集 /kaggle/input/otto-validation 中
在这里，我们只在训练数据和预先计算的特征之间进行简单的连接。

步骤 1.3 - 对训练数据进行模型训练
我们使用训练数据训练 LGBM 排名。

第 2 步：模型推理
步骤 2.1 - 加载测试数据
此笔记本中的训练数据由以下逻辑提取：
test_df = test_df[test_df[‘会话’]%10 == 0]

步骤 2.2 - 生成候选人参考这个baseline
我们使用Chris Deotte的候选人重新排名模型中的逻辑来生成40个潜在候选人。
在此阶段，我们将在每个会话中生成 40 名候选人，并进入排名器进行最终排名。

步骤 2.3 - 排名器模型
候选生成阶段的推荐辅助工具将与排名测试辅助工具合并。
特征预处理步骤与训练管道完全相同
测试数据帧将被传递以进行预测，分数将与数据帧合并
按分数从高到低对数据帧中的每个会话进行排序
使用 'groupby（‘session’）.last（20）提取前 20 个结果
步骤 2.4 - 导出为 CSV

步骤3：模型评估

VER = 6

import pandas as pd, numpy as np
from tqdm.notebook import tqdm
import os, sys, pickle, glob, gc
from collections import Counter
import itertools
import polars as pl
from gensim.models import Word2Vec

# Balance of type weighting 
# 0:clicks 1:carts 2:orders
type_weight = {0:0.5,
               1:9,
               2:0.5}
type_weight_multipliers = type_weight

# Use top X for clicks, carts and orders Top
clicks_th = 15 
carts_th  = 20 
orders_th = 20 

VER = 7

type_labels = {'clicks':0, 'carts':1, 'orders':2}

步骤一：模型训练

步骤 1.1 - 生成训练数据
我们将向用户推荐两种类型的帮助：

来自训练数据集的数据（数据集中的实际帮助[即用户的实际行为。用户可以购物车他们点击的项目，或者他们可以再次点击同一项目]）
来自候选生成的数据（来自候选生成逻辑的推荐帮助 [即与实际用户行为相关的项目]）
因此，我们的排名器应该能够同时对这两个项目进行排名。

我们希望使用推荐的候选项模拟此数据的推理时间。因此，我们需要在训练大小中生成一些推荐项，以确保训练集与推理时间具有相同的分布。

从测试集生成训练数据（实际行为）

# Getting the actual aid from dataset

def load_train_data_sampled():    
    dfs = []
    for e, chunk_file in enumerate(glob.glob('../input/otto-validation/test_parquet/*')):
        chunk = pd.read_parquet(chunk_file)
        chunk.ts = (chunk.ts/1000).astype('int32')
        chunk['type'] = chunk['type'].map(type_labels).astype('int8')
        dfs.append(chunk)
        
    train_df = pd.concat(dfs).reset_index(drop=True) #.astype({"ts": "datetime64[ms]"})

    # Using different sample as the candidate generation for training
    train_df = train_df[train_df['session']%10 == 1]
    return train_df

train_df = load_train_data_sampled()
print('Sampled Training data has shape',train_df.shape)


# This indicate that this aid is actual behaviors
train_df['real_action'] = 1

# CG stands for candidate generation. Since the aid here is actual user behavior, they should have no ranking
train_df['CG_ranking'] = 0

# Only focus on click action first
train_df_click = train_df[train_df['type'] == 0]

# Calculate the last three aid for the embedding calculation
train_df_click['aid_last'] = train_df_click.groupby(['session']).aid.shift(1).bfill()
train_df_click['aid_second_last'] = train_df_click.groupby(['session']).aid.shift(2).bfill()
train_df_click['aid_third_last'] = train_df_click.groupby(['session']).aid.shift(3).bfill()

train_df_click = pl.from_pandas(train_df_click)

train_df_click = train_df_click.with_columns([
    pl.col('aid_last').cast(pl.Int32),
    pl.col('aid_second_last').cast(pl.Int32),
    pl.col('aid_third_last').cast(pl.Int32),
])

从候选生成生成训练数据
下面的代码块与我们用于基于候选 ReRank 模型生成候选代码的代码完全相同。
可以参考此处了解详细逻辑：https://www.kaggle.com/code/cdeotte/candidate-rerank-model-lb-0-575

%%time
# Generating recommended aid from the actual aid based on candidate generation
top_clicks = train_df.loc[train_df['type']== 0,'aid'].value_counts().index.values[:20] 


# Improved speed for 2X using polars. 
def pqt_to_dict(path):
    return pl.read_parquet(path).groupby('aid_x').agg(pl.col('aid_y').list()).to_pandas().set_index('aid_x').aid_y.apply(list).to_dict()

DISK_PIECES = 4

# LOAD THREE CO-VISITATION MATRICES
top_20_clicks = pqt_to_dict(f'/kaggle/input/otto-covisitation-matrix-parquet-files/top_20_clicks_v{VER}_0.pqt')

for k in range(1,DISK_PIECES): 
    top_20_clicks.update(pd.read_parquet(f'/kaggle/input/otto-covisitation-matrix-parquet-files/top_20_clicks_v{VER}_{k}.pqt') ) 

def suggest_clicks(df):
    # USER HISTORY AIDS AND TYPES
    aids=df.aid.tolist()
    types = df.type.tolist()
    unique_aids = list(dict.fromkeys(aids[::-1] ))
    # RERANK CANDIDATES USING WEIGHTS
    if len(unique_aids)>=20:
        weights=np.logspace(0.1,1,len(aids),base=2, endpoint=True)-1
        aids_temp = Counter() 
        # RERANK BASED ON REPEAT ITEMS AND TYPE OF ITEMS
        for aid,w,t in zip(aids,weights,types): 
            aids_temp[aid] += w * type_weight_multipliers[t]
        sorted_aids = [k for k,v in aids_temp.most_common(20)]
        return sorted_aids
    # USE "CLICKS" CO-VISITATION MATRIX
    aids2 = list(itertools.chain(*[top_20_clicks[aid] for aid in unique_aids if aid in top_20_clicks]))
    # RERANK CANDIDATES
    top_aids2 = [aid2 for aid2, cnt in Counter(aids2).most_common(40) if aid2 not in unique_aids]    
    result = unique_aids + top_aids2#[:20 - len(unique_aids)]
    # USE TOP20 TEST CLICKS
    return result + list(top_clicks)#[:20-len(result)]


pred_df_clicks = train_df.sort_values(["session", "ts"]).groupby(["session"]).apply(
    lambda x: suggest_clicks(x)
)

train_df_click_recommended = pl.from_pandas(pd.DataFrame(pred_df_clicks, columns = ['aid']).reset_index()).explode("aid")
train_df_click_recommended = train_df_click_recommended.with_columns([
    pl.lit(0).alias('ts').cast(pl.Int32),    
    pl.lit(0).alias('type').cast(pl.Int8),
    pl.lit(0).alias('real_action').cast(pl.Int64),
    pl.col('session').cast(pl.Int32),
    pl.col('aid').cast(pl.Int32),
])
n_col_after_join = train_df_click_recommended.groupby('session').agg([
    pl.col('aid').cumcount().alias('CG_ranking')]).select(
    pl.col('CG_ranking').explode().cast(pl.Int64))
train_df_click_recommended = pl.concat([train_df_click_recommended, n_col_after_join], how="horizontal")

步骤 1.2 - 特征工程

计算稀疏特征

model = Word2Vec.load("/kaggle/input/ottoprecalculatedfeatureparquet/word2vec.model")
embedding_weight = np.load('/kaggle/input/ottoprecalculatedfeatureparquet/word2vec.model.wv.vectors.npy')
embedding_weight_neg = np.load('/kaggle/input/ottoprecalculatedfeatureparquet/word2vec.model.syn1neg.npy')

embedding_weigh_dict_df = pl.from_pandas(pd.DataFrame(embedding_weight, columns = ['Embedding_' + str(x) for x in range(32)]).reset_index().rename(columns = {'index':'aid'})).with_columns(pl.col('aid').cast(pl.Int32))

%%time



# Calculating the embedding of last three actions aid
train_df_click = train_df_click.join(embedding_weigh_dict_df, on = 'aid', how = 'left').join(embedding_weigh_dict_df, right_on = 'aid', left_on = 'aid_last', how = 'left', suffix = 'last_1').join(embedding_weigh_dict_df, right_on = 'aid', left_on = 'aid_second_last', how = 'left', suffix = 'last_2').join(embedding_weigh_dict_df, right_on = 'aid', left_on = 'aid_third_last', suffix = 'last_3').select(pl.exclude(['aid_last', 'aid_second_last', 'aid_third_last']))

# Calculating the embedding of last three actions aid for the recommended action. We all use the last three actual aid embedding for the last three aid
train_df_click_last_action_embedding = train_df_click.sort(['session', 'ts']).groupby(['session']).last().select(pl.exclude(['aid', 'ts','type','real_action', 'CG_ranking', 'aid_last', 'aid_second_last', 'aid_third_last'] + ['Embedding_' + str(i) for i in range(32)]))

train_df_click_recommended = train_df_click_recommended.join(embedding_weigh_dict_df, on = 'aid', how = 'left').join(train_df_click_last_action_embedding, on = 'session', how = 'left')

计算密集要素

# Creating feature by joining with pre-computed feature
aid_global_counter_all_types = pl.read_parquet('/kaggle/input/ottoprecalculatedfeatureparquet/aid_global_counter_all_types.pqt')

aid_global_user_counter_all_types = pl.read_parquet('/kaggle/input/ottoprecalculatedfeatureparquet/aid_global_user_counter_all_types.pqt')

aid_global_user_counter_all_types_time_weighted = pl.read_parquet('/kaggle/input/ottoprecalculatedfeatureparquet/aid_global_user_counter_all_types_time_weighted (1).pqt')

train_df_click = train_df_click.join(aid_global_counter_all_types, on='aid', suffix ='_global_counter').join(aid_global_user_counter_all_types, on='aid', suffix ='_user_counter').join(aid_global_user_counter_all_types_time_weighted, on='aid', suffix ='_timed_global_counter')
train_df_click_recommended = train_df_click_recommended.join(aid_global_counter_all_types, on='aid', suffix ='_global_counter').join(aid_global_user_counter_all_types, on='aid', suffix ='_user_counter').join(aid_global_user_counter_all_types_time_weighted, on='aid', suffix ='_timed_global_counter')

将两个训练数据集合并在一起

# Merging two type of data as training data
train_df_click_all = pl.concat([train_df_click_recommended, train_df_click], how = 'vertical')

步骤 1.3 - 模型训练

# Merging the Ground Truth label with training dataset
# Using negative downsampling of 50%

train_labels = pd.read_parquet('../input/otto-validation/test_labels.parquet')
train_labels['type'] = train_labels['type'].map(type_labels).astype('int8')
train_labels = pl.from_pandas(train_labels)
train_labels.head()
train_labels = train_labels.explode('ground_truth').with_columns([pl.col('ground_truth').alias('aid'), pl.lit(1).alias('label')]).with_columns([
    pl.col('ground_truth').cast(pl.Int32),
    pl.col('session').cast(pl.Int32),
    pl.col('aid').cast(pl.Int32),
])

train_df_click_all_sampled =  train_df_click_all.sample(n= int(len(train_df_click_all)*0.5))

train_df_click_all_sampled = train_df_click_all_sampled.join(train_labels, how='left', on=['session', 'type', 'aid']).with_column(pl.col('label').fill_null(0))

%%time

# Training the model. Seems LGBMRanker train pretty fast. Should be able to add more features
from lightgbm.sklearn import LGBMRanker

ranker = LGBMRanker(
    objective="lambdarank",
    metric="ndcg",
    boosting_type="dart",
    n_estimators=20,
    importance_type='gain',
)

feature_cols = [
 'ts',
 'real_action',
 'CG_ranking',
 'orders',
 'clicks',
 'carts',
 'carts_user_counter',
 'clicks_user_counter',
 'orders_user_counter',
 'carts_timed_global_counter',
 'orders_timed_global_counter',
 'clicks_timed_global_counter',]

target = 'label'

def get_session_lenghts(df):
    return df.groupby('session').agg([
        pl.col('session').count().alias('session_length')
    ])['session_length'].to_numpy()

session_lengths_train = get_session_lenghts(train_df_click_all_sampled)

ranker = ranker.fit(
    train_df_click_all_sampled[feature_cols].to_pandas(),
    train_df_click_all_sampled[target].to_pandas(),
    group=session_lengths_train,
)

步骤二：模型推理

步骤 2.1：加载测试数据和预先计算的共访矩阵

为了更快地进行实验，我们将只使用验证test_parquet的 1/10

我们使用另一组会话来模拟测试集模式。

我们使用函数 test_df[test_df[‘session’]%10 == 0] 提取一组不同的语义

def load_test():    
    dfs = []
    for e, chunk_file in enumerate(glob.glob('../input/otto-validation/test_parquet/*')):
        chunk = pd.read_parquet(chunk_file)
        chunk.ts = (chunk.ts/1000).astype('int32')
        chunk['type'] = chunk['type'].map(type_labels).astype('int8')
        dfs.append(chunk)
    return pd.concat(dfs).reset_index(drop=True) #.astype({"ts": "datetime64[ms]"})

test_df = load_test()
test_df = test_df[test_df['session']%10 == 0]
print('Test data has shape',test_df.shape)
test_df.head()

%%time
# Improved speed for 2X using polars. 
def pqt_to_dict(path):
    return pl.read_parquet(path).groupby('aid_x').agg(pl.col('aid_y').list()).to_pandas().set_index('aid_x').aid_y.apply(list).to_dict()

DISK_PIECES = 4

# LOAD THREE CO-VISITATION MATRICES
top_20_clicks = pqt_to_dict(f'/kaggle/input/otto-covisitation-matrix-parquet-files/top_20_clicks_v{VER}_0.pqt')

for k in range(1,DISK_PIECES): 
    top_20_clicks.update(pd.read_parquet(f'/kaggle/input/otto-covisitation-matrix-parquet-files/top_20_clicks_v{VER}_{k}.pqt') ) 


top_20_buys = pqt_to_dict(f'/kaggle/input/otto-covisitation-matrix-parquet-files/top_15_carts_orders_v{VER}_0.pqt') 

for k in range(1,DISK_PIECES): 
    top_20_buys.update( pqt_to_dict( f'/kaggle/input/otto-covisitation-matrix-parquet-files/top_15_carts_orders_v{VER}_{k}.pqt') )

top_20_buy2buy = pqt_to_dict(f'/kaggle/input/otto-covisitation-matrix-parquet-files/top_15_buy2buy_v{VER}_0.pqt') 

print('Here are size of our 3 co-visitation matrices:')
print( len( top_20_clicks ), len( top_20_buy2buy ), len( top_20_buys ) )

步骤 2.2 使用重新排名模型生成候选

top_clicks = test_df.loc[test_df['type']== 0,'aid'].value_counts().index.values[:20] 
top_carts = test_df.loc[test_df['type']== 1,'aid'].value_counts().index.values[:20]
top_orders = test_df.loc[test_df['type']== 2,'aid'].value_counts().index.values[:20]

def suggest_clicks(df):
    # USER HISTORY AIDS AND TYPES
    aids=df.aid.tolist()
    types = df.type.tolist()
    unique_aids = list(dict.fromkeys(aids[::-1] ))
    # RERANK CANDIDATES USING WEIGHTS
    if len(unique_aids)>=20:
        weights=np.logspace(0.1,1,len(aids),base=2, endpoint=True)-1
        aids_temp = Counter() 
        # RERANK BASED ON REPEAT ITEMS AND TYPE OF ITEMS
        for aid,w,t in zip(aids,weights,types): 
            aids_temp[aid] += w * type_weight_multipliers[t]
        sorted_aids = [k for k,v in aids_temp.most_common(20)]
        return sorted_aids
    # USE "CLICKS" CO-VISITATION MATRIX
    aids2 = list(itertools.chain(*[top_20_clicks[aid] for aid in unique_aids if aid in top_20_clicks]))
    # RERANK CANDIDATES
    top_aids2 = [aid2 for aid2, cnt in Counter(aids2).most_common(20) if aid2 not in unique_aids]    
    result = unique_aids + top_aids2[:20 - len(unique_aids)]
    # USE TOP20 TEST CLICKS
    return result + list(top_clicks)[:20-len(result)]

我已经重写了函数并进行了以下更改

将推荐的援助编号从 20 更改为 40 原因是，如果只推荐 20，排名者实际上不会提高性能，因为排行榜召回率分数是计算的，无论顺序如何。
因此，在这里，我们建议更多的候选人增加召回指标（即候选人推荐中推荐的地面实况援助的总覆盖率）

测试集中的辅助将不被推荐测试集中的辅助将单独处理，因为它们具有 ts 信息并且不应具有CG_ranking（候选生成排名）特征信息。

top_clicks = test_df.loc[test_df['type']== 0,'aid'].value_counts().index.values[:40] 
def suggest_clicks_40_candidates(df):
    # USER HISTORY AIDS AND TYPES
    aids=df.aid.tolist()
    types = df.type.tolist()
    unique_aids = list(dict.fromkeys(aids[::-1] ))
    # RERANK CANDIDATES USING WEIGHTS
    if len(unique_aids)>=40:
        weights=np.logspace(0.1,1,len(aids),base=2, endpoint=True)-1
        aids_temp = Counter() 
        # RERANK BASED ON REPEAT ITEMS AND TYPE OF ITEMS
        for aid,w,t in zip(aids,weights,types): 
            aids_temp[aid] += w * type_weight_multipliers[t]
        sorted_aids = [k for k,v in aids_temp.most_common(40)]
        return sorted_aids
    # USE "CLICKS" CO-VISITATION MATRIX
    aids2 = list(itertools.chain(*[top_20_clicks[aid] for aid in unique_aids if aid in top_20_clicks]))
    # RERANK CANDIDATES
    top_aids2 = [aid2 for aid2, cnt in Counter(aids2).most_common(40) if aid2 not in unique_aids]    
    result = top_aids2[:40]
    # USE TOP20 TEST CLICKS
    return result + list(top_clicks)[:40-len(result)]

def suggest_carts(df):
    # User history aids and types
    aids = df.aid.tolist()
    types = df.type.tolist()
    
    # UNIQUE AIDS AND UNIQUE BUYS
    unique_aids = list(dict.fromkeys(aids[::-1] ))
    df = df.loc[(df['type'] == 0)|(df['type'] == 1)]
    unique_buys = list(dict.fromkeys(df.aid.tolist()[::-1]))
    
    # Rerank candidates using weights
    if len(unique_aids) >= 20:
        weights=np.logspace(0.5,1,len(aids),base=2, endpoint=True)-1
        aids_temp = Counter() 
        
        # Rerank based on repeat items and types of items
        for aid,w,t in zip(aids,weights,types): 
            aids_temp[aid] += w * type_weight_multipliers[t]
        
        # Rerank candidates using"top_20_carts" co-visitation matrix
        aids2 = list(itertools.chain(*[top_20_buys[aid] for aid in unique_buys if aid in top_20_buys]))
        for aid in aids2: aids_temp[aid] += 0.1
        sorted_aids = [k for k,v in aids_temp.most_common(20)]
        return sorted_aids
    
    # Use "cart order" and "clicks" co-visitation matrices
    aids1 = list(itertools.chain(*[top_20_clicks[aid] for aid in unique_aids if aid in top_20_clicks]))
    aids2 = list(itertools.chain(*[top_20_buys[aid]*2 for aid in unique_aids if aid in top_20_buys]))
    
    # RERANK CANDIDATES
    top_aids2 = [aid2 for aid2, cnt in Counter(aids1+aids2).most_common(20) if aid2 not in unique_aids] 
    result = unique_aids + top_aids2[:20 - len(unique_aids)]
    
    # USE TOP20 TEST ORDERS

    return result + list(top_carts)[:20-len(result)]

def suggest_buys(df):
    
    # USER HISTORY AIDS AND TYPES
    aids=df.aid.tolist()
    types = df.type.tolist()
    # UNIQUE AIDS AND UNIQUE BUYS
    unique_aids = list(dict.fromkeys(aids[::-1] ))
    df = df.loc[(df['type']==1)|(df['type']==2)]
    unique_buys = list(dict.fromkeys( df.aid.tolist()[::-1] ))
    # RERANK CANDIDATES USING WEIGHTS
    if len(unique_aids)>=20:
        weights=np.logspace(0.5,1,len(aids),base=2, endpoint=True)-1
        aids_temp = Counter() 
        # RERANK BASED ON REPEAT ITEMS AND TYPE OF ITEMS
        for aid,w,t in zip(aids,weights,types): 
            aids_temp[aid] += w * type_weight_multipliers[t]
        # RERANK CANDIDATES USING "BUY2BUY" CO-VISITATION MATRIX
        aids3 = list(itertools.chain(*[top_20_buy2buy[aid] for aid in unique_buys if aid in top_20_buy2buy]))
        for aid in aids3: aids_temp[aid] += 0.1
        sorted_aids = [k for k,v in aids_temp.most_common(20)]
        return sorted_aids
    # USE "CART ORDER" CO-VISITATION MATRIX
    aids2 = list(itertools.chain(*[top_20_buys[aid] for aid in unique_aids if aid in top_20_buys]))
    # USE "BUY2BUY" CO-VISITATION MATRIX
    aids3 = list(itertools.chain(*[top_20_buy2buy[aid] for aid in unique_buys if aid in top_20_buy2buy]))
    # RERANK CANDIDATES
    top_aids2 = [aid2 for aid2, cnt in Counter(aids2+aids3).most_common(20) if aid2 not in unique_aids] 
    result = unique_aids + top_aids2[:20 - len(unique_aids)]
    # USE TOP20 TEST ORDERS
    return result + list(top_orders)[:20-len(result)]

使用共同访问矩阵为每个操作生成候选项。

from pandarallel import pandarallel

# Using pandarallel to accelerate
from pandarallel import pandarallel
pandarallel.initialize(nb_workers = 4,progress_bar=True)

%%time
# Improved speed for 2X using pandarallel
# pred_df_clicks = test_df.sort_values(["session", "ts"]).groupby(["session"]).parallel_apply(
#     lambda x: suggest_clicks(x)
# )

# Improved speed for 2X using pandarallel
pred_df_clicks = test_df.sort_values(["session", "ts"]).groupby(["session"]).parallel_apply(
    lambda x: suggest_clicks_40_candidates(x)
)

pred_df_buys = test_df.sort_values(["session", "ts"]).groupby(["session"]).parallel_apply(
    lambda x: suggest_buys(x)
)

pred_df_carts = test_df.sort_values(["session", "ts"]).groupby(["session"]).parallel_apply(
    lambda x: suggest_carts(x)
)

步骤 2.3 使用 LGBM 排名器进行候选人排名

要运行排名器，我们需要结合两个数据源

测试集提供的辅助
在候选生成阶段推荐的援助
这两种类型的援助应分开处理，因为它们的某些特征不同（例如，来自候选生成的援助没有 ts(time series) 信息）

处理来自测试装置的辅助

# Extracting the testing data action for clicks only
test_df_click = test_df[test_df['type'] == 0]
test_df_click['real_action'] = 1
test_df_click['CG_ranking'] = 0

# Calculate the last three aid for the embedding calculation
test_df_click['aid_last'] = test_df_click.groupby(['session']).aid.shift(1).bfill()
test_df_click['aid_second_last'] = test_df_click.groupby(['session']).aid.shift(2).bfill()
test_df_click['aid_third_last'] = test_df_click.groupby(['session']).aid.shift(3).bfill()

test_df_click = pl.from_pandas(test_df_click)

test_df_click = test_df_click.with_columns([
    pl.col('aid_last').cast(pl.Int32),
    pl.col('aid_second_last').cast(pl.Int32),
    pl.col('aid_third_last').cast(pl.Int32),
])

处理候选人推荐的帮助

# Extracting the candiddate generated from the testing data
clicks_candidate_df = pl.from_pandas(pd.DataFrame(pred_df_clicks, columns = ['aid']).reset_index())
clicks_candidate_df = clicks_candidate_df.explode('aid')
clicks_candidate_df = clicks_candidate_df.with_columns([
    pl.lit(0).alias('ts').cast(pl.Int32),    
    pl.lit(0).alias('type').cast(pl.Int8),
    pl.lit(0).alias('real_action').cast(pl.Int64),
    pl.col('session').cast(pl.Int32),
    pl.col('aid').cast(pl.Int32),
])
n_col_after_join = clicks_candidate_df.groupby('session').agg([
    pl.col('aid').cumcount().alias('CG_ranking')]).select(
    pl.col('CG_ranking').explode().cast(pl.Int64))

test_df_click_recommended = pl.concat([clicks_candidate_df, n_col_after_join], how="horizontal")

推理数据的特征计算

%%time

# Calculating the embedding of last three actions aid
test_df_click = test_df_click.join(embedding_weigh_dict_df, on = 'aid', how = 'left').join(embedding_weigh_dict_df, right_on = 'aid', left_on = 'aid_last', how = 'left', suffix = 'last_1').join(embedding_weigh_dict_df, right_on = 'aid', left_on = 'aid_second_last', how = 'left', suffix = 'last_2').join(embedding_weigh_dict_df, right_on = 'aid', left_on = 'aid_third_last', suffix = 'last_3').select(pl.exclude(['aid_last', 'aid_second_last', 'aid_third_last']))

# Calculating the embedding of last three actions aid for the recommended action. We all use the last three actual aid embedding for the last three aid
train_df_click_last_action_embedding = test_df_click.sort(['session', 'ts']).groupby(['session']).last().select(pl.exclude(['aid', 'ts','type','real_action', 'CG_ranking', 'aid_last', 'aid_second_last', 'aid_third_last'] + ['Embedding_' + str(i) for i in range(32)]))

test_df_click_recommended = test_df_click_recommended.join(embedding_weigh_dict_df, on = 'aid', how = 'left').join(train_df_click_last_action_embedding, on = 'session', how = 'left')

# Calculating the feature on the inference data
test_df_click = test_df_click.join(aid_global_counter_all_types, on='aid', suffix ='_global_counter').join(aid_global_user_counter_all_types, on='aid', suffix ='_user_counter').join(aid_global_user_counter_all_types_time_weighted, on='aid', suffix ='_timed_global_counter')
test_df_click_recommended = test_df_click_recommended.join(aid_global_counter_all_types, on='aid', suffix ='_global_counter').join(aid_global_user_counter_all_types, on='aid', suffix ='_user_counter').join(aid_global_user_counter_all_types_time_weighted, on='aid', suffix ='_timed_global_counter')

# Combining the actual test data aid and recommended aid
test_df_click_all = pl.concat([test_df_click, test_df_click_recommended], how = 'vertical')

# Model inference
scores = ranker.predict(test_df_click_all[feature_cols].to_pandas())

# Appending the model score to the original dataframe
test_df_click_all = test_df_click_all.with_columns(pl.Series(name='score', values=scores))

# Getting the top 20 candidates from the prediction
clicks_pred_df = test_df_click_all.sort(['session', 'score'], reverse=True).groupby('session').agg([
    pl.col('aid').limit(20).list().alias('labels')
])

# Converting to pandas format and making it align with result format
clicks_pred_df = clicks_pred_df.with_columns(
pl.col('session') + '_clicks'
).to_pandas()

步骤 2.4 导出为 csv

# clicks_pred_df = pd.DataFrame(pred_df_clicks.add_suffix("_clicks"), columns=["labels"]).reset_index()
orders_pred_df = pd.DataFrame(pred_df_buys.add_suffix("_orders"), columns=["labels"]).reset_index()
carts_pred_df = pd.DataFrame(pred_df_carts.add_suffix("_carts"), columns=["labels"]).reset_index()

pred_df = pd.concat([clicks_pred_df, orders_pred_df, carts_pred_df])
pred_df.columns = ["session_type", "labels"]
pred_df["labels"] = pred_df.labels.apply(lambda x: " ".join(map(str,x)))
pred_df.to_csv("validation_preds.csv", index=False)
pred_df.head()

第 3 步：模型评估

# # FREE MEMORY
# del pred_df_clicks, pred_df_buys, clicks_pred_df, orders_pred_df, carts_pred_df
# del top_20_clicks, top_20_buy2buy, top_20_buys, top_clicks, top_orders, test_df
# _ = gc.collect()

%%time
# COMPUTE METRIC
score = 0
weights = {'clicks': 0.10, 'carts': 0.30, 'orders': 0.60}
for t in ['clicks','carts','orders']:
    sub = pred_df.loc[pred_df.session_type.str.contains(t)].copy()
    sub['session'] = sub.session_type.apply(lambda x: int(x.split('_')[0]))
    sub.labels = sub.labels.apply(lambda x: [int(i) for i in x.split(' ')])
    test_labels = pd.read_parquet('../input/otto-validation/test_labels.parquet')
    test_labels = test_labels.loc[test_labels['type']==t]
    test_labels = test_labels.merge(sub, how='left', on=['session'])
    test_labels = test_labels.dropna()
    test_labels['hits'] = test_labels.apply(lambda df: len(set(df.ground_truth).intersection(set(df.labels))), axis=1)
    test_labels['gt_count'] = test_labels.ground_truth.str.len().clip(0,20)
    recall = test_labels['hits'].sum() / test_labels['gt_count'].sum()
    score += weights[t]*recall
    print(f'{t} recall =',recall)
    
print('=============')
print('Overall Recall =',score)
print('=============')

丰。。

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
打赏
3
评论
kaggle学习笔记-otto-baseline8-候选生成 + LGBM 排名器模型

我们尝试开发一个两阶段模型，其中包括候选生成模型（共同访问矩阵）和排名模型。这种做法自候选人一代以来在大型科技公司中广泛使用应该注意的是，候选生成模型应以高召回率为目标，而排名模型应以最相关的项目为目标，首先对最相关的项目进行排名。步骤一：模型训练步骤 1.1 - 加载训练数据此笔记本中的训练数据由以下逻辑提取：train_df = train_df[train_df['session']%10 == 1]训练数据的标签存储在 test_labels.parquet 中，其中包含训练和测
复制链接

扫一扫