天猫用户重复购买预测赛题——特征工程

最新推荐文章于 2024-07-12 19:09:18 发布

jialun0116

最新推荐文章于 2024-07-12 19:09:18 发布

阅读量5.4k

点赞数 9

分类专栏：天池大赛—天猫用户重复购买预测赛题文章标签：机器学习深度学习 python

本文链接：https://blog.csdn.net/qq_30031221/article/details/111499957

版权

天池大赛—天猫用户重复购买预测赛题专栏收录该内容

4 篇文章 24 订阅

订阅专栏

天猫用户重复购买预测赛题——特征工程

比赛链接

构建的特征有

‘label’,‘merchant_id’,‘user_id’,
‘user_item_counts’,
‘user_cat_counts’,
‘user_seller_counts’,
‘user_seller_unique_counts’,
‘user_brand_counts’,
‘user_day_active_counts’,
‘user_action_type_counts’,
‘user_time_stamp_max’,
‘user_time_stamp_min’,
‘user_time_stamp_std’,
‘user_time_stamp_range_day’,
‘user_cat_most_1’,
‘user_cat_most_1_cnt’,
‘user_seller_most_1’,
‘user_seller_most_1_cnt’,
‘user_brand_most_1’,
‘user_brand_most_1_cnt’,
‘user_cnt_0’,‘user_cnt_1’,‘user_cnt_2’,‘user_cnt_3’,
‘user_merchant_action_0’,
‘user_merchant_action_1’,
‘user_merchant_action_2’,
‘user_merchant_action_3’,
‘seller_user_counts’,
‘seller_user_unique_counts’,
‘seller_item_counts’,
‘seller_cat_counts’,
‘seller_brand_counts’,
‘seller_day_active_counts’,
‘seller_time_stamp_max’,
‘seller_time_stamp_min’,
‘seller_time_stamp_std’,
‘seller_time_stamp_range_day’,
‘seller_cnt_0’,
‘seller_cnt_1’,
‘seller_cnt_2’,
‘seller_cnt_3’,
‘age_0.0’, ‘age_1.0’,‘age_2.0’,
‘age_3.0’,‘age_4.0’,‘age_5.0’,
‘age_6.0’,‘age_7.0’,‘age_8.0’,
‘gender_0.0’,
‘gender_1.0’,
‘gender_2.0’

理论知识

文本表示模型

词袋模型 常用TF-IDF来计算权重
- TF-IDF(t,d) = TF(t,d) * IDF(t)
- TF(t,d) 为单词t在文档d中出现的频率
- IDF(t) 是逆文档频率，用来衡量单词t对表达语义所起的重要性
- $log{a\over b+1}$ a=文章总数 b=包含单词t的文章总数
- ```
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS
from scipy import sparse
#cntVec = CountVectorizer(stop_words=ENGLISH_STOP_WORDS, ngram_range=(1, 1), max_features=10)
tfidfVec = TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS, ngram_range=(1, 1), max_features=10)

for i, col in enumerate(columns_list):
	tfidfVec.fit(data_test[col])
	data_ = tfidfVec.transform(data_test[col])
	if i == 0:
		data_cat = data_
	else:
		data_cat = sparse.hstack((data_cat, data_))
```
N-gram
- 语言模型
- 对于由N个词组成的语句片段，假设第N个词的出现是否只与前N-1个词相关，整个语句出现的概率就是这个N个词概率的乘积。
主题模型
- 词袋模型和 N-gram 无法识别两个不同的词或者词组是否具有相同的主题
- 可以将相同主题的词或词组映射到同一维度上，映射的这一维度表示某个主题

词嵌入 Embedding

将词项量化的模型统称
把每个词映射到低维空间上的一个稠密向量

import gensim
# Train Word2Vec model
model = gensim.models.Word2Vec(all_data_test['seller_path'].apply(lambda x: x.split(' ')), size=100, window=5, min_count=5, workers=4)
# model.save("product2vec.model")
# model = gensim.models.Word2Vec.load("product2vec.model")

def mean_w2v_(x, model, size=100):
    try:
        i = 0
        for word in x.split(' '):
            if word in model.wv.vocab:
                i += 1
                if i == 1:
                    vec = np.zeros(size)
                vec += model.wv[word]
        return vec / i 
    except:
        return  np.zeros(size)

def get_mean_w2v(df_data, columns, model, size):
    data_array = []
    for index, row in df_data.iterrows():
        w2v = mean_w2v_(row[columns], model, size)
        data_array.append(w2v)
    return pd.DataFrame(data_array)

df_embeeding = get_mean_w2v(all_data_test, 'seller_path', model, 100)
df_embeeding.columns = ['embeeding_' + str(i) for i in df_embeeding.columns]

特征工程思路 ——人、货、场
- 用户行为特征
- 商店特征
- 用户针对该店铺构造特征

1. 导入相关包和数据

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

import gc
from collections import Counter
import copy

import warnings
warnings.filterwarnings("ignore")

train_file = '../datasets/data_format1/train_format1.csv'
test_file = '../datasets/data_format1/test_format1.csv'

user_info_file = '../datasets/data_format1/user_info_format1.csv'
user_log_file = '../datasets/da

2. 对数据进行内存压缩

对数值类型的特征字段选取合适的dtype

def reduce_mem_usage(df_path):
    df = pd.read_csv(df_path)
    start_mem = df.memory_usage().sum() / 1024**2
    numerics = ['int8','int16','int32','int64','float16','float32','float64']
    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if(c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max):
                    df[col] = df[col].astype(np.int8)
                elif (c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max):
                    df[col] = df[col].astype(np.int16)
                elif (c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max):
                    df[col] = df[col].astype(np.int32)
                elif (c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max):
                    df[col] = df[col].astype(np.int64)
            else:
                if (c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max):
                    df[col] = df[col].astype(np.float16)
                elif (c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
                    
    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df  
  
  
train_data = reduce_mem_usage(train_file)
test_data = reduce_mem_usage(test_file)

user_info = reduce_mem_usage(user_info_file)
user_log = reduce_mem_usage(user_log_file)

3. 合并训练集、测试集、用户信息表和用户动作表

# 首先合并训练集和测试集 
# 再合并用户信息表

all_data = train_data.append(test_data)
all_data = all_data.sort_values(['user_id','merchant_id'])
all_data = all_data.merge(user_info,on=['user_id'],how='left')


# 对每个用户逐个合并
list_join_func = lambda x: " ".join([str(i) for i in x])
# 对每个字段进行聚合 
agg_dict = {
            'item_id' : list_join_func,
            'cat_id' : list_join_func,
            'seller_id' : list_join_func,
            'brand_id' : list_join_func,
            'time_stamp' : list_join_func,
            'action_type' : list_join_func}
# 重命名
rename_dict = {
            'item_id' : 'item_path',
            'cat_id' : 'cat_path',
            'seller_id' : 'seller_path',
            'brand_id' : 'brand_path',
            'time_stamp' : 'time_stamp_path',
            'action_type' : 'action_type_path'}

def merge_list(df,join_columns,join_data,agg_dict,rename_dict):
    join_data = join_data.groupby(join_columns).agg(agg_dict).reset_index().rename(columns=rename_dict)
    df = df.merge(join_data,on=join_columns,how='left')
    return df
  
# 用户日志信息按时间进行排序
user_log = user_log.sort_values(['user_id','time_stamp'])
all_data_user_info_log = merge_list(all_data, 'user_id', user_log, agg_dict, rename_dict)

在这里插入图片描述

4. 定义特征统计函数

# 统计数据的总数
def cnt_(x):
    try:
        return len(x.split(' '))
    except:
        return -1
def user_cnt(df_data, single_col, name):
    df_data[name] = df_data[single_col].apply(cnt_)
    return df_data

# 统计数据并去重
def nunique_(x):
    try:
        return len(set(x.split(' ')))
    except:
        return -1
def user_nunique(df_data, single_col, name):
    df_data[name] = df_data[single_col].apply(nunique_)
    return df_data

# 统计数据最大值
def max_(x):
    try:
        return np.max([int(i) for i in x.split(' ')])
    except:
        return -1
def user_max(df_data, single_col, name):
    df_data[name] = df_data[single_col].apply(max_)
    return df_data

# 统计数据最小值
def min_(x):
    try:
        return np.min([int(i) for i in x.split(' ')])
    except:
        return -1  
def user_min(df_data, single_col, name):
    df_data[name] = df_data[single_col].apply(min_)
    return df_data

# 统计数据的标准差
def std_(x):
    try:
        return np.std([float(i) for i in x.split(' ')])
    except:
        return -1 
def user_std(df_data, single_col, name):
    df_data[name] = df_data[single_col].apply(std_)
    return df_data

# 统计数据的TOP N的数据
def most_n(x, n):
    try:
        return Counter(x.split(' ')).most_common(n)[n-1][0]
    except:
        return -1
def user_most_n(df_data, single_col, name, n=1):
    func = lambda x: most_n(x, n)
    df_data[name] = df_data[single_col].apply(func)
    return df_data

# 统计数据的TOP N数据的总数
def most_n_cnt(x, n):
    try:
        return Counter(x.split(' ')).most_common(n)[n-1][1]
    except:
        return -1
def user_most_n_cnt(df_data, single_col, name, n=1):
    func = lambda x: most_n_cnt(x, n)
    df_data[name] = df_data[single_col].apply(func)
    return df_data

# 时间转换函数
def int_to_datetime(x):
    x = str(x)
    x = '2020-'+x[:-2] + '-' + x[-2:]
    return pd.to_datetime(x, errors='ignore')
# 最晚和最早时间相差天数
def time_range(df_data, max_date_col,min_date_col,name):
    max_date = list(map(int_to_datetime,copy.deepcopy(df_data[max_date_col])))
    min_date = list(map(int_to_datetime,copy.deepcopy(df_data[min_date_col])))
    ns_to_day = 1e9*60*60*24
    ans = []
    for i in range(len(max_date)):
        ans.append((max_date[i]-min_date[i]).value/ns_to_day)
    df_data[name] = ans
    return df_data

# 统计每个用户不同行为的次数
def user_action_cnt(df_data,col_action,action_type,name):
    func = lambda x: len([i for i in x.split(' ') if i == action_type])
    df_data[name+'_'+action_type] = df_data[col_action].apply(func) 
    return df_data

# 用户针对此商家 之前的打分 
def user_merchant_mark(df_data,merchant_id,seller_path,action_type_path,action_type):
    seller_len = len(df_data[seller_path].split(' '))
    data_dict = {}
    data_dict[seller_path] = df_data[seller_path].split(' ')
    data_dict[action_type_path] = df_data[action_type_path].split(' ')
    #print(data_dict)
    mark = 0
    for i_ in range(seller_len):
        if data_dict[seller_path][i_] == str(df_data[merchant_id]):
            if data_dict[action_type_path][i_] == action_type:
                mark += 1
    return mark
  
def user_merchant_mark_all(df_data,merchant_id,seller_path,action_type_path,action_type,name):
    df_data[name+'_'+action_type] = df_data.apply(lambda x:user_merchant_mark(x,merchant_id,seller_path,action_type_path,action_type),axis=1)
    return df_data

5. 构建用户画像

# 建立用户画像
all_data_test = all_data_user_info_log.head(2000)
# 用户对多少种物品进行操作
all_data_test = user_nunique(all_data_test,'item_path','user_item_counts')
# 用户对多少种类别的物品操作
all_data_test = user_nunique(all_data_test,'cat_path','user_cat_counts')
# 用户总共对多少个商店进行操作 不去重
all_data_test = user_cnt(all_data_test,'seller_path','user_seller_counts')
# 用户对多少个商店进行了操作 去重
all_data_test = user_nunique(all_data_test,'seller_path','user_seller_unique_counts')
# 用户对多少个品牌进行了操作
all_data_test = user_nunique(all_data_test,'brand_path','user_brand_counts')
# 用户活跃的天数
all_data_test = user_nunique(all_data_test,'time_stamp_path','user_day_active_counts')
# 用户有几种行为
all_data_test = user_nunique(all_data_test,'action_type_path','user_action_type_counts')

# 用户最晚操作时间
all_data_test = user_max(all_data_test, 'time_stamp_path','user_time_stamp_max')
# 用户最早操作时间
all_data_test = user_min(all_data_test, 'time_stamp_path','user_time_stamp_min')
# 用户活跃天数方差
all_data_test = user_std(all_data_test, 'time_stamp_path','user_time_stamp_std')
# 最早和最晚相差天数
all_data_test = time_range(all_data_test,'user_time_stamp_max','user_time_stamp_min','user_time_stamp_range_day')


# 用户最喜欢操作的类目 和 次数 包括点击加购物车分享收藏
all_data_test = user_most_n(all_data_test, 'cat_path', 'user_cat_most_1', n=1)
all_data_test = user_most_n_cnt(all_data_test, 'cat_path', 'user_cat_most_1_cnt', n=1)
# 用户最喜欢操作的店铺  和 次数
all_data_test = user_most_n(all_data_test,'seller_path','user_seller_most_1',n=1)
all_data_test = user_most_n_cnt(all_data_test,'seller_path','user_seller_most_1_cnt',n=1)
# 最喜欢操作的品牌 和 次数
all_data_test = user_most_n(all_data_test, 'brand_path', 'user_brand_most_1', n=1)
all_data_test = user_most_n_cnt(all_data_test, 'brand_path', 'user_brand_most_1_cnt', n=1)
# 最常见的行为动作 和 次数
# all_data_test = user_most_n(all_data_test, 'action_type_path', 'action_most_1', n=1)
# all_data_test = user_most_n_cnt(all_data_test, 'action_type_path', 'action_most_1_cnt', n=1)

# 用户点击的次数
all_data_test = user_action_cnt(all_data_test,  'action_type_path', '0', 'user_cnt')
# 用户加购的次数
all_data_test = user_action_cnt(all_data_test,  'action_type_path', '1', 'user_cnt')
# 用户购买的次数
all_data_test = user_action_cnt(all_data_test,  'action_type_path', '2', 'user_cnt')
# 用户收藏的次数
all_data_test = user_action_cnt(all_data_test,  'action_type_path', '3', 'user_cnt')

# 对年龄和性别进行独热编码
age_range = pd.get_dummies(all_data_test['age_range'],prefix='age')
gender_range = pd.get_dummies(all_data_test['gender'],prefix='gender')
all_data_test = all_data_test.join(age_range)
all_data_test = all_data_test.join(gender_range)

6. 构建用户和目标商家的特征

# 用户针对此商家有多少次 0、1、2、3动作 
all_data_test = user_merchant_mark_all(all_data_test,'merchant_id','seller_path','action_type_path','0','user_merchant_action')
all_data_test = user_merchant_mark_all(all_data_test,'merchant_id','seller_path','action_type_path','1','user_merchant_action')
all_data_test = user_merchant_mark_all(all_data_test,'merchant_id','seller_path','action_type_path','2','user_merchant_action')
all_data_test = user_merchant_mark_all(all_data_test,'merchant_id','seller_path','action_type_path','3','user_merchant_action')

all_data_test.columns
'''
Index(['label', 'merchant_id', 'prob', 'user_id', 'age_range', 'gender',
       'item_path', 'cat_path', 'seller_path', 'brand_path', 'time_stamp_path',
       'action_type_path', 'user_item_counts', 'user_cat_counts',
       'user_seller_counts', 'user_seller_unique_counts', 'user_brand_counts',
       'user_day_active_counts', 'user_action_type_counts',
       'user_time_stamp_max', 'user_time_stamp_min', 'user_time_stamp_std',
       'user_time_stamp_range_day', 'user_cat_most_1', 'user_cat_most_1_cnt',
       'user_seller_most_1', 'user_seller_most_1_cnt', 'user_brand_most_1',
       'user_brand_most_1_cnt', 'user_cnt_0', 'user_cnt_1', 'user_cnt_2',
       'user_cnt_3', 'age_0.0', 'age_2.0', 'age_3.0', 'age_4.0', 'age_5.0',
       'age_6.0', 'age_7.0', 'age_8.0', 'gender_0.0', 'gender_1.0',
       'gender_2.0', 'user_merchant_action_0', 'user_merchant_action_1',
       'user_merchant_action_2', 'user_merchant_action_3'],
      dtype='object')
'''

在这里插入图片描述

7. 构建商家特征

# 按照商家划分
list_join_func = lambda x: " ".join([str(i) for i in x])

agg_dict_seller = {
            'user_id':list_join_func,
            'item_id' : list_join_func,
            'cat_id' : list_join_func,
            'brand_id' : list_join_func,
            'time_stamp' : list_join_func,
            'action_type' : list_join_func}

rename_dict_seller = {
            'user_id':'user_path',
            'item_id' : 'item_path',
            'cat_id' : 'cat_path',
            'brand_id' : 'brand_path',
            'time_stamp' : 'time_stamp_path',
            'action_type' : 'action_type_path'}
# 建立商店特征
user_log_seller = user_log.groupby('seller_id').agg(agg_dict_seller).reset_index().rename(columns=rename_dict_seller)
display(user_log_seller)

在这里插入图片描述

user_log_seller_test = user_log_seller

# 商店被操作的总次数
user_log_seller_test = user_cnt(user_log_seller_test,'user_path','seller_user_counts')
# 商店被不同用户操作的总次数
user_log_seller_test = user_nunique(user_log_seller_test,'user_path','seller_user_unique_counts')
# 商店有多少种不同物品
user_log_seller_test = user_nunique(user_log_seller_test,'item_path','seller_item_counts')
# 商店有多少种类别的物品
user_log_seller_test = user_nunique(user_log_seller_test,'cat_path','seller_cat_counts')
# 商店有多少个品牌进行了操作
user_log_seller_test = user_nunique(user_log_seller_test,'brand_path','seller_brand_counts')
# 商店被活跃的天数
user_log_seller_test = user_nunique(user_log_seller_test,'time_stamp_path','seller_day_active_counts')


# 最晚时间
user_log_seller_test = user_max(user_log_seller_test, 'time_stamp_path','seller_time_stamp_max')
# 最早时间
user_log_seller_test = user_min(user_log_seller_test, 'time_stamp_path','seller_time_stamp_min')
# 活跃天数方差
user_log_seller_test = user_std(user_log_seller_test, 'time_stamp_path','seller_time_stamp_std')
# 最早和最晚相差天数
user_log_seller_test = time_range(user_log_seller_test,'seller_time_stamp_max','seller_time_stamp_min','seller_time_stamp_range_day')

# 统计商店被点击的次数之和 
# 商店被用户点击的次数
user_log_seller_test = user_action_cnt(user_log_seller_test,  'action_type_path', '0', 'seller_cnt')
# 商店被用户加购的次数
user_log_seller_test = user_action_cnt(user_log_seller_test,  'action_type_path', '1', 'seller_cnt')
# 商店被用户购买的次数
user_log_seller_test = user_action_cnt(user_log_seller_test,  'action_type_path', '2', 'seller_cnt')
# 商店被用户收藏的次数
user_log_seller_test = user_action_cnt(user_log_seller_test,  'action_type_path', '3', 'seller_cnt')

display(user_log_seller_test)

在这里插入图片描述

8. 特征融合

user_log_seller_test = user_log_seller_test.rename(columns={'seller_id':'merchant_id'})
all_data_test = all_data_test.merge(user_log_seller_test,on='merchant_id',how='left')
features_columns = [c for c in all_data_test.columns if c not in [ 'prob', 'age_range', 'gender',
       'item_path_x', 'cat_path_x', 'seller_path', 'brand_path_x',
       'time_stamp_path_x', 'action_type_path_x','item_path_y', 'cat_path_y', 'user_path', 'brand_path_y',
       'time_stamp_path_y', 'action_type_path_y',]]
display(features_columns)
'''
['label',
 'merchant_id',
 'user_id',
 'user_item_counts',
 'user_cat_counts',
 'user_seller_counts',
 'user_seller_unique_counts',
 'user_brand_counts',
 'user_day_active_counts',
 'user_action_type_counts',
 'user_time_stamp_max',
 'user_time_stamp_min',
 'user_time_stamp_std',
 'user_time_stamp_range_day',
 'user_cat_most_1',
 'user_cat_most_1_cnt',
 'user_seller_most_1',
 'user_seller_most_1_cnt',
 'user_brand_most_1',
 'user_brand_most_1_cnt',
 'user_cnt_0',
 'user_cnt_1',
 'user_cnt_2',
 'user_cnt_3',
 'user_merchant_action_0',
 'user_merchant_action_1',
 'user_merchant_action_2',
 'user_merchant_action_3',
 'seller_user_counts',
 'seller_user_unique_counts',
 'seller_item_counts',
 'seller_cat_counts',
 'seller_brand_counts',
 'seller_day_active_counts',
 'seller_time_stamp_max',
 'seller_time_stamp_min',
 'seller_time_stamp_std',
 'seller_time_stamp_range_day',
 'seller_cnt_0',
 'seller_cnt_1',
 'seller_cnt_2',
 'seller_cnt_3',
 'age_0.0',
 'age_1.0',
 'age_2.0',
 'age_3.0',
 'age_4.0',
 'age_5.0',
 'age_6.0',
 'age_7.0',
 'age_8.0',
 'gender_0.0',
 'gender_1.0',
 'gender_2.0']
'''