天猫用户重复购买预测赛题——特征工程

比赛链接

构建的特征有

‘label’,‘merchant_id’,‘user_id’,
‘user_item_counts’,
‘user_cat_counts’,
‘user_seller_counts’,
‘user_seller_unique_counts’,
‘user_brand_counts’,
‘user_day_active_counts’,
‘user_action_type_counts’,
‘user_time_stamp_max’,
‘user_time_stamp_min’,
‘user_time_stamp_std’,
‘user_time_stamp_range_day’,
‘user_cat_most_1’,
‘user_cat_most_1_cnt’,
‘user_seller_most_1’,
‘user_seller_most_1_cnt’,
‘user_brand_most_1’,
‘user_brand_most_1_cnt’,
‘user_cnt_0’,‘user_cnt_1’,‘user_cnt_2’,‘user_cnt_3’,
‘user_merchant_action_0’,
‘user_merchant_action_1’,
‘user_merchant_action_2’,
‘user_merchant_action_3’,
‘seller_user_counts’,
‘seller_user_unique_counts’,
‘seller_item_counts’,
‘seller_cat_counts’,
‘seller_brand_counts’,
‘seller_day_active_counts’,
‘seller_time_stamp_max’,
‘seller_time_stamp_min’,
‘seller_time_stamp_std’,
‘seller_time_stamp_range_day’,
‘seller_cnt_0’,
‘seller_cnt_1’,
‘seller_cnt_2’,
‘seller_cnt_3’,
‘age_0.0’, ‘age_1.0’,‘age_2.0’,
‘age_3.0’,‘age_4.0’,‘age_5.0’,
‘age_6.0’,‘age_7.0’,‘age_8.0’,
‘gender_0.0’,
‘gender_1.0’,
‘gender_2.0’

理论知识

  • 文本表示模型

    • 词袋模型 常用TF-IDF来计算权重

      • TF-IDF(t,d) = TF(t,d) * IDF(t)

      • TF(t,d) 为单词t在文档d中出现的频率

      • IDF(t) 是逆文档频率,用来衡量单词t对表达语义所起的重要性

      • I D F ( t ) = l o g a b + 1 IDF(t) = log{a\over b+1} IDF(t)=logb+1a a=文章总数 b=包含单词t的文章总数

      • from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS
        from scipy import sparse
        #cntVec = CountVectorizer(stop_words=ENGLISH_STOP_WORDS, ngram_range=(1, 1), max_features=10)
        tfidfVec = TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS, ngram_range=(1, 1), max_features=10)
        
        for i, col in enumerate(columns_list):
        	tfidfVec.fit(data_test[col])
        	data_ = tfidfVec.transform(data_test[col])
        	if i == 0:
        		data_cat = data_
        	else:
        		data_cat = sparse.hstack((data_cat, data_))
        
    • N-gram

      • 语言模型
      • 对于由N个词组成的语句片段,假设第N个词的出现是否只与前N-1个词相关,整个语句出现的概率就是这个N个词概率的乘积
    • 主题模型

      • 词袋模型 和 N-gram 无法识别两个不同的词或者词组 是否具有相同的主题
      • 可以将相同主题的词或词组映射到同一维度上,映射的这一维度表示某个主题
    • 词嵌入 Embedding

      • 将词项量化的模型统称

      • 把每个词映射到低维空间上的一个稠密向量

      • import gensim
        # Train Word2Vec model
        model = gensim.models.Word2Vec(all_data_test['seller_path'].apply(lambda x: x.split(' ')), size=100, window=5, min_count=5, workers=4)
        # model.save("product2vec.model")
        # model = gensim.models.Word2Vec.load("product2vec.model")
        
        def mean_w2v_(x, model, size=100):
            try:
                i = 0
                for word in x.split(' '):
                    if word in model.wv.vocab:
                        i += 1
                        if i == 1:
                            vec = np.zeros(size)
                        vec += model.wv[word]
                return vec / i 
            except:
                return  np.zeros(size)
        
        def get_mean_w2v(df_data, columns, model, size):
            data_array = []
            for index, row in df_data.iterrows():
                w2v = mean_w2v_(row[columns], model, size)
                data_array.append(w2v)
            return pd.DataFrame(data_array)
        
        df_embeeding = get_mean_w2v(all_data_test, 'seller_path', model, 100)
        df_embeeding.columns = ['embeeding_' + str(i) for i in df_embeeding.columns]
        
  • 特征工程思路 ——人、货、场

    • 用户行为特征
    • 商店特征
    • 用户针对该店铺构造特征

1. 导入相关包 和 数据

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

import gc
from collections import Counter
import copy

import warnings
warnings.filterwarnings("ignore")

train_file = '../datasets/data_format1/train_format1.csv'
test_file = '../datasets/data_format1/test_format1.csv'

user_info_file = '../datasets/data_format1/user_info_format1.csv'
user_log_file = '../datasets/da

2. 对数据进行内存压缩

  • 对数值类型的特征字段 选取合适的dtype
def reduce_mem_usage(df_path):
    df = pd.read_csv(df_path)
    start_mem = df.memory_usage().sum() / 1024**2
    numerics = ['int8','int16','int32','int64','float16','float32','float64']
    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if(c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max):
                    df[col] = df[col].astype(np.int8)
                elif (c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max):
                    df[col] = df[col].astype(np.int16)
                elif (c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max):
                    df[col] = df[col].astype(np.int32)
                elif (c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max):
                    df[col] = df[col].astype(np.int64)
            else:
                if (c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max):
                    df[col] = df[col].astype(np.float16)
                elif (c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
                    
    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df  
  
  
train_data = reduce_mem_usage(train_file)
test_data = reduce_mem_usage(test_file)

user_info = reduce_mem_usage(user_info_file)
user_log = reduce_mem_usage(user_log_file)

3. 合并训练集、测试集、用户信息表和用户动作表

# 首先合并训练集和测试集 
# 再合并用户信息表

all_data = train_data.append(test_data)
all_data = all_data.sort_values(['user_id','merchant_id'])
all_data = all_data.merge(user_info,on=['user_id'],how='left')


# 对每个用户逐个合并
list_join_func = lambda x: " ".join([str(i) for i in x])
# 对每个字段进行聚合 
agg_dict = {
            'item_id' : list_join_func,
            'cat_id' : list_join_func,
            'seller_id' : list_join_func,
            'brand_id' : list_join_func,
            'time_stamp' : list_join_func,
            'action_type' : list_join_func}
# 重命名
rename_dict = {
            'item_id' : 'item_path',
            'cat_id' : 'cat_path',
            'seller_id' : 'seller_path',
            'brand_id' : 'brand_path',
            'time_stamp' : 'time_stamp_path',
            'action_type' : 'action_type_path'}

def merge_list(df,join_columns,join_data,agg_dict,rename_dict):
    join_data = join_data.groupby(join_columns).agg(agg_dict).reset_index().rename(columns=rename_dict)
    df = df.merge(join_data,on=join_columns,how='left')
    return df
  
# 用户日志信息按时间进行排序
user_log = user_log.sort_values(['user_id','time_stamp'])
all_data_user_info_log = merge_list(all_data, 'user_id', user_log, agg_dict, rename_dict)

在这里插入图片描述

4. 定义特征统计函数

# 统计数据的总数
def cnt_(x):
    try:
        return len(x.split(' '))
    except:
        return -1
def user_cnt(df_data, single_col, name):
    df_data[name] = df_data[single_col].apply(cnt_)
    return df_data

# 统计数据并去重
def nunique_(x):
    try:
        return len(set(x.split(' ')))
    except:
        return -1
def user_nunique(df_data, single_col, name):
    df_data[name] = df_data[single_col].apply(nunique_)
    return df_data

# 统计数据最大值
def max_(x):
    try:
        return np.max([int(i) for i in x.split(' ')])
    except:
        return -1
def user_max(df_data, single_col, name):
    df_data[name] = df_data[single_col].apply(max_)
    return df_data

# 统计数据最小值
def min_(x):
    try:
        return np.min([int(i) for i in x.split(' ')])
    except:
        return -1  
def user_min(df_data, single_col, name):
    df_data[name] = df_data[single_col].apply(min_)
    return df_data

# 统计数据的标准差
def std_(x):
    try:
        return np.std([float(i) for i in x.split(' ')])
    except:
        return -1 
def user_std(df_data, single_col, name):
    df_data[name] = df_data[single_col].apply(std_)
    return df_data

# 统计数据的TOP N的数据
def most_n(x, n):
    try:
        return Counter(x.split(' ')).most_common(n)[n-1][0]
    except:
        return -1
def user_most_n(df_data, single_col, name, n=1):
    func = lambda x: most_n(x, n)
    df_data[name] = df_data[single_col].apply(func)
    return df_data

# 统计数据的TOP N数据的总数
def most_n_cnt(x, n):
    try:
        return Counter(x.split(' ')).most_common(n)[n-1][1]
    except:
        return -1
def user_most_n_cnt(df_data, single_col, name, n=1):
    func = lambda x: most_n_cnt(x, n)
    df_data[name] = df_data[single_col].apply(func)
    return df_data

# 时间转换函数
def int_to_datetime(x):
    x = str(x)
    x = '2020-'+x[:-2] + '-' + x[-2:]
    return pd.to_datetime(x, errors='ignore')
# 最晚和最早时间相差天数
def time_range(df_data, max_date_col,min_date_col,name):
    max_date = list(map(int_to_datetime,copy.deepcopy(df_data[max_date_col])))
    min_date = list(map(int_to_datetime,copy.deepcopy(df_data[min_date_col])))
    ns_to_day = 1e9*60*60*24
    ans = []
    for i in range(len(max_date)):
        ans.append((max_date[i]-min_date[i]).value/ns_to_day)
    df_data[name] = ans
    return df_data

# 统计每个用户不同行为的次数
def user_action_cnt(df_data,col_action,action_type,name):
    func = lambda x: len([i for i in x.split(' ') if i == action_type])
    df_data[name+'_'+action_type] = df_data[col_action].apply(func) 
    return df_data

# 用户针对此商家 之前的打分 
def user_merchant_mark(df_data,merchant_id,seller_path,action_type_path,action_type):
    seller_len = len(df_data[seller_path].split(' '))
    data_dict = {}
    data_dict[seller_path] = df_data[seller_path].split(' ')
    data_dict[action_type_path] = df_data[action_type_path].split(' ')
    #print(data_dict)
    mark = 0
    for i_ in range(seller_len):
        if data_dict[seller_path][i_] == str(df_data[merchant_id]):
            if data_dict[action_type_path][i_] == action_type:
                mark += 1
    return mark
  
def user_merchant_mark_all(df_data,merchant_id,seller_path,action_type_path,action_type,name):
    df_data[name+'_'+action_type] = df_data.apply(lambda x:user_merchant_mark(x,merchant_id,seller_path,action_type_path,action_type),axis=1)
    return df_data

5. 构建用户画像

# 建立用户画像
all_data_test = all_data_user_info_log.head(2000)
# 用户对多少种物品进行操作
all_data_test = user_nunique(all_data_test,'item_path','user_item_counts')
# 用户对多少种类别的物品操作
all_data_test = user_nunique(all_data_test,'cat_path','user_cat_counts')
# 用户总共对多少个商店进行操作 不去重
all_data_test = user_cnt(all_data_test,'seller_path','user_seller_counts')
# 用户对多少个商店进行了操作 去重
all_data_test = user_nunique(all_data_test,'seller_path','user_seller_unique_counts')
# 用户对多少个品牌进行了操作
all_data_test = user_nunique(all_data_test,'brand_path','user_brand_counts')
# 用户活跃的天数
all_data_test = user_nunique(all_data_test,'time_stamp_path','user_day_active_counts')
# 用户有几种行为
all_data_test = user_nunique(all_data_test,'action_type_path','user_action_type_counts')

# 用户最晚操作时间
all_data_test = user_max(all_data_test, 'time_stamp_path','user_time_stamp_max')
# 用户最早操作时间
all_data_test = user_min(all_data_test, 'time_stamp_path','user_time_stamp_min')
# 用户活跃天数方差
all_data_test = user_std(all_data_test, 'time_stamp_path','user_time_stamp_std')
# 最早和最晚相差天数
all_data_test = time_range(all_data_test,'user_time_stamp_max','user_time_stamp_min','user_time_stamp_range_day')


# 用户最喜欢操作的类目 和 次数 包括点击加购物车分享收藏
all_data_test = user_most_n(all_data_test, 'cat_path', 'user_cat_most_1', n=1)
all_data_test = user_most_n_cnt(all_data_test, 'cat_path', 'user_cat_most_1_cnt', n=1)
# 用户最喜欢操作的店铺  和 次数
all_data_test = user_most_n(all_data_test,'seller_path','user_seller_most_1',n=1)
all_data_test = user_most_n_cnt(all_data_test,'seller_path','user_seller_most_1_cnt',n=1)
# 最喜欢操作的品牌 和 次数
all_data_test = user_most_n(all_data_test, 'brand_path', 'user_brand_most_1', n=1)
all_data_test = user_most_n_cnt(all_data_test, 'brand_path', 'user_brand_most_1_cnt', n=1)
# 最常见的行为动作 和 次数
# all_data_test = user_most_n(all_data_test, 'action_type_path', 'action_most_1', n=1)
# all_data_test = user_most_n_cnt(all_data_test, 'action_type_path', 'action_most_1_cnt', n=1)

# 用户点击的次数
all_data_test = user_action_cnt(all_data_test,  'action_type_path', '0', 'user_cnt')
# 用户加购的次数
all_data_test = user_action_cnt(all_data_test,  'action_type_path', '1', 'user_cnt')
# 用户购买的次数
all_data_test = user_action_cnt(all_data_test,  'action_type_path', '2', 'user_cnt')
# 用户收藏的次数
all_data_test = user_action_cnt(all_data_test,  'action_type_path', '3', 'user_cnt')

# 对年龄和性别进行独热编码
age_range = pd.get_dummies(all_data_test['age_range'],prefix='age')
gender_range = pd.get_dummies(all_data_test['gender'],prefix='gender')
all_data_test = all_data_test.join(age_range)
all_data_test = all_data_test.join(gender_range)

6. 构建用户和目标商家的特征

# 用户针对此商家有多少次 0、1、2、3动作 
all_data_test = user_merchant_mark_all(all_data_test,'merchant_id','seller_path','action_type_path','0','user_merchant_action')
all_data_test = user_merchant_mark_all(all_data_test,'merchant_id','seller_path','action_type_path','1','user_merchant_action')
all_data_test = user_merchant_mark_all(all_data_test,'merchant_id','seller_path','action_type_path','2','user_merchant_action')
all_data_test = user_merchant_mark_all(all_data_test,'merchant_id','seller_path','action_type_path','3','user_merchant_action')

all_data_test.columns
'''
Index(['label', 'merchant_id', 'prob', 'user_id', 'age_range', 'gender',
       'item_path', 'cat_path', 'seller_path', 'brand_path', 'time_stamp_path',
       'action_type_path', 'user_item_counts', 'user_cat_counts',
       'user_seller_counts', 'user_seller_unique_counts', 'user_brand_counts',
       'user_day_active_counts', 'user_action_type_counts',
       'user_time_stamp_max', 'user_time_stamp_min', 'user_time_stamp_std',
       'user_time_stamp_range_day', 'user_cat_most_1', 'user_cat_most_1_cnt',
       'user_seller_most_1', 'user_seller_most_1_cnt', 'user_brand_most_1',
       'user_brand_most_1_cnt', 'user_cnt_0', 'user_cnt_1', 'user_cnt_2',
       'user_cnt_3', 'age_0.0', 'age_2.0', 'age_3.0', 'age_4.0', 'age_5.0',
       'age_6.0', 'age_7.0', 'age_8.0', 'gender_0.0', 'gender_1.0',
       'gender_2.0', 'user_merchant_action_0', 'user_merchant_action_1',
       'user_merchant_action_2', 'user_merchant_action_3'],
      dtype='object')
'''

在这里插入图片描述

7. 构建商家特征

# 按照商家划分
list_join_func = lambda x: " ".join([str(i) for i in x])

agg_dict_seller = {
            'user_id':list_join_func,
            'item_id' : list_join_func,
            'cat_id' : list_join_func,
            'brand_id' : list_join_func,
            'time_stamp' : list_join_func,
            'action_type' : list_join_func}

rename_dict_seller = {
            'user_id':'user_path',
            'item_id' : 'item_path',
            'cat_id' : 'cat_path',
            'brand_id' : 'brand_path',
            'time_stamp' : 'time_stamp_path',
            'action_type' : 'action_type_path'}
# 建立商店特征
user_log_seller = user_log.groupby('seller_id').agg(agg_dict_seller).reset_index().rename(columns=rename_dict_seller)
display(user_log_seller)

在这里插入图片描述

user_log_seller_test = user_log_seller

# 商店被操作的总次数
user_log_seller_test = user_cnt(user_log_seller_test,'user_path','seller_user_counts')
# 商店被不同用户操作的总次数
user_log_seller_test = user_nunique(user_log_seller_test,'user_path','seller_user_unique_counts')
# 商店有多少种不同物品
user_log_seller_test = user_nunique(user_log_seller_test,'item_path','seller_item_counts')
# 商店有多少种类别的物品
user_log_seller_test = user_nunique(user_log_seller_test,'cat_path','seller_cat_counts')
# 商店有多少个品牌进行了操作
user_log_seller_test = user_nunique(user_log_seller_test,'brand_path','seller_brand_counts')
# 商店被活跃的天数
user_log_seller_test = user_nunique(user_log_seller_test,'time_stamp_path','seller_day_active_counts')


# 最晚时间
user_log_seller_test = user_max(user_log_seller_test, 'time_stamp_path','seller_time_stamp_max')
# 最早时间
user_log_seller_test = user_min(user_log_seller_test, 'time_stamp_path','seller_time_stamp_min')
# 活跃天数方差
user_log_seller_test = user_std(user_log_seller_test, 'time_stamp_path','seller_time_stamp_std')
# 最早和最晚相差天数
user_log_seller_test = time_range(user_log_seller_test,'seller_time_stamp_max','seller_time_stamp_min','seller_time_stamp_range_day')

# 统计商店被点击的次数之和 
# 商店被用户点击的次数
user_log_seller_test = user_action_cnt(user_log_seller_test,  'action_type_path', '0', 'seller_cnt')
# 商店被用户加购的次数
user_log_seller_test = user_action_cnt(user_log_seller_test,  'action_type_path', '1', 'seller_cnt')
# 商店被用户购买的次数
user_log_seller_test = user_action_cnt(user_log_seller_test,  'action_type_path', '2', 'seller_cnt')
# 商店被用户收藏的次数
user_log_seller_test = user_action_cnt(user_log_seller_test,  'action_type_path', '3', 'seller_cnt')

display(user_log_seller_test)

在这里插入图片描述

8. 特征融合

user_log_seller_test = user_log_seller_test.rename(columns={'seller_id':'merchant_id'})
all_data_test = all_data_test.merge(user_log_seller_test,on='merchant_id',how='left')
features_columns = [c for c in all_data_test.columns if c not in [ 'prob', 'age_range', 'gender',
       'item_path_x', 'cat_path_x', 'seller_path', 'brand_path_x',
       'time_stamp_path_x', 'action_type_path_x','item_path_y', 'cat_path_y', 'user_path', 'brand_path_y',
       'time_stamp_path_y', 'action_type_path_y',]]
display(features_columns)
'''
['label',
 'merchant_id',
 'user_id',
 'user_item_counts',
 'user_cat_counts',
 'user_seller_counts',
 'user_seller_unique_counts',
 'user_brand_counts',
 'user_day_active_counts',
 'user_action_type_counts',
 'user_time_stamp_max',
 'user_time_stamp_min',
 'user_time_stamp_std',
 'user_time_stamp_range_day',
 'user_cat_most_1',
 'user_cat_most_1_cnt',
 'user_seller_most_1',
 'user_seller_most_1_cnt',
 'user_brand_most_1',
 'user_brand_most_1_cnt',
 'user_cnt_0',
 'user_cnt_1',
 'user_cnt_2',
 'user_cnt_3',
 'user_merchant_action_0',
 'user_merchant_action_1',
 'user_merchant_action_2',
 'user_merchant_action_3',
 'seller_user_counts',
 'seller_user_unique_counts',
 'seller_item_counts',
 'seller_cat_counts',
 'seller_brand_counts',
 'seller_day_active_counts',
 'seller_time_stamp_max',
 'seller_time_stamp_min',
 'seller_time_stamp_std',
 'seller_time_stamp_range_day',
 'seller_cnt_0',
 'seller_cnt_1',
 'seller_cnt_2',
 'seller_cnt_3',
 'age_0.0',
 'age_1.0',
 'age_2.0',
 'age_3.0',
 'age_4.0',
 'age_5.0',
 'age_6.0',
 'age_7.0',
 'age_8.0',
 'gender_0.0',
 'gender_1.0',
 'gender_2.0']
'''
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值