京东购买意向预测（三）特征工程

本文链接：https://blog.csdn.net/qfikh/article/details/103296300

特征工程

用户基本特征：

获取基本的用户特征，基于用户本身属性多为类别特征的特点，对age,sex,usr_lv_cd进行独热编码操作，对于用户注册时间暂时不处理，

商品基本特征：

根据商品文件获取基本的特征
针对属性a1,a2,a3进行独热编码
商品类别和品牌直接作为特征？？

评论特征：

分时间段，??
对评论数进行独热编码: 0表示无评论，1表示有1条评论，2表示有2-10条评论，3表示有11-50条评论，4表示大于50条评论，对0~4 进行独热编码

行为特征：

分时间段
对行为类别进行独热编码：对1~6进行独热编码
分别按照用户-类别行为分组和用户-类别-商品行为分组统计，然后计算
用户对同类别下其他商品的行为计数
不同时间累积的行为计数（3,5,7,10,15,21,30）

累积用户特征：

分时间段
用户不同行为的
购买转化率
均值

用户近期行为特征：

在上面针对用户进行累积特征提取的基础上，分别提取用户近一个月、近三天的特征，然后提取一个月内用户除去最近三天的行为占据一个月的行为的比重

用户对同类别下各种商品的行为:

用户对各个类别的各项行为操作统计
用户对各个类别操作行为统计占对所有类别操作行为统计的比重

累积商品特征:

分时间段
针对商品的不同行为的
购买转化率
均值

类别特征：

分时间段下各个商品类别的
购买转化率
均值

首先导包，定义常量以及基本方法

import time
from datetime import datetime
from datetime import timedelta
import pandas as pd
import pickle
import os
import math
import numpy as np

# 常量定义
action_2_path = r'data/JData_Action_201602.csv'
action_3_path = r'data/JData_Action_201603.csv'
action_4_path = r'data/JData_Action_201604.csv'

comment_path = r'data/JData_Comment.csv'
product_path = r'data/JData_Product.csv'
user_path = r'data/JData_User.csv'


comment_date = [
    "2016-02-01", "2016-02-08", "2016-02-15", "2016-02-22", "2016-02-29",
    "2016-03-07", "2016-03-14", "2016-03-21", "2016-03-28", "2016-04-04",
    "2016-04-11", "2016-04-15"
]

# 基本方法
def get_actions_0():
    action = pd.read_csv(action_2_path)
    return action

def get_actions_2():
    action = pd.read_csv(action_2_path)
    action[['user_id','sku_id','model_id','type','cate','brand']] = action[['user_id','sku_id','model_id','type','cate','brand']].astype('float32')
    return action
'''
数值100  占一个字节，字符串“100“ 占三个字节 不同类型数据占用的字节数  ，sys.getsizeof(a)=a占用的字节数, 
https://blog.csdn.net/u013679490/article/details/54408326
'''
def get_actions_3():
    action = pd.read_csv(action_3_path)
    action[['user_id','sku_id','model_id','type','cate','brand']] = action[['user_id','sku_id','model_id','type','cate','brand']].astype('float32')

    return action
def get_actions_4():
    action = pd.read_csv(action_4_path)
    action[['user_id','sku_id','model_id','type','cate','brand']] = action[['user_id','sku_id','model_id','type','cate','brand']].astype('float32')

    return action

#如果电脑性能好就不用分块
def get_actions_2_chunk():
    
    reader = pd.read_csv(action_2_path, iterator=True)
    reader[['user_id','sku_id','model_id','type','cate','brand']] = reader[['user_id','sku_id','model_id','type','cate','brand']].astype('float32')
    chunks = []
    loop = True
    while loop:
        try:
            chunk = reader.get_chunk(50000)
            chunks.append(chunk)
        except StopIteration:
            loop = False
            print("Iteration is stopped")
    action = pd.concat(chunks, ignore_index=True)
            
    return action
def get_actions_3_chunk():
    
    reader = pd.read_csv(action_3_path, iterator=True)
    reader[['user_id','sku_id','model_id','type','cate','brand']] = reader[['user_id','sku_id','model_id','type','cate','brand']].astype('float32')
    chunks = []
    loop = True
    while loop:
        try:
            chunk = reader.get_chunk(50000)
            chunks.append(chunk)
        except StopIteration:
            loop = False
            print("Iteration is stopped")
    action = pd.concat(chunks, ignore_index=True)
            
    return action
def get_actions_4_chunk():
    
    reader = pd.read_csv(action_4_path, iterator=True)
    reader[['user_id','sku_id','model_id','type','cate','brand']] = reader[['user_id','sku_id','model_id','type','cate','brand']].astype('float32')
    chunks = []
    loop = True
    while loop:
        try:
            chunk = reader.get_chunk(50000)
            chunks.append(chunk)
        except StopIteration:
            loop = False
            print("Iteration is stopped")
    action = pd.concat(chunks, ignore_index=True)
            
    return action
# 读取并拼接所有行为记录文件

def get_all_action():
    action_2 = get_actions_2()
    action_3 = get_actions_3()
    action_4 = get_actions_4()
    actions = pd.concat([action_2, action_3, action_4]) # type: pd.DataFrame

    return actions


# 获取某个时间段的行为记录
def get_actions(start_date, end_date, all_actions):
    """
    :param start_date:
    :param end_date:
    :return: actions: pd.Dataframe
    """
    actions = all_actions[(all_actions.time >= start_date) & (all_actions.time < end_date)].copy()
    return actions

train_start_date = '2016-02-01'
train_end_date = datetime.strptime(train_start_date, '%Y-%m-%d') + timedelta(days=3)
train_end_date = train_end_date.strftime('%Y-%m-%d')
day = 3

start_date = datetime.strptime(train_end_date, '%Y-%m-%d') - timedelta(days=day)
start_date = start_date.strftime('%Y-%m-%d')

print (start_date,'-->',train_end_date)

print(all_actions.shape) #(50601736, 7)

all_actions = get_all_action()
actions = get_actions(start_date, train_end_date, all_actions)
actions = actions[['user_id', 'cate', 'type']]
actions.head()

用户特征

用户基本特征

获取基本的用户特征，基于用户本身属性多为类别特征的特点，对age,sex,usr_lv_cd进行独热编码操作，对于用户注册时间暂时不处理。

from sklearn import preprocessing

def get_basic_user_feat():
    # 针对年龄的中文字符问题处理，首先是读入的时候编码，填充空值，然后将其数值化，最后独热编码，此外对于sex也进行了数值类型转换
    user = pd.read_csv(user_path, encoding='gbk')
    #user['age'].fillna('-1', inplace=True)
    #user['sex'].fillna(2, inplace=True)
    user.dropna(axis=0, how='any',inplace=True)
    user['sex'] = user['sex'].astype(int)    
    user['age'] = user['age'].astype(int)
    le = preprocessing.LabelEncoder()        # 方式一
    age_df = le.fit_transform(user['age'])
    age_df = pd.get_dummies(age_df, prefix='age')
    print(list(le.classes_))

#     age_df = pd.get_dummies(user['age'], prefix='age')
    sex_df = pd.get_dummies(user['sex'], prefix='sex')
    user_lv_df = pd.get_dummies(user['user_lv_cd'], prefix='user_lv_cd')
    user = pd.concat([user['user_id'], age_df, sex_df, user_lv_df], axis=1)
    return user

basic_user_feat=get_basic_user_feat()
basic_user_feat.head(10)

from sklearn import preprocessing

def get_basic_user_feat():
    # 针对年龄的中文字符问题处理，首先是读入的时候编码，填充空值，然后将其数值化，最后独热编码，此外对于sex也进行了数值类型转换
    user = pd.read_csv(user_path, encoding='gbk')
    #user['age'].fillna('-1', inplace=True)
    #user['sex'].fillna(2, inplace=True)
    user.dropna(axis=0, how='any',inplace=True)
    user['sex'] = user['sex'].astype(int)    
    user['age'] = user['age'].astype(int)
#     le = preprocessing.LabelEncoder()    
#     age_df = le.fit_transform(user['age'])
#     age_df = pd.get_dummies(age_df, prefix='age')
#     print(list(le.classes_))

    age_df = pd.get_dummies(user['age'], prefix='age') # 方式二
    sex_df = pd.get_dummies(user['sex'], prefix='sex')
    user_lv_df = pd.get_dummies(user['user_lv_cd'], prefix='user_lv_cd')
    user = pd.concat([user['user_id'], age_df, sex_df, user_lv_df], axis=1)
    return user

basic_user_feat=get_basic_user_feat()
basic_user_feat.head(10)

对比方式一和方式二，发现方式二的输出数值都是整数，方式二的age_-1 对应方式一的 age_0 ，虽然形式不同，但实质是一样的

商品特征

商品基本特征

根据商品文件获取基本的特征，针对属性a1,a2,a3进行独热编码，商品类别和品牌直接作为特征

def get_basic_product_feat():
    product = pd.read_csv(product_path)
    attr1_df = pd.get_dummies(product["a1"], prefix="a1")
    attr2_df = pd.get_dummies(product["a2"], prefix="a2")
    attr3_df = pd.get_dummies(product["a3"], prefix="a3")
    product = pd.concat([product[['sku_id', 'cate', 'brand']], attr1_df, attr2_df, attr3_df], axis=1)
    return product
basic_product_feat=get_basic_product_feat()
basic_product_feat.head()

评论特征

分时间段
对评论数进行独热编码不太清楚里面的逻辑

def get_comments_product_feat(end_date):
    comments = pd.read_csv(comment_path)
    comment_date_end = end_date
    comment_date_begin = comment_date[0]
    for date in reversed(comment_date):
        if date < comment_date_end:
            comment_date_begin = date
            break
    comments = comments[comments.dt==comment_date_begin]
    df = pd.get_dummies(comments['comment_num'], prefix='comment_num')
    # 为了防止某个时间段不具备评论数为0的情况（测试集出现过这种情况）
    for i in range(0, 5):
        if 'comment_num_' + str(i) not in df.columns:
            df['comment_num_' + str(i)] = 0
    df = df[['comment_num_0', 'comment_num_1', 'comment_num_2', 'comment_num_3', 'comment_num_4']]
    
    comments = pd.concat([comments, df], axis=1) # type: pd.DataFrame
        #del comments['dt']
        #del comments['comment_num']
    comments = comments[['sku_id', 'has_bad_comment', 'bad_comment_rate','comment_num_0', 'comment_num_1', 
                         'comment_num_2', 'comment_num_3', 'comment_num_4']]
    return comments

train_start_date = '2016-02-01'
train_end_date = datetime.strptime(train_start_date, '%Y-%m-%d') + timedelta(days=3)
train_end_date = train_end_date.strftime('%Y-%m-%d')
day = 3

start_date = datetime.strptime(train_end_date, '%Y-%m-%d') - timedelta(days=day)
start_date = start_date.strftime('%Y-%m-%d')

在这里插入图片描述