特征工程
用户基本特征:
- 获取基本的用户特征,基于用户本身属性多为类别特征的特点,对age,sex,usr_lv_cd进行独热编码操作,对于用户注册时间暂时不处理,
商品基本特征:
- 根据商品文件获取基本的特征
- 针对属性a1,a2,a3进行独热编码
- 商品类别和品牌直接作为特征??
评论特征:
- 分时间段,??
- 对评论数进行独热编码: 0表示无评论,1表示有1条评论,2表示有2-10条评论,3表示有11-50条评论,4表示大于50条评论,对0~4 进行独热编码
行为特征:
- 分时间段
- 对行为类别进行独热编码:对1~6进行独热编码
- 分别按照用户-类别行为分组和用户-类别-商品行为分组统计,然后计算
- 用户对同类别下其他商品的行为计数
- 不同时间累积的行为计数(3,5,7,10,15,21,30)
累积用户特征:
- 分时间段
- 用户不同行为的
- 购买转化率
- 均值
用户近期行为特征:
- 在上面针对用户进行累积特征提取的基础上,分别提取用户近一个月、近三天的特征,然后提取一个月内用户除去最近三天的行为占据一个月的行为的比重
用户对同类别下各种商品的行为:
- 用户对各个类别的各项行为操作统计
- 用户对各个类别操作行为统计占对所有类别操作行为统计的比重
累积商品特征:
- 分时间段
- 针对商品的不同行为的
- 购买转化率
- 均值
类别特征:
- 分时间段下各个商品类别的
- 购买转化率
- 均值
首先导包,定义常量以及基本方法
import time
from datetime import datetime
from datetime import timedelta
import pandas as pd
import pickle
import os
import math
import numpy as np
# 常量定义
action_2_path = r'data/JData_Action_201602.csv'
action_3_path = r'data/JData_Action_201603.csv'
action_4_path = r'data/JData_Action_201604.csv'
comment_path = r'data/JData_Comment.csv'
product_path = r'data/JData_Product.csv'
user_path = r'data/JData_User.csv'
comment_date = [
"2016-02-01", "2016-02-08", "2016-02-15", "2016-02-22", "2016-02-29",
"2016-03-07", "2016-03-14", "2016-03-21", "2016-03-28", "2016-04-04",
"2016-04-11", "2016-04-15"
]
# 基本方法
def get_actions_0():
action = pd.read_csv(action_2_path)
return action
def get_actions_2():
action = pd.read_csv(action_2_path)
action[['user_id','sku_id','model_id','type','cate','brand']] = action[['user_id','sku_id','model_id','type','cate','brand']].astype('float32')
return action
'''
数值100 占一个字节,字符串“100“ 占三个字节 不同类型数据占用的字节数 ,sys.getsizeof(a)=a占用的字节数,
https://blog.csdn.net/u013679490/article/details/54408326
'''
def get_actions_3():
action = pd.read_csv(action_3_path)
action[['user_id','sku_id','model_id','type','cate','brand']] = action[['user_id','sku_id','model_id','type','cate','brand']].astype('float32')
return action
def get_actions_4():
action = pd.read_csv(action_4_path)
action[['user_id','sku_id','model_id','type','cate','brand']] = action[['user_id','sku_id','model_id','type','cate','brand']].astype('float32')
return action
#如果电脑性能好就不用分块
def get_actions_2_chunk():
reader = pd.read_csv(action_2_path, iterator=True)
reader[['user_id','sku_id','model_id','type','cate','brand']] = reader[['user_id','sku_id','model_id','type','cate','brand']].astype('float32')
chunks = []
loop = True
while loop:
try:
chunk = reader.get_chunk(50000)
chunks.append(chunk)
except StopIteration:
loop = False
print("Iteration is stopped")
action = pd.concat(chunks, ignore_index=True)
return action
def get_actions_3_chunk():
reader = pd.read_csv(action_3_path, iterator=True)
reader[['user_id','sku_id','model_id','type','cate','brand']] = reader[['user_id','sku_id','model_id','type','cate','brand']].astype('float32')
chunks = []
loop = True
while loop:
try:
chunk = reader.get_chunk(50000)
chunks.append(chunk)
except StopIteration:
loop = False
print("Iteration is stopped")
action = pd.concat(chunks, ignore_index=True)
return action
def get_actions_4_chunk():
reader = pd.read_csv(action_4_path, iterator=True)
reader[['user_id','sku_id','model_id','type','cate','brand']] = reader[['user_id','sku_id','model_id','type','cate','brand']].astype('float32')
chunks = []
loop = True
while loop:
try:
chunk = reader.get_chunk(50000)
chunks.append(chunk)
except StopIteration:
loop = False
print("Iteration is stopped")
action = pd.concat(chunks, ignore_index=True)
return action
# 读取并拼接所有行为记录文件
def get_all_action():
action_2 = get_actions_2()
action_3 = get_actions_3()
action_4 = get_actions_4()
actions = pd.concat([action_2, action_3, action_4]) # type: pd.DataFrame
return actions
# 获取某个时间段的行为记录
def get_actions(start_date, end_date, all_actions):
"""
:param start_date:
:param end_date:
:return: actions: pd.Dataframe
"""
actions = all_actions[(all_actions.time >= start_date) & (all_actions.time < end_date)].copy()
return actions
train_start_date = '2016-02-01'
train_end_date = datetime.strptime(train_start_date, '%Y-%m-%d') + timedelta(days=3)
train_end_date = train_end_date.strftime('%Y-%m-%d')
day = 3
start_date = datetime.strptime(train_end_date, '%Y-%m-%d') - timedelta(days=day)
start_date = start_date.strftime('%Y-%m-%d')
print (start_date,'-->',train_end_date)
print(all_actions.shape) #(50601736, 7)
all_actions = get_all_action()
actions = get_actions(start_date, train_end_date, all_actions)
actions = actions[['user_id', 'cate', 'type']]
actions.head()
用户特征
用户基本特征
获取基本的用户特征,基于用户本身属性多为类别特征的特点,对age,sex,usr_lv_cd进行独热编码操作,对于用户注册时间暂时不处理。
from sklearn import preprocessing
def get_basic_user_feat():
# 针对年龄的中文字符问题处理,首先是读入的时候编码,填充空值,然后将其数值化,最后独热编码,此外对于sex也进行了数值类型转换
user = pd.read_csv(user_path, encoding='gbk')
#user['age'].fillna('-1', inplace=True)
#user['sex'].fillna(2, inplace=True)
user.dropna(axis=0, how='any',inplace=True)
user['sex'] = user['sex'].astype(int)
user['age'] = user['age'].astype(int)
le = preprocessing.LabelEncoder() # 方式一
age_df = le.fit_transform(user['age'])
age_df = pd.get_dummies(age_df, prefix='age')
print(list(le.classes_))
# age_df = pd.get_dummies(user['age'], prefix='age')
sex_df = pd.get_dummies(user['sex'], prefix='sex')
user_lv_df = pd.get_dummies(user['user_lv_cd'], prefix='user_lv_cd')
user = pd.concat([user['user_id'], age_df, sex_df, user_lv_df], axis=1)
return user
basic_user_feat=get_basic_user_feat()
basic_user_feat.head(10)
from sklearn import preprocessing
def get_basic_user_feat():
# 针对年龄的中文字符问题处理,首先是读入的时候编码,填充空值,然后将其数值化,最后独热编码,此外对于sex也进行了数值类型转换
user = pd.read_csv(user_path, encoding='gbk')
#user['age'].fillna('-1', inplace=True)
#user['sex'].fillna(2, inplace=True)
user.dropna(axis=0, how='any',inplace=True)
user['sex'] = user['sex'].astype(int)
user['age'] = user['age'].astype(int)
# le = preprocessing.LabelEncoder()
# age_df = le.fit_transform(user['age'])
# age_df = pd.get_dummies(age_df, prefix='age')
# print(list(le.classes_))
age_df = pd.get_dummies(user['age'], prefix='age') # 方式二
sex_df = pd.get_dummies(user['sex'], prefix='sex')
user_lv_df = pd.get_dummies(user['user_lv_cd'], prefix='user_lv_cd')
user = pd.concat([user['user_id'], age_df, sex_df, user_lv_df], axis=1)
return user
basic_user_feat=get_basic_user_feat()
basic_user_feat.head(10)
对比方式一和方式二,发现方式二 的输出数值都是整数,方式二的age_-1 对应方式一的 age_0 ,虽然形式不同,但实质是一样的
商品特征
商品基本特征
根据商品文件获取基本的特征,针对属性a1,a2,a3进行独热编码,商品类别和品牌直接作为特征
def get_basic_product_feat():
product = pd.read_csv(product_path)
attr1_df = pd.get_dummies(product["a1"], prefix="a1")
attr2_df = pd.get_dummies(product["a2"], prefix="a2")
attr3_df = pd.get_dummies(product["a3"], prefix="a3")
product = pd.concat([product[['sku_id', 'cate', 'brand']], attr1_df, attr2_df, attr3_df], axis=1)
return product
basic_product_feat=get_basic_product_feat()
basic_product_feat.head()
评论特征
- 分时间段
- 对评论数进行独热编码 不太清楚里面的逻辑
def get_comments_product_feat(end_date):
comments = pd.read_csv(comment_path)
comment_date_end = end_date
comment_date_begin = comment_date[0]
for date in reversed(comment_date):
if date < comment_date_end:
comment_date_begin = date
break
comments = comments[comments.dt==comment_date_begin]
df = pd.get_dummies(comments['comment_num'], prefix='comment_num')
# 为了防止某个时间段不具备评论数为0的情况(测试集出现过这种情况)
for i in range(0, 5):
if 'comment_num_' + str(i) not in df.columns:
df['comment_num_' + str(i)] = 0
df = df[['comment_num_0', 'comment_num_1', 'comment_num_2', 'comment_num_3', 'comment_num_4']]
comments = pd.concat([comments, df], axis=1) # type: pd.DataFrame
#del comments['dt']
#del comments['comment_num']
comments = comments[['sku_id', 'has_bad_comment', 'bad_comment_rate','comment_num_0', 'comment_num_1',
'comment_num_2', 'comment_num_3', 'comment_num_4']]
return comments
train_start_date = '2016-02-01'
train_end_date = datetime.strptime(train_start_date, '%Y-%m-%d') + timedelta(days=3)
train_end_date = train_end_date.strftime('%Y-%m-%d')
day = 3
start_date = datetime.strptime(train_end_date, '%Y-%m-%d') - timedelta(days=day)
start_date = start_date.strftime('%Y-%m-%d')
行为特征
分时间段
- 对行为类别进行独热编码
- 分别按照用户-类别行为分组和用户-类别-商品行为分组统计,然后计算
- 用户对同类别下其他商品的行为计数
- 针对用户对同类别下目标商品的行为计数与该时间段的行为均值作差
运行时报错,没再继续下去,不知道为什么报错