空值检查
Name | Shape | inTarget | |
---|---|---|---|
0 | Inital Data | (478138, 27) | (9021, 27) |
1 | instance_id is -1: | (0, 27) | (0, 27) |
2 | item_id is -1: | (0, 27) | (0, 27) |
3 | item_category_list is -1: | (0, 27) | (0, 27) |
4 | item_property_list is -1: | (0, 27) | (0, 27) |
5 | item_brand_id is -1: | (473, 27) | (9, 27) |
6 | item_city_id is -1: | (277, 27) | (0, 27) |
7 | item_price_level is -1: | (0, 27) | (0, 27) |
8 | item_sales_level is -1: | (913, 27) | (3, 27) |
9 | item_collected_level is -1: | (0, 27) | (0, 27) |
10 | item_pv_level is -1: | (0, 27) | (0, 27) |
11 | user_id is -1: | (0, 27) | (0, 27) |
12 | user_gender_id is -1: | (12902, 27) | (166, 27) |
13 | user_age_level is -1: | (964, 27) | (12, 27) |
14 | user_occupation_id is -1: | (964, 27) | (12, 27) |
15 | user_star_level is -1: | (964, 27) | (12, 27) |
16 | context_id is -1: | (0, 27) | (0, 27) |
17 | context_timestamp is -1: | (0, 27) | (0, 27) |
18 | context_page_id is -1: | (0, 27) | (0, 27) |
19 | predict_category_property is -1: | (0, 27) | (0, 27) |
20 | shop_id is -1: | (0, 27) | (0, 27) |
21 | shop_review_num_level is -1: | (0, 27) | (0, 27) |
22 | shop_review_positive_rate is -1: | (7, 27) | (0, 27) |
23 | shop_star_level is -1: | (0, 27) | (0, 27) |
24 | shop_score_service is -1: | (59, 27) | (3, 27) |
25 | shop_score_delivery is -1: | (59, 27) | (3, 27) |
26 | shop_score_description is -1: | (59, 27) | (3, 27) |
27 | is_trade is -1: | (0, 27) | (0, 27) |
空值处理
1.
字段名称 | 空值数 | 正样本中空值数 |
item_brand_id | 473 | 9 |
---|---|---|
item_city_id | 277 | 0 |
item_sales_level | 913 | 3 |
user_age_level | 964 | 12 |
user_occupation_id | 964 | 12 |
user_star_level | 964 | 12 |
shop_score_service | 59 | 3 |
shop_score_delivery | 59 | 3 |
shop_score_description | 59 | 3 |
shop_review_positive_rate | 7 | 0 |
以上对正样本影响小,直接删除
并且发现 User 和 Shop 一条记录各个属性的缺失值是一同出现的
2. 观察分布(可视化链接)可知家庭数量很少,且家庭包括男女,故式填为2
部分字段处理
1. 数据集中有三个列表类型的字段item_category_list、item_property_list、predict_category_property拆分处理这些字段。
① item_category_list:观察子类长度可知,有的有第二子类,但多出的第二子类别均属于同一个第一子类且均不是正样本记录,忽略仍旧只取两个
item_category_list | is_trade | item_category_list_len | |
---|---|---|---|
9251 | [2642175453151805566, 8868887661186419229] | 0 | 2 |
9252 | [2642175453151805566, 8868887661186419229] | 0 | 2 |
15749 | [2642175453151805566, 6233669177166538628] | 0 | 2 |
15750 | [2642175453151805566, 6233669177166538628] | 0 | 2 |
15751 | [2642175453151805566, 6233669177166538628] | 0 | 2 |
参考代码
data_train03["item_category_sup"] = data_train03["item_category_list"].map(lambda x:x.split(";")[0])
data_train03["item_category_sub"] = data_train03["item_category_list"].map(lambda x:x.split(";")[1])
新的两个列
instance_id | item_category_sup | item_category_sub | |
---|---|---|---|
0 | 0 | 7908382889764677758 | 5799347067982556520 |
1 | 1 | 7908382889764677758 | 5799347067982556520 |
2 | 2 | 7908382889764677758 | 5799347067982556520 |
3 | 3 | 7908382889764677758 | 5799347067982556520 |
4 | 4 | 7908382889764677758 | 5799347067982556520 |
② item_property_list:直接拆成set
参考代码
data_train03["item_property_list"] = data_train03["item_property_list"].map(lambda x:x.split(";"))
data_train05["item_property_list"] = data_train05["item_property_list"].map(lambda x: set(x)
改变后的列
instance_id | item_property_list | |
---|---|---|
0 | 0 | [2072967855524022579, 5131280576272319091, 263... |
1 | 1 | [2072967855524022579, 5131280576272319091, 263... |
2 | 2 | [2072967855524022579, 5131280576272319091, 263... |
3 | 3 | [2072967855524022579, 5131280576272319091, 263... |
4 | 4 | [2072967855524022579, 5131280576272319091, 263... |
③ predict_category_property:将 category 和 property 分别放在两列中,以set(去重)形式保存暂时忽略他们的关系,并剔除 -1
参考代码
data_train04["predict_category_property2"] = data_train04["predict_category_property"].map(lambda x:x.split(";"))
# 提取 predict_category
def get0(s):
return s.split(":")[0]
# 提取 predict_property 为二维list
def get0_(s):
if (int(s.split(":")[0])!=(-1)):
return s.split(":")[1].split(",")
else:
return [-1]
data_train04["predict_category_list"] = data_train04["predict_category_property2"].map(lambda line:map(get0,line))
data_train04["predict_property_list"] = data_train04["predict_category_property2"].map(lambda line:map(get0_,line))
data_train04["predict_property_list"] = data_train04["predict_property_list"].map(lambda x: [j for i in x for j in i])
# 提取 predict_property 非 -1 的值
data_train04["predict_property_list"] = data_train04["predict_property_list"].map(lambda line:[i for i in line if int(i)!=(-1)])
data_train04[["instance_id","predict_category_list","predict_property_list"]].head()
data_train05["predict_category_list"] = data_train05["predict_category_list"].map(lambda x: set(x))
data_train05["predict_property_list"] = data_train05["predict_property_list"].map(lambda x: set(x))
新增的两列
instance_id | predict_category_list | predict_property_list | |
---|---|---|---|
0 | 0 | [5799347067982556520, 509660095530134768, 5755... | [9148482949976129397] |
1 | 1 | [5799347067982556520, 7908382889764677758] | [9172976955054793469, 1787573075717641245, 917... |
2 | 2 | [5799347067982556520, 7258015885215914736, 790... | [5131280576272319091, 5131280576272319091, 513... |
3 | 3 | [509660095530134768, 5799347067982556520, 7908... | [1787573075717641245, 9148482949976129397, 914... |
4 | 4 | [5799347067982556520, 7908382889764677758] | [9172976955054793469, 9172976955054793469] |
最终新增三列,及他们set元素的长度(感觉以后有用)
item_property_list | predict_category_list | predict_property_list | item_property_num | predict_category_num | predict_property_num | |
---|---|---|---|---|---|---|
0 | {5131280576272319091, 3408398779125901630, 815... | {509660095530134768, 7908382889764677758, 5755... | {9148482949976129397} | 22 | 5 | 1 |
1 | {5131280576272319091, 3408398779125901630, 815... | {7908382889764677758, 5799347067982556520} | {1787573075717641245, 9148482949976129397, 917... | 22 | 2 | 5 |
2 | {5131280576272319091, 3408398779125901630, 815... | {7258015885215914736, 7908382889764677758, 579... | {5131280576272319091} | 22 | 3 | 1 |
3 | {5131280576272319091, 3408398779125901630, 815... | {509660095530134768, 1950314698730389427, 7492... | {9148482949976129397, 4038060334629950706, 664... | 22 | 5 | 8 |
4 | {5131280576272319091, 3408398779125901630, 815... | {7908382889764677758, 5799347067982556520} | {9172976955054793469} | 22 | 2 | 1 |
2. 时间转换: 数据给的式以秒为单位的时间戳,转化成日期格式
参考代码
# # 时间转换(单位s)
data_train05['context_timestamp'] = data_train05['context_timestamp'].map(lambda x:
time.strftime("%Y-%m-%d %H:%M:%S",time.localtime(x)))
3. 数据偏移:有些数据(店铺等级、页面Id等)可能是脱敏外加好区分,数据都是四位数,千位貌似是类别取模吧,不然作为数浮动不大
参考代码
data_train05['context_page_id'] = data_train05['context_page_id']%1000
data_train05['shop_star_level'] = (data_train05['shop_star_level']+1)%1000
结语
目前差不多就想到这些,随时更新,还望大佬们在评论区多留意见
Authors:
Sun Jiazheng
Yu HaoXiong
Yan Zhi
Zhang Jinwei