【天池笔记】——IJCAI-18 数据初步清洗


空值检查

 NameShapeinTarget
0Inital Data(478138, 27)(9021, 27)
1instance_id is -1:(0, 27)(0, 27)
2item_id is -1:(0, 27)(0, 27)
3item_category_list is -1:(0, 27)(0, 27)
4item_property_list is -1:(0, 27)(0, 27)
5item_brand_id is -1:(473, 27)(9, 27)
6item_city_id is -1:(277, 27)(0, 27)
7item_price_level is -1:(0, 27)(0, 27)
8item_sales_level is -1:(913, 27)(3, 27)
9item_collected_level is -1:(0, 27)(0, 27)
10item_pv_level is -1:(0, 27)(0, 27)
11user_id is -1:(0, 27)(0, 27)
12user_gender_id is -1:(12902, 27)(166, 27)
13user_age_level is -1:(964, 27)(12, 27)
14user_occupation_id is -1:(964, 27)(12, 27)
15user_star_level is -1:(964, 27)(12, 27)
16context_id is -1:(0, 27)(0, 27)
17context_timestamp is -1:(0, 27)(0, 27)
18context_page_id is -1:(0, 27)(0, 27)
19predict_category_property is -1:(0, 27)(0, 27)
20shop_id is -1:(0, 27)(0, 27)
21shop_review_num_level is -1:(0, 27)(0, 27)
22shop_review_positive_rate is -1:(7, 27)(0, 27)
23shop_star_level is -1:(0, 27)(0, 27)
24shop_score_service is -1:(59, 27)(3, 27)
25shop_score_delivery is -1:(59, 27)(3, 27)
26shop_score_description is -1:(59, 27)(3, 27)
27is_trade is -1:(0, 27)(0, 27)

空值处理

1.   

字段名称空值数正样本中空值数
item_brand_id473  9
item_city_id2770
item_sales_level9133
user_age_level96412
user_occupation_id96412
user_star_level96412
shop_score_service593
shop_score_delivery593
shop_score_description593
shop_review_positive_rate70

以上对正样本影响小,直接删除

并且发现 User 和 Shop 一条记录各个属性的缺失值是一同出现的

2. 观察分布(可视化链接)可知家庭数量很少,且家庭包括男女,故式填为2


部分字段处理

1. 数据集中有三个列表类型的字段item_category_list、item_property_list、predict_category_property拆分处理这些字段。

        ① item_category_list:观察子类长度可知,有的有第二子类,但多出的第二子类别均属于同一个第一子类且均不是正样本记录,忽略仍旧只取两个

           

 item_category_listis_tradeitem_category_list_len
9251[2642175453151805566, 8868887661186419229]02
9252[2642175453151805566, 8868887661186419229]02
15749[2642175453151805566, 6233669177166538628]02
15750[2642175453151805566, 6233669177166538628]02
15751[2642175453151805566, 6233669177166538628]02

            参考代码

data_train03["item_category_sup"] = data_train03["item_category_list"].map(lambda x:x.split(";")[0])
data_train03["item_category_sub"] = data_train03["item_category_list"].map(lambda x:x.split(";")[1])

            新的两个列

 instance_iditem_category_supitem_category_sub
0079083828897646777585799347067982556520
1179083828897646777585799347067982556520
2279083828897646777585799347067982556520
3379083828897646777585799347067982556520
4479083828897646777585799347067982556520

        ② item_property_list:直接拆成set

            参考代码

data_train03["item_property_list"] = data_train03["item_property_list"].map(lambda x:x.split(";"))
data_train05["item_property_list"] = data_train05["item_property_list"].map(lambda x: set(x)

            改变后的列

 instance_iditem_property_list
00[2072967855524022579, 5131280576272319091, 263...
11[2072967855524022579, 5131280576272319091, 263...
22[2072967855524022579, 5131280576272319091, 263...
33[2072967855524022579, 5131280576272319091, 263...
44[2072967855524022579, 5131280576272319091, 263...

        ③ predict_category_property:将 category 和 property 分别放在两列中,以set(去重)形式保存暂时忽略他们的关系,并剔除 -1

            参考代码

data_train04["predict_category_property2"] = data_train04["predict_category_property"].map(lambda x:x.split(";"))

# 提取 predict_category
def get0(s):
    return s.split(":")[0]
# 提取 predict_property 为二维list
def get0_(s):
    if (int(s.split(":")[0])!=(-1)):
        return s.split(":")[1].split(",")
    else:
        return [-1]
    
data_train04["predict_category_list"] = data_train04["predict_category_property2"].map(lambda line:map(get0,line))
data_train04["predict_property_list"] = data_train04["predict_category_property2"].map(lambda line:map(get0_,line))
data_train04["predict_property_list"] = data_train04["predict_property_list"].map(lambda x: [j for i in x for j in i])

# 提取 predict_property 非 -1 的值 
data_train04["predict_property_list"] = data_train04["predict_property_list"].map(lambda line:[i for i in line if int(i)!=(-1)])
data_train04[["instance_id","predict_category_list","predict_property_list"]].head()
data_train05["predict_category_list"] = data_train05["predict_category_list"].map(lambda x: set(x))
data_train05["predict_property_list"] = data_train05["predict_property_list"].map(lambda x: set(x))

            新增的两列

 instance_idpredict_category_listpredict_property_list
00[5799347067982556520, 509660095530134768, 5755...[9148482949976129397]
11[5799347067982556520, 7908382889764677758][9172976955054793469, 1787573075717641245, 917...
22[5799347067982556520, 7258015885215914736, 790...[5131280576272319091, 5131280576272319091, 513...
33[509660095530134768, 5799347067982556520, 7908...[1787573075717641245, 9148482949976129397, 914...
44[5799347067982556520, 7908382889764677758][9172976955054793469, 9172976955054793469]

            最终新增三列,及他们set元素的长度(感觉以后有用)

 item_property_listpredict_category_listpredict_property_listitem_property_numpredict_category_numpredict_property_num
0{5131280576272319091, 3408398779125901630, 815...{509660095530134768, 7908382889764677758, 5755...{9148482949976129397}2251
1{5131280576272319091, 3408398779125901630, 815...{7908382889764677758, 5799347067982556520}{1787573075717641245, 9148482949976129397, 917...2225
2{5131280576272319091, 3408398779125901630, 815...{7258015885215914736, 7908382889764677758, 579...{5131280576272319091}2231
3{5131280576272319091, 3408398779125901630, 815...{509660095530134768, 1950314698730389427, 7492...{9148482949976129397, 4038060334629950706, 664...2258
4{5131280576272319091, 3408398779125901630, 815...{7908382889764677758, 5799347067982556520}{9172976955054793469}2221

2. 时间转换: 数据给的式以秒为单位的时间戳,转化成日期格式

            参考代码

# # 时间转换(单位s)
data_train05['context_timestamp'] = data_train05['context_timestamp'].map(lambda x:
                                                                           time.strftime("%Y-%m-%d %H:%M:%S",time.localtime(x)))

3. 数据偏移:有些数据(店铺等级、页面Id等)可能是脱敏外加好区分,数据都是四位数,千位貌似是类别取模吧,不然作为数浮动不大

            参考代码

data_train05['context_page_id'] = data_train05['context_page_id']%1000
data_train05['shop_star_level'] = (data_train05['shop_star_level']+1)%1000

结语

目前差不多就想到这些,随时更新,还望大佬们在评论区多留意见


Authors:

Sun Jiazheng

Yu HaoXiong

Yan Zhi

Zhang Jinwei










             

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值