众所周知,countvectorize是简单地计算出现频率。那在一个groupby对象中,如何针对每个类计算频率,我google了一个上午没有找到特定的现象以及API,故将自己的做法写下这篇博客记录。
现象举例:
df中的category_1只有Y和N两种,而想要计算出每个id分别对应的Y和N的数量,并生成新的特征与原始数据进行合并。
>>> df.tail(20)
card_id category_1
29112341 C_ID_f9f906a4a2 Y
29112342 C_ID_f9f906a4a2 Y
29112343 C_ID_5cf3187742 N
29112344 C_ID_5cf3187742 N
29112345 C_ID_5cf3187742 N
29112346 C_ID_803aa0aed4 N
29112347 C_ID_803aa0aed4 N
29112348 C_ID_803aa0aed4 N
29112349 C_ID_62df280b20 Y
29112350 C_ID_62df280b20 Y
29112351 C_ID_62df280b20 N
29112352 C_ID_62df280b20 N
29112353 C_ID_e49b1996b0 N
29112354 C_ID_e49b1996b0 Y
29112355 C_ID_e49b1996b0 N
29112356 C_ID_2863d2fa95 Y
29112357 C_ID_2863d2fa95 Y
29112358 C_ID_5c240d6e3c N
29112359 C_ID_5c240d6e3c N
29112360 C_ID_5c240d6e3c N
首先对它俩进行groupby()并得到count
>>> gdf = df.groupby(['card_id', 'category_1']).size()
>>> gdf
>>>
card_id category_1
C_ID_00007093c1 N 121
Y 28
C_ID_0001238066 N 121
Y 2
C_ID_0001506ef0 N 66
C_ID_0001793786 N 214
Y 2
C_ID_000183fdda N 140
Y 4
...
C_ID_fffeed3a89 N 154
Y 14
C_ID_ffff1d9928 N 10
Y 2
C_ID_ffff579d3a N 114
C_ID_ffff756266 N 21
Y 3
C_ID_ffff828181 N 179
Y 11
C_ID_fffffd5772 N 51
Y 33
Length: 469255, dtype: int64
接下来将groupby对象转换为一个待用的dataframe对象
>>> df_category_1_total = gdf.reset_index(name = 'category_1_count')
>>> df_category_1_total
>>>
card_id category_1 category_1_count
0 C_ID_00007093c1 N 121
1 C_ID_00007093c1 Y 28
2 C_ID_0001238066 N 121
3 C_ID_0001238066 Y 2
4 C_ID_0001506ef0 N 66
5 C_ID_0001793786 N 214
6 C_ID_0001793786 Y 2
7 C_ID_000183fdda N 140
8 C_ID_000183fdda Y 4
9 C_ID_00024e244b N 70
... ... ... ...
469245 C_ID_fffeed3a89 Y 14
469246 C_ID_ffff1d9928 N 10
469247 C_ID_ffff1d9928 Y 2
469248 C_ID_ffff579d3a N 114
469249 C_ID_ffff756266 N 21
469250 C_ID_ffff756266 Y 3
469251 C_ID_ffff828181 N 179
469252 C_ID_ffff828181 Y 11
469253 C_ID_fffffd5772 N 51
469254 C_ID_fffffd5772 Y 33
将其中类别为Y的部分进行提取并重新规整为一个待用的dataframe
>>> df_category_1_Y_count = df_category_1_total[df_category_1_total['category_1'] == 'Y'].reset_index()
>>> df_category_1_Y_count
>>>
index card_id category_1 category_1_count
0 1 C_ID_00007093c1 Y 28
1 3 C_ID_0001238066 Y 2
2 6 C_ID_0001793786 Y 2
3 8 C_ID_000183fdda Y 4
4 11 C_ID_0002709b5a Y 6
5 17 C_ID_00032df08f Y 62
6 21 C_ID_0003be3c83 Y 37
7 24 C_ID_00042d509c Y 41
8 26 C_ID_0004587331 Y 2
9 29 C_ID_0004888ddd Y 4
... ... ... ... ...
148498 469231 C_ID_fffd943b91 Y 32
148499 469234 C_ID_fffde15ab6 Y 19
148500 469238 C_ID_fffe78b232 Y 18
148501 469241 C_ID_fffea6de74 Y 33
148502 469243 C_ID_fffeced303 Y 138
148503 469245 C_ID_fffeed3a89 Y 14
148504 469247 C_ID_ffff1d9928 Y 2
148505 469250 C_ID_ffff756266 Y 3
148506 469252 C_ID_ffff828181 Y 11
148507 469254 C_ID_fffffd5772 Y 33
事实上我们不需要index和category_1这两列,所以将其drop后再与我们的train_data融合就行了。(这里train_data中每一行对应一个id)
>>> train_df.head()
>>>
first_active_month card_id feature_1 feature_2 feature_3 target
0 2017-06-01 C_ID_92a2005557 5 2 1 -0.820283
1 2017-01-01 C_ID_3d0044924f 4 1 0 0.392913
2 2016-08-01 C_ID_d639edf6cd 2 2 0 0.688056
3 2017-09-01 C_ID_186d6a6901 4 3 0 0.142495
4 2017-11-01 C_ID_cdbd2c0db2 1 3 0 -0.159749
融合
>>> df_category_1_Y_count.drop(['index', 'category_1'], axis = 1, inplace = True)
>>> df_category_1_Y_count.head()
>>>
card_id category_1_Y_count
0 C_ID_00007093c1 28
1 C_ID_0001238066 2
2 C_ID_0001793786 2
3 C_ID_000183fdda 4
4 C_ID_0002709b5a 6
>>> train_df = pd.merge(train_df, df_category_1_Y_count, on = 'card_id', how = 'left')
>>> train_df.tail()
first_active_month card_id feature_1 feature_2 feature_3 target category_1_Y_count
201898 2017-06-01 C_ID_123b4b8d1e 1 1 0 -3.360124 58.0
201899 2017-10-01 C_ID_58e359763e 5 1 1 -2.702214 5.0
201900 2017-07-01 C_ID_0032aebb26 3 2 1 -0.151734 9.0
201901 2017-11-01 C_ID_3814ea6382 3 2 1 -1.743145 1.0
201902 2017-01-01 C_ID_b9cd68366b 1 3 0 1.001529 15.0
对于category_1的N来说,使用同样的方法就可以了。
代码总结:
gdf = df.groupby(['card_id', 'category_1']).size()
df_category_1_total = gdf.reset_index(name = 'category_1_count')
df_category_1_Y_count = df_category_1_total[df_category_1_total['category_1'] == 'Y'].reset_index()
df_category_1_Y_count.drop(['index', 'category_1'], axis = 1, inplace = True)
train_df = pd.merge(train_df, df_category_1_Y_count, on = 'card_id', how = 'left')
因为没有找到方便的API,所以可能略显复杂,如果您有更好的方法,欢迎指点。