1、code
训练集分成K-Fold,用其他K-1 Fold计算CTR,然后merge给第K个Fold,遍历K次。然后训练集整体计算CTR,再merge给测试集。
def ctr_fea(train,test,feature):
for fea in feature:
print(fea)
temp = train[['label',fea]].groupby(fea)['label'].agg({fea+'_sum':sum,
fea+'_count':'count'}).reset_index()
temp[fea+'_ctr'] = temp[fea+'_sum']/(temp[fea+'_count']+10)
test = test.merge(temp[[fea,fea+'_ctr']],on=fea,how='left')
for i in range(len(feature)-1):
for j in range(i+1,len(feature)):
col = [feature[i],feature[j]]
print(col)
temp = train[['label',feature[i],feature[j]]].groupby(col)['label'].agg({'_'.join(col)+'_sum':sum,
'_'.join(col)+'_count':'count'}).reset_index()
temp['_'.join(col)+'_ctr'] = temp['_'.join(col)+'_sum']/(temp['_'.join(col)+'_count']+10)
test = test.merge(temp[col+['_'.join(col)+'_ctr']],on=col,how='left')
return test
train['label'] = label
train_new = None
skf = StratifiedKFold(n_splits=5,random_state=2019,shuffle=True)
for i,(train_index,valid_index) in enumerate(skf.split(train,label)):
print('flod_{}'.format(i+1))
temp = ctr_fea(train.iloc[train_index],train.iloc[valid_index],feature)
train_new = pd.concat([train_new,temp])
test_new = ctr_fea(train,test,feature)
2、Note
1、做完ctr之后,训练集和测试集的顺序已经变掉,要按照ID进行排序,否则和label不匹配,会出现训练错误。
2、注意区分label.values和label = train[‘label’],是两种不同的结构。