Categorical Feature Encoding Challenge II | Kaggle
这是Kaggle上的一个分类任务竞赛,给出的数据集没有明确的业务背景,但是只包含分类数据,包括:二分类数据,低分类数与高分类数的类别特征,低分类数与高分类数的顺序特征,以及(潜在包含)的周期特征。
选手们需要根据600K条数据预测400K条测试集的二分类可能性。
一、数据可视化
数据集中将特征分为了四部分:二分类数据(bin),多分类类别数据(nom),顺序特征(ord)以及(潜在的)周期特征(day/month)。
从二分类数据来看,bin_0、bin_1、bin_2的分布和因变量的分布更像,而bin_3和bin_4变量而言,0-1分类的比例似乎与因变量target的分布不同,相关性更弱。从柱状图来看也似乎确实如此:
图中横轴为该特征的0-1类别,颜色用以区分因变量target的0-1类别;可以看出bin_0、bin_1、bin_2在不同的变量中,因变量target的0-1变量比例也明显有所不同。而对于bin_3与bin_4而言,比例似乎没有很大的差异。
卡方检验的结果显示,除了bin_3以外所有的二分类变量与因变量target均显著相关。
而对于有着少量类别的多分类特征而言,他们所属的类别与因变量target有着明显的相关性,而某些类别也明显是出于“数据量很少”的非平衡性,也给建模带来了一定难度。
多分类变量在柱状图上可以看出,各个分类类别不仅在数量上的有着明显的差别,不同类别之间的因变量target的分布有着明显差异。这些多变量的分类对于建模而言是一个极大的挑战。
对于有序变量而言,特征与变量之间没有显示出明显的线性关系,也可以将他们看作是无序的分类变量。
对于时间数据 ,从原始数据中的分类来看,9、10月和周四的数据量相对其他分类较少,使用正/余弦函数与傅里叶变换对数据进行处理后,未能看出有明显的周期效应,故在建模时将他们视为普通的类别特征。
可以看出,由于没有对应的业务背景信息,数据可视化得到的信息并不多;同时,数据集中某些类别特征有着1000+个类别,类别之间的交互效应未知,这样的一些问题,给我们的建模带来了不小的挑战。
二、使用模型简述
在之前的Rossmann竞赛中,已经有选手发现了Embedding层能够对多类别的分类数据做出很好的处理,经过时间积累后,已经有着较多用以处理格式化数据的模型了,在本次参赛者中,有很多的选手使用了预测点击率(CTR)的模型并取得了不错的成绩,下面我就简单描述一下CTR的相关模型。
FM因式分解机:逻辑回归的改进版本,加入了二阶交叉项。
FFM:FM的改进版本,将域(field)的概念引入,不同特征在和不同的field特征交互时,使用不同的向量。
Wide and Deep:Wide部分使用线性模型处理onehot输入,连续变量以及交互变量/人工特征,Deep部分使用Embedding处理分类特征,再使用DNN全连接处理。
DeepFM:将Wide and Deep中的线性模型部分换为FM;且将输入统一,使用Embedding处理特征。
DCN:在Wide and Deep上改进,将线性模型部分换成Cross Network。在Cross Network中,每一层的输出都会乘以原始的输入特征。这个模型也是统一输入使用Embedding处理离散特征。
xDeepFM:虽然名字和DeepFM很像,但他其实更多的像是DCN。它在Wide and Deep的基础之上加入了CIN压缩感知层。CIN在CrossNetwork的基础之上,加入了field的概念,将同一个field下的元素整体考虑,与其他特征交互时使用相同权重。
PNN:PNN以field为粒度将特征进行特征交互,分为Inner Product和Outer Product。Inner Product将不同field的特征作两两内积,而Outer Product则做外积(由于外积的结果是向量,所以还需乘以一个等形状的权重举证W)。将他们拼接之后传入全连接层。
AutoInt:认为浅层模型收到交叉阶数限制,且DNN在高阶隐性交叉结果不好;这个模型加入了注意力机制。模型在输入时,会同时将离散特征与连续特征进行Embedding,将其分为三个矩阵:Query,Key,Values 将Query与Key作内积衡量相似度,使用Softmax得到attention,最终将Attention乘以values,得到一个head的结果。
三、建模策略
这里的建模策略有2个参考:
Categorical Feature Encoding Challenge II | Kaggle
前者是用单个的深度学习模型进行建模预测,并且还推出了他们公司自制的包deepTables;后者是将数据集分为50个fold,每次挑出一部分作为验证集训练50个不同的模型,将50个模型的结果取均值。我原本是想要将所有的模型都试一遍50个fold,但是由于这个过程过于漫长且某些模型本身效果欠佳,故而没有这么做。
最终使用了:1、使用deepCTR中的deepFM做了50个Fold的均值
2、使用自己写的xDeepFM做了50个均值
3、使用自己写的AutoInt跑了100个epoch。
DeepFM(deepCTR)
DEVICE = torch.device("cuda:0")
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")
df_ttl = pd.concat([df_train,df_test],axis=0)
df_ttl.reset_index(inplace=True)
df_ttl.drop("index",axis=1,inplace=True)
train_shape = df_train.shape[0]
del df_train,df_test
def preprocess_features(X):
for col in X.columns:
X[col] = X[col].astype("category")
#LabelEncoding
for col in X.columns:
X[col] = LabelEncoder().fit_transform(X[col])
return X
X_ttl = preprocess_features(df_ttl.drop(["target","id"],axis=1))
y_train = df_ttl.loc[:train_shape-1,"target"]
input_feats = X_ttl.columns
sparse_columns = [SparseFeat(feat,X_ttl[feat].nunique()) for feat in input_feats]
linear_feature_cols = sparse_columns
dnn_feature_cols = sparse_columns
feature_names = get_feature_names(linear_feature_cols+dnn_feature_cols)
train_input = {name:X_ttl.loc[:train_shape-1,name] for name in feature_names}
test_input = {name:X_ttl.loc[train_shape-1:,name] for name in feature_names}
skf = StratifiedKFold(n_splits=50,shuffle=True)
test_res = []
for fold,(train_idx,valid_idx) in enumerate(skf.split(X_ttl.loc[:train_shape-1,:].values,y_train.values)):
train_input = {name:X_ttl.loc[train_idx,name] for name in feature_names}
valid_input = {name:X_ttl.loc[valid_idx,name] for name in feature_names}
# es = callbacks.EarlyStopping(monitor='val_auc',min_delta=0.001,patience=3,
# verbose=0,mode="max",restore_best_weights=True)
#clr = callbacks.LearningRateScheduler.
model = DeepFM(linear_feature_cols,dnn_feature_cols,dnn_hidden_units=(256,256),dnn_dropout=0,
dnn_activation=nn.Mish,
dnn_use_bn=False,task="binary",device=DEVICE)
model.compile('adam',loss = 'binary_crossentropy', metrics=[roc_auc_score])
model.fit(train_input,y_train.values[train_idx],validation_data = (valid_input,y_train.values[valid_idx])
,batch_size=128,epochs=10,verbose=0)
pred_valid = model.predict(valid_input)
auc = roc_auc_score(y_train.values[valid_idx],pred_valid.flatten())
save(model.state_dict(),f"deep_FM_FOLD{fold}.txt")
print(f"FOLD{fold+1} AUC:{auc}")
test_res.append(model.predict(test_input))
torch.cuda.empty_cache()
res_test = (np.sum(test_res,axis=0)/50).flatten()[1:]
deepFM的结果如图。
自己写的xDeepFM
数据预处理:
import pandas as pd
import numpy as np
import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader,Dataset
from sklearn.impute import SimpleImputer
import os
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
import pandas as pd
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
DEVICE = torch.device("cuda:0")
#DEVICE = torch.device("cpu")
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")
def preprocess_features(X,onehot_cols):
#ordered
ord_order = [
[1.0, 2.0, 3.0],
['Novice', 'Contributor', 'Expert', 'Master', 'Grandmaster'],
['Freezing', 'Cold', 'Warm', 'Hot', 'Boiling Hot', 'Lava Hot']
]
for i in range(1, 3):
ord_order_dict = {i : j for j, i in enumerate(ord_order[i])}
X[f'ord_{i}_en'] = X[f'ord_{i}'].map(ord_order_dict).fillna(-1)
for i in range(3, 6):
ord_order_dict = {i : j for j, i in enumerate(sorted(list(set(list(X[f'ord_{i}'].dropna().unique())))))}
X[f'ord_{i}_en'] = X[f'ord_{i}'].map(ord_order_dict).fillna(-1)
cont_cols = [i for i in X.columns if 'en' in i]
X_cont = X.loc[:,cont_cols]
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
X_cont = imputer.fit_transform(X_cont)
X.drop(cont_cols,axis=1,inplace=True)
for col in X.columns:
X[col] = X[col].astype("category")
#Onehot
onehot_data = pd.get_dummies(X[onehot_cols])
#LabelEncoding
for col in X.columns:
X[col] = LabelEncoder().fit_transform(X[col])
return X,X_cont,onehot_data
df_ttl = pd.concat([df_train,df_test],axis=0)
df_ttl.reset_index(inplace=True)
df_ttl.drop("index",axis=1,inplace=True)
train_shape = df_train.shape[0]
del df_train,df_test
onehot_cols = [i for i in df_ttl.columns if i not in ["nom_"+str(j) for j in range(5,10)]+["ord_5","id","target"]]
X,X_cont,onehot_data = preprocess_features(df_ttl.drop(["target","id"],axis=1),onehot_cols=onehot_cols)
class CustomDataSet(Dataset):
def __init__(self,x,X_cont,onehot,y):
super(CustomDataSet,self).__init__()
self.x = x
self.onehot = onehot
self.X_cont = X_cont
self.y = y
def __len__(self):
return len(self.x)
def __getitem__(self,idx):
return torch.IntTensor(self.x[idx]),torch.FloatTensor(self.X_cont[idx]),torch.IntTensor(self.onehot[idx]),torch.FloatTensor(self.y[idx])
test_set = CustomDataSet(X.iloc[train_shape:,:].values,X_cont[train_shape:],onehot_data.iloc[train_shape:,:].values,df_ttl.loc[train_shape:,"target"].values.reshape(-1,1))
模型定义
class CIN(nn.Module):
"""
https://zhuanlan.zhihu.com/p/96827361
参考了DeepCTR_torch这个包里的内容
"""
def __init__(self,input_shape,layer_shapes,activation=F.relu,split_half=True):
super(CIN,self).__init__()
self.input_shape = input_shape #[None,m,D]
self.m = input_shape[1]
self.D = input_shape[2]
self.layer_shapes = [self.m]+layer_shapes
self.activaltion = activation
self.split_half = split_half
# self.l2_reg = l2_reg
if self.split_half:
for i in range(len(self.layer_shapes)-1):
self.layer_shapes[i+1] = self.layer_shapes[i+1]//2
self.conv_layers = torch.nn.ModuleList([
nn.Conv1d(in_channels=self.layer_shapes[0]*self.layer_shapes[i],out_channels=self.layer_shapes[i+1],
kernel_size=1,stride=1,dilation=1,bias=True)
for i in range(len(self.layer_shapes)-1)
])
self.output_linear = nn.Linear(sum(self.layer_shapes)-self.m,1)
def forward(self,x):
# x形状:[None,m,D]
x_list = []
x0 = x.unsqueeze(dim=2) #[None,m,1,D]
for i in range(len(self.layer_shapes)-1):
xi = x0*x.unsqueeze(dim=1) #[None,m,1,D]*[None,1,Hk,D] ==> [None,m,Hk,D]
xi = xi.view(xi.shape[0],-1,self.D)
xi = self.activaltion(self.conv_layers[i](xi)) #[None,m,D]
x_list.append(xi)
x = xi
return self.output_linear(torch.sum(torch.cat(x_list,1),2))
class DNN(nn.Module):
"""
input_shape:输入大小
units:每一层神经元个数
p_dropout:失活率
activation:激活函数
"""
def __init__(self,input_shape,units,p_dropout,activation = F.mish,use_BN=True):
super(DNN,self).__init__()
self.input_shape = input_shape
self.units = [input_shape[-1]]+units
self.p_dropout = p_dropout
self.activation = activation
self.use_BN = use_BN
self.dnns = nn.ModuleList([nn.Linear(i,j,bias=True) for i,j in list(zip(self.units[:-1],self.units[1:]))])
if self.p_dropout>0:
self.dropout = nn.Dropout(p_dropout)
if use_BN:
self.bn = nn.ModuleList([nn.BatchNorm1d(self.units[i+1]) for i in range(len(self.units)-1)])
def forward(self,x):
for i in range(len(self.dnns)):
linear = self.dnns[i]
x = linear(x)
if self.use_BN:
x = self.bn[i](x)
x = self.activation(x)
if self.p_dropout>0:
x = self.dropout(x)
return x
class xDeepFM(nn.Module):
def __init__(self,Embedding_units,Embedding_nums,cate_shape,cont_shape,dnn_units,p_dropout,cin_layers,activation=F.relu,use_BN=True):
super(xDeepFM,self).__init__()
self.Embedding_units = Embedding_units
self.Embedding_nums = Embedding_nums
self.cont_shape = cont_shape
self.cate_shape = cate_shape
#self.onehot_shape = onehot_shape
self.dnn_units = dnn_units
self.p_dropout = p_dropout
self.activation = activation
self.cin_layers = cin_layers
self.Linear = nn.Linear(cont_shape+sum(self.Embedding_units),1)
self.Embedding = nn.ModuleList([nn.Embedding(self.Embedding_nums[i],self.Embedding_units[i]) for i in range(len(self.Embedding_nums))])
self.DNN = DNN([None,self.cont_shape+sum(self.Embedding_units)],self.dnn_units,self.p_dropout,self.activation,use_BN)
self.CIN = CIN([None,cate_shape,self.Embedding_units[0]],self.cin_layers)
self.out_layer = nn.Linear(1,1)
for weight in self.parameters():
try:
nn.init.xavier_uniform_(weight)
except:
continue
def forward(self,x_cate,x_cont):
x_cate_embedded = []
for i in range(len(Embedding_nums)):
layer = self.Embedding[i]
x_cate_embedded.append(layer(x_cate[:,i]))
x_cate_embedded_DNN = torch.cat(x_cate_embedded,1)
if x_cont is None:
dnn_part = self.DNN(x_cate_embedded_DNN)
linear_part = self.Linear(x_cate_embedded_DNN)
else:
dnn_part = self.DNN(torch.cat([x_cate_embedded_DNN,x_cont],1))
linear_part = self.Linear(torch.cat([x_cate_embedded_DNN,x_cont],1))
x_cate_embedded = torch.cat([x_cate_embedded[i].reshape(x_cate_embedded[i].shape[0],1,self.Embedding_units[i]) for i in range(len(self.Embedding_nums))],1)
cin_part = self.CIN(x_cate_embedded)
out = F.sigmoid(self.out_layer(linear_part+dnn_part+cin_part))
# out = F.sigmoid(self.out_layer(dnn_part))
return out
训练过程
Embedding_units = [20 for _ in Embedding_nums]
n_splits = 50
skf = StratifiedKFold(n_splits=n_splits,shuffle=True)
batch_size = 128
test_res = []
for fold,(train_idx,valid_idx) in enumerate(skf.split(X.loc[:train_shape-1,:].values,df_ttl.loc[:train_shape-1,"target"].values)):
train_set = CustomDataSet(X.iloc[train_idx,:].values,X_cont[train_idx],onehot_data.iloc[train_idx,:].values,df_ttl.loc[train_idx,"target"].values.reshape(-1,1))
valid_set = CustomDataSet(X.iloc[valid_idx,:].values,X_cont[valid_idx],onehot_data.iloc[valid_idx,:].values,df_ttl.loc[valid_idx,"target"].values.reshape(-1,1))
model = xDeepFM(Embedding_units,Embedding_nums,cate_shape,cont_shape,[300,300,1],0.3,[200,200],activation=F.mish).to(DEVICE)
optim = torch.optim.Adam(model.parameters(),lr=1e-2)
scheduler = torch.optim.lr_scheduler.CyclicLR(optim,1e-7,1e-4,mode = 'exp_range',gamma=1.0,scale_mode = 'cycle',cycle_momentum=False)
loss_func = nn.BCELoss()
dl_train = DataLoader(train_set,batch_size=batch_size,shuffle=True)
dl_valid = DataLoader(valid_set,batch_size=batch_size*4,shuffle=True)
best_auc=0
best_valid_loss = np.inf
patient = 3
for epoch in range(10):
epoch_loss = 0
valid_loss = 0
model.train()
for i,(x,x_cont,onehot,y) in enumerate(dl_train):
x,x_cont,onehot,y = x.to(DEVICE),x_cont.to(DEVICE),onehot.to(DEVICE),y.to(DEVICE)
#if i not in valid_idxs:
optim.zero_grad()
pred = model(x,x_cont)
loss = loss_func(pred,y)
for para in model.parameters():
loss = loss + torch.sum(torch.square(para))*1e-5
epoch_loss += loss.item()
loss.backward()
optim.step()
scheduler.step()
valid_pred = []
valid_y = []
model.eval()
for i,(x,x_cont,onehot,y) in enumerate(dl_valid):
x,x_cont,onehot,y = x.to(DEVICE),x_cont.to(DEVICE),onehot.to(DEVICE),y.to(DEVICE)
with torch.no_grad():
pred = model(x,x_cont)
loss = loss_func(pred,y)
for para in model.parameters():
loss = loss + torch.sum(torch.square(para))*1e-5
valid_loss += loss.item()
valid_pred.append(pred)
valid_y.append(y)
valid_pred = torch.cat(valid_pred).cpu().numpy()
valid_y = torch.cat(valid_y).cpu().numpy()
auc = roc_auc_score(valid_y.flatten(),valid_pred.flatten())
#early_stopping
if valid_loss>best_valid_loss and auc<best_auc:
patient -= 1
if patient==0:
break
else:
continue
best_valid_loss=valid_loss
best_auc = auc
torch.save(model.state_dict(),f"xDeepFM_FOLD_{fold+1}.txt")
print(f"FOLD{fold+1} AUC:{best_auc}")
torch.cuda.empty_cache()
预测结果并保存
dl_test = DataLoader(test_set,batch_size=512)
for fold in range(50):
model.load_state_dict(torch.load(f"xDeepFM_FOLD_{fold+1}.txt"))
model.eval()
tmp = []
for i,(x,x_cont,onehot,y) in enumerate(dl_test):
x,x_cont,onehot,y = x.to(DEVICE),x_cont.to(DEVICE),onehot.to(DEVICE),y.to(DEVICE)
with torch.no_grad():
pred = model(x,x_cont)
tmp.append(pred.cpu().numpy())
test_res.append(tmp)
print(f"{fold+1} complete")
xDeepFM_res = np.array([np.concatenate([i.flatten() for i in test_res[j]]) for j in range(len(test_res))])
xDeepFM_res = np.sum(xDeepFM_res,0)/50
result = pd.read_csv("sample_submission.csv")
result["target"] = xDeepFM_res
result.to_csv("xDeepFM_custom.csv",index=None)
自己写的AutoInt(100个epoch)
定义模型
class InteractingLayer(nn.Module):
"""
https://blog.csdn.net/qq_42363032/article/details/126004209
"""
def __init__(self,num_heads,Embedding_unit):
super(InteractingLayer,self).__init__()
self.num_heads = num_heads
self.Embedding_unit = Embedding_unit
if self.Embedding_unit % self.num_heads == 0:
assert "num_heads必须要能被Embedding_unit整除"
self.qkv_size = self.Embedding_unit//self.num_heads
self.Weight_Q = nn.Parameter(torch.Tensor(self.Embedding_unit,self.Embedding_unit))
self.Weight_K = nn.Parameter(torch.Tensor(self.Embedding_unit,self.Embedding_unit))
self.Weight_V = nn.Parameter(torch.Tensor(self.Embedding_unit,self.Embedding_unit))
self.W_R = nn.Parameter(torch.Tensor(self.Embedding_unit,self.Embedding_unit))
for weight in self.parameters():
nn.init.xavier_uniform_(weight)
def forward(self,x):
#[batch_size,fields,emb_size]=>[batch_size,fields,emb_size]
query = torch.matmul(x,self.Weight_Q)
key = torch.matmul(x,self.Weight_K)
value = torch.matmul(x,self.Weight_V)
#[batch_size,fields,emb_size] => [num_heads,batch_size,fields,qkvsize]
query = torch.stack(torch.split(query,self.qkv_size,dim=2))
key = torch.stack(torch.split(key,self.qkv_size,dim=2))
value = torch.stack(torch.split(value,self.qkv_size,dim=2))
att = torch.matmul(query,key.transpose(-2,-1))
att = F.softmax(att,dim=-1)
res = torch.matmul(att,value)
res = torch.cat(torch.split(res,1),-1)
res = torch.squeeze(res,0) #[batch_size,fields,emb_size]
res = res + torch.matmul(x,self.W_R)
res = F.relu(res)
return res
class AutoInt(nn.Module):
def __init__(self,num_heads,num_attnlayers,Embedding_nums,Embedding_unit,DNN_units,p_dropout_dnn,activation = F.relu,use_BN=True):
super(AutoInt,self).__init__()
self.num_heads = num_heads
self.Embedding_nums = Embedding_nums
self.Embedding_unit = Embedding_unit
self.Embedding_layers = nn.ModuleList([nn.Embedding(self.Embedding_nums[i],self.Embedding_unit) for i in range(len(Embedding_nums))])
self.Interactive = nn.ModuleList([InteractingLayer(num_heads,Embedding_unit) for _ in range(num_attnlayers)])
self.dnn_input_shape = [None,len(self.Embedding_nums)*self.Embedding_unit]
self.DNN = DNN(self.dnn_input_shape,DNN_units,p_dropout_dnn,activation,use_BN)
self.Linear = nn.Linear(self.dnn_input_shape[-1],1)
self.Linear_interactive = nn.Linear(len(self.Embedding_nums)*self.Embedding_unit+DNN_units[-1],1)
def forward(self,x):
x_cate_embedded = []
for i in range(len(Embedding_nums)):
layer = self.Embedding_layers[i]
x_cate_embedded.append(layer(x[:,i]).unsqueeze(1))
x = torch.cat(x_cate_embedded,1)
linear_part = self.Linear(torch.flatten(x,start_dim=1))
DNN_part = self.DNN(torch.flatten(x,start_dim=1))
for i in range(len(self.Interactive)):
layer = self.Interactive[i]
x = layer(x)
Interactive_part = x.reshape(-1,len(self.Embedding_nums)*self.Embedding_unit)
out = linear_part+self.Linear_interactive(torch.cat([Interactive_part,DNN_part],1))
return F.sigmoid(out)
训练模型
model = AutoInt(2,3,Embedding_nums,20,[300,300],0.3).to(DEVICE)
optim = torch.optim.Adam(model.parameters(),lr=5e-3,weight_decay=1e-3)
loss_func = nn.BCELoss()
scheduler = torch.optim.lr_scheduler.CyclicLR(optim,1e-7,1e-4,mode = 'exp_range',gamma=1.0,scale_mode = 'cycle',cycle_momentum=False)
train_idx = [i for i in range(train_shape)]
batch_size = 512
#dl_valid = DataLoader(valid_set)
for epoch in range(100):
patient = 5
epoch_loss = 0
valid_loss = 0
valid_idxs = set(np.random.randint(0,int(train_shape/batch_size),int(train_shape/batch_size*0.2)))
train_idx = [i for i in range(train_shape) if i not in valid_idxs]
valid_idx = list(valid_idxs)
best_auc=0
best_loss = np.inf
train_set = CustomDataSet(X.iloc[train_idx,:].values,X_cont[train_idx],onehot_data.iloc[train_idx,:].values,df_ttl.loc[train_idx,"target"].values.reshape(-1,1))
valid_set = CustomDataSet(X.iloc[valid_idx,:].values,X_cont[valid_idx],onehot_data.iloc[valid_idx,:].values,df_ttl.loc[valid_idx,"target"].values.reshape(-1,1))
dl_train = DataLoader(train_set,batch_size=batch_size,shuffle=True)
dl_valid = DataLoader(valid_set,batch_size=batch_size,shuffle=True)
model.train()
for i,(x,x_cont,onehot,y) in enumerate(dl_train):
x,x_cont,onehot,y = x.to(DEVICE),x_cont.to(DEVICE),onehot.to(DEVICE),y.to(DEVICE)
optim.zero_grad()
pred = model(x.int())
loss = loss_func(pred,y)
epoch_loss += loss.item()
loss.backward()
optim.step()
scheduler.step()
valid_pred = []
valid_y = []
model.eval()
for i,(x,x_cont,onehot,y) in enumerate(dl_valid):
with torch.no_grad():
x,x_cont,onehot,y = x.to(DEVICE),x_cont.to(DEVICE),onehot.to(DEVICE),y.to(DEVICE)
pred = model(x.int())
loss = loss_func(pred,y)
valid_loss += loss.item()
valid_y.append(y)
valid_pred.append(pred)
valid_pred = torch.cat(valid_pred).cpu().numpy()
valid_y = torch.cat(valid_y).cpu().numpy()
auc = roc_auc_score(valid_y.flatten(),valid_pred.flatten())
if auc<best_auc and epoch_loss>best_loss:
patient -= 1
if patient == 0:
break
else:
continue
best_auc = max(best_auc,auc)
best_loss = min(best_loss,epoch_loss)
torch.save(model.state_dict(),f"AutoInt/AutoInt_FOLD_best.txt")
print(f"epoch:{epoch+1},loss:{epoch_loss},valid_loss:{valid_loss},auc:{auc}")
预测结果并保存
test_res = []
model.eval()
tmp = []
dl_test = DataLoader(test_set,batch_size=512)
for i,(x,x_cont,onehot,y) in enumerate(dl_test):
x,x_cont,onehot,y = x.to(DEVICE),x_cont.to(DEVICE),onehot.to(DEVICE),y.to(DEVICE)
with torch.no_grad():
pred = model(x.int())
tmp.append(pred.cpu().numpy())
test_res.append(tmp)
AutoInt_res = np.array([np.concatenate([i.flatten() for i in test_res[j]]) for j in range(len(test_res))])
result = pd.read_csv("sample_submission.csv")
result["target"] = AutoInt_res.flatten()
result.to_csv("AutoInt_100.csv",index=None)
从结果而言,xDeepFM的效果更好,不过像是DeepCTR与DeepTables那样的轮子,应该会在autograd部分以及模型的正则等方面作优化。自己造轮子只是为了做学习用,真的遇到问题时还是优先使用现有的包吧。
同时,AutoInt在做50Fold时,耗时过长,故而没有使用此策略,因此不能直接说“AutoInt的效果欠佳”。
四、附一:没有用到的自定义模型
以下是我根据第一名策略写的一些自定模型,但是由于效果欠佳故而没有用上,希望有人能看出这其中的疏漏并在评论中指出。
FM+DCN+DNN:
## FM
class FM(nn.Module):
def __init__(self,k,input_shape):
"""
k:二阶交叉项的个数
"""
super(FM,self).__init__()
self.k = k
self.w0 = nn.Parameter(torch.empty(1,)) #bias
self.w1 = nn.Parameter(nn.init.normal_(torch.empty(input_shape[-1],1))) #weight_of_linear
self.v = nn.Parameter(nn.init.normal_(torch.empty(input_shape[-1],self.k))) #weight_of_linear
def forward(self,x):
linear_part = torch.matmul(x,self.w1)+self.w0
inter_part1 = torch.pow(torch.matmul(x,self.v),2)
inter_part2 = torch.matmul(torch.pow(x,2),torch.pow(self.v,2))
out = linear_part+(torch.sum(inter_part1-inter_part2,axis=-1)/2).reshape(-1,1)
return F.sigmoid(out)
class FM_Layer(nn.Module):
def __init__(self):
super(FM_Layer,self).__init__()
def forward(self,x):
square_of_sum = torch.pow(torch.sum(x,dim=1,keepdim=True),2)
sum_of_square = torch.sum(x*x,dim=1,keepdim=True)
cross = square_of_sum-sum_of_square
cross = torch.sum(cross,dim=-1,keepdim=True)*0.5
return cross
class CrossLayer(nn.Module):
def __init__(self,layer_num,input_shape):
super(CrossLayer,self).__init__()
self.layer_num = layer_num
self.Linears = nn.ModuleList([nn.Linear(input_shape[-1],1,bias=False) for _ in range(self.layer_num)])
for i in self.parameters():
nn.init.xavier_normal_(i)
def forward(self,x):
x0 = x.unsqueeze(dim=2)
xl = x.unsqueeze(dim=2)
for i in range(self.layer_num):
linear = self.Linears[i]
xl = xl + linear(torch.bmm(x0,xl.permute((0,2,1))))
"""
假设原本有2行3维特征数据(此处已经扩展了一个维度):
x0:[[[1.],[2.],[3.]]
,[[4.],[5.],[6.]]]
xl:[[[1.],[2.],[3.]]
,[[4.],[5.],[6.]]]
那么对于每一行的特征而言,都是分别交叉,最终得到2*3*3的矩阵:
[[[ 1., 2., 3.],
[ 2., 4., 6.],
[ 3., 6., 9.]],
[[16., 20., 24.],
[20., 25., 30.],
[24., 30., 36.]]]
"""
out = torch.squeeze(xl)
return out
class CrossNet(nn.Module):
def __init__(self,Embedding_units,Embedding_nums,layer_num,input_shape,dnn_units,p_dropout,activation = F.relu):
super(CrossNet,self).__init__()
self.layer_num = layer_num
self.input_shape = input_shape
self.dnn_units = dnn_units
self.p_dropout = p_dropout
self.activation = activation
self.Embedding_units = Embedding_units
self.Embedding_nums = Embedding_nums
self.DNN = DNN(self.input_shape,self.dnn_units,self.p_dropout,self.activation)
self.Cross = CrossLayer(self.layer_num,self.input_shape)
self.Embedding_Layers = nn.ModuleList([nn.Embedding(self.Embedding_nums[i],self.Embedding_units[i]) for i in range(len(Embedding_nums))])
self.final_inputshape = self.dnn_units[-1] + input_shape[-1]
self.Output_Linear = nn.Linear(self.final_inputshape,1)
def forward(self,x_cate,x_cont):
x_cate_embedded = []
for i in range(len(Embedding_nums)):
layer = self.Embedding_Layers[i]
x_cate_embedded.append(layer(x_cate[:,i]))
x_cate_embedded = torch.cat(x_cate_embedded,1)
DNN_part = self.DNN(torch.cat([x_cate_embedded,x_cont],dim=1))
Cross_part = self.Cross(torch.cat([x_cate_embedded,x_cont],dim=1))
x = torch.cat([Cross_part,DNN_part],1)
out = F.sigmoid(self.Output_Linear(x))
return out
class FmDcnDnn(nn.Module):
def __init__(self,Embedding_units,Embedding_nums,layer_num,input_shape,input_shape_FM,dnn_units,p_dropout,activation = F.relu):
super(FmDcnDnn,self).__init__()
self.layer_num = layer_num
self.input_shape = input_shape
self.dnn_units = dnn_units
self.p_dropout = p_dropout
self.activation = activation
self.Embedding_units = Embedding_units
self.Embedding_nums = Embedding_nums
self.DNN = DNN(self.input_shape,self.dnn_units,self.p_dropout,self.activation)
self.Cross = CrossLayer(self.layer_num,self.input_shape)
self.Embedding_Layers = nn.ModuleList([nn.Embedding(self.Embedding_nums[i],self.Embedding_units[i]) for i in range(len(Embedding_nums))])
self.FMLayer = FM_Layer()
self.DNN_Cross = DNN(self.input_shape,self.dnn_units,self.p_dropout,self.activation)
self.final_inputshape = self.dnn_units[-1] + input_shape[-1]
self.Output_Linear = nn.Linear(self.final_inputshape,1)
def forward(self,x_cate,x_cont):
x_cate_embedded = []
for i in range(len(Embedding_nums)):
layer = self.Embedding_Layers[i]
x_cate_embedded.append(layer(x_cate[:,i]))
x_cate_embedded = torch.cat(x_cate_embedded,1)
if x_cont is None:
concat_input = x_cate_embedded
else:
concat_input = torch.cat([x_cate_embedded,x_cont],dim=1)
DNN_part = self.DNN(concat_input)
DNN_cross_part = self.DNN_Cross(concat_input)
Cross_part = self.Cross(concat_input)
DCN_part = self.Output_Linear(torch.cat([Cross_part,DNN_cross_part],1))
FM_part = self.FMLayer(concat_input)
# x = torch.cat([Cross_part,DNN_part],1)
x = DNN_part+DCN_part+FM_part
out = F.sigmoid(x)
return out
FM+DCN+OuterPNN
class InnerProductLayer(nn.Module):
def __init__(self,input_shape):
# input_shape:[None,field,k] k为Embedding后的数
super(InnerProductLayer,self).__init__()
self.input_shape = input_shape
self.fields = self.input_shape[1]
self.k = self.input_shape[2]
def forward(self,x):
tmp = torch.matmul(x,x.permute(0,2,1))
res = []
for row in range(self.fields):
for col in range(row+1,self.fields):
res.append(tmp[:,row,col].reshape(-1,1))
out = torch.cat(res,1)
return out
class OuterProductLayer(nn.Module):
def __init__(self,input_shape):
super(OuterProductLayer,self).__init__()
self.input_shape = input_shape
self.fields = self.input_shape[1]
self.k = self.input_shape[2]
self.kernal = nn.Parameter(torch.Tensor(self.k,int(self.fields*(self.fields-1)/2),self.k))
def forward(self,x):
row = []
col = []
for i in range(self.fields-1):
for j in range(i+1,self.fields):
row.append(i)
col.append(j)
p = torch.cat([x[:,i:i+1,:] for i in row],dim=1)
q = torch.cat([x[:,j:j+1,:] for j in col],dim=1)
p = p.unsqueeze(dim=1)
kp = torch.sum(torch.mul(p,self.kernal),dim=-1)
kp = torch.transpose(kp,2,1)
kp = torch.sum(torch.mul(kp,q),dim = -1)
return kp
class PNN(nn.Module):
def __init__(self,Embedding_nums,Embedding_unit,dnn_units,p_dropout,activation = F.relu,mode = "both"):
# 此处Embedding_unit每一个Category都一样
super(PNN,self).__init__()
self.Embedding_nums = Embedding_nums#长度就是field_num
self.Embedding_unit = Embedding_unit# k
self.dnn_units = dnn_units
self.mode = mode
self.p_dropout = p_dropout
self.activation = activation
if self.mode == "both":
self.dnn_input = [None,len(self.Embedding_nums)*(len(self.Embedding_nums)-1)+len(self.Embedding_nums)*self.Embedding_unit]
else:
self.dnn_input = [None,int(len(self.Embedding_nums)*(len(self.Embedding_nums)-1))]
self.Embedding = nn.ModuleList([nn.Embedding(self.Embedding_nums[i],self.Embedding_unit) for i in range(len(self.Embedding_nums))])
self.product_input_shape = [None,len(Embedding_nums),self.Embedding_unit]
self.dnn = DNN(self.dnn_input,self.dnn_units,self.p_dropout,self.activation)
self.InnerProduct = InnerProductLayer([None,len(self.Embedding_nums),Embedding_unit])
self.OuterProduct = OuterProductLayer([None,len(self.Embedding_nums),Embedding_unit])
def forward(self,x):
x_cate_embedded = []
for i in range(len(Embedding_nums)):
layer = self.Embedding[i]
x_cate_embedded.append(layer(x[:,i]))
x_cate_embedded = torch.stack(x_cate_embedded).permute(1,0,2)
if self.mode == "both":
inner_part = self.InnerProduct(x_cate_embedded)
outer_part = self.OuterProduct(x_cate_embedded)
out = torch.cat([x_cate_embedded.reshape(-1,len(self.Embedding_nums)*self.Embedding_unit),inner_part,outer_part],1)
elif self.mode == "inner":
inner_part = self.InnerProduct(x_cate_embedded)
out = torch.cat([x_cate_embedded,inner_part],1)
else:
outer_part = self.OuterProduct(x_cate_embedded)
out = torch.cat([x_cate_embedded,outer_part],1)
out = self.dnn(out)
return out
class FMCrossPNN(nn.Module):
def __init__(self,Embedding_units,Embedding_nums,layer_num,input_shape,input_shape_FM,dnn_units,p_dropout,activation = F.relu):
super(FMCrossPNN,self).__init__()
self.layer_num = layer_num
self.input_shape = input_shape
self.dnn_units = dnn_units
self.p_dropout = p_dropout
self.activation = activation
self.Embedding_units = Embedding_units
self.Embedding_nums = Embedding_nums
self.opnn_output_shape = [None,int(len(self.Embedding_nums)*(len(self.Embedding_nums)-1)/2)]
self.DNN_opnn = DNN(self.opnn_output_shape,self.dnn_units,self.p_dropout,self.activation)
self.OuterProduct = OuterProductLayer([None,len(self.Embedding_nums),Embedding_units])
self.Cross = CrossLayer(self.layer_num,self.input_shape)
self.Embedding_Layers = nn.ModuleList([nn.Embedding(self.Embedding_nums[i],self.Embedding_units) for i in range(len(Embedding_nums))])
self.FMLayer = FM_Layer()
self.cross_output = input_shape[-1]
self.cross_linear = nn.Linear(self.cross_output,1)
def forward(self,x_cate):
x_cate_embedded = []
for i in range(len(Embedding_nums)):
layer = self.Embedding_Layers[i]
x_cate_embedded.append(layer(x_cate[:,i]).reshape([-1,1,self.Embedding_units]))
opnn_input = torch.cat(x_cate_embedded,1)
normal_input = opnn_input.reshape([-1,len(self.Embedding_nums)*self.Embedding_units])
Outer_part = self.OuterProduct(opnn_input)
Outer_part = self.DNN_opnn(Outer_part)
Cross_part = self.Cross(normal_input)
DCN_part = self.cross_linear(Cross_part)
FM_part = self.FMLayer(normal_input)
x = Outer_part+DCN_part+FM_part
out = F.sigmoid(x)
return out
五、附二:Logistics回归调参
Kaggle上有个使用Logistics回归就得到较好效果的案例,使用optuna进行了调参。由于调参耗时较大,我这边只是运行了一下建模的过程。但是代码我还是附在这里了。
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split,cross_val_score,StratifiedKFold
import scipy
from sklearn.linear_model import LogisticRegression
import optuna
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
submission = pd.read_csv("sample_submission.csv")
train.drop("id",axis=1,inplace = True)
test.drop("id",axis=1,inplace = True)
y_train = train["target"]
train.drop("target",axis=1,inplace=True)
X_ttl = pd.concat([train,test])
X_ttl.reset_index(inplace=True)
X_ttl.drop("index",axis=1,inplace=True)
encoded = pd.get_dummies(X_ttl,columns=X_ttl.columns,sparse=True)
X_train = encoded.iloc[:train.shape[0],:]
X_test = encoded.iloc[train.shape[0]:,:]
kf = StratifiedKFold(n_splits=10)
def optuna_objective(trial):
C = trial.suggest_loguniform("C",10e-10,10)
model = LogisticRegression(C=C,class_weight="balanced",
max_iter=10000,solver="lbfgs",n_jobs=-1)
score =-cross_val_score(model,X_train,y_train,cv=kf,scoring="roc_auc").mean()
return score
study = optuna.create_study()
study.optimize(optuna_objective,n_trials=50,show_progress_bar=True)
#建模
model = LogisticRegression(C=0.07536298444122952,class_weight="balanced",
max_iter=10000,solver="lbfgs",n_jobs=-1)
model.fit(X_train,y_train)
logistic_res = model.predict_proba(X_test)
logistic_res[:,1]
submission["target"] = logistic_res[:,1]
submission.to_csv("submission_logistics.csv",index=None)