【复赛前排分享（二）】收好这份王牌优化指南，助你轻松上分无压力

腾讯广告算法大赛

于 2020-07-31 11:38:36 发布

阅读量479

点赞数

分类专栏：腾讯算法大赛文章标签：广告算法腾讯深度学习大数据

本文链接：https://blog.csdn.net/weixin_45676602/article/details/107709712

版权

腾讯算法大赛专栏收录该内容

121 篇文章 4 订阅

订阅专栏

2020腾讯广告算法大赛复赛已经落幕，决赛答辩终极一战即将在8月3日14:00深圳腾讯滨海大厦举行，了解决赛详情并预约直播观赛，请点击：

《决赛来袭！十强战队齐聚，终极一战拉开帷幕！》

外部赛场战况激烈，腾讯公司也联合码客开启了面向员工的内部赛道。其中夺得复赛内部榜第二名的大雄团队，受邀来到本次前排分享会，与大家分享他们的解题秘诀。在竞赛过程中，他们的答题策略透露出优秀的时间管理能力和丰富的实战经验。如何在保证优化效果的前提下减轻训练压力？听听他们怎么说。

01 赛题解读

本届腾讯广告算法大赛的题目是用户画像，即根据用户的广告点击行为以及广告相应的信息对用户的年龄和性别进行预测。

02 数据字段

time: 天粒度时间 nunique: 91
user_id: 从1到N随机编号生成 nunique: 400w
creative_id: 用户点击的广告素材id nunique: 2618159
click_times: 当天该用户点击该广告素材的次数 nunique: 54
ad_id: 该素材所归属的广告id，每个广告可能包含多个可展示的素材 nunique: 2379475
product_id: 该广告中所宣传的产品id nunique: 34111
product_category: 该广告中所宣传的产品的类别id nunique: 18
advertiser_id: 广告主的id nunique: 52861
industry: 广告主所属行业的id nunique: 326
age: 用户年龄段[1-10]
gender: 用户性别[1,2]

03 模型输入

最终方案只使用了五个id序列作为模型输入：
‘creative_id’
‘ad_id’
‘advertiser_id’
‘product_id’
‘industry’

由于只能在非工作时间参赛，我放弃了特征构造，安心每天挂机调输入调结构。其实最终解决方案并不复杂，只要把握好试错时间成本，相信大家都能得到理想的结果，下面我就针对这两部分，分别说说我一路调优下来的感受。

模型输入直接决定了模型的天花板，我尝试了多种方案后总结出：对输入影响最直接的就是有效词的选择、word2vec的词向量生成阶段以及输入的shuffle。词有效性的选择既决定了训练是否有效，又决定了词向量矩阵的内存消耗，在主办方没提供TI-ONE的条件下还是很有效地缓解了内存不足的问题。这里只出现一次的id将被视为不起效，将其与训练测试集不相交的id统一起来，视为一个id，会大大减轻训练压力，对训练效果也没有影响。

        differ = set(train[col].unique()).symmetric_difference(set(test[col].unique())) #获取不同的id
        common = set(train[col].unique()) and (set(test[col].unique())) #获取相同的id
        for v in val_cnt[val_cnt == 1].index:  # 出现一次的统一起来当成一个id
            id_map[v] = 0
        for v in differ:  # 训练集测试集不一样的也统一起来当一个id
            id_map[v] = 0
        for i, v in enumerate(common):  # 相同的按index累加当id
            id_map[v] = i + 1

w2v训练参数最终采用了skip-gram形式，关键参数为min_count=1，size=256，window=10，当然size和window不太有普适性，多跑几个尝试一下即可。

model = models.Word2Vec(list_d,sg=1,min_count=1,size=256,window=10,workers=48,iter=10)
        We = []
        if '0' in model.wv:
            for i in tqdm(range(len(model.wv.index2word))):
                We.append(model.wv[str(i)].reshape((1,-1)))
        else:
            We.append(np.zeros((1,128)))
            for i in tqdm(range(len(model.wv.index2word))):
                We.append(model.wv[str(i+1)].reshape((1,-1)))
        We = np.vstack(We)

输入构造这里有正序、逆序、随机shuffle、click_times加倍等几种操作，click_times加倍后也要相应地适当增加sequence_length，取95%序列长度即可。

for col in tqdm(['creative_id', 'ad_id', 'advertiser_id', 'product_id', 'industry', 'product_category']):
    list_d = pd.read_pickle('./idlist/{}_list.pkl'.format(col))
    We = np.load('./w2v_256_10/{}_embedding_weight.npy'.format(col))
    We = np.vstack([We, np.zeros(config.embeddingSize)])
    list_d = list(list_d)
    for i in range(len(list_d)):
        ret = []
        for j in range(len(list_d[i])):
            ret += [list_d[i][j]] * click_times[i][j]
        list_d[i] = ret

        if len(list_d[i]) > config.sequenceLength:
            list_d[i] = list_d[i][:config.sequenceLength]
        else:
            list_d[i] += [len(We) - 1] * (config.sequenceLength - len(list_d[i]))
    list_d = np.array(list_d)
    list_d = list_d.astype(np.int32)  # 减少内存使用量

class DataSequence(Sequence):
    def __init__(self, xs, y, batch_size=128, shuffle=True):
        self.xs = xs
        self.y = y
        self.batch_size = batch_size
        self.size = xs[0].shape[0]
        self.shuffle = shuffle
        if self.shuffle:
            state = np.random.get_state()
            for x in self.xs:
                np.random.set_state(state)
                np.random.shuffle(x)
            np.random.set_state(state)
            np.random.shuffle(self.y)

    def __len__(self):
        return int(np.ceil(self.size / float(self.batch_size)))

    def __getitem__(self, idx):
        batch_idx = np.arange(idx * self.batch_size, min((idx + 1) * self.batch_size, self.size))
        batch_xs = [x[batch_idx] for x in self.xs]
        batch_y = self.y[batch_idx]
        # shuffle
        if self.shuffle:
            x = []
            for i in range(len(batch_xs)):
                x.append(batch_xs[i].copy())
            for i in range(len(x[0])):
                p = np.random.rand()
                if p < 0.8:
                    state = np.random.get_state()
                    for j in range(len(batch_xs)):
                        np.random.set_state(state)
                        np.random.shuffle(x[j][i])
            batch_xs = x
        return batch_xs, batch_y

04 模型结构

模型结构方面尝试了LSTM、CNN_Inception结构，CNN最终也能到1.47左右的水平，transformer结合LSTM效果也不错，最终没调试出超过纯LSTM。当然也可以只是用transformer模型，但是我的效果并不好，有兴趣的可以参考CyberZHG/Hugging Face开源的实现调调看。个人感觉针对本题数据，**少头优于多头，多层优于少层。**可以改一下只用QK，放弃dense层，弄成个精简版的multi-head。最终我是实现了keras和torch两个版本的模型框架（solo参赛为了最终融合只能想想办法了），模型结构如下：

##LSTM keras-version
def LSTM(config, n_cls=10):
    cols = ['creative_id', 'ad_id', 'advertiser_id', 'product_id', 'industry']
    n_in = len(cols)
    inputs = []
    outputs = []
    max_len = []
    for i in range(n_in):
        We = np.load('./w2v_256_10/{}_embedding_weight.npy'.format(cols[i]))
        We = np.vstack([We, np.zeros(config.embeddingSize)])
        inp = Input(shape=(config.sequenceLength,), dtype="int32")
        x = Embedding(We.shape[0], We.shape[1], weights=[We], trainable=False)(inp)
        inputs.append(inp)
        outputs.append(x)
        del We
        gc.collect()

    embedding_model = Model(inputs, outputs)

    inputs = []
    for i in range(n_in):
        inp = Input(shape=(config.sequenceLength, config.embeddingSize,))
        inputs.append(inp)

    all_input = Concatenate()(inputs)
    all_input = SpatialDropout1D(0.2)(all_input)
    lstm1 = Bidirectional(CuDNNLSTM(256, return_sequences=True))(all_input)
    lstm2 = Bidirectional(CuDNNLSTM(256, return_sequences=True))(lstm1)
    pool_1 = GlobalMaxPooling1D()(lstm1)
    pool_2 = GlobalMaxPooling1D()(lstm2)
    pool = Concatenate()([pool_1, pool_2])
    pool = Dropout(0.2)(pool)

    outputs = Dense(n_cls, activation='softmax')(pool)
    lstm_model = Model(inputs, outputs)
    model = Model(embedding_model.inputs, lstm_model(embedding_model.outputs))

    return model, lstm_model

##LSTM Torch-version
class LSTM(nn.Module):
    def __init__(self):
        super(LSTM, self).__init__()
        emb_outputs = []

        cols = ['creative_id', 'ad_id', 'advertiser_id', 'product_id', 'industry']
        n_in = len(cols)
        for i in range(n_in):
            We = np.load('./w2v_256_120/{}_embedding_weight.npy'.format(cols[i]))
            We = np.vstack([We, np.zeros(256)])
            embed = nn.Embedding(num_embeddings=We.shape[0], embedding_dim=We.shape[1], padding_idx=len(We) - 1,
                                 _weight=t.FloatTensor(We))
            for p in embed.parameters():
                p.requires_grad = False
            emb_outputs.append(embed)

        for i in range(n_in):
            We = np.load('./w2v_128_60/{}_embedding_weight.npy'.format(cols[i]))
            We = np.vstack([We, np.zeros(128)])
            embed = nn.Embedding(num_embeddings=We.shape[0], embedding_dim=We.shape[1], padding_idx=len(We) - 1,
                                 _weight=t.FloatTensor(We))
            for p in embed.parameters():
                p.requires_grad = False
            emb_outputs.append(embed)
            del We
            gc.collect()

        self.encoders = nn.ModuleList(emb_outputs)
        self.emb_drop = nn.Dropout(p=0.2)
        self.lstm = nn.LSTM(input_size=(256 + 128) * 5, hidden_size=384, num_layers=2, bias=True, batch_first=True,
                            dropout=0.2, bidirectional=True)
        self.max_pool = nn.MaxPool1d(kernel_size=2, stride=2)
        self.fc = nn.Sequential(nn.Linear(384, n_cls))
        self.fc_drop = nn.Dropout(p=0.2)

    def forward(self, xs):
        inp = [self.encoders[i](x) for i, x in enumerate(xs)] + [self.encoders[i + 5](x) for i, x in enumerate(xs)]
        x = t.cat(inp, 2)
        x = self.emb_drop(x)
        x = self.lstm(x)[0]
        x = self.max_pool(x)
        x = t.max(x, dim=1)[0]
        x = self.fc_drop(x)
        logits = self.fc(x)
        return logits

##CNN_Inception Torch-verison
class Inception(nn.Module):
    def __init__(self,cin,co,relu=True,norm=True):
        super(Inception, self).__init__()
        assert(co%4==0)
        cos=[co//4]*4
        self.activa=nn.Sequential()
        if norm:self.activa.add_module('norm',nn.BatchNorm1d(co))
        if relu:self.activa.add_module('relu',nn.ReLU(True))
        self.branch1 =nn.Sequential(OrderedDict([
            ('conv1', nn.Conv1d(cin,cos[0], 1,stride=1)),
            ]))
        self.branch2 =nn.Sequential(OrderedDict([
            ('conv1', nn.Conv1d(cin,cos[1], 1)),
            ('norm1', nn.BatchNorm1d(cos[1])),
            ('relu1', nn.ReLU(inplace=True)),
            ('conv3', nn.Conv1d(cos[1],cos[1], 3,stride=1,padding=1)),
            ]))
        self.branch3 =nn.Sequential(OrderedDict([
            ('conv1', nn.Conv1d(cin,cos[2], 3,padding=1)),
            ('norm1', nn.BatchNorm1d(cos[2])),
            ('relu1', nn.ReLU(inplace=True)),
            ('conv3', nn.Conv1d(cos[2],cos[2], 5,stride=1,padding=2)),
            ]))
        self.branch4 =nn.Sequential(OrderedDict([
            #('pool',nn.MaxPool1d(2)),
            ('conv3', nn.Conv1d(cin,cos[3], 3,stride=1,padding=1)),
            ]))
    def forward(self,x):
        branch1=self.branch1(x)
        branch2=self.branch2(x)
        branch3=self.branch3(x)
        branch4=self.branch4(x)
        result=self.activa(t.cat((branch1,branch2,branch3,branch4),1))
        return result


class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        emb_outputs = []

        cols = ['creative_id', 'ad_id', 'advertiser_id', 'product_id', 'industry']
        n_in = len(cols)
        for i in range(n_in):
            We = np.load('./w2v_256_120/{}_embedding_weight.npy'.format(cols[i]))
            We = np.vstack([We, np.zeros(256)])
            embed = nn.Embedding(num_embeddings=We.shape[0], embedding_dim=We.shape[1], padding_idx=len(We) - 1,
                                 _weight=t.FloatTensor(We))
            for p in embed.parameters():
                p.requires_grad = False
            emb_outputs.append(embed)

        for i in range(n_in):
            We = np.load('./w2v_128_60/{}_embedding_weight.npy'.format(cols[i]))
            We = np.vstack([We, np.zeros(128)])
            embed = nn.Embedding(num_embeddings=We.shape[0], embedding_dim=We.shape[1], padding_idx=len(We) - 1,
                                 _weight=t.FloatTensor(We))
            for p in embed.parameters():
                p.requires_grad = False
            emb_outputs.append(embed)
            del We
            gc.collect()

        self.encoders = nn.ModuleList(emb_outputs)
        self.emb_drop = nn.Dropout(p=0.2)
        self.embed_conv = nn.Sequential(
            Inception(1920, 1024),  # (batch_size,64,opt.title_seq_len)->(batch_size,32,(opt.title_seq_len)/2)
            Inception(1024, 1024),
            # nn.MaxPool1d(opt.title_seq_len)
        )
        self.fc = nn.Sequential(
            nn.Linear(1024 * 2, 1024),
            nn.BatchNorm1d(1024),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.2),
            nn.Linear(1024, n_cls)
        )

    def forward(self, xs):
        inp = [self.encoders[i](x) for i, x in enumerate(xs)] + [self.encoders[i + 5](x) for i, x in enumerate(xs)]
        x = t.cat(inp, 2)
        x = self.emb_drop(x)
        x = self.embed_conv(x.permute(0, 2, 1))
        x = t.max(x.permute(0, 2, 1), dim=1)[0]
        logits = self.fc(x)
        return logits