Kaggle实战：点击率预估

最新推荐文章于 2023-10-08 14:25:46 发布

hellozhxy

最新推荐文章于 2023-10-08 14:25:46 发布

阅读量1.1k

点赞数

分类专栏：机器学习

原文链接：https://zhuanlan.zhihu.com/p/32500652

版权

机器学习专栏收录该内容

306 篇文章 72 订阅

订阅专栏

请安装TensorFlow1.0，Python3.5

项目地址： chengstone/kaggle_criteo_ctr_challenge-

点击率预估用来判断一条广告被用户点击的概率，对每次广告的点击做出预测，把用户最有可能点击的广告找出来，是广告技术最重要的算法之一。

数据集下载

这次我们使用Kaggle上的Display Advertising Challenge挑战的criteo数据集。

下载数据集请在终端输入下面命令(脚本文件路径：./data/download.sh)：

wget --no-check-certificate https://s3-eu-west-1.amazonaws.com/criteo-labs/dac.tar.gz

tar zxf dac.tar.gz

rm -f dac.tar.gz

mkdir raw

mv ./*.txt raw/

解压缩以后，train.txt文件11.7G，test.txt文件1.35G。

数据量太大了，我们只使用前100万条数据。

head -n 1000000 test.txt > test_sub100w.txt

head -n 1000000 train.txt > train_sub100w.txt

然后将文件名重新命名为train.txt和test.txt，文件位置不变。

Data fields

Label

Target variable that indicates if an ad was clicked (1) or not (0).

I1-I13

A total of 13 columns of integer features (mostly count features).

C1-C26

A total of 26 columns of categorical features. The values of these features have been hashed onto 32 bits for anonymization purposes.

数据中含有Label字段，表示这条广告是否被点击，I1-I13一共13个数值特征（Dense Input），C1-C26共26个Categorical类别特征（Sparse Input）。

网络模型

模型包含三部分网络，一个是FFM(Field-aware Factorization Machines)，一个是FM(Factorization Machine)，另一个是DNN，其中FM网络包含GBDT和FM两个组件。通常在数据预处理的部分，需要做特征交叉组合等特征工程，以便找出帮助我们预测的特征出来，这绝对是技术活。

这次我们跳过特征工程的步骤，把这些组件和深度神经网络组合在一起，将挑选特征的工作交给模型来处理。其中FFM使用了LibFFM，FM使用了LibFM，GBDT使用了LightGBM，当然你也可以使用xgboost。

GBDT

给入训练数据后，GBDT会训练出若干棵树，我们要使用的是GBDT中每棵树输出的叶子结点，将这些叶子结点作为categorical类别特征输入给FM。有关决策树的使用，请参照Facebook的这篇文章Practical Lessons from Predicting Clicks on Ads at Facebook。

FM用来解决数据量大并且特征稀疏下的特征组合问题，先来看看公式（只考虑二阶多项式的情况）：n代表样本的特征数量， [公式] 是第i个特征的值，是模型参数。

从公式可以看出来这是在线性模型基础上，添加了特征组合 [公式] ，当然只有在特征和都不为0时才有意义。然而在实际的应用场景中，训练组合特征的参数是很困难的。因为输入数据普遍存在稀疏性，这导致和大部分情况都是0，而组合特征的参数只有在特征不为0时才能训练出有意义的值。

比如跟购物相关的特征中，女性可能会更关注化妆品或者首饰之类的物品，而男性可能更关注体育用品或者电子产品等商品，这说明特征组合训练是有意义的。而商品特征可能存在几百上千种分类，通常我们将类别特征转成One hot编码的形式，这样一个特征就要变成几百维的特征，再加上其他的分类特征，这导致输入的特征空间急剧膨胀，所以数据的稀疏性是实际问题中不可避免的挑战。

为了解决二次项参数训练的问题，引入了矩阵分解的概念。在上一篇文章中我们讨论的是电影推荐系统，我们构造了用户特征向量和电影特征向量，通过两个特征向量的点积得到了用户对于某部电影的评分。如果将用户特征矩阵与电影特征矩阵相乘就会得到所有用户对所有影片的评分矩阵。

如果将上面的过程反过来看，实际上对于评分矩阵，我们可以分解成用户矩阵和电影矩阵，而评分矩阵中每一个数据点就相当于上面讨论的组合特征的参数 [公式] 。

对于参数矩阵W，我们采用矩阵分解的方法，将每一个参数 [公式] 分解成两个向量（称之为隐向量）的点积。这样矩阵就可以分解为，而每个参数，是第i维特征的隐向量，这样FM的二阶公式就变成：

这就是FM模型的思想。

将GBDT输出的叶子节点作为训练数据的输入，来训练FM模型。这样对于我们的FM网络，需要训练GBDT和FM。看得出来，这次我们的点击率预测网络要复杂了许多，影响最终结果的因素和超参更多了。关于FM和GBDT两个组件的训练我们会在下文进行说明。

FFM

接下来需要训练FFM模型。FFM在FM的基础上增加了一个Field的概念，比如说一个商品字段，是一个分类特征，可以分成很多不同的feature，但是这些feature都属于同一个Field，或者说同一个categorical的分类特征都可以放到同一个Field。

这可以看成是1对多的关系，打个比方，比如职业字段，这是一个特征，经过One Hot以后，变成了N个特征。那这N个特征其实都属于职业，所以职业就是一个Field。

我们要通过特征组合来训练隐向量，这样每一维特征 [公式] ，都会与其他特征的每一种Field 学习一个隐向量。也就是说，隐向量不仅与特征有关，还与Field有关。模型的公式：

DNN

我们来看DNN的部分。将输入数据分成两部分，一部分是数值特征（Dense Input），一部分是类别特征（Sparse Input）。我们仍然不适用One Hot编码，将类别特征传入嵌入层，得到多个嵌入向量，再将这些嵌入向量和数值特征连接在一起，传入全连接层，一共连接三层全连接层，使用Relu激活函数。然后再将第三层全连接的输出和FFM、FM的全连接层的输出连接在一起，传入最后一层全连接层。

我们要学习的目标Label表示广告是否被点击了，只有1（点击）和0（没有点击）两种状态。所以我们网络的最后一层要做Logistic回归，在最后一层全连接层使用Sigmoid激活函数，得到广告被点击的概率。

使用LogLoss作为损失函数，FTRL作为学习算法。

FTRL有关的Paper：Ad_click_prediction_a_view_from_the_trenches

LibFFM和LibFM的代码我做了修改，请使用代码库中我的相关代码。

预处理数据集

生成神经网络的输入
生成FFM的输入
生成GBDT的输入

首先要为DNN、FFM和GBDT的输入做预处理。对于数值特征，我们将I1-I13转成0-1之间的小数。类别特征我们将某类别使用次数少于cutoff（超参）的忽略掉，留下使用次数多的feature作为某类别字段的特征，然后将这些特征以各自字段为组进行编号。

比如有C1和C2两个类别字段，C1下面有特征a（大于cutoff次）、b（少于cutoff次）、c（大于cutoff次），C2下面有特征x和y（均大于cutoff次），这样留下来的特征就是C1：a、c和C2：x、y。然后以各自字段为分组进行编号，对于C1字段，a和c的特征id对应0和1；对于C2字段，x和y也是0和1。

对于类别特征的输入数据处理，FFM和GBDT各不相同，我们分别来说。

GBDT

GBDT的处理要简单一些，C1-C26每个字段各自的特征id值作为输入即可。 GBDT的输入数据格式是：Label I1-I13 C1-C26 所以实际输入可能是这样：0 小数1 小数2 ~ 小数13 1（C1特征Id） 0（C2特征Id） ~ C26特征Id 其中C1特征Id是1，说明此处C1字段的feature是c，而C2字段的feature是x。

下面是一段生成的真实数据： 0 0.05 0.004983 0.05 0 0.021594 0.008 0.15 0.04 0.362 0.166667 0.2 0 0.04 2 3 0 0 1 1 0 3 1 0 0 0 0 3 0 0 1 4 1 3 0 0 2 0 1 0

很抱歉，我的造句能力实在很差，要是上面一段文字看的你很混乱的话，那就直接看代码吧：）

FFM

FFM的输入数据要复杂一些，详细可以参看官方Github上的说明，摘抄如下：

It is important to understand the difference between field and feature. For example, if we have a raw data like this:

Click  Advertiser  Publisher
=====  ==========  =========
0        Nike        CNN
1        ESPN        BBC

Here, we have

* 2 fields: Advertiser and Publisher
* 4 features: Advertiser-Nike, Advertiser-ESPN, Publisher-CNN, Publisher-BBC

Usually you will need to build two dictionares, one for field and one for features, like this:

DictField[Advertiser] -> 0
DictField[Publisher]  -> 1

DictFeature[Advertiser-Nike] -> 0
DictFeature[Publisher-CNN]   -> 1
DictFeature[Advertiser-ESPN] -> 2
DictFeature[Publisher-BBC]   -> 3

Then, you can generate FFM format data:

0 0:0:1 1:1:1
1 0:2:1 1:3:1

Note that because these features are categorical, the values here are all ones.

fields应该很好理解，features的划分跟之前GBDT有些不一样，在刚刚GBDT的处理中我们是每个类别内独立编号，C1有features 0~n，C2有features 0~n。而这次FFM是所有的features统一起来编号。你看它的例子，C1是Advertiser，有两个feature，C2是Publisher，有两个feature，统一起来编号就是0~3。而在GBDT我们要独立编号的，看起来像这样：

DictFeature[Advertiser-Nike] -> 0
DictFeature[Advertiser-ESPN] -> 1
DictFeature[Publisher-CNN]   -> 0
DictFeature[Publisher-BBC]   -> 1

现在我们假设有第三条数据，看看如何构造FFM的输入数据：

Click  Advertiser  Publisher
=====  ==========  =========
0        Nike        CNN
1        ESPN        BBC
0        Lining      CNN

按照规则，应该是像下面这样：

DictFeature[Advertiser-Nike]   -> 0
DictFeature[Publisher-CNN]     -> 1
DictFeature[Advertiser-ESPN]   -> 2
DictFeature[Publisher-BBC]     -> 3
DictFeature[Advertiser-Lining] -> 4

在我们这次FFM的输入数据处理中，跟上面略有些区别，每个类别编号以后，下一个类别继续编号，所以最终的features编号是这样的：

DictFeature[Advertiser-Nike]   -> 0
DictFeature[Advertiser-ESPN]   -> 1
DictFeature[Advertiser-Lining] -> 2
DictFeature[Publisher-CNN]     -> 3
DictFeature[Publisher-BBC]     -> 4

对于我们的数据是从I1开始编号的，从I1-I13，所以C1的编号要从加13开始。

这是一条来自真实的FFM输入数据： 0 0:0:0.05 1:1:0.004983 2:2:0.05 3:3:0 4:4:0.021594 5:5:0.008 6:6:0.15 7:7:0.04 8:8:0.362 9:9:0.166667 10:10:0.2 11:11:0 12:12:0.04 13:15:1 14:29:1 15:64:1 16:76:1 17:92:1 18:101:1 19:107:1 20:122:1 21:131:1 22:133:1 23:143:1 24:166:1 25:179:1 26:209:1 27:216:1 28:243:1 29:260:1 30:273:1 31:310:1 32:317:1 33:318:1 34:333:1 35:340:1 36:348:1 37:368:1 38:381:1

DNN

DNN的输入数据就没有那么复杂了，仍然是I1-I13的小数和C1-C26的统一编号，就像FFM一样，只是不需要从加13开始，最后是Label。

真实数据就像这样：

0.05,0.004983,0.05,0,0.021594,0.008,0.15,0.04,0.362,0.166667,0.2,0,0.04,2,16,51,63,79,88,94,109,118,120,130,153,166,196,203,230,247,260,297,304,305,320,327,335,355,368,0

要说明的就这么多了，我们来看看代码吧，因为要同时生成训练数据、验证数据和测试数据，所以要运行一段时间。

核心代码讲解

完整代码请参见项目地址

以下代码来自百度deep_fm的preprocess.py，稍稍添了些代码，我就不重复造轮子了：）

# There are 13 integer features and 26 categorical features
continous_features = range(1, 14)
categorial_features = range(14, 40)

# Clip integer features. The clip point for each integer feature
# is derived from the 95% quantile of the total values in each feature
continous_clip = [20, 600, 100, 50, 64000, 500, 100, 50, 500, 10, 10, 10, 50]

class ContinuousFeatureGenerator:
    """
    Normalize the integer features to [0, 1] by min-max normalization
    """

    def __init__(self, num_feature):
        self.num_feature = num_feature
        self.min = [sys.maxsize] * num_feature
        self.max = [-sys.maxsize] * num_feature

    def build(self, datafile, continous_features):
        with open(datafile, 'r') as f:
            for line in f:
                features = line.rstrip('\n').split('\t')
                for i in range(0, self.num_feature):
                    val = features[continous_features[i]]
                    if val != '':
                        val = int(val)
                        if val > continous_clip[i]:
                            val = continous_clip[i]
                        self.min[i] = min(self.min[i], val)
                        self.max[i] = max(self.max[i], val)

    def gen(self, idx, val):
        if val == '':
            return 0.0
        val = float(val)
        return (val - self.min[idx]) / (self.max[idx] - self.min[idx])

class CategoryDictGenerator:
    """
    Generate dictionary for each of the categorical features
    """

    def __init__(self, num_feature):
        self.dicts = []
        self.num_feature = num_feature
        for i in range(0, num_feature):
            self.dicts.append(collections.defaultdict(int))

    def build(self, datafile, categorial_features, cutoff=0):
        with open(datafile, 'r') as f:
            for line in f:
                features = line.rstrip('\n').split('\t')
                for i in range(0, self.num_feature):
                    if features[categorial_features[i]] != '':
                        self.dicts[i][features[categorial_features[i]]] += 1
        for i in range(0, self.num_feature):
            self.dicts[i] = filter(lambda x: x[1] >= cutoff,
                                   self.dicts[i].items())

            self.dicts[i] = sorted(self.dicts[i], key=lambda x: (-x[1], x[0]))
            vocabs, _ = list(zip(*self.dicts[i]))
            self.dicts[i] = dict(zip(vocabs, range(1, len(vocabs) + 1)))
            self.dicts[i]['<unk>'] = 0

    def gen(self, idx, key):
        if key not in self.dicts[idx]:
            res = self.dicts[idx]['<unk>']
        else:
            res = self.dicts[idx][key]
        return res

    def dicts_sizes(self):
        return list(map(len, self.dicts))

def preprocess(datadir, outdir):
    """
    All the 13 integer features are normalzied to continous values and these
    continous features are combined into one vecotr with dimension 13.

    Each of the 26 categorical features are one-hot encoded and all the one-hot
    vectors are combined into one sparse binary vector.
    """
    dists = ContinuousFeatureGenerator(len(continous_features))
    dists.build(os.path.join(datadir, 'train.txt'), continous_features)

    dicts = CategoryDictGenerator(len(categorial_features))
    dicts.build(
        os.path.join(datadir, 'train.txt'), categorial_features, cutoff=200)#200 50

    dict_sizes = dicts.dicts_sizes()
    categorial_feature_offset = [0]
    for i in range(1, len(categorial_features)):
        offset = categorial_feature_offset[i - 1] + dict_sizes[i - 1]
        categorial_feature_offset.append(offset)

    random.seed(0)

    # 90% of the data are used for training, and 10% of the data are used
    # for validation.
    train_ffm = open(os.path.join(outdir, 'train_ffm.txt'), 'w')
    valid_ffm = open(os.path.join(outdir, 'valid_ffm.txt'), 'w')

    train_lgb = open(os.path.join(outdir, 'train_lgb.txt'), 'w')
    valid_lgb = open(os.path.join(outdir, 'valid_lgb.txt'), 'w')

    with open(os.path.join(outdir, 'train.txt'), 'w') as out_train:
        with open(os.path.join(outdir, 'valid.txt'), 'w') as out_valid:
            with open(os.path.join(datadir, 'train.txt'), 'r') as f:
                for line in f:
                    features = line.rstrip('\n').split('\t')
                    continous_feats = []
                    continous_vals = []
                    for i in range(0, len(continous_features)):

                        val = dists.gen(i, features[continous_features[i]])
                        continous_vals.append(
                            "{0:.6f}".format(val).rstrip('0').rstrip('.'))
                        continous_feats.append(
                            "{0:.6f}".format(val).rstrip('0').rstrip('.'))#('{0}'.format(val))

                    categorial_vals = []
                    categorial_lgb_vals = []
                    for i in range(0, len(categorial_features)):
                        val = dicts.gen(i, features[categorial_features[i]]) + categorial_feature_offset[i]
                        categorial_vals.append(str(val))
                        val_lgb = dicts.gen(i, features[categorial_features[i]])
                        categorial_lgb_vals.append(str(val_lgb))

                    continous_vals = ','.join(continous_vals)
                    categorial_vals = ','.join(categorial_vals)
                    label = features[0]
                    if random.randint(0, 9999) % 10 != 0:
                        out_train.write(','.join(
                            [continous_vals, categorial_vals, label]) + '\n')
                        train_ffm.write('\t'.join(label) + '\t')
                        train_ffm.write('\t'.join(
                            ['{}:{}:{}'.format(ii, ii, val) for ii,val in enumerate(continous_vals.split(','))]) + '\t')
                        train_ffm.write('\t'.join(
                            ['{}:{}:1'.format(ii + 13, str(np.int32(val) + 13)) for ii, val in enumerate(categorial_vals.split(','))]) + '\n')
                        
                        train_lgb.write('\t'.join(label) + '\t')
                        train_lgb.write('\t'.join(continous_feats) + '\t')
                        train_lgb.write('\t'.join(categorial_lgb_vals) + '\n')

                    else:
                        out_valid.write(','.join(
                            [continous_vals, categorial_vals, label]) + '\n')
                        valid_ffm.write('\t'.join(label) + '\t')
                        valid_ffm.write('\t'.join(
                            ['{}:{}:{}'.format(ii, ii, val) for ii,val in enumerate(continous_vals.split(','))]) + '\t')
                        valid_ffm.write('\t'.join(
                            ['{}:{}:1'.format(ii + 13, str(np.int32(val) + 13)) for ii, val in enumerate(categorial_vals.split(','))]) + '\n')
                                                
                        valid_lgb.write('\t'.join(label) + '\t')
                        valid_lgb.write('\t'.join(continous_feats) + '\t')
                        valid_lgb.write('\t'.join(categorial_lgb_vals) + '\n')
                        
    train_ffm.close()
    valid_ffm.close()

    train_lgb.close()
    valid_lgb.close()

    test_ffm = open(os.path.join(outdir, 'test_ffm.txt'), 'w')
    test_lgb = open(os.path.join(outdir, 'test_lgb.txt'), 'w')

    with open(os.path.join(outdir, 'test.txt'), 'w') as out:
        with open(os.path.join(datadir, 'test.txt'), 'r') as f:
            for line in f:
                features = line.rstrip('\n').split('\t')

                continous_feats = []
                continous_vals = []
                for i in range(0, len(continous_features)):
                    val = dists.gen(i, features[continous_features[i] - 1])
                    continous_vals.append(
                        "{0:.6f}".format(val).rstrip('0').rstrip('.'))
                    continous_feats.append(
                            "{0:.6f}".format(val).rstrip('0').rstrip('.'))#('{0}'.format(val))

                categorial_vals = []
                categorial_lgb_vals = []
                for i in range(0, len(categorial_features)):
                    val = dicts.gen(i,
                                    features[categorial_features[i] -
                                             1]) + categorial_feature_offset[i]
                    categorial_vals.append(str(val))

                    val_lgb = dicts.gen(i, features[categorial_features[i] - 1])
                    categorial_lgb_vals.append(str(val_lgb))

                continous_vals = ','.join(continous_vals)
                categorial_vals = ','.join(categorial_vals)

                out.write(','.join([continous_vals, categorial_vals]) + '\n')
                
                test_ffm.write('\t'.join(['{}:{}:{}'.format(ii, ii, val) for ii,val in enumerate(continous_vals.split(','))]) + '\t')
                test_ffm.write('\t'.join(
                    ['{}:{}:1'.format(ii + 13, str(np.int32(val) + 13)) for ii, val in enumerate(categorial_vals.split(','))]) + '\n')
                                                                
                test_lgb.write('\t'.join(continous_feats) + '\t')
                test_lgb.write('\t'.join(categorial_lgb_vals) + '\n')

    test_ffm.close()
    test_lgb.close()
    return dict_sizes

训练FFM

数据准备好了，开始调用LibFFM，训练FFM模型。

learning rate是0.1，迭代32次，训练好后保存的模型文件是model_ffm。

import subprocess, sys, os, time

NR_THREAD = 1
cmd = './libffm/libffm/ffm-train --auto-stop -r 0.1 -t 32 -s {nr_thread} -p ./data/valid_ffm.txt ./data/train_ffm.txt model_ffm'.format(nr_thread=NR_THREAD) 
os.popen(cmd).readlines()

训练结果：

['First check if the text file has already been converted to binary format (1.3 seconds)\n',
 'Binary file found. Skip converting text to binary\n',
 'First check if the text file has already been converted to binary format (0.2 seconds)\n',
 'Binary file found. Skip converting text to binary\n',
 'iter   tr_logloss   va_logloss      tr_time\n',
 '   1      0.49339      0.48196         12.8\n',
 '   2      0.47621      0.47651         25.9\n',
 '   3      0.47149      0.47433         39.0\n',
 '   4      0.46858      0.47277         51.2\n',
 '   5      0.46630      0.47168         63.0\n',
 '   6      0.46447      0.47092         74.7\n',
 '   7      0.46269      0.47038         86.4\n',
 '   8      0.46113      0.47000         98.0\n',
 '   9      0.45960      0.46960        109.6\n',
 '  10      0.45811      0.46940        121.2\n',
 '  11      0.45660      0.46913        132.5\n',
 '  12      0.45509      0.46899        144.3\n',
 '  13      0.45366      0.46903\n',
 'Auto-stop. Use model at 12th iteration.\n']

FFM模型训练好了，我们把训练、验证和测试数据输入给FFM，得到FFM层的输出，输出的文件名为*.out.logit

cmd = './libffm/libffm/ffm-predict ./data/train_ffm.txt model_ffm tr_ffm.out'.format(nr_thread=NR_THREAD) 
os.popen(cmd).readlines()
cmd = './libffm/libffm/ffm-predict ./data/valid_ffm.txt model_ffm va_ffm.out'.format(nr_thread=NR_THREAD) 
os.popen(cmd).readlines()
cmd = './libffm/libffm/ffm-predict ./data/test_ffm.txt model_ffm te_ffm.out true'.format(nr_thread=NR_THREAD) 
os.popen(cmd).readlines()

训练GBDT

现在调用LightGBM训练GBDT模型，因为决策树较容易过拟合，我们设置树的个数为32，叶子节点数设为30，深度就不设置了，学习率设为0.05。

def lgb_pred(tr_path, va_path, _sep = '\t', iter_num = 32):
    # load or create your dataset
    print('Load data...')
    df_train = pd.read_csv(tr_path, header=None, sep=_sep)
    df_test = pd.read_csv(va_path, header=None, sep=_sep)
    
    y_train = df_train[0].values
    y_test = df_test[0].values
    X_train = df_train.drop(0, axis=1).values
    X_test = df_test.drop(0, axis=1).values
    
    # create dataset for lightgbm
    lgb_train = lgb.Dataset(X_train, y_train)
    lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
    
    # specify your configurations as a dict
    params = {
        'task': 'train',
        'boosting_type': 'gbdt',
        'objective': 'binary',
        'metric': {'l2', 'auc', 'logloss'},
        'num_leaves': 30,
#         'max_depth': 7,
        'num_trees': 32,
        'learning_rate': 0.05,
        'feature_fraction': 0.9,
        'bagging_fraction': 0.8,
        'bagging_freq': 5,
        'verbose': 0
    }
    
    print('Start training...')
    # train
    gbm = lgb.train(params,
                    lgb_train,
                    num_boost_round=iter_num,
                    valid_sets=lgb_eval,
                    feature_name=["I1","I2","I3","I4","I5","I6","I7","I8","I9","I10","I11","I12","I13","C1","C2","C3","C4","C5","C6","C7","C8","C9","C10","C11","C12","C13","C14","C15","C16","C17","C18","C19","C20","C21","C22","C23","C24","C25","C26"],
                    categorical_feature=["C1","C2","C3","C4","C5","C6","C7","C8","C9","C10","C11","C12","C13","C14","C15","C16","C17","C18","C19","C20","C21","C22","C23","C24","C25","C26"],
                    early_stopping_rounds=5)
    
    print('Save model...')
    # save model to file
    gbm.save_model('lgb_model.txt')
    
    print('Start predicting...')
    # predict
    y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
    # eval
    print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5)

    return gbm,y_pred,X_train,y_train

训练结果：

[1]	valid_0's l2: 0.241954	valid_0's auc: 0.70607
Training until validation scores don't improve for 5 rounds.
[2]	valid_0's l2: 0.234704	valid_0's auc: 0.715608
[3]	valid_0's l2: 0.228139	valid_0's auc: 0.717791
[4]	valid_0's l2: 0.222168	valid_0's auc: 0.72273
[5]	valid_0's l2: 0.216728	valid_0's auc: 0.724065
[6]	valid_0's l2: 0.211819	valid_0's auc: 0.725036
[7]	valid_0's l2: 0.207316	valid_0's auc: 0.727427
[8]	valid_0's l2: 0.203296	valid_0's auc: 0.728583
[9]	valid_0's l2: 0.199582	valid_0's auc: 0.730092
[10]	valid_0's l2: 0.196185	valid_0's auc: 0.730792
[11]	valid_0's l2: 0.193063	valid_0's auc: 0.732316
[12]	valid_0's l2: 0.190268	valid_0's auc: 0.733773
[13]	valid_0's l2: 0.187697	valid_0's auc: 0.734782
[14]	valid_0's l2: 0.185351	valid_0's auc: 0.735636
[15]	valid_0's l2: 0.183215	valid_0's auc: 0.736346
[16]	valid_0's l2: 0.181241	valid_0's auc: 0.737393
[17]	valid_0's l2: 0.179468	valid_0's auc: 0.737709
[18]	valid_0's l2: 0.177829	valid_0's auc: 0.739096
[19]	valid_0's l2: 0.176326	valid_0's auc: 0.740135
[20]	valid_0's l2: 0.174948	valid_0's auc: 0.741065
[21]	valid_0's l2: 0.173675	valid_0's auc: 0.742165
[22]	valid_0's l2: 0.172499	valid_0's auc: 0.742672
[23]	valid_0's l2: 0.171471	valid_0's auc: 0.743246
[24]	valid_0's l2: 0.17045	valid_0's auc: 0.744415
[25]	valid_0's l2: 0.169582	valid_0's auc: 0.744792
[26]	valid_0's l2: 0.168746	valid_0's auc: 0.745478
[27]	valid_0's l2: 0.167966	valid_0's auc: 0.746282
[28]	valid_0's l2: 0.167264	valid_0's auc: 0.74675
[29]	valid_0's l2: 0.166582	valid_0's auc: 0.747429
[30]	valid_0's l2: 0.16594	valid_0's auc: 0.748392
[31]	valid_0's l2: 0.165364	valid_0's auc: 0.748986
[32]	valid_0's l2: 0.164844	valid_0's auc: 0.749362
Did not meet early stopping. Best iteration is:
[32]	valid_0's l2: 0.164844	valid_0's auc: 0.749362
Save model...
Start predicting...
The rmse of prediction is: 0.406009502303

我们把每个特征的重要程度排个序看看

def ret_feat_impt(gbm):
    gain = gbm.feature_importance("gain").reshape(-1, 1) / sum(gbm.feature_importance("gain"))
    col = np.array(gbm.feature_name()).reshape(-1, 1)
    return sorted(np.column_stack((col, gain)),key=lambda x: x[1],reverse=True)

[array(['I6', '0.1978774213012332'],
       dtype='<U32'), array(['I11', '0.1892171073393491'],
       dtype='<U32'), array(['C13', '0.09876586224832032'],
       dtype='<U32'), array(['I7', '0.09328723289667494'],
       dtype='<U32'), array(['C15', '0.07837089393651243'],
       dtype='<U32'), array(['I1', '0.06896606612740637'],
       dtype='<U32'), array(['C18', '0.03397325870627491'],
       dtype='<U32'), array(['C4', '0.03194220375573926'],
       dtype='<U32'), array(['I13', '0.027751948092299045'],
       dtype='<U32'), array(['C14', '0.022884477973766117'],
       dtype='<U32'), array(['C17', '0.01758709018584479'],
       dtype='<U32'), array(['I3', '0.01745531293913725'],
       dtype='<U32'), array(['C24', '0.015748415135270675'],
       dtype='<U32'), array(['C7', '0.014203757070472703'],
       dtype='<U32'), array(['I8', '0.013413268591324624'],
       dtype='<U32'), array(['C11', '0.012366386458128355'],
       dtype='<U32'), array(['C10', '0.011022221770323784'],
       dtype='<U32'), array(['I5', '0.01042866903792042'],
       dtype='<U32'), array(['C16', '0.010389410428237439'],
       dtype='<U32'), array(['I9', '0.009918639946598076'],
       dtype='<U32'), array(['C2', '0.006787009911825981'],
       dtype='<U32'), array(['C12', '0.005168884905437884'],
       dtype='<U32'), array(['I4', '0.00468917800335175'],
       dtype='<U32'), array(['C26', '0.003364625407413743'],
       dtype='<U32'), array(['C23', '0.0031263193710805628'],
       dtype='<U32'), array(['C21', '0.0008737398560005959'],
       dtype='<U32'), array(['C19', '0.00042059860405565207'],
       dtype='<U32'), array(['I2', '0.0'],
       dtype='<U32'), array(['I10', '0.0'],
       dtype='<U32'), array(['I12', '0.0'],
       dtype='<U32'), array(['C1', '0.0'],
       dtype='<U32'), array(['C3', '0.0'],
       dtype='<U32'), array(['C5', '0.0'],
       dtype='<U32'), array(['C6', '0.0'],
       dtype='<U32'), array(['C8', '0.0'],
       dtype='<U32'), array(['C9', '0.0'],
       dtype='<U32'), array(['C20', '0.0'],
       dtype='<U32'), array(['C22', '0.0'],
       dtype='<U32'), array(['C25', '0.0'],
       dtype='<U32')]

通过eli5分析参数

import eli5 

from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
import csv
import numpy as np

with open('./data/train_eli5.csv', 'rt') as f:
    data = list(csv.DictReader(f))

_all_xs = [{k: v for k, v in row.items() if k != 'clicked'} for row in data]
_all_ys = np.array([int(row['clicked']) for row in data])

all_xs, all_ys = shuffle(_all_xs, _all_ys, random_state=0)
train_xs, valid_xs, train_ys, valid_ys = train_test_split(
    all_xs, all_ys, test_size=0.25, random_state=0)
print('{} items total, {:.1%} true'.format(len(all_xs), np.mean(all_ys)))

# from xgboost import XGBClassifier
import warnings
# xgboost <= 0.6a2 shows a warning when used with scikit-learn 0.18+
warnings.filterwarnings('ignore', category=UserWarning)
class CSCTransformer:
    def transform(self, xs):
        # work around https://github.com/dmlc/xgboost/issues/1238#issuecomment-243872543
        return xs.tocsc()
    def fit(self, *args):
        return self

clf =  lgb.LGBMClassifier()
vec = DictVectorizer()
pipeline = make_pipeline(vec, CSCTransformer(), clf)

def evaluate(_clf):
    scores = cross_val_score(_clf, all_xs, all_ys, scoring='accuracy', cv=10)
    print('Accuracy: {:.3f} ± {:.3f}'.format(np.mean(scores), 2 * np.std(scores)))
    _clf.fit(train_xs, train_ys)  # so that parts of the original pipeline are fitted

evaluate(pipeline)

booster = clf.booster_   #如果运行出错请使用这句clf.booster()
original_feature_names = booster.feature_name
booster.feature_names = vec.get_feature_names()
# recover original feature names
booster.feature_names = original_feature_names

输出结果：

899991 items total, 25.5% true
Accuracy: 0.776 ± 0.003

from eli5 import show_weights
show_weights(clf, vec=vec)

from eli5 import show_prediction
show_prediction(clf, valid_xs[1], vec=vec, show_feature_values=True)

用LightGBM的输出生成FM数据

数据格式请参见libFM 1.4.2 manual中的说明，截取文档中的格式说明如下：

GBDT已经训练好了，我们需要GBDT输出的叶子节点作为输入数据X传给FM，一共30个叶子节点，那么输入给FM的数据格式就是X中不是0的数据的index:value。

一段真实数据如下：0 0:31 1:61 2:93 3:108 4:149 5:182 6:212 7:242 8:277 9:310 10:334 11:365 12:401 13:434 14:465 15:491 16:527 17:552 18:589 19:619 20:648 21:678 22:697 23:744 24:770 25:806 26:826 27:862 28:899 29:928 30:955 31:988

def generat_lgb2fm_data(outdir, gbm, dump, tr_path, va_path, te_path, _sep = '\t'):
    with open(os.path.join(outdir, 'train_lgb2fm.txt'), 'w') as out_train:
        with open(os.path.join(outdir, 'valid_lgb2fm.txt'), 'w') as out_valid:
            with open(os.path.join(outdir, 'test_lgb2fm.txt'), 'w') as out_test:
                df_train_ = pd.read_csv(tr_path, header=None, sep=_sep)
                df_valid_ = pd.read_csv(va_path, header=None, sep=_sep)
                df_test_= pd.read_csv(te_path, header=None, sep=_sep)

                y_train_ = df_train_[0].values
                y_valid_ = df_valid_[0].values                

                X_train_ = df_train_.drop(0, axis=1).values
                X_valid_ = df_valid_.drop(0, axis=1).values
                X_test_= df_test_.values
   
                train_leaves= gbm.predict(X_train_, num_iteration=gbm.best_iteration, pred_leaf=True)
                valid_leaves= gbm.predict(X_valid_, num_iteration=gbm.best_iteration, pred_leaf=True)
                test_leaves= gbm.predict(X_test_, num_iteration=gbm.best_iteration, pred_leaf=True)

                tree_info = dump['tree_info']
                tree_counts = len(tree_info)
                for i in range(tree_counts):
                    train_leaves[:, i] = train_leaves[:, i] + tree_info[i]['num_leaves'] * i + 1
                    valid_leaves[:, i] = valid_leaves[:, i] + tree_info[i]['num_leaves'] * i + 1
                    test_leaves[:, i] = test_leaves[:, i] + tree_info[i]['num_leaves'] * i + 1
#                     print(train_leaves[:, i])
#                     print(tree_info[i]['num_leaves'])

                for idx in range(len(y_train_)):            
                    out_train.write((str(y_train_[idx]) + '\t'))
                    out_train.write('\t'.join(
                        ['{}:{}'.format(ii, val) for ii,val in enumerate(train_leaves[idx]) if float(val) != 0 ]) + '\n')
                    
                for idx in range(len(y_valid_)):                   
                    out_valid.write((str(y_valid_[idx]) + '\t'))
                    out_valid.write('\t'.join(
                        ['{}:{}'.format(ii, val) for ii,val in enumerate(valid_leaves[idx]) if float(val) != 0 ]) + '\n')
                    
                for idx in range(len(X_test_)):                   
                    out_test.write('\t'.join(
                        ['{}:{}'.format(ii, val) for ii,val in enumerate(test_leaves[idx]) if float(val) != 0 ]) + '\n')

训练FM

为训练FM的数据已经准备好了，我们调用LibFM进行训练。

迭代64次，使用sgd训练，学习率是0.00000001，训练好的模型保存为文件fm_model。

训练输出的log，Train和Test的数值不是loss，是accuracy。

cmd = './libfm/libfm/bin/libFM -task c -train ./data/train_lgb2fm.txt -test ./data/valid_lgb2fm.txt -dim ’1,1,8’ -iter 64 -method sgd -learn_rate 0.00000001 -regular ’0,0,0.01’ -init_stdev 0.1 -save_model fm_model'
os.popen(cmd).readlines()

训练结果：

['----------------------------------------------------------------------------\n',
 'libFM\n',
 '  Version: 1.4.4\n',
 '  Author:  Steffen Rendle, srendle@libfm.org\n',
 '  WWW:     http://www.libfm.org/\n',
 'This program comes with ABSOLUTELY NO WARRANTY; for details see license.txt.\n',
 'This is free software, and you are welcome to redistribute it under certain\n',
 'conditions; for details see license.txt.\n',
 '----------------------------------------------------------------------------\n',
 'Loading train...\t\n',
 'has x = 1\n',
 'has xt = 0\n',
 'num_rows=899991\tnum_values=28799712\tnum_features=32\tmin_target=0\tmax_target=1\n',
 'Loading test... \t\n',
 'has x = 1\n',
 'has xt = 0\n',
 'num_rows=100009\tnum_values=3200288\tnum_features=32\tmin_target=0\tmax_target=1\n',
 '#relations: 0\n',
 'Loading meta data...\t\n',
 'learnrate=1e-08\n',
 'learnrates=1e-08,1e-08,1e-08\n',
 '#iterations=64\n',
 "SGD: DON'T FORGET TO SHUFFLE THE ROWS IN TRAINING DATA TO GET THE BEST RESULTS.\n",
 '#Iter=  0\tTrain=0.625438\tTest=0.619484\n',
 '#Iter=  1\tTrain=0.636596\tTest=0.632013\n',
 '#Iter=  2\tTrain=0.627663\tTest=0.623114\n',
 '#Iter=  3\tTrain=0.609776\tTest=0.606605\n',
 '#Iter=  4\tTrain=0.563581\tTest=0.56092\n',
 '#Iter=  5\tTrain=0.497907\tTest=0.495655\n',
 '#Iter=  6\tTrain=0.461677\tTest=0.461408\n',
 '#Iter=  7\tTrain=0.453666\tTest=0.452639\n',
 '#Iter=  8\tTrain=0.454026\tTest=0.453419\n',
 '#Iter=  9\tTrain=0.456836\tTest=0.455919\n',
 '#Iter= 10\tTrain=0.46032\tTest=0.459339\n',
 '#Iter= 11\tTrain=0.466546\tTest=0.465358\n',
 '#Iter= 12\tTrain=0.473565\tTest=0.472317\n',
 '#Iter= 13\tTrain=0.481726\tTest=0.480967\n',
 '#Iter= 14\tTrain=0.492357\tTest=0.491216\n',
 '#Iter= 15\tTrain=0.504419\tTest=0.502935\n',
 '#Iter= 16\tTrain=0.517793\tTest=0.516214\n',
 '#Iter= 17\tTrain=0.533604\tTest=0.532102\n',
 '#Iter= 18\tTrain=0.552926\tTest=0.5515\n',
 '#Iter= 19\tTrain=0.575645\tTest=0.573198\n',
 '#Iter= 20\tTrain=0.59418\tTest=0.590887\n',
 '#Iter= 21\tTrain=0.610691\tTest=0.607815\n',
 '#Iter= 22\tTrain=0.626138\tTest=0.623384\n',
 '#Iter= 23\tTrain=0.640751\tTest=0.637923\n',
 '#Iter= 24\tTrain=0.65393\tTest=0.652141\n',
 '#Iter= 25\tTrain=0.666099\tTest=0.6641\n',
 '#Iter= 26\tTrain=0.677933\tTest=0.675419\n',
 '#Iter= 27\tTrain=0.689539\tTest=0.687108\n',
 '#Iter= 28\tTrain=0.700177\tTest=0.697397\n',
 '#Iter= 29\tTrain=0.709265\tTest=0.706156\n',
 '#Iter= 30\tTrain=0.716553\tTest=0.713266\n',
 '#Iter= 31\tTrain=0.723218\tTest=0.719635\n',
 '#Iter= 32\tTrain=0.729163\tTest=0.726065\n',
 '#Iter= 33\tTrain=0.734428\tTest=0.731354\n',
 '#Iter= 34\tTrain=0.738863\tTest=0.735844\n',
 '#Iter= 35\tTrain=0.74284\tTest=0.740323\n',
 '#Iter= 36\tTrain=0.746316\tTest=0.743793\n',
 '#Iter= 37\tTrain=0.749123\tTest=0.746333\n',
 '#Iter= 38\tTrain=0.751573\tTest=0.748493\n',
 '#Iter= 39\tTrain=0.753264\tTest=0.750292\n',
 '#Iter= 40\tTrain=0.754803\tTest=0.751642\n',
 '#Iter= 41\tTrain=0.756011\tTest=0.753062\n',
 '#Iter= 42\tTrain=0.756902\tTest=0.753892\n',
 '#Iter= 43\tTrain=0.757642\tTest=0.754872\n',
 '#Iter= 44\tTrain=0.758293\tTest=0.755372\n',
 '#Iter= 45\tTrain=0.758855\tTest=0.755782\n',
 '#Iter= 46\tTrain=0.759293\tTest=0.756322\n',
 '#Iter= 47\tTrain=0.759695\tTest=0.756652\n',
 '#Iter= 48\tTrain=0.760084\tTest=0.756982\n',
 '#Iter= 49\tTrain=0.760343\tTest=0.757252\n',
 '#Iter= 50\tTrain=0.76055\tTest=0.757332\n',
 '#Iter= 51\tTrain=0.760706\tTest=0.757582\n',
 '#Iter= 52\tTrain=0.760944\tTest=0.757842\n',
 '#Iter= 53\tTrain=0.761035\tTest=0.757952\n',
 '#Iter= 54\tTrain=0.761173\tTest=0.758152\n',
 '#Iter= 55\tTrain=0.761291\tTest=0.758382\n',
 '#Iter= 56\tTrain=0.76142\tTest=0.758412\n',
 '#Iter= 57\tTrain=0.761541\tTest=0.758452\n',
 '#Iter= 58\tTrain=0.761677\tTest=0.758572\n',
 '#Iter= 59\tTrain=0.76175\tTest=0.758692\n',
 '#Iter= 60\tTrain=0.761829\tTest=0.758822\n',
 '#Iter= 61\tTrain=0.761855\tTest=0.758862\n',
 '#Iter= 62\tTrain=0.761918\tTest=0.759002\n',
 '#Iter= 63\tTrain=0.761988\tTest=0.758972\n',
 'Final\tTrain=0.761988\tTest=0.758972\n',
 'Writing FM model to fm_model\n']

FM模型训练好了，我们把训练、验证和测试数据输入给FM，得到FM层的输出，输出的文件名为*.fm.logits

cmd = './libfm/libfm/bin/libFM -task c -train ./data/train_lgb2fm.txt -test ./data/valid_lgb2fm.txt -dim ’1,1,8’ -iter 32 -method sgd -learn_rate 0.00000001 -regular ’0,0,0.01’ -init_stdev 0.1 -load_model fm_model -train_off true -prefix tr'
os.popen(cmd).readlines()
cmd = './libfm/libfm/bin/libFM -task c -train ./data/valid_lgb2fm.txt -test ./data/valid_lgb2fm.txt -dim ’1,1,8’ -iter 32 -method sgd -learn_rate 0.00000001 -regular ’0,0,0.01’ -init_stdev 0.1 -load_model fm_model -train_off true -prefix va'
os.popen(cmd).readlines()
cmd = './libfm/libfm/bin/libFM -task c -train ./data/test_lgb2fm.txt -test ./data/valid_lgb2fm.txt -dim ’1,1,8’ -iter 32 -method sgd -learn_rate 0.00000001 -regular ’0,0,0.01’ -init_stdev 0.1 -load_model fm_model -train_off true -prefix te -test2predict true'
os.popen(cmd).readlines()

开始构建模型

embed_dim = 32
sparse_max = 30000 # sparse_feature_dim = 117568
sparse_dim = 26
dense_dim = 13
out_dim = 400

定义输入占位符

import tensorflow as tf
def get_inputs():
    dense_input = tf.placeholder(tf.float32, [None, dense_dim], name="dense_input")
    sparse_input = tf.placeholder(tf.int32, [None, sparse_dim], name="sparse_input")
    FFM_input = tf.placeholder(tf.float32, [None, 1], name="FFM_input")
    FM_input = tf.placeholder(tf.float32, [None, 1], name="FM_input")
    
    targets = tf.placeholder(tf.float32, [None, 1], name="targets")
    LearningRate = tf.placeholder(tf.float32, name = "LearningRate")
    return dense_input, sparse_input, FFM_input, FM_input, targets, LearningRate

输入类别特征，从嵌入层获得嵌入向量

def get_sparse_embedding(sparse_input):
    with tf.name_scope("sparse_embedding"):
        sparse_embed_matrix = tf.Variable(tf.random_uniform([sparse_max, embed_dim], -1, 1), name = "sparse_embed_matrix")
        sparse_embed_layer = tf.nn.embedding_lookup(sparse_embed_matrix, sparse_input, name = "sparse_embed_layer")
        sparse_embed_layer = tf.reshape(sparse_embed_layer, [-1, sparse_dim * embed_dim])
    return sparse_embed_layer

输入数值特征，和嵌入向量链接在一起经过三层全连接层

def get_dnn_layer(dense_input, sparse_embed_layer):
    with tf.name_scope("dnn_layer"):
        input_combine_layer = tf.concat([dense_input, sparse_embed_layer], 1)  #(?, 845 = 832 + 13)
        fc1_layer = tf.layers.dense(input_combine_layer, out_dim, name = "fc1_layer", activation=tf.nn.relu)
        fc2_layer = tf.layers.dense(fc1_layer, out_dim, name = "fc2_layer", activation=tf.nn.relu)
        fc3_layer = tf.layers.dense(fc2_layer, out_dim, name = "fc3_layer", activation=tf.nn.relu)
    return fc3_layer

构建计算图

如前所述，将FFM和FM层的输出经过全连接层，再和数值特征、嵌入向量的三层全连接层的输出连接在一起，做Logistic回归。

采用LogLoss损失，FtrlOptimizer优化损失。

tf.reset_default_graph()
train_graph = tf.Graph()
with train_graph.as_default():
    dense_input, sparse_input, FFM_input, FM_input, targets, lr = get_inputs()
    sparse_embed_layer = get_sparse_embedding(sparse_input)
    fc3_layer = get_dnn_layer(dense_input, sparse_embed_layer)

    ffm_fc_layer = tf.layers.dense(FFM_input, 1, name = "ffm_fc_layer")
    fm_fc_layer = tf.layers.dense(FM_input, 1, name = "fm_fc_layer")
    feature_combine_layer = tf.concat([ffm_fc_layer, fm_fc_layer, fc3_layer], 1)  #(?, 402)

    with tf.name_scope("inference"):
        logits = tf.layers.dense(feature_combine_layer, 1, name = "logits_layer")
        pred = tf.nn.sigmoid(logits, name = "prediction")
    
    with tf.name_scope("loss"):
        # LogLoss损失，Logistic回归到点击率
#         cost = tf.losses.sigmoid_cross_entropy(targets, logits )
        sigmoid_cost = tf.nn.sigmoid_cross_entropy_with_logits(labels=targets, logits=logits, name = "sigmoid_cost")
        logloss_cost = tf.losses.log_loss(labels=targets, predictions=pred)
        cost = logloss_cost # + sigmoid_cost
        loss = tf.reduce_mean(cost)
    # 优化损失 
#     train_op = tf.train.AdamOptimizer(lr).minimize(loss)  #cost
    global_step = tf.Variable(0, name="global_step", trainable=False)
    optimizer = tf.train.FtrlOptimizer(lr)  #tf.train.FtrlOptimizer(lr)  AdamOptimizer
    gradients = optimizer.compute_gradients(loss)  #cost
    train_op = optimizer.apply_gradients(gradients, global_step=global_step)
    
    # Accuracy
    with tf.name_scope("score"):
        correct_prediction = tf.equal(tf.to_float(pred > 0.5), targets)
        accuracy = tf.reduce_mean(tf.to_float(correct_prediction), name="accuracy")
        
#     auc, uop = tf.contrib.metrics.streaming_auc(pred, targets)

超参

数据量太大，我们只跑一个epoch。

# Number of Epochs
num_epochs = 1
# Batch Size
batch_size = 32

# Learning Rate
learning_rate = 0.01
# Show stats for every n number of batches
show_every_n_batches = 25

save_dir = './save'

ffm_tr_out_path = './tr_ffm.out.logit'
ffm_va_out_path = './va_ffm.out.logit'
fm_tr_out_path = './tr.fm.logits'
fm_va_out_path = './va.fm.logits'
train_path = './data/train.txt'
valid_path = './data/valid.txt'

读取FFM的输出

ffm_train = pd.read_csv(ffm_tr_out_path, header=None)    
ffm_train = ffm_train[0].values

ffm_valid = pd.read_csv(ffm_va_out_path, header=None)    
ffm_valid = ffm_valid[0].values

读取FM的输出

fm_train = pd.read_csv(fm_tr_out_path, header=None)    
fm_train = fm_train[0].values

fm_valid = pd.read_csv(fm_va_out_path, header=None)    
fm_valid = fm_valid[0].values

读取数据集

将DNN数据和FM、FFM的输出数据读取出来，并连接在一起

train_data = pd.read_csv(train_path, header=None)    
train_data = train_data.values

valid_data = pd.read_csv(valid_path, header=None)    
valid_data = valid_data.values

cc_train = np.concatenate((ffm_train.reshape(-1, 1), fm_train.reshape(-1, 1), train_data), 1)
cc_valid = np.concatenate((ffm_valid.reshape(-1, 1), fm_valid.reshape(-1, 1), valid_data), 1)

np.random.shuffle(cc_train)
np.random.shuffle(cc_valid)

train_y = cc_train[:,-1]
test_y = cc_valid[:,-1]

train_X = cc_train[:,0:-1]
test_X = cc_valid[:,0:-1]

训练网络

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import time
import datetime
from sklearn.metrics import log_loss
from sklearn.learning_curve import learning_curve
from sklearn import metrics
def train_model(num_epochs):
    losses = {'train':[], 'test':[]}
    acc_lst = {'train':[], 'test':[]}
    pred_lst = []

    with tf.Session(graph=train_graph) as sess:
        
        
        # Keep track of gradient values and sparsity
        grad_summaries = []
        for g, v in gradients:
            if g is not None:
                grad_hist_summary = tf.summary.histogram("{}/grad/hist".format(v.name.replace(':', '_')), g)
                sparsity_summary = tf.summary.scalar("{}/grad/sparsity".format(v.name.replace(':', '_')), tf.nn.zero_fraction(g))
                grad_summaries.append(grad_hist_summary)
                grad_summaries.append(sparsity_summary)
        grad_summaries_merged = tf.summary.merge(grad_summaries)
            
        # Output directory for models and summaries
        timestamp = str(int(time.time()))
        out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))
        print("Writing to {}\n".format(out_dir))
         
        # Summaries for loss and accuracy
        loss_summary = tf.summary.scalar("loss", loss)
#         acc_summary = tf.scalar_summary("accuracy", accuracy)
         
        # Train Summaries
        train_summary_op = tf.summary.merge([loss_summary, grad_summaries_merged])
        train_summary_dir = os.path.join(out_dir, "summaries", "train")
        train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph)
    
        # Inference summaries
        inference_summary_op = tf.summary.merge([loss_summary])
        inference_summary_dir = os.path.join(out_dir, "summaries", "inference")
        inference_summary_writer = tf.summary.FileWriter(inference_summary_dir, sess.graph)
    
        sess.run(tf.global_variables_initializer())
        sess.run(tf.local_variables_initializer())
        saver = tf.train.Saver()
        for epoch_i in range(num_epochs):
            
            #将数据集分成训练集和测试集
            train_batches = get_batches(train_X, train_y, batch_size)
            test_batches = get_batches(test_X, test_y, batch_size)
        
            #训练的迭代，保存训练损失
            for batch_i in range(len(train_X) // batch_size):
                x, y = next(train_batches)
    
                feed = {
                    dense_input: x.take([2,3,4,5,6,7,8,9,10,11,12,13,14],1),
                    sparse_input: x.take([15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40],1),
                    FFM_input: np.reshape(x.take(0,1), [batch_size, 1]),
                    FM_input: np.reshape(x.take(1,1), [batch_size, 1]),
                    targets: np.reshape(y, [batch_size, 1]),
                    lr: learning_rate}
    #             _ = sess.run([train_op], feed)  #cost
                step, train_loss, summaries, _, prediction, acc = sess.run(
                    [global_step, loss, train_summary_op, train_op, pred, accuracy], feed)  #cost
                
                prediction = prediction.reshape(y.shape)
                losses['train'].append(train_loss)

                acc_lst['train'].append(acc)
                train_summary_writer.add_summary(summaries, step)  #

                if(np.mean(y) != 0):
                    auc = metrics.roc_auc_score(y, prediction)
                else:
                    auc = -1
                    
                # Show every <show_every_n_batches> batches
                if (epoch_i * (len(train_X) // batch_size) + batch_i) % show_every_n_batches == 0:
                    time_str = datetime.datetime.now().isoformat()
                    print('{}: Epoch {:>3} Batch {:>4}/{}   train_loss = {:.3f}  accuracy = {}  auc = {}'.format(
                        time_str,
                        epoch_i,
                        batch_i,
                        (len(train_X) // batch_size),
                        train_loss,
                        acc,
                        auc))
#                     print(metrics.classification_report(y, np.float32(prediction > 0.5)))
                    
            #使用测试数据的迭代
            for batch_i  in range(len(test_X) // batch_size):
                x, y = next(test_batches)
                
                feed = {
                    dense_input: x.take([2,3,4,5,6,7,8,9,10,11,12,13,14],1),
                    sparse_input: x.take([15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40],1),
                    FFM_input: np.reshape(x.take(0,1), [batch_size, 1]),
                    FM_input: np.reshape(x.take(1,1), [batch_size, 1]),
                    targets: np.reshape(y, [batch_size, 1]),
                    lr: learning_rate}
                # Get Prediction
                step, test_loss, summaries, prediction, acc = sess.run(
                    [global_step, loss, inference_summary_op, pred, accuracy], feed)  #cost
    
                #保存测试损失和准确率
                prediction = prediction.reshape(y.shape)
                losses['test'].append(test_loss)

                acc_lst['test'].append(acc)
                inference_summary_writer.add_summary(summaries, step)  #
                pred_lst.append(prediction)

                if(np.mean(y) != 0):
                    auc = metrics.roc_auc_score(y, prediction)
                else:
                    auc = -1

                time_str = datetime.datetime.now().isoformat()
                if (epoch_i * (len(test_X) // batch_size) + batch_i) % show_every_n_batches == 0:
                    print('{}: Epoch {:>3} Batch {:>4}/{}   test_loss = {:.3f}  accuracy = {}  auc = {}'.format(
                        time_str,
                        epoch_i,
                        batch_i,
                        (len(test_X) // batch_size),
                        test_loss,
                        acc,
                        auc))
                    print(metrics.classification_report(y, np.float32(prediction > 0.5)))

        # Save Model
        saver.save(sess, save_dir)  #, global_step=epoch_i
        print('Model Trained and Saved')
        save_params((losses, acc_lst, pred_lst, save_dir))
    return losses, acc_lst, pred_lst, save_dir

输出验证集上的训练信息

平均准确率
平均损失
平均Auc
预测的平均点击率
精确率、召回率、F1 Score等信息

因为数据中大部分都是负例，正例较少，如果模型全部猜0就能有75%的准确率，所以准确率这个指标是不可信的。

我们需要关注正例的精确率和召回率，当然最主要还是要看LogLoss的值，因为比赛采用的评价指标是LogLoss，而不是采用AUC值。

def train_info():
    print("Test Mean Acc : {}".format(np.mean(acc_lst['test'])))  #test_pred_mean
    print("Test Mean Loss : {}".format(np.mean(losses['test'])))  #test_pred_mean
    print("Mean Auc : {}".format(metrics.roc_auc_score(test_y[:-9], np.array(pred_lst).reshape(-1, 1))))
    print("Mean prediction : {}".format(np.mean(np.array(pred_lst).reshape(-1, 1))))
    print(metrics.classification_report(test_y[:-9], np.float32(np.array(pred_lst).reshape(-1, 1) > 0.5)))

TensorBoard中查看loss

总结

以上就是点击率预估的完整过程，没有进行完整数据的训练，并且有很多超参可以调整，从只跑了一次epoch的结果来看，验证集上的LogLoss是0.46，其他数据都在75%~80%之间，这跟FFM、GBDT和FM网络训练的准确率差不多。

扩展阅读

hellozhxy

关注

0
点赞
踩
14

收藏

觉得还不错? 一键收藏
0
评论
Kaggle实战：点击率预估

版权声明：本文出自程世东的知乎，原创文章，转载请注明出处：Kaggle实战——点击率预估。请安装TensorFlow1.0，Python3.5项目地址：chengstone/kaggle_criteo_ctr_challenge-点击率预估用来判断一条广告被用户点击的概率，对每次广告的点击做出预测，把用户最有可能点击的广告找出来，是广告技术最重要的算法之一。数据集下载这次...
复制链接

扫一扫