A Two-Stage Ensemble of Diverse Models for Advertisement Ranking in KDD Cup 2012

BUPT_WX

于 2016-06-03 20:27:54 发布

阅读量1.3k

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/BUPT_WX/article/details/51581526

版权

机器学习专栏收录该内容

7 篇文章 0 订阅

订阅专栏

                    
                        
                    
                    KDD Cup 2012： 是一个关于搜索广告的竞赛，这个竞赛的任务是通过搜索引擎的历史日志预测某条广告的点击率
每一条记录可以看成(#click, #impression, Dis- playURL, AdID, AdvertiserID, Depth, Position, QueryID, KeywordID, TitleID, DescriptionID, UserID)组成的向量，同时，还提供了查询语句、标题、描述信息的tokens，这些tokes就是相应词的hash值
论文分为三个步骤 
 
  单个模型生成
使用验证集进行融合
使用测试集组合融合模型和单个模型 
  
 作者将训练集分成验证集和子训练集两部分，每个模型被训练了两次，第一次是在子训练集上训练并使用验证集进行预测，然后用得到的模型用于第二次训练，这次在整个训练集上训练并在整个测试集上进行测试。
特征 
 
  类别特征 
 数据集中有些原始数据蕴含着有用的信息，将它们直接作为一种特征，他们有：AdID, QueryID, Keywor- dID, UserID, and ad’s position
基本稀疏特征 
 在类别特征基础上扩展出的二进制特征，比如数据集中有22,023,547个不同的UserID，那么就将其扩展为22,023,547维的向量作为一个新的特征。这些扩展的特征有：AdID, AdvertiserID, QueryID, KeywordID, TitleID, Descrip- tionID, UserID, DisplayURL, user’s gender, user’s age, depth of session, position of ad and combination of position and depth. We also expand query’s tokens, title’s tokens, de- scription’s tokens and keyword’s tokens into binary features.（这些特征都是稀疏的）
点击率特征 
 对于AdID、AdvertiserID、QueryID、KeywordID、TitleID、DescriptionID、UserID、DisplayURL、user‘s age、user’s gender 和 depth、position和(depth-position)/depth类别特征，计算对应的平均CTR作为单独的特征，比如，对每一个AdID，计算相同AdID的所有记录的平均点击率作为单独的特征，这样的特征可以在给定类别的情况下估计CTR。为了防止CTR计算中分母为0的情况，采用平滑CTR： 
      
       
       pseudo−CTR=#click+α×β#impression+β 
       
    
,这里 
      
      α 
     取0.05, 
      
      β 
     取75。
其他特征 
 numerical value of depth, numerical value of position and the relative position, which is (depth position)/depth.还有length feature，比如每个QueryID对应的tokens个数
token的相似特征 
 为了利用QueryID、KeywordID、TitleID和DescriptionID对应的tokens提供的信息，计算他们之间的相似性并用六维向量( 
      
      C24 
     )来表示。有两种方法计算相似度，第一种是基于余弦相似度的df-idf,第二种是LDA，
用到的单个模型 
  
 这些模型分为三类：分类模型、回归模型和ranking模型 
 
  分类模型 
 作者构建了二分类模型来解决这个问题，他们将训练集按是否点击分成了正样本和负样本，然后训练模型以从负样本中分离出正样本。这里用到了两个模型：朴素贝叶斯模型和逻辑回归模型。
朴素贝叶斯 
 用到的特征:cat_User, cat_Query, cat_Ad, cat_Position, similarity_tfidf
逻辑回归 
 用到的特性:similarity_tfidf, pCTR_Ad, pCTR_Advertiser, pCTR_Query, pCTR_User, binary_User, binary_Ad, binary_Query, value_User, value_Query, num_Query, num_Title, num_Description, num_Keyword, binary_Gender, binary_Age, 
 binary_PositionDepth, binary_QueryTokens
回归模型 
 直接预测CTR,背后的机理是回归模型会给那些可能被点击的instance更高的CTR，作者使用了两种回归模型：岭回归和支持向量回归
岭回归(Ridge Regression) 
  
 用到的特征：binary_Age, binary_Gender, num_Imp_Ad, num_Imp_Advertiser, aCTR_Ad, aCTR_Advertiser, num_Depth, num_Position, num_RPosition, num_Query, num_Title, num_Description, num_Keyword, num_idf_Query, num_idf_Title, num_idf_Description, num_idf_Keyword
支持向量回归(SVR) 
  
 用到的特征: pCTR_Ad, pCTR_Query, pCTR_Title, pCTR_Description, pCTR_User, pCTR_Url, pCTR_Age, pCTR_RPosition, similarity_topic_6, similarity_topic_20
Ranking Models
Combining Regression and Ranking
矩阵分解模型(MF models) 
 因为类似的用户可能点击相似的广告，因此，这个比赛任务可以看成协同过滤问题，可以采用矩阵分解技术。

                

BUPT_WX

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
A Two-Stage Ensemble of Diverse Models for Advertisement Ranking in KDD Cup 2012

KDD Cup 2012：是一个关于搜索广告的竞赛，这个竞赛的任务是通过搜索引擎的历史日志预测某条广告的点击率每一条记录可以看成(#click, #impression, Dis- playURL, AdID, AdvertiserID, Depth, Position, QueryID, KeywordID, TitleID, DescriptionID, UserID)组成的向量，同时，还提
复制链接

扫一扫