A Two-Stage Ensemble of Diverse Models for Advertisement Ranking in KDD Cup 2012-CSDN博客

本文链接：https://blog.csdn.net/BUPT_WX/article/details/51581526
                    
                        
                    
                    KDD Cup 2012： 是一个关于搜索广告的竞赛，这个竞赛的任务是通过搜索引擎的历史日志预测某条广告的点击率
每一条记录可以看成(#click, #impression, Dis- playURL, AdID, AdvertiserID, Depth, Position, QueryID, KeywordID, TitleID, DescriptionID, UserID)组成的向量，同时，还提供了查询语句、标题、描述信息的tokens，这些tokes就是相应词的hash值
论文分为三个步骤 
 
  单个模型生成
使用验证集进行融合
使用测试集组合融合模型和单个模型 
  
 作者将训练集分成验证集和子训练集两部分，每个模型被训练了两次，第一次是在子训练集上训练并使用验证集进行预测，然后用得到的模型用于第二次训练，这次在整个训练集上训练并在整个测试集上进行测试。
特征 
 
  类别特征 
 数据集中有些原始数据蕴含着有用的信息，将它们直接作为一种特征，他们有：AdID, QueryID, Keywor- dID, UserID, and ad’s position
基本稀疏特征 
 在类别特征基础上扩展出的二进制特征，比如数据集中有22,023,547个不同的UserID，那么就将其扩展为22,023,547维的向量作为一个新的特征。这些扩展的特征有：AdID, AdvertiserID, QueryID, KeywordID, TitleID, Descrip- tionID, UserID, DisplayURL, user’s gender, user’s age, depth of session, position of ad and combination of position and depth. We also expand query’s tokens, title’s tokens, de- scription’s tokens and keyword’s tokens into binary features.（这些特征都是稀疏的）
点击率特征 
 对于AdID、AdvertiserID、QueryID、KeywordID、TitleID、DescriptionID、UserID、DisplayURL、user‘s age、user’s gender 和 depth、position和(depth-position)/depth类别特征，计算对应的平均CTR作为单独的特征，比如，对每一个AdID，计算相同AdID的所有记录的平均点击率作为单独的特征，这样的特征可以在给定类别的情况下估计CTR。为了防止CTR计算中分母为0的情况，采用平滑CTR： 
      
       
       pseudo−CTR=#click+α×β#impression+β 
       
    
,这里 
      
      α 
     取0.05, 
      
      β 
     取75。
其他特征 
 numerical value of depth, numerical value of position and the relative position, which is (depth position)/depth.还有length feature，比如每个QueryID对应的tokens个数
token的相似特征 
 为了利用QueryID、KeywordID、TitleID和DescriptionID对应的tokens提供的信息，计算他们之间的相似性并用六维向量( 
      
      C24 
     )来表示。有两种方法计算相似度，第一种是基于余弦相似度的df-idf,第二种是LDA，
用到的单个模型 
  
 这些模型分为三类：分类模型、回归模型和ranking模型 
 
  分类模型 
 作者构建了二分类模型来解决这个问题，他们将训练集按是否点击分成了正样本和负样本，然后训练模型以从负样本中分离出正样本。这里用到了两个模型：朴素贝叶斯模型和逻辑回归模型。
朴素贝叶斯 
 用到的特征:cat_User, cat_Query, cat_Ad, cat_Position, similarity_tfidf
逻辑回归 
 用到的特性:similarity_tfidf, pCTR_Ad, pCTR_Advertiser, pCTR_Query, pCTR_User, binary_User, binary_Ad, binary_Query, value_User, value_Query, num_Query, num_Title, num_Description, num_Keyword, binary_Gender, binary_Age, 
 binary_PositionDepth, binary_QueryTokens
回归模型 
 直接预测CTR,背后的机理是回归模型会给那些可能被点击的instance更高的CTR，作者使用了两种回归模型：岭回归和支持向量回归
岭回归(Ridge Regression) 
  
 用到的特征：binary_Age, binary_Gender, num_Imp_Ad, num_Imp_Advertiser, aCTR_Ad, aCTR_Advertiser, num_Depth, num_Position, num_RPosition, num_Query, num_Title, num_Description, num_Keyword, num_idf_Query, num_idf_Title, num_idf_Description, num_idf_Keyword
支持向量回归(SVR) 
  
 用到的特征: pCTR_Ad, pCTR_Query, pCTR_Title, pCTR_Description, pCTR_User, pCTR_Url, pCTR_Age, pCTR_RPosition, similarity_topic_6, similarity_topic_20
Ranking Models
Combining Regression and Ranking
矩阵分解模型(MF models) 
 因为类似的用户可能点击相似的广告，因此，这个比赛任务可以看成协同过滤问题，可以采用矩阵分解技术。