用机器学习对CTR预估建模(一)

本文探讨了如何使用机器学习对CTR(点击率)进行预估建模,首先进行了特征筛选和数据集的down sampling,然后通过简单的特征测试模型,并利用网格搜索优化参数。针对Kaggle上的CTR预测挑战,训练数据集由于规模庞大,需要特殊处理才能适应内存。最终展示了一部分实验结果。
摘要由CSDN通过智能技术生成

题目网址:https://www.kaggle.com/c/avazu-ctr-prediction

数据集介绍:

train - Training set. 10 days of click-through data, ordered chronologically. Non-clicks and clicks
are subsampled according to different strategies.
Train.csv 解压后有5.6G,样本个数非常大,一般200m的csv数据(20~30维)用pandas读取成数据帧(dataframe)格式,大概会占用内存1G左右,所以这么的数据集单机内存一般吃不消。

test - Test set. 1 day of ads to for testing your model predictions.
Test.csv解压后有673m,不是很大。

sampleSubmission.csv - Sample submission file in the correct format, corresponds to the All-0.5 Benchmark.

对特征进行筛选和down sampling来降低数据集

# -*- coding: utf-8 -*-
"""
Created on Wed Feb 01 12:51:31 2017

@author: JR.Lu
"""
import pandas as pd
import numpy as np

train_df=pd.read_csv('train.csv',nrows=10000000)
test_df=pd.read_csv('test.csv')

#down sampling
temp_0=train_df.click==0
data_0=train_df[temp_0] # 16546986./20000000 占了0.8273493左右
temp_1=train_df.click==1
data_1=train_df[temp_1] # 3453014
data_0_ed=data_0[0:len(data_1)]
data_downsampled=pd.concat([data_1,data_0_ed])

#select features
#通过每个columns对label的影响来选择feature,这里使用grouby实现
#train_df.groupby(train_df['device_model'])['click'].mean()
columns_select_test=['id','device_type','C1','C15','C16','banner_pos','banner_pos','site_category']
columns_select=['click','device_type','C1','C15','C16','banner_pos','banner_pos','site_category']
data_downsampled_1=data_downsampled[columns_select]
test_small=test_df[columns_select_test]

# 打乱数据
sampler = np.random.permutation(len(data_downsampled_1))
data_downsampled_1=data_downsampled_1.take(sampler)
data_downsampled_1.to_csv('train_small.csv')
test_small.to_csv('test_small.csv')

其次是用简单的特征来测试模型,用网格搜索的方式来进行参数优选

# -*- coding: utf-8 -*-
"""
Created on Wed Feb 01 20:36:46 2017

@author: JR.Lu
"""
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.linear_mod
评论 9
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值