题目网址:https://www.kaggle.com/c/avazu-ctr-prediction
数据集介绍:
train - Training set. 10 days of click-through data, ordered chronologically. Non-clicks and clicks
are subsampled according to different strategies.
Train.csv 解压后有5.6G,样本个数非常大,一般200m的csv数据(20~30维)用pandas读取成数据帧(dataframe)格式,大概会占用内存1G左右,所以这么的数据集单机内存一般吃不消。
test - Test set. 1 day of ads to for testing your model predictions.
Test.csv解压后有673m,不是很大。
sampleSubmission.csv - Sample submission file in the correct format, corresponds to the All-0.5 Benchmark.
对特征进行筛选和down sampling来降低数据集
# -*- coding: utf-8 -*-
"""
Created on Wed Feb 01 12:51:31 2017
@author: JR.Lu
"""
import pandas as pd
import numpy as np
train_df=pd.read_csv('train.csv',nrows=10000000)
test_df=pd.read_csv('test.csv')
#down sampling
temp_0=train_df.click==0
data_0=train_df[temp_0] # 16546986./20000000 占了0.8273493左右
temp_1=train_df.click==1
data_1=train_df[temp_1] # 3453014
data_0_ed=data_0[0:len(data_1)]
data_downsampled=pd.concat([data_1,data_0_ed])
#select features
#通过每个columns对label的影响来选择feature,这里使用grouby实现
#train_df.groupby(train_df['device_model'])['click'].mean()
columns_select_test=['id','device_type','C1','C15','C16','banner_pos','banner_pos','site_category']
columns_select=['click','device_type','C1','C15','C16','banner_pos','banner_pos','site_category']
data_downsampled_1=data_downsampled[columns_select]
test_small=test_df[columns_select_test]
# 打乱数据
sampler = np.random.permutation(len(data_downsampled_1))
data_downsampled_1=data_downsampled_1.take(sampler)
data_downsampled_1.to_csv('train_small.csv')
test_small.to_csv('test_small.csv')
其次是用简单的特征来测试模型,用网格搜索的方式来进行参数优选
# -*- coding: utf-8 -*-
"""
Created on Wed Feb 01 20:36:46 2017
@author: JR.Lu
"""
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.linear_mod