广告点击率预测_精品案例|在线广告点击率预测

最新推荐文章于 2023-06-16 10:34:46 发布

weixin_39600291

最新推荐文章于 2023-06-16 10:34:46 发布

阅读量1k

点赞数

文章标签：广告点击率预测

本文链接：https://blog.csdn.net/weixin_39600291/article/details/112669657

版权

数据说明
数据查看
数据预处理
3.1 处理非数值特征
初步建立模型查看效果
探索性分析和特征工程
5.1 特征分布可视化
5.2 处理取值不均匀特征
5.3 特征相关性
模型训练和模型选择
6.1 数据准备
6.2 逻辑回归、决策树、随机森林和AdaBoost模型
6.3 参数调优
对测试数据进行预测
总结

1 数据说明

在线广告的点击率不仅与广告投放的位置和站点有关，还与用户所使用的应用程序和设备有关。在本项目中，数据文件已被分为三部分，训练集ctr_train.csv、训练集标签ctr_labels.csv和测试集ctr_test.csv 。基于训练集数据建立并训练CTR预估模型，输出测试集上的预测概率。本项目选择使用测试集的二分类交叉熵作为评价指标。各数据字段含义如下表所示：

字段	说明
id	用户ID
click	建模目标，是否点击，1为点击，0为未点击
C1	匿名分类变量
banner_pos	网页上广告位置
site_id	站点ID
site_domain	站点区域
site_category	站点类别
app_id	用户APPID
app_domain	用户APP区域
app_category	用户APP类别
device_id	设备ID
device_ip	设备IP
device_model	设备模型
device_type	设备类型
device_conn_type	设备接入类型
C14-C21	未知分类变量

2 数据查看

首先导入本项目所使用的所有相关库。

# 屏蔽警告
import warnings
warnings.filterwarnings("ignore")

# 导入项目所用的相关库
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import VotingClassifier

加载数据，使用Pandas库中的read_csv()函数进行读取。

train_data = pd.read_csv('./input/ctr_train.csv')
labels = pd.read_csv('./input/ctr_labels.csv', header=None)

注意：在训练集标签ctr_labels.csv文件中发现第一行没有表头信息，因此在进行读取的时候要设置header=None参数

为了便于查看后续的数据分析结果，这里将训练集和标签进行拼接成一个数据。

labels.columns = ['click']
data = pd.concat([train_data, labels], axis=1)
data.head()

	id	hour	C1	banner_pos	site_id	site_domain	site_category	app_id	app_domain	app_category	...	C14	C15	C16	C17	C18	C19	C20	C21
0	1.04215E+19	14102100	1005	0	6256f5b4	28f93029	f028772b	ecad2386	7801e8d9	07d7df22	...	16859	320	50	1887	3	39	-1	23
1	1.02382E+19	14102100	1002	0	2c4ed2f7	c4e18dd6	50e219e0	ecad2386	7801e8d9	07d7df22	...	21699	320	50	2497	3	43	100151	42
2	1.58943E+19	14102100	1005	1	856e6d3f	58a89a43	f028772b	ecad2386	7801e8d9	07d7df22	...	19771	320	50	2227	0	687	100077	48
3	1.59655E+19	14102100	1005	0	d9750ee7	98572c79	f028772b	ecad2386	7801e8d9	07d7df22	...	15701	320	50	1722	0	35	100084	79
4	1.17335E+19	14102100	1005	0	d6137915	bb1ef334	f028772b	ecad2386	7801e8d9	07d7df22	...	16920	320	50	1899	0	431	100077	117

5 rows × 24 columns

查看数据的基本信息，了解是否有缺失值。

data.info()

'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 24 columns):#   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id                40000 non-null  float64
 1   hour              40000 non-null  int64  
 2   C1                40000 non-null  int64  
 3   banner_pos        40000 non-null  int64  
 4   site_id           40000 non-null  object 
 5   site_domain       40000 non-null  object 
 6   site_category     40000 non-null  object 
 7   app_id            40000 non-null  object 
 8   app_domain        40000 non-null  object 
 9   app_category      40000 non-null  object 
 10  device_id         40000 non-null  object 
 11  device_ip         40000 non-null  object 
 12  device_model      40000 non-null  object 
 13  device_type       40000 non-null  float64
 14  device_conn_type  40000 non-null  float64
 15  C14               40000 non-null  float64
 16  C15               40000 non-null  float64
 17  C16               40000 non-null  float64
 18  C17               40000 non-null  float64
 19  C18               40000 non-null  float64
 20  C19               40000 non-null  float64
 21  C20               40000 non-null  float64
 22  C21               40000 non-null  float64
 23  click             40000 non-null  int64  
dtypes: float64(11), int64(4), object(9)
memory usage: 7.3+ MB

从特征的情况来看，数据中有9个定性特征，并且数据中不存在缺失值，之后我们再来看一下这9个定性特征中每一特征中的类别种数。

3 数据预处理

3.1 处理非数值特征

查看非数值特征中每一特征中的类别种数。

# 把定义型特征放在not_num列表中
not_num = ['site_id', 'site_domain', 'site_category', 'app_id', 'app_domain', 'app_category', 'device_id', 
           'device_ip', 'device_model']

# 直接使用列表推导式输出类别的种数
[len(data[i].unique()) for i in not_num]

[652, 551, 16, 521, 41, 17, 3556, 21601, 1956]

可以看到这9个定性特征中每个特征的类别种数大部分都有上百种，甚至上千种，对这些特征使用One-hot编码处理会添加大量特征，增加计算量，所以我们后续选择建立树模型，无需进行One-Hot编码，只需进行数字编码。

对非数值特征进行数字编码。

le = LabelEncoder()
for i in not_num:
    data[i] = le.fit_transform(data[i])

data.head()

	id	hour	C1	banner_pos	site_id	site_domain	site_category	app_id	app_domain	...	C14	C15	C16	C17	C18	C19	C20	C21
0	1.04E+19	14102100	1005	0	260	92	14	489	19	...	16859	320	50	1887	3	39	-1	23
1	1.02E+19	14102100	1002	0	101	426	6	489	19	...	21699	320	50	2497	3	43	100151	42
2	1.59E+19	14102100	1005	1	337	179	14	489	19	...	19771	320	50	2227	0	687	100077	48
3	1.60E+19	14102100	1005	0	542	334	14	489	19	...	15701	320	50	1722	0	35	100084	79
4	1.17E+19	14102100	1005	0	530	413	14	489	19	...	16920	320	50	1899	0	431	100077	117

5 rows × 24 columns

以上我们对训练集和测试集中定性特征都做了数字编码，编码从0开始到类别种数。

4 初步建立模型查看效果

划分数据集，其中20%为验证数据。

X = data.drop('click', axis=1)
y = data['click']
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=42)

建立逻辑回归、决策树、随机森林和AdaBoost模型。

# 实例化逻辑回归、决策树、随机森林和AdaBoost模型
lr = LogisticRegression(random_state=42)
dt = DecisionTreeClassifier(max_depth=1, random_state=42)
rf = RandomForestClassifier(n_estimators=100, max_depth=1, random_state=42)
ab = AdaBoostClassifier(n_estimators=200, random_state=42)

log_loss_list = [] # 用于存储各个模型的二分类交叉熵
for i in [lr, dt, rf, ab]:
    i.fit(train_x, train_y) # 训练模型
    probs = i.predict_proba(test_x) # 预测
    log_loss_list.append(round(log_loss(test_y, probs), 4)) # 计算二分类交叉熵并存储到列表中

former = pd.DataFrame(log_loss_list, index=[['逻辑回归', '决策树', '随机森林', 'AdaBoost']], columns=[['二分类交叉熵']])
former

	二分类交叉熵
逻辑回归	0.6931
决策树	0.4457
随机森林	0.4441
AdaBoost	0.6896

初步建模的结果可以作为baseline，下面尝试使用特征工程进一步提升模型效果。

5 探索性分析和特征工程

5.1 特征分布可视化

# 设置字体，避免乱码
plt.rcParams['font.sans-serif'] = 'SimHei'
plt.rcParams['axes.unicode_minus'] = False

# 创建画布
p = plt.figure(figsize=(16, 9))
for i, j in enumerate(data.columns):
    p.add_subplot(4, 6, i+1)   # 增加子图
    plt.scatter(data[j].value_counts().index.tolist(), data[j].value_counts().values.tolist())  # 子图为散点图
    plt.xlabel(j) # 设置横坐标名称
    plt.ylabel('Count') # 设置纵坐标名称

# 子图自动调整
plt.tight_layout()
plt.subplots_adjust()

plt.show() # 图的显示

以上的24个子图是各个特征和目标标签的分布情况，其中图的横轴表示特征数据的值，纵轴表示每个值的数量。观察上面的图我们发现特征id和特征hour是无关特征，之后应直接删除，并且特征site_id、site_domain、app_id、app_domain和device_id的取值分布非常不均匀，因此下面需要进行处理。

data = data.drop(['id', 'hour'], axis=1)

5.2 处理取值不均匀特征

对不均匀特征进行可视化

# 将不均匀特征放在一个列表中，然后对列表中的每一个元素进行可视化
not_balance = ['site_id', 'site_domain', 'app_id', 'app_domain', 'device_id']

for i in not_balance:
    plt.figure(figsize=(16, 4)) # 创建画布
    plt.scatter(data[i].value_counts().index.tolist(), data[i].value_counts().values.tolist()) # 画出散点图
    plt.xlabel(i) # 设置横坐标名称
    plt.ylabel('Count') # 设置纵坐标名称
    
    plt.show()# 图的显示

由上面的5个图中可以明显的看出这5个特征的数据分布确实是不均匀的，下面再来看一下这些特征中排名前5的值。

for i in not_balance:
    print('-' * 35)
    print(data[i].value_counts()[:5])
print('-' * 35)

-----------------------------------
72     13967
339     8821
225     2856
542     1459
563     1083
Name: site_id, dtype: int64
-----------------------------------
529    13967
426     9463
432     2856
334     1486
276     1143
Name: site_domain, dtype: int64
-----------------------------------
489    31179
519      936
466      681
26       464
203      452
Name: app_id, dtype: int64
-----------------------------------
19    33126
5      3052
22      936
34      757
15      681
Name: app_domain, dtype: int64
-----------------------------------
2391    34716
2728       66
726        28
2282       27
2090       24
Name: device_id, dtype: int64
-----------------------------------

将特征中某个值超过30000次的进行二值化处理，这里使用apply()方法、lambda表达式和Python的三元运算符进行处理。

data['app_id'] = data['app_id'].apply(lambda x : 1 if x == 489 else 0)
data['app_domain'] = data['app_domain'].apply(lambda x : 1 if x == 19 else 0)
data['device_id'] = data['device_id'].apply(lambda x : 1 if x == 2391 else 0)

对于特征site_id和特征site_domain使用区间划分的方式进行处理，这里使用cut()方法。

# 首先指定特征`site_id`和特征`site_domain`的划分区间
bins_site_id = [-1, 71.9, 72, 224.9, 225, 338.9, 339, 541.9, 542, 562.9, 563, np.inf]
bins_site_domain = [-1, 275.9, 276, 333.9, 334, 425.9, 426, 431.9, 432, 528.9, 529, np.inf]

# 然后使用cut()方法对进行区间编码
data['site_id'] = pd.cut(data['site_id'], bins=bins_site_id, labels=range(len(bins_site_id)-1))
data['site_domain'] = pd.cut(data['site_domain'], bins=bins_site_domain, labels=range(len(bins_site_id)-1))

5.3 特征相关性

# 设置字体，避免乱码
plt.rcParams['font.sans-serif'] = 'SimHei'
plt.rcParams['axes.unicode_minus'] = False

plt.figure(figsize=(13, 13)) # 创建画布
sns.heatmap(data.corr(), annot=True, vmax=1, vmin=-1, square=True, cmap='Blues') # 相关系数图
plt.show() # 图的显示

在上面的相关系数热力图中很明显的看出特征device_type和特征C1、特征C14和特征C17具有较高的相关性，因此这些特征中要保留一种即可。

data = data.drop(['C1', 'C17'], axis=1)

6 模型训练和模型选择

6.1 数据准备

划分数据集，其中20%为验证数据。

X = data.drop('click', axis=1)
y = data['click']
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=42)

6.2 逻辑回归、决策树、随机森林和AdaBoost模型

# 实例化逻辑回归、决策树、随机森林和AdaBoost模型
lr = LogisticRegression(random_state=42)
dt = DecisionTreeClassifier(max_depth=1, random_state=42)
rf = RandomForestClassifier(n_estimators=100, max_depth=1, random_state=42)
ab = AdaBoostClassifier(n_estimators=200, random_state=42)

log_loss_list = [] # 用于存储各个模型的二分类交叉熵
for i in [lr, dt, rf, ab]:
    i.fit(train_x, train_y) # 训练模型
    probs = i.predict_proba(test_x) # 预测
    log_loss_list.append(round(log_loss(test_y, probs), 4)) # 计算二分类交叉熵并存储到列表中

latter = pd.DataFrame(log_loss_list, index=[['逻辑回归', '决策树', '随机森林', 'AdaBoost']], columns=[['二分类交叉熵']])
latter

	二分类交叉熵
逻辑回归	0.4379
决策树	0.4457
随机森林	0.4446
AdaBoost	0.6897

特征工程前后的对比。

former.columns = ['之前']
latter.columns = ['之后']
pd.concat([former, latter], axis=1)

	之前	之后
逻辑回归	0.6931	0.4379
决策树	0.4457	0.4457
随机森林	0.4441	0.4446
AdaBoost	0.6896	0.6897

由上面可以看出经过特征工程后逻辑回归的误差变小了，决策树没有变化，随机森林升高了一点，但也没有超过0.5，而AdaBoost的误差依然很大。因此，之后要对决策树模型和随机森林模型进行参数调优，并且不再考虑AdaBoost模型。

6.3 参数调优

对决策树模型进行网格搜索。

param_grid = {
    'criterion' : ['gini', 'entropy'],
    'max_depth': [1, 2, 3, 4, 5], 
    'random_state' : [42]
}

dt = DecisionTreeClassifier()

grid_search = GridSearchCV(dt, param_grid, cv=5, scoring='neg_log_loss')
grid_search.fit(train_x, train_y)
dt_best = grid_search.best_estimator_
dt_best # 输出最好的模型

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy', max_depth=4, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=42, splitter='best')

dt_best.fit(train_x, train_y) # 训练模型
probs = dt_best.predict_proba(test_x) # 预测
print(round(log_loss(test_y, probs), 4)) # 二分类交叉熵

0.4228

对随机森林模型进行网格搜索。

param_grid = {
    'criterion' : ['gini', 'entropy'],
    'n_estimators': [100, 200, 300],
    'max_features': [8, 10, 12],
    'max_depth' : [5, 6, 7],
    'random_state' : [42]
}

rf = RandomForestClassifier()

grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='neg_log_loss')
grid_search.fit(train_x, train_y)
rf_best = grid_search.best_estimator_
rf_best # 输出最好的模型

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='entropy', max_depth=7, max_features=12, max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=None, oob_score=False, random_state=42, verbose=0, warm_start=False) In [24]:

rf_best.fit(train_x, train_y) # 训练模型
probs = rf_best.predict_proba(test_x) # 预测
print(round(log_loss(test_y, probs), 4)) # 二分类交叉熵

0.4118

通过对决策树和随机森林的参数调优已经分别把误差降到了0.4228和0.4118。

7 对测试数据进行预测

数据读取

test_data = pd.read_csv('./input/ctr_test.csv')

数据预处理

# 数字编码
not_num = ['site_id', 'site_domain', 'site_category', 'app_id', 'app_domain', 'app_category', 'device_id', 
           'device_ip', 'device_model']
le = LabelEncoder()
for i in not_num:
    test_data[i] = le.fit_transform(test_data[i])

特征工程

test_data = test_data.drop(['id', 'hour'], axis=1)

test_data['app_id'] = test_data['app_id'].apply(lambda x : 1 if x == 489 else 0)
test_data['app_domain'] = test_data['app_domain'].apply(lambda x : 1 if x == 19 else 0)
test_data['device_id'] = test_data['device_id'].apply(lambda x : 1 if x == 2391 else 0)

bins_site_id = [-1, 71.9, 72, 224.9, 225, 338.9, 339, 541.9, 542, 562.9, 563, np.inf]
bins_site_domain = [-1, 275.9, 276, 333.9, 334, 425.9, 426, 431.9, 432, 528.9, 529, np.inf]
test_data['site_id'] = pd.cut(test_data['site_id'], bins=bins_site_id, labels=range(len(bins_site_id)-1))
test_data['site_domain'] = pd.cut(test_data['site_domain'], bins=bins_site_domain, labels=range(len(bins_site_id)-1))

test_data = test_data.drop(['C1', 'C17'], axis=1)

预测

pred = rf_best.predict(test_data)
pred

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

在本案例中，我们首先对数据进行读取和预处理，然后直接建立模型，得到初步的结果，之后进行特征工程再建立模型，与之前的结果进行比较，只有逻辑回归模型的误差具有较大的减小，然后我们考虑模型的超参数影响，再进行了参数调优，最优模型为随机森林，最后我们对测试数据集进行了预测

本案例由数据酷客创造营学员蔡猛撰写。

戳下面的原文阅读，可以在线运行案例哦！

weixin_39600291

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫