腾讯广告大赛

最新推荐文章于 2021-08-18 14:00:26 发布

368chen

最新推荐文章于 2021-08-18 14:00:26 发布

阅读量537

点赞数

分类专栏：机器学习文章标签：比赛

本文链接：https://blog.csdn.net/qq_16236875/article/details/89711685

版权

机器学习专栏收录该内容

78 篇文章 2 订阅

订阅专栏

1 利用全部曝光数据统计曝光次数，结合静态广告数据构建训练集，多值的分行，缺失值填0，去掉bid 列。lgb 模型预测得到43.16.

2 第二次数据处理

代码：/data/chenli/algo.qq.com_641013010_testa/testA/totalExposureLog_uniq/Exposure_count.py 和 /data/chenli/algo.qq.com_641013010_testa/testA/totalExposureLog_uniq/lgb.py

2.1 统计静态广告数据中缺失值情况。

创建时间为0的有9290。

广告账户 id 没有缺失的。

商品 id 为空： 238932

行业id ：229

素材尺寸：135686

去掉素材尺寸为空，行业id 有多值的记录，商品id 为-1和行业id 为-1 变成0，文件变为：

ad_static_feature_filter.out

原来有760866条，还剩502157 条数据。

2.2 totalExposureLog.out 中去掉完全相同的列：原来有102386695 ，还剩101507776 条记录

命令：sort totalExposureLog.out |uniq > totalExposureLog_uniq.log

/data/chenli/algo.qq.com_641013010_testa/testA/totalExposureLog_uniq/totalExposureLog_uniq.log

统计count ：24小时的，没考虑创建时间，修改时间啥的。

结果： /data/chenli/algo.qq.com_641013010_testa/testA/totalExposureLog_uniq/totalExposureLog_uniq_count.log

2.3 totalExposureLog_uniq_count.log和静态数据结合

/data/chenli/algo.qq.com_641013010_testa/testA/totalExposureLog_uniq/totalExposureLog_count1_static_uniq.txt

2181057 条数据。

多个素材尺寸的分行：2181079 条 totalExposureLog_count1_static_uniq_res.txt

第三次数据处理：

3.1历史平均来填充旧广告id的曝光量，用H:\qq\totalExposureLog_uniq\for_25\2_no_product_type最高49.5884 新的预测值，结果50.7618

调整单调性：用预测的bid均值 *出价/(出价均值）76.8542

3.2 #历史平均来填充旧广告id的曝光量，新广告id曝光量用广告size、商品id特征对应历史平均来填充。

结果：lgb_submission_no_price_size_id_log.csv

0 以下的用test中的bid/10 来填充：56.2773 调单调性78.99

H:\qq\result\lgv_v2\lgb_submission_no_price_size_id_log\调过单调性的

0 一下的用0 来填充：53.1888

3.3 用最全的历史曝光填充；

/data1/data/chenli/for_25/lgv_v2/from_all_data/known_sample_submission.txt 1300多条，其余数据全0：53.9856
0 以下的用test中的bid/10 来填充：H:\qq\result\lgv_v2\from_all_data\submission_before.csv

调单调性：保留有历史曝光的原值

F:\桌面\RNA-seq1\qq\lgb_v2_monotonicity_with_old.py

结果 H:\qq\result\lgv_v2\from_all_data\调单调性\submission.csv :69.1675

调单调性：不保留有历史曝光的原值，也就是有历史曝光的也要调整单调性，78.92

3.4 用H:\qq\result\lgv_v2\lgb_submission_no_price_size_id\lgb_submission_no_price_size_id.csv 下的预测文件

后续处理：填0，然后调单调性：F:\桌面\RNA-seq1\qq\lgb_v2_monotonicity_with_old.py

78.7521

加上price的：H:\qq\result\lgv_v2\lgb_submission_with_price_size_id\submission.csv ：78.7521

第四次数据处理：改变训练集的构造，去掉bid ，预测每个id 的count。

/data/chenli/algo.qq.com_641013010_testa/testA/4_train_data_without_bid/totalExposureLog_uniq_count_without_bid.log

得到已知id的历史均值:

代码： /data/chenli/algo.qq.com_641013010_testa/testA/4_train_data_without_bid/get_known_count.py

结果： /data/chenli/algo.qq.com_641013010_testa/testA/4_train_data_without_bid/known_sample_submission.txt

曝光数据和静态数据结合：根据id:

结果：/data/chenli/algo.qq.com_641013010_testa/testA/4_train_data_without_bid/totalExposureLog_count1_static_uniq.txt

后续结果逗号的分割：totalExposureLog_count1_static_process_uniq.py

结果;totalExposureLog_count1_static_uniq_res.txt

/data/chenli/algo.qq.com_641013010_testa/testA/4_train_data_without_bid/totalExposureLog_count1_static_uniq_res.txt用lgb 预测： price和label 没有log。

后续处理：F:\桌面\RNA-seq1\qq\lgb_v2_monotonicity_with_old.py

69.9659

有log ：69.9659

不要price：训练集：totalExposureLog_count1_static_uniq_res_without_price.txt

代码：F:\桌面\RNA-seq1\qq\get_totalExposureLog_count1_static_uniq_res_without_price.py

结果：负数用0 填充，然后加上bid/10000 调单调性 H:\qq\result\4_train_data_without_bid\lgb_submission_size_id_log_CV3_no_price\submission.csv

79.6392

H:\qq\result\4_train_data_without_bid\lgb_submission_size_id_log_CV3_no_product_type_no_price

历史值填充，负数用0 填充，然后历史值和其他的加上bid/10000 调单调：81.5981

后续处理得到提交结果：F:\桌面\RNA-seq1\qq\lgb_v4_monotonicity_with_old.py

"H:\\qq\\result\\4_train_data_without_bid\\lgb_submission_size_id_log_CV3_no_product_type_no_price\\submission.csv"

第五次数据处理：操作数据和静态数据，只保留有完整一天曝光的数据

代码： /data1/data/chenli/for_25/5_yitian_time/static_data_with_operation_data-3.py .

结果：/data1/data/chenli/for_25/5_yitian_time/ad_static_feature_filter.out 50万条数据

然后统计曝光值，每个id每一天：get_five_exposure_count.py

结果 /data1/data/chenli/for_25/5_yitian_time/get_five_exposure_count.txt 40多万个id

计算每个id 的曝光均值：/data1/data/chenli/for_25/5_yitian_time/totalExposureLog_count1_static_uniq_average.txt

代码：/data1/data/chenli/for_25/5_yitian_time/get_five_exposure_count_average.py

处理结果的多值：逗号分隔： totalExposureLog_count1_static_uniq_average_process.txt

预测：/data1/data/chenli/for_25/5_yitian_time/lgb_without_price.py

结果：/data1/data/chenli/for_25/5_yitian_time/lgb_submission_with_price_size_id_log_no_price_no_product_type.csv

后续处理：F:\桌面\RNA-seq1\qq\lgb_v4_monotonicity_with_old.py

结果：H:\qq\result\5_完整时间\submission.csv 79.09

mape 的损失值转换为mae ：y=log(y+1) :代码：

lgb_without_price_mae.py

lgb_v4_monotonicity_with_old_mae.py

结果：H:\qq\result\5_完整时间\mae\submission.csv 72.818

使用队友的数据，我的模型和单调性方法：75.4068

5.2 重新处理静态数据：商品id 多个值去掉，素材尺寸多值去掉，广告行业id多值去掉。

H:\\qq\\ad_static_feature_filter_v2.out

代码：F:\桌面\RNA-seq1\qq\static_process.py

结果： /data1/data/chenli/for_25/5_yitian_time/new_static/submission.csv 79.09

预测得到的结果除以4： 83.5808

预测得到的结果除以5： 83.9204

预测得到的结果除以10： 83.9404

预测得到的结果（x-min）/（max-min）归一化：84.5168

log(x+1) 归一化：84.3736

84.5168 的和队友82 的融合：84.2495

归一化： *25 ：85.1784

归一化： *15 ：85.0124

归一化： *20 ：85.3514,均值18 左右

加上历史值缩放： *5100 ：85.8071, 均值18.867424, 146.409786

归一化： *18,,加上历史值缩放： *5100 ：85.7455, 均值 18.170867, 146.468141

/data1/data/chenli/for_25/5_yitian_time/new_static/new_history_sacle/change_std.py

让均值变小，方差变小： 85.8061

same_sample_target_dict1[key2]=((float(same_sample_target_dict[key2])-min_value)/((max_value-min_value) )*1.1)*23

value=((float(known_sample_dict[line_list[1]])-min_know_sample)/((max_know_sample-min_know_sample)*1.1))*5400

让均值变大，方差变小： 85.8061

same_sample_target_dict1[key2]=((float(same_sample_target_dict[key2])-min_value)/((max_value-min_value) )*1.5)*28

value=((float(known_sample_dict[line_list[1]])-min_know_sample)/((max_know_sample-min_know_sample)*1.6))*7000

用队友的方法放大方差： 85.8061

H:\qq\result\5_完整时间\new_static\归一化\change_std\submission.csv

H:\qq\result\5_完整时间\new_static\归一化\history_scale\submission.csv

新的历史值缩放：大的增大一倍，小的缩小一倍： 85.018 均值 18.450576

加上smape ，修改参数：

learning_rate=0.01,
n_estimators=20,

Mean mse: 153.681786537, std mse: 60.294994631. All mse: 20096.299746574.
Mean smape: 0.000018600, std smape: 0.000008742. All smape: 0.000004448.

learning_rate=0.01,
n_estimators=40,

Mean mse: 153.681785082, std mse: 60.294995299. All mse: 20096.299709243.
Mean smape: 0.000018600, std smape: 0.000008742. All smape: 0.000004448.

learning_rate=0.01,
n_estimators=60,

Mean mse: 153.681786558, std mse: 60.294994770. All mse: 20096.299759089.
Mean smape: 0.000018600, std smape: 0.000008742. All smape: 0.000004448.

learning_rate=0.05,
n_estimators=40,

Mean mse: 167.347358596, std mse: 52.372359150. All mse: 20062.579857791.
Mean smape: 0.000019398, std smape: 0.000007727. All smape: 0.000003346.
结果：lgb_submission_new_static_0.01_60.csv

learning_rate=0.01,
n_estimators=60,

Mean mse: 167.347358596, std mse: 52.372359150. All mse: 20062.579857791.
Mean smape: 0.000019398, std smape: 0.000007727. All smape: 0.000003346.

选用0.01，40的

使用smape 降到82.7365

5.2.4 得到历史的中位数：

/data1/data/chenli/for_25/5_yitian_time/get_five_exposure_count_median.py

结果：/data1/data/chenli/for_25/5_yitian_time/totalExposureLog_count1_static_uniq_median.txt

代码：/data1/data/chenli/for_25/5_yitian_time/new_static/median/lgb_v4_monotonicity_with_old.py

提交结果：

H:\qq\result\5_完整时间\new_static\归一化\median\submission_85.8364.csv

不改4350，还是用5100

85.7389

大的更大：85.7909，方差更大了

用19号的填充：85.2293

直接用模型中位数去预测：结果/data1/data/chenli/for_25/5_yitian_time/new_static/median_predict

84.9236

0.5以下的变成0 的：分数没有变化

rank/10000 :85.8423

做二舍8入：85.925 rank*3

rank*7：85.9535

做5 舍7入： rank*7 85.956

做5 舍不入： rank*7 85.9335

做5 舍8入： rank*7 85.9335

做4 舍7入8：5.929

5.2.5 使用product_type:

Mean mse: 178.076122690, std mse: 45.446324348. All mse: 20118.303165090.
Mean smape: 0.000015740, std smape: 0.000008321. All smape: 0.000003527.

Mean mse: 178.124333730, std mse: 45.417656607. All mse: 20119.597510196.
Mean smape: 0.000015473, std smape: 0.000008725. All smape: 0.000003527.

降分：85.7413

使用product_type和product id

Mean mse: 182.235124232, std mse: 42.884181205. All mse: 20078.758639709.
Mean smape: 0.000019011, std smape: 0.000006515. All smape: 0.000003485.

使用product id

Mean mse: 180.041759753, std mse: 44.131588175. All mse: 20095.480800340.
Mean smape: 0.000018791, std smape: 0.000007167. All smape: 0.000003534.

85.3832

5.3 处理操作数：

去掉20190230 的数据，全部是修改状态为失效的。一共1292 rows。

清理创建时间为0、操作类型为修改的 1522 row记录

把重复的值删除，删掉3064个值。

没处理完，使用队友的数据：/data1/data/chenli/for_25/5_yitian_time/new_static/队友的数据/pre_ad_operation.csv

统计出每个id的失效的时间段，

记得保留创建时间为0，操作数据中没有的数据。

代码：F:\桌面\RNA-seq1\qq\get_Invalid_time_for_each_id.py和F:\桌面\RNA-seq1\qq\get_five_exposure_count_baoliu2_process.py

合并：get_five_exposure_count_baoliu1.txt和get_five_exposure_count_baoliu22.txt得到get_five_exposure_count_baoliu.txt

训练：中位数填充，rank/10000,

结果：/data1/data/chenli/for_25/5_yitian_time/new_get_five_exposure_count_baoliu *38

85.729

*22 84.4326

5.6用smooth_smape: 82.615

Mean mse: 141.236253879, std mse: 67.403471372. All mse: 20098.378660878.
Mean smape: 0.000019701, std smape: 0.000006800. All smape: 0.000004454.

5.7 加入bid ，pctr，ecpm，uid

/data1/data/chenli/for_25/5_yitian_time/get_five_exposure_count_with_bid_pctr_ecpm1.py 拿到每个id 每一天的bid

结果：/data1/data/chenli/for_25/5_yitian_time/get_five_exposure_count_bid_uid1.txt

后面依次是pctr，quality_ecpm，totalecpm，uid的list。

/data1/data/chenli/for_25/5_yitian_time/get_five_exposure_count_average_bid_average.py

得到结果：/data1/data/chenli/for_25/5_yitian_time/totalExposureLog_count1_static_uniq_average_pctr_uid.txt

有出现次数最高的uid ：

totalExposureLog_count1_static_uniq_average_pctr_uid_max.txt

测试集·的pctr，ecpm1，ecpm2

Btest_sample_new_no_price_pctr_ecpm.txt

舍弃uid ：训练集：totalExposureLog_count1_static_uniq_average_pctr_without_uidlist.txt

测试集：Btest_sample_new_no_price_pctr_ecpm_without_uidlist.txt

六：B 榜他人经验

大于1000的id：

第二个是训练集的结果。我的8个特征是尺寸、行业id、商品类型、商品id、帐户id、出价、星期几、投放时段长度变换成2000+维分数83。然后是历史的曝光加历史的操作if。

鱼佬只用了用户id其他的用户信息没用？

人群定向就统计了数量，其他的真的不知道咋用,那你用的是操作数据中那个人群定向去统计的？

队友方法：

数据预处理：
静态数据去掉
删除素材尺寸为NA的广告
删除创建时间为0的广告
去掉ad size多值
去掉商品id多值
去掉行业id多值的
添加为日期格式的创建时间
操作数据：
NAN数据
完全重复的数据
删除创建时间为0, 操作类型为修改（1）的记录
同一时间，同一修改字段的频繁操作
用广告静态数据填充操作数据create time = 0 的数据, 但是仍然有一些是在静态数据中找不到的，就直接删除
曝光数据：
去掉完全重复的数据
去掉同一用户的同一请求的同一广告位的数据
去掉pctr异常数据
去掉bid异常的数据
去掉quality ecpm异常的数据
去掉total ecpm异常的数据
训练数据：
1. 取操作数据和静态数据中共有的广告数据形成配置数据,删除了一天多条修改操作和2.16号之前的操作数据.
2. 计算每天每个广告的曝光量，没有曝光的广告用每天的均值来填充pctr等数据，用0填充曝光量。得到每天每个广告的曝光做为训练数据.
3. 用2.16-3.18 号的曝光做训练集,用3.19号的数据作为线下验证集, 用lgb模型预测, 用历史数据的最后一天的曝光量填充test data 在训练集中出现过的, 用0填充负数.
使用的特征是: 账户id, 商品id, 行业id, 广告素材尺寸

368chen

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
腾讯广告大赛

1利用全部曝光数据统计曝光次数，结合静态广告数据构建训练集，多值的分行，缺失值填0，去掉bid列。lgb模型预测得到43.16.2 第二次数据处理代码：/data/chenli/algo.qq.com_641013010_testa/testA/totalExposureLog_uniq/Exposure_count.py 和/data/chenli/algo.qq.com_64...
复制链接

扫一扫

专栏目录