前言
iPinyou数据集是做CTR预估和竞价策略的比较早的Benchmark,Weinan Zhang老师也给出了标准化的工具,但是似乎是基于Python2写的。我一方面对其进行了适应Python3的改动,另一方面也会介绍一下可能会遇到的问题。
开源地址:https://github.com/jingranburangyongzhongwen/make-ipinyou-data_py3
顺便一提,Typora收费了,可以选择:https://marktext.app/
使用
Step 0
首先从百度网盘上下载数据集: http://pan.baidu.com/s/1kTwX2mF,UCL的链接失效了,不过百度网盘即使限速,一晚上差不多也下完了。
然后你会得到文件夹 ipinyou.contest.dataset
,假设它的路径是 ~/ipinyou.contest.dataset
。
Step 1
从 wnzhang/make-ipinyou-data 下载 make-ipinyou-data
。
Step 2
更新 original-data
中 ipinyou.contest.dataset
的软链接。
lkf@ubuntu:~/make-ipinyou-data/original-data$ ln -sfn ~/ipinyou.contest.dataset ipinyou.contest.dataset
Step 3
使用本仓库的替换其中的文件夹 python
。确保他们的权限是 775 或 777。
lkf@ubuntu:~/make-ipinyou-data/python$ chmod 777 *
Step 4
在 make-ipinyou-data
文件夹下,运行 make all
。大约要花费30分钟,然后你就会得到:
859M ./3358
482M ./2259
1.3G ./3427
1.4G ./3386
56K ./python
5.4G ./all
1.6G ./1458
396M ./2261
804K ./.git
1.1G ./3476
4.0K ./original-data
135M ./2997
776M ./2821
14G .
数据形式
我们以1458下的为例,其中的文件为:
featindex.txt test.log.txt test.yzx.txt train.log.txt train.yzx.txt
其中train.log.txt
和test.log.txt
是跟文章中提到的数据格式一样的,每行是一个log,具体如下:
click weekday hour bidid timestamp logtype ipinyouid useragent IP region city adexchange domain url urlid slotid slotwidth slotheight slotvisibility slotformat slotprice creative bidprice payprice keypage advertiser usertag
0 4 00 81aced04baad90f9358aa39a4521cd6f 20130606000104828 1 Vhk7ZAnxDIuOjCn windows_ie 115.45.195.* 216 219 1 trqRTJkrBoq7JsNr5SqfNX f41292b3547399af082eccc2ad28f23c null mm_34022157_3445226_11175096 336 280 2 1 0 77819d3e0b3467fe5c7b16d68ad923a1 300 51 bebefa5efe83beee17a3d245e7c5085b 1458 10006,10110
0 4 00 572fa35095e8b6c30b1aa871e52b2d 20130606000105075 1 Z0n7Ce1GPe5-toc windows_chrome 120.40.95.* 124 125 1 trqRTJjrXqf7FmMs 23510110004d0dcb593e2e3c1fc46e28 null mm_13991432_2298120_9467354 336 280 0 1 0 77819d3e0b3467fe5c7b16d68ad923a1 300 87 bebefa5efe83beee17a3d245e7c5085b 1458 10031,13042,10110
0 4 00 e1e44b8a725b957a626991ecb15b56f 20130606000105119 1 VhkE1w9iOeu2eWz windows_ie 60.163.144.* 94 98 1 5On-q5uvgN171m58uG c2015a85d584d92f7da991154996254f null mm_10024662_3445902_11178359 336 280 2 1 0 77819d3e0b3467fe5c7b16d68ad923a1 300 33 bebefa5efe83beee17a3d245e7c5085b 1458 10006,10110
另外的可能是为了压缩文件大小,这个我不确定,我觉得有*.log.txt
应该就够了。有兴趣的同学可以研究一下告诉我:
featindex.txt
maps the features to their indexes. For example,8:115.45.195.* 29
means that the 8th column intrain.log.txt
with the string115.45.195.*
maps to feature index29
.train.yzx.txt
andtest.yzx.txt
are the mapped vector data fortrain.log.txt
andtest.log.txt
. The format is y:click, z:wining_price, and x:features. Such data is in the standard form as introduced in iPinYou Benchmarking.
可能遇到的问题
有可能你在运行make all
后报一些错
mkdir -p ./original-data/ipinyou.contest.dataset/train
cp ./original-data/ipinyou.contest.dataset/training2nd/imp..bz2 ./original-data/ipinyou.contest.dataset/train
cp ./original-data/ipinyou.contest.dataset/training2nd/clk..bz2 ./original-data/ipinyou.contest.dataset/train
cp ./original-data/ipinyou.contest.dataset/training3rd/imp..bz2 ./original-data/ipinyou.contest.dataset/train
cp ./original-data/ipinyou.contest.dataset/training3rd/clk..bz2 ./original-data/ipinyou.contest.dataset/train
bzip2 -d ./original-data/ipinyou.contest.dataset/train/*
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20130606.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20130606.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20130606.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20130606.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20130607.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20130607.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20130607.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20130607.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20130608.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20130608.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20130608.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20130608.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20130609.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20130609.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20130609.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20130609.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20130610.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20130610.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20130610.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20130610.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20130611.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20130611.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20130611.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20130611.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20130612.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20130612.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20130612.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20130612.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20131019.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20131019.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20131019.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20131019.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20131020.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20131020.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20131020.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20131020.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20131021.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20131021.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20131021.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20131021.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20131022.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20131022.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20131022.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20131022.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20131023.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20131023.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20131023.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20131023.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20131024.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20131024.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20131024.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20131024.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20131025.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20131025.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20131025.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20131025.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20131026.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20131026.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20131026.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20131026.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20131027.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20131027.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20131027.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20131027.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20130606.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20130606.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20130606.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20130606.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20130607.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20130607.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20130607.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20130607.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20130608.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20130608.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20130608.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20130608.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20130609.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20130609.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20130609.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20130609.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20130610.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20130610.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20130610.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20130610.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20130611.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20130611.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20130611.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20130611.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20130612.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20130612.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20130612.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20130612.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20131019.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20131019.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20131019.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20131019.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20131020.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20131020.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20131020.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20131020.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20131021.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20131021.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20131021.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20131021.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20131022.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20131022.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20131022.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20131022.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20131023.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20131023.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20131023.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20131023.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20131024.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20131024.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20131024.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20131024.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20131025.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20131025.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20131025.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20131025.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20131026.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20131026.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20131026.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20131026.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20131027.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20131027.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20131027.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20131027.txt already exists.
Makefile:11: recipe for target 'init' failed
make: *** [init] Error 2
有可能是因为iPinYou中的有些数据集被解压了,或者make-ipinyou-data已经存在了一些文件。
建议是全部删除,从头再来,我在进行Python3更改的时候就重来了好几次。