make-ipinyou-data 的 Python3更改

前言

iPinyou数据集是做CTR预估和竞价策略的比较早的Benchmark,Weinan Zhang老师也给出了标准化的工具,但是似乎是基于Python2写的。我一方面对其进行了适应Python3的改动,另一方面也会介绍一下可能会遇到的问题。

开源地址:https://github.com/jingranburangyongzhongwen/make-ipinyou-data_py3

顺便一提,Typora收费了,可以选择:https://marktext.app/

使用

Step 0

首先从百度网盘上下载数据集: http://pan.baidu.com/s/1kTwX2mF,UCL的链接失效了,不过百度网盘即使限速,一晚上差不多也下完了。

然后你会得到文件夹 ipinyou.contest.dataset,假设它的路径是 ~/ipinyou.contest.dataset

Step 1

wnzhang/make-ipinyou-data 下载 make-ipinyou-data

Step 2

更新 original-dataipinyou.contest.dataset 的软链接。

lkf@ubuntu:~/make-ipinyou-data/original-data$ ln -sfn ~/ipinyou.contest.dataset ipinyou.contest.dataset

Step 3

使用本仓库的替换其中的文件夹 python 。确保他们的权限是 775 或 777。

lkf@ubuntu:~/make-ipinyou-data/python$ chmod 777 *

Step 4

make-ipinyou-data 文件夹下,运行 make all。大约要花费30分钟,然后你就会得到:

859M    ./3358
482M    ./2259
1.3G    ./3427
1.4G    ./3386
56K    ./python
5.4G    ./all
1.6G    ./1458
396M    ./2261
804K    ./.git
1.1G    ./3476
4.0K    ./original-data
135M    ./2997
776M    ./2821
14G    .

数据形式

我们以1458下的为例,其中的文件为:

featindex.txt  test.log.txt  test.yzx.txt  train.log.txt  train.yzx.txt

其中train.log.txttest.log.txt是跟文章中提到的数据格式一样的,每行是一个log,具体如下:

click	weekday	hour	bidid	timestamp	logtype	ipinyouid	useragent	IP	region	city	adexchange	domain	url	urlid	slotid	slotwidth	slotheight	slotvisibility	slotformat	slotprice	creative	bidprice	payprice	keypage	advertiser	usertag
0	4	00	81aced04baad90f9358aa39a4521cd6f	20130606000104828	1	Vhk7ZAnxDIuOjCn	windows_ie	115.45.195.*	216	219	1	trqRTJkrBoq7JsNr5SqfNX	f41292b3547399af082eccc2ad28f23c	null	mm_34022157_3445226_11175096	336	280	2	1	0	77819d3e0b3467fe5c7b16d68ad923a1	300	51	bebefa5efe83beee17a3d245e7c5085b	1458	10006,10110
0	4	00	572fa35095e8b6c30b1aa871e52b2d	20130606000105075	1	Z0n7Ce1GPe5-toc	windows_chrome	120.40.95.*	124	125	1	trqRTJjrXqf7FmMs	23510110004d0dcb593e2e3c1fc46e28	null	mm_13991432_2298120_9467354	336	280	0	1	0	77819d3e0b3467fe5c7b16d68ad923a1	300	87	bebefa5efe83beee17a3d245e7c5085b	1458	10031,13042,10110
0	4	00	e1e44b8a725b957a626991ecb15b56f	20130606000105119	1	VhkE1w9iOeu2eWz	windows_ie	60.163.144.*	94	98	1	5On-q5uvgN171m58uG	c2015a85d584d92f7da991154996254f	null	mm_10024662_3445902_11178359	336	280	2	1	0	77819d3e0b3467fe5c7b16d68ad923a1	300	33	bebefa5efe83beee17a3d245e7c5085b	1458	10006,10110

另外的可能是为了压缩文件大小,这个我不确定,我觉得有*.log.txt应该就够了。有兴趣的同学可以研究一下告诉我:

  • featindex.txt maps the features to their indexes. For example, 8:115.45.195.* 29 means that the 8th column in train.log.txt with the string 115.45.195.* maps to feature index 29.
  • train.yzx.txt and test.yzx.txt are the mapped vector data for train.log.txt and test.log.txt. The format is y:click, z:wining_price, and x:features. Such data is in the standard form as introduced in iPinYou Benchmarking.

可能遇到的问题

有可能你在运行make all后报一些错

mkdir -p ./original-data/ipinyou.contest.dataset/train
cp ./original-data/ipinyou.contest.dataset/training2nd/imp..bz2 ./original-data/ipinyou.contest.dataset/train
cp ./original-data/ipinyou.contest.dataset/training2nd/clk..bz2 ./original-data/ipinyou.contest.dataset/train
cp ./original-data/ipinyou.contest.dataset/training3rd/imp..bz2 ./original-data/ipinyou.contest.dataset/train
cp ./original-data/ipinyou.contest.dataset/training3rd/clk..bz2 ./original-data/ipinyou.contest.dataset/train
bzip2 -d ./original-data/ipinyou.contest.dataset/train/*
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20130606.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20130606.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20130606.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20130606.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20130607.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20130607.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20130607.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20130607.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20130608.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20130608.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20130608.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20130608.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20130609.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20130609.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20130609.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20130609.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20130610.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20130610.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20130610.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20130610.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20130611.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20130611.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20130611.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20130611.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20130612.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20130612.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20130612.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20130612.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20131019.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20131019.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20131019.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20131019.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20131020.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20131020.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20131020.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20131020.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20131021.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20131021.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20131021.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20131021.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20131022.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20131022.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20131022.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20131022.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20131023.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20131023.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20131023.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20131023.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20131024.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20131024.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20131024.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20131024.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20131025.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20131025.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20131025.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20131025.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20131026.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20131026.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20131026.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20131026.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/clk.20131027.txt -- using ./original-data/ipinyou.contest.dataset/train/clk.20131027.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/clk.20131027.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/clk.20131027.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20130606.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20130606.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20130606.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20130606.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20130607.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20130607.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20130607.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20130607.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20130608.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20130608.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20130608.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20130608.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20130609.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20130609.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20130609.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20130609.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20130610.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20130610.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20130610.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20130610.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20130611.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20130611.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20130611.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20130611.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20130612.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20130612.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20130612.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20130612.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20131019.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20131019.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20131019.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20131019.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20131020.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20131020.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20131020.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20131020.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20131021.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20131021.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20131021.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20131021.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20131022.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20131022.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20131022.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20131022.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20131023.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20131023.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20131023.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20131023.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20131024.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20131024.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20131024.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20131024.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20131025.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20131025.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20131025.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20131025.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20131026.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20131026.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20131026.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20131026.txt already exists.
bzip2: Can't guess original name for ./original-data/ipinyou.contest.dataset/train/imp.20131027.txt -- using ./original-data/ipinyou.contest.dataset/train/imp.20131027.txt.out
bzip2: ./original-data/ipinyou.contest.dataset/train/imp.20131027.txt is not a bzip2 file.
bzip2: Output file ./original-data/ipinyou.contest.dataset/train/imp.20131027.txt already exists.
Makefile:11: recipe for target 'init' failed
make: *** [init] Error 2

有可能是因为iPinYou中的有些数据集被解压了,或者make-ipinyou-data已经存在了一些文件。

建议是全部删除,从头再来,我在进行Python3更改的时候就重来了好几次。

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值