Step by step 的入门东西关于大数据竞赛

最新推荐文章于 2024-06-29 01:06:11 发布

ssdut_yrp

最新推荐文章于 2024-06-29 01:06:11 发布

阅读量1.3k

点赞数

分类专栏：【大数据】

本文链接：https://blog.csdn.net/yrp_ssdut/article/details/21522959

版权

【大数据】专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Step by step 的入门东西关于大数据竞赛

首先声明，大部分思路来自这里，其中也加入了我一些自己的思路，其中也希望各路大神不喜勿吐槽。

因为本人最近和同学一块组队参与了这次规模比较大的赛事——天猫推荐算法挑战赛，希望将大学所学的东西学有所用，挑战下自己。有参与这次竞赛的我们可以交流讨论。

关键字：https://www.coursera.org/ 一个regression+一个协同过滤

STEP1.首先扫一眼数据发现时间那一列居然是中文，先转成可处理的日期格式，就假设数据是13年的好了。

（python处理中文资料查询这里）

def parse_date(raw_date):
    entry_date = raw_date.decode("gbk")
    month = int(entry_date[0])
   //unicode 对中文的长度是1，如果6月2日那么长度就是4，如果6月25日，长度就是5

 if len(entry_date) == 5:
        day = 10 * int(entry_date[2]) + int(entry_date[3])
    else:
        day = int(entry_date[2])
    return 2013, month, day

STEP2.由于越靠后的内容权重应该越大，于是以4月15号为零点，在把数据分成两个集合的同时把时间部分重新处理一遍。同时验证集合只需要购买的记录就可以了，就把没用的记录过滤掉。

(海量数据处理的python实现，具体参http://blog.csdn.net/quicktest/article/details/7453189#comments)

def split_file(raw_file, seperate_day, begin_date):
    train = open("train.csv", "w")
    validation = open("validation.csv", "w")
    raw_file.readline()
    for line in raw_file.readlines():
        entry = line.split(",")
        entry_date = date(*parse_date(entry[3]))  
        date_delta = (entry_date - begin_date).days
        if date_delta < seperate_day:
            train.write(",".join(entry[:3]) + "," + str(date_delta) + "\n")
        elif int(entry[2]) == 1:
            validation.write(",".join(entry[:2]) + "\n")
            print ",".join(entry[:2])
    train.close()
    validation.close()

STEP3.生成了验证集合后，需要将结果归并一下，把验证集合的结果也归并成提交格式要求的那个样子。

def generate_result(validation):
    entrys = validation.readlines()
    entrys.sort(key=lambda x: x.split(",")[0])
    result = open("result.txt", "w")
    for index, entry in enumerate(entrys):
        uid, tid = entry.strip().split(",")
        if index == 0:
            cur_id = uid
            cur_result = [tid]
        elif uid == cur_id:
            cur_result.append(tid)
        else:
            result.write(cur_id + "\t" + ",".join(set(cur_result)) + "\n")
            cur_id = uid
            cur_result = [tid]
    result.close()

STEP4.把这几个函数都整合起来，形成初步的训练集，验证集和最终结果

SEPERATEDAY = date(2013, 7, 15)
BEGINDAY = date(2013, 4, 15)
raw_file = open("t_alibaba_data.csv")
split_file(raw_file, (SEPERATEDAY - BEGINDAY).days, BEGINDAY)
raw_file.close()
validation = open("validation.csv")
generate_result(validation)

STEP5.本地也要自己完成在验证集合上的测试，需要对比算法预测出来的结果和验证集上的结果

from collections import defaultdict
predict_num = 0
hit_num = 0
brand = 0
result = defaultdict(set)
f = open("result")
for line in f.readlines():
    uid, bid = line.split("\t")
    result[uid] = bid.split(",")
    brand += len(result[uid])
f.close()
f = open("predict.txt")
for line in f.readlines():
    uid, bid = line.split("\t")
    bid = bid.split(",")
    predict_num += len(bid)
    if uid not in result:
        continue
    else:
        for i in bid:
            if i in result[uid]:
                hit_num += 1
print "predict num is ", predict_num
print "hit num is ", hit_num
print "total brand is ", brand
precision = float(hit_num)/predict_num
callrate = float(hit_num)/brand
print "precision is ", precision
print "call rate is ", callrate
print "F1 is ", 2*precision*callrate/(precision+callrate)

ssdut_yrp

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Step by step 的入门东西关于大数据竞赛

Step by step 的入门东西关于大数据竞赛首先声明，大部分思路来自这里，其中也加入了我一些自己的思路，其中也希望各路大神不喜勿吐槽。因为本人最近和同学一块组队参与了这次规模比较大的赛事——天猫推荐算法挑战赛，希望将大学所学的东西学有所用，挑战下自己。有参与这次竞赛的我们可以交流讨论。关键字：https://www.coursera.org/一个regression+一个协同过滤
复制链接

扫一扫

专栏目录