"阿里巴巴"杯北邮数据挖掘竞赛（一）

最新推荐文章于 2023-06-11 23:55:47 发布

zhihua_bupt

最新推荐文章于 2023-06-11 23:55:47 发布

阅读量2.9k

点赞数 1

CC 4.0 BY-SA版权

分类专栏： Machine Learning Python 机器学习实战笔记

本文链接：https://blog.csdn.net/geekmanong/article/details/50803154

Machine Learning 同时被 3 个专栏收录

27 篇文章

订阅专栏

Python

12 篇文章

订阅专栏

机器学习实战笔记

10 篇文章

订阅专栏

本文介绍了阿里巴巴举办的北邮数据挖掘竞赛，主要任务是根据用户近200天在天猫的行为日志，建立用户的品牌偏好并进行性别和年龄分类。提供了详细的数据说明、评估指标、数据分析以及统计特征等内容，帮助参赛者更好地理解任务和数据，从而提高比赛成绩。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

"阿里巴巴"杯北邮数据挖掘竞赛

一、赛题介绍

在天猫，每天都会有数千万的用户通过品牌发现自己喜欢的商品，品牌是联接消费者与商品最重要的纽带。
本届赛题的任务就是根据用户近200天在天猫的行为日志，建立用户的品牌偏好，并对用户的性别和年龄进行分类。
根据性别和年龄将用户分为了12类，建议参赛者考虑类别不平衡问题（偏斜不严重）。
比赛共约700MB的数据量，按6:2:2的比例分配与训练集、测试集1、测试集2。

二、数据说明

竞赛数据类型:

log_train文件:

info_train文件:

三、评估指标

用最常用的准确率与召回率作为排行榜的指标。具体计算公式如下：

用上述公式计算得出的加权F1作为最终排名指标。

四、数据分析

1.数据总量

2.类别分布

分析：根据上图，我们可以预测1-6类为女性用户，7-12类为男性用户！

实现代码：

#!/usr/bin/python278  
# _*_ coding: utf-8 _*_  

import matplotlib
import matplotlib.pyplot as plt
def goforit():
    labdic={}
    file=open("info_train.txt",'r')
    y=[]
    zhfont = matplotlib.font_manager.FontProperties(fname='C:\Windows\Fonts\ukai.ttc')  
    for line in file.readlines():
        lab=int(line)
        if lab not in labdic:
            labdic[lab]=0
        labdic[lab]+=1
    x=range(1,len(labdic)+1)
    for i in x:
        y+=[labdic[i]]
    fig=plt.figure()
    ax=fig.add_subplot(111)
    ax.scatter(x,y)
    ax.axis( [0,12, 0 ,30000]) #设置坐标轴范围 
    plt.plot(x,y,'-r')
    plt.bar(x,y,alpha = .5, color = 'g')
    plt.xlabel(u'用户类别', fontproperties=zhfont,fontsize=26) 
    plt.ylabel(u'用户数量', fontproperties=zhfont,fontsize=26) 
    plt.show()

3.用户消费日期分析

分析：根据上图，我们可以看出第185天是消费的最高峰，可以推测为“双十一狂欢购物节”，同时，第1-10是消费的最低峰，可以推测为“春节”！

实现代码：

#!/usr/bin/python278  
# _*_ coding: utf-8 _*_  

import numpy
import matplotlib
import matplotlib.pyplot as plt

def timePurchase():
    file=open("time_stamp.txt",'r')
    indexDict={}
    y=[]
    zhfont = matplotlib.font_manager.FontProperties(fname='C:\Windows\Fonts\ukai.ttc')  
    for line in file.readlines():
        index=int(line)
        if index not in indexDict:
            indexDict[index]=0
        indexDict[index]+=1
    x=range(1,len(indexDict)+1)
    for i in range(1,len(indexDict)+1):
        y+=[indexDict[i]]  
    fig=plt.figure()
    ax=fig.add_subplot(111)
    ax.scatter(x,y)
    ax.axis( [1,185, 5 ,50000]) #设置坐标轴范围 
    plt.plot(x,y,'-r')
    plt.bar(x,y,alpha = .5, color = 'g')
    plt.xlabel(u'消费日期', fontproperties=zhfont,fontsize=26) 
    plt.ylabel(u'用户数量', fontproperties=zhfont,fontsize=26) 
    plt.show()

4.统计特征

未完待续。。。