数据分析
数据分析的价值主要在于熟悉了解整个数据集的基本情况包括每个文件里有哪些数据,具体的文件中的每个字段表示什么实际含义,以及数据集中特征之间的相关性,在推荐场景下主要就是分析用户本身的基本属性,文章基本属性,以及用户和文章交互的一些分布,这些都有利于后面的召回策略的选择,以及特征工程。
建议:当特征工程和模型调参已经很难继续上分了,可以回来在重新从新的角度去分析这些数据,或许可以找到上分的灵感
导包
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rc('font', family='SimHei', size=13)
import os,gc,re,warnings,sys
warnings.filterwarnings("ignore")
读取数据
path = './data_raw/'
#####train
trn_click = pd.read_csv(path+'train_click_log.csv')
#trn_click = pd.read_csv(path+'train_click_log.csv', names=['user_id','item_id','click_time','click_environment','click_deviceGroup','click_os','click_country','click_region','click_referrer_type'])
item_df = pd.read_csv(path+'articles.csv')
item_df = item_df.rename(columns={
'article_id': 'click_article_id'}) #重命名,方便后续match
item_emb_df = pd.read_csv(path+'articles_emb.csv')
#####test
tst_click = pd.read_csv(path+'testA_click_log.csv')
数据预处理
计算用户点击rank和点击次数
# 对每个用户的点击时间戳进行排序
trn_click['rank'] = trn_click.groupby(['user_id'])['click_timestamp'].rank(ascending=False).astype(int)
tst_click['rank'] = tst_click.groupby(['user_id'])['click_timestamp'].rank(ascending=False).astype(int)
#计算用户点击文章的次数,并添加新的一列count
trn_click['click_cnts'] = trn_click.groupby(['user_id'])['click_timestamp'].transform('count')
tst_click['click_cnts'] = tst_click.groupby(['user_id'])['click_timestamp'].transform('count')
数据浏览
用户点击日志文件_训练集
trn_click = trn_click.merge(item_df, how='left', on=['click_article_id'])
trn_click.head()
user_id | click_article_id | click_timestamp | click_environment | click_deviceGroup | click_os | click_country | click_region | click_referrer_type | rank | click_cnts | category_id | created_at_ts | words_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 199999 | 160417 | 1507029570190 | 4 | 1 | 17 | 1 | 13 | 1 | 11 | 11 | 281 | 1506942089000 | 173 |
1 | 199999 | 5408 | 1507029571478 | 4 | 1 | 17 | 1 | 13 | 1 | 10 | 11 | 4 | 1506994257000 | 118 |
2 | 199999 | 50823 | 1507029601478 | 4 | 1 | 17 | 1 | 13 | 1 | 9 | 11 | 99 | 1507013614000 | 213 |
3 | 199998 | 157770 | 1507029532200 | 4 | 1 | 17 | 1 | 25 | 5 | 40 | 40 | 281 | 1506983935000 | 201 |
4 | 199998 | 96613 | 1507029671831 | 4 | 1 | 17 | 1 | 25 | 5 | 39 | 40 | 209 | 1506938444000 | 185 |
train_click_log.csv文件数据中每个字段的含义
- user_id: 用户的唯一标识
- click_article_id: 用户点击的文章唯一标识
- click_timestamp: 用户点击文章时的时间戳
- click_environment: 用户点击文章的环境
- click_deviceGroup: 用户点击文章的设备组
- click_os: 用户点击文章时的操作系统
- click_country: 用户点击文章时的所在的国家
- click_region: 用户点击文章时所在的区域
- click_referrer_type: 用户点击文章时,文章的来源
#用户点击日志信息
trn_click.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1112623 entries, 0 to 1112622
Data columns (total 14 columns):
user_id 1112623 non-null int64
click_article_id 1112623 non-null int64
click_timestamp 1112623 non-null int64
click_environment 1112623 non-null int64
click_deviceGroup 1112623 non-null int64
click_os 1112623 non-null int64
click_country 1112623 non-null int64
click_region 1112623 non-null int64
click_referrer_type 1112623 non-null int64
rank 1112623 non-null int32
click_cnts 1112623 non-null int64
category_id 1112623 non-null int64
created_at_ts 1112623 non-null int64
words_count 1112623 non-null int64
dtypes: int32(1), int64(13)
memory usage: 123.1 MB
trn_click.describe()
user_id | click_article_id | click_timestamp | click_environment | click_deviceGroup | click_os | click_country | click_region | click_referrer_type | rank | click_cnts | category_id | created_at_ts | words_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1.112623e+06 | 1.112623e+06 | 1.112623e+06 | 1.112623e+06 | 1.112623e+06 | 1.112623e+06 | 1.112623e+06 | 1.112623e+06 | 1.112623e+06 | 1.112623e+06 | 1.112623e+06 | 1.112623e+06 | 1.112623e+06 | 1.112623e+06 |
mean | 1.221198e+05 | 1.951541e+05 | 1.507588e+12 | 3.947786e+00 | 1.815981e+00 | 1.301976e+01 | 1.310776e+00 | 1.813587e+01 | 1.910063e+00 | 7.118518e+00 | 1.323704e+01 | 3.056176e+02 | 1.506598e+12 | 2.011981e+02 |
std | 5.540349e+04 | 9.292286e+04 | 3.363466e+08 | 3.276715e-01 | 1.035170e+00 | 6.967844e+00 | 1.618264e+00 | 7.105832e+00 | 1.220012e+00 | 1.016095e+01 | 1.631503e+01 | 1.155791e+02 | 8.343066e+09 | 5.223881e+01 |
min | 0.000000e+00 | 3.000000e+00 | 1.507030e+12 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.000000e+00 | 1.166573e+12 | 0.000000e+00 |
25% | 7.934700e+04 | 1.239090e+05 | 1.507297e+12 | 4.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.000000e+00 | 1.300000e+01 | 1.000000e+00 | 2.000000e+00 | 4.000000e+00 | 2.500000e+02 | 1.507220e+12 | 1.700000e+02 |
50% | 1.309670e+05 | 2.038900e+05 | 1.507596e+12 | 4.000000e+00 | 1.000000e+00 | 1.700000e+01 | 1.000000e+00 | 2.100000e+01 | 2.000000e+00 | 4.000000e+00 | 8.000000e+00 | 3.280000e+02 | 1.507553e+12 | 1.970000e+02 |
75% | 1.704010e+05 | 2.777120e+05 | 1.507841e+12 | 4.000000e+00 | 3.000000e+00 | 1.700000e+01 | 1.000000e+00 | 2.500000e+01 | 2.000000e+00 | 8.000000e+00 | 1.600000e+01 | 4.100000e+02 | 1.507756e+12 | 2.280000e+02 |
max | 1.999990e+05 | 3.640460e+05 | 1.510603e+12 | 4.000000e+00 | 5.000000e+00 | 2.000000e+01 | 1.100000e+01 | 2.800000e+01 | 7.000000e+00 | 2.410000e+02 | 2.410000e+02 | 4.600000e+02 | 1.510666e+12 | 6.690000e+03 |
#训练集中的用户数量为20w
trn_click.user_id.nunique()
200000
trn_click.groupby('user_id')['click_article_id'].count().min() # 训练集里面每个用户至少点击了两篇文章
2
画直方图大体看一下基本的属性分布
plt.figure()
plt