Recommend News 02-数据分析

最新推荐文章于 2024-08-11 19:46:27 发布

Gocara

最新推荐文章于 2024-08-11 19:46:27 发布

阅读量307

点赞数

文章标签：推荐系统

本文链接：https://blog.csdn.net/qq_34903176/article/details/110219484

版权

数据分析

数据分析的价值主要在于熟悉了解整个数据集的基本情况包括每个文件里有哪些数据，具体的文件中的每个字段表示什么实际含义，以及数据集中特征之间的相关性，在推荐场景下主要就是分析用户本身的基本属性，文章基本属性，以及用户和文章交互的一些分布，这些都有利于后面的召回策略的选择，以及特征工程。

建议：当特征工程和模型调参已经很难继续上分了，可以回来在重新从新的角度去分析这些数据，或许可以找到上分的灵感

导包

%matplotlib inline
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
plt.rc('font', family='SimHei', size=13)

import os,gc,re,warnings,sys
warnings.filterwarnings("ignore")

读取数据

path = './data_raw/'

#####train
trn_click = pd.read_csv(path+'train_click_log.csv')
#trn_click = pd.read_csv(path+'train_click_log.csv', names=['user_id','item_id','click_time','click_environment','click_deviceGroup','click_os','click_country','click_region','click_referrer_type'])
item_df = pd.read_csv(path+'articles.csv')
item_df = item_df.rename(columns={
   'article_id': 'click_article_id'})  #重命名，方便后续match
item_emb_df = pd.read_csv(path+'articles_emb.csv')

#####test
tst_click = pd.read_csv(path+'testA_click_log.csv')

数据预处理

计算用户点击rank和点击次数

# 对每个用户的点击时间戳进行排序
trn_click['rank'] = trn_click.groupby(['user_id'])['click_timestamp'].rank(ascending=False).astype(int)
tst_click['rank'] = tst_click.groupby(['user_id'])['click_timestamp'].rank(ascending=False).astype(int)

#计算用户点击文章的次数，并添加新的一列count
trn_click['click_cnts'] = trn_click.groupby(['user_id'])['click_timestamp'].transform('count')
tst_click['click_cnts'] = tst_click.groupby(['user_id'])['click_timestamp'].transform('count')

数据浏览

用户点击日志文件_训练集

trn_click = trn_click.merge(item_df, how='left', on=['click_article_id'])
trn_click.head()

	user_id	click_article_id	click_timestamp	click_environment	click_deviceGroup	click_os	click_country	click_region	click_referrer_type	rank	click_cnts	category_id	created_at_ts	words_count
0	199999	160417	1507029570190	4	1	17	1	13	1	11	11	281	1506942089000	173
1	199999	5408	1507029571478	4	1	17	1	13	1	10	11	4	1506994257000	118
2	199999	50823	1507029601478	4	1	17	1	13	1	9	11	99	1507013614000	213
3	199998	157770	1507029532200	4	1	17	1	25	5	40	40	281	1506983935000	201
4	199998	96613	1507029671831	4	1	17	1	25	5	39	40	209	1506938444000	185

train_click_log.csv文件数据中每个字段的含义

user_id: 用户的唯一标识
click_article_id: 用户点击的文章唯一标识
click_timestamp: 用户点击文章时的时间戳
click_environment: 用户点击文章的环境
click_deviceGroup: 用户点击文章的设备组
click_os: 用户点击文章时的操作系统
click_country: 用户点击文章时的所在的国家
click_region: 用户点击文章时所在的区域
click_referrer_type: 用户点击文章时，文章的来源

#用户点击日志信息
trn_click.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1112623 entries, 0 to 1112622
Data columns (total 14 columns):
user_id                1112623 non-null int64
click_article_id       1112623 non-null int64
click_timestamp        1112623 non-null int64
click_environment      1112623 non-null int64
click_deviceGroup      1112623 non-null int64
click_os               1112623 non-null int64
click_country          1112623 non-null int64
click_region           1112623 non-null int64
click_referrer_type    1112623 non-null int64
rank                   1112623 non-null int32
click_cnts             1112623 non-null int64
category_id            1112623 non-null int64
created_at_ts          1112623 non-null int64
words_count            1112623 non-null int64
dtypes: int32(1), int64(13)
memory usage: 123.1 MB

trn_click.describe()

	user_id	click_article_id	click_timestamp	click_environment	click_deviceGroup	click_os	click_country	click_region	click_referrer_type	rank	click_cnts	category_id	created_at_ts	words_count
count	1.112623e+06	1.112623e+06	1.112623e+06	1.112623e+06	1.112623e+06	1.112623e+06	1.112623e+06	1.112623e+06	1.112623e+06	1.112623e+06	1.112623e+06	1.112623e+06	1.112623e+06	1.112623e+06
mean	1.221198e+05	1.951541e+05	1.507588e+12	3.947786e+00	1.815981e+00	1.301976e+01	1.310776e+00	1.813587e+01	1.910063e+00	7.118518e+00	1.323704e+01	3.056176e+02	1.506598e+12	2.011981e+02
std	5.540349e+04	9.292286e+04	3.363466e+08	3.276715e-01	1.035170e+00	6.967844e+00	1.618264e+00	7.105832e+00	1.220012e+00	1.016095e+01	1.631503e+01	1.155791e+02	8.343066e+09	5.223881e+01
min	0.000000e+00	3.000000e+00	1.507030e+12	1.000000e+00	1.000000e+00	2.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	2.000000e+00	1.000000e+00	1.166573e+12	0.000000e+00
25%	7.934700e+04	1.239090e+05	1.507297e+12	4.000000e+00	1.000000e+00	2.000000e+00	1.000000e+00	1.300000e+01	1.000000e+00	2.000000e+00	4.000000e+00	2.500000e+02	1.507220e+12	1.700000e+02
50%	1.309670e+05	2.038900e+05	1.507596e+12	4.000000e+00	1.000000e+00	1.700000e+01	1.000000e+00	2.100000e+01	2.000000e+00	4.000000e+00	8.000000e+00	3.280000e+02	1.507553e+12	1.970000e+02
75%	1.704010e+05	2.777120e+05	1.507841e+12	4.000000e+00	3.000000e+00	1.700000e+01	1.000000e+00	2.500000e+01	2.000000e+00	8.000000e+00	1.600000e+01	4.100000e+02	1.507756e+12	2.280000e+02
max	1.999990e+05	3.640460e+05	1.510603e+12	4.000000e+00	5.000000e+00	2.000000e+01	1.100000e+01	2.800000e+01	7.000000e+00	2.410000e+02	2.410000e+02	4.600000e+02	1.510666e+12	6.690000e+03

#训练集中的用户数量为20w
trn_click.user_id.nunique()

trn_click.groupby('user_id')['click_article_id'].count().min()  # 训练集里面每个用户至少点击了两篇文章

画直方图大体看一下基本的属性分布

plt.figure()
plt

最低0.47元/天解锁文章

Gocara

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫