一位小白零基础入门新闻推荐Task02

最新推荐文章于 2024-10-18 00:00:00 发布

蒙拉

最新推荐文章于 2024-10-18 00:00:00 发布

阅读量123

点赞数

文章标签： python

本文链接：https://blog.csdn.net/weixin_45925255/article/details/110244734

版权

学习目标：

想继续提升Task01中Baseline中的结果，仔细分析数据从新的角度去出发，或许是不错的出发点。
接下来，Task02我主要是熟悉了解所有下载下来的数据集的基本情况，以及数据集与数据集之间的那些特征的关联性。弄清楚用户与文章各自的属性和之间的属性。

学习内容：

查看学习总结Task02的数据处理思路

学习收获：

0 、相关包情况

# 导入相关包
%matplotlib inline
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
*import seaborn as sns   #用 matplotlib能够完成一些基本的图表操作，而 Seaborn 库可以让这些图的表现更加丰富。*
plt.rc('font', family='SimHei', size=13)

import os,gc,re,warnings,sys
warnings.filterwarnings("ignore")

1、用info()查看train_click_log.csv、articles.csv、articles_emb.csv、testA_click_log.csv表信息情况

#articles.csv：新闻文章信息数据表
item_df.info()
运行结果：
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 364047 entries, 0 to 364046
Data columns (total 4 columns):
click_article_id    364047 non-null int64
category_id         364047 non-null int64
created_at_ts       364047 non-null int64
words_count         364047 non-null int64
dtypes: int64(4)
memory usage: 11.1 MB

#articles_emb.csv：新闻文章embedding向量表示数据表
item_emb_df.info()
运行结果：
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 364047 entries, 0 to 364046
Columns: 250 entries, emb_0 to emb_249
dtypes: float64(250)
memory usage: 694.4 MB

#testA_click_log.csv：测试集用户点击日志
tst_click.info()
运行结果：
<class 'pandas.core.frame.DataFrame'>
Int64Index: 518010 entries, 0 to 518009
Data columns (total 14 columns):
user_id                518010 non-null int64
click_article_id       518010 non-null int64
click_timestamp        518010 non-null int64
click_environment      518010 non-null int64
click_deviceGroup      518010 non-null int64
click_os               518010 non-null int64
click_country          518010 non-null int64
click_region           518010 non-null int64
click_referrer_type    518010 non-null int64
rank                   518010 non-null int32
click_cnts             518010 non-null int64
category_id            518010 non-null int64
created_at_ts          518010 non-null int64
words_count            518010 non-null int64
dtypes: int32(1), int64(13)
memory usage: 57.3 MB

#train_click_log.csv：训练集用户点击日志
trn_click.info()
运行结果：
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1112623 entries, 0 to 1112622
Data columns (total 11 columns):
user_id                1112623 non-null int64
click_article_id       1112623 non-null int64
click_timestamp        1112623 non-null int64
click_environment      1112623 non-null int64
click_deviceGroup      1112623 non-null int64
click_os               1112623 non-null int64
click_country          1112623 non-null int64
click_region           1112623 non-null int64
click_referrer_type    1112623 non-null int64
rank                   1112623 non-null int32
click_cnts             1112623 non-null int64
dtypes: int32(1), int64(10)
memory usage: 89.1 MB

从上面都可知，四个数据表均无空值，里面大概了解共36万多篇不同的新闻文章，从articles.csv与articles_emb.csv看出来每篇新闻文章有对应的embedding向量表示。

数据浏览

用户点击日志文件_训练集

计算用户点击次数排名

# 对每个用户的点击时间戳进行排序
trn_click['rank'] = trn_click.groupby(['user_id'])['click_timestamp'].rank(ascending=False).astype(int)
tst_click['rank'] = tst_click.groupby(['user_id'])['click_timestamp'].rank(ascending=False).astype(int)
print(tst_click['rank'] )

运行结果：
0         19
1         18
2          5
3          4
4          7
          ..
518005     1
518006     4
518007     3
518008     2
518009     1
Name: rank, Length: 518010, dtype: int32

#计算用户点击文章的次数，并添加新的一列count
trn_click['click_cnts'] = trn_click.groupby(['user_id'])['click_timestamp'].transform('count')
tst_click['click_cnts'] = tst_click.groupby(['user_id'])['click_timestamp'].transform('count')
tst_click['click_cnts'] 
运行结果：
0         19
1         19
2          5
3          5
4          7
          ..
518005    14
518006    14
518007    14
518008    14
518009    14
Name: click_cnts, Length: 518010, dtype: int64

统计用户点击文章次数的新列name：click_cnts
2、查看用户数量

#训练集中的用户数量为20w
trn_click.user_id.nunique()
运行结果：
200000

unique()是以数组形式（numpy.ndarray）返回列的所有唯一值（特征的所有唯一值）
nunique() Return number of unique elements in the object.即返回的是唯一值的个数

3、查看用户最多最少点击文章数

trn_click.groupby('user_id')['click_article_id'].count().min()  # 训练集里面每个用户至少点击了两篇文章
运行结果：2

trn_click.groupby('user_id')['click_article_id'].count().max()
运行结果：241

测试集用户点击日志

1、将文章数据表和测试用户点击日志左连接合并成trn_click表

trn_click = trn_click.merge(item_df, how='left', on=['click_article_id'])
trn_click.head()

#测试集中的用户数量为5w
tst_click.user_id.nunique()
运行结果：50000

对tst_click进行
tst_click.describe()

在这里插入图片描述
看出训练集和测试集的用户是完全不一样的
训练集的用户ID由0 ~ 199999，而测试集A的用户ID由200000 ~ 249999。

新闻文章信息数据表

print(item_df['category_id'].nunique())     # 461个文章主题
item_df['category_id'].hist()#hist()表示画图时有多少条柱
item_df.shape       # 364047篇文章
运行结果：(364047, 4)

数据分析

#用户点击新闻次数
user_click_count.loc[:,'count'].value_counts()

# 分析用户点击环境变化是否明显，这里随机采样10个用户分析这些用户的点击环境分布
sample_user_ids = np.random.choice(tst_click['user_id'].unique(), size=5, replace=False)
sample_users = user_click_merge[user_click_merge['user_id'].isin(sample_user_ids)]
cols = ['click_environment','click_deviceGroup', 'click_os', 'click_country', 'click_region','click_referrer_type']
for _, user_df in sample_users.groupby('user_id'):
    plot_envs(user_df, cols, 2, 3)

这里就不放图了，我们可以看出有1605541（约占99.2%）的用户未重复阅读过文章，仅有极少数用户重复点击过某篇文章。绝大多数数的用户的点击环境是比较固定的

用户点击新闻数量的分布

#点击次数在前50的用户
plt.plot(user_click_item_count[:50])

在这里插入图片描述
可以根据用户的点击文章次数看出用户的活跃度，点击次数排前50的用户的点击次数都在100次以上。

#点击次数排名在[25000:50000]之间
plt.plot(user_click_item_count[25000:50000])

在这里插入图片描述
可以看出点击次数小于等于两次的用户非常的多，这些用户可以认为是非活跃用户

新闻点击次数分析

plt.plot(item_click_count[:100])

在这里插入图片描述
可以看出点击次数最多的前100篇新闻，点击次数大于1000次

plt.plot(item_click_count[:20])

在这里插入图片描述
点击次数最多的前20篇新闻，点击次数大于2500，可以定义这些新闻为热门新闻。

用户点击的新闻类型的偏好

plt.plot(sorted(user_click_merge.groupby('user_id')['category_id'].nunique(), reverse=True))

在这里插入图片描述
我们从上图中可以看出有一小部分用户阅读类型是极其广泛的，大部分人都处在20个新闻类型以下。

plt.plot(sorted(user_click_merge.groupby('user_id')['words_count'].mean(), reverse=True))

在这里插入图片描述
我们可以看到大多数人偏好于阅读字数在200-400字之间的新闻。

今天先到这里了，下期见。

蒙拉

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫