EDA

最新推荐文章于 2024-01-11 19:00:00 发布

晃晃我的半瓶水

最新推荐文章于 2024-01-11 19:00:00 发布

阅读量346

点赞数

本文链接：https://blog.csdn.net/qq_41834327/article/details/110253304

版权

新闻推荐：task-02 数据分析

数据分为训练集用户日志和测试机用户日志，新闻信息，文章词向量。
数据分析的价值：熟悉整个数据集的基本情况，即每个文件中有哪些数据，具体的文件中每个字段所表示的实际含义，数据集特征之间的相关性。
针对于新闻推荐来说，主要需要分析的有用户自身的一个状态，用户与文章的关系，文章与文章之间的相关性，文章本身的基本属性，分析这些属性有助于后面召回策略的选择及特征工程的具体方向。

导入函数库

%matplotlib inline
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

plt.rc('font', family = 'SimHei', size = 13) #设置画板显示格式

import os, re, os, sys, warnings
warnings.filterwarnings('ignore')   #隐藏程序运行警告

读取数据集

path = r'D:/新闻推荐/'

trn_click = pd.read_csv(path + 'train_click_log.csv')
item_df = pd.read_csv(path + 'articles.csv')
item_df = item_df.rename(columns = {'article_id' : 'click_article_id'})
item_emb_df = pd.read_csv(path + 'articles_emb.csv')

tst_click = pd.read_csv(path + 'testA_click_log.csv')

检视已导入数据，寻找特征之间显然的联系

trn_click.head(3)

	user_id	click_article_id	click_timestamp	click_environment	click_deviceGroup	click_os	click_country	click_region	click_referrer_type
0	199999	160417	1507029570190	4	1	17	1	13	1
1	199999	5408	1507029571478	4	1	17	1	13	1
2	199999	50823	1507029601478	4	1	17	1	13	1

item_df.head(3)

	click_article_id	category_id	created_at_ts	words_count
0	0	0	1513144419000	168
1	1	1	1405341936000	189
2	2	1	1408667706000	250

item_emb_df.head(3)

	article_id	emb_0	emb_1	emb_2	emb_3	emb_4	emb_5	emb_6	emb_7	emb_8	...	emb_240	emb_241	emb_242	emb_243	emb_244	emb_245	emb_246	emb_247	emb_248	emb_249
0	0	-0.161183	-0.957233	-0.137944	0.050855	0.830055	0.901365	-0.335148	-0.559561	-0.500603	...	0.321248	0.313999	0.636412	0.169179	0.540524	-0.813182	0.286870	-0.231686	0.597416	0.409623
1	1	-0.523216	-0.974058	0.738608	0.155234	0.626294	0.485297	-0.715657	-0.897996	-0.359747	...	-0.487843	0.823124	0.412688	-0.338654	0.320786	0.588643	-0.594137	0.182828	0.397090	-0.834364
2	2	-0.619619	-0.972960	-0.207360	-0.128861	0.044748	-0.387535	-0.730477	-0.066126	-0.754899	...	0.454756	0.473184	0.377866	-0.863887	-0.383365	0.137721	-0.810877	-0.447580	0.805932	-0.285284

3 rows × 251 columns

将用户日志中时间戳特征转化为更易于理解分析的排序特征

注意点：

使用groupby函数以user_id作为主键建立透视表。
对透视表中click_timestamp特征使用rank()函数，使用时注意因为时间戳值越大表示发生时间越晚，因此使用排序时需要使用降序排名以保证rank()特征不发生歧义。
使用rank()函数后需要注意将其数据类型强制转换成整型，避免因排名出现浮点数导致歧义
transform()函数使用时传入参数为待使用的函数。
merge()函数参数‘data’为被合并表，参数‘on’为两表拼接方式，参数‘how’为两表拼接时所依靠的主要键。
describe()函数为显示数据本身所包含的基本信息。
info()函数为显示DataFrame数据存储的本身属性信息。
nunique()函数为获取数据所具有的不同种类的总共个数。
unique()函数为获取数据所具有的全部总类。
count()函数为获取数据中非空元素的总共个数。
value_counts()函数为获取数据所有种类及对应种类对应统计个数。

trn_click['rank'] = trn_click.groupby('user_id')['click_timestamp'].rank(ascending = False).astype(int)
tst_click['rank'] = tst_click.groupby('user_id')['click_timestamp'].rank(ascending = False).astype(int)

trn_click['click_cnts'] = trn_click.groupby(['user_id'])['click_timestamp'].transform('count')

trn_click.head(2)

	user_id	click_article_id	click_timestamp	click_environment	click_deviceGroup	click_os	click_country	click_region	click_referrer_type	rank	click_cnts
0	199999	160417	1507029570190	4	1	17	1	13	1	11	11
1	199999	5408	1507029571478	4	1	17	1	13	1	10	11

trn_click = trn_click.merge(item_df, how = 'left', on = ['click_article_id'])
trn_click.head(2)

	user_id	click_article_id	click_timestamp	click_environment	click_deviceGroup	click_os	click_country	click_region	click_referrer_type	rank	click_cnts	category_id	created_at_ts	words_count
0	199999	160417	1507029570190	4	1	17	1	13	1	11	11	281	1506942089000	173
1	199999	5408	1507029571478	4	1	17	1	13	1	10	11	4	1506994257000	118

trn_click.sort_values(by = 'user_id')

	user_id	click_article_id	click_timestamp	click_environment	click_deviceGroup	click_os	click_country	click_region	click_referrer_type	rank	click_cnts	category_id	created_at_ts	words_count
1112620	0	157507	1508211702520	4	1	17	1	25	2	1	2	281	1508236945000	370
1112619	0	30760	1508211672520	4	1	17	1	25	2	2	2	26	1508185091000	162
1112602	1	63746	1508211346889	4	1	17	1	25	6	1	2	133	1508142585000	162
1112601	1	289197	1508211316889	4	1	17	1	25	6	2	2	418	1508179909000	176
1112600	2	168401	1508211468695	4	3	20	1	25	2	1	2	297	1507663321000	215
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1081895	199999	218355	1508176867088	4	1	17	1	13	1	1	11	352	1508155745000	202
660731	199999	161191	1507665351186	4	1	17	1	13	1	6	11	281	1507646579000	285
660732	199999	42223	1507665381186	4	1	17	1	13	1	5	11	67	1507648195000	186
211041	199999	123909	1507226987864	4	1	17	1	13	1	8	11	250	1507198955000	240
0	199999	160417	1507029570190	4	1	17	1	13	1	11	11	281	1506942089000	173

1112623 rows × 14 columns

trn_click.describe()

	user_id	click_article_id	click_timestamp	click_environment	click_deviceGroup	click_os	click_country	click_region	click_referrer_type	rank	click_cnts	category_id	created_at_ts	words_count
count	1.112623e+06	1.112623e+06	1.112623e+06	1.112623e+06	1.112623e+06	1.112623e+06	1.112623e+06	1.112623e+06	1.112623e+06	1.112623e+06	1.112623e+06	1.112623e+06	1.112623e+06	1.112623e+06
mean	1.221198e+05	1.951541e+05	1.507588e+12	3.947786e+00	1.815981e+00	1.301976e+01	1.310776e+00	1.813587e+01	1.910063e+00	7.118518e+00	1.323704e+01	3.056176e+02	1.506598e+12	2.011981e+02
std	5.540349e+04	9.292286e+04	3.363466e+08	3.276715e-01	1.035170e+00	6.967844e+00	1.618264e+00	7.105832e+00	1.220012e+00	1.016095e+01	1.631503e+01	1.155791e+02	8.343066e+09	5.223881e+01
min	0.000000e+00	3.000000e+00	1.507030e+12	1.000000e+00	1.000000e+00	2.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	2.000000e+00	1.000000e+00	1.166573e+12	0.000000e+00
25%	7.934700e+04	1.239090e+05	1.507297e+12	4.000000e+00	1.000000e+00	2.000000e+00	1.000000e+00	1.300000e+01	1.000000e+00	2.000000e+00	4.000000e+00	2.500000e+02	1.507220e+12	1.700000e+02
50%	1.309670e+05	2.038900e+05	1.507596e+12	4.000000e+00	1.000000e+00	1.700000e+01	1.000000e+00	2.100000e+01	2.000000e+00	4.000000e+00	8.000000e+00	3.280000e+02	1.507553e+12	1.970000e+02
75%	1.704010e+05	2.777120e+05	1.507841e+12	4.000000e+00	3.000000e+00	1.700000e+01	1.000000e+00	2.500000e+01	2.000000e+00	8.000000e+00	1.600000e+01	4.100000e+02	1.507756e+12	2.280000e+02
max	1.999990e+05	3.640460e+05	1.510603e+12	4.000000e+00	5.000000e+00	2.000000e+01	1.100000e+01	2.800000e+01	7.000000e+00	2.410000e+02	2.410000e+02	4.600000e+02	1.510666e+12	6.690000e+03

trn_click.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1112623 entries, 0 to 1112622
Data columns (total 14 columns):
 #   Column               Non-Null Count    Dtype
---  ------               --------------    -----
 0   user_id              1112623 non-null  int64
 1   click_article_id     1112623 non-null  int64
 2   click_timestamp      1112623 non-null  int64
 3   click_environment    1112623 non-null  int64
 4   click_deviceGroup    1112623 non-null  int64
 5   click_os             1112623 non-null  int64
 6   click_country        1112623 non-null  int64
 7   click_region         1112623 non-null  int64
 8   click_referrer_type  1112623 non-null  int64
 9   rank                 1112623 non-null  int32
 10  click_cnts           1112623 non-null  int64
 11  category_id          1112623 non-null  int64
 12  created_at_ts        1112623 non-null  int64
 13  words_count          1112623 non-null  int64
dtypes: int32(1), int64(13)
memory usage: 123.1 MB

trn_click['user_id'].nunique()

trn_click.groupby(['user_id'])['click_article_id'].count().min()

plt.subplot()函数为画板格局分布函数

plt.figure(figsize = (15,20))
base_cols = [ 'click_article_id', 'click_timestamp', 'click_environment',
       'click_deviceGroup', 'click_os', 'click_country', 'click_region',
       'click_referrer_type', 'rank', 'click_cnts']
i = 1
for col in base_cols:
    plot_envs = plt.subplot(5,2,i)
    i = i + 1
    v = trn_click[col].value_counts().reset_index()[:10]
    fig = sns.barplot(x = v['index'], y = v[col])
    for item in fig.get_xticklabels():
        item.set_rotation(90)
    plt.title(col)
plt.tight_layout()
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-R9IAwTID-1606498367958)(output_20_0.png)]

trn_click.columns

Index(['user_id', 'click_article_id', 'click_timestamp', 'click_environment',
       'click_deviceGroup', 'click_os', 'click_country', 'click_region',
       'click_referrer_type', 'rank', 'click_cnts', 'category_id',
       'created_at_ts', 'words_count'],
      dtype='object')

tst_click = tst_click.merge(item_df, how='left', on=['click_article_id'])
tst_click.head()

	user_id	click_article_id	click_timestamp	click_environment	click_deviceGroup	click_os	click_country	click_region	click_referrer_type	rank	category_id	created_at_ts	words_count
0	249999	160974	1506959142820	4	1	17	1	13	2	19	281	1506912747000	259
1	249999	160417	1506959172820	4	1	17	1	13	2	18	281	1506942089000	173
2	249998	160974	1506959056066	4	1	12	1	13	2	5	281	1506912747000	259
3	249998	202557	1506959086066	4	1	12	1	13	2	4	327	1506938401000	219
4	249997	183665	1506959088613	4	1	17	1	15	5	7	301	1500895686000	256

tst_click.describe()

	user_id	click_article_id	click_timestamp	click_environment	click_deviceGroup	click_os	click_country	click_region	click_referrer_type	rank	category_id	created_at_ts	words_count
count	518010.000000	518010.000000	5.180100e+05	518010.000000	518010.000000	518010.000000	518010.000000	518010.000000	518010.000000	518010.000000	518010.000000	5.180100e+05	518010.000000
mean	227342.428169	193803.792550	1.507387e+12	3.947300	1.738285	13.628467	1.348209	18.250250	1.819614	15.521785	305.324961	1.506883e+12	210.966331
std	14613.907188	88279.388177	3.706127e+08	0.323916	1.020858	6.625564	1.703524	7.060798	1.082657	33.957702	110.411513	5.816668e+09	83.040065
min	200000.000000	137.000000	1.506959e+12	1.000000	1.000000	2.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.265812e+12	0.000000
25%	214926.000000	128551.000000	1.507026e+12	4.000000	1.000000	12.000000	1.000000	13.000000	1.000000	4.000000	252.000000	1.506970e+12	176.000000
50%	229109.000000	199197.000000	1.507308e+12	4.000000	1.000000	17.000000	1.000000	21.000000	2.000000	8.000000	323.000000	1.507249e+12	199.000000
75%	240182.000000	272143.000000	1.507666e+12	4.000000	3.000000	17.000000	1.000000	25.000000	2.000000	18.000000	399.000000	1.507630e+12	232.000000
max	249999.000000	364043.000000	1.508832e+12	4.000000	5.000000	20.000000	11.000000	28.000000	7.000000	938.000000	460.000000	1.509949e+12	3082.000000

tst_click['user_id'].nunique()

tst_click.groupby(['user_id'])['click_article_id'].count().min()

item_df.head().append(item_df.tail())

	click_article_id	category_id	created_at_ts	words_count
0	0	0	1513144419000	168
1	1	1	1405341936000	189
2	2	1	1408667706000	250
3	3	1	1408468313000	230
4	4	1	1407071171000	162
364042	364042	460	1434034118000	144
364043	364043	460	1434148472000	463
364044	364044	460	1457974279000	177
364045	364045	460	1515964737000	126
364046	364046	460	1505811330000	479

item_df['words_count'].value_counts()

176     3485
182     3480
179     3463
178     3458
174     3456
        ... 
845        1
710        1
965        1
847        1
1535       1
Name: words_count, Length: 866, dtype: int64

print(item_df['category_id'].nunique())
item_df['category_id'].hist()

461





<matplotlib.axes._subplots.AxesSubplot at 0x17f256f2a90>

在这里插入图片描述

item_df.head()

	click_article_id	category_id	created_at_ts	words_count
0	0	0	1513144419000	168
1	1	1	1405341936000	189
2	2	1	1408667706000	250
3	3	1	1408468313000	230
4	4	1	1407071171000	162

item_df.shape

(364047, 4)

item_emb_df.head()

	article_id	emb_0	emb_1	emb_2	emb_3	emb_4	emb_5	emb_6	emb_7	emb_8	...	emb_240	emb_241	emb_242	emb_243	emb_244	emb_245	emb_246	emb_247	emb_248	emb_249
0	0	-0.161183	-0.957233	-0.137944	0.050855	0.830055	0.901365	-0.335148	-0.559561	-0.500603	...	0.321248	0.313999	0.636412	0.169179	0.540524	-0.813182	0.286870	-0.231686	0.597416	0.409623
1	1	-0.523216	-0.974058	0.738608	0.155234	0.626294	0.485297	-0.715657	-0.897996	-0.359747	...	-0.487843	0.823124	0.412688	-0.338654	0.320786	0.588643	-0.594137	0.182828	0.397090	-0.834364
2	2	-0.619619	-0.972960	-0.207360	-0.128861	0.044748	-0.387535	-0.730477	-0.066126	-0.754899	...	0.454756	0.473184	0.377866	-0.863887	-0.383365	0.137721	-0.810877	-0.447580	0.805932	-0.285284
3	3	-0.740843	-0.975749	0.391698	0.641738	-0.268645	0.191745	-0.825593	-0.710591	-0.040099	...	0.271535	0.036040	0.480029	-0.763173	0.022627	0.565165	-0.910286	-0.537838	0.243541	-0.885329
4	4	-0.279052	-0.972315	0.685374	0.113056	0.238315	0.271913	-0.568816	0.341194	-0.600554	...	0.238286	0.809268	0.427521	-0.615932	-0.503697	0.614450	-0.917760	-0.424061	0.185484	-0.580292

5 rows × 251 columns

item_emb_df.shape

(364047, 251)

user_click_merge = trn_click.append(tst_click)
user_click_count = user_click_merge.groupby(['user_id', 'click_article_id'])['click_timestamp'].agg({'count'}).reset_index()
user_click_count[:10]

	user_id	click_article_id	count
0	0	30760	1
1	0	157507	1
2	1	63746	1
3	1	289197	1
4	2	36162	1
5	2	168401	1
6	3	36162	1
7	3	50644	1
8	4	39894	1
9	4	42567	1

user_click_count[user_click_count['count'] > 7]

	user_id	click_article_id	count
311242	86295	74254	10
311243	86295	76268	10
393761	103237	205948	10
393763	103237	235689	10
576902	134850	69463	13

user_click_count['count'].unique()

array([ 1,  2,  4,  3,  6,  5, 10,  7, 13], dtype=int64)

user_click_count.loc[:, 'count'].value_counts()

1     1605541
2       11621
3         422
4          77
5          26
6          12
10          4
7           3
13          1
Name: count, dtype: int64

def plot_envs(df, cols, r, c):
    plt.figure(figsize = (10, 5))
    i = 1
    for col in cols:
        plt.subplot(r,c,i)
        i = i + 1
        v = df[col].value_counts().reset_index()
        fig = sns.barplot(x = v['index'], y = v[col])
        for item in fig.get_xticklabels():
            item.set_rotation(90)
        plt.title(col)
    plt.tight_layout()
    plt.show()

从测试集中随机选取五个不同的用户，分别将其不同的日志绘制成图表示出来，检查单个用户的点击新闻环境变换是否稳定。

sample_user_ids = np.random.choice(tst_click['user_id'].unique(), size = 5, replace = False)
sample_users = tst_click[tst_click['user_id'].isin(sample_user_ids)]
cols = ['click_environment','click_deviceGroup', 'click_os', 'click_country', 'click_region',
       'click_referrer_type']
for _, user_df in sample_users.groupby('user_id'):
    plot_envs(user_df, cols, 2, 3)

在这里插入图片描述

sample_users.head()

	user_id	click_article_id	click_timestamp	click_environment	click_deviceGroup	click_os	click_country	click_region	click_referrer_type	rank	category_id	created_at_ts	words_count
51836	230439	162655	1506974999046	4	1	17	1	13	2	4	281	1506949610000	245
51837	230439	166581	1506975626449	4	1	17	1	13	2	3	289	1506963755000	210
51838	230439	225055	1506975709313	4	1	17	1	13	2	2	354	1506953474000	245
51839	230439	199198	1506975847167	4	1	17	1	13	2	1	323	1506958823000	221
52490	230231	160974	1506975219671	4	3	2	1	16	2	10	281	1506912747000	259

tst_click.columns

Index(['user_id', 'click_article_id', 'click_timestamp', 'click_environment',
       'click_deviceGroup', 'click_os', 'click_country', 'click_region',
       'click_referrer_type', 'rank', 'category_id', 'created_at_ts',
       'words_count'],
      dtype='object')

统计每位用户点击文章的个数，并检视所有用户与点击文章个数的分布。

user_click_item_count = sorted(user_click_merge.groupby(['user_id'])['click_article_id'].count(), reverse = True)
plt.plot(user_click_item_count)

[<matplotlib.lines.Line2D at 0x17f2eabd198>]

在这里插入图片描述

plt.plot(user_click_item_count[:50])

[<matplotlib.lines.Line2D at 0x17f2c0f9048>]

在这里插入图片描述

plt.plot(user_click_item_count[25000: 50000])

[<matplotlib.lines.Line2D at 0x17f2e9a5080>]

在这里插入图片描述

统计每篇文章的阅读量，并检视数据集中所有文章的阅读量分布。

item_click_count = sorted(user_click_merge.groupby(['click_article_id'])['user_id'].count(), reverse = True)
plt.plot(item_click_count)

[<matplotlib.lines.Line2D at 0x17f2c084e10>]

在这里插入图片描述

plt.plot(item_click_count[:100])

[<matplotlib.lines.Line2D at 0x17f256c6cf8>]

在这里插入图片描述

plt.plot(item_click_count[:20])

[<matplotlib.lines.Line2D at 0x17f2ea1f588>]

在这里插入图片描述

plt.plot(item_click_count[3500:])

[<matplotlib.lines.Line2D at 0x17f2b67ae10>]

在这里插入图片描述

plt.plot(item_click_count[:1000])

[<matplotlib.lines.Line2D at 0x17f2e9d3b70>]

在这里插入图片描述

求新闻共现频率：两篇文章连续出现的次数

将数据按时间戳进行排序。
将排好序的数据通过用户ID为主键获取文章ID数组，使用shift函数获取下一文章ID，使用transform单独对需要进行操作的列进行处理，处理时需要注意每位用户最后点击的一篇的下一篇不存在，处理时若不考虑则将导致NAN值，因此需要使用fillna()函数将缺失值统一替换为某一约定。
使用当前文章ID与下一文章ID为主键，获取该两篇文章连续出现的次数，使用的时候对groupby所建立的透视表使用agg()函数来执行count()函数获取次数，然后将获得到的连续文章对应的次数数组重新建立索引表，再对count特征对应的出现次数进行排序。

tmp = user_click_merge.sort_values(['click_timestamp'])
tmp['next_item'] = tmp.groupby(['user_id'])['click_article_id'].transform(lambda x : x.shift(-1))
union_item = tmp.groupby(['click_article_id', 'next_item'])['click_timestamp'].agg({'count'}).reset_index().sort_values('count', ascending = False)
union_item[['count']].describe()

	count
count	433597.000000
mean	3.184139
std	18.851753
min	1.000000
25%	1.000000
50%	1.000000
75%	2.000000
max	2202.000000

x = union_item['click_article_id']
y = union_item['count']
plt.scatter(x, y)

<matplotlib.collections.PathCollection at 0x17fe9eb2ac8>

在这里插入图片描述

plt.plot(union_item['count'].values[40000:])

[<matplotlib.lines.Line2D at 0x17fe9f842e8>]

在这里插入图片描述

检视整体数据中文章多样性的一个体现

plt.plot(user_click_merge['category_id'].value_counts().values)

[<matplotlib.lines.Line2D at 0x17fea035320>]

在这里插入图片描述

plt.plot(user_click_merge['category_id'].value_counts().values[150:])

[<matplotlib.lines.Line2D at 0x17fea0dc8d0>]

在这里插入图片描述

检视全体数据中文章篇幅的分布信息

user_click_merge['words_count'].describe()

count    1.630633e+06
mean     2.043012e+02
std      6.382198e+01
min      0.000000e+00
25%      1.720000e+02
50%      1.970000e+02
75%      2.290000e+02
max      6.690000e+03
Name: words_count, dtype: float64

plt.plot(user_click_merge['words_count'].values)

[<matplotlib.lines.Line2D at 0x17fea18e710>]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-dyrKcCpC-1606498368000)(output_61_1.png)]

查看每位用户所点击文章的种类信息，即检视用户群体中每位用户的兴趣分散度。

plt.plot(sorted(user_click_merge.groupby('user_id')['category_id'].nunique(), reverse = True))

[<matplotlib.lines.Line2D at 0x17fea23ecc0>]

在这里插入图片描述

user_click_merge.groupby('user_id')['category_id'].nunique().reset_index().describe()

	user_id	category_id
count	250000.000000	250000.000000
mean	124999.500000	4.573188
std	72168.927986	4.419800
min	0.000000	1.000000
25%	62499.750000	2.000000
50%	124999.500000	3.000000
75%	187499.250000	6.000000
max	249999.000000	95.000000

检视用户喜欢的文章篇幅的大小分布。

plt.plot(sorted(user_click_merge.groupby('user_id')['words_count'].mean(), reverse = True))

[<matplotlib.lines.Line2D at 0x17f23595c18>]

在这里插入图片描述

plt.plot(sorted(user_click_merge.groupby('user_id')['words_count'].mean(), reverse = True)[1000:45000])

[<matplotlib.lines.Line2D at 0x17f25910c18>]

在这里插入图片描述

user_click_merge.groupby('user_id')['words_count'].mean().reset_index().describe()

	user_id	words_count
count	250000.000000	250000.000000
mean	124999.500000	205.830189
std	72168.927986	47.174030
min	0.000000	8.000000
25%	62499.750000	187.500000
50%	124999.500000	202.000000
75%	187499.250000	217.750000
max	249999.000000	3434.500000

适当的将时间戳类型的数据转化为较易处理的数据格式以减少运算量。

from sklearn.preprocessing import MinMaxScaler
mm = MinMaxScaler()
user_click_merge['click_timestamp'] = mm.fit_transform(user_click_merge[['click_timestamp']])
user_click_merge['created_at_ts'] = mm.fit_transform(user_click_merge[['created_at_ts']])
user_click_merge = user_click_merge.sort_values(['click_timestamp'])

user_click_merge.head(5)

	user_id	click_article_id	click_timestamp	click_environment	click_deviceGroup	click_os	click_country	click_region	click_referrer_type	rank	click_cnts	category_id	created_at_ts	words_count
18	249990	162300	0.000000	4	3	20	1	25	2	5	NaN	281	0.989186	193
2	249998	160974	0.000002	4	1	12	1	13	2	5	NaN	281	0.989092	259
30	249985	160974	0.000003	4	1	17	1	8	2	8	NaN	281	0.989092	259
50	249979	162300	0.000004	4	1	17	1	25	2	2	NaN	281	0.989186	193
25	249988	160974	0.000004	4	1	17	1	21	2	17	NaN	281	0.989092	259

使用列名直接索引时将得到Series类型的该列数据，由于索引后的数据较原数据格式发生变化会导致后续进行处理产生不必要的麻烦；使用数据类型的列名进行索引时，得到的索引后的数据将会保持与原数据相同的数据格式，较为安全。

user_click_merge[['click_timestamp']].head(5)

	click_timestamp
18	0.000000
2	0.000002
30	0.000003
50	0.000004
25	0.000004

定义求解两时间戳之间的间隔即用户在每篇文章上所花费的阅读时间的抽象体现，因为该处理表现为当前文章的时间戳与下一文章的时间戳的间隔，故存在最后一篇文章的下一篇文章不存在导致出现NAN值，可通过使用fillna()函数防止NAN值得产生；同时考虑到间隔，故正间隔与负间隔所体现意义相同，因此需要使用np.abs()函数保证间隔的统一。

def mean_diff_time_func(df, col):
    df = pd.DataFrame(df, columns = {col})
    df['time_shift1'] = df[col].shift(1).fillna(0)
    df['diff_time'] = abs(df[col] - df['time_shift1'])
    return df['diff_time'].mean()

mean_diff_click_time = user_click_merge.groupby('user_id')['click_timestamp', 'created_at_ts'].apply(lambda x : mean_diff_time_func(x, 'click_timestamp'))

plt.plot(sorted(mean_diff_click_time.values, reverse = True))

[<matplotlib.lines.Line2D at 0x17f1e8d7ba8>]

在这里插入图片描述

mean_diff_creat_time = user_click_merge.groupby('user_id')['click_timestamp', 'created_at_ts'].apply(lambda x : mean_diff_time_func(x, 'created_at_ts'))

plt.plot(sorted(mean_diff_creat_time.values, reverse = True))

[<matplotlib.lines.Line2D at 0x17f258475c0>]

在这里插入图片描述

将文章词向量中文章ID与索引通过zip()函数构成列表后再使用dict()函数构建索引与文章ID的字典，之后将文章ID列删除以便后续对词向量的快捷使用。

item_idx_2_rawid_dict = dict(zip(item_emb_df['article_id'], item_emb_df.index))

del item_emb_df['article_id']

使用np.ascontiguousarray()函数将文章词向量数据转化为连续内存存储，并将数据格式转换为32位numpy浮点数

item_emb_np = np.ascontiguousarray(item_emb_df.values, dtype = np.float32)

在用户点击数据中随机抽取十五个不同的用户所包含的数据。

sub_user_ids = np.random.choice(user_click_merge['user_id'].unique(), size = 15, replace = False)
sub_user_info = user_click_merge[user_click_merge['user_id'].isin(sub_user_ids)]
sub_user_info.head()

	user_id	click_article_id	click_timestamp	click_environment	click_deviceGroup	click_os	click_country	click_region	click_referrer_type	rank	click_cnts	category_id	created_at_ts	words_count
24395	240315	300470	0.002052	4	1	17	1	25	5	11	NaN	428	0.989182	203
24396	240315	160974	0.002060	4	1	17	1	25	5	10	NaN	281	0.989092	259
137929	240315	272143	0.019313	4	1	17	1	25	5	9	NaN	399	0.989235	184
137930	240315	198659	0.019322	4	1	17	1	25	5	8	NaN	323	0.989235	191
154069	240315	156560	0.025085	4	1	17	1	25	1	7	NaN	281	0.989222	185

计算每位用户点击文章的词向量之间的相似度，具体步骤是循环所有已点击文章，通过用户点击日志中的文章ID在文章ID与词向量数据集索引所组成的字典中，使用词向量对应的索引获取该文章的词向量，使用相同方法获取下一文章的词向量，相似度计算时使用函数为 $=\frac{\begin{vmatrix} A \end{vmatrix} \cdot \begin{vmatrix} B \end{vmatrix}}{ \begin{Vmatrix} A \end{Vmatrix}_2^2 \cdot \begin{Vmatrix} B\end{Vmatrix}_2^2 }$

def get_item_sim_list(df):
    sim_list = []
    item_list = df['click_article_id'].values
    for i in range(0, len(item_list)-1):
        emb1 = item_emb_np[item_idx_2_rawid_dict[item_list[i]]]
        emb2 = item_emb_np[item_idx_2_rawid_dict[item_list[i + 1]]]
        sim_list.append(np.dot(emb1, emb2) / (np.linalg.norm(emb1) * (np.linalg.norm(emb2))))
    sim_list.append(0)
    return sim_list

for _, user_df in sub_user_info.groupby('user_id'):
    item_sim_list = get_item_sim_list(user_df)
    plt.plot(item_sim_list)

在这里插入图片描述

总结

通过数据分析的过程，我们目前可以得到以下几点重要的信息，这个对于我们进行后面的特征制作和分析非常有帮助：

训练集和测试集的用户id没有重复，也就是测试集里面的用户模型是没有见过的。
用户对于文章存在重复点击的情况，但这个都存在于训练集里面。
同一用户的点击环境存在不唯一的情况，后面做这部分特征的时候可以采用统计特征。
用户点击文章的次数有很大的区分度，后面可以根据这个制作衡量用户活跃度的特征。
文章被用户点击的次数也有很大的区分度，后面可以根据这个制作衡量文章热度的特征。
用户看的新闻，相关性是比较强的，所以往往我们判断用户是否对某篇文章感兴趣的时候，在很大程度上会和他历史点击过的文章有关。
用户点击的文章字数有比较大的区别，这个可以反映用户对于文章字数的区别。
用户点击过的文章主题也有很大的区别，这个可以反映用户的主题偏好。
不同用户点击文章的时间差也会有所区别，这个可以反映用户对于文章时效性的偏好。

晃晃我的半瓶水

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
EDA

新闻推荐：task-02 数据分析数据分为训练集用户日志和测试机用户日志，新闻信息，文章词向量。数据分析的价值：熟悉整个数据集的基本情况，即每个文件中有哪些数据，具体的文件中每个字段所表示的实际含义，数据集特征之间的相关性。针对于新闻推荐来说，主要需要分析的有用户自身的一个状态，用户与文章的关系，文章与文章之间的相关性，文章本身的基本属性，分析这些属性有助于后面召回策略的选择及特征工程的具体方向。导入函数库%matplotlib inlineimport numpy as npimpo
复制链接

扫一扫