电商用户画像标签表制作

最新推荐文章于 2024-07-25 16:24:11 发布

Larry Westside

最新推荐文章于 2024-07-25 16:24:11 发布

阅读量1.3k

点赞数 3

文章标签：数据分析

本文链接：https://blog.csdn.net/Larry_Westside/article/details/117898930

版权


```python
## 一、导入模块
#导入库
%matplotlib inline
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib
# 指定字体
matplotlib.rcParams['font.sans-serif'] = ['SimHei']  
matplotlib.rcParams['font.family']='sans-serif'  
#解决负号'-'显示为方块的问题  
matplotlib.rcParams['axes.unicode_minus'] = False

import pymysql
pymysql.install_as_MySQLdb()
from sqlalchemy import create_engine

import gc
import warnings
warnings.filterwarnings('ignore')
from datetime import datetime

# 导入数据集
df = pd.read_excel("order_data.xlsx")

# 导入用户详情数据集
df_user = pd.read_excel("user_data.xlsx")
##订单信息+用户信息
# 浏览：behavior_type=1
# 收藏：behavior_type=2
# 加购：behavior_type=3
# 购买：behavior_type=4
df.head()

1、数据清洗

1.重复值 2.缺失值 3.异常值 (在这里没有明确的定义暂时略去)
-检测
-统计
-处理

#1.重复值 
df_user[df_user.duplicated()]  #没有重复值

在这里插入图片描述

df_user.info()  #Unnamed: 8 可能需要处理

在这里插入图片描述
#2.缺失值
df_user.isnull().sum() #Unnamed: 8一列都是空值删除这一列

del df_user['Unnamed: 8']
df_user

在这里插入图片描述

2、日期与时段处理

df.info() #都是非空但发现time型的type可能需要转化
在这里插入图片描述
df.head()

#将time字段拆分为日期和时段
df['date'] = df['time'].str[0:10]
df['date'] = pd.to_datetime(df['date'],format='%Y-%m-%d')
df['time'] = df['time'].str[11:]
df['time'] = df['time'].astype(int)

df

在这里插入图片描述

# 将时段分为'凌晨'、'上午'、'中午'、'下午'、'晚上'，左开右闭区间
df['hour'] = pd.cut(df['time'],bins=[-1,5,10,13,18,24],labels=['凌晨','上午','中午','下午','晚上'])

df

在这里插入图片描述

3、制作用户标签表

#生成用户标签表，制作好的标签都加入这个表中
labels = df_user

labels.head()

在这里插入图片描述

一、用户活跃时间

1.1、用户浏览活跃时间段
1.2、用户购买活跃时间段
a、用户浏览活跃时间段

计算加工流程：

a、提取 behavior_type=1 的用户浏览数据
b、然后根据用户id+时间段分组计数，并且求出最大值
c、获取用户id+最活跃时间段，如果有多个最活跃时间段，则进行逗号拼接

behavior_type 的内容

浏览：behavior_type=1
收藏：behavior_type=2
加购：behavior_type=3
购买：behavior_type=4

- #对用户和时段分组，统计浏览次数
time_browse = df[df['behavior_type']==1].groupby(['user_id','hour']).item_id.count().reset_index()

time_browse

在这里插入图片描述

time_browse.rename(columns={
   'item_id':'hour_counts'},inplace=True)

#统计每个用户浏览次数最多的时段
time_browse_max = time_browse.groupby('user_id').hour_counts.max().reset_index()
time_browse_max.head()

在这里插入图片描述

time_browse_max.rename(columns={
   'hour_counts':'read_counts_max'},inplace=True)
time_browse_max.head()

在这里插入图片描述

time_browse = pd.merge(time_browse,time_browse_max,how='left',on='user_id')

time_browse

在这里插入图片描述

#选取各用户浏览次数最多的时段，如有并列最多的时段，用逗号连接
time_browse_hour = time_browse.loc[time_browse['hour_counts']==time_browse['read_counts_max'],'hour'].groupby(time_browse['user_id']).aggregate(lambda x:','.join(x)).reset_index()

time_browse_hour.sample(10)

在这里插入图片描述

#将用户浏览活跃时间段加入到用户标签表中
labels = pd.merge(labels,time_browse_hour,how='left',on='user_id')
labels.rename(columns={
   'hour':'time_browse'},inplace=True)

labels

在这里插入图片描述

b、用户购买活跃时间段

计算加工流程：

a、提取 behavior_type=4 的用户购买数据
b、然后根据用户id+时间段分组计数，并且求出最大值
c、获取用户id+最活跃时间段，如果有多个最活跃时间段，则进行逗号拼接

#对用户和时段分组，统计购买次数
time_buy = df[df['behavior_type']==4].groupby(['user_id','hour']).item_id.count().reset_index()
time_buy.rename(columns={
   'item_id':'hour_counts'},inplace=True)
time_buy.head()

在这里插入图片描述

#统计每个用户购买次数最多的时段
time_buy_max = time_buy.groupby('user_id').hour_counts.max().reset_index()
time_buy_max.rename(columns={
   'hour_counts':'buy_counts_max'},inplace=True)

time_buy_max.head()

在这里插入图片描述

time_buy = pd.merge(time_buy,time_buy_max,how='left',on='user_id')
time_buy.head(10)

在这里插入图片描述

#选取各用户购买次数最多的时段，如有并列最多的时段，用逗号连接
time_buy_hour = time_buy.loc[time_buy['hour_counts']==time_buy['buy_counts_max'],'hour'].groupby(time_buy['user_id']).aggregate(lambda x:','.join(x)).reset_index()
time_buy_hour.head()

在这里插入图片描述

del time_browse
del time_buy
del time_browse_hour
del time_browse_max
del time_buy_hour
del time_buy_max
gc.collect()

在这里插入图片描述

labels.head()

二、关于类目的用户行为

2.1、浏览最多的类目
2.2、收藏最多的类目
2.3、加购最多的类目
2.4、购买最多的类目

# 先获取各个需要的数据集
df_browse = df.loc[df['behavior_type']==1,['user_id','item_id','item_category']]
df_collect = df.loc[df['behavior_type']==2,['user_id','item_id','item_category']]
df_cart = df.loc[df['behavior_type']==3,['user_id',

最低0.47元/天解锁文章

Larry Westside

关注

3
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
电商用户画像标签表制作

```python## 一、导入模块#导入库%matplotlib inlineimport numpy as npimport pandas as pdfrom matplotlib import pyplot as pltimport matplotlib# 指定字体matplotlib.rcParams['font.sans-serif'] = ['SimHei'] matplotlib.rcParams['font.family']='sans-serif' #解决负号'.
复制链接

扫一扫