数据分析 — 新零售用户分析

最新推荐文章于 2024-09-11 12:37:05 发布

永远十八的小仙女~

最新推荐文章于 2024-09-11 12:37:05 发布

阅读量472

点赞数 9

分类专栏：数据分析文章标签：数据分析

本文链接：https://blog.csdn.net/muyuhen/article/details/136167427

版权

数据分析专栏收录该内容

9 篇文章 1 订阅

订阅专栏

本文介绍了如何使用Python进行电商数据分析，包括数据字段解读、Pandas操作（数据读取、清洗）、高潜用户定义和分析，以及用户行为的可视化展示。通过实例展示了数据预处理、用户行为特征提取以及结果的可视化呈现。

摘要由CSDN通过智能技术生成

一、数据字段信息

action（用户行为）

字段名称	意义
user_id	用户编号
sku_id	商品编号
type	行为类型
time	行为时间
cate	品类ID

user（用户数据）

字段名称	意义
user_id	用户编号
age	年龄
sex	性别
user_lv_cd	用户级别
browse_num	浏览数
addcart_num	加购数
delcart_num	删购数
buy_num	购买数
favor_num	收藏数
click_num	点击数

二、数据读取

import pandas as pd  # 导入 Pandas 库并使用别名 pd
# 指定 CSV 文件路径
path = r'F:\data\action.csv'
# 通过 Pandas 读取 CSV 文件，使用 chunksize 参数分块读取，每块大小为 10000 行
data = pd.read_csv(path, chunksize=10000)
# 初始化一个空列表，用于存储每次读取的数据块
chunks = []
# 遍历每个数据块并将其转换为 DataFrame，然后添加到列表中
for chunk in data:
    chunk = pd.DataFrame(chunk)
    chunks.append(chunk)
# 将所有数据块拼接成一个大的 DataFrame
action = pd.concat(chunks)
# 打印数据的信息，包括列名、非空值数量等
action.info()

在这里插入图片描述

# 选择特定的列：'user_id','sku_id','time','type','cate'
action = action[['user_id', 'sku_id', 'time', 'type', 'cate']]
# 输出筛选后的数据的基本信息
action.info()

在这里插入图片描述

# 输出筛选后的数据的基本信息
print(action.head())

在这里插入图片描述

三、数据清洗

# 统计每列的缺失值数量
print(action.isnull().sum())
# user_id    0
# sku_id     0
# time       0
# type       0
# cate       0
# dtype: int64

# 统计整个 DataFrame 中的重复行数量
print(action.duplicated().sum())  #7262534

# 显示 DataFrame 的基本描述信息，包括统计、频数、数据类型等
print(action.describe(include='all'))

在这里插入图片描述

# 显示 DataFrame 的基本信息，包括数据类型、非空值数量等
action.info()

在这里插入图片描述

# 将 'user_id' 列的数据类型转换为 int64
action['user_id'] = action['user_id'].astype('int64')
# 将 'time' 列的数据类型转换为 datetime
action['time'] = pd.to_datetime(action['time'])
# 打印 DataFrame 的信息
action.info()

在这里插入图片描述

四、高潜用户分析

1、获取高潜用户

高潜用户应该具有以下特征：
1）必须有购买行为。
2）对一个商品购买，并且有其它交互行为（如浏览、点击、收藏等）。
3）对一个商品购买和其它交互行为（浏览、点击和收藏等）时间差大于1天。

2、对高潜用户进行分析

# 打印 'type' 列的唯一值
print(action['type'].unique())  # [6 1 4 2 3 5]

# 根据漏斗原理，购买的数量肯定是最少的
# type=4 是购买，1 2 3 5 6则是其它行为
# 统计 'type' 列中每个值的出现次数
print(action['type'].value_counts())
# 6    8219746
# 1    4715843
# 2     154935
# 3      72294
# 5      24837
# 4      12279
# Name: count, dtype: int64

# 选择 'type' 列为 4 的行
action_type4 = action[action['type'] == 4]
print(action_type4)

在这里插入图片描述

# 这里只分析其中一个品类
# 进一步筛选 'type' 列值为 4 且 'cate' 列值为 4 的行
action_type4 = action_type4[action_type4['cate'] == 4]
print(action_type4)

在这里插入图片描述

# 对 'user_id' 分组，获取每个用户 'time' 列的最大值（最后购买时间）
ac_lastbuytime = action_type4.groupby('user_id')['time'].max().reset_index()
print(ac_lastbuytime)

在这里插入图片描述

# 将最后购买时间与原始数据集进行合并，根据 'user_id' 列
ac_all_buy = pd.merge(ac_lastbuytime, action, on='user_id')
print(ac_all_buy.head())

在这里插入图片描述

# 打印合并后的 DataFrame 的形状
print(ac_all_buy.shape)  # (1457770, 6)

# 对合并后的数据集再次进行分组，获取每个用户 'time_y' 列的最小值（首次交互时间）
ac_firsttime = ac_all_buy.groupby('user_id')['time_y'].min().reset_index()
print(ac_firsttime)

在这里插入图片描述

# 合并最后购买时间和首次交互时间的数据集，根据 'user_id' 列
df = pd.merge(ac_lastbuytime, ac_firsttime, on='user_id')
# 为合并后的 DataFrame 添加列名
df.columns = ['user_id', 'last_buy_time', 'first_action_time']
print(df.head())

在这里插入图片描述

# 打印合并后的 DataFrame 的信息
df.info()

在这里插入图片描述

# 计算两个时间列的差值，以天为单位，并添加为新的列 'days'
df['days'] = (pd.to_datetime(df['last_buy_time']) - pd.to_datetime(df['first_action_time'])).dt.days
print(df.head())

在这里插入图片描述

# 选择 'days' 列大于 1 的行
high_pot = df[df['days'] > 1]
print(high_pot)

在这里插入图片描述

# 从 CSV 文件读取用户数据
user = pd.read_csv(r'F:\data\user.csv')
print(user.head())

在这里插入图片描述

# 将高潜力用户的数据与用户数据合并，根据 'user_id' 列
user_high = pd.merge(user, high_pot, on='user_id')
print(user_high)

在这里插入图片描述

# 将合并后的数据保存为 CSV 文件
user_high.to_csv('./user_high.csv')

五、可视化

# 年龄分析，打印 'age' 列的唯一值
print(user_high['age'].unique())  # [ 3.  2. -1.  4.  6.  5.]

# 对 'age' 列进行分组，统计每个年龄的用户数量
user_age_count = user_high.groupby('age').count()
print(user_age_count)

在这里插入图片描述

from pyecharts.charts import *  # 从 pyecharts.charts 模块中导入所有类
# 对用户数量进行降序排序
dd = user_age_count['user_id'].sort_values(ascending=False)
print(dd)
# age
#  3.0    1012
#  4.0     485
# -1.0     169
#  2.0     136
#  5.0      57
#  6.0      23
# Name: user_id, dtype: int64

# 获取排序后的年龄列表
x_data = dd.index.tolist()
print(x_data)  # [3.0, 4.0, -1.0, 2.0, 5.0, 6.0]
# 获取排序后的用户数量列表
y_data = dd.values.tolist()
print(y_data)  # [1012, 485, 169, 136, 57, 23]
# 使用 pyecharts 绘制柱状图
bar = (
    Bar()
    .add_xaxis(x_data)
    .add_yaxis('年龄分布', y_data)
)
# 将图表保存为 HTML 文件
bar.render('./bar.html')