电子产品销售洞察：月度趋势、用户行为与店铺策略-CSDN博客

本文链接：https://blog.csdn.net/weixin_53029015/article/details/117606871

项目背景

本数据报告以某电子产品销售数据为数据集。以店铺和用户的角度进行探索式分析，从而了解在线销售业务的消费情况以及用户的消费行为，最终提出店铺销售建议。

数据清洗

#导入第三方库
import pandas as pd
import numpy as np
import os
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns

#导入数据
df = pd.read_csv(r"D:\数据分析项目\电子产品销售\电子产品销售分析.csv")
df.head()

	Unnamed: 0	event_time	order_id	product_id	category_id	category_code	brand	price	user_id	age	sex	local
0	0	2020-04-24 11:50:39 UTC	2294359932054536986	1515966223509089906	2.268105e+18	electronics.tablet	samsung	162.01	1.515916e+18	24.0	女	海南
1	1	2020-04-24 11:50:39 UTC	2294359932054536986	1515966223509089906	2.268105e+18	electronics.tablet	samsung	162.01	1.515916e+18	24.0	女	海南
2	2	2020-04-24 14:37:43 UTC	2294444024058086220	2273948319057183658	2.268105e+18	electronics.audio.headphone	huawei	77.52	1.515916e+18	38.0	女	北京
3	3	2020-04-24 14:37:43 UTC	2294444024058086220	2273948319057183658	2.268105e+18	electronics.audio.headphone	huawei	77.52	1.515916e+18	38.0	女	北京
4	4	2020-04-24 19:16:21 UTC	2294584263154074236	2273948316817424439	2.268105e+18	NaN	karcher	217.57	1.515916e+18	32.0	女	广东

选择子集

第一列为数据编号，已有索引故删除

df.drop(["Unnamed: 0"],axis=1,inplace= True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 564169 entries, 0 to 564168
Data columns (total 11 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   event_time     564169 non-null  object 
 1   order_id       564169 non-null  int64  
 2   product_id     564169 non-null  int64  
 3   category_id    564169 non-null  float64
 4   category_code  434799 non-null  object 
 5   brand          536945 non-null  object 
 6   price          564169 non-null  float64
 7   user_id        564169 non-null  float64
 8   age            564169 non-null  float64
 9   sex            564169 non-null  object 
 10  local          564169 non-null  object 
dtypes: float64(4), int64(2), object(5)
memory usage: 47.3+ MB

标准化处理

df.dtypes

event_time        object
order_id           int64
product_id         int64
category_id      float64
category_code     object
brand             object
price            float64
user_id          float64
age              float64
sex               object
local             object
dtype: object

#数据类型转化
df['event_time'] = pd.to_datetime(df['event_time'].str[:19],format="%Y-%m-%d %H:%M:%S")

#计算时间变量
df['Month'] = df['event_time'].dt.month
df['Day'] = df['event_time'].dt.day
df['Dayofweek']=df['event_time'].dt.dayofweek
df['hour'] = df['event_time'].dt.hour

df.head()

	event_time	order_id	product_id	category_id	category_code	brand	price	user_id	age	sex	local	Month	Day	Dayofweek	hour
0	2020-04-24 11:50:39	2294359932054536986	1515966223509089906	2.268105e+18	electronics.tablet	samsung	162.01	1.515916e+18	24.0	女	海南	4	24	4	11
1	2020-04-24 11:50:39	2294359932054536986	1515966223509089906	2.268105e+18	electronics.tablet	samsung	162.01	1.515916e+18	24.0	女	海南	4	24	4	11
2	2020-04-24 14:37:43	2294444024058086220	2273948319057183658	2.268105e+18	electronics.audio.headphone	huawei	77.52	1.515916e+18	38.0	女	北京	4	24	4	14
3	2020-04-24 14:37:43	2294444024058086220	2273948319057183658	2.268105e+18	electronics.audio.headphone	huawei	77.52	1.515916e+18	38.0	女	北京	4	24	4	14
4	2020-04-24 19:16:21	2294584263154074236	2273948316817424439	2.268105e+18	NaN	karcher	217.57	1.515916e+18	32.0	女	广东	4	24	4	19

缺失值和重复值处理

#查看缺失值
df.isnull().sum()

event_time            0
order_id              0
product_id            0
category_id           0
category_code    129370
brand             27224
price                 0
user_id               0
age                   0
sex                   0
local                 0
Month                 0
Day                   0
Dayofweek             0
hour                  0
dtype: int64

#有两列中有数据缺失值，类别列缺失129370条，品牌列缺失27224条，这两列数值缺失对店铺销售情况的分析和用户消费行为的分析没主要影响，
#但是其他数据有重要影响，所以这两列缺失值由missing填充。

df.fillna('missing',inplace=True)

df.isnull().sum()
#缺失值已全部填充

event_time       0
order_id         0
product_id       0
category_id      0
category_code    0
brand            0
price            0
user_id          0
age              0
sex              0
local            0
Month            0
Day              0
Dayofweek        0
hour             0
dtype: int64

#重复值检查和处理
df.duplicated()
df.drop_duplicates()

	event_time	order_id	product_id	category_id	category_code	brand	price	user_id	age	sex	local	Month	Day	Dayofweek	hour
0	2020-04-24 11:50:39	2294359932054536986	1515966223509089906	2.268105e+18	electronics.tablet	samsung	162.01	1.515916e+18	24.0	女	海南	4	24	4	11
2	2020-04-24 14:37:43	2294444024058086220	2273948319057183658	2.268105e+18	electronics.audio.headphone	huawei	77.52	1.515916e+18	38.0	女	北京	4	24	4	14
4	2020-04-24 19:16:21	2294584263154074236	2273948316817424439	2.268105e+18	missing	karcher	217.57	1.515916e+18	32.0	女	广东	4	24	4	19
5	2020-04-26 08:45:57	2295716521449619559	1515966223509261697	2.268105e+18	furniture.kitchen.table	maestro	39.33	1.515916e+18	20.0