电子产品销售数据分析

项目背景

本数据报告以某电子产品销售数据为数据集。以店铺和用户的角度进行探索式分析,从而了解在线销售业务的消费情况以及用户的消费行为,最终提出店铺销售建议。

数据清洗

#导入第三方库
import pandas as pd
import numpy as np
import os
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
#导入数据
df = pd.read_csv(r"D:\数据分析项目\电子产品销售\电子产品销售分析.csv")
df.head()
Unnamed: 0 event_time order_id product_id category_id category_code brand price user_id age sex local
0 0 2020-04-24 11:50:39 UTC 2294359932054536986 1515966223509089906 2.268105e+18 electronics.tablet samsung 162.01 1.515916e+18 24.0 海南
1 1 2020-04-24 11:50:39 UTC 2294359932054536986 1515966223509089906 2.268105e+18 electronics.tablet samsung 162.01 1.515916e+18 24.0 海南
2 2 2020-04-24 14:37:43 UTC 2294444024058086220 2273948319057183658 2.268105e+18 electronics.audio.headphone huawei 77.52 1.515916e+18 38.0 北京
3 3 2020-04-24 14:37:43 UTC 2294444024058086220 2273948319057183658 2.268105e+18 electronics.audio.headphone huawei 77.52 1.515916e+18 38.0 北京
4 4 2020-04-24 19:16:21 UTC 2294584263154074236 2273948316817424439 2.268105e+18 NaN karcher 217.57 1.515916e+18 32.0 广东

选择子集

第一列为数据编号,已有索引故删除

df.drop(["Unnamed: 0"],axis=1,inplace= True)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 564169 entries, 0 to 564168
Data columns (total 11 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   event_time     564169 non-null  object 
 1   order_id       564169 non-null  int64  
 2   product_id     564169 non-null  int64  
 3   category_id    564169 non-null  float64
 4   category_code  434799 non-null  object 
 5   brand          536945 non-null  object 
 6   price          564169 non-null  float64
 7   user_id        564169 non-null  float64
 8   age            564169 non-null  float64
 9   sex            564169 non-null  object 
 10  local          564169 non-null  object 
dtypes: float64(4), int64(2), object(5)
memory usage: 47.3+ MB

标准化处理

df.dtypes
event_time        object
order_id           int64
product_id         int64
category_id      float64
category_code     object
brand             object
price            float64
user_id          float64
age              float64
sex               object
local             object
dtype: object
#数据类型转化
df['event_time'] = pd.to_datetime(df['event_time'].str[:19],format="%Y-%m-%d %H:%M:%S")
#计算时间变量
df['Month'] = df['event_time'].dt.month
df['Day'] = df['event_time'].dt.day
df['Dayofweek']=df['event_time'].dt.dayofweek
df['hour'] = df['event_time'].dt.hour
df.head()
event_time order_id product_id category_id category_code brand price user_id age sex local Month Day Dayofweek hour
0 2020-04-24 11:50:39 2294359932054536986 1515966223509089906 2.268105e+18 electronics.tablet samsung 162.01 1.515916e+18 24.0 海南 4 24 4 11
1 2020-04-24 11:50:39 2294359932054536986 1515966223509089906 2.268105e+18 electronics.tablet samsung 162.01 1.515916e+18 24.0 海南 4 24 4 11
2 2020-04-24 14:37:43 2294444024058086220 2273948319057183658 2.268105e+18 electronics.audio.headphone huawei 77.52 1.515916e+18 38.0 北京 4 24 4 14
3 2020-04-24 14:37:43 2294444024058086220 2273948319057183658 2.268105e+18 electronics.audio.headphone huawei 77.52 1.515916e+18 38.0 北京 4 24 4 14
4 2020-04-24 19:16:21 2294584263154074236 2273948316817424439 2.268105e+18 NaN karcher 217.57 1.515916e+18 32.0 广东 4 24 4 19

缺失值和重复值处理

#查看缺失值
df.isnull().sum()
event_time            0
order_id              0
product_id            0
category_id           0
category_code    129370
brand             27224
price                 0
user_id               0
age                   0
sex                   0
local                 0
Month                 0
Day                   0
Dayofweek             0
hour                  0
dtype: int64
#有两列中有数据缺失值,类别列缺失129370条,品牌列缺失27224条,这两列数值缺失对店铺销售情况的分析和用户消费行为的分析没主要影响,
#但是其他数据有重要影响,所以这两列缺失值由missing填充。
df.fillna('missing',inplace=True)
df.isnull().sum()
#缺失值已全部填充
event_time       0
order_id         0
product_id       0
category_id      0
category_code    0
brand            0
price            0
user_id          0
age              0
sex              0
local            0
Month            0
Day              0
Dayofweek        0
hour             0
dtype: int64
#重复值检查和处理
df.duplicated()
df.drop_duplicates()
<
event_time order_id product_id category_id category_code brand price user_id age sex local Month Day Dayofweek hour
0 2020-04-24 11:50:39 2294359932054536986 1515966223509089906 2.268105e+18 electronics.tablet samsung 162.01 1.515916e+18 24.0 海南 4 24 4 11
2 2020-04-24 14:37:43 2294444024058086220 2273948319057183658 2.268105e+18 electronics.audio.headphone huawei 77.52 1.515916e+18 38.0 北京 4 24 4 14
4 2020-04-24 19:16:21 2294584263154074236 2273948316817424439 2.268105e+18 missing karcher 217.57 1.515916e+18 32.0 广东 4 24 4 19
5 2020-04-26 08:45:57 2295716521449619559 1515966223509261697 2.268105e+18 furniture.kitchen.table maestro 39.33 1.515916e+18 20.0
评论 7
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值