项目背景
本数据报告以某电子产品销售数据为数据集。以店铺和用户的角度进行探索式分析,从而了解在线销售业务的消费情况以及用户的消费行为,最终提出店铺销售建议。
数据清洗
import pandas as pd
import numpy as np
import os
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(r"D:\数据分析项目\电子产品销售\电子产品销售分析.csv")
df.head()
|
Unnamed: 0 |
event_time |
order_id |
product_id |
category_id |
category_code |
brand |
price |
user_id |
age |
sex |
local |
0 |
0 |
2020-04-24 11:50:39 UTC |
2294359932054536986 |
1515966223509089906 |
2.268105e+18 |
electronics.tablet |
samsung |
162.01 |
1.515916e+18 |
24.0 |
女 |
海南 |
1 |
1 |
2020-04-24 11:50:39 UTC |
2294359932054536986 |
1515966223509089906 |
2.268105e+18 |
electronics.tablet |
samsung |
162.01 |
1.515916e+18 |
24.0 |
女 |
海南 |
2 |
2 |
2020-04-24 14:37:43 UTC |
2294444024058086220 |
2273948319057183658 |
2.268105e+18 |
electronics.audio.headphone |
huawei |
77.52 |
1.515916e+18 |
38.0 |
女 |
北京 |
3 |
3 |
2020-04-24 14:37:43 UTC |
2294444024058086220 |
2273948319057183658 |
2.268105e+18 |
electronics.audio.headphone |
huawei |
77.52 |
1.515916e+18 |
38.0 |
女 |
北京 |
4 |
4 |
2020-04-24 19:16:21 UTC |
2294584263154074236 |
2273948316817424439 |
2.268105e+18 |
NaN |
karcher |
217.57 |
1.515916e+18 |
32.0 |
女 |
广东 |
选择子集
第一列为数据编号,已有索引故删除
df.drop(["Unnamed: 0"],axis=1,inplace= True)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 564169 entries, 0 to 564168
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 event_time 564169 non-null object
1 order_id 564169 non-null int64
2 product_id 564169 non-null int64
3 category_id 564169 non-null float64
4 category_code 434799 non-null object
5 brand 536945 non-null object
6 price 564169 non-null float64
7 user_id 564169 non-null float64
8 age 564169 non-null float64
9 sex 564169 non-null object
10 local 564169 non-null object
dtypes: float64(4), int64(2), object(5)
memory usage: 47.3+ MB
标准化处理
df.dtypes
event_time object
order_id int64
product_id int64
category_id float64
category_code object
brand object
price float64
user_id float64
age float64
sex object
local object
dtype: object
df['event_time'] = pd.to_datetime(df['event_time'].str[:19],format="%Y-%m-%d %H:%M:%S")
df['Month'] = df['event_time'].dt.month
df['Day'] = df['event_time'].dt.day
df['Dayofweek']=df['event_time'].dt.dayofweek
df['hour'] = df['event_time'].dt.hour
df.head()
|
event_time |
order_id |
product_id |
category_id |
category_code |
brand |
price |
user_id |
age |
sex |
local |
Month |
Day |
Dayofweek |
hour |
0 |
2020-04-24 11:50:39 |
2294359932054536986 |
1515966223509089906 |
2.268105e+18 |
electronics.tablet |
samsung |
162.01 |
1.515916e+18 |
24.0 |
女 |
海南 |
4 |
24 |
4 |
11 |
1 |
2020-04-24 11:50:39 |
2294359932054536986 |
1515966223509089906 |
2.268105e+18 |
electronics.tablet |
samsung |
162.01 |
1.515916e+18 |
24.0 |
女 |
海南 |
4 |
24 |
4 |
11 |
2 |
2020-04-24 14:37:43 |
2294444024058086220 |
2273948319057183658 |
2.268105e+18 |
electronics.audio.headphone |
huawei |
77.52 |
1.515916e+18 |
38.0 |
女 |
北京 |
4 |
24 |
4 |
14 |
3 |
2020-04-24 14:37:43 |
2294444024058086220 |
2273948319057183658 |
2.268105e+18 |
electronics.audio.headphone |
huawei |
77.52 |
1.515916e+18 |
38.0 |
女 |
北京 |
4 |
24 |
4 |
14 |
4 |
2020-04-24 19:16:21 |
2294584263154074236 |
2273948316817424439 |
2.268105e+18 |
NaN |
karcher |
217.57 |
1.515916e+18 |
32.0 |
女 |
广东 |
4 |
24 |
4 |
19 |
缺失值和重复值处理
df.isnull().sum()
event_time 0
order_id 0
product_id 0
category_id 0
category_code 129370
brand 27224
price 0
user_id 0
age 0
sex 0
local 0
Month 0
Day 0
Dayofweek 0
hour 0
dtype: int64
df.fillna('missing',inplace=True)
df.isnull().sum()
event_time 0
order_id 0
product_id 0
category_id 0
category_code 0
brand 0
price 0
user_id 0
age 0
sex 0
local 0
Month 0
Day 0
Dayofweek 0
hour 0
dtype: int64
df.duplicated()
df.drop_duplicates()
|
event_time |
order_id |
product_id |
category_id |
category_code |
brand |
price |
user_id |
age |
sex |
local |
Month |
Day |
Dayofweek |
hour |
0 |
2020-04-24 11:50:39 |
2294359932054536986 |
1515966223509089906 |
2.268105e+18 |
electronics.tablet |
samsung |
162.01 |
1.515916e+18 |
24.0 |
女 |
海南 |
4 |
24 |
4 |
11 |
2 |
2020-04-24 14:37:43 |
2294444024058086220 |
2273948319057183658 |
2.268105e+18 |
electronics.audio.headphone |
huawei |
77.52 |
1.515916e+18 |
38.0 |
女 |
北京 |
4 |
24 |
4 |
14 |
4 |
2020-04-24 19:16:21 |
2294584263154074236 |
2273948316817424439 |
2.268105e+18 |
missing |
karcher |
217.57 |
1.515916e+18 |
32.0 |
女 |
广东 |
4 |
24 |
4 |
19 |
5 |
2020-04-26 08:45:57 |
2295716521449619559 |
1515966223509261697 |
2.268105e+18 |
furniture.kitchen.table |
maestro |
39.33 |
1.515916e+18 |
20.0 | <