购物篮分析
变量解释
变量 | 含义 | 说明 |
---|---|---|
ReceiptID | 收据单号 | |
Value | 支付金额 | |
pmethod | 支付渠道 | 1现金,2信用卡,3电子支付,4其他 |
sex | 性别 | 1男性,2女性 |
homeown | 是否有住宅 | 1有,2无,3未知 |
income | 收入 | |
age | 年龄 | |
其他 | 其他 | 购买的各种类商品的数量 |
数据导入
import matplotlib.pyplot as plt
plt.rcParams["font.sans-serif"] = ["SimHei"]
plt.rcParams["axes.unicode_minus"] = False
# 中文环境
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# 加载数据
data = pd.read_excel('路径',sheet_name=1)
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58000 entries, 0 to 57999
Data columns (total 50 columns):
ReceiptID 58000 non-null int64
Value 58000 non-null float64
pmethod 58000 non-null int64
sex 58000 non-null int64
homeown 58000 non-null int64
income 57999 non-null float64
age 57999 non-null float64
PostCode 48208 non-null object
nchildren 57998 non-null float64
fruit 58000 non-null object
freshmeat 58000 non-null int64
dairy 58000 non-null int64
MozerallaCheese 58000 non-null int64
cannedveg 57999 non-null float64
cereal 57991 non-null float64
frozenmeal 58000 non-null int64
frozendessert 58000 non-null int64
PizzaBase 57999 non-null float64
TomatoSauce 58000 non-null int64
frozen fish 58000 non-null int64
bread 58000 non-null int64
milk 57999 non-null float64
softdrink 58000 non-null int64
fruitjuice 58000 non-null int64
confectionery 57999 non-null float64
fish 58000 non-null int64
vegetables 58000 non-null int64
icecream 58000 non-null int64
energydrink 58000 non-null int64
tea 58000 non-null int64
coffee 58000 non-null int64
laundryPowder 58000 non-null int64
householCleaners 58000 non-null int64
corn chips 58000 non-null int64
Frozen yogurt 58000 non-null int64
Chocolate 58000 non-null int64
Olive Oil 58000 non-null int64
Baby Food 58000 non-null int64
Napies 58000 non-null int64
banana 58000 non-null int64
cat food 58000 non-null int64
dog food 58000 non-null int64
mince 58000 non-null int64
sunflower Oil 58000 non-null int64
chicken 58000 non-null int64
vitamins 58000 non-null int64
deodorants 58000 non-null int64
dishwashingliquid 58000 non-null int64
onions 58000 non-null int64
lettuce 58000 non-null int64
dtypes: float64(9), int64(39), object(2)
memory usage: 22.1+ MB
结论
- 共58000个观测值,部分数据有缺失值
数据预处理
PostCode缺失值较多,且对后续分析作用不大,直接删除此列
#输出文件名
outputfile = '路径'
#填补数据
data.drop(['PostCode'] ,axis=1,inplace=True)
#输出到指定文件
data.to_excel(outputfile)
对income用众数插补
#输出文件名
outputfile = '路径'
#填补数据
data['income'] = data['income'].fillna(data['income'].mode()[0])
#输出到指定文件
data.to_excel(outputfile)
对age用均值插补
#输出文件名
outputfile = '路径'
#填补数据
data['age'] = data['age'].fillna(data['age'].mean())
#输出到指定文件
data.to_excel(outputfile)
对nchildren用前一个数据插补
#输出文件名
outputfile = '路径'
#填补数据
data['nchildren'] = data['nchildren'].fillna(method='pad')
#输出到指定文件
data.to_excel(outputfile)
对cannedveg用众数插补
#输出文件名
outputfile = '路径'
#填补数据
data['cannedveg'] = data['cannedveg'].fillna(data['cannedveg']