1 业务背景
数据集来源于:kaggle数据集(链接),该数据集记录了某全球超市四年的销售数据,通过分析该超市四年内的销售数据,从不同角度出发,分析经营现状,发掘提高销量的销售策略,利用数据找到新的增长点,并提出建议。
1.1 分析思路及指标
1. 数据总览 (Data Describe)
2. 数据预处理 (Data Preprocessing)
- 数据类型转换
- 空值、缺失值、异常值处理
3. 数据分析 (Data Analysis)
4. 模型构建 (Bulid Model)
- RFM 用户价值模型
- Kmeans 机器学习模型
5. 模型评估 (Evaluate)
- TSNE降维可视化
6. 结论建议 (Conclusion)
2 数据加载与清洗
2.1 载入数据分析库及数据
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
sns.set(style='white')
sns.set_context('notebook', font_scale=1.5, rc={
'lines.linewidth':2.5,'figure.figsize':(17, 10)})
large = 22; med = 16; small = 12
params = {
'axes.titlesize': large,
'legend.fontsize': med,
'figure.figsize': (16, 10),
'axes.labelsize': med,
'xtick.labelsize': med,
'ytick.labelsize': med,
'figure.titlesize': large}
plt.rcParams.update(params)
plt.rcParams['figure.figsize'] = (17, 10)
plt.rcParams["font.family"] = 'SimHei'
plt.rcParams["axes.unicode_minus"] = False
warnings.filterwarnings("ignore")
df = pd.read_csv('./superstore_dataset2011-2015.csv',encoding='ISO-8859-1')
df.head().append(df.tail())
Row ID | Order ID | Order Date | Ship Date | Ship Mode | Customer ID | Customer Name | Segment | City | State | ... | Product ID | Category | Sub-Category | Product Name | Sales | Quantity | Discount | Profit | Shipping Cost | Order Priority | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 42433 | AG-2011-2040 | 1/1/2011 | 6/1/2011 | Standard Class | TB-11280 | Toby Braunhardt | Consumer | Constantine | Constantine | ... | OFF-TEN-10000025 | Office Supplies | Storage | Tenex Lockers, Blue | 408.300 | 2 | 0.0 | 106.1400 | 35.46 | Medium |
1 | 22253 | IN-2011-47883 | 1/1/2011 | 8/1/2011 | Standard Class | JH-15985 | Joseph Holt | Consumer | Wagga Wagga | New South Wales | ... | OFF-SU-10000618 | Office Supplies | Supplies | Acme Trimmer, High Speed | 120.366 | 3 | 0.1 | 36.0360 | 9.72 | Medium |
2 | 48883 | HU-2011-1220 | 1/1/2011 | 5/1/2011 | Second Class | AT-735 | Annie Thurman | Consumer | Budapest | Budapest | ... | OFF-TEN-10001585 | Office Supplies | Storage | Tenex Box, Single Width | 66.120 | 4 | 0.0 | 29.6400 | 8.17 | High |
3 | 11731 | IT-2011-3647632 | 1/1/2011 | 5/1/2011 | Second Class | EM-14140 | Eugene Moren | Home Office | Stockholm | Stockholm | ... | OFF-PA-10001492 | Office Supplies | Paper | Enermax Note Cards, Premium | 44.865 | 3 | 0.5 | -26.0550 | 4.82 | High |
4 | 22255 | IN-2011-47883 | 1/1/2011 | 8/1/2011 | Standard Class | JH-15985 | Joseph Holt | Consumer | Wagga Wagga | New South Wales | ... | FUR-FU-10003447 | Furniture | Furnishings | Eldon Light Bulb, Duo Pack | 113.670 | 5 | 0.1 | 37.7700 | 4.70 | Medium |
51285 | 32593 | CA-2014-115427 | 31-12-2014 | 4/1/2015 | Standard Class | EB-13975 | Erica Bern | Corporate | Fairfield | California | ... | OFF-BI-10002103 | Office Supplies | Binders | Cardinal Slant-D Ring Binder, Heavy Gauge Vinyl | 13.904 | 2 | 0.2 | 4.5188 | 0.89 | Medium |
51286 | 47594 | MO-2014-2560 | 31-12-2014 | 5/1/2015 | Standard Class | LP-7095 | Liz Preis | Consumer | Agadir | Souss-Massa-Draâ | ... | OFF-WIL-10001069 | Office Supplies | Binders | Wilson Jones Hole Reinforcements, Clear | 3.990 | 1 | 0.0 | 0.4200 | 0.49 | Medium |
51287 | 8857 | MX-2014-110527 | 31-12-2014 | 2/1/2015 | Second Class | CM-12190 | Charlotte Melton | Consumer | Managua | Managua | ... | OFF-LA-10004182 | Office Supplies | Labels | Hon Color Coded Labels, 5000 Label Set | 26.400 | 3 | 0.0 | 12.3600 | 0.35 | Medium |
51288 | 6852 | MX-2014-114783 | 31-12-2014 | 6/1/2015 | Standard Class | TD-20995 | Tamara Dahlen | Consumer | Juárez | Chihuahua | ... | OFF-LA-10000413 | Office Supplies | Labels | Hon Legal Exhibit Labels, Alphabetical | 7.120 | 1 | 0.0 | 0.5600 | 0.20 | Medium |
51289 | 36388 | CA-2014-156720 | 31-12-2014 | 4/1/2015 | Standard Class | JM-15580 | Jill Matthias | Consumer | Loveland | Colorado | ... | OFF-FA-10003472 | Office Supplies | Fasteners | Bagged Rubber Bands | 3.024 | 3 | 0.2 | -0.6048 | 0.17 | Medium |
10 rows × 24 columns
2.2 总览数据
数据维度有点多,显示不全,先转置看看各个维度的数据大概长什么样
df[-5:].T
51285 | 51286 | 51287 | 51288 | 51289 | |
---|---|---|---|---|---|
Row ID | 32593 | 47594 | 8857 | 6852 | 36388 |
Order ID | CA-2014-115427 | MO-2014-2560 | MX-2014-110527 | MX-2014-114783 | CA-2014-156720 |
Order Date | 31-12-2014 | 31-12-2014 | 31-12-2014 | 31-12-2014 | 31-12-2014 |
Ship Date | 4/1/2015 | 5/1/2015 | 2/1/2015 | 6/1/2015 | 4/1/2015 |
Ship Mode | Standard Class | Standard Class | Second Class | Standard Class | Standard Class |
Customer ID | EB-13975 | LP-7095 | CM-12190 | TD-20995 | JM-15580 |
Customer Name | Erica Bern | Liz Preis | Charlotte Melton | Tamara Dahlen | Jill Matthias |
Segment | Corporate | Consumer | Consumer | Consumer | Consumer |
City | Fairfield | Agadir | Managua | Juárez | Loveland |
State | California | Souss-Massa-Draâ | Managua | Chihuahua | Colorado |
Country | United States | Morocco | Nicaragua | Mexico | United States |
Postal Code | 94533 | NaN | NaN | NaN | 80538 |
Market | US | Africa | LATAM | LATAM | US |
Region | West | Africa | Central | North | West |
Product ID | OFF-BI-10002103 | OFF-WIL-10001069 | OFF-LA-10004182 | OFF-LA-10000413 | OFF-FA-10003472 |
Category | Office Supplies | Office Supplies | Office Supplies | Office Supplies | Office Supplies |
Sub-Category | Binders | Binders | Labels | Labels | Fasteners |
Product Name | Cardinal Slant-D Ring Binder, Heavy Gauge Vinyl | Wilson Jones Hole Reinforcements, Clear | Hon Color Coded Labels, 5000 Label Set | Hon Legal Exhibit Labels, Alphabetical | Bagged Rubber Bands |
Sales | 13.904 | 3.99 | 26.4 | 7.12 | 3.024 |
Quantity | 2 | 1 | 3 | 1 | 3 |
Discount | 0.2 | 0 | 0 | 0 | 0.2 |
Profit | 4.5188 | 0.42 | 12.36 | 0.56 | -0.6048 |
Shipping Cost | 0.89 | 0.49 | 0.35 | 0.2 | 0.17 |
Order Priority | Medium | Medium | Medium | Medium | Medium |
df.columns
Index(['Row ID', 'Order ID', 'Order Date', 'Ship Date', 'Ship Mode',
'Customer ID', 'Customer Name', 'Segment', 'City', 'State', 'Country',
'Postal Code', 'Market', 'Region', 'Product ID', 'Category',
'Sub-Category', 'Product Name', 'Sales', 'Quantity', 'Discount',
'Profit', 'Shipping Cost', 'Order Priority'],
dtype='object')
- 各维度分别为:行编号、订单编号、订购日期、发货日期、运送方式、客户ID、客户姓名、客户类型、客户城市、客户所在州、客户国家、邮编、店铺所在区域、店铺所属州、产品ID、类别、子类别、产品名称、销售额、销售量、折扣、利润、运输费、订单优先级 共24个维度
- 行编号无用,直接删除
del df['Row ID']
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51290 entries, 0 to 51289
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Order ID 51290 non-null object
1 Order Date 51290 non-null object
2 Ship Date 51290 non-null object
3 Ship Mode 51290 non-null object
4 Customer ID 51290 non-null object
5 Customer Name 51290 non-null object
6 Segment 51290 non-null object
7 City 51290 non-null object
8 State 51290 non-null object
9 Country 51290 non-null object
10 Postal Code 9994 non-null float64
11 Market 51290 non-null object
12 Region 51290 non-null object
13 Product ID 51290 non-null object
14 Category 51290 non-null object
15 Sub-Category 51290 non-null object
16 Product Name 51290 non-null object
17 Sales 51290 non-null float64
18 Quantity 51290 non-null int64
19 Discount 51290 non-null float64
20 Profit 51290 non-null float64
21 Shipping Cost 51290 non-null float64
22 Order Priority 51290 non-null object
dtypes: float64(5), int64(1), object(17)
memory usage: 9.0+ MB
2.3 查询、处理空值
print('邮编缺失数据占比: {:.2%}'.format(df['Postal Code'].isnull().sum() / len(df)))
邮编缺失数据占比: 80.51%
邮编一栏缺失数据超过80%,又因为邮编对数据几乎无影响,采取措施为直接删除此列
del df['Postal Code']
2.4 查询重复记录
df.duplicated().sum()
0
无重复记录
2.5 更改数据类型,并创建时间维度字段
# 将订单日期更改为日期格式
df['Order Date']=pd.to_datetime(df['Order Date'])
# 添加年和月的列,便于后续进行时间维度的分析
df['year'] = df['Order Date'].dt.year
df['month'] = df['Order Date'].dt.month
3 经营分析
3.1 销售额分析
# 创建销售额透视表
sales = pd.pivot_table(df, values='Sales', index='month', columns='year',aggfunc=[np.sum])
sales.columns = ['2011年','2012年','2013年','2014年']
sales.index=['1月','2月','3月','4月','5月','6月','7月','8月','9月','10月','11月','12月']
sales
2011年 | 2012年 | 2013年 | 2014年 | |
---|---|---|---|---|
1月 | 138241.30042 | 162800.89338 | 206459.19582 | 268265.52240 |
2月 | 134969.94086 | 152661.15144 | 191062.77216 | 244159.30486 |
3月 | 171455.59372 | 201608.72784 | 230547.79468 | 347720.96868 |
4月 | 128833.47034 | 187469.96192 | 233181.34844 | 302132.54000 |
5月 | 148146.72092 | 218960.16042 | 304509.96336 | 304798.83604 |
6月 | 189338.43966 | 249289.77172 | 341162.34370 | 372577.23298 |
7月 | 162034.69756 | 174394.02808 | 223642.65664 | 278672.17326 |
8月 | 219223.49524 | 271669.66086 | 323876.60716 | 432731.04194 |
9月 | 255237.89698 | 256567.85308 | 326897.27044 | 405436.70584 |
10月 | 204675.07846 | 239321.09904 | 270121.87570 | 406659.41500 |
11月 | 214934.29386 | 270723.05356 | 383039.21248 | 508954.73156 |
12月 | 292359.96752 | 291972.33306 | 371245.40880 | 427757.39800 |
3.1.1 时间维度销售额
3.1.1.1 年度销售额及增长率
rise_12 = (sales.sum()[1]-sales.sum()[0])/sales.sum()[0]
rise_13 = (sales.sum()[2]-sales.sum()[1])/sales.sum()[1]
rise_14 = (sales.sum()[3]-sales.sum()[2])/sales.sum()[2]
rise_rate = [0, rise_12, rise_13, rise_14]
sales_sum=pd.DataFrame({
'Sales_sum':sales.sum()})
sales_sum['rise_rate'] = rise_rate
sales_sum.index=pd.Series(['2011年','2012年','2013年','2014年'])
sales_sum
Sales_sum | rise_rate | |
---|---|---|
2011年 | 2.259451e+06 | 0.000000 |
2012年 | 2.677439e+06 | 0.184995 |
2013年 | 3.405746e+06 | 0.272017 |
2014年 | 4.299866e+06 | 0.262533 |
x1 = x2 = sales_sum.index
y1 = sales_sum.Sales_sum
y2 = sales_sum.rise_rate
# 建立左侧纵坐标画板
fig, ax1 = plt.subplots()
# 画柱状图
bar = ax1.bar(x1, y1, alpha=0.8, label = '销售额')
# 显示左侧纵坐标
ax1.set_ylabel('销售额', fontsize=20)
plt.yticks(range(0,5500000,500000),('0','50W','100W','150W','200W','250W','300W','350W','400W','450W'))
ax1.legend(bar, ("销售额",), loc = [0.05, 0.9])
plt.tick_params(labelsize=14)
# 建立右侧坐标画板
ax2 = ax1.twinx()
# 画折线图
line = ax2.plot(x2, y2, marker='o' ,c='y')
# 折线图显示标识
for a, b in zip(x2, y2):
ax2.text(a, b, "%.2f" % (100 * b) + '%', ha='center', fontsize=20)
ax2.set_ylabel('增长率', fontsize=20, rotation=270)
ax2.legend(line, ('增长率',), loc = [0.05, 0.85])
plt.tick_params(labelsize=14)
ax2.set_title("年度销售额及其增长率", fontsize=22);
从年度销售额及其增长率来看:
- 该超市2011年销售额只有226万,2014年销售额达到了430万,GMV提升非常快,从行业发展阶段来看,该超市应该处于导入期或成长期
- 该超市主营办公用品,据了解,全球领先的办公用品供应商包括:Costco、Office Depot、Staples、Tesco和沃尔玛,查询沃尔玛2014年销售额得知,沃尔玛2014年销售额为4731亿美元,可见与全球大型超市对比,差距大,可提升空间较大
- 从增长率来看,增长率非常高,12-13年、13-14年GMV增长率接近均超过25%,而沃尔玛2013-2014的增长率大概为3%,可见该超市潜力非常大
- 了解了超市整体销售额后,再对每年每月的销售额进行分析,了解不同月份的销售情况,找出是否有淡旺季之分,找出重点销售月份,以便制定经营策略与业绩的月度及季度指标拆分
3.1.1.2 月度销售额
sales.plot.area(colormap = 'tab20c', stacked=False)
plt.title('月度销售额')
plt.ylabel('销售额')
plt.xlabel('月份');
通过不同年份月度销售额我们可以看出:
- 该超市2011年-2014年每一年的销售额同比上一年都是上升趋势,销售季节性明显,总体上半年是淡季,下半年是旺季
- 上半年中6月份销售额比较高,下半年中7月份的销售额偏低。对于旺季的月份,运营推广等策略要继续维持,还可以加大投入,提高整体销售额;对于淡季的月份,可以结合产品特点进行新产品拓展,举办一些促销活动等吸引客户
3.2 利润分析
# 创建利润透视表
profit = pd.pivot_table(df, values='Profit', index='month', columns='year',aggfunc=