建模1_数据绘制

最新推荐文章于 2024-09-14 10:41:30 发布

冒的感觉的疯子

最新推荐文章于 2024-09-14 10:41:30 发布

阅读量186

点赞数

文章标签： pandas 数学建模数据分析

本文链接：https://blog.csdn.net/weixin_42878736/article/details/118541692

版权

该分析基于会员消费数据，通过数据预处理、分组和聚合操作，揭示了不同年龄段、性别与消费金额的关联。使用pandas库进行数据清洗和转换，创建了年龄区间，计算了各区间内的消费频次和总金额，并进行了可视化展示，揭示了消费模式的性别和年龄分布特点。

摘要由CSDN通过智能技术生成

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.options.display.max_rows= 10
import datetime
import matplotlib
matplotlib.rcParams['font.sans-serif'] = ['SimHei']
matplotlib.rcParams['font.family']='sans-serif'

fec1 = pd.read_excel(r'C:\Users\Administrator\Desktop\c题\c附件\会员消费明细查询2016.xlsx')
#fec2 = pd.read_excel(r'C:\Users\Administrator\Desktop\c题\c附件\会员消费明细查询2017.xlsx')

fec1['date'] = pd.to_datetime(fec1['消费时间'])
#fec2['date'] = pd.to_datetime(fec2['消费时间'])
fec1['month']=fec1['date'].apply(lambda x:x.month)
#fec2['month']=fec2['date'].apply(lambda x:x.month+12)

fec1['month']=fec1['date'].apply(lambda x:x.month)
#fec2['month']=fec2['date'].apply(lambda x:x.month+12)

	会员卡号	卡类型	会员编号	性别	年龄	单据号	消费时间	商铺编码	消费金额	消费积分	date	month
0	8088001625	白金卡	MN20150307198159	女	42	58504802	2016/12/31 21:58:52	112	238.0	238	2016-12-31 21:58:52	12
1	8088012005	白金卡	MN20140430166571	男	33	6051201612310076	2016/12/31 21:38:43	605-1	1436.0	1436	2016-12-31 21:38:43	12
2	8086102716	VIP卡	MN20131210149893	女	63	58503611	2016/12/31 21:21:44	112	200.0	200	2016-12-31 21:21:44	12
3	8086232983	VIP卡	MN20131222152491	女	46	58503502	2016/12/31 21:20:37	112	42.8	42	2016-12-31 21:20:37	12
4	8088001625	白金卡	MN20150307198159	女	42	58503422	2016/12/31 21:18:50	112	150.0	150	2016-12-31 21:18:50	12
...	...	...	...	...	...	...	...	...	...	...	...	...
736428	8086129227	VIP卡	MN20121231041002	男	41	B124201601012198672	2016/1/1 9:56:09	B124	164.0	49	2016-01-01 09:56:09	1
736429	8086223710	VIP卡	MN20150407199239	男	29	B124201601012166847	2016/1/1 9:56:09	B124	84.0	25	2016-01-01 09:56:09	1
736430	8086226933	VIP卡	MN20130727107860	女	33	708201601010043	2016/1/1 9:56:09	708	94.0	94	2016-01-01 09:56:09	1
736431	8086233209	VIP卡	MN20131212150274	男	69	708201601010036	2016/1/1 9:56:09	708	181.0	181	2016-01-01 09:56:09	1
736432	8086227574	VIP卡	MN20130901118045	女	39	B124201601112201812	2016/1/1 0:00:00	B124	22.0	6	2016-01-01 00:00:00	1

736433 rows × 12 columns

#fec = pd.concat([fec1,fec2])
#fec['date'] = pd.to_datetime(fec['消费时间'])
#fec['month'] = fec['date'].apply(lambda x: x.month)
#fec

fec1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 736433 entries, 0 to 736432
Data columns (total 12 columns):
会员卡号     736433 non-null object
卡类型      736433 non-null object
会员编号     736433 non-null object
性别       731993 non-null object
年龄       736433 non-null int64
单据号      736433 non-null object
消费时间     736433 non-null object
商铺编码     736433 non-null object
消费金额     736433 non-null float64
消费积分     736433 non-null int64
date     736433 non-null datetime64[ns]
month    736433 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(3), object(7)
memory usage: 67.4+ MB

bins=np.array([0,10,20,30,40,50,60,70,80,90,100,120])#出资额分组
fec1['年龄段']=pd.cut(fec1['年龄'],bins,labels=[1,2,3,4,5,6,7,8,9,10,11])#年龄分段
fec1.corr(method='spearman')
fec1

	会员卡号	卡类型	会员编号	性别	年龄	单据号	消费时间	商铺编码	消费金额	消费积分	date	month	年龄段
0	8088001625	白金卡	MN20150307198159	女	42	58504802	2016/12/31 21:58:52	112	238.0	238	2016-12-31 21:58:52	12	5
1	8088012005	白金卡	MN20140430166571	男	33	6051201612310076	2016/12/31 21:38:43	605-1	1436.0	1436	2016-12-31 21:38:43	12	4
2	8086102716	VIP卡	MN20131210149893	女	63	58503611	2016/12/31 21:21:44	112	200.0	200	2016-12-31 21:21:44	12	7
3	8086232983	VIP卡	MN20131222152491	女	46	58503502	2016/12/31 21:20:37	112	42.8	42	2016-12-31 21:20:37	12	5
4	8088001625	白金卡	MN20150307198159	女	42	58503422	2016/12/31 21:18:50	112	150.0	150	2016-12-31 21:18:50	12	5
...	...	...	...	...	...	...	...	...	...	...	...	...	...
736428	8086129227	VIP卡	MN20121231041002	男	41	B124201601012198672	2016/1/1 9:56:09	B124	164.0	49	2016-01-01 09:56:09	1	5
736429	8086223710	VIP卡	MN20150407199239	男	29	B124201601012166847	2016/1/1 9:56:09	B124	84.0	25	2016-01-01 09:56:09	1	3
736430	8086226933	VIP卡	MN20130727107860	女	33	708201601010043	2016/1/1 9:56:09	708	94.0	94	2016-01-01 09:56:09	1	4
736431	8086233209	VIP卡	MN20131212150274	男	69	708201601010036	2016/1/1 9:56:09	708	181.0	181	2016-01-01 09:56:09	1	7
736432	8086227574	VIP卡	MN20130901118045	女	39	B124201601112201812	2016/1/1 0:00:00	B124	22.0	6	2016-01-01 00:00:00	1	4

736433 rows × 13 columns

unique_cands = fec1['会员编号'].unique()#求解会员编号有多少不同, 返回会员编号这一列所有的唯一值
len(unique_cands)
fec1['会员编号'].nunique()
unique_cands

array(['MN20150307198159', 'MN20140430166571', 'MN20131210149893', ...,
       'MN2015091900255', 'MN2015071000008', 'MN20130122046159'], dtype=object)

fec1.isnull().any() #判断是否有缺失

会员卡号     False
卡类型      False
会员编号     False
性别        True
年龄       False
         ...  
消费金额     False
消费积分     False
date     False
month    False
年龄段       True
Length: 13, dtype: bool

取需要的列dataframe[[],[]…]

data_1 = fec1[['会员编号','卡类型','性别','年龄','单据号','商铺编码','消费金额','date']]
data_1

	会员编号	卡类型	性别	年龄	单据号	商铺编码	消费金额	date
0	MN20150307198159	白金卡	女	42	58504802	112	238.0	2016-12-31 21:58:52
1	MN20140430166571	白金卡	男	33	6051201612310076	605-1	1436.0	2016-12-31 21:38:43
2	MN20131210149893	VIP卡	女	63	58503611	112	200.0	2016-12-31 21:21:44
3	MN20131222152491	VIP卡	女	46	58503502	112	42.8	2016-12-31 21:20:37
4	MN20150307198159	白金卡	女	42	58503422	112	150.0	2016-12-31 21:18:50
...	...	...	...	...	...	...	...	...
736428	MN20121231041002	VIP卡	男	41	B124201601012198672	B124	164.0	2016-01-01 09:56:09
736429	MN20150407199239	VIP卡	男	29	B124201601012166847	B124	84.0	2016-01-01 09:56:09
736430	MN20130727107860	VIP卡	女	33	708201601010043	708	94.0	2016-01-01 09:56:09
736431	MN20131212150274	VIP卡	男	69	708201601010036	708	181.0	2016-01-01 09:56:09
736432	MN20130901118045	VIP卡	女	39	B124201601112201812	B124	22.0	2016-01-01 00:00:00

736433 rows × 8 columns

#bins=np.array([0,20,30,40,50,60,70,80,90,100])#出资额分组
bins=np.array([0,10,20,30,40,50,60,70,80,90,100,120])#出资额分组，取值范围左开右臂

def rfm_convert1(x):
    rfm_dict = {0:'R',
                2:'M'}
    for i in (3):
        if i == 3:
            labels = [10,20,30,40,50,60,70,80,90,100]
            x[rfm_dict[i]] = pd.cut(x.iloc[:,i],10,labels=labels)
        else:
            labels = np.arange(1,6)    
            x[rfm_dict[i]] = pd.qcut(x.iloc[:,i],q=np.linspace(0,1,num=6),labels=labels)
    return x

labels1 = [10,20,30,40,50,60,70,80,90,100,120]
data_1['年龄段']=pd.cut(data_1['年龄'],bins,labels=labels1,include_lowest=True)  #include_lowest=True左边闭合
data_1.sort_values(by='年龄段',ascending=False)  #ascending表示从大到小排列

D:\anaconda3\lib\site-packages\ipykernel_launcher.py:17: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

	会员编号	卡类型	性别	年龄	单据号	商铺编码	消费金额	date	年龄段
319589	MN20130109042454	VIP卡	男	92	813201607271407	813	120.0	2016-07-27 13:19:57	100
415883	MN20130109042454	VIP卡	男	92	209201606121519	209	135.0	2016-06-12 15:51:35	100
595627	MN20130324060927	VIP卡	女	98	B135201603153647	B135	45.0	2016-03-15 18:56:11	100
595626	MN20130324060927	VIP卡	女	98	B135201603153699	B135	18.0	2016-03-15 18:56:11	100
595625	MN20130324060927	VIP卡	女	98	B135201603153698	B135	25.0	2016-03-15 18:56:11	100
...	...	...	...	...	...	...	...	...	...
534275	MN2015090100295	VIP卡	男	0	6031201604170015	603-1	308.0	2016-04-17 13:22:50	10
201409	MN2016090200005	普通卡	男	0	B124201609232091130	B124	37.0	2016-09-23 13:38:15	10
534276	MN2015122700201	VIP卡	女	0	CB001201604170097025	CB001	649.0	2016-04-17 13:22:50	10
201407	MN2016062300093	普通卡	男	0	B124201609232168665	B124	19.0	2016-09-23 13:38:15	10
368216	MN2015120500197	VIP卡	女	0	701201607040608002	701	205.0	2016-07-04 18:03:55	10

736433 rows × 9 columns

df.pivot_table(index=‘student’, values=‘grades’, aggfunc=fun)

转换各个维度去观察数据, aggfunc就是在数据转换过程中的过程函数

by_occupation_1= data_1.pivot_table('消费金额',index=['年龄段'],aggfunc='count')
print(by_occupation_1)
by_occupation_1.plot.pie(subplots = True,autopct='%.2f', fontsize=10, figsize=(5,5))
plt.show()

       消费金额
年龄段        
10   171482
20     1461
30    62698
40   252688
50   146145
..      ...
70    36302
80     8516
90      849
100     211
120       0

[11 rows x 1 columns]

在这里插入图片描述

聚合操作 .groupby

print(data_1)
f_value = data_1.groupby('年龄段')['会员编号'].agg([('2018年消费频次','count')])
print(f_value)
f_value.plot.pie(subplots = True,autopct='%.2f', fontsize=20, figsize=(8, 8))
plt.show()

                    会员编号   卡类型 性别  年龄                  单据号   商铺编码    消费金额  \
0       MN20150307198159   白金卡  女  42             58504802    112   238.0   
1       MN20140430166571   白金卡  男  33     6051201612310076  605-1  1436.0   
2       MN20131210149893  VIP卡  女  63             58503611    112   200.0   
3       MN20131222152491  VIP卡  女  46             58503502    112    42.8   
4       MN20150307198159   白金卡  女  42             58503422    112   150.0   
...                  ...   ... ..  ..                  ...    ...     ...   
736428  MN20121231041002  VIP卡  男  41  B124201601012198672   B124   164.0   
736429  MN20150407199239  VIP卡  男  29  B124201601012166847   B124    84.0   
736430  MN20130727107860  VIP卡  女  33      708201601010043    708    94.0   
736431  MN20131212150274  VIP卡  男  69      708201601010036    708   181.0   
736432  MN20130901118045  VIP卡  女  39  B124201601112201812   B124    22.0   

                      date 年龄段  
0      2016-12-31 21:58:52  50  
1      2016-12-31 21:38:43  40  
2      2016-12-31 21:21:44  70  
3      2016-12-31 21:20:37  50  
4      2016-12-31 21:18:50  50  
...                    ...  ..  
736428 2016-01-01 09:56:09  50  
736429 2016-01-01 09:56:09  30  
736430 2016-01-01 09:56:09  40  
736431 2016-01-01 09:56:09  70  
736432 2016-01-01 00:00:00  40  

[736433 rows x 9 columns]
     2018年消费频次
年龄段           
10      171482
20        1461
30       62698
40      252688
50      146145
..         ...
70       36302
80        8516
90         849
100        211
120          0

[11 rows x 1 columns]

在这里插入图片描述

xiangguan = by_occupation_1.join(f_value)
xiangguan.corr(method='spearman')

	消费金额	2018年消费频次
消费金额	1.0	1.0
2018年消费频次	1.0	1.0

f_value

	2018年消费频次
年龄段
10	171482
20	1461
30	62698
40	252688
50	146145
...	...
70	36302
80	8516
90	849
100	211
120	0

11 rows × 1 columns

by_occupation_1= fec1.pivot_table('消费金额',index=['卡类型'],columns='性别',aggfunc='count')
by_occupation_1

性别	女	男
卡类型
VIP卡	330876	201053
南昌普通卡	496	287
普通卡	55431	55664
白金卡	63558	24628

by_occupation_1.plot(kind='barh')#横向条形
plt.show()

在这里插入图片描述

]

bins=np.array([0,10,20,30,40,50,60,70,80,90,100,120])#出资额分组
labels=pd.cut(fec1['年龄'],bins)

0         (40, 50]
1         (30, 40]
2         (60, 70]
3         (40, 50]
4         (40, 50]
            ...   
736428    (40, 50]
736429    (20, 30]
736430    (30, 40]
736431    (60, 70]
736432    (30, 40]
Name: 年龄, Length: 736433, dtype: category
Categories (11, interval[int64]): [(0, 10] < (10, 20] < (20, 30] < (30, 40] ... (70, 80] < (80, 90] < (90, 100] < (100, 120]]

grouped.size().unstack(0)的用法

stack()即“堆叠”，作用是将列旋转到行

unstack()即stack()的反操作，将行旋转到列

grouped=fec1.groupby(['卡类型',labels])
grouped.size().unstack(0)

卡类型	VIP卡	南昌普通卡	普通卡	白金卡
年龄
(0, 10]	637.0	NaN	209.0	108.0
(10, 20]	1088.0	NaN	159.0	214.0
(20, 30]	49024.0	2.0	5643.0	8029.0
(30, 40]	199018.0	618.0	17393.0	35659.0
(40, 50]	112961.0	159.0	7788.0	25237.0
(50, 60]	44441.0	2.0	2991.0	8647.0
(60, 70]	26276.0	2.0	828.0	9196.0
(70, 80]	7037.0	NaN	273.0	1206.0
(80, 90]	788.0	NaN	3.0	58.0
(90, 100]	210.0	NaN	1.0	NaN

#by_occupation_4= fec.pivot_table('消费金额',index=labels,columns=['卡类型'],aggfunc='count')
#by_occupation_4.plot.pie(subplots = True,autopct='%.2f', fontsize=30, figsize=(50, 50))

by_occupation_5= fec1.pivot_table('消费金额',index=labels,columns=['卡类型'],aggfunc='sum')
by_occupation_5.plot.pie(subplots = True,autopct='%.2f', fontsize=10, figsize=(20, 20))
plt.show()

在这里插入图片描述

bins=np.array([0,10,20,30,40,50,60,70,80,90,100,120])#出资额分组
labels=pd.cut(fec1['年龄'],bins)
by_occupation_4= fec1.pivot_table('消费金额',index=labels,aggfunc='sum')
by_occupation_4

	消费金额
年龄
(0, 10]	3.319693e+05
(10, 20]	6.612430e+05
(20, 30]	1.941047e+07
(30, 40]	7.539194e+07
(40, 50]	4.943245e+07
...	...
(60, 70]	1.047164e+07
(70, 80]	1.649638e+06
(80, 90]	1.621774e+05
(90, 100]	2.897200e+04
(100, 120]	NaN

11 rows × 1 columns

by_occupation_4.plot(kind='barh')#横向条形
plt.show()

在这里插入图片描述

by_occupation_4.plot.pie(subplots = True,autopct='%.2f', fontsize=20, figsize=(10, 10))
plt.show()

在这里插入图片描述

fec1['date'] = pd.to_datetime(fec1['消费时间'])
#fec2['date'] = pd.to_datetime(fec2['消费时间'])
fec1['month'] = fec1['date'].apply(lambda x: x.month)
#fec2['month'] = fec2['date'].apply(lambda x: x.month)
fec1

	会员卡号	卡类型	会员编号	性别	年龄	单据号	消费时间	商铺编码	消费金额	消费积分	date	month
0	8088001625	白金卡	MN20150307198159	女	42	58504802	2016/12/31 21:58:52	112	238.0	238	2016-12-31 21:58:52	12
1	8088012005	白金卡	MN20140430166571	男	33	6051201612310076	2016/12/31 21:38:43	605-1	1436.0	1436	2016-12-31 21:38:43	12
2	8086102716	VIP卡	MN20131210149893	女	63	58503611	2016/12/31 21:21:44	112	200.0	200	2016-12-31 21:21:44	12
3	8086232983	VIP卡	MN20131222152491	女	46	58503502	2016/12/31 21:20:37	112	42.8	42	2016-12-31 21:20:37	12
4	8088001625	白金卡	MN20150307198159	女	42	58503422	2016/12/31 21:18:50	112	150.0	150	2016-12-31 21:18:50	12
...	...	...	...	...	...	...	...	...	...	...	...	...
736428	8086129227	VIP卡	MN20121231041002	男	41	B124201601012198672	2016/1/1 9:56:09	B124	164.0	49	2016-01-01 09:56:09	1
736429	8086223710	VIP卡	MN20150407199239	男	29	B124201601012166847	2016/1/1 9:56:09	B124	84.0	25	2016-01-01 09:56:09	1
736430	8086226933	VIP卡	MN20130727107860	女	33	708201601010043	2016/1/1 9:56:09	708	94.0	94	2016-01-01 09:56:09	1
736431	8086233209	VIP卡	MN20131212150274	男	69	708201601010036	2016/1/1 9:56:09	708	181.0	181	2016-01-01 09:56:09	1
736432	8086227574	VIP卡	MN20130901118045	女	39	B124201601112201812	2016/1/1 0:00:00	B124	22.0	6	2016-01-01 00:00:00	1

736433 rows × 12 columns

data_3= fec1.pivot_table('消费金额',index='month',aggfunc='sum')
data_3.plot(kind='bar')#横向条形
plt.show()

在这里插入图片描述

#data_33= fec2.pivot_table('消费金额',index='month',aggfunc='sum')
#data_33.plot(kind='bar')#横向条形

统计卡类型变换

car_label = {'VIP卡':'a','南昌普通卡':'b','普通卡':'c','白金卡':'d'}
fec1['party'] = fec1['卡类型'].map(car_label)
data2 = fec1.groupby('会员编号')['party'].unique()
data2

会员编号
MN20120928000004       [a]
MN20120928000005       [d]
MN20120928000007       [a]
MN20120928000016       [a]
MN20120928000017       [a]
                     ...  
MN2018022100100     [c, a]
MN2018032400067     [c, a]
MN2018051900017        [c]
MN2018051900057        [c]
MN2018051900133        [c]
Name: party, Length: 82391, dtype: object

data33 = data2[data2.apply(len)==2]
data33

会员编号
MN20120928000020    [d, a]
MN20120928000040    [d, a]
MN20120928000046    [d, a]
MN20120928000047    [d, a]
MN20120928000067    [d, a]
                     ...  
MN2017102100160     [c, a]
MN2017122300006     [c, a]
MN2018011800118     [c, a]
MN2018022100100     [c, a]
MN2018032400067     [c, a]
Name: party, Length: 14218, dtype: object

def f(x):
    if (x[0] is 'a'):
        if x[1] is 'b':
            x = 1
        elif x[1] is 'c':
            x=2
        else:
            x = 3
            
    elif x[0] is 'b':
        if x[1] is 'a':
            x = 4
        elif x[1] is 'c':
            x=5
        else:
            x=6
            
    elif x[0] is 'c':
        if x[1] is 'a':
            x = 7
        elif x[1] is 'b':
            x=8
        else:
            x=9
            
    elif x[0] is 'd':
        if x[1] is 'a':
            x = 10
        elif x[1] is 'b':
            x=11
        else:
            x=12
    return x
data43 = data33.apply(f)
data6 = pd.DataFrame(list(zip(data33, data43)))
data6

	0	1
0	[d, a]	10
1	[d, a]	10
2	[d, a]	10
3	[d, a]	10
4	[d, a]	10
...	...	...
14213	[c, a]	7
14214	[c, a]	7
14215	[c, a]	7
14216	[c, a]	7
14217	[c, a]	7

14218 rows × 2 columns

data6[1].value_counts()

7     12800
2       709
10      617
3        43
9        27
12       18
1         4
Name: 1, dtype: int64

pd.to_numeric(data6[1])

0        10
1        10
2        10
3        10
4        10
         ..
14213     7
14214     7
14215     7
14216     7
14217     7
Name: 1, Length: 14218, dtype: int64