这是一个电商数据分析题目,可以用excel做,可以用sql做,可以用python做
题目如下: 数据源如下: 链接:https://pan.baidu.com/s/1gM2NoS_1SxUCgPbf-9mgYg 提取码:9zrs
Excel篇:其他人用数据透析表做的 SQL篇:我遇到一些问题,重新用pyhton做 Python篇:如下
import pandas as pd
import numpy as np
import matplotlib. pyplot as plt
data = pd. read_csv( '数据源.csv' )
C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3057: DtypeWarning: Columns (1,2,3,4,7,8) have mixed types. Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
df = data. copy( )
data. dropna( how= "all" , inplace= True )
data. drop_duplicates( inplace= True )
data. head( )
id week date order_d order_e quantity amount phone account 0 1.0 NaN 2017/6/8 GU20170608923396-1 GU20170608923396 1.0 2750.0 1500026566 penny.li@cn.gucci.com 1 2.0 NaN 2017/6/8 GU20170608923428-1 GU20170608923428 1.0 2250.0 16201251750 vincent.qiu@gmail.com 2 3.0 NaN 2017/6/8 GU20170608923430-1 GU20170608923430 1.0 3900.0 15010529091 412526292@qq.com 3 4.0 NaN 2017/6/8 GU20170608923431-1 GU20170608923431 1.0 3300.0 12621601625 1941327726@qq.com 4 5.0 NaN 2017/6/8 GU20170608923435-1 GU20170608923435 1.0 2250.0 12917062526 aliutina@163.com
data. date = pd. to_datetime( df. date, format = '%Y-%m-%d' )
data. phone = df. phone. astype( 'str' )
data. describe( )
id quantity amount count 293657.000000 293657.000000 293657.000000 mean 146829.000000 1.008050 4533.377958 std 84771.618337 0.259776 3904.373436 min 1.000000 1.000000 350.000000 25% 73415.000000 1.000000 1750.000000 50% 146829.000000 1.000000 3650.000000 75% 220243.000000 1.000000 6450.000000 max 293657.000000 62.000000 71500.000000
len ( data)
293657
data. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 293657 entries, 0 to 293656
Data columns (total 9 columns):
id 293657 non-null float64
week 5491 non-null object
date 293657 non-null datetime64[ns]
order_d 293657 non-null object
order_e 293657 non-null object
quantity 293657 non-null float64
amount 293657 non-null float64
phone 293657 non-null object
account 254716 non-null object
dtypes: datetime64[ns](1), float64(3), object(5)
memory usage: 22.4+ MB
data. isnull( ) . sum ( )
id 0
week 288166
date 0
order_d 0
order_e 0
quantity 0
amount 0
phone 0
account 38941
dtype: int64
列account有缺失值,Week只有本周有数值,之前没有数值
len ( data. account. value_counts( ) )
199179
len ( data. phone. value_counts( ) )
234490
df_week = data[ data. week == 'Sep W36' ]
len ( df_week. phone. value_counts( ) )
5054
len ( df_week)
5491
len ( df_history)
288166
df_history = data[ data. week != 'Sep W36' ]
len ( df_history. phone. value_counts( ) )
230260
第一部分:
本周新客人数(开店至今本周第一次购买):4230人 本周一共有5491条消费记录,用户5054人,新客4230,老客824人,新客消费4483条,老客消费1008条 本周新客消费记录保存在df_week,本周老客消费记录保存在df_history
df2 = pd. merge( df_week, df_history, how= 'left' , on= 'phone' )
df2. head( )
id_x week_x date_x order_d_x order_e_x quantity_x amount_x phone account_x id_y week_y date_y order_d_y order_e_y quantity_y amount_y account_y 0 288167.0 Sep W36 2020-08-30 GU202008291446552-1 GU202008291446552 1.0 6900.0 15120002542 15120002542 NaN NaN NaT NaN NaN NaN NaN NaN 1 288168.0 Sep W36 2020-08-30 GU202008291446404-1 GU202008291446404 1.0 10200.0 15941716566 15714177539 NaN NaN NaT NaN NaN NaN NaN NaN 2 288169.0 Sep W36 2020-08-30 GU202008291447049-1 GU202008291447049 1.0 6900.0 12721266779 12721236779 NaN NaN NaT NaN NaN NaN NaN NaN 3 288170.0 Sep W36 2020-08-30 GU202008291446576-1 GU202008291446576 1.0 2900.0 16091666427 13091333427 NaN NaN NaT NaN NaN NaN NaN NaN 4 288171.0 Sep W36 2020-08-30 GU202008291446393-1 GU202008291446393 1.0 14300.0 17627600559 17627600559 95877.0 NaN 2019-08-26 GU201908261110787-1 GU201908261110787 1.0 10500.0 17627600559
len ( df2)
10519
df2. id_y. isnull( ) . sum ( )
4483
把于本周第一次消费的数据提取出来,保存在df_sep_first 本周合计销售金额:25653800 本周新客销售金额:20929940 本周新客销售占比 81.59%
df_week. amount. sum ( )
25653800.0
df_sep_first = df2[ df2. id_y. isnull( ) ]
len ( df_sep_first)
4483
df_sep_first. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4483 entries, 0 to 10518
Data columns (total 17 columns):
id_x 4483 non-null float64
week_x 4483 non-null object
date_x 4483 non-null datetime64[ns]
order_d_x 4483 non-null object
order_e_x 4483 non-null object
quantity_x 4483 non-null float64
amount_x 4483 non-null float64
phone 4483 non-null object
account_x 3961 non-null object
id_y 0 non-null float64
week_y 0 non-null object
date_y 0 non-null datetime64[ns]
order_d_y 0 non-null object
order_e_y 0 non-null object
quantity_y 0 non-null float64
amount_y 0 non-null float64
account_y 0 non-null object
dtypes: datetime64[ns](2), float64(6), object(9)
memory usage: 630.4+ KB
df_sep_first. amount_x. sum ( )
20929940.0
本周新客会员人数,即在df_sep_first中,account有多少个,3961
df_sep_first. account_x. count( )
3961
第二部分:
近一年累计消费金额≥¥68,000的总人数:202 #创建满足1的数据框df_oneyear2 满足1条件的本周产生购买的新产生客人数: 满足1条件的本周产生购买的原达标客人数: 满足上述2个条件的人数消费总金额:
data. date. max ( )
Timestamp('2020-09-05 00:00:00')
df_oneyear1 = data[ ( data. date < '2019-09-05' ) ]
( df_oneyear1. pivot_table( index= 'phone' , values= 'amount' , aggfunc= 'sum' ) . amount > 68000 ) . sum ( )
202
下面把202条数据提取出来到df_oneyear2
df_oneyear2 = df_oneyear1. pivot_table( index= 'phone' , values= 'amount' , aggfunc= 'sum' )
df_oneyear2 = df_oneyear2[ df_oneyear2. amount > 68000 ]
len ( df_oneyear2)
202
思路:
难点在于,这里是聚合求和,满足近1年消费大于68000的聚合求和,已经存到df_oneyear2 本周数据df_week,其中新客df_sep_first,减去之后即为本周消费的老客,这里我们先看本周之内全部满足1条件的人数 先对df_week进行聚合求和,然后再和df_oneyear2相交,查到本周内全部累计消费》68000的用户 这里易错点,拿df_week直接和df_oneyear2相交,结果不对,因为没有聚合求和
df_week. pivot_table( index= 'phone' , values= 'amount' , aggfunc= 'sum' ) . sort_values( by= 'amount' , ascending= False ) . head( )
amount phone 12721265027 67600.0 12611996246 49000.0 15626452464 42500.0 16771162166 37000.0 16261501212 34500.0
结果发现本周之内累计消费最高67600,都不满足68000,所以第2和第3都是0人
第三部分
近一年有多次消费,且至少有一笔消费金额≥¥18,000 近一年仅一次消费,且该笔消费金额≥¥18,000 近一年累计消费金额<¥68,000
满足上述条件的总人数: 满足1条件的本周产生购买的新产生客人数: 满足1条件的本周产生购买的原达标客人数: 满足上述2个条件的人数消费总金额:
思路:第一题的总人数,先求出3个条件每一个满足条件的人数,再求和
df_oneyear1. head( )
id week date order_d order_e quantity amount phone account 0 1.0 NaN 2017-06-08 GU20170608923396-1 GU20170608923396 1.0 2750.0 1500026566 penny.li@cn.gucci.com 1 2.0 NaN 2017-06-08 GU20170608923428-1 GU20170608923428 1.0 2250.0 16201251750 vincent.qiu@gmail.com 2 3.0 NaN 2017-06-08 GU20170608923430-1 GU20170608923430 1.0 3900.0 15010529091 412526292@qq.com 3 4.0 NaN 2017-06-08 GU20170608923431-1 GU20170608923431 1.0 3300.0 12621601625 1941327726@qq.com 4 5.0 NaN 2017-06-08 GU20170608923435-1 GU20170608923435 1.0 2250.0 12917062526 aliutina@163.com
a = df_oneyear1[ df_oneyear1. amount > 18000 ]
a. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1438 entries, 51 to 98105
Data columns (total 9 columns):
id 1438 non-null float64
week 0 non-null object
date 1438 non-null datetime64[ns]
order_d 1438 non-null object
order_e 1438 non-null object
quantity 1438 non-null float64
amount 1438 non-null float64
phone 1438 non-null object
account 1096 non-null object
dtypes: datetime64[ns](1), float64(3), object(5)
memory usage: 112.3+ KB
a. phone. value_counts( ) > 1
16262121224 True
17621541994 True
12647119726 True
16129000006 True
16652119121 True
12064295506 True
12520221500 True
16964522200 True
16521512766 True
16612052679 True
16996176222 True
12500226762 True
15222227475 True
16201967491 True
12405291559 True
16570712222 True
15267120909 True
15210924674 True
16606609226 True
16552005664 True
16552401027 True
16752720064 True
15625676110 True
16916976602 True
15210951271 True
12676106654 True
16916055229 True
16264169746 True
12610649665 True
12662560267 True
...
16466210740 False
12600679612 False
12001729520 False
12420052226 False
15907952196 False
15625229997 False
16921950066 False
12696912090 False
16979201079 False
12100221766 False
16201609226 False
16261646692 False
16611225626 False
15927546424 False
15912621004 False
17600446922 False
12260064657 False
16727962992 False
16962049492 False
19921426110 False
16217566946 False
12210562700 False
16296267629 False
15144694469 False
16166279955 False
12970221916 False
16477045216 False
15292277991 False
15207910610 False
15157020292 False
Name: phone, Length: 1329, dtype: bool