以下面这张表进行说明:
print(dfoff)
User_id Merchant_id Coupon_id Discount_rate Distance Date_received \
0 1439408 2632 NaN NaN 0.0 NaN
1 1439408 4663 11002.0 150:20 1.0 20160528.0
2 1439408 2632 8591.0 20:1 0.0 20160217.0
3 1439408 2632 1078.0 20:1 0.0 20160319.0
4 1439408 2632 8591.0 20:1 0.0 20160613.0
Date
0 20160217.0
1 NaN
2 NaN
3 NaN
4 NaN
假设现在想以Date_received (收到优惠券的时间)进行分组并计数那么可以这样:
couponbydate = dfoff[dfoff['Date_received'].notnull()][['Date_received', 'Date']].groupby(['Date_received']).count()
print(couponbydate.columns)
print(couponbydate.head(5))
Index(['Date'], dtype='object')
Date
Date_received
20160101.0 74
20160102.0 67
20160103.0 74
20160104.0 98
20160105.0 107
as_index默认是True
那令其为false时看看有什么不同:
couponbydate = dfoff[dfoff['Date_received'].notnull()][['Date_received', 'Date']].groupby(['Date_received'],as_index=False).count()
Index(['Date_received', 'Date'], dtype='object')
Date_received Date
0 20160101.0 74
1 20160102.0 67
2 20160103.0 74
3 20160104.0 98
4 20160105.0 107
两者一对比相信一某了然了吧,as_index的作用正如其名,就是说要不要把其作为索引,作为了索引的话,即上面那个结果,其实它就只有一列啦,而后者并没有将其作为索引,所以所以就采用了默认的0,1,2,,,,,而Date_received成了表格中一个普通的列
其实按这样统计Date_received分组计数并不正确,笔者在这里只是说明了as_index的用法,关于为什么不正确,有兴趣的可以接着往下看
---------------------------------------------------------------------------------------------------------------------------------------------------------
最后顺便说一下分组搭配使用聚合函数(count是其中一种)需要注意的地方:
如果没有使用聚合函数的话
couponbydate = dfoff[dfoff['Date_received'].notnull()][['Date_received', 'Date']].groupby(['Date_received'],as_index=False)
print(couponbydate)
<pandas.core.groupby.DataFrameGroupBy object at 0x7f90d445d710>
可以看到:变量couponbydate是一个GroupBy对象,只是含有一些有关分组键['Date_received']的中间数据而已,除此之外没有任何计算,于是乎我们可以使用一些聚合函数对这个中间数据即分组后的数据进行一些特定的计算比如count的计算方法,mean求平均值方法等等。
再来直观看一下其计数后整个结果:
couponbydate = dfoff[dfoff['Date_received'].notnull()].groupby(['Date_received'],as_index=False).count()
print(couponbydate.head(5))
Date_received User_id Merchant_id Coupon_id Discount_rate Distance \
0 20160101.0 554 554 554 554 447
1 20160102.0 542 542 542 542 439
2 20160103.0 536 536 536 536 429
3 20160104.0 577 577 577 577 474
4 20160105.0 691 691 691 691 579
Date
0 74
1 67
2 74
3 98
4 107
首先可以看到当没有使用分组键Date_received 作为索引值时,分组键出现在第一列。
除了第一列的Date_received,后面所有了下面的数字代表什么呢?其实就是计算结果,比如第一行对应的Date_received是20160101.0,后面的554的含义就是说20160101.0在所有数据中出现了554次,就这么简单和别的没什么关系不论你是User_id列也好,Merchant_id列也罢,所以上面每一行的数字都是相同的(绿色部分)
看到这里也许会奇怪,那红色呢?比如第一行的447 和74是怎么回事?按理说也应该等于554呀!造成这个结果的主要原因就是有空值(NaN),具体来说就是:
print(dfoff[dfoff['Date_received']==20160101.0][['Date_received','Date']])
笔者这里只是截取了部分结果,可以看到在Date_received==20160101.0对应的行中,Date大部分是空值,那这样的情况有多少种呢?
print(dfoff[dfoff['Date_received']==20160101.0].shape[0])
print(dfoff[(dfoff['Date_received']==20160101.0)&(dfoff['Date'].notnull())].shape[0])
print(dfoff[(dfoff['Date_received']==20160101.0)&(dfoff['Date'].isnull())].shape[0])
554
74
480
可以看到有480中,也就是说不是非空的有74种,这也正是group在Date这一列显示的是74而不是554的原因,我们还可以再拿
Merchant_id列验证一下:
print(dfoff[dfoff['Date_received']==20160101.0].shape[0])
print(dfoff[(dfoff['Date_received']==20160101.0)&(dfoff['Merchant_id'].notnull())].shape[0])
554
554
所以分析到这里的话回过头来看一开始给出的那个统计方法并不正确(一开始那个实际上是在'Date_received'不为空且‘Date也不为空’基础上统计的结果)即:
couponbydate = dfoff[dfoff['Date_received'].notnull()][['Date_received', 'Date']].groupby(['Date_received'],as_index=False).count()
print(couponbydate.columns)
print(couponbydate.head(5))
Index(['Date_received', 'Date'], dtype='object')
Date_received Date
0 20160101.0 74
1 20160102.0 67
2 20160103.0 74
3 20160104.0 98
4 20160105.0 107
而我们现在就是想单纯的统计Date_received中每一个日期真真出现的次数,正确的话应该是这样:
couponbydate = dfoff[dfoff['Date_received'].notnull()][['Date_received', 'Coupon_id']].groupby(['Date_received'], as_index=False).count()
couponbydate.columns = ['Date_received','count']
Date_received count
0 20160101.0 554
1 20160102.0 542
2 20160103.0 536
3 20160104.0 577
4 20160105.0 691
可以看到 20160101.0出现的次数是554并不是74
可能有人说,你这也不对呀,这不是在'Date_received'不为空且‘Coupon_id也不为空’的基础上统计的吗?也有可能 Date_received是20160101.0 但Coupon_id是空值的这一情况,也不是没有统计进去吗?是的,说的没错!!!!!!!
之所以说这样做是正确的是和数据含义有关,Date_received 代表的是收到优惠券的时间,Coupon_id代表的是优惠券的id,所以说只要Date_received有一个日期,Coupon_id必定不为空值,而Date代表的是使用优惠券的时间,Date_received即使有一个日期,但Date也可以为空,因为有优惠券不一定用是吧,其实上面的统计还可使用:
couponbydate = dfoff[dfoff['Date_received'].notnull()][['Date_received', 'Discount_rate']].groupby(['Date_received'], as_index=False).count()
couponbydate.columns = ['Date_received','count']
因为Discount_rate是优惠率,一个Date_received必定对应一个非空的Discount_rate。
当然啦除此之外还有一种办法即value_counts()方法:
print(dfoff['Date_received'].value_counts())
20160129.0 71658
20160125.0 65904
20160124.0 39481
20160131.0 35427
20160128.0 34334
20160207.0 33319
20160130.0 33226
20160126.0 26027
20160123.0 24045
20160521.0 19859
……
下面给出一个小实践,也是天池上面的一个项目,参考他人的改进版:
先统计一下,每天发放的优惠券数以及被使用的数量
data_received_type = dfoff['Date_received'].unique()
data_received_type = sorted(data_received_type[data_received_type==data_received_type])
couponbydate = dfoff[dfoff['Date_received'].notnull()][['Date_received', 'Coupon_id']].groupby(['Date_received'], as_index=False).count()
couponbydate.columns = ['date_received','count']
buybydate = dfoff[(dfoff['Date_received'].notnull())&(dfoff['Date'].notnull())][['Date_received', 'Date']].groupby(['Date_received'],as_index=False).count()
buybydate.columns = ['date_buy','count']
plt.figure(figsize = (12,8))
date_received_dt = pd.to_datetime(data_received_type, format='%Y%m%d')
plt.bar(date_received_dt, couponbydate['count'], label = 'number of coupon received' )
plt.bar(date_received_dt, buybydate['count'], label = 'number of coupon used')
plt.yscale('log')
plt.ylabel('Count')
plt.xlabel('Date')
plt.legend()
-------------------------------------------------------------------------------------------------------------------------------------------------------------
最后总结一下:
as_index的作用是要不要将分组键作为索引,作为了索引,最后得到的新表格就没有这一列啦,否则相反
分组的结果第一列是分组键分出的各个关键值,后面的列下面的数字就是代表该关键值出现的次数,本身和那一列没关系,所有列数值都是一样的,统计的话,就取分组键和后面任意一列,得到的每一行结果分组后的关键字及其计数
上面有特殊情况,即如果任选后面的一列中有空值,group是不计算在内的
————————————————
版权声明:本文为CSDN博主「weixin_42001089」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/weixin_42001089/article/details/83747730