文章目录
pandas__任务4.5 创建透视表与交叉表
4.5 创建透视表与交叉表
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
代码4-67 使用订单号作为透视表索引制作透视表
%%html
<img src='./image/1.png', width=900, height=400>
<img src=’./image/1.png’, width=900, height=400>
engine = create_engine('mysql+pymysql://root:123456@localhost:3306/zuoye')
detail = pd.read_sql_table('meal_order_detail1',con = engine)
detailPivot = pd.pivot_table(detail[['order_id','counts','amounts']],index = 'order_id')
print('以order_id作为分组键创建的订单透视表为:\n',
detailPivot.head())
D:\Study\anaconda\lib\site-packages\pymysql\cursors.py:170: Warning: (1366, "Incorrect string value: '\\xD6\\xD0\\xB9\\xFA\\xB1\\xEA...' for column 'VARIABLE_VALUE' at row 1")
result = self._query(query)
以order_id作为分组键创建的订单透视表为:
amounts counts
order_id
1002 32.000 1.0000
1003 30.125 1.2500
1004 43.875 1.0625
1008 63.000 1.0000
1011 57.700 1.0000
在不特殊指定聚合函数aggfunc时,会默认使用numpy.mean进行聚合运算,numpy.mean会自动过滤掉非数值类型数据。可以通过指定aggfunc参数修改聚合函数。
代码 4-68 修改聚合函数后的透视表
detailPivot1 = pd.pivot_table(detail[[
'order_id','counts','amounts']],
index = 'order_id',aggfunc = np.sum)
print('以order_id作为分组键创建的订单销量与售价总和透视表为:\n',
detailPivot1.head())
以order_id作为分组键创建的订单销量与售价总和透视表为:
amounts counts
order_id
1002 224.0 7.0
1003 241.0 10.0
1004 702.0 17.0
1008 315.0 5.0
1011 577.0 10.0
和groupby方法分组的时候相同,pivot_table函数在创建透视表的时候分组键index可以有多个。
代码 4-69 使用订单号和菜品名称作为索引的透视表
detailPivot2 = pd.pivot_table(detail[[
'order_id','dishes_name',
'counts','amounts']],
index = ['order_id','dishes_name'],
aggfunc = np.sum)
print('以order_id和dishes_name作为分组键创建的订单\
销量与售价总和透视表为:\n',detailPivot2.head())
以order_id和dishes_name作为分组键创建的订单销量与售价总和透视表为:
amounts counts
order_id dishes_name
1002 凉拌菠菜 27.0 1.0
南瓜枸杞小饼干 19.0 1.0
焖猪手 58.0 1.0
独家薄荷鲜虾牛肉卷 45.0 1.0
白胡椒胡萝卜羊肉汤 35.0 1.0
通过设置columns参数可以指定列分组。
代码 4-70 指定菜品名称为列分组键的透视表
detailPivot2 = pd.pivot_table(detail[[
'order_id','dishes_name','counts','amounts']],
index = 'order_id',
columns = 'dishes_name',
aggfunc = np.sum)
print('以order_id和dishes_name作为行列分组键创建的\
透视表前5行4列为:\n',detailPivot2.iloc[:5,:4])
以order_id和dishes_name作为行列分组键创建的透视表前5行4列为:
amounts
dishes_name 42度海之蓝 北冰洋汽水 38度剑南春 50度古井贡酒
order_id
1002 NaN NaN NaN NaN
1003 NaN NaN NaN NaN
1004 NaN NaN NaN NaN
1008 NaN NaN NaN NaN
1011 99.0 NaN NaN NaN
当全部数据列数很多时,若只想要显示某列,可以通过指定values参数来实现。
代码 4-71 指定某些列制作透视表
detailPivot4 = pd.pivot_table(detail[['order_id','dishes_name','counts','amounts']],
index = 'order_id',
values = ['counts','amounts'],
aggfunc = np.sum)
print('以order_id作为行分组键counts作为值创建的\
透视表前5行为:\n',detailPivot4.head())
以order_id作为行分组键counts作为值创建的透视表前5行为:
amounts counts
order_id
1002 224.0 7.0
1003 241.0 10.0
1004 702.0 17.0
1008 315.0 5.0
1011 577.0 10.0
当某些数据不存在时,会自动填充NaN,因此可以指定fill_value参数,表示当存在缺失值时,以指定数值进行填充。
代码 4-72 对透视表中的缺失值进行填充
detailPivot5 = pd.pivot_table(detail[[
'order_id','dishes_name','counts','amounts']],
index = 'order_id',
columns = 'dishes_name',
aggfunc = np.sum,fill_value = 0)
print('空值填0后以order_id和dishes_name为行列分组键\
创建透视表前5行4列为:\n',detailPivot5.iloc[:5,:4])
空值填0后以order_id和dishes_name为行列分组键创建透视表前5行4列为:
amounts
dishes_name 42度海之蓝 北冰洋汽水 38度剑南春 50度古井贡酒
order_id
1002 0 0 0 0
1003 0 0 0 0
1004 0 0 0 0
1008 0 0 0 0
1011 99 0 0 0
可以更改margins参数,查看汇总数据。
代码 4-73 在透视表中添加汇总数据
detailPivot6 = pd.pivot_table(detail[[
'order_id','dishes_name','counts','amounts']],
index = 'order_id',columns = 'dishes_name',
aggfunc = np.sum,fill_value = 0,
margins = True)
print('添加margins后以order_id和dishes_name为分组键\
的透视表前5行后4列为:\n',detailPivot6.iloc[:5,-4:])
添加margins后以order_id和dishes_name为分组键的透视表前5行后4列为:
counts
dishes_name 黄油曲奇饼干 黄花菜炒木耳 黑米恋上葡萄 All
order_id
1002 0 0 0 7.0
1003 0 0 0 10.0
1004 0 1 0 17.0
1008 0 0 0 5.0
1011 0 0 0 10.0
4.5.2 使用crosstab函数创建交叉表
交叉表是一种特殊的透视表,主要用于计算分组频率。利用pandas提供的crosstab函数可以制作交叉表,crosstab函数的常用参数和使用格式如下。
%%html
<img src='./image/2.png', width=900, height=400>
<img src=’./image/2.png’, width=900, height=400>
由于交叉表是透视表的一种,其参数基本保持一致,不同之处在于crosstab函数中的index,columns,values填入的都是对应的从Dataframe中取出的某一列。
pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, dropna=True, normalize=False)
代码 4-74 使用crosstab函数制作交叉表
detailCross = pd.crosstab(index=detail['order_id'],
columns = detail['dishes_name'],
values = detail['counts'],aggfunc = np.sum)
print('以order_id和dishes_name为分组键\
counts为值的透视表前5行5列为:\n',detailCross.iloc[:5,:5])
以order_id和dishes_name为分组键counts为值的透视表前5行5列为:
dishes_name 42度海之蓝 北冰洋汽水 38度剑南春 50度古井贡酒 52度泸州老窖
order_id
1002 NaN NaN NaN NaN NaN
1003 NaN NaN NaN NaN NaN
1004 NaN NaN NaN NaN NaN
1008 NaN NaN NaN NaN NaN
1011 1.0 NaN NaN NaN NaN
代码 4-75 订单详情表单日菜品成交总额与总数透视表
detail['place_order_time'] = pd.to_datetime(
detail['place_order_time'])
detail['date'] = [i.date() for i in detail['place_order_time']]
PivotDetail = pd.pivot_table(detail[[
'date','dishes_name','counts','amounts']],
index ='date',aggfunc = np.sum,
margins = True)
print('订单详情表单日菜品成交总额与总数透视表前5行5列为:\n',
PivotDetail.head())
订单详情表单日菜品成交总额与总数透视表前5行5列为:
amounts counts
date
2016-08-01 9366.0 233.0
2016-08-02 6125.0 151.0
2016-08-03 6890.0 192.0
2016-08-04 7549.0 169.0
2016-08-05 8671.0 224.0
代码 4-76 订单详情表单个菜品单日成交总额透视表
CrossDetail = pd.crosstab(
index=detail['date'],columns=detail['dishes_name'],
values = detail['amounts'],
aggfunc = np.sum,margins = True)
print('订单详情表单日单个菜品成交总额交叉表后5行5列为:\n',
CrossDetail.iloc[-5:,-5:])
订单详情表单日单个菜品成交总额交叉表后5行5列为:
dishes_name 黄尾袋鼠西拉子红葡萄酒 黄油曲奇饼干 黄花菜炒木耳 黑米恋上葡萄 All
date
2016-08-07 230.0 32.0 105.0 99.0 31306.0
2016-08-08 46.0 NaN NaN 33.0 6532.0
2016-08-09 138.0 NaN 35.0 99.0 7155.0
2016-08-10 46.0 NaN 70.0 33.0 10231.0
All 736.0 80.0 525.0 561.0 125992.0