pandas数据透视图与交叉表

最新推荐文章于 2024-08-02 12:25:21 发布

Yph_Jerry

最新推荐文章于 2024-08-02 12:25:21 发布

阅读量557

点赞数

分类专栏： python 文章标签： python 数据分析

本文链接：https://blog.csdn.net/qq_42965915/article/details/107619068

版权

python 专栏收录该内容

8 篇文章 4 订阅

订阅专栏

pandas数据透视表与交叉表

定期复查

关于

透视表和交叉图差不多，两者的参数也差不多一致。也可以说，某种情况下透视图可以转变成交叉图（有margins参数的时候）
还有一点差异是pivot_table（）是dataframe下的方法，crosstab是pandas下的方法。
还有另外一个创建透视图的方法pivot(),参数大体一致，但是是属于pandas下的函数。

# 导入模块
import pandas as pd

# 数据读取
names = ['month', 'day', 'proof', 'dept', 'subject','occrual']
df = pd.read_excel('data.xlsx', sheet_name=0, header=None, names=names, skiprows=1, encoding='gbk')
df.head(2)

	month	day	proof	dept	subject	occrual
0	1	29	记-0023	一车间	邮寄费	5.0
1	1	29	记-0021	一车间	出租车费	14.8

# 数据处理
df['date'] = pd.to_datetime('2017-' + df['month'].apply(lambda val: str(val)).str.cat(df['day'].apply(convert), sep='-'))
del df['month']
del df['day']
df = df.set_index('date')
df.head(2)

	proof	dept	subject	occrual
date
2017-01-29	记-0023	一车间	邮寄费	5.0
2017-01-29	记-0021	一车间	出租车费	14.8

数据透视表

首先学习一下源码（感兴趣可以看以下，不难）

help(df.pivot_table)

Help on method pivot_table in module pandas.core.frame:

pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False) -> 'DataFrame' method of pandas.core.frame.DataFrame instance
    Create a spreadsheet-style pivot table as a DataFrame
    
    Parameters
    ----------
    values : column to aggregate, optional
    index : column, Grouper, array, or list of the previous
        If an array is passed, it must be the same length as the data. The
        list can contain any of the other types (except list).
        Keys to group by on the pivot table index.  If an array is passed,
        it is being used as the same manner as column values.
    columns : column, Grouper, array, or list of the previous
        If an array is passed, it must be the same length as the data. The
        list can contain any of the other types (except list).
        Keys to group by on the pivot table column.  If an array is passed,
        it is being used as the same manner as column values.
    aggfunc : function, list of functions, dict, default numpy.mean
        If list of functions passed, the resulting pivot table will have
        hierarchical columns whose top level are the function names
        (inferred from the function objects themselves)
        If dict is passed, the key is column to aggregate and value
        is function or list of functions.
    fill_value : scalar, default None
        Value to replace missing values with.
    margins : bool, default False
        Add all row / columns (e.g. for subtotal / grand totals).
    dropna : bool, default True
        Do not include columns whose entries are all NaN.
    margins_name : str, default 'All'
        Name of the row / column that will contain the totals
        when margins is True.
    observed : bool, default False
        This only applies if any of the groupers are Categoricals.
        If True: only show observed values for categorical groupers.
        If False: show all values for categorical groupers.
   
    
    Returns
    -------
    DataFrame
        An Excel style pivot table.
    
    Examples
    --------
    >>> df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
    ...                          "bar", "bar", "bar", "bar"],
    ...                    "B": ["one", "one", "one", "two", "two",
    ...                          "one", "one", "two", "two"],
    ...                    "C": ["small", "large", "large", "small",
    ...                          "small", "large", "small", "small",
    ...                          "large"],
    ...                    "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
    ...                    "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
    >>> df
         A    B      C  D  E
    0  foo  one  small  1  2
    1  foo  one  large  2  4
    2  foo  one  large  2  5
    3  foo  two  small  3  5
    4  foo  two  small  3  6
    5  bar  one  large  4  6
    6  bar  one  small  5  8
    7  bar  two  small  6  9
    8  bar  two  large  7  9
    
    This first example aggregates values by taking the sum.
    
    >>> table = pd.pivot_table(df, values='D', index=['A', 'B'],
    ...                     columns=['C'], aggfunc=np.sum)
    >>> table
    C        large  small
    A   B
    bar one    4.0    5.0
        two    7.0    6.0
    foo one    4.0    1.0
        two    NaN    6.0
    
    We can also fill missing values using the `fill_value` parameter.
    
    >>> table = pd.pivot_table(df, values='D', index=['A', 'B'],
    ...                     columns=['C'], aggfunc=np.sum, fill_value=0)
    >>> table
    C        large  small
    A   B
    bar one      4      5
        two      7      6
    foo one      4      1
        two      0      6
    
    The next example aggregates by taking the mean across multiple columns.
    
    >>> table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'],
    ...                     aggfunc={'D': np.mean,
    ...                              'E': np.mean})
    >>> table
                    D         E
    A   C
    bar large  5.500000  7.500000
        small  5.500000  8.500000
    foo large  2.000000  4.500000
        small  2.333333  4.333333
    
    We can also calculate multiple types of aggregations for any given
    value column.
    
    >>> table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'],
    ...                     aggfunc={'D': np.mean,
    ...                              'E': [min, max, np.mean]})
    >>> table
                    D    E
                mean  max      mean  min
    A   C
    bar large  5.500000  9.0  7.500000  6.0
        small  5.500000  9.0  8.500000  8.0
    foo large  2.000000  5.0  4.500000  4.0
        small  2.333333  6.0  4.333333  2.0

部分参数

columns: 表示要以此作为透视表的列
index: 表示以此为透视表的行索引
values: 表示需要以此进行求和、求平均等一系列操作的列
aggfunc: 可以是一个聚合操作也可以是多个，以列表形式
fill_value: 当值为NaN时的替代值
margins: 相当于各行各列分别进行aggfunc操作
margins_name: 当margins为True时，margins_name默认为‘all’

# 创建按月记总各部门的发生额的数据透视表
monthly_pivot = df.pivot_table(columns='dept', values='occrual', index=df.index.month, aggfunc='sum', fill_value=0)
monthly_pivot

dept	一车间	二车间	人力资源部	技改办	经理室	财务部	销售1部	销售2部
date
1	31350.57	9594.98	2392.25	0.00	3942.00	18461.74	7956.20	13385.20
2	18.00	10528.06	2131.00	0.00	7055.00	18518.58	11167.00	16121.00
3	32026.57	14946.70	4645.06	0.00	17491.30	21870.66	40314.92	28936.58
4	5760.68	20374.62	2070.70	11317.60	4121.00	19016.85	13854.40	27905.70
5	70760.98	23034.35	2822.07	154307.23	28371.90	29356.87	36509.35	33387.31
6	36076.57	18185.57	2105.10	111488.76	13260.60	17313.71	15497.30	38970.41
7	4838.90	21916.07	2103.08	54955.40	19747.20	17355.71	70604.39	79620.91
8	19.00	27112.05	3776.68	72145.00	10608.38	23079.69	64152.12	52661.83
9	14097.56	13937.80	12862.20	47264.95	21260.60	22189.46	16241.57	49964.33
10	16.00	14478.15	21223.89	0.00	14538.85	22863.39	41951.80	16894.00
11	20755.79	26340.45	4837.74	5438.58	21643.45	36030.86	26150.48	96658.50
12	146959.74	21892.09	3979.24	206299.91	36269.00	46937.96	39038.49	38984.12

可以多个聚合函数，多个columns，多个index，以列表形式传入

monthly_pivot = df.pivot_table(columns=['dept'], values='occrual', index=df.index.month, aggfunc=['sum', 'mean'], fill_value=0)
monthly_pivot

	sum								mean
dept	一车间	二车间	人力资源部	技改办	经理室	财务部	销售1部	销售2部	一车间	二车间	人力资源部	技改办	经理室	财务部	销售1部	销售2部
date
1	31350.57	9594.98	2392.25	0.00	3942.00	18461.74	7956.20	13385.20	10450.190000	685.355714	797.416667	0.000000	985.500000	3692.348000	994.525000	3346.300000
2	18.00	10528.06	2131.00	0.00	7055.00	18518.58	11167.00	16121.00	18.000000	701.870667	1065.500000	0.000000	705.500000	3703.716000	1116.700000	4030.250000
3	32026.57	14946.70	4645.06	0.00	17491.30	21870.66	40314.92	28936.58	8006.642500	515.403448	929.012000	0.000000	1249.378571	2733.832500	4031.492000	2630.598182
4	5760.68	20374.62	2070.70	11317.60	4121.00	19016.85	13854.40	27905.70	5760.680000	970.220000	690.233333	1886.266667	1030.250000	2377.106250	1385.440000	5581.140000
5	70760.98	23034.35	2822.07	154307.23	28371.90	29356.87	36509.35	33387.31	17690.245000	743.043548	564.414000	11021.945000	2182.453846	3261.874444	2147.608824	4173.413750
6	36076.57	18185.57	2105.10	111488.76	13260.60	17313.71	15497.30	38970.41	7215.314000	586.631290	701.700000	10135.341818	884.040000	2885.618333	1192.100000	3247.534167
7	4838.90	21916.07	2103.08	54955.40	19747.20	17355.71	70604.39	79620.91	967.780000	664.123333	701.026667	18318.466667	2194.133333	2892.618333	4153.199412	4976.306875
8	19.00	27112.05	3776.68	72145.00	10608.38	23079.69	64152.12	52661.83	6.333333	968.287500	944.170000	9018.125000	1326.047500	3297.098571	4276.808000	4050.910000
9	14097.56	13937.80	12862.20	47264.95	21260.60	22189.46	16241.57	49964.33	7048.780000	497.778571	2143.700000	15754.983333	1771.716667	2773.682500	955.386471	3843.410000
10	16.00	14478.15	21223.89	0.00	14538.85	22863.39	41951.80	16894.00	16.000000	629.484783	4244.778000	0.000000	1615.427778	3266.198571	2996.557143	1877.111111
11	20755.79	26340.45	4837.74	5438.58	21643.45	36030.86	26150.48	96658.50	5188.947500	731.679167	439.794545	1087.716000	1545.960714	2402.057333	1089.603333	5685.794118
12	146959.74	21892.09	3979.24	206299.91	36269.00	46937.96	39038.49	38984.12	48986.580000	576.107632	331.603333	22922.212222	1908.894737	1618.550345	1055.094324	3544.010909

交叉表

还是看下源码先

help(pd.crosstab)

Help on function crosstab in module pandas.core.reshape.pivot:

crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name: str = 'All', dropna: bool = True, normalize=False) -> 'DataFrame'
    Compute a simple cross tabulation of two (or more) factors. By default
    computes a frequency table of the factors unless an array of values and an
    aggregation function are passed.
    
    Parameters
    ----------
    index : array-like, Series, or list of arrays/Series
        Values to group by in the rows.
    columns : array-like, Series, or list of arrays/Series
        Values to group by in the columns.
    values : array-like, optional
        Array of values to aggregate according to the factors.
        Requires `aggfunc` be specified.
    rownames : sequence, default None
        If passed, must match number of row arrays passed.
    colnames : sequence, default None
        If passed, must match number of column arrays passed.
    aggfunc : function, optional
        If specified, requires `values` be specified as well.
    margins : bool, default False
        Add row/column margins (subtotals).
    margins_name : str, default 'All'
        Name of the row/column that will contain the totals
        when margins is True.
    
        .. versionadded:: 0.21.0
    
    dropna : bool, default True
        Do not include columns whose entries are all NaN.
    normalize : bool, {'all', 'index', 'columns'}, or {0,1}, default False
        Normalize by dividing all values by the sum of values.
    
        - If passed 'all' or `True`, will normalize over all values.
        - If passed 'index' will normalize over each row.
        - If passed 'columns' will normalize over each column.
        - If margins is `True`, will also normalize margin values.
    
    Returns
    -------
    DataFrame
        Cross tabulation of the data.
 
    
    Examples
    --------
    >>> a = np.array(["foo", "foo", "foo", "foo", "bar", "bar",
    ...               "bar", "bar", "foo", "foo", "foo"], dtype=object)
    >>> b = np.array(["one", "one", "one", "two", "one", "one",
    ...               "one", "two", "two", "two", "one"], dtype=object)
    >>> c = np.array(["dull", "dull", "shiny", "dull", "dull", "shiny",
    ...               "shiny", "dull", "shiny", "shiny", "shiny"],
    ...              dtype=object)
    >>> pd.crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])
    b   one        two
    c   dull shiny dull shiny
    a
    bar    1     2    1     0
    foo    2     2    1     2
    
    Here 'c' and 'f' are not represented in the data and will not be
    shown in the output because dropna is True by default. Set
    dropna=False to preserve categories with no data.
    
    >>> foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
    >>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
    >>> pd.crosstab(foo, bar)
    col_0  d  e
    row_0
    a      1  0
    b      0  1
    >>> pd.crosstab(foo, bar, dropna=False)
    col_0  d  e  f
    row_0
    a      1  0  0
    b      0  1  0
    c      0  0  0

# 创建交叉表
monthly_crosstab = pd.crosstab(index=df.index.month, columns=df['dept'], values=df['occrual'], aggfunc='sum', rownames=['dayofmonth'], margins=True, margins_name='sum')
monthly_crosstab

dept	一车间	二车间	人力资源部	技改办	经理室	财务部	销售1部	销售2部	sum
dayofmonth
1	31350.57	9594.98	2392.25	NaN	3942.00	18461.74	7956.20	13385.20	87082.94
2	18.00	10528.06	2131.00	NaN	7055.00	18518.58	11167.00	16121.00	65538.64
3	32026.57	14946.70	4645.06	NaN	17491.30	21870.66	40314.92	28936.58	160231.79
4	5760.68	20374.62	2070.70	11317.60	4121.00	19016.85	13854.40	27905.70	104421.55
5	70760.98	23034.35	2822.07	154307.23	28371.90	29356.87	36509.35	33387.31	378550.06
6	36076.57	18185.57	2105.10	111488.76	13260.60	17313.71	15497.30	38970.41	252898.02
7	4838.90	21916.07	2103.08	54955.40	19747.20	17355.71	70604.39	79620.91	271141.66
8	19.00	27112.05	3776.68	72145.00	10608.38	23079.69	64152.12	52661.83	253554.75
9	14097.56	13937.80	12862.20	47264.95	21260.60	22189.46	16241.57	49964.33	197818.47
10	16.00	14478.15	21223.89	NaN	14538.85	22863.39	41951.80	16894.00	131966.08
11	20755.79	26340.45	4837.74	5438.58	21643.45	36030.86	26150.48	96658.50	237855.85
12	146959.74	21892.09	3979.24	206299.91	36269.00	46937.96	39038.49	38984.12	540360.55
sum	362680.36	222340.89	64949.01	663217.43	198309.28	292995.48	383438.02	493489.89	2681420.36

normalize参数

data = {'name': ['aa', 'bb', 'aa', 'cc', 'bb'],
        'a':[0, 1, 1, 2, 0],'b':[5, 6, 7, 8, 9]}
df1 = pd.DataFrame(data, index=[0, 1, 2, 3, 4])
df1

	name	a	b
0	aa	0	5
1	bb	1	6
2	aa	1	7
3	cc	2	8
4	bb	0	9

pd.crosstab(index=df1['name'], values=df1['b'], columns=df1['a'], aggfunc='sum', margins=True)

a	0	1	2	All
name
aa	5.0	7.0	NaN	12
bb	9.0	6.0	NaN	15
cc	NaN	NaN	8.0	8
All	14.0	13.0	8.0	35

args: normalize

当normalize为True或’all’时，表示每一个数与总和的比值
为’index‘时表示每个数与自身行求和的比值
为‘columns’时表示每个数与自身列求和的比值

pd.crosstab(index=df1['name'], values=df1['b'], columns=df1['a'], normalize=True, aggfunc='sum', margins=True)

a	0	1	2	All
name
aa	0.142857	0.200000	0.000000	0.342857
bb	0.257143	0.171429	0.000000	0.428571
cc	0.000000	0.000000	0.228571	0.228571
All	0.400000	0.371429	0.228571	1.000000

想更加了解还是多看看源码吧，淦！

Yph_Jerry

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pandas数据透视图与交叉表

pandas数据透视表与交叉表定期复查关于透视表和交叉图差不多，两者的参数也差不多一致。也可以说，某种情况下透视图可以转变成交叉图（有margins参数的时候）还有一点差异是pivot_table（）是dataframe下的方法，crosstab是pandas下的方法。还有另外一个创建透视图的方法pivot(),参数大体一致，但是是属于pandas下的函数。# 导入模块import pandas as pd# 数据读取names = ['month', 'day', 'proo
复制链接

扫一扫

专栏目录