pandas数据透视图与交叉表

pandas数据透视表与交叉表

定期复查

关于
  • 透视表和交叉图差不多,两者的参数也差不多一致。也可以说,某种情况下透视图可以转变成交叉图(有margins参数的时候)
  • 还有一点差异是pivot_table()是dataframe下的方法,crosstab是pandas下的方法。
  • 还有另外一个创建透视图的方法pivot(),参数大体一致,但是是属于pandas下的函数。

# 导入模块
import pandas as pd
# 数据读取
names = ['month', 'day', 'proof', 'dept', 'subject','occrual']
df = pd.read_excel('data.xlsx', sheet_name=0, header=None, names=names, skiprows=1, encoding='gbk')
df.head(2)
monthdayproofdeptsubjectoccrual
0129记-0023一车间邮寄费5.0
1129记-0021一车间出租车费14.8

# 数据处理
df['date'] = pd.to_datetime('2017-' + df['month'].apply(lambda val: str(val)).str.cat(df['day'].apply(convert), sep='-'))
del df['month']
del df['day']
df = df.set_index('date')
df.head(2)
proofdeptsubjectoccrual
date
2017-01-29记-0023一车间邮寄费5.0
2017-01-29记-0021一车间出租车费14.8

数据透视表

首先学习一下源码(感兴趣可以看以下,不难)
help(df.pivot_table)
Help on method pivot_table in module pandas.core.frame:

pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False) -> 'DataFrame' method of pandas.core.frame.DataFrame instance
    Create a spreadsheet-style pivot table as a DataFrame
    
    Parameters
    ----------
    values : column to aggregate, optional
    index : column, Grouper, array, or list of the previous
        If an array is passed, it must be the same length as the data. The
        list can contain any of the other types (except list).
        Keys to group by on the pivot table index.  If an array is passed,
        it is being used as the same manner as column values.
    columns : column, Grouper, array, or list of the previous
        If an array is passed, it must be the same length as the data. The
        list can contain any of the other types (except list).
        Keys to group by on the pivot table column.  If an array is passed,
        it is being used as the same manner as column values.
    aggfunc : function, list of functions, dict, default numpy.mean
        If list of functions passed, the resulting pivot table will have
        hierarchical columns whose top level are the function names
        (inferred from the function objects themselves)
        If dict is passed, the key is column to aggregate and value
        is function or list of functions.
    fill_value : scalar, default None
        Value to replace missing values with.
    margins : bool, default False
        Add all row / columns (e.g. for subtotal / grand totals).
    dropna : bool, default True
        Do not include columns whose entries are all NaN.
    margins_name : str, default 'All'
        Name of the row / column that will contain the totals
        when margins is True.
    observed : bool, default False
        This only applies if any of the groupers are Categoricals.
        If True: only show observed values for categorical groupers.
        If False: show all values for categorical groupers.
   
    
    Returns
    -------
    DataFrame
        An Excel style pivot table.
    
    Examples
    --------
    >>> df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
    ...                          "bar", "bar", "bar", "bar"],
    ...                    "B": ["one", "one", "one", "two", "two",
    ...                          "one", "one", "two", "two"],
    ...                    "C": ["small", "large", "large", "small",
    ...                          "small", "large", "small", "small",
    ...                          "large"],
    ...                    "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
    ...                    "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
    >>> df
         A    B      C  D  E
    0  foo  one  small  1  2
    1  foo  one  large  2  4
    2  foo  one  large  2  5
    3  foo  two  small  3  5
    4  foo  two  small  3  6
    5  bar  one  large  4  6
    6  bar  one  small  5  8
    7  bar  two  small  6  9
    8  bar  two  large  7  9
    
    This first example aggregates values by taking the sum.
    
    >>> table = pd.pivot_table(df, values='D', index=['A', 'B'],
    ...                     columns=['C'], aggfunc=np.sum)
    >>> table
    C        large  small
    A   B
    bar one    4.0    5.0
        two    7.0    6.0
    foo one    4.0    1.0
        two    NaN    6.0
    
    We can also fill missing values using the `fill_value` parameter.
    
    >>> table = pd.pivot_table(df, values='D', index=['A', 'B'],
    ...                     columns=['C'], aggfunc=np.sum, fill_value=0)
    >>> table
    C        large  small
    A   B
    bar one      4      5
        two      7      6
    foo one      4      1
        two      0      6
    
    The next example aggregates by taking the mean across multiple columns.
    
    >>> table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'],
    ...                     aggfunc={'D': np.mean,
    ...                              'E': np.mean})
    >>> table
                    D         E
    A   C
    bar large  5.500000  7.500000
        small  5.500000  8.500000
    foo large  2.000000  4.500000
        small  2.333333  4.333333
    
    We can also calculate multiple types of aggregations for any given
    value column.
    
    >>> table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'],
    ...                     aggfunc={'D': np.mean,
    ...                              'E': [min, max, np.mean]})
    >>> table
                    D    E
                mean  max      mean  min
    A   C
    bar large  5.500000  9.0  7.500000  6.0
        small  5.500000  9.0  8.500000  8.0
    foo large  2.000000  5.0  4.500000  4.0
        small  2.333333  6.0  4.333333  2.0
部分参数
  • columns: 表示要以此作为透视表的列
  • index: 表示以此为透视表的行索引
  • values: 表示需要以此进行求和、求平均等一系列操作的列
  • aggfunc: 可以是一个聚合操作也可以是多个,以列表形式
  • fill_value: 当值为NaN时的替代值
  • margins: 相当于各行各列分别进行aggfunc操作
  • margins_name: 当margins为True时,margins_name默认为‘all’
# 创建按月记总各部门的发生额的数据透视表
monthly_pivot = df.pivot_table(columns='dept', values='occrual', index=df.index.month, aggfunc='sum', fill_value=0)
monthly_pivot
dept一车间二车间人力资源部技改办经理室财务部销售1部销售2部
date
131350.579594.982392.250.003942.0018461.747956.2013385.20
218.0010528.062131.000.007055.0018518.5811167.0016121.00
332026.5714946.704645.060.0017491.3021870.6640314.9228936.58
45760.6820374.622070.7011317.604121.0019016.8513854.4027905.70
570760.9823034.352822.07154307.2328371.9029356.8736509.3533387.31
636076.5718185.572105.10111488.7613260.6017313.7115497.3038970.41
74838.9021916.072103.0854955.4019747.2017355.7170604.3979620.91
819.0027112.053776.6872145.0010608.3823079.6964152.1252661.83
914097.5613937.8012862.2047264.9521260.6022189.4616241.5749964.33
1016.0014478.1521223.890.0014538.8522863.3941951.8016894.00
1120755.7926340.454837.745438.5821643.4536030.8626150.4896658.50
12146959.7421892.093979.24206299.9136269.0046937.9639038.4938984.12

可以多个聚合函数,多个columns,多个index,以列表形式传入
monthly_pivot = df.pivot_table(columns=['dept'], values='occrual', index=df.index.month, aggfunc=['sum', 'mean'], fill_value=0)
monthly_pivot
summean
dept一车间二车间人力资源部技改办经理室财务部销售1部销售2部一车间二车间人力资源部技改办经理室财务部销售1部销售2部
date
131350.579594.982392.250.003942.0018461.747956.2013385.2010450.190000685.355714797.4166670.000000985.5000003692.348000994.5250003346.300000
218.0010528.062131.000.007055.0018518.5811167.0016121.0018.000000701.8706671065.5000000.000000705.5000003703.7160001116.7000004030.250000
332026.5714946.704645.060.0017491.3021870.6640314.9228936.588006.642500515.403448929.0120000.0000001249.3785712733.8325004031.4920002630.598182
45760.6820374.622070.7011317.604121.0019016.8513854.4027905.705760.680000970.220000690.2333331886.2666671030.2500002377.1062501385.4400005581.140000
570760.9823034.352822.07154307.2328371.9029356.8736509.3533387.3117690.245000743.043548564.41400011021.9450002182.4538463261.8744442147.6088244173.413750
636076.5718185.572105.10111488.7613260.6017313.7115497.3038970.417215.314000586.631290701.70000010135.341818884.0400002885.6183331192.1000003247.534167
74838.9021916.072103.0854955.4019747.2017355.7170604.3979620.91967.780000664.123333701.02666718318.4666672194.1333332892.6183334153.1994124976.306875
819.0027112.053776.6872145.0010608.3823079.6964152.1252661.836.333333968.287500944.1700009018.1250001326.0475003297.0985714276.8080004050.910000
914097.5613937.8012862.2047264.9521260.6022189.4616241.5749964.337048.780000497.7785712143.70000015754.9833331771.7166672773.682500955.3864713843.410000
1016.0014478.1521223.890.0014538.8522863.3941951.8016894.0016.000000629.4847834244.7780000.0000001615.4277783266.1985712996.5571431877.111111
1120755.7926340.454837.745438.5821643.4536030.8626150.4896658.505188.947500731.679167439.7945451087.7160001545.9607142402.0573331089.6033335685.794118
12146959.7421892.093979.24206299.9136269.0046937.9639038.4938984.1248986.580000576.107632331.60333322922.2122221908.8947371618.5503451055.0943243544.010909

交叉表

还是看下源码先
help(pd.crosstab)
Help on function crosstab in module pandas.core.reshape.pivot:

crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name: str = 'All', dropna: bool = True, normalize=False) -> 'DataFrame'
    Compute a simple cross tabulation of two (or more) factors. By default
    computes a frequency table of the factors unless an array of values and an
    aggregation function are passed.
    
    Parameters
    ----------
    index : array-like, Series, or list of arrays/Series
        Values to group by in the rows.
    columns : array-like, Series, or list of arrays/Series
        Values to group by in the columns.
    values : array-like, optional
        Array of values to aggregate according to the factors.
        Requires `aggfunc` be specified.
    rownames : sequence, default None
        If passed, must match number of row arrays passed.
    colnames : sequence, default None
        If passed, must match number of column arrays passed.
    aggfunc : function, optional
        If specified, requires `values` be specified as well.
    margins : bool, default False
        Add row/column margins (subtotals).
    margins_name : str, default 'All'
        Name of the row/column that will contain the totals
        when margins is True.
    
        .. versionadded:: 0.21.0
    
    dropna : bool, default True
        Do not include columns whose entries are all NaN.
    normalize : bool, {'all', 'index', 'columns'}, or {0,1}, default False
        Normalize by dividing all values by the sum of values.
    
        - If passed 'all' or `True`, will normalize over all values.
        - If passed 'index' will normalize over each row.
        - If passed 'columns' will normalize over each column.
        - If margins is `True`, will also normalize margin values.
    
    Returns
    -------
    DataFrame
        Cross tabulation of the data.
 
    
    Examples
    --------
    >>> a = np.array(["foo", "foo", "foo", "foo", "bar", "bar",
    ...               "bar", "bar", "foo", "foo", "foo"], dtype=object)
    >>> b = np.array(["one", "one", "one", "two", "one", "one",
    ...               "one", "two", "two", "two", "one"], dtype=object)
    >>> c = np.array(["dull", "dull", "shiny", "dull", "dull", "shiny",
    ...               "shiny", "dull", "shiny", "shiny", "shiny"],
    ...              dtype=object)
    >>> pd.crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])
    b   one        two
    c   dull shiny dull shiny
    a
    bar    1     2    1     0
    foo    2     2    1     2
    
    Here 'c' and 'f' are not represented in the data and will not be
    shown in the output because dropna is True by default. Set
    dropna=False to preserve categories with no data.
    
    >>> foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
    >>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
    >>> pd.crosstab(foo, bar)
    col_0  d  e
    row_0
    a      1  0
    b      0  1
    >>> pd.crosstab(foo, bar, dropna=False)
    col_0  d  e  f
    row_0
    a      1  0  0
    b      0  1  0
    c      0  0  0
# 创建交叉表
monthly_crosstab = pd.crosstab(index=df.index.month, columns=df['dept'], values=df['occrual'], aggfunc='sum', rownames=['dayofmonth'], margins=True, margins_name='sum')
monthly_crosstab
dept一车间二车间人力资源部技改办经理室财务部销售1部销售2部sum
dayofmonth
131350.579594.982392.25NaN3942.0018461.747956.2013385.2087082.94
218.0010528.062131.00NaN7055.0018518.5811167.0016121.0065538.64
332026.5714946.704645.06NaN17491.3021870.6640314.9228936.58160231.79
45760.6820374.622070.7011317.604121.0019016.8513854.4027905.70104421.55
570760.9823034.352822.07154307.2328371.9029356.8736509.3533387.31378550.06
636076.5718185.572105.10111488.7613260.6017313.7115497.3038970.41252898.02
74838.9021916.072103.0854955.4019747.2017355.7170604.3979620.91271141.66
819.0027112.053776.6872145.0010608.3823079.6964152.1252661.83253554.75
914097.5613937.8012862.2047264.9521260.6022189.4616241.5749964.33197818.47
1016.0014478.1521223.89NaN14538.8522863.3941951.8016894.00131966.08
1120755.7926340.454837.745438.5821643.4536030.8626150.4896658.50237855.85
12146959.7421892.093979.24206299.9136269.0046937.9639038.4938984.12540360.55
sum362680.36222340.8964949.01663217.43198309.28292995.48383438.02493489.892681420.36

normalize参数
data = {'name': ['aa', 'bb', 'aa', 'cc', 'bb'],
        'a':[0, 1, 1, 2, 0],'b':[5, 6, 7, 8, 9]}
df1 = pd.DataFrame(data, index=[0, 1, 2, 3, 4])
df1
nameab
0aa05
1bb16
2aa17
3cc28
4bb09
pd.crosstab(index=df1['name'], values=df1['b'], columns=df1['a'], aggfunc='sum', margins=True)
a012All
name
aa5.07.0NaN12
bb9.06.0NaN15
ccNaNNaN8.08
All14.013.08.035
args: normalize
  • 当normalize为True或’all’时,表示每一个数与总和的比值
  • 为’index‘时表示每个数与自身行求和的比值
  • 为‘columns’时表示每个数与自身列求和的比值
pd.crosstab(index=df1['name'], values=df1['b'], columns=df1['a'], normalize=True, aggfunc='sum', margins=True)
a012All
name
aa0.1428570.2000000.0000000.342857
bb0.2571430.1714290.0000000.428571
cc0.0000000.0000000.2285710.228571
All0.4000000.3714290.2285711.000000

想更加了解还是多看看源码吧,淦!

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
PandasPython一个强大的数据处理库,它提供了高效的数据结构DataFrame和Series,使得数据清洗、转换、分组、聚合等操作变得简单易行。在数据统计实验分析Pandas的功能尤为突出,它可以进行各种统计计算,包括描述性统计(如平均值、位数、标准差、频率分布等)、数据清洗(异常值检测和处理)、数据透视交叉)、以及数据可视化(如直方图、箱线图等)。 以下是Pandas进行数据统计实验分析的一些核心步骤和功能: 1. **数据加载**:使用`read_csv`、`read_excel`等函数从文件或数据加载数据到DataFrame。 2. **数据预处理**: - **描述性统计**:使用`describe()`函数获取基本的统计摘要,如count、mean、std、min、25%, 50%, 75%和max。 - **缺失值处理**:`isnull()`和`dropna()`用于检查和删除缺失值,`fillna()`用于填充缺失值。 3. **数据分组和聚合**: - `groupby()`用于按照某一列或多列对数据进行分组,然后应用统计函数(如`sum()`, `mean()`, `count()`等)。 - `pivot_table()`用于创建数据透视,进行多维度汇总。 4. **数据可视化**: - `plot()`函数支持各种图绘制,例如折线图、柱状图、散点图等。 - `hist()`用于绘制直方图展示数值变量的分布。 5. **假设检验和统计推断**:虽然Pandas本身不直接提供这些功能,但可以配合其他库(如SciPy、statsmodels等)进行更复杂的统计分析。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值