百分之99的人都不知道关于pandas的stack的冷知识

先上stack的定义

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html

    def stack(self, level=-1, dropna=True):
        """
        Stack the prescribed level(s) from columns to index.
        Return a reshaped DataFrame or Series having a multi-level
        index with one or more new inner-most levels compared to the current
        DataFrame. The new inner-most levels are created by pivoting the
        columns of the current dataframe:
          - if the columns have a single level, the output is a Series;
          - if the columns have multiple levels, the new index
            level(s) is (are) taken from the prescribed level(s) and
            the output is a DataFrame.
        Parameters
        ----------
        level : int, str, list, default -1
            Level(s) to stack from the column axis onto the index
            axis, defined as one index or label, or a list of indices
            or labels.
        dropna : bool, default True
            Whether to drop rows in the resulting Frame/Series with
            missing values. Stacking a column level onto the index
            axis can create combinations of index and column values
            that are missing from the original dataframe. See Examples
            section.
        Returns
        -------
        DataFrame or Series
            Stacked dataframe or series.
        """

stack 主要是把列转成索引,返回一个变形后的 dataframe (原列是多层的)或 series (原列的单层的)。

参数只有 level 和 dropna 两个。

level 就是需要转的层级,可以传入int类型,层级的最上层是0,所以默认值-1就是指最下面那层;还可以传入str,需要设置columns.names;还可以传入list,就是转多层,注意顺序。

dropna 是决定是否丢弃全是missing values的行,什么意思呢,看下面例子:

In [10]: df_multi_level_cols3 = pd.DataFrame([[None, 1.0], [2.0, 3.0]],
    ...:                                     index=['cat', 'dog'],
    ...:                                     columns=multicol2)

In [11]: df_multi_level_cols3
Out[11]:
    weight height
        kg      m
cat    NaN    1.0
dog    2.0    3.0

In [12]: df_multi_level_cols3.stack(dropna=False)
Out[12]:
        height  weight
cat kg     NaN     NaN
    m      1.0     NaN
dog kg     NaN     2.0
    m      3.0     NaN

In [13]: df_multi_level_cols3.stack(dropna=True)
Out[13]:
        height  weight
cat m      1.0     NaN
dog kg     NaN     2.0
    m      3.0     NaN

细心的朋友应该能看出区别吧,没错,这就是这篇文章的重点了。

dropna参数的妙用

你现在有这样一个数据集:

In [22]: df = pd.DataFrame({'sex':['male','female','female'],
							'type':['student','teacher','student'],
							'age':[7,28,8]})

In [23]: df
Out[23]:
      sex     type  age
0    male  student    7
1  female  teacher   28
2  female  student    8

你需要根据性别和职业分组求年龄的均值,很简单对吧:

In [24]: df1 = df.groupby(['sex','type'])[['age']].mean()

In [25]: df1
Out[25]:
                age
sex    type
female student    8
       teacher   28
male   student    7

诶,这时候你发现,因为这个数据集没有male的teacher,产生的缺失值导致你这个表少了一行,但是你老板就偏偏要你做成这种样式的excel表格:

excel

咋办呢,你会想到用reindex对索引进行排序:

In [34]: df1.reindex(index=['student','teacher'], level=1)
Out[34]:
                age
sex    type
female student    8
       teacher   28
male   student    7

可惜,reindex对多层和单层的作用好像不太相同,结果并不理想。

你又想到了直接构建multiIndex来排序:

In [35]: multicol1 = pd.MultiIndex.from_product([('male', 'female'),
    ...:                                     ('student', 'teacher')])

In [36]: multicol1
Out[36]:
MultiIndex([(  'male', 'student'),
            (  'male', 'teacher'),
            ('female', 'student'),
            ('female', 'teacher')],
           )

In [37]: df1.reindex(index=multicol1)
Out[37]:
                 age
male   student   7.0
       teacher   NaN
female student   8.0
       teacher  28.0

成功了,但是实际生产中的索引并不会这么简单,构建索引的时候可能会踩到别的坑。

那么,接下来我们来用 stack 实现:

In [38]: df1.unstack().stack(dropna=False)
Out[38]:
                 age
sex    type
female student   8.0
       teacher  28.0
male   student   7.0
       teacher   NaN

OHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH!

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值